Global Sources
EE Times-Asia
Stay in touch with EE Times Asia
EE Times-Asia > Interface

Embedded multimodal systems tell a good story

Posted: 17 Nov 2003 ?? ?Print Version ?Bookmark and Share

Keywords:multimodal technology? embedded system? salt? html? web page?

The convergence of phenomena such as computer cost and size reduction, their increasing ubiquity and the availability of high-resolution displays has generated more interest in embedded multimodal systems that adopt speech, visual and haptic interfaces as primary modes of interaction. The target devices for these applications range from cellphones and PDAs to appliances and in-car computers.

Thus, multimodal technology is ready for embedded environments. The issue now is how to enable developers to author sophisticated applications. A promising approach known as speech application language tags (SALT) enhances the HTML language with a set of elements that control speech processing and call control resources that support the development of multimodal Web pages by binding graphics with speech I/O events. The World Wide Web Consortium (W3C) is working at creating a standard for multimodal authoring, and the SALT specification has been contributed as a possible candidate.

However, in an analogy with Web applications, HTML is only one aspect of development, and the real complexity lies in the server programs that manage different layers of logics. Similarly, for multimodal interaction systems, the real complexity is behind the presentation layer, in what is known as the dialog or interaction manager.

While Web and network speech application development can rely on a wide variety of technologies and products for server side development, embedded systems still must rely on traditional programming. This situation can be alleviated by carefully identifying the different levels of logic involved in a multimodal interaction application.

Anatomy of multimodal dialog

The simplest multimodal dialog application is obtained by combining speech and GUI into a single system. To simplify even further, we will restrict our discussion to what is generally referred to as sequential multimodality, in which users provide one input at a time.

For example, users can either speak an utterance or provide input to the GUI by clicking a button or filling a text field at will, but cannot interact with both channels at the same time. The interaction manager is at the core of the system and represents the application logic. It receives data from the user input channels, such as the speech recognizer and GUI. In response, it sends data to the output channels, such as the audio prompt player.

The interaction manager has two main functions. The first is the management and update of the application state, which is generally embodied by a structured set of variables. The second is deciding on the next action given the current configuration of the application state. The decision mechanism that selects the proper next action given a certain configuration of the application state is generally referred to as the dialog strategy.

There are many forms of dialog strategies that can be adopted. One of the most effective ways of representing a dialog strategy is the state machine controller. In its simplest form, a state machine controller is nothing more than an if-then-else structure, where the conditions are drawn over the variables of the application. However, an interaction manager based on hard-coded case statements is difficult to structure into modular elements of interaction that are reusable and make the application easy to maintain and update.

The granularity and distribution of the state information between the interaction manager and other components is a design choice. Peripheral components of a simple multimodal application can be made completely stateless. For instance, the speech recognizer can receive configuration information from the interaction manager at a certain interaction turn, recognize the input speech, send the results of the recognition back to the interaction manager and return into an idle state, ready to serve another request.

Similarly, the GUI can send user input information to the interaction manager as soon as it is available. In this case, the interaction manager also has to manage the integration of the inputs coming from different channels. Events raised at the peripheral components need to propagate to the interaction manager layer and activate proper event handlers.

Sophisticated applications spanning multiple HTML pages benefit from a well-designed interaction manager that dynamically generates pages with SALT rather than directly invoking the speech processing and GUI methods.

- Roberto Pieraccini

Director, NAtural Dialog Team

SpeechWorks International Inc.

Article Comments - Embedded multimodal systems tell a g...
*? You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.

Back to Top