WO2011000749A1

WO2011000749A1 - Multimodal interaction on digital television applications

Info

Publication number: WO2011000749A1
Application number: PCT/EP2010/058886
Authority: WO
Inventors: José Luis GOMEZ SOTO; Susana Mielgo Fernandez
Original assignee: Telefonica, S.A.
Priority date: 2009-06-30
Filing date: 2010-06-23
Publication date: 2011-01-06
Also published as: UY32729A; ES2382747B1; ES2382747A1; AR077281A1

Abstract

The invention relates to a method of multimodal interaction on digital television applications, wherein the multimodal application resides in a web server and is downloaded by a browser (110) residing in the actual television decoder (100). All the multimodal interaction analysis processes can be performed in real time using a distributed system of components and through the communications protocols. The system allows the interaction of the user with the application by means of using the remote control or voice.

Description

MULTIMODAL INTERACTION ON DIGITAL TELEVISION APPLICATIONS

Field of the Invention

The present invention is applied to the digital television sector, more specifically to the field of human-computer interactions on terminals such as digital television decoders or mobile telephones capable of executing interactive applications which are displayed on a television set.

Background of the Invention

A multimodal system must simultaneously allow different input methods or mechanisms (keyboard, speech, images, etc.), collecting the information from each of them as needed, for example, sometimes the user could say something by means of a voice command, but other times he could select a name from a list by means of using the keyboard and could even select a menu or part of the screen by pointing with his own finger, making the engine of the multimodal interface be capable of detecting the method of interaction which the user has freely chosen (discarding incongruent information received through the other methods).

In relation to the design of user interfaces, they have traditionally been based on the desktop metaphor, developed decades ago in the Xerox laboratories, and which attempts to transfer all the objects and tasks which are normally performed in a real office to the computer world; thus for example, both real and electronic documents can be stored, the traditional typewriter has its equivalent in the word processor, the blank sheet of paper is equivalent to the blank document in the processor, etc. It is thus achieved that the mental model that the user has when he performs these traditional tasks is maintained with few changes when it is transferred to the computer field, i.e., attempting the reach the highest degree of familiarity between objects and actions. This desktop metaphor has been implemented through the WIPM (Windows, Icons, Pointers and Menus) paradigm, the main elements which support most of the current graphic interfaces.

However, this paradigm is clearly unsuitable in an interactive Digital TV environment for several reasons. The first of them is related to the actual nature of the tasks performed by a user on an interactive application (more relaxed and close to an entertainment environment, social environment, etc.), which make them rather different from those of a real office. As a second point it must be indicated that the device with which the user interacts (remote control) is very different in functionality and accessibility from that of the keyboard and mouse, which imposes many restrictions when performing tasks on a Digital TV environment (for example, entering text through the remote control to perform a simple search can be become a difficult task). For many years, and since it appeared, the remote control used in the TV environment has become the device par excellence and through it it has been possible to control a wide variety of devices and functions associated therewith. However, the task models used in any of the interactive services which can currently be deployed at a commercial level on any of the distribution technologies and development environments thereof, make their use inefficient on numerous occasions, having great usability problems, which translates into a discouragement and loss of interest in browsing by the users (usability is defined as the efficiency and satisfaction with which a product allows specific users, for example television viewers, to achieve the specific objectives, such as for example the purchase of a soccer match, in a specific use context, such as for example the living room of a house).

If it is furthermore taken into account that many people have accessibility problems when using a traditional remote control, it can be concluded that the traditional mechanism of interaction with the television has clearly become old- fashioned and is surpassed by the new interactive services executed on digital television decoders.

Tasks such as entering text with the remote control when performing a search in an EPG (Electronic Programming Guide) or the possibility of sending a message through an interactive TV application, can become a difficult task which might finally make the user lose interest in its use. When entering these data a virtual keyboard appearing on the screen and which can have an appearance similar to the keyboard of a mobile telephone or that of an ANSI keyboard, is normally used. In any case, the process is slow, not everyone is accustomed to using the remote control as if it were a keyboard of a mobile telephone and furthermore the mistakes made when using this mechanism are not uncommon (the remote control works by infrared which, depending on the light of the environment, objects located between the user and the receiver, etc., can mean that pressed keys are not translated into an entry of characters). Almost all the usability studies and tests performed on the interactive applications indicate that this process is somewhat difficult for the user.

It should also be indicated that TV has a much more social nature, and the user is normally in a much more relaxed environment, seated at 3-4 meters from the TV, and with an attitude of much lower concentration than that required for working with a computer. It is clear that many of the tasks which are performed on a computer through a traditional graphic interface cannot be performed or will have to be performed in a very different manner. All of the above has made it necessary to abandon this desktop metaphor in the development of Digital TV applications.

The interactive applications on Digital TV are furthermore executed on a single window presented simultaneously (instead of several windows such as graphic PC interfaces, for example) due to all the restrictions indicated above. The different multimedia objects making up the scene (text, graphics, videos, etc.) are arranged on this window, endeavoring that all of them are synchronized based on a timeline, generating a set of scenes which describe the different actions or steps that the user must complete until achieving his objective. For example, in the purchase of a movie from an on-demand interactive video system, the user must initially enter that section, perform a search of the content based on some criterion, enter the data, select the content, enter a purchase PIN, etc. The different objects in the scene gradually appear in a synchronized manner as the user interacts with them.

Although this concept based on the description of scenes and the simultaneous presentation of objects (videos, text, graphics) can seem simple it is difficult from the point of view of graphic processing and is especially emphasized on those devices, such as TV decoders, in which business models impose considerable restrictions in the electronic components forming the device for cost reasons. However, there are already technologies and mechanisms in the industry which support this paradigm at the level of the presentation layer.

If the concept of synchronization of objects in the presentation layer is to be transferred to that of synchronization of the different mechanisms of interaction (remote control, interaction by means of voice commands, etc.), it would be seen that architectures supporting this synchronization of mechanisms of interaction have hardly been developed on Digital TV environments. For example, returning to the interactive application of the purchase of a video movie on demand, the intention could be to perform the search of the content by means of a voice command which is given to the system, but then the purchase PIN is entered by means of the traditional remote control for privacy reasons. The simultaneous management of the different mechanisms of interaction is complex from a semantic point of view (for example, when contrary and simultaneous orders are given through a voice and graphic interface) and expensive in computational processing resources. If this is added to the fact that the multimodal interactive applications will be executed on devices with little processing capacity (CPUs of 100MHz and limited RAM memory of 32-64MB and very inferior to the performance of any domestic PC, CPUs of 1 GHz and RAM of 1 GB) and on which, for example, digital television decoders, it is impossible to process speech in real time, it can be concluded that thee are considerable technical difficulties when offering multimodal interfaces on Digital TV environments.

There are different patents relating to multimodal interfaces, although none of them is specifically applied to the field of digital television.

Patent US5265014 is focused on how to solve ambiguities which occur when using natural language as a mechanism of interaction. US4829423 is similar to the previous one although it is more focused on how to solve the problems which occur when using a natural language in a multimodal environment. Patent US6345111 relates more to a mechanism of image analysis such that the system is capable of recognizing which object the user is gazing at, such that it serves as a mechanism of input in the selection of elements. US5577165 generally describes how the mechanism of dialogue between a computer and a person is performed, taking into account the key words detected during the recognition process, as well as the different states through which the system passes during the dialogue.

Attempting to implement a complete architecture of use of multimodal interfaces on lightweight devices, such as digital TV decoders or mobile telephones, has performance problems (mainly in relation to the real-time speech processing or the incorporation of a multimodal interpreter, for example SALT or VoiceXML, within the device itself which is also expensive in terms of processing and memory reservation), therefore it is necessary to approach new architectures which allow achieving that the response times in the interaction process using any of the mechanisms (visual or voice commands) are as short as possible as already occurs in other more powerful devices such as PCs, in which all the speech processing

(synthesis and recognition) is performed in the computer itself, without requiring external processing. Thus, for example, the CPU and RAM memory necessary for performing a real-time speech recognition process with acceptable response times (less than 5 seconds) and which allow the user to keep his attention focused on the system would involve the use of a current average type PC (512 RAM, 1 Ghz CPU), which makes it differ greatly from the processing capacities of digital television decoders (32 MB RAM, 100 MHz), even top-end decoders (128 MB RAM, 200 Mhz CPU).

In short, the technical problem consists of the fact that the limited processing capacity of digital television decoders prevent developing thereon authentic multimodal applications which use, in addition to visual interaction, voice interaction in a simultaneous manner as mechanisms of interaction.

Object of the Invention

The object of the present invention is to create a method, platform or system which allows different mechanisms of interaction to coexist simultaneously and in a synchronized manner on a digital television environment, thus producing what it is known as multimodal interfaces. To that end, the invention proposes a method of multimodal interaction on interactive Digital TV applications, in which the television is provided with a network decoder which incorporates an associated browser, and in which the method is mainly made up of the following steps:

a. Connecting the browser to a network server and downloading a multimodal application and its descriptive tags which are generated in response to an interaction event caused by a user during the human- computer dialogue.

b. The browser sending the tags characterizing the multimodal application to an interpreter residing in a network server.

c. The interpreter interpreting the tags, which interpreter orders the execution of actions corresponding to the tags.

d. Repeating steps a-c until the user exits the application.

In step a. the events can be graphics and/or speech. It is advantageous to associate an external module with the browser with the function of transferring the descriptive tags of the speech dialogue to the interpreter of said tags by means of an IP protocol. This interpreter preferably coordinates and controls all the speech events and communicates with one or several servers providing speech resources by means of the MRCP protocol. It also preferably analyzes the structure of the multimodal application and sends the corresponding commands to the voice server which complies with the MRCP protocol. It optionally communicates with the external module and transfers thereto the data necessary for such module to set up a session by means of SIP with the MRCP voice server. The decoder can receive and send the voice data to the MRCP server by means of the RTP protocol. The external module preferably sets up communication with an RTP client, thus obtaining the state of the communication between the decoder and the MRCP voice server. The decoder in turn has an application capable of collecting the data coming from any external device which collects audio data and is capable of sending them by means of an IP connection to the voice servers. Said application is capable of compressing said audio data to the format compatible with an MRCP server and sending them through the RTP protocol to the voice server. The decoder preferably has an application capable of collecting the audio data coming from the RTP channel, decompressing them to the format playable by the decoder and sending them to an electronic device existing therein in charge of the audio generation.

The communication between the browser existing in the decoder and the external module is performed through an application programming interface.

The multimodal applications executed in the browser are optionally pre- processed, separating the multimodal logic from the service logic before being displayed to the user.

The use of multimodal interfaces in the interactive TV environment does not intend to replace the use of the remote control (visual interaction) but rather to complement and improve it according to the needs and wishes of the user.

Brief Description of the Drawings

For the purpose of aiding to better understand the present description, according to a preferred practical embodiment of the invention, a figure is attached, which has an illustrative and non-limiting character and which describes the architecture of the system (Figure 1 ).

Detailed Description of the Invention

The system of the invention is capable of performing all the multimodal interaction analysis processes in real time using a distributed system of components through the communications protocols described in Figure 1 . The power of the system is based on the distributed architecture of components, delegating those processes of intensive use of the CPU to external machines and servers.

The system of the invention must have:

• A decoder with an integrated web browser and a return channel providing it with the capacity to access an external server. The browser must allow the use and execution of a scriptable language, for example, JavaScript language.

• This decoder will be connected to the television set to allow the display of the graphic part of the multimodal application and play the voice messages.

• The browser must allow the development and installation of plug-ins (or external modules outside the browser) providing the browser, and therefore the decoder, with the specific functionality for the interpretation and execution of multimodal applications.

• One or several external servers in which the speech resources reside

(speech synthesis and recognition). • A server capable of interpreting the tags providing the multimodality (such as SALT/VoiceXML, for example). The system has an internal state machine such that the latter is updated during the entire human- computer dialogue. The SALT/VoiceXML tags (or other similar ones) are distributed throughout the webpage of the application, providing characteristics such as speech recognition or synthesis. The location thereof depends on the actual design of the webpage and, together with the traditional HTML or JS tags, would form the so-called multimodal application. Thus, for example, there may be a <prompt> tag "Text" which produces at the end a synthesized audio of the message appearing after the tag. In a similar manner, there are other tags, <listen>, <reco>, which allow recording the commands of the user for their subsequent recognition. In any case, the syntax of the tags depends on the actual language or specification which is being used (SALT/VoiceXML or others).

• Web server in which the multimodal application resides.

• Any external device, such as for example a microphone connected to the decoder, a mobile telephone, a Bluetooth hands-free device, etc. which allows collecting the speech and sending it in digital format to the decoder.

The system allows the interaction of the user with the application by means of using the remote control or voice. Both methods are complementary, allowing the user to decide which he wants to use in each case. This interaction of the user with the application is referred to as human-computer dialogue.

The system furthermore allows the synchronization between the events generated by the user through any of the possible modes (text/voice) together with what is presented to the user, solving those incoherencies which may occur during the interaction. This synchronization is done by an external module which is executed in the browser, preventing unwanted actions or actions which can cause adverse effects on the system.

In summary, the system performs the following steps:

1 . The multimodal application resides in a Web server. In a first step, the user selects said application (purchase of contents, electronic banking, etc) provided by his Digital TV service provided. Since it is a multimodal application, the service provider will inform the user of such fact, indicating that he must previously connect the microphone, mobile telephone, Bluetooth headset to the decoder (the details of this step, in any case, are outside the scope of the invention). Once the device is connected, the decoder downloads said application by means of an http protocol or other mechanism into the web browser residing in the decoder, a plug-in is executed and the multimodal application is executed in the browser. The plug-in at this time is already linked to the browser.

2. For the purpose of performing real-time processing, the browser plug-in sends the web page containing the tags providing the multimodality to an external interpreter (SALT/VoiceXML, for example).

3. The browser immediately processes the web page and presents it in the television set.

4. The multimodal interpreter (SALT/VoiceXML) simultaneously recognizes and processes the multimodal tags, communicating to the voice servers the actions which must be performed at all times (for example, at a given time during the human-computer dialogue with the interactive application, the multimodal tag indicates that it is necessary to play by audio a certain message appearing on the screen).

5. The voice server sends to the decoder the audio data, which are processed and adapted such that they are playable by the audio device of the television set and heard by the user.

6. At this point, the state of the dialogue has advanced one step, and the browser plug-in informs the SALT interpreter that it can process the following multimodal tag, this process being repeated until the user abandons the application.

The multimodal application is made up of an HTML document made up of two main frames, an element existing in HTML terminology which corresponds with a part of a web page: a frame containing the application in which the actual web content resides (application frame, which is what is displayed in the graphic interface and what the user ultimately sees in his terminal) and another frame for creating in execution time, instantiation, the external module or plug-in (frame of the external module). The frames are used to separate the content from the application. The application frame contains the elements with which the user is able to interact both graphically and vocally. The external module or plug-in is an application which is related to the application frame in order to provide it with the specific functionality of the multimodal logic, i.e., that which allows maintaining a human-computer dialogue by means of voice commands and in a manner that is complementary to the traditional graphic interface along with the remote control during the entire execution period of the application. The structure of a multimodal application resides in a XML-type document formatted to comply with the specifications defined in SALT (Speech Application Language Tags) or VoiceXML.

The external frame is hidden from the user because it does not contain a graphic interface and it is only in charge of clustering the SALT/VoiceXML tags, mentioned in the previous paragraph, specific to each application, and of the instantiation of the external module. This set of SALT/VoiceXML tags determines the multimodal logic, i.e., the interactions allowed in the application. These tags allow configuring the synthesis and execution of the voice as well as the speech recognition device and the set of events that can be performed using the voice interface. For example, during the human-computer dialogue with the application, the user can see on his television set an edit field prompting him to enter text with the remote control; in parallel, in the application there will be a SALT/VoiceXML tag which will indicate that the system at that time is waiting for the user to give a voice command. At this time, the user can choose to enter the data with the remote control or he can give the equivalent voice order.

In order for there to be agreement between what is displayed on the screen, in this case the television set, and the interactions or events which are performed vocally, both frames communicate with one another using an Application Programming Interface or API through the actual architecture of the browser which is executed in the decoder. This API defines the set of communication processes and functions between the two frames achieving a level of abstraction and separation between them. The API is defined in JavaScript because it is compatible with the browser integrated in the decoder.

This frame structure allows separating the service logic from the multimodal logic. The multimodal logic (provided by the hidden frame in which the external module or plug-in is executed) is associated with the management of the human- computer interaction from any of the graphic or voice interfaces, i.e., it would be in charge of the management of the events from the voice or graphic interface launching the relevant actions given these events. It would also be in charge of solving the problems raised by simultaneous interactions between both interfaces. The service logic (provided by the application frame) would be associated with achieving the objective that the user has when using the application, such as purchasing a movie in a pay-per-view DTV service. It also allows the external module or plug-in to keep the service providing the multimodality active when browsing between the different pages and contents of the main application. This means that while a new graphic interface is being loaded, Text To Speech or TTS conversions, or conversions of speech into a format that can be understood by the system also referred to as SR (Speech Recognition) can be performed. This improves the user's experience, preventing any type of wait, because the service providing the multimodality is always working in a second plane.

The plug-in providing the multimodality requires the communications with the user to be based on events, i.e., it requires an action to be performed with regard to which action some process takes place. For example, at some time during the dialogue the user can press with the remote control an accept button or send the voice command "Accept" for its recognition; the same execution code will be generated in both cases, not distinguishing the mechanism of interaction through which the event arrived. If it is through the graphic interface, the plug-in will inform the SALT/VoiceXML interpreter that this button has been pressed so that its state machine is synchronized with the execution of the application, while at the same time the code associated with pressing the key will be executed. If the user sends a voice command and it is correctly recognized, the SALT/VoiceXML interpreter will inform the plug-in of such fact by means of a command, and the plug-in will generate the corresponding event which will execute the code corresponding to pressing that event.

To manage all the events produced by both the user and the system, the

JavaScript programming language, which allows programming event handlers, which are in charge of capturing the actions occurring in the system, is preferably used.

The browser that is used in Digital TV decoders interprets the JavaScript code which is integrated in the Web pages. Furthermore, the actual multimodal logic

(the interactions allowed), determined by the SALT/XML tags, is encoded in

JavaScript. As a result, the external multimodal module requires JavaScript in order to communicate with the browser.

The external multimodal module also needs to access the audio hardware resources of the decoder to control basic actions such as playing, stopping, etc.

Figure 1 shows the high-level architecture of an example of a system capable of carrying out the method of the invention:

Decoder [100]:

[110] Browser with possibility of including an external module [120].

Customized logic support capable of allowing the incorporation of modules enabling an IP connection with other systems and any other communication protocol. SIP (Session Initiation Protocol) Client [140]: in charge of creating a session between the decoder [100] and the MRCP (Media Resource Control Protocol) server [460]

RTP (Real Time Protocol) client [170]

Audio recording [190] and audio playback [180].

IP network connection [500]

External servers:

• MRCP server [350] and [460]

SALT/Voice XML interpreter [300]

Speech resources [400] (TTS or Text-to-Speech, SP or Speech Recognition)

The communication processes between the client (decoder) [100] and the resources of the MRCP server [460] are performed through the SIP (Session

Initiation Protocol) [610] which allows setting up multimedia sessions by means of exchanging messages between the parties who wish to communicate with one another. The decoder [100] implements a SIP module [140] creating a session [610] by means of sending petitions to the MRCP server. In this message, it also sends to it the characteristics of the session which it wishes to set up as supported audio coders/decoders, addresses, ports in which they are expected to be received, transmission rates, etc. which are necessary for performing the speech synthesis and recognition processes. All these actions are coordinated by the SALT interpreter [300].

In the part of the decoder [100], an RTP (Real Time Protocol) module [170] may be necessary since it is possible that the decoder does not support the streaming playback by means of RTP. It is therefore necessary to use a player [180] capable of sending to the system loudspeaker the raw voice data, collected and stored in real time in the buffer of the client RTP [170].

The process of sending the audio [620] through the RTP channel to the RTP element [411] is based on using an application [190] capable of recording and collecting the raw voice data from the audio input device (for example, a microphone) to then create a buffer with said data. Depending on the possibilities of the voice servers and on the formats to be supported, it would be necessary to use different speech compressors/decompressors, such as for example, PCMU-PCM

(Pulse Code Modulation mu-law). This compressor/decompressor application [180][190] implements the client/server paradigm, the object of the transaction being the voice data on an RTP protocol [620]. The application consists of two main modules:

• RTP player [180]: it decompresses and converts the sound in PCMU (or PCMA) format into PCM, which comes from the RTP channel, and finally plays the sound

• RTP recorder [190]: it reads data of the audio input device, compresses/converts PCM to PCMU (or PCMA) and finally sends them through the RTP channel [620]

All these components are in turn coordinated by the SALT interpreter [300].

The most common actions performed with them are PLAY, STOP, PAUSE and REC.

The invention can be applied to almost all the interactive services executed on Digital TV decoders which comply with the aforementioned characteristics. As examples of interactive services:

• Control and browsing through EPGs (Electronic Program Guide) or

ESG (Electronic Service Guide) or UEG (Unified Electronic Guide).

• Control and browsing through the VoD (Video on Demand), CoD (Content on Demand) functionalities.

• Control and browsing through services of home banking, electronic purchase/sale, access to product catalogues, etc.

• Control and browsing through the functionalities offered by electronic messaging applications and browsing through the Internet by means of browsers.

Claims

1 .- Method of multimodal interaction on interactive Digital TV applications, wherein the television is provided with a network decoder (100) incorporating an associated browser (1 10), characterized by the following steps:

a. Connecting the browser to a network server and downloading a multimodal application and its descriptive tags which are generated in response to an interaction event caused by a user during the human- computer dialogue

b. The browser sending the tags characterizing the multimodal application to an interpreter (300) residing in a network server

c. The interpreter interpreting the tags, which interpreter orders the execution of actions corresponding to the tags

d. Repeating steps a-c until the user exits the application

2.- Method according to claim 1 , characterized in that in step a. the events are graphics and/or speech.

3.- Method according to claim 2, wherein an external module (120) is associated with the browser (1 10) with the function of transferring the descriptive tags of the speech dialogue to the interpreter of said tags (300) by means of an IP protocol.

4.- Method according to claim 3, wherein the interpreter (300) of the descriptive tags of the speech dialogue coordinates and controls all the speech events.

5.- Method according to claim 4, wherein the interpreter (300) of the descriptive tags of the speech dialogue communicates with one or several servers providing speech resources by means of the MRCP protocol..

6.- Method according to claim 5, wherein the interpreter (300) of the descriptive tags of the speech dialogue analyzes the structure of the multimodal application and sends the corresponding commands to the voice server which complies with the MRCP protocol.

7.- Method according to claim 6, wherein the interpreter (300) of the descriptive tags of the speech dialogue communicates with the external module

(120) associated with the browser of the decoder and transfers thereto the data necessary for such module to set up a session by means of SIP with the MRCP voice server (460).

8.- Method according to claim 7, wherein the decoder receives and sends the voice data to the MRCP server (460) by means of the RTP protocol.

9.- Method according to claim 8, characterized in that the external module (120) associated with the browser sets up communication with an RTP client (170), thus obtaining the state of the communication between the decoder and the MRCP voice server (460).

10.- Method according to any of claims 5-9, wherein the decoder (100) has an application (190) capable of collecting the data coming from any external device which collects audio data and is capable of sending them by means of an IP connection to the voice servers.

1 1.- Method according to claim 10, wherein said application is capable of compressing said audio data to the format compatible with an MRCP server and sending them through the RTP protocol to the voice server (400).

12.- Method according to claim 1 1 , wherein the decoder has an application (180) capable of collecting the audio data coming from the RTP channel, decompressing the data to the format playable by the decoder and sending them to an electronic device existing therein in charge of the audio generation.

13.- Method according to any of claims 3-12, wherein the communication between the browser (1 10) existing in the decoder and the external module (120) is performed through an application programming interface.

14.- Method according to any of the previous claims, wherein the multimodal applications executed in the browser (1 10) are pre-processed, separating the multimodal logic from the service logic before being displayed to the user.

15.- System capable of carrying out any of the methods of claims 1 to 14.

16.- Use of the system of claim 15 in a pay-per-view digital television service.