WO2011000749A1 - Multimodal interaction on digital television applications - Google Patents

Multimodal interaction on digital television applications Download PDF

Info

Publication number
WO2011000749A1
WO2011000749A1 PCT/EP2010/058886 EP2010058886W WO2011000749A1 WO 2011000749 A1 WO2011000749 A1 WO 2011000749A1 EP 2010058886 W EP2010058886 W EP 2010058886W WO 2011000749 A1 WO2011000749 A1 WO 2011000749A1
Authority
WO
WIPO (PCT)
Prior art keywords
application
multimodal
browser
decoder
interpreter
Prior art date
Application number
PCT/EP2010/058886
Other languages
French (fr)
Inventor
José Luis GOMEZ SOTO
Susana Mielgo Fernandez
Original Assignee
Telefonica, S.A.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonica, S.A. filed Critical Telefonica, S.A.
Publication of WO2011000749A1 publication Critical patent/WO2011000749A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/443OS processes, e.g. booting an STB, implementing a Java virtual machine in an STB or power management in an STB
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/42203Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS] sound input device, e.g. microphone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/42204User interfaces specially adapted for controlling a client device through a remote control device; Remote control devices therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/47202End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for requesting content on demand, e.g. video on demand
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/65Transmission of management data between client and server
    • H04N21/654Transmission by server directed to the client
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/65Transmission of management data between client and server
    • H04N21/658Transmission by the client directed to the server
    • H04N21/6582Data stored in the client, e.g. viewing habits, hardware capabilities, credit card number
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/16Analogue secrecy systems; Analogue subscription systems
    • H04N7/173Analogue secrecy systems; Analogue subscription systems with two-way working, e.g. subscriber sending a programme selection signal

Definitions

  • the present invention is applied to the digital television sector, more specifically to the field of human-computer interactions on terminals such as digital television decoders or mobile telephones capable of executing interactive applications which are displayed on a television set.
  • a multimodal system must simultaneously allow different input methods or mechanisms (keyboard, speech, images, etc.), collecting the information from each of them as needed, for example, sometimes the user could say something by means of a voice command, but other times he could select a name from a list by means of using the keyboard and could even select a menu or part of the screen by pointing with his own finger, making the engine of the multimodal interface be capable of detecting the method of interaction which the user has freely chosen (discarding incongruent information received through the other methods).
  • TV has a much more social nature, and the user is normally in a much more relaxed environment, seated at 3-4 meters from the TV, and with an attitude of much lower concentration than that required for working with a computer. It is clear that many of the tasks which are performed on a computer through a traditional graphic interface cannot be performed or will have to be performed in a very different manner. All of the above has made it necessary to abandon this desktop metaphor in the development of Digital TV applications.
  • the interactive applications on Digital TV are furthermore executed on a single window presented simultaneously (instead of several windows such as graphic PC interfaces, for example) due to all the restrictions indicated above.
  • the different multimedia objects making up the scene (text, graphics, videos, etc.) are arranged on this window, endeavoring that all of them are synchronized based on a timeline, generating a set of scenes which describe the different actions or steps that the user must complete until achieving his objective. For example, in the purchase of a movie from an on-demand interactive video system, the user must initially enter that section, perform a search of the content based on some criterion, enter the data, select the content, enter a purchase PIN, etc.
  • the different objects in the scene gradually appear in a synchronized manner as the user interacts with them.
  • Patent US5265014 is focused on how to solve ambiguities which occur when using natural language as a mechanism of interaction.
  • US4829423 is similar to the previous one although it is more focused on how to solve the problems which occur when using a natural language in a multimodal environment.
  • Patent US6345111 relates more to a mechanism of image analysis such that the system is capable of recognizing which object the user is gazing at, such that it serves as a mechanism of input in the selection of elements.
  • US5577165 generally describes how the mechanism of dialogue between a computer and a person is performed, taking into account the key words detected during the recognition process, as well as the different states through which the system passes during the dialogue.
  • the CPU and RAM memory necessary for performing a real-time speech recognition process with acceptable response times (less than 5 seconds) and which allow the user to keep his attention focused on the system would involve the use of a current average type PC (512 RAM, 1 Ghz CPU), which makes it differ greatly from the processing capacities of digital television decoders (32 MB RAM, 100 MHz), even top-end decoders (128 MB RAM, 200 Mhz CPU).
  • a current average type PC 512 RAM, 1 Ghz CPU
  • the technical problem consists of the fact that the limited processing capacity of digital television decoders prevent developing thereon authentic multimodal applications which use, in addition to visual interaction, voice interaction in a simultaneous manner as mechanisms of interaction.
  • the object of the present invention is to create a method, platform or system which allows different mechanisms of interaction to coexist simultaneously and in a synchronized manner on a digital television environment, thus producing what it is known as multimodal interfaces.
  • the invention proposes a method of multimodal interaction on interactive Digital TV applications, in which the television is provided with a network decoder which incorporates an associated browser, and in which the method is mainly made up of the following steps:
  • the events can be graphics and/or speech. It is advantageous to associate an external module with the browser with the function of transferring the descriptive tags of the speech dialogue to the interpreter of said tags by means of an IP protocol.
  • This interpreter preferably coordinates and controls all the speech events and communicates with one or several servers providing speech resources by means of the MRCP protocol. It also preferably analyzes the structure of the multimodal application and sends the corresponding commands to the voice server which complies with the MRCP protocol. It optionally communicates with the external module and transfers thereto the data necessary for such module to set up a session by means of SIP with the MRCP voice server.
  • the decoder can receive and send the voice data to the MRCP server by means of the RTP protocol.
  • the external module preferably sets up communication with an RTP client, thus obtaining the state of the communication between the decoder and the MRCP voice server.
  • the decoder in turn has an application capable of collecting the data coming from any external device which collects audio data and is capable of sending them by means of an IP connection to the voice servers. Said application is capable of compressing said audio data to the format compatible with an MRCP server and sending them through the RTP protocol to the voice server.
  • the decoder preferably has an application capable of collecting the audio data coming from the RTP channel, decompressing them to the format playable by the decoder and sending them to an electronic device existing therein in charge of the audio generation.
  • the communication between the browser existing in the decoder and the external module is performed through an application programming interface.
  • the multimodal applications executed in the browser are optionally pre- processed, separating the multimodal logic from the service logic before being displayed to the user.
  • FIG. 1 For the purpose of aiding to better understand the present description, according to a preferred practical embodiment of the invention, a figure is attached, which has an illustrative and non-limiting character and which describes the architecture of the system ( Figure 1 ).
  • the system of the invention is capable of performing all the multimodal interaction analysis processes in real time using a distributed system of components through the communications protocols described in Figure 1 .
  • the power of the system is based on the distributed architecture of components, delegating those processes of intensive use of the CPU to external machines and servers.
  • the system of the invention must have:
  • the browser must allow the use and execution of a scriptable language, for example, JavaScript language.
  • This decoder will be connected to the television set to allow the display of the graphic part of the multimodal application and play the voice messages.
  • the browser must allow the development and installation of plug-ins (or external modules outside the browser) providing the browser, and therefore the decoder, with the specific functionality for the interpretation and execution of multimodal applications.
  • SALT/VoiceXML speech synthesis and recognition
  • a server capable of interpreting the tags providing the multimodality (such as SALT/VoiceXML, for example).
  • the system has an internal state machine such that the latter is updated during the entire human- computer dialogue.
  • the SALT/VoiceXML tags (or other similar ones) are distributed throughout the webpage of the application, providing characteristics such as speech recognition or synthesis. The location thereof depends on the actual design of the webpage and, together with the traditional HTML or JS tags, would form the so-called multimodal application. Thus, for example, there may be a ⁇ prompt> tag "Text" which produces at the end a synthesized audio of the message appearing after the tag.
  • tags ⁇ listen>, ⁇ reco>, which allow recording the commands of the user for their subsequent recognition.
  • syntax of the tags depends on the actual language or specification which is being used (SALT/VoiceXML or others).
  • Any external device such as for example a microphone connected to the decoder, a mobile telephone, a Bluetooth hands-free device, etc. which allows collecting the speech and sending it in digital format to the decoder.
  • the system allows the interaction of the user with the application by means of using the remote control or voice. Both methods are complementary, allowing the user to decide which he wants to use in each case.
  • This interaction of the user with the application is referred to as human-computer dialogue.
  • the system furthermore allows the synchronization between the events generated by the user through any of the possible modes (text/voice) together with what is presented to the user, solving those incoherencies which may occur during the interaction.
  • This synchronization is done by an external module which is executed in the browser, preventing unwanted actions or actions which can cause adverse effects on the system.
  • the multimodal application resides in a Web server.
  • the user selects said application (purchase of contents, electronic banking, etc) provided by his Digital TV service provided. Since it is a multimodal application, the service provider will inform the user of such fact, indicating that he must previously connect the microphone, mobile telephone, Bluetooth headset to the decoder (the details of this step, in any case, are outside the scope of the invention).
  • the decoder downloads said application by means of an http protocol or other mechanism into the web browser residing in the decoder, a plug-in is executed and the multimodal application is executed in the browser. The plug-in at this time is already linked to the browser.
  • the browser plug-in sends the web page containing the tags providing the multimodality to an external interpreter (SALT/VoiceXML, for example).
  • an external interpreter SALT/VoiceXML, for example.
  • the browser immediately processes the web page and presents it in the television set.
  • the multimodal interpreter (SALT/VoiceXML) simultaneously recognizes and processes the multimodal tags, communicating to the voice servers the actions which must be performed at all times (for example, at a given time during the human-computer dialogue with the interactive application, the multimodal tag indicates that it is necessary to play by audio a certain message appearing on the screen).
  • the voice server sends to the decoder the audio data, which are processed and adapted such that they are playable by the audio device of the television set and heard by the user.
  • the multimodal application is made up of an HTML document made up of two main frames, an element existing in HTML terminology which corresponds with a part of a web page: a frame containing the application in which the actual web content resides (application frame, which is what is displayed in the graphic interface and what the user ultimately sees in his terminal) and another frame for creating in execution time, instantiation, the external module or plug-in (frame of the external module).
  • the frames are used to separate the content from the application.
  • the application frame contains the elements with which the user is able to interact both graphically and vocally.
  • the external module or plug-in is an application which is related to the application frame in order to provide it with the specific functionality of the multimodal logic, i.e., that which allows maintaining a human-computer dialogue by means of voice commands and in a manner that is complementary to the traditional graphic interface along with the remote control during the entire execution period of the application.
  • the structure of a multimodal application resides in a XML-type document formatted to comply with the specifications defined in SALT (Speech Application Language Tags) or VoiceXML.
  • the external frame is hidden from the user because it does not contain a graphic interface and it is only in charge of clustering the SALT/VoiceXML tags, mentioned in the previous paragraph, specific to each application, and of the instantiation of the external module.
  • This set of SALT/VoiceXML tags determines the multimodal logic, i.e., the interactions allowed in the application. These tags allow configuring the synthesis and execution of the voice as well as the speech recognition device and the set of events that can be performed using the voice interface.
  • the user can see on his television set an edit field prompting him to enter text with the remote control; in parallel, in the application there will be a SALT/VoiceXML tag which will indicate that the system at that time is waiting for the user to give a voice command. At this time, the user can choose to enter the data with the remote control or he can give the equivalent voice order.
  • both frames communicate with one another using an Application Programming Interface or API through the actual architecture of the browser which is executed in the decoder.
  • This API defines the set of communication processes and functions between the two frames achieving a level of abstraction and separation between them.
  • the API is defined in JavaScript because it is compatible with the browser integrated in the decoder.
  • the multimodal logic (provided by the hidden frame in which the external module or plug-in is executed) is associated with the management of the human- computer interaction from any of the graphic or voice interfaces, i.e., it would be in charge of the management of the events from the voice or graphic interface launching the relevant actions given these events. It would also be in charge of solving the problems raised by simultaneous interactions between both interfaces.
  • the service logic (provided by the application frame) would be associated with achieving the objective that the user has when using the application, such as purchasing a movie in a pay-per-view DTV service. It also allows the external module or plug-in to keep the service providing the multimodality active when browsing between the different pages and contents of the main application.
  • the plug-in providing the multimodality requires the communications with the user to be based on events, i.e., it requires an action to be performed with regard to which action some process takes place. For example, at some time during the dialogue the user can press with the remote control an accept button or send the voice command "Accept" for its recognition; the same execution code will be generated in both cases, not distinguishing the mechanism of interaction through which the event arrived. If it is through the graphic interface, the plug-in will inform the SALT/VoiceXML interpreter that this button has been pressed so that its state machine is synchronized with the execution of the application, while at the same time the code associated with pressing the key will be executed.
  • the SALT/VoiceXML interpreter will inform the plug-in of such fact by means of a command, and the plug-in will generate the corresponding event which will execute the code corresponding to pressing that event.
  • JavaScript programming language which allows programming event handlers, which are in charge of capturing the actions occurring in the system, is preferably used.
  • the browser that is used in Digital TV decoders interprets the JavaScript code which is integrated in the Web pages. Furthermore, the actual multimodal logic
  • the external multimodal module requires JavaScript in order to communicate with the browser.
  • the external multimodal module also needs to access the audio hardware resources of the decoder to control basic actions such as playing, stopping, etc.
  • Figure 1 shows the high-level architecture of an example of a system capable of carrying out the method of the invention:
  • Customized logic support capable of allowing the incorporation of modules enabling an IP connection with other systems and any other communication protocol.
  • SIP Session Initiation Protocol
  • Client [140] in charge of creating a session between the decoder [100] and the MRCP (Media Resource Control Protocol) server [460]
  • MRCP Media Resource Control Protocol
  • the communication processes between the client (decoder) [100] and the resources of the MRCP server [460] are performed through the SIP (Session
  • Initiation Protocol [610] which allows setting up multimedia sessions by means of exchanging messages between the parties who wish to communicate with one another.
  • the decoder [100] implements a SIP module [140] creating a session [610] by means of sending petitions to the MRCP server. In this message, it also sends to it the characteristics of the session which it wishes to set up as supported audio coders/decoders, addresses, ports in which they are expected to be received, transmission rates, etc. which are necessary for performing the speech synthesis and recognition processes. All these actions are coordinated by the SALT interpreter [300].
  • an RTP (Real Time Protocol) module [170] may be necessary since it is possible that the decoder does not support the streaming playback by means of RTP. It is therefore necessary to use a player [180] capable of sending to the system loudspeaker the raw voice data, collected and stored in real time in the buffer of the client RTP [170].
  • the process of sending the audio [620] through the RTP channel to the RTP element [411] is based on using an application [190] capable of recording and collecting the raw voice data from the audio input device (for example, a microphone) to then create a buffer with said data.
  • an application [190] capable of recording and collecting the raw voice data from the audio input device (for example, a microphone) to then create a buffer with said data.
  • the audio input device for example, a microphone
  • This compressor/decompressor application [180][190] implements the client/server paradigm, the object of the transaction being the voice data on an RTP protocol [620].
  • the application consists of two main modules:
  • RTP player it decompresses and converts the sound in PCMU (or PCMA) format into PCM, which comes from the RTP channel, and finally plays the sound
  • RTP recorder it reads data of the audio input device, compresses/converts PCM to PCMU (or PCMA) and finally sends them through the RTP channel [620]
  • the invention can be applied to almost all the interactive services executed on Digital TV decoders which comply with the aforementioned characteristics.
  • interactive services executed on Digital TV decoders which comply with the aforementioned characteristics.
  • ESG Electronic Service Guide
  • UEG Unified Electronic Guide

Abstract

The invention relates to a method of multimodal interaction on digital television applications, wherein the multimodal application resides in a web server and is downloaded by a browser (110) residing in the actual television decoder (100). All the multimodal interaction analysis processes can be performed in real time using a distributed system of components and through the communications protocols. The system allows the interaction of the user with the application by means of using the remote control or voice.

Description

MULTIMODAL INTERACTION ON DIGITAL TELEVISION APPLICATIONS
Field of the Invention
The present invention is applied to the digital television sector, more specifically to the field of human-computer interactions on terminals such as digital television decoders or mobile telephones capable of executing interactive applications which are displayed on a television set.
Background of the Invention
A multimodal system must simultaneously allow different input methods or mechanisms (keyboard, speech, images, etc.), collecting the information from each of them as needed, for example, sometimes the user could say something by means of a voice command, but other times he could select a name from a list by means of using the keyboard and could even select a menu or part of the screen by pointing with his own finger, making the engine of the multimodal interface be capable of detecting the method of interaction which the user has freely chosen (discarding incongruent information received through the other methods).
In relation to the design of user interfaces, they have traditionally been based on the desktop metaphor, developed decades ago in the Xerox laboratories, and which attempts to transfer all the objects and tasks which are normally performed in a real office to the computer world; thus for example, both real and electronic documents can be stored, the traditional typewriter has its equivalent in the word processor, the blank sheet of paper is equivalent to the blank document in the processor, etc. It is thus achieved that the mental model that the user has when he performs these traditional tasks is maintained with few changes when it is transferred to the computer field, i.e., attempting the reach the highest degree of familiarity between objects and actions. This desktop metaphor has been implemented through the WIPM (Windows, Icons, Pointers and Menus) paradigm, the main elements which support most of the current graphic interfaces.
However, this paradigm is clearly unsuitable in an interactive Digital TV environment for several reasons. The first of them is related to the actual nature of the tasks performed by a user on an interactive application (more relaxed and close to an entertainment environment, social environment, etc.), which make them rather different from those of a real office. As a second point it must be indicated that the device with which the user interacts (remote control) is very different in functionality and accessibility from that of the keyboard and mouse, which imposes many restrictions when performing tasks on a Digital TV environment (for example, entering text through the remote control to perform a simple search can be become a difficult task). For many years, and since it appeared, the remote control used in the TV environment has become the device par excellence and through it it has been possible to control a wide variety of devices and functions associated therewith. However, the task models used in any of the interactive services which can currently be deployed at a commercial level on any of the distribution technologies and development environments thereof, make their use inefficient on numerous occasions, having great usability problems, which translates into a discouragement and loss of interest in browsing by the users (usability is defined as the efficiency and satisfaction with which a product allows specific users, for example television viewers, to achieve the specific objectives, such as for example the purchase of a soccer match, in a specific use context, such as for example the living room of a house).
If it is furthermore taken into account that many people have accessibility problems when using a traditional remote control, it can be concluded that the traditional mechanism of interaction with the television has clearly become old- fashioned and is surpassed by the new interactive services executed on digital television decoders.
Tasks such as entering text with the remote control when performing a search in an EPG (Electronic Programming Guide) or the possibility of sending a message through an interactive TV application, can become a difficult task which might finally make the user lose interest in its use. When entering these data a virtual keyboard appearing on the screen and which can have an appearance similar to the keyboard of a mobile telephone or that of an ANSI keyboard, is normally used. In any case, the process is slow, not everyone is accustomed to using the remote control as if it were a keyboard of a mobile telephone and furthermore the mistakes made when using this mechanism are not uncommon (the remote control works by infrared which, depending on the light of the environment, objects located between the user and the receiver, etc., can mean that pressed keys are not translated into an entry of characters). Almost all the usability studies and tests performed on the interactive applications indicate that this process is somewhat difficult for the user.
It should also be indicated that TV has a much more social nature, and the user is normally in a much more relaxed environment, seated at 3-4 meters from the TV, and with an attitude of much lower concentration than that required for working with a computer. It is clear that many of the tasks which are performed on a computer through a traditional graphic interface cannot be performed or will have to be performed in a very different manner. All of the above has made it necessary to abandon this desktop metaphor in the development of Digital TV applications.
The interactive applications on Digital TV are furthermore executed on a single window presented simultaneously (instead of several windows such as graphic PC interfaces, for example) due to all the restrictions indicated above. The different multimedia objects making up the scene (text, graphics, videos, etc.) are arranged on this window, endeavoring that all of them are synchronized based on a timeline, generating a set of scenes which describe the different actions or steps that the user must complete until achieving his objective. For example, in the purchase of a movie from an on-demand interactive video system, the user must initially enter that section, perform a search of the content based on some criterion, enter the data, select the content, enter a purchase PIN, etc. The different objects in the scene gradually appear in a synchronized manner as the user interacts with them.
Although this concept based on the description of scenes and the simultaneous presentation of objects (videos, text, graphics) can seem simple it is difficult from the point of view of graphic processing and is especially emphasized on those devices, such as TV decoders, in which business models impose considerable restrictions in the electronic components forming the device for cost reasons. However, there are already technologies and mechanisms in the industry which support this paradigm at the level of the presentation layer.
If the concept of synchronization of objects in the presentation layer is to be transferred to that of synchronization of the different mechanisms of interaction (remote control, interaction by means of voice commands, etc.), it would be seen that architectures supporting this synchronization of mechanisms of interaction have hardly been developed on Digital TV environments. For example, returning to the interactive application of the purchase of a video movie on demand, the intention could be to perform the search of the content by means of a voice command which is given to the system, but then the purchase PIN is entered by means of the traditional remote control for privacy reasons. The simultaneous management of the different mechanisms of interaction is complex from a semantic point of view (for example, when contrary and simultaneous orders are given through a voice and graphic interface) and expensive in computational processing resources. If this is added to the fact that the multimodal interactive applications will be executed on devices with little processing capacity (CPUs of 100MHz and limited RAM memory of 32-64MB and very inferior to the performance of any domestic PC, CPUs of 1 GHz and RAM of 1 GB) and on which, for example, digital television decoders, it is impossible to process speech in real time, it can be concluded that thee are considerable technical difficulties when offering multimodal interfaces on Digital TV environments.
There are different patents relating to multimodal interfaces, although none of them is specifically applied to the field of digital television.
Patent US5265014 is focused on how to solve ambiguities which occur when using natural language as a mechanism of interaction. US4829423 is similar to the previous one although it is more focused on how to solve the problems which occur when using a natural language in a multimodal environment. Patent US6345111 relates more to a mechanism of image analysis such that the system is capable of recognizing which object the user is gazing at, such that it serves as a mechanism of input in the selection of elements. US5577165 generally describes how the mechanism of dialogue between a computer and a person is performed, taking into account the key words detected during the recognition process, as well as the different states through which the system passes during the dialogue.
Attempting to implement a complete architecture of use of multimodal interfaces on lightweight devices, such as digital TV decoders or mobile telephones, has performance problems (mainly in relation to the real-time speech processing or the incorporation of a multimodal interpreter, for example SALT or VoiceXML, within the device itself which is also expensive in terms of processing and memory reservation), therefore it is necessary to approach new architectures which allow achieving that the response times in the interaction process using any of the mechanisms (visual or voice commands) are as short as possible as already occurs in other more powerful devices such as PCs, in which all the speech processing
(synthesis and recognition) is performed in the computer itself, without requiring external processing. Thus, for example, the CPU and RAM memory necessary for performing a real-time speech recognition process with acceptable response times (less than 5 seconds) and which allow the user to keep his attention focused on the system would involve the use of a current average type PC (512 RAM, 1 Ghz CPU), which makes it differ greatly from the processing capacities of digital television decoders (32 MB RAM, 100 MHz), even top-end decoders (128 MB RAM, 200 Mhz CPU).
In short, the technical problem consists of the fact that the limited processing capacity of digital television decoders prevent developing thereon authentic multimodal applications which use, in addition to visual interaction, voice interaction in a simultaneous manner as mechanisms of interaction.
Object of the Invention
The object of the present invention is to create a method, platform or system which allows different mechanisms of interaction to coexist simultaneously and in a synchronized manner on a digital television environment, thus producing what it is known as multimodal interfaces. To that end, the invention proposes a method of multimodal interaction on interactive Digital TV applications, in which the television is provided with a network decoder which incorporates an associated browser, and in which the method is mainly made up of the following steps:
a. Connecting the browser to a network server and downloading a multimodal application and its descriptive tags which are generated in response to an interaction event caused by a user during the human- computer dialogue.
b. The browser sending the tags characterizing the multimodal application to an interpreter residing in a network server.
c. The interpreter interpreting the tags, which interpreter orders the execution of actions corresponding to the tags.
d. Repeating steps a-c until the user exits the application.
In step a. the events can be graphics and/or speech. It is advantageous to associate an external module with the browser with the function of transferring the descriptive tags of the speech dialogue to the interpreter of said tags by means of an IP protocol. This interpreter preferably coordinates and controls all the speech events and communicates with one or several servers providing speech resources by means of the MRCP protocol. It also preferably analyzes the structure of the multimodal application and sends the corresponding commands to the voice server which complies with the MRCP protocol. It optionally communicates with the external module and transfers thereto the data necessary for such module to set up a session by means of SIP with the MRCP voice server. The decoder can receive and send the voice data to the MRCP server by means of the RTP protocol. The external module preferably sets up communication with an RTP client, thus obtaining the state of the communication between the decoder and the MRCP voice server. The decoder in turn has an application capable of collecting the data coming from any external device which collects audio data and is capable of sending them by means of an IP connection to the voice servers. Said application is capable of compressing said audio data to the format compatible with an MRCP server and sending them through the RTP protocol to the voice server. The decoder preferably has an application capable of collecting the audio data coming from the RTP channel, decompressing them to the format playable by the decoder and sending them to an electronic device existing therein in charge of the audio generation.
The communication between the browser existing in the decoder and the external module is performed through an application programming interface.
The multimodal applications executed in the browser are optionally pre- processed, separating the multimodal logic from the service logic before being displayed to the user.
The use of multimodal interfaces in the interactive TV environment does not intend to replace the use of the remote control (visual interaction) but rather to complement and improve it according to the needs and wishes of the user.
Brief Description of the Drawings
For the purpose of aiding to better understand the present description, according to a preferred practical embodiment of the invention, a figure is attached, which has an illustrative and non-limiting character and which describes the architecture of the system (Figure 1 ).
Detailed Description of the Invention
The system of the invention is capable of performing all the multimodal interaction analysis processes in real time using a distributed system of components through the communications protocols described in Figure 1 . The power of the system is based on the distributed architecture of components, delegating those processes of intensive use of the CPU to external machines and servers.
The system of the invention must have:
• A decoder with an integrated web browser and a return channel providing it with the capacity to access an external server. The browser must allow the use and execution of a scriptable language, for example, JavaScript language.
• This decoder will be connected to the television set to allow the display of the graphic part of the multimodal application and play the voice messages.
• The browser must allow the development and installation of plug-ins (or external modules outside the browser) providing the browser, and therefore the decoder, with the specific functionality for the interpretation and execution of multimodal applications.
• One or several external servers in which the speech resources reside
(speech synthesis and recognition). • A server capable of interpreting the tags providing the multimodality (such as SALT/VoiceXML, for example). The system has an internal state machine such that the latter is updated during the entire human- computer dialogue. The SALT/VoiceXML tags (or other similar ones) are distributed throughout the webpage of the application, providing characteristics such as speech recognition or synthesis. The location thereof depends on the actual design of the webpage and, together with the traditional HTML or JS tags, would form the so-called multimodal application. Thus, for example, there may be a <prompt> tag "Text" which produces at the end a synthesized audio of the message appearing after the tag. In a similar manner, there are other tags, <listen>, <reco>, which allow recording the commands of the user for their subsequent recognition. In any case, the syntax of the tags depends on the actual language or specification which is being used (SALT/VoiceXML or others).
• Web server in which the multimodal application resides.
• Any external device, such as for example a microphone connected to the decoder, a mobile telephone, a Bluetooth hands-free device, etc. which allows collecting the speech and sending it in digital format to the decoder.
The system allows the interaction of the user with the application by means of using the remote control or voice. Both methods are complementary, allowing the user to decide which he wants to use in each case. This interaction of the user with the application is referred to as human-computer dialogue.
The system furthermore allows the synchronization between the events generated by the user through any of the possible modes (text/voice) together with what is presented to the user, solving those incoherencies which may occur during the interaction. This synchronization is done by an external module which is executed in the browser, preventing unwanted actions or actions which can cause adverse effects on the system.
In summary, the system performs the following steps:
1 . The multimodal application resides in a Web server. In a first step, the user selects said application (purchase of contents, electronic banking, etc) provided by his Digital TV service provided. Since it is a multimodal application, the service provider will inform the user of such fact, indicating that he must previously connect the microphone, mobile telephone, Bluetooth headset to the decoder (the details of this step, in any case, are outside the scope of the invention). Once the device is connected, the decoder downloads said application by means of an http protocol or other mechanism into the web browser residing in the decoder, a plug-in is executed and the multimodal application is executed in the browser. The plug-in at this time is already linked to the browser.
2. For the purpose of performing real-time processing, the browser plug-in sends the web page containing the tags providing the multimodality to an external interpreter (SALT/VoiceXML, for example).
3. The browser immediately processes the web page and presents it in the television set.
4. The multimodal interpreter (SALT/VoiceXML) simultaneously recognizes and processes the multimodal tags, communicating to the voice servers the actions which must be performed at all times (for example, at a given time during the human-computer dialogue with the interactive application, the multimodal tag indicates that it is necessary to play by audio a certain message appearing on the screen).
5. The voice server sends to the decoder the audio data, which are processed and adapted such that they are playable by the audio device of the television set and heard by the user.
6. At this point, the state of the dialogue has advanced one step, and the browser plug-in informs the SALT interpreter that it can process the following multimodal tag, this process being repeated until the user abandons the application.
The multimodal application is made up of an HTML document made up of two main frames, an element existing in HTML terminology which corresponds with a part of a web page: a frame containing the application in which the actual web content resides (application frame, which is what is displayed in the graphic interface and what the user ultimately sees in his terminal) and another frame for creating in execution time, instantiation, the external module or plug-in (frame of the external module). The frames are used to separate the content from the application. The application frame contains the elements with which the user is able to interact both graphically and vocally. The external module or plug-in is an application which is related to the application frame in order to provide it with the specific functionality of the multimodal logic, i.e., that which allows maintaining a human-computer dialogue by means of voice commands and in a manner that is complementary to the traditional graphic interface along with the remote control during the entire execution period of the application. The structure of a multimodal application resides in a XML-type document formatted to comply with the specifications defined in SALT (Speech Application Language Tags) or VoiceXML.
The external frame is hidden from the user because it does not contain a graphic interface and it is only in charge of clustering the SALT/VoiceXML tags, mentioned in the previous paragraph, specific to each application, and of the instantiation of the external module. This set of SALT/VoiceXML tags determines the multimodal logic, i.e., the interactions allowed in the application. These tags allow configuring the synthesis and execution of the voice as well as the speech recognition device and the set of events that can be performed using the voice interface. For example, during the human-computer dialogue with the application, the user can see on his television set an edit field prompting him to enter text with the remote control; in parallel, in the application there will be a SALT/VoiceXML tag which will indicate that the system at that time is waiting for the user to give a voice command. At this time, the user can choose to enter the data with the remote control or he can give the equivalent voice order.
In order for there to be agreement between what is displayed on the screen, in this case the television set, and the interactions or events which are performed vocally, both frames communicate with one another using an Application Programming Interface or API through the actual architecture of the browser which is executed in the decoder. This API defines the set of communication processes and functions between the two frames achieving a level of abstraction and separation between them. The API is defined in JavaScript because it is compatible with the browser integrated in the decoder.
This frame structure allows separating the service logic from the multimodal logic. The multimodal logic (provided by the hidden frame in which the external module or plug-in is executed) is associated with the management of the human- computer interaction from any of the graphic or voice interfaces, i.e., it would be in charge of the management of the events from the voice or graphic interface launching the relevant actions given these events. It would also be in charge of solving the problems raised by simultaneous interactions between both interfaces. The service logic (provided by the application frame) would be associated with achieving the objective that the user has when using the application, such as purchasing a movie in a pay-per-view DTV service. It also allows the external module or plug-in to keep the service providing the multimodality active when browsing between the different pages and contents of the main application. This means that while a new graphic interface is being loaded, Text To Speech or TTS conversions, or conversions of speech into a format that can be understood by the system also referred to as SR (Speech Recognition) can be performed. This improves the user's experience, preventing any type of wait, because the service providing the multimodality is always working in a second plane.
The plug-in providing the multimodality requires the communications with the user to be based on events, i.e., it requires an action to be performed with regard to which action some process takes place. For example, at some time during the dialogue the user can press with the remote control an accept button or send the voice command "Accept" for its recognition; the same execution code will be generated in both cases, not distinguishing the mechanism of interaction through which the event arrived. If it is through the graphic interface, the plug-in will inform the SALT/VoiceXML interpreter that this button has been pressed so that its state machine is synchronized with the execution of the application, while at the same time the code associated with pressing the key will be executed. If the user sends a voice command and it is correctly recognized, the SALT/VoiceXML interpreter will inform the plug-in of such fact by means of a command, and the plug-in will generate the corresponding event which will execute the code corresponding to pressing that event.
To manage all the events produced by both the user and the system, the
JavaScript programming language, which allows programming event handlers, which are in charge of capturing the actions occurring in the system, is preferably used.
The browser that is used in Digital TV decoders interprets the JavaScript code which is integrated in the Web pages. Furthermore, the actual multimodal logic
(the interactions allowed), determined by the SALT/XML tags, is encoded in
JavaScript. As a result, the external multimodal module requires JavaScript in order to communicate with the browser.
The external multimodal module also needs to access the audio hardware resources of the decoder to control basic actions such as playing, stopping, etc.
Figure 1 shows the high-level architecture of an example of a system capable of carrying out the method of the invention:
Decoder [100]:
[110] Browser with possibility of including an external module [120].
Customized logic support capable of allowing the incorporation of modules enabling an IP connection with other systems and any other communication protocol. SIP (Session Initiation Protocol) Client [140]: in charge of creating a session between the decoder [100] and the MRCP (Media Resource Control Protocol) server [460]
RTP (Real Time Protocol) client [170]
Audio recording [190] and audio playback [180].
IP network connection [500]
External servers:
• MRCP server [350] and [460]
SALT/Voice XML interpreter [300]
Speech resources [400] (TTS or Text-to-Speech, SP or Speech Recognition)
The communication processes between the client (decoder) [100] and the resources of the MRCP server [460] are performed through the SIP (Session
Initiation Protocol) [610] which allows setting up multimedia sessions by means of exchanging messages between the parties who wish to communicate with one another. The decoder [100] implements a SIP module [140] creating a session [610] by means of sending petitions to the MRCP server. In this message, it also sends to it the characteristics of the session which it wishes to set up as supported audio coders/decoders, addresses, ports in which they are expected to be received, transmission rates, etc. which are necessary for performing the speech synthesis and recognition processes. All these actions are coordinated by the SALT interpreter [300].
In the part of the decoder [100], an RTP (Real Time Protocol) module [170] may be necessary since it is possible that the decoder does not support the streaming playback by means of RTP. It is therefore necessary to use a player [180] capable of sending to the system loudspeaker the raw voice data, collected and stored in real time in the buffer of the client RTP [170].
The process of sending the audio [620] through the RTP channel to the RTP element [411] is based on using an application [190] capable of recording and collecting the raw voice data from the audio input device (for example, a microphone) to then create a buffer with said data. Depending on the possibilities of the voice servers and on the formats to be supported, it would be necessary to use different speech compressors/decompressors, such as for example, PCMU-PCM
(Pulse Code Modulation mu-law). This compressor/decompressor application [180][190] implements the client/server paradigm, the object of the transaction being the voice data on an RTP protocol [620]. The application consists of two main modules:
• RTP player [180]: it decompresses and converts the sound in PCMU (or PCMA) format into PCM, which comes from the RTP channel, and finally plays the sound
• RTP recorder [190]: it reads data of the audio input device, compresses/converts PCM to PCMU (or PCMA) and finally sends them through the RTP channel [620]
All these components are in turn coordinated by the SALT interpreter [300].
The most common actions performed with them are PLAY, STOP, PAUSE and REC.
The invention can be applied to almost all the interactive services executed on Digital TV decoders which comply with the aforementioned characteristics. As examples of interactive services:
• Control and browsing through EPGs (Electronic Program Guide) or
ESG (Electronic Service Guide) or UEG (Unified Electronic Guide).
• Control and browsing through the VoD (Video on Demand), CoD (Content on Demand) functionalities.
• Control and browsing through services of home banking, electronic purchase/sale, access to product catalogues, etc.
• Control and browsing through the functionalities offered by electronic messaging applications and browsing through the Internet by means of browsers.

Claims

1 .- Method of multimodal interaction on interactive Digital TV applications, wherein the television is provided with a network decoder (100) incorporating an associated browser (1 10), characterized by the following steps:
a. Connecting the browser to a network server and downloading a multimodal application and its descriptive tags which are generated in response to an interaction event caused by a user during the human- computer dialogue
b. The browser sending the tags characterizing the multimodal application to an interpreter (300) residing in a network server
c. The interpreter interpreting the tags, which interpreter orders the execution of actions corresponding to the tags
d. Repeating steps a-c until the user exits the application
2.- Method according to claim 1 , characterized in that in step a. the events are graphics and/or speech.
3.- Method according to claim 2, wherein an external module (120) is associated with the browser (1 10) with the function of transferring the descriptive tags of the speech dialogue to the interpreter of said tags (300) by means of an IP protocol.
4.- Method according to claim 3, wherein the interpreter (300) of the descriptive tags of the speech dialogue coordinates and controls all the speech events.
5.- Method according to claim 4, wherein the interpreter (300) of the descriptive tags of the speech dialogue communicates with one or several servers providing speech resources by means of the MRCP protocol..
6.- Method according to claim 5, wherein the interpreter (300) of the descriptive tags of the speech dialogue analyzes the structure of the multimodal application and sends the corresponding commands to the voice server which complies with the MRCP protocol.
7.- Method according to claim 6, wherein the interpreter (300) of the descriptive tags of the speech dialogue communicates with the external module
(120) associated with the browser of the decoder and transfers thereto the data necessary for such module to set up a session by means of SIP with the MRCP voice server (460).
8.- Method according to claim 7, wherein the decoder receives and sends the voice data to the MRCP server (460) by means of the RTP protocol.
9.- Method according to claim 8, characterized in that the external module (120) associated with the browser sets up communication with an RTP client (170), thus obtaining the state of the communication between the decoder and the MRCP voice server (460).
10.- Method according to any of claims 5-9, wherein the decoder (100) has an application (190) capable of collecting the data coming from any external device which collects audio data and is capable of sending them by means of an IP connection to the voice servers.
1 1.- Method according to claim 10, wherein said application is capable of compressing said audio data to the format compatible with an MRCP server and sending them through the RTP protocol to the voice server (400).
12.- Method according to claim 1 1 , wherein the decoder has an application (180) capable of collecting the audio data coming from the RTP channel, decompressing the data to the format playable by the decoder and sending them to an electronic device existing therein in charge of the audio generation.
13.- Method according to any of claims 3-12, wherein the communication between the browser (1 10) existing in the decoder and the external module (120) is performed through an application programming interface.
14.- Method according to any of the previous claims, wherein the multimodal applications executed in the browser (1 10) are pre-processed, separating the multimodal logic from the service logic before being displayed to the user.
15.- System capable of carrying out any of the methods of claims 1 to 14.
16.- Use of the system of claim 15 in a pay-per-view digital television service.
PCT/EP2010/058886 2009-06-30 2010-06-23 Multimodal interaction on digital television applications WO2011000749A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
ES200930385A ES2382747B1 (en) 2009-06-30 2009-06-30 MULTIMODAL INTERACTION ON DIGITAL TELEVISION APPLICATIONS
ESP200930385 2009-06-30

Publications (1)

Publication Number Publication Date
WO2011000749A1 true WO2011000749A1 (en) 2011-01-06

Family

ID=42712507

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2010/058886 WO2011000749A1 (en) 2009-06-30 2010-06-23 Multimodal interaction on digital television applications

Country Status (4)

Country Link
AR (1) AR077281A1 (en)
ES (1) ES2382747B1 (en)
UY (1) UY32729A (en)
WO (1) WO2011000749A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2675153A1 (en) * 2012-06-14 2013-12-18 Samsung Electronics Co., Ltd Display apparatus, interactive server, and method for providing response information
EP2680597A3 (en) * 2012-06-29 2014-09-03 Samsung Electronics Co., Ltd Display apparatus, electronic device, interactive system, and controlling methods thereof
TWI561072B (en) * 2015-08-05 2016-12-01 Chunghwa Telecom Co Ltd

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4829423A (en) 1983-01-28 1989-05-09 Texas Instruments Incorporated Menu-based natural language understanding system
US5265014A (en) 1990-04-10 1993-11-23 Hewlett-Packard Company Multi-modal user interface
US5577165A (en) 1991-11-18 1996-11-19 Kabushiki Kaisha Toshiba Speech dialogue system for facilitating improved human-computer interaction
US6345111B1 (en) 1997-02-28 2002-02-05 Kabushiki Kaisha Toshiba Multi-modal interface apparatus and method
EP1455282A1 (en) * 2003-03-06 2004-09-08 Alcatel Markup language extension enabling speech recognition for controlling an application
US20060150082A1 (en) * 2004-12-30 2006-07-06 Samir Raiyani Multimodal markup language tags
US20080255850A1 (en) * 2007-04-12 2008-10-16 Cross Charles W Providing Expressive User Interaction With A Multimodal Application

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4829423A (en) 1983-01-28 1989-05-09 Texas Instruments Incorporated Menu-based natural language understanding system
US5265014A (en) 1990-04-10 1993-11-23 Hewlett-Packard Company Multi-modal user interface
US5577165A (en) 1991-11-18 1996-11-19 Kabushiki Kaisha Toshiba Speech dialogue system for facilitating improved human-computer interaction
US6345111B1 (en) 1997-02-28 2002-02-05 Kabushiki Kaisha Toshiba Multi-modal interface apparatus and method
EP1455282A1 (en) * 2003-03-06 2004-09-08 Alcatel Markup language extension enabling speech recognition for controlling an application
US20060150082A1 (en) * 2004-12-30 2006-07-06 Samir Raiyani Multimodal markup language tags
US20080255850A1 (en) * 2007-04-12 2008-10-16 Cross Charles W Providing Expressive User Interaction With A Multimodal Application

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2675153A1 (en) * 2012-06-14 2013-12-18 Samsung Electronics Co., Ltd Display apparatus, interactive server, and method for providing response information
US9219949B2 (en) 2012-06-14 2015-12-22 Samsung Electronics Co., Ltd. Display apparatus, interactive server, and method for providing response information
EP2680597A3 (en) * 2012-06-29 2014-09-03 Samsung Electronics Co., Ltd Display apparatus, electronic device, interactive system, and controlling methods thereof
US8983299B2 (en) 2012-06-29 2015-03-17 Samsung Electronics Co., Ltd. Display apparatus, electronic device, interactive system, and controlling methods thereof
EP3214842A1 (en) * 2012-06-29 2017-09-06 Samsung Electronics Co., Ltd. Display apparatus, electronic device, interactive system, and controlling methods thereof
USRE47168E1 (en) 2012-06-29 2018-12-18 Samsung Electronics Co., Ltd. Display apparatus, electronic device, interactive system, and controlling methods thereof
USRE48423E1 (en) 2012-06-29 2021-02-02 Samsung Electronics Co., Ltd. Display apparatus, electronic device, interactive system, and controlling methods thereof
EP3833036A1 (en) * 2012-06-29 2021-06-09 Samsung Electronics Co., Ltd. Display apparatus, electronic device, interactive system, and controlling methods thereof
USRE49493E1 (en) 2012-06-29 2023-04-11 Samsung Electronics Co., Ltd. Display apparatus, electronic device, interactive system, and controlling methods thereof
TWI561072B (en) * 2015-08-05 2016-12-01 Chunghwa Telecom Co Ltd

Also Published As

Publication number Publication date
UY32729A (en) 2011-01-31
ES2382747B1 (en) 2013-05-08
ES2382747A1 (en) 2012-06-13
AR077281A1 (en) 2011-08-17

Similar Documents

Publication Publication Date Title
US10650816B2 (en) Performing tasks and returning audio and visual feedbacks based on voice command
EP1143679B1 (en) A conversational portal for providing conversational browsing and multimedia broadcast on demand
US7086079B1 (en) Method and apparatus for internet TV
CN101036385B (en) Method and system for providing interactive services in digital television
US20110067059A1 (en) Media control
US20120317492A1 (en) Providing Interactive and Personalized Multimedia Content from Remote Servers
EP3790284A1 (en) Interactive video generation
CN111625716B (en) Media asset recommendation method, server and display device
CN112163086A (en) Multi-intention recognition method and display device
WO2011000749A1 (en) Multimodal interaction on digital television applications
CN111182339A (en) Method for playing media item and display equipment
CN114900386A (en) Terminal equipment and data relay method
CN116614659A (en) Screen projection method, display device and intelligent device
CN111914565A (en) Electronic equipment and user statement processing method
CN112883144A (en) Information interaction method
CN113207042B (en) Media asset playing method and display equipment
CN112788372B (en) Media resource platform registration method, display equipment and server
CN113053380B (en) Server and voice recognition method
CN113490021B (en) User interface display method and display device
CN113849664A (en) Display device, server and media asset searching method
CN113282773A (en) Video searching method, display device and server
CN117407611A (en) Portal wall-mounting method, portal wall-mounting device and computer storage medium
US9578396B2 (en) Method and device for providing HTML-based program guide service in a broadcasting terminal, and recording medium therefor
WO2011125066A1 (en) A cost effective communication device
CN115344722A (en) Display device, server and media asset searching method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10726501

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10726501

Country of ref document: EP

Kind code of ref document: A1