US20120296644A1 - Hybrid Speech Recognition - Google Patents

Hybrid Speech Recognition Download PDF

Info

Publication number
US20120296644A1
US20120296644A1 US13/563,981 US201213563981A US2012296644A1 US 20120296644 A1 US20120296644 A1 US 20120296644A1 US 201213563981 A US201213563981 A US 201213563981A US 2012296644 A1 US2012296644 A1 US 2012296644A1
Authority
US
United States
Prior art keywords
results
speech recognition
speech
recognizer
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/563,981
Inventor
Detlef Koll
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
MModal IP LLC
Original Assignee
MModal IP LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by MModal IP LLC filed Critical MModal IP LLC
Priority to US13/563,981 priority Critical patent/US20120296644A1/en
Assigned to ROYAL BANK OF CANADA, AS ADMINISTRATIVE AGENT reassignment ROYAL BANK OF CANADA, AS ADMINISTRATIVE AGENT SECURITY AGREEMENT Assignors: MMODAL IP LLC, MULTIMODAL TECHNOLOGIES, LLC, POIESIS INFOMATICS INC.
Assigned to MMODAL IP LLC reassignment MMODAL IP LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MULTIMODAL TECHNOLOGIES, LLC
Publication of US20120296644A1 publication Critical patent/US20120296644A1/en
Assigned to MMODAL IP LLC reassignment MMODAL IP LLC RELEASE OF SECURITY INTEREST Assignors: ROYAL BANK OF CANADA, AS ADMINISTRATIVE AGENT
Assigned to WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT reassignment WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT SECURITY AGREEMENT Assignors: MMODAL IP LLC
Assigned to CORTLAND CAPITAL MARKET SERVICES LLC reassignment CORTLAND CAPITAL MARKET SERVICES LLC PATENT SECURITY AGREEMENT Assignors: MMODAL IP LLC
Assigned to MMODAL IP LLC reassignment MMODAL IP LLC CHANGE OF ADDRESS Assignors: MMODAL IP LLC
Assigned to MMODAL IP LLC reassignment MMODAL IP LLC RELEASE BY SECURED PARTY (SEE DOCUMENT FOR DETAILS). Assignors: CORTLAND CAPITAL MARKET SERVICES LLC, AS ADMINISTRATIVE AGENT
Assigned to MEDQUIST CM LLC, MEDQUIST OF DELAWARE, INC., MULTIMODAL TECHNOLOGIES, LLC, MMODAL IP LLC, MMODAL MQ INC. reassignment MEDQUIST CM LLC TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS Assignors: WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems

Definitions

  • ASRs automatic speech recognizers
  • Some applications of automatic speech recognizers require shorter turnaround times (the amount of time between when the speech is spoken and when the speech recognizer produces output) than others in order to appear responsive to the end user.
  • a speech recognizer that is used for a “live” speech recognition application such as controlling the movement of an on-screen cursor, may require a shorter turnaround time (also referred to as a “response time”) than a speech recognizer that is used to produce a transcript of a medical report.
  • the desired turnaround time may depend, for example, on the content of the speech utterance that is processed by the speech recognizer. For example, for a short command-and-control utterance, such as “close window,” a turnaround time above 500 ms may appear sluggish to the end user. In contrast, for a long dictated sentence which the user desires to transcribe into text, response times of 1000 ms may be acceptable to the end user. In fact, in the latter case users may prefer longer response times because they may otherwise feel that their speech is being interrupted by the immediate display of text in response to their speech. For longer dictated passages, such as entire paragraphs, even longer response times of multiple seconds may be acceptable to the end user.
  • One known technique for overcoming these resource constraints in the context of embedded devices is to delegate some or all of the speech recognition processing responsibility to a speech recognition server that is located remotely from the embedded device and which has significantly greater computing resources than the embedded device.
  • the embedded device When a user speaks into the embedded device in this situation, the embedded device does not attempt to recognize the speech using its own computing resources. Instead, the embedded device transmits the speech (or a processed form of it) over a network connection to the speech recognition server, which recognizes the speech using its greater computing resources and therefore produces recognition results more quickly than the embedded device could have produced with the same accuracy.
  • the speech recognition server then transmits the results back over the network connection to the embedded device.
  • this technique produces highly-accurate speech recognition results more quickly than would otherwise be possible using the embedded device alone.
  • server-side speech recognition technique has a variety of shortcomings.
  • server-side speech recognition relies on the availability of high-speed and reliable network connections, the technique breaks down if such connections are not available when needed.
  • the potential increases in speed made possible by server-side speech recognition may be negated by use of a network connection without sufficiently high bandwidth.
  • the typical network latency of an HTTP call to a remote server can range from 100 ms to 500 ms. If spoken data arrives at a speech recognition server 500 ms after it is spoken, it will be impossible for that server to produce results quickly enough to satisfy the minimum turnaround time (500 ms) required by command-and-control applications. As a result, even the fastest speech recognition server will produce results that appear sluggish if used in combination with a slow network connection.
  • a hybrid speech recognition system uses a client-side speech recognition engine and a server-side speech recognition engine to produce speech recognition results for the same speech.
  • An arbitration engine produces speech recognition output based on one or both of the client-side and server-side speech recognition results.
  • FIG. 1 is a dataflow diagram of a speech recognition system according to one embodiment of the present invention
  • FIG. 2 is a flowchart of a method performed by the system of FIG. 1 according to one embodiment of the present invention
  • FIGS. 3A-3E are flowcharts of methods performed by an arbitration engine to produce hybrid speech recognition output according to various embodiments of the present invention.
  • FIGS. 4A-4F are flowcharts of methods performed by a speech recognition system to process overlapping recognition results from multiple speech recognition engines according to various embodiments of the present invention.
  • FIG. 1 a dataflow diagram is shown of a speech recognition system 100 according to one embodiment of the present invention.
  • FIG. 2 a flowchart is shown of a method 200 performed by the system 100 of FIG. 1 according to one embodiment of the present invention.
  • a user 102 of a client device 106 speaks and thereby provides speech 104 to the client device (step 202 ).
  • the client device 106 may be any device, such as a desktop or laptop computer, cellular telephone, personal digital assistant (PDA), or telephone. Embodiments of the present invention, however, are particularly useful in conjunction with resource-constrained clients, such as computers or mobile computing devices with slow processors or small amounts of memory, or computers running resource-intensive software.
  • the device 106 may receive the speech 104 from the user 102 in any way, such as through a microphone connected to a sound card.
  • the speech 104 may be embodied in an audio signal which is tangibly stored in a computer-readable medium and/or transmitted over a network connection or other channel.
  • the client device 106 includes an application 108 , such as a transcription application or other application which needs to recognize the speech 104 .
  • the application 108 transmits the speech 104 to a delegation engine 110 (step 204 ).
  • the application 108 may process the speech 104 in some way and provide the processed version of the speech 104 , or other data derived from the speech 104 , to the delegation engine 110 .
  • the delegation engine 110 itself may process the speech 104 (in addition to or instead of any processing performed on the speech by the application) in preparation for transmitting the speech for recognition.
  • the delegation engine 110 may present the same interface to the application 108 as that presented by a conventional automatic speech recognition engine.
  • the application 108 may provide the speech 104 to the delegation engine 110 in the same way that it would provide the speech 104 directly to a conventional speech recognition engine.
  • the creator of the application 108 therefore, need not know that the delegation engine 110 is not itself a conventional speech recognition engine.
  • the delegation engine 110 also provides speech recognition results back to the application 108 in the same manner as a conventional speech recognition engine. Therefore, the delegation engine 110 appears to perform the same function as a conventional speech recognition engine from the perspective of the application 108 .
  • the delegation engine 110 provides the speech 104 (or a processed form of the speech 104 or other data derived from the speech 104 ) to both a client-side automatic speech recognition engine 112 in the client device 106 (step 206 ) and to a server-side automatic speech recognition engine 120 in a server 118 located remotely over a network 116 (step 208 ).
  • the server 118 may be a computing device which has significantly greater computing resources than the client device.
  • the client-side speech recognizer 112 and server-side speech recognizer 120 may be conventional speech recognizers.
  • the client-side speech recognizer 112 and server-side speech recognizer 120 may, however, differ from each other.
  • the server-side speech recognizer 120 may use more complex speech recognition models which require more computing resources than those used by the client-side speech recognizer 112 .
  • one of the speech recognizers 112 and 120 may be speaker-independent, while the other may be adapted to the voice of the user 102 .
  • the client-side recognizer 112 and server-side recognizer 120 may have different response times due to a combination of differences in the computing resources of the client 106 and server 118 , differences in the speech recognizers themselves 112 and 120 , and the fact that the results from the server-side recognizer 120 must be provided back to the client device 106 over the network 116 , thereby introducing latency not incurred by the client-side recognizer 112 .
  • Responsibilities may be divided between the client-side speech recognizer 112 and server-side speech recognizer 120 in various ways, whether or not such recognizers 112 and 120 differ from each other.
  • the client-side speech recognizer 112 may be used solely for command-and-control speech recognition, while the server-side speech recognizer 112 may be used for both command-and-control and dictation recognition.
  • the client-side recognizer 112 may only be permitted to utilize up to a predetermined maximum percentage of processor time on the client device 106 .
  • the delegation engine 110 may be configured to transmit appropriate speech to the client-side recognizer 112 and server-side recognizer 120 in accordance with the responsibilities of each.
  • the client-side recognizer 112 produces speech recognition results 114 , such as text based on the speech 104 (step 210 ).
  • the server-side recognizer 120 produces speech recognition results 122 , such as text based on the speech 104 (step 212 ).
  • the results 114 may include other information, such as the set of best candidate words, confidence measurements associated with those words, and other output typically provided by speech recognition engines.
  • the client-side results 114 and server-side results 122 may differ from each other.
  • the client-side recognizer 112 and server-side recognizer 120 both provide their results 114 and 112 , respectively, to an arbitration engine 124 in the client device 106 .
  • the arbitration engine 124 analyzes one or both of the results 114 and 122 to decide which of the two results 114 and 122 to provide (as results 126 ) to the delegation engine 110 (step 214 ).
  • the arbitration engine 124 may perform step 214 either after receiving both of the results 114 and 122 , or after receiving one of the results 114 and 122 but not the other. Therefore, in general the arbitration engine 124 produces the output 126 based on the client-side results 114 and/or the server-side results 122 .
  • the delegation engine 110 provides the selected results 126 back to the requesting application 108 (step 216 ).
  • the requesting application 108 receives speech recognition results 126 back from the delegation engine 110 as if the delegation engine 110 were a single, integrated speech recognition engine 110 .
  • the details of the operations performed by the delegation engine 110 and arbitration engine 124 are hidden from the requesting application 108 .
  • the arbitration engine 124 may use any of a variety techniques to select which of the client-side results 114 and server-side results 122 to provide to the delegation engine 110 . For example, as illustrated by the method 300 of FIG. 3A , the arbitration engine 124 may select the client-side results 114 as soon as those results 114 become available (step 302 ), if the server-side recognizer 120 is not accessible over the network (e.g., if the connection between the client 106 and the network 116 is down) (steps 304 - 306 ).
  • the arbitration engine 124 may select the server-side results 122 as soon as those results 122 become available (step 312 ), if the client-side recognizer 112 is not accessible (steps 314 - 316 ). This may occur, for example, if the client-side recognizer 112 has been disabled as a result of a high-priority CPU task being executed on the client device 106 .
  • the arbitration engine 124 may select the server-side recognizer's results 122 if those results 122 become available no later than a predetermined waiting time after the client-side recognizer's results 114 became available.
  • the arbitration engine 124 may return the server-side results 122 (step 330 ) only if they are received (step 324 ) before the predetermined waiting time has passed (step 326 ). If the server-side results 122 are not available by that time, then the arbitration engine 124 may return the client-side results 114 (step 328 ).
  • the predetermined waiting time may be selected in any way.
  • the predetermined waiting time may depend on the type of recognition result.
  • the predetermined waiting time applied by the method 320 to command-and-control grammars may be selected to be shorter than the predetermined waiting time applied to dictation grammars.
  • a predetermined waiting time of 500 ms may be applied to command-and-control grammars, while a predetermined waiting time of 1000 ms may be applied to dictation grammars.
  • the arbitration engine 124 may select the client-side recognizer's results 114 (step 346 ) as soon as those results 114 become available (step 342 ), if the confidence measure associated with those results 114 exceeds some predetermined threshold value (step 344 ).
  • the arbitration engine 124 is not limited to “selecting” one or the other of the results 114 and 122 produced by the client-side recognizer 112 and server-side recognizer 120 , respectively. Rather, for example, as illustrated by the method 350 of FIG. 3E , the arbitration engine 124 may receive the results 114 and 122 (steps 352 and 354 ), and combine or otherwise process those results 114 and 122 in various ways (step 356 ) to produce the output 126 provided back to the requesting application 108 (step 358 ). For example, the arbitration engine 124 may combine the results 114 and 122 using a well-known technology named ROVER (Recognizer Output Voting Error Reduction), or using other techniques, to produce the output 126 .
  • ROVER Recognition Output Voting Error Reduction
  • the arbitration engine 124 may combine the techniques disclosed above with respect to FIGS. 3A-3E , and with other techniques, in any combination.
  • the method 340 of FIG. 3D may be combined with the method 320 of FIG. 3C by performing steps 344 and 346 of method 340 after step 322 in FIG. 3C , and proceeding to step 324 of FIG. 3C if the confidence measure in step 344 does not exceed the threshold.
  • results from one of the recognizers 112 and 120 may overlap in time with the results from the other recognizer, as illustrated by the method 400 of FIG. 4A .
  • the speech 104 is five seconds in duration
  • the client-side recognizer 112 produces high-confidence results 114 for the first two seconds of the speech 104 (step 402 ).
  • the arbitration engine 124 may submit those results 114 to the delegation engine 110 , which commits those results 114 (i.e., includes the results 114 in the results 126 that are passed back to the application 108 ) before the server-side results 122 become available (step 404 ).
  • the server-side results 122 for some or all of the same five seconds of speech 104 may conflict (overlap in time) with some or all the client-side results 114 (step 406 ).
  • the arbitration engine 124 may take action in response to such overlap (step 408 ).
  • the arbitration engine 124 may consider results 114 and 122 to be non-overlapping and process them in any of the ways described above with respect to FIGS. 3A-3E (step 414 ). Otherwise, the arbitration engine 124 may consider the results 114 and 122 to be overlapping and process them accordingly, such as in the ways described in the following examples (step 416 ).
  • some predetermined threshold time period e.g. 100 ms
  • the arbitration engine 124 may consider one of the recognizers (e.g., the server-side recognizer 120 ) to be preferred over the other recognizer. In this case, if results (e.g., client-side results 114 ) from the non-preferred recognizer arrive first (step 422 ) and are committed first (step 424 ), and then results (e.g., server-side results 122 ) from the preferred recognizer arrive (step 428 ) which overlap with the previously-committed non-preferred results, the arbitration engine 124 may commit (i.e., include in the hybrid results 126 ) the preferred results (e.g., server-side results 122 ) as well (step 430 ).
  • the preferrs e.g., the server-side recognizer 120
  • the arbitration engine 124 may commit (i.e., include in the hybrid results 126 ) the preferred results (e.g., server-side results 122 ) as well (step 430 ).
  • results e.g., server-side results 122
  • results e.g., client-side results 114
  • the arbitration engine 124 may discard the non-preferred results (step 450 ). Otherwise, the arbitration engine 124 may commit the later-received results or process them in another manner (step 452 ).
  • FIG. 4E which represents one embodiment of step 408 of FIG. 4A
  • the arbitration engine 124 may ignore the words from the new recognition results that overlap in time with the words from the old recognition results (using timestamps associated with each word in both recognition results) (step 462 ), and then commit the remaining (non-overlapping) words from the new recognition results (step 464 ).
  • the arbitration engine 124 may use the newly-received results to update the previously-committed results (step 472 ). For example, the arbitration engine 124 may determine whether the confidence measure associated with the newly-received results exceeds the confidence measure associated with the previously-committed results (step 474 ) and, if so, replace the previously-committed results with the newly-received results (step 476 ).
  • Embodiments of the present invention have a variety of advantages.
  • a client-side device such as a cellular telephone
  • the techniques disclosed herein effectively produce a hybrid speech recognition engine which uses both the client-side recognizer 112 and server-side recognizer 118 to produce better results than either of those recognizers could have produced individually. More specifically, the resulting hybrid result can have better operating characteristics with respect to system availability, recognition quality, and response time than could be obtained from either of the component recognizers 112 and 120 individually.
  • the techniques disclosed herein may be used to satisfy the user's turnaround time requirements even as the availability of the network 116 fluctuates over time, and even as the processing load on the CPU of the client device 106 fluctuates over time.
  • Such flexibility results from the ability of the arbitration engine 124 to respond to changes in the turnaround times of the client-side recognizer 112 and server-side recognizer 120 , and in response to other time-varying factors.
  • Embodiments of the present invention thereby provide a distinct benefit over conventional server-side speech recognition techniques, which break down if the network slows down or becomes unavailable.
  • Hybrid speech recognition systems implemented in accordance with embodiments of the present invention may provide higher speech recognition accuracy than is provided by the faster of the two component recognizers (e.g., the server-side recognizer 120 in FIG. 1 ). This is a distinct advantage over conventional server-side speech recognition techniques, which only provide results having the accuracy of the server-side recognizer, since that is the only recognizer used by the system.
  • hybrid speech recognition systems implemented in accordance with embodiments of the present invention may provide a faster average response time than is provided by the slower of the two component recognizers (e.g., the client-side recognizer 112 in FIG. 1 ).
  • This is a distinct advantage over conventional server-side speech recognition techniques, which only provide results having the response time of the server-side recognizer, since that is the only recognizer used by the system.
  • Each of the client-side recognizer 112 and server-side recognizer 120 may be any kind of recognizer. Each of them may be chosen without knowledge of the characteristics of the other. Multiple client-side recognizers, possibly of different types, may be used in conjunction with a single server-side recognizer to effectively form multiple hybrid recognition systems. Either of the client-side recognizer 112 or server-side recognizer 120 may be modified or replaced without causing the hybrid system to break down. As a result, the techniques disclosed herein provide a wide degree of flexibility that makes them suitable for use in conjunction with a wide variety of client-side and server-side recognizers.
  • the techniques disclosed herein may be implemented without requiring any modification to existing applications which rely on speech recognition engines.
  • the delegation engine 110 may provide the same interface to the application 108 as a conventional speech recognition engine.
  • the application 108 may provide input to and receive output from the delegation engine 110 as if the delegation engine 110 were a conventional speech recognition engine.
  • the delegation engine 110 therefore, may be inserted into the client device 106 in place of a conventional speech recognition engine without requiring any modifications to the application 108 .
  • the techniques described above may be implemented, for example, in hardware, software tangibly stored on a computer-readable medium, firmware, or any combination thereof.
  • the techniques described above may be implemented in one or more computer programs executing on a programmable computer including a processor, a storage medium readable by the processor (including, for example, volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
  • Program code may be applied to input entered using the input device to perform the functions described and to generate output.
  • the output may be provided to one or more output devices.
  • Each computer program within the scope of the claims below may be implemented in any programming language, such as assembly language, machine language, a high-level procedural programming language, or an object-oriented programming language.
  • the programming language may, for example, be a compiled or interpreted programming language.
  • Each such computer program may be implemented in a computer program product tangibly embodied in a machine-readable storage device for execution by a computer processor.
  • Method steps of the invention may be performed by a computer processor executing a program tangibly embodied on a computer-readable medium to perform functions of the invention by operating on input and generating output.
  • Suitable processors include, by way of example, both general and special purpose microprocessors.
  • the processor receives instructions and data from a read-only memory and/or a random access memory.
  • Storage devices suitable for tangibly embodying computer program instructions include, for example, all forms of non-volatile memory, such as semiconductor memory devices, including EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROMs. Any of the foregoing may be supplemented by, or incorporated in, specially-designed ASICs (application-specific integrated circuits) or FPGAs (Field-Programmable Gate Arrays).
  • a computer can generally also receive programs and data from a storage medium such as an internal disk (not shown) or a removable disk.

Abstract

A hybrid speech recognition system uses a client-side speech recognition engine and a server-side speech recognition engine to produce speech recognition results for the same speech. An arbitration engine produces speech recognition output based on one or both of the client-side and server-side speech recognition results.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation of U.S. patent application Ser. No. 12/890,280, filed on Sep. 24, 2010, entitled, “Hybrid Speech Recognition”; which is a continuation of U.S. patent application Ser. No. 12/550,380, filed on Aug. 30, 2009, entitled, “Hybrid Speech Recognition” (now U.S. Pat. No. 7,933,777, issued on Apr. 26, 2011); which claims the benefit of U.S. Prov. Pat. App. Ser. No. 61/093,220, filed on Aug. 29, 2008, entitled, “Hybrid Speech Recognition”; all of which are hereby incorporated by reference herein.
  • BACKGROUND
  • A variety of automatic speech recognizers (ASRs) exist for performing functions such as converting speech into text and controlling the operations of a computer in response to speech. Some applications of automatic speech recognizers require shorter turnaround times (the amount of time between when the speech is spoken and when the speech recognizer produces output) than others in order to appear responsive to the end user. For example, a speech recognizer that is used for a “live” speech recognition application, such as controlling the movement of an on-screen cursor, may require a shorter turnaround time (also referred to as a “response time”) than a speech recognizer that is used to produce a transcript of a medical report.
  • The desired turnaround time may depend, for example, on the content of the speech utterance that is processed by the speech recognizer. For example, for a short command-and-control utterance, such as “close window,” a turnaround time above 500 ms may appear sluggish to the end user. In contrast, for a long dictated sentence which the user desires to transcribe into text, response times of 1000 ms may be acceptable to the end user. In fact, in the latter case users may prefer longer response times because they may otherwise feel that their speech is being interrupted by the immediate display of text in response to their speech. For longer dictated passages, such as entire paragraphs, even longer response times of multiple seconds may be acceptable to the end user.
  • In typical prior art speech recognition systems, improving response time while maintaining recognition accuracy requires increasing the computing resources (processing cycles and/or memory) that are dedicated to performing speech recognition. Similarly, in typical prior art speech recognition systems, recognition accuracy may typically be increased without sacrificing response time only by increasing the computing resources that are dedicated to performing speech recognition. One example of a consequence of these tradeoffs is that when porting a given speech recognizer from a desktop computer platform to an embedded system, such as a cellular telephone, with fewer computing resources, recognition accuracy must typically be sacrificed if the same response time is to be maintained.
  • One known technique for overcoming these resource constraints in the context of embedded devices is to delegate some or all of the speech recognition processing responsibility to a speech recognition server that is located remotely from the embedded device and which has significantly greater computing resources than the embedded device. When a user speaks into the embedded device in this situation, the embedded device does not attempt to recognize the speech using its own computing resources. Instead, the embedded device transmits the speech (or a processed form of it) over a network connection to the speech recognition server, which recognizes the speech using its greater computing resources and therefore produces recognition results more quickly than the embedded device could have produced with the same accuracy. The speech recognition server then transmits the results back over the network connection to the embedded device. Ideally this technique produces highly-accurate speech recognition results more quickly than would otherwise be possible using the embedded device alone.
  • In practice, however, this use of server-side speech recognition technique has a variety of shortcomings. In particular, because server-side speech recognition relies on the availability of high-speed and reliable network connections, the technique breaks down if such connections are not available when needed. For example, the potential increases in speed made possible by server-side speech recognition may be negated by use of a network connection without sufficiently high bandwidth. As one example, the typical network latency of an HTTP call to a remote server can range from 100 ms to 500 ms. If spoken data arrives at a speech recognition server 500 ms after it is spoken, it will be impossible for that server to produce results quickly enough to satisfy the minimum turnaround time (500 ms) required by command-and-control applications. As a result, even the fastest speech recognition server will produce results that appear sluggish if used in combination with a slow network connection.
  • What is needed, therefore, are improved techniques for producing high-quality speech recognition results for embedded devices within the turnaround times required by those devices, but without requiring low-latency high-availability network connections.
  • SUMMARY
  • A hybrid speech recognition system uses a client-side speech recognition engine and a server-side speech recognition engine to produce speech recognition results for the same speech. An arbitration engine produces speech recognition output based on one or both of the client-side and server-side speech recognition results.
  • Other features and advantages of various aspects and embodiments of the present invention will become apparent from the following description and from the claims.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a dataflow diagram of a speech recognition system according to one embodiment of the present invention;
  • FIG. 2 is a flowchart of a method performed by the system of FIG. 1 according to one embodiment of the present invention;
  • FIGS. 3A-3E are flowcharts of methods performed by an arbitration engine to produce hybrid speech recognition output according to various embodiments of the present invention; and
  • FIGS. 4A-4F are flowcharts of methods performed by a speech recognition system to process overlapping recognition results from multiple speech recognition engines according to various embodiments of the present invention.
  • DETAILED DESCRIPTION
  • Referring to FIG. 1, a dataflow diagram is shown of a speech recognition system 100 according to one embodiment of the present invention. Referring to FIG. 2, a flowchart is shown of a method 200 performed by the system 100 of FIG. 1 according to one embodiment of the present invention.
  • A user 102 of a client device 106 speaks and thereby provides speech 104 to the client device (step 202). The client device 106 may be any device, such as a desktop or laptop computer, cellular telephone, personal digital assistant (PDA), or telephone. Embodiments of the present invention, however, are particularly useful in conjunction with resource-constrained clients, such as computers or mobile computing devices with slow processors or small amounts of memory, or computers running resource-intensive software. The device 106 may receive the speech 104 from the user 102 in any way, such as through a microphone connected to a sound card. The speech 104 may be embodied in an audio signal which is tangibly stored in a computer-readable medium and/or transmitted over a network connection or other channel.
  • The client device 106 includes an application 108, such as a transcription application or other application which needs to recognize the speech 104. The application 108 transmits the speech 104 to a delegation engine 110 (step 204). Alternatively, the application 108 may process the speech 104 in some way and provide the processed version of the speech 104, or other data derived from the speech 104, to the delegation engine 110. The delegation engine 110 itself may process the speech 104 (in addition to or instead of any processing performed on the speech by the application) in preparation for transmitting the speech for recognition.
  • The delegation engine 110 may present the same interface to the application 108 as that presented by a conventional automatic speech recognition engine. As a result, the application 108 may provide the speech 104 to the delegation engine 110 in the same way that it would provide the speech 104 directly to a conventional speech recognition engine. The creator of the application 108, therefore, need not know that the delegation engine 110 is not itself a conventional speech recognition engine. As will be described in more detail below, the delegation engine 110 also provides speech recognition results back to the application 108 in the same manner as a conventional speech recognition engine. Therefore, the delegation engine 110 appears to perform the same function as a conventional speech recognition engine from the perspective of the application 108.
  • The delegation engine 110 provides the speech 104 (or a processed form of the speech 104 or other data derived from the speech 104) to both a client-side automatic speech recognition engine 112 in the client device 106 (step 206) and to a server-side automatic speech recognition engine 120 in a server 118 located remotely over a network 116 (step 208). The server 118 may be a computing device which has significantly greater computing resources than the client device.
  • The client-side speech recognizer 112 and server-side speech recognizer 120 may be conventional speech recognizers. The client-side speech recognizer 112 and server-side speech recognizer 120 may, however, differ from each other. For example, the server-side speech recognizer 120 may use more complex speech recognition models which require more computing resources than those used by the client-side speech recognizer 112. As another example, one of the speech recognizers 112 and 120 may be speaker-independent, while the other may be adapted to the voice of the user 102. The client-side recognizer 112 and server-side recognizer 120 may have different response times due to a combination of differences in the computing resources of the client 106 and server 118, differences in the speech recognizers themselves 112 and 120, and the fact that the results from the server-side recognizer 120 must be provided back to the client device 106 over the network 116, thereby introducing latency not incurred by the client-side recognizer 112.
  • Responsibilities may be divided between the client-side speech recognizer 112 and server-side speech recognizer 120 in various ways, whether or not such recognizers 112 and 120 differ from each other. For example, the client-side speech recognizer 112 may be used solely for command-and-control speech recognition, while the server-side speech recognizer 112 may be used for both command-and-control and dictation recognition. As another example, the client-side recognizer 112 may only be permitted to utilize up to a predetermined maximum percentage of processor time on the client device 106. The delegation engine 110 may be configured to transmit appropriate speech to the client-side recognizer 112 and server-side recognizer 120 in accordance with the responsibilities of each.
  • The client-side recognizer 112 produces speech recognition results 114, such as text based on the speech 104 (step 210). Similarly, the server-side recognizer 120 produces speech recognition results 122, such as text based on the speech 104 (step 212). The results 114 may include other information, such as the set of best candidate words, confidence measurements associated with those words, and other output typically provided by speech recognition engines.
  • The client-side results 114 and server-side results 122 may differ from each other. The client-side recognizer 112 and server-side recognizer 120 both provide their results 114 and 112, respectively, to an arbitration engine 124 in the client device 106. The arbitration engine 124 analyzes one or both of the results 114 and 122 to decide which of the two results 114 and 122 to provide (as results 126) to the delegation engine 110 (step 214). As will be described in more detail below, the arbitration engine 124 may perform step 214 either after receiving both of the results 114 and 122, or after receiving one of the results 114 and 122 but not the other. Therefore, in general the arbitration engine 124 produces the output 126 based on the client-side results 114 and/or the server-side results 122.
  • The delegation engine 110 provides the selected results 126 back to the requesting application 108 (step 216). As a result, the requesting application 108 receives speech recognition results 126 back from the delegation engine 110 as if the delegation engine 110 were a single, integrated speech recognition engine 110. In other words, the details of the operations performed by the delegation engine 110 and arbitration engine 124 are hidden from the requesting application 108.
  • The arbitration engine 124 may use any of a variety techniques to select which of the client-side results 114 and server-side results 122 to provide to the delegation engine 110. For example, as illustrated by the method 300 of FIG. 3A, the arbitration engine 124 may select the client-side results 114 as soon as those results 114 become available (step 302), if the server-side recognizer 120 is not accessible over the network (e.g., if the connection between the client 106 and the network 116 is down) (steps 304-306).
  • Conversely, as illustrated by the method 310 of FIG. 3B, the arbitration engine 124 may select the server-side results 122 as soon as those results 122 become available (step 312), if the client-side recognizer 112 is not accessible (steps 314-316). This may occur, for example, if the client-side recognizer 112 has been disabled as a result of a high-priority CPU task being executed on the client device 106.
  • As another example, and assuming that the server-side recognizer 120 provides, on average, higher-quality recognition results than the client-side recognizer 112, the arbitration engine 124 may select the server-side recognizer's results 122 if those results 122 become available no later than a predetermined waiting time after the client-side recognizer's results 114 became available. In other words, as illustrated by the method 320 of FIG. 3C, once the client-side recognizer's results 114 become available (step 322), the arbitration engine 124 may return the server-side results 122 (step 330) only if they are received (step 324) before the predetermined waiting time has passed (step 326). If the server-side results 122 are not available by that time, then the arbitration engine 124 may return the client-side results 114 (step 328).
  • The predetermined waiting time may be selected in any way. For example, the predetermined waiting time may depend on the type of recognition result. For example, the predetermined waiting time applied by the method 320 to command-and-control grammars may be selected to be shorter than the predetermined waiting time applied to dictation grammars. As just one example, a predetermined waiting time of 500 ms may be applied to command-and-control grammars, while a predetermined waiting time of 1000 ms may be applied to dictation grammars.
  • As yet another example, and as illustrated by the method 340 of FIG. 3D, even assuming that the server-side recognizer 120 provides, on average, higher-quality recognition results than the client-side recognizer 112, the arbitration engine 124 may select the client-side recognizer's results 114 (step 346) as soon as those results 114 become available (step 342), if the confidence measure associated with those results 114 exceeds some predetermined threshold value (step 344).
  • The arbitration engine 124 is not limited to “selecting” one or the other of the results 114 and 122 produced by the client-side recognizer 112 and server-side recognizer 120, respectively. Rather, for example, as illustrated by the method 350 of FIG. 3E, the arbitration engine 124 may receive the results 114 and 122 (steps 352 and 354), and combine or otherwise process those results 114 and 122 in various ways (step 356) to produce the output 126 provided back to the requesting application 108 (step 358). For example, the arbitration engine 124 may combine the results 114 and 122 using a well-known technology named ROVER (Recognizer Output Voting Error Reduction), or using other techniques, to produce the output 126.
  • The arbitration engine 124 may combine the techniques disclosed above with respect to FIGS. 3A-3E, and with other techniques, in any combination. For example, the method 340 of FIG. 3D may be combined with the method 320 of FIG. 3C by performing steps 344 and 346 of method 340 after step 322 in FIG. 3C, and proceeding to step 324 of FIG. 3C if the confidence measure in step 344 does not exceed the threshold.
  • It is possible for results from one of the recognizers 112 and 120 to overlap in time with the results from the other recognizer, as illustrated by the method 400 of FIG. 4A. For example, assume that the speech 104 is five seconds in duration, and that the client-side recognizer 112 produces high-confidence results 114 for the first two seconds of the speech 104 (step 402). As a result of the high confidence measure of the results 114, the arbitration engine 124 may submit those results 114 to the delegation engine 110, which commits those results 114 (i.e., includes the results 114 in the results 126 that are passed back to the application 108) before the server-side results 122 become available (step 404). Then, when the server-side results 122 for some or all of the same five seconds of speech 104 become available, some or all of those results 122 may conflict (overlap in time) with some or all the client-side results 114 (step 406). The arbitration engine 124 may take action in response to such overlap (step 408).
  • For example, as shown by the method 410 of FIG. 4B, if the client-side results 114 and the server-side results 122 overlap by less than some predetermined threshold time period (e.g., 100 ms) (step 412), then the arbitration engine 124 may consider results 114 and 122 to be non-overlapping and process them in any of the ways described above with respect to FIGS. 3A-3E (step 414). Otherwise, the arbitration engine 124 may consider the results 114 and 122 to be overlapping and process them accordingly, such as in the ways described in the following examples (step 416).
  • For example, as illustrated by the method 420 of FIG. 4B, the arbitration engine 124 may consider one of the recognizers (e.g., the server-side recognizer 120) to be preferred over the other recognizer. In this case, if results (e.g., client-side results 114) from the non-preferred recognizer arrive first (step 422) and are committed first (step 424), and then results (e.g., server-side results 122) from the preferred recognizer arrive (step 428) which overlap with the previously-committed non-preferred results, the arbitration engine 124 may commit (i.e., include in the hybrid results 126) the preferred results (e.g., server-side results 122) as well (step 430). Although this results in certain portions of the speech 104 being committed twice, this may produce more desirable results than discarding the results of a preferred recognizer. If the later-received results are not from the preferred recognizer, those results may be discarded rather than committed (step 432).
  • As yet another example, as illustrated by the method 440 of FIG. 4D, if results (e.g., server-side results 122) from the preferred recognizer arrive first (step 442) and are committed first (step 444), and then results (e.g., client-side results 114) from the non-preferred recognizer arrive which overlap with the previously-committed preferred results (steps 446 and 448), then the arbitration engine 124 may discard the non-preferred results (step 450). Otherwise, the arbitration engine 124 may commit the later-received results or process them in another manner (step 452).
  • More generally, as illustrated by FIG. 4E (which represents one embodiment of step 408 of FIG. 4A), if the arbitration engine 124 receives recognition results which overlap with any previously-committed result received from (the same or different) speech recognizer, then the arbitration engine 124 may ignore the words from the new recognition results that overlap in time with the words from the old recognition results (using timestamps associated with each word in both recognition results) (step 462), and then commit the remaining (non-overlapping) words from the new recognition results (step 464).
  • As yet another example, as illustrated by FIG. 4F (which represents one embodiment of step 408 of FIG. 4A), if the arbitration engine 124 receives recognition results which overlap with any previously-committed result received from (the same or different) speech recognizer, then the arbitration engine 124 may use the newly-received results to update the previously-committed results (step 472). For example, the arbitration engine 124 may determine whether the confidence measure associated with the newly-received results exceeds the confidence measure associated with the previously-committed results (step 474) and, if so, replace the previously-committed results with the newly-received results (step 476).
  • Embodiments of the present invention have a variety of advantages. In general, embodiments of the invention enable a client-side device, such as a cellular telephone, having limited resources to obtain high-quality speech recognition results within predetermined turnaround time requirements without requiring a high-availability, high-bandwidth network connection. The techniques disclosed herein effectively produce a hybrid speech recognition engine which uses both the client-side recognizer 112 and server-side recognizer 118 to produce better results than either of those recognizers could have produced individually. More specifically, the resulting hybrid result can have better operating characteristics with respect to system availability, recognition quality, and response time than could be obtained from either of the component recognizers 112 and 120 individually.
  • For example, the techniques disclosed herein may be used to satisfy the user's turnaround time requirements even as the availability of the network 116 fluctuates over time, and even as the processing load on the CPU of the client device 106 fluctuates over time. Such flexibility results from the ability of the arbitration engine 124 to respond to changes in the turnaround times of the client-side recognizer 112 and server-side recognizer 120, and in response to other time-varying factors. Embodiments of the present invention thereby provide a distinct benefit over conventional server-side speech recognition techniques, which break down if the network slows down or becomes unavailable.
  • Hybrid speech recognition systems implemented in accordance with embodiments of the present invention may provide higher speech recognition accuracy than is provided by the faster of the two component recognizers (e.g., the server-side recognizer 120 in FIG. 1). This is a distinct advantage over conventional server-side speech recognition techniques, which only provide results having the accuracy of the server-side recognizer, since that is the only recognizer used by the system.
  • Similarly, hybrid speech recognition systems implemented in accordance with embodiments of the present invention may provide a faster average response time than is provided by the slower of the two component recognizers (e.g., the client-side recognizer 112 in FIG. 1). This is a distinct advantage over conventional server-side speech recognition techniques, which only provide results having the response time of the server-side recognizer, since that is the only recognizer used by the system.
  • Furthermore, embodiments of the present invention impose no constraints on the type or combinations of recognizers that may be used to form the hybrid system. Each of the client-side recognizer 112 and server-side recognizer 120 may be any kind of recognizer. Each of them may be chosen without knowledge of the characteristics of the other. Multiple client-side recognizers, possibly of different types, may be used in conjunction with a single server-side recognizer to effectively form multiple hybrid recognition systems. Either of the client-side recognizer 112 or server-side recognizer 120 may be modified or replaced without causing the hybrid system to break down. As a result, the techniques disclosed herein provide a wide degree of flexibility that makes them suitable for use in conjunction with a wide variety of client-side and server-side recognizers.
  • Moreover, the techniques disclosed herein may be implemented without requiring any modification to existing applications which rely on speech recognition engines. As described above, for example, the delegation engine 110 may provide the same interface to the application 108 as a conventional speech recognition engine. As a result, the application 108 may provide input to and receive output from the delegation engine 110 as if the delegation engine 110 were a conventional speech recognition engine. The delegation engine 110, therefore, may be inserted into the client device 106 in place of a conventional speech recognition engine without requiring any modifications to the application 108.
  • It is to be understood that although the invention has been described above in terms of particular embodiments, the foregoing embodiments are provided as illustrative only, and do not limit or define the scope of the invention. Various other embodiments, including but not limited to the following, are also within the scope of the claims. For example, elements and components described herein may be further divided into additional components or joined together to form fewer components for performing the same functions.
  • The techniques described above may be implemented, for example, in hardware, software tangibly stored on a computer-readable medium, firmware, or any combination thereof. The techniques described above may be implemented in one or more computer programs executing on a programmable computer including a processor, a storage medium readable by the processor (including, for example, volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Program code may be applied to input entered using the input device to perform the functions described and to generate output. The output may be provided to one or more output devices.
  • Each computer program within the scope of the claims below may be implemented in any programming language, such as assembly language, machine language, a high-level procedural programming language, or an object-oriented programming language. The programming language may, for example, be a compiled or interpreted programming language.
  • Each such computer program may be implemented in a computer program product tangibly embodied in a machine-readable storage device for execution by a computer processor. Method steps of the invention may be performed by a computer processor executing a program tangibly embodied on a computer-readable medium to perform functions of the invention by operating on input and generating output. Suitable processors include, by way of example, both general and special purpose microprocessors. Generally, the processor receives instructions and data from a read-only memory and/or a random access memory. Storage devices suitable for tangibly embodying computer program instructions include, for example, all forms of non-volatile memory, such as semiconductor memory devices, including EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROMs. Any of the foregoing may be supplemented by, or incorporated in, specially-designed ASICs (application-specific integrated circuits) or FPGAs (Field-Programmable Gate Arrays). A computer can generally also receive programs and data from a storage medium such as an internal disk (not shown) or a removable disk. These elements will also be found in a conventional desktop or workstation computer as well as other computers suitable for executing computer programs implementing the methods described herein, which may be used in conjunction with any digital print engine or marking engine, display monitor, or other raster output device capable of producing color or gray scale pixels on paper, film, display screen, or other output medium.

Claims (1)

1. A computer-implemented method performed by a client device, the method comprising:
(A) receiving a request from a requester to apply automatic speech recognition to an audio signal;
(B) providing the audio signal to a first automatic speech recognition engine in the client device;
(C) receiving first speech recognition results from the first automatic speech recognition engine;
(D) determining whether a second automatic speech recognition engine, in a server device, is accessible to the client device;
(E) if the second automatic speech recognition engine is determined not to be accessible to the client device, then providing the first speech recognition results to the requester in response to the request.
US13/563,981 2008-08-29 2012-08-01 Hybrid Speech Recognition Abandoned US20120296644A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/563,981 US20120296644A1 (en) 2008-08-29 2012-08-01 Hybrid Speech Recognition

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US9322008P 2008-08-29 2008-08-29
US12/550,380 US7933777B2 (en) 2008-08-29 2009-08-30 Hybrid speech recognition
US12/890,280 US8249877B2 (en) 2008-08-29 2010-09-24 Hybrid speech recognition
US13/563,981 US20120296644A1 (en) 2008-08-29 2012-08-01 Hybrid Speech Recognition

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US12/890,280 Continuation US8249877B2 (en) 2008-08-29 2010-09-24 Hybrid speech recognition

Publications (1)

Publication Number Publication Date
US20120296644A1 true US20120296644A1 (en) 2012-11-22

Family

ID=41722338

Family Applications (3)

Application Number Title Priority Date Filing Date
US12/550,380 Active 2029-12-08 US7933777B2 (en) 2008-08-29 2009-08-30 Hybrid speech recognition
US12/890,280 Active US8249877B2 (en) 2008-08-29 2010-09-24 Hybrid speech recognition
US13/563,981 Abandoned US20120296644A1 (en) 2008-08-29 2012-08-01 Hybrid Speech Recognition

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US12/550,380 Active 2029-12-08 US7933777B2 (en) 2008-08-29 2009-08-30 Hybrid speech recognition
US12/890,280 Active US8249877B2 (en) 2008-08-29 2010-09-24 Hybrid speech recognition

Country Status (5)

Country Link
US (3) US7933777B2 (en)
EP (1) EP2329491B1 (en)
JP (2) JP2012501480A (en)
CA (2) CA3002206C (en)
WO (1) WO2010025440A2 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9105267B2 (en) 2012-01-05 2015-08-11 Denso Corporation Speech recognition apparatus
US20150281401A1 (en) * 2014-04-01 2015-10-01 Microsoft Corporation Hybrid Client/Server Architecture for Parallel Processing
US20150287413A1 (en) * 2014-04-07 2015-10-08 Samsung Electronics Co., Ltd. Speech recognition using electronic device and server
US9293137B2 (en) 2012-09-24 2016-03-22 Kabushiki Kaisha Toshiba Apparatus and method for speech recognition
WO2017014721A1 (en) 2015-07-17 2017-01-26 Nuance Communications, Inc. Reduced latency speech recognition system using multiple recognizers
US20170069307A1 (en) * 2015-09-09 2017-03-09 Samsung Electronics Co., Ltd. Collaborative recognition apparatus and method
US9601108B2 (en) 2014-01-17 2017-03-21 Microsoft Technology Licensing, Llc Incorporating an exogenous large-vocabulary model into rule-based speech recognition
US9905225B2 (en) 2013-12-26 2018-02-27 Panasonic Intellectual Property Management Co., Ltd. Voice recognition processing device, voice recognition processing method, and display device
US10885918B2 (en) 2013-09-19 2021-01-05 Microsoft Technology Licensing, Llc Speech recognition using phoneme matching
KR20210002921A (en) * 2019-07-01 2021-01-11 주식회사 한글과컴퓨터 Speech recognition apparatus capable of generating text corresponding to speech of a speaker based on divided speech recognition and operating method thereof
US10971157B2 (en) 2017-01-11 2021-04-06 Nuance Communications, Inc. Methods and apparatus for hybrid speech recognition processing
US20210110824A1 (en) * 2019-10-10 2021-04-15 Samsung Electronics Co., Ltd. Electronic apparatus and controlling method thereof

Families Citing this family (76)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7844464B2 (en) * 2005-07-22 2010-11-30 Multimodal Technologies, Inc. Content-based audio playback emphasis
US8335688B2 (en) * 2004-08-20 2012-12-18 Multimodal Technologies, Llc Document transcription system training
US20120253823A1 (en) * 2004-09-10 2012-10-04 Thomas Barton Schalk Hybrid Dialog Speech Recognition for In-Vehicle Automated Interaction and In-Vehicle Interfaces Requiring Minimal Driver Processing
US7502741B2 (en) * 2005-02-23 2009-03-10 Multimodal Technologies, Inc. Audio signal de-identification
US7640158B2 (en) 2005-11-08 2009-12-29 Multimodal Technologies, Inc. Automatic detection and application of editing patterns in draft documents
US7831423B2 (en) * 2006-05-25 2010-11-09 Multimodal Technologies, Inc. Replacing text representing a concept with an alternate written form of the concept
US8560314B2 (en) * 2006-06-22 2013-10-15 Multimodal Technologies, Llc Applying service levels to transcripts
US8364481B2 (en) 2008-07-02 2013-01-29 Google Inc. Speech recognition with parallel recognition tasks
US7933777B2 (en) * 2008-08-29 2011-04-26 Multimodal Technologies, Inc. Hybrid speech recognition
US8019608B2 (en) 2008-08-29 2011-09-13 Multimodal Technologies, Inc. Distributed speech recognition using one way communication
US20100125450A1 (en) 2008-10-27 2010-05-20 Spheris Inc. Synchronized transcription rules handling
US8346549B2 (en) * 2009-12-04 2013-01-01 At&T Intellectual Property I, L.P. System and method for supplemental speech recognition by identified idle resources
US20110184740A1 (en) * 2010-01-26 2011-07-28 Google Inc. Integration of Embedded and Network Speech Recognizers
JP2011232619A (en) * 2010-04-28 2011-11-17 Ntt Docomo Inc Voice recognition device and voice recognition method
US9634855B2 (en) 2010-05-13 2017-04-25 Alexander Poltorak Electronic personal interactive device that determines topics of interest using a conversational agent
EP2586026B1 (en) 2010-06-24 2016-11-16 Honda Motor Co., Ltd. Communication system and method between an on-vehicle voice recognition system and an off-vehicle voice recognition system
US8959102B2 (en) 2010-10-08 2015-02-17 Mmodal Ip Llc Structured searching of dynamic structured document corpuses
EP2678861B1 (en) 2011-02-22 2018-07-11 Speak With Me, Inc. Hybridized client-server speech recognition
WO2013005248A1 (en) * 2011-07-05 2013-01-10 三菱電機株式会社 Voice recognition device and navigation device
US9009041B2 (en) * 2011-07-26 2015-04-14 Nuance Communications, Inc. Systems and methods for improving the accuracy of a transcription using auxiliary data such as personal data
JP5658641B2 (en) * 2011-09-15 2015-01-28 株式会社Nttドコモ Terminal device, voice recognition program, voice recognition method, and voice recognition system
US20130085753A1 (en) * 2011-09-30 2013-04-04 Google Inc. Hybrid Client/Server Speech Recognition In A Mobile Device
US8924219B1 (en) 2011-09-30 2014-12-30 Google Inc. Multi hotword robust continuous voice command detection in mobile devices
JP5957269B2 (en) * 2012-04-09 2016-07-27 クラリオン株式会社 Voice recognition server integration apparatus and voice recognition server integration method
US9715879B2 (en) * 2012-07-02 2017-07-25 Salesforce.Com, Inc. Computer implemented methods and apparatus for selectively interacting with a server to build a local database for speech recognition at a device
US9583100B2 (en) * 2012-09-05 2017-02-28 GM Global Technology Operations LLC Centralized speech logger analysis
JP2014062944A (en) * 2012-09-20 2014-04-10 Sharp Corp Information processing devices
KR20140058127A (en) * 2012-11-06 2014-05-14 삼성전자주식회사 Voice recognition apparatus and voice recogniton method
US9171066B2 (en) * 2012-11-12 2015-10-27 Nuance Communications, Inc. Distributed natural language understanding and processing using local data sources
US9704486B2 (en) * 2012-12-11 2017-07-11 Amazon Technologies, Inc. Speech recognition power management
US9190057B2 (en) * 2012-12-12 2015-11-17 Amazon Technologies, Inc. Speech model retrieval in distributed speech recognition systems
KR20140087717A (en) * 2012-12-31 2014-07-09 삼성전자주식회사 Display apparatus and controlling method thereof
KR20140089863A (en) * 2013-01-07 2014-07-16 삼성전자주식회사 Display apparatus, Method for controlling display apparatus and Method for controlling display apparatus in Voice recognition system thereof
KR102112742B1 (en) * 2013-01-22 2020-05-19 삼성전자주식회사 Electronic apparatus and voice processing method thereof
JP6281202B2 (en) * 2013-07-30 2018-02-21 株式会社デンソー Response control system and center
DE102013219649A1 (en) * 2013-09-27 2015-04-02 Continental Automotive Gmbh Method and system for creating or supplementing a user-specific language model in a local data memory connectable to a terminal
JP6054283B2 (en) * 2013-11-27 2016-12-27 シャープ株式会社 Speech recognition terminal, server, server control method, speech recognition system, speech recognition terminal control program, server control program, and speech recognition terminal control method
DE102014200570A1 (en) * 2014-01-15 2015-07-16 Bayerische Motoren Werke Aktiengesellschaft Method and system for generating a control command
FR3016458B1 (en) 2014-01-16 2017-06-16 Cie Ind Et Financiere D'ingenierie Ingenico METHOD OF SECURING A TRANSACTION REALIZED BY BANK CARD
KR102215579B1 (en) * 2014-01-22 2021-02-15 삼성전자주식회사 Interactive system, display apparatus and controlling method thereof
KR101585105B1 (en) * 2014-08-11 2016-01-13 주식회사 케이티 Voice recognition apparatus, method and system
JP6118838B2 (en) * 2014-08-21 2017-04-19 本田技研工業株式会社 Information processing apparatus, information processing system, information processing method, and information processing program
US20160111090A1 (en) * 2014-10-16 2016-04-21 General Motors Llc Hybridized automatic speech recognition
JP2015143866A (en) * 2015-02-25 2015-08-06 株式会社東芝 Voice recognition apparatus, voice recognition system, voice recognition method, and voice recognition program
US9997161B2 (en) 2015-09-11 2018-06-12 Microsoft Technology Licensing, Llc Automatic speech recognition confidence classifier
US10706852B2 (en) 2015-11-13 2020-07-07 Microsoft Technology Licensing, Llc Confidence features for automated speech recognition arbitration
CN107452383B (en) * 2016-05-31 2021-10-26 华为终端有限公司 Information processing method, server, terminal and information processing system
US11115463B2 (en) * 2016-08-17 2021-09-07 Microsoft Technology Licensing, Llc Remote and local predictions
US10546061B2 (en) 2016-08-17 2020-01-28 Microsoft Technology Licensing, Llc Predicting terms by using model chunks
KR101700099B1 (en) * 2016-10-11 2017-01-31 미디어젠(주) Hybrid speech recognition Composite Performance Auto Evaluation system
CN108010523B (en) * 2016-11-02 2023-05-09 松下电器(美国)知识产权公司 Information processing method and recording medium
JP6751658B2 (en) 2016-11-15 2020-09-09 クラリオン株式会社 Voice recognition device, voice recognition system
WO2018140420A1 (en) 2017-01-24 2018-08-02 Honeywell International, Inc. Voice control of an integrated room automation system
KR20180118461A (en) * 2017-04-21 2018-10-31 엘지전자 주식회사 Voice recognition module and and voice recognition method
US10984329B2 (en) 2017-06-14 2021-04-20 Ademco Inc. Voice activated virtual assistant with a fused response
DE112017007562B4 (en) * 2017-06-22 2021-01-21 Mitsubishi Electric Corporation Speech recognition device and method
US10515637B1 (en) 2017-09-19 2019-12-24 Amazon Technologies, Inc. Dynamic speech processing
KR102471493B1 (en) * 2017-10-17 2022-11-29 삼성전자주식회사 Electronic apparatus and method for voice recognition
US11597519B2 (en) 2017-10-17 2023-03-07 The Boeing Company Artificially intelligent flight crew systems and methods
DE102017222549A1 (en) 2017-12-13 2019-06-13 Robert Bosch Gmbh Control procedure and speech dialogue system
US10192554B1 (en) * 2018-02-26 2019-01-29 Sorenson Ip Holdings, Llc Transcription of communications using multiple speech recognition systems
KR102517228B1 (en) 2018-03-14 2023-04-04 삼성전자주식회사 Electronic device for controlling predefined function based on response time of external electronic device on user input and method thereof
US20190332848A1 (en) 2018-04-27 2019-10-31 Honeywell International Inc. Facial enrollment and recognition system
US10147428B1 (en) * 2018-05-30 2018-12-04 Green Key Technologies Llc Computer systems exhibiting improved computer speed and transcription accuracy of automatic speech transcription (AST) based on a multiple speech-to-text engines and methods of use thereof
US20190390866A1 (en) 2018-06-22 2019-12-26 Honeywell International Inc. Building management system with natural language interface
EP3800633B1 (en) * 2018-06-27 2023-10-11 Google LLC Rendering responses to a spoken utterance of a user utilizing a local text-response map
US11094326B2 (en) * 2018-08-06 2021-08-17 Cisco Technology, Inc. Ensemble modeling of automatic speech recognition output
US20210350802A1 (en) * 2019-01-08 2021-11-11 Samsung Electronics Co., Ltd. Method and system for performing speech recognition in an electronic device
US20220328047A1 (en) * 2019-06-04 2022-10-13 Nippon Telegraph And Telephone Corporation Speech recognition control apparatus, speech recognition control method, and program
WO2021029643A1 (en) 2019-08-13 2021-02-18 Samsung Electronics Co., Ltd. System and method for modifying speech recognition result
CN114223029A (en) 2019-08-13 2022-03-22 三星电子株式会社 Server supporting device to perform voice recognition and operation method of server
CN114207711A (en) 2019-08-13 2022-03-18 三星电子株式会社 System and method for recognizing speech of user
JP2020129130A (en) * 2020-04-27 2020-08-27 パイオニア株式会社 Information processing device
CN111627431B (en) * 2020-05-13 2022-08-09 广州国音智能科技有限公司 Voice recognition method, device, terminal and storage medium
CN111681647B (en) 2020-06-10 2023-09-05 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for identifying word slots
CN112164392A (en) * 2020-11-13 2021-01-01 北京百度网讯科技有限公司 Method, device, equipment and storage medium for determining displayed recognition text

Family Cites Families (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5101375A (en) * 1989-03-31 1992-03-31 Kurzweil Applied Intelligence, Inc. Method and apparatus for providing binding and capitalization in structured report generation
US5365574A (en) * 1990-05-15 1994-11-15 Vcs Industries, Inc. Telephone network voice recognition and verification using selectively-adjustable signal thresholds
ZA948426B (en) * 1993-12-22 1995-06-30 Qualcomm Inc Distributed voice recognition system
US6665639B2 (en) * 1996-12-06 2003-12-16 Sensory, Inc. Speech recognition in consumer electronic products
US6101473A (en) * 1997-08-08 2000-08-08 Board Of Trustees, Leland Stanford Jr., University Using speech recognition to access the internet, including access via a telephone
US6125345A (en) * 1997-09-19 2000-09-26 At&T Corporation Method and apparatus for discriminative utterance verification using multiple confidence measures
US6006183A (en) * 1997-12-16 1999-12-21 International Business Machines Corp. Speech recognition confidence level display
US6154465A (en) * 1998-10-06 2000-11-28 Vertical Networks, Inc. Systems and methods for multiple mode voice and data communications using intelligenty bridged TDM and packet buses and methods for performing telephony and data functions using the same
WO1999052237A1 (en) * 1998-04-03 1999-10-14 Vertical Networks Inc. System and method for transmitting voice and data using intelligent bridged tdm and packet buses
US6229880B1 (en) * 1998-05-21 2001-05-08 Bell Atlantic Network Services, Inc. Methods and apparatus for efficiently providing a communication system with speech recognition capabilities
US7003463B1 (en) * 1998-10-02 2006-02-21 International Business Machines Corporation System and method for providing network coordinated conversational services
US6377922B2 (en) * 1998-12-29 2002-04-23 At&T Corp. Distributed recognition system having multiple prompt-specific and response-specific speech recognizers
WO2000058942A2 (en) 1999-03-26 2000-10-05 Koninklijke Philips Electronics N.V. Client-server speech recognition
US6363349B1 (en) 1999-05-28 2002-03-26 Motorola, Inc. Method and apparatus for performing distributed speech processing in a communication system
US6292781B1 (en) * 1999-05-28 2001-09-18 Motorola Method and apparatus for facilitating distributed speech processing in a communication system
US7203651B2 (en) * 2000-12-07 2007-04-10 Art-Advanced Recognition Technologies, Ltd. Voice control system with multiple voice recognition engines
US6785654B2 (en) 2001-11-30 2004-08-31 Dictaphone Corporation Distributed speech recognition system with speech recognition engines offering multiple functionalities
GB2383459B (en) * 2001-12-20 2005-05-18 Hewlett Packard Co Speech recognition system and method
US6898567B2 (en) * 2001-12-29 2005-05-24 Motorola, Inc. Method and apparatus for multi-level distributed speech recognition
JP2004012653A (en) * 2002-06-05 2004-01-15 Matsushita Electric Ind Co Ltd Voice recognition system, voice recognition client, voice recognition server, voice recognition client program, and voice recognition server program
JP3759508B2 (en) * 2003-03-31 2006-03-29 オリンパス株式会社 Actuator, actuator driving method, and actuator system
US20040210443A1 (en) * 2003-04-17 2004-10-21 Roland Kuhn Interactive mechanism for retrieving information from audio and multimedia files containing speech
US7363228B2 (en) * 2003-09-18 2008-04-22 Interactive Intelligence, Inc. Speech recognition system and method
JP2005249829A (en) 2004-03-01 2005-09-15 Advanced Media Inc Computer network system performing speech recognition
US20050215260A1 (en) 2004-03-23 2005-09-29 Motorola, Inc. Method and system for arbitrating between a local engine and a network-based engine in a mobile communication network
JP4554285B2 (en) * 2004-06-18 2010-09-29 トヨタ自動車株式会社 Speech recognition system, speech recognition method, and speech recognition program
US8589156B2 (en) * 2004-07-12 2013-11-19 Hewlett-Packard Development Company, L.P. Allocation of speech recognition tasks and combination of results thereof
US7437297B2 (en) * 2005-01-27 2008-10-14 International Business Machines Corporation Systems and methods for predicting consequences of misinterpretation of user commands in automated systems
KR101073190B1 (en) 2005-02-03 2011-10-13 주식회사 현대오토넷 Distribute speech recognition system
JP2007033901A (en) 2005-07-27 2007-02-08 Nec Corp System, method, and program for speech recognition
US8612230B2 (en) * 2007-01-03 2013-12-17 Nuance Communications, Inc. Automatic speech recognition with a selection list
US8019608B2 (en) * 2008-08-29 2011-09-13 Multimodal Technologies, Inc. Distributed speech recognition using one way communication
US7933777B2 (en) * 2008-08-29 2011-04-26 Multimodal Technologies, Inc. Hybrid speech recognition
US8150696B2 (en) * 2008-12-08 2012-04-03 At&T Intellectual Property I, L.P. Method of providing dynamic speech processing services during variable network connectivity

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9105267B2 (en) 2012-01-05 2015-08-11 Denso Corporation Speech recognition apparatus
US9293137B2 (en) 2012-09-24 2016-03-22 Kabushiki Kaisha Toshiba Apparatus and method for speech recognition
US10885918B2 (en) 2013-09-19 2021-01-05 Microsoft Technology Licensing, Llc Speech recognition using phoneme matching
US9905225B2 (en) 2013-12-26 2018-02-27 Panasonic Intellectual Property Management Co., Ltd. Voice recognition processing device, voice recognition processing method, and display device
US10311878B2 (en) 2014-01-17 2019-06-04 Microsoft Technology Licensing, Llc Incorporating an exogenous large-vocabulary model into rule-based speech recognition
US9601108B2 (en) 2014-01-17 2017-03-21 Microsoft Technology Licensing, Llc Incorporating an exogenous large-vocabulary model into rule-based speech recognition
US20150281401A1 (en) * 2014-04-01 2015-10-01 Microsoft Corporation Hybrid Client/Server Architecture for Parallel Processing
CN106164869A (en) * 2014-04-01 2016-11-23 微软技术许可有限责任公司 Mixed-client/server architecture for parallel processing
US10749989B2 (en) * 2014-04-01 2020-08-18 Microsoft Technology Licensing Llc Hybrid client/server architecture for parallel processing
US10074372B2 (en) * 2014-04-07 2018-09-11 Samsung Electronics Co., Ltd. Speech recognition using electronic device and server
US20150287413A1 (en) * 2014-04-07 2015-10-08 Samsung Electronics Co., Ltd. Speech recognition using electronic device and server
US20170236519A1 (en) * 2014-04-07 2017-08-17 Samsung Electronics Co., Ltd. Speech recognition using electronic device and server
US9640183B2 (en) * 2014-04-07 2017-05-02 Samsung Electronics Co., Ltd. Speech recognition using electronic device and server
US20190080696A1 (en) * 2014-04-07 2019-03-14 Samsung Electronics Co., Ltd. Speech recognition using electronic device and server
US10643621B2 (en) * 2014-04-07 2020-05-05 Samsung Electronics Co., Ltd. Speech recognition using electronic device and server
EP3323126A4 (en) * 2015-07-17 2019-03-20 Nuance Communications, Inc. Reduced latency speech recognition system using multiple recognizers
WO2017014721A1 (en) 2015-07-17 2017-01-26 Nuance Communications, Inc. Reduced latency speech recognition system using multiple recognizers
CN108028044A (en) * 2015-07-17 2018-05-11 纽昂斯通讯公司 The speech recognition system of delay is reduced using multiple identifiers
US20170069307A1 (en) * 2015-09-09 2017-03-09 Samsung Electronics Co., Ltd. Collaborative recognition apparatus and method
US10446154B2 (en) * 2015-09-09 2019-10-15 Samsung Electronics Co., Ltd. Collaborative recognition apparatus and method
US10971157B2 (en) 2017-01-11 2021-04-06 Nuance Communications, Inc. Methods and apparatus for hybrid speech recognition processing
KR20210002921A (en) * 2019-07-01 2021-01-11 주식회사 한글과컴퓨터 Speech recognition apparatus capable of generating text corresponding to speech of a speaker based on divided speech recognition and operating method thereof
KR102266062B1 (en) * 2019-07-01 2021-06-17 주식회사 한글과컴퓨터 Speech recognition apparatus capable of generating text corresponding to speech of a speaker based on divided speech recognition and operating method thereof
US20210110824A1 (en) * 2019-10-10 2021-04-15 Samsung Electronics Co., Ltd. Electronic apparatus and controlling method thereof

Also Published As

Publication number Publication date
CA3002206C (en) 2020-04-07
CA3002206A1 (en) 2010-03-04
US20100057450A1 (en) 2010-03-04
US8249877B2 (en) 2012-08-21
CA2732255C (en) 2019-02-05
WO2010025440A3 (en) 2010-06-03
EP2329491A2 (en) 2011-06-08
JP6113008B2 (en) 2017-04-12
US7933777B2 (en) 2011-04-26
EP2329491A4 (en) 2012-11-28
CA2732255A1 (en) 2010-03-04
US20110238415A1 (en) 2011-09-29
JP2012501480A (en) 2012-01-19
EP2329491B1 (en) 2018-04-18
WO2010025440A2 (en) 2010-03-04
JP2013232001A (en) 2013-11-14

Similar Documents

Publication Publication Date Title
US7933777B2 (en) Hybrid speech recognition
US20210090554A1 (en) Enhanced speech endpointing
US10339917B2 (en) Enhanced speech endpointing
US20190318721A1 (en) Speech endpointing
US8849664B1 (en) Realtime acoustic adaptation using stability measures
US10269341B2 (en) Speech endpointing
US20060229873A1 (en) Methods and apparatus for adapting output speech in accordance with context of communication
JP7230806B2 (en) Information processing device and information processing method
US9196250B2 (en) Application services interface to ASR
US20170110118A1 (en) Speech endpointing
US11763819B1 (en) Audio encryption
EP2733697A1 (en) Application services interface to ASR
US20230230578A1 (en) Personalized speech query endpointing based on prior interaction(s)
US20230025709A1 (en) Transferring dialog data from an initially invoked automated assistant to a subsequently invoked automated assistant
WO2023086075A1 (en) Selectively generating and/or selectively rendering continuing content for spoken utterance completion

Legal Events

Date Code Title Description
AS Assignment

Owner name: ROYAL BANK OF CANADA, AS ADMINISTRATIVE AGENT, ONT

Free format text: SECURITY AGREEMENT;ASSIGNORS:MMODAL IP LLC;MULTIMODAL TECHNOLOGIES, LLC;POIESIS INFOMATICS INC.;REEL/FRAME:028824/0459

Effective date: 20120817

AS Assignment

Owner name: MMODAL IP LLC, TENNESSEE

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MULTIMODAL TECHNOLOGIES, LLC;REEL/FRAME:029205/0333

Effective date: 20121026

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: MMODAL IP LLC, TENNESSEE

Free format text: RELEASE OF SECURITY INTEREST;ASSIGNOR:ROYAL BANK OF CANADA, AS ADMINISTRATIVE AGENT;REEL/FRAME:033459/0935

Effective date: 20140731

AS Assignment

Owner name: WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT, NEW YORK

Free format text: SECURITY AGREEMENT;ASSIGNOR:MMODAL IP LLC;REEL/FRAME:034047/0527

Effective date: 20140731

Owner name: WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT,

Free format text: SECURITY AGREEMENT;ASSIGNOR:MMODAL IP LLC;REEL/FRAME:034047/0527

Effective date: 20140731

AS Assignment

Owner name: CORTLAND CAPITAL MARKET SERVICES LLC, ILLINOIS

Free format text: PATENT SECURITY AGREEMENT;ASSIGNOR:MMODAL IP LLC;REEL/FRAME:033958/0729

Effective date: 20140731

AS Assignment

Owner name: MMODAL IP LLC, TENNESSEE

Free format text: CHANGE OF ADDRESS;ASSIGNOR:MMODAL IP LLC;REEL/FRAME:042271/0858

Effective date: 20140805

AS Assignment

Owner name: MMODAL IP LLC, TENNESSEE

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CORTLAND CAPITAL MARKET SERVICES LLC, AS ADMINISTRATIVE AGENT;REEL/FRAME:048211/0799

Effective date: 20190201

AS Assignment

Owner name: MEDQUIST CM LLC, TENNESSEE

Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS;ASSIGNOR:WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT;REEL/FRAME:048411/0712

Effective date: 20190201

Owner name: MMODAL MQ INC., TENNESSEE

Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS;ASSIGNOR:WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT;REEL/FRAME:048411/0712

Effective date: 20190201

Owner name: MULTIMODAL TECHNOLOGIES, LLC, TENNESSEE

Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS;ASSIGNOR:WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT;REEL/FRAME:048411/0712

Effective date: 20190201

Owner name: MEDQUIST OF DELAWARE, INC., TENNESSEE

Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS;ASSIGNOR:WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT;REEL/FRAME:048411/0712

Effective date: 20190201

Owner name: MMODAL IP LLC, TENNESSEE

Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS;ASSIGNOR:WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT;REEL/FRAME:048411/0712

Effective date: 20190201