US20140358537A1 - System and Method for Combining Speech Recognition Outputs From a Plurality of Domain-Specific Speech Recognizers Via Machine Learning - Google Patents
System and Method for Combining Speech Recognition Outputs From a Plurality of Domain-Specific Speech Recognizers Via Machine Learning Download PDFInfo
- Publication number
- US20140358537A1 US20140358537A1 US14/459,719 US201414459719A US2014358537A1 US 20140358537 A1 US20140358537 A1 US 20140358537A1 US 201414459719 A US201414459719 A US 201414459719A US 2014358537 A1 US2014358537 A1 US 2014358537A1
- Authority
- US
- United States
- Prior art keywords
- speech recognition
- speech
- candidates
- domain
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/32—Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0638—Interactive procedures
Definitions
- the present disclosure relates to automatic speech recognition and, in particular, to automatic speech recognition across different applications or environments.
- speech recognition across multiple applications or environments is improved by using a collection of domain-specific speech recognizers to recognize received speech to yield respective speech recognition outputs; determining at least one speech recognition confidence score for the respective speech recognition outputs; selecting, via a machine-learning algorithm, speech recognition candidates from segments of the speech recognition outputs based on the at least one speech recognition confidence score for the respective speech recognition outputs; and combining, via a machine-learning algorithm, selected speech recognition candidates to generate text.
- FIG. 1 illustrates an example system embodiment
- FIG. 2 is a functional block diagram that illustrates an exemplary natural language spoken dialog system
- FIG. 3 is a schematic block diagram illustrating one embodiment of an example system for automatic speech recognition
- FIG. 4 is a schematic flow chart diagram illustrating one embodiment of an example method for automatic speech recognition.
- the present disclosure addresses the need in the art for developing a system capable of performing speech recognition across different applications or environments without model customization or prior knowledge of the domain of the received speech.
- This disclosure provides a system for performing speech recognition across different applications or environments without model customization or prior knowledge of the domain of the received speech.
- Known models for performing speech recognition across different applications or environments require a high volume of data. Disadvantageously, these systems are created by combining all potential data available into a single system. The increased volume of data requires intensive processing and causes out of memory problems. As a result, these systems are costly and hard to scale.
- the approaches discussed herein can be used to provide a standards-based API (like a web services API) where developers provide audio input and obtain text output without any model building, tuning, or optimization.
- the system determines the best recognition performance by aggregating information from a collection of domain-specific speech recognizers. Accordingly, the system provides speech recognition across multiple applications or environments without model customization and a lower volume of data, thereby increasing scalability and reducing cost.
- an exemplary system 100 includes a general-purpose computing device 100 , including a processing unit (CPU or processor) 120 and a system bus 110 that couples various system components including the system memory 130 such as read only memory (ROM) 140 and random access memory (RAM) 150 to the processor 120 .
- the system 100 can include a cache 122 of high speed memory connected directly with, in close proximity to, or integrated as part of the processor 120 .
- the system 100 copies data from the memory 130 and/or the storage device 160 to the cache 122 for quick access by the processor 120 . In this way, the cache 122 provides a performance boost that avoids processor 120 delays while waiting for data.
- These and other modules can be configured to control the processor 120 to perform various actions.
- Other system memory 130 may be available for use as well.
- the memory 130 can include multiple different types of memory with different performance characteristics. It can be appreciated that the disclosure may operate on a computing device 100 with more than one processor 120 or on a group or cluster of computing devices networked together to provide greater processing capability.
- the processor 120 can include any general purpose processor and a hardware module or software module, such as module 1 162 , module 2 164 , module 3 166 , module 4 168 , module 5 172 , module 6 174 , and module 7 176 stored in storage device 160 , configured to control the processor 120 as well as a special-purpose processor where software instructions are incorporated into the actual processor design.
- the processor 120 may essentially be a completely self-contained computing system, containing multiple cores or processors, a bus, memory controller, cache, etc.
- a multi-core processor may be symmetric or asymmetric.
- the system bus 110 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures.
- a basic input/output (BIOS) stored in ROM 140 or the like may provide the basic routine that helps to transfer information between elements within the computing device 100 , such as during start-up.
- the computing device 100 further includes storage devices 160 such as a hard disk drive, a magnetic disk drive, an optical disk drive, tape drive or the like.
- the storage device 160 can include software modules 162 , 164 , 166 , 168 , 172 , 174 , 176 for controlling the processor 120 . Other hardware or software modules are contemplated.
- the storage device 160 is connected to the system bus 110 by a drive interface.
- the drives and the associated computer readable storage media provide nonvolatile storage of computer readable instructions, data structures, program modules and other data for the computing device 100 .
- a hardware module that performs a particular function includes the software component stored in a non-transitory computer-readable medium in connection with the necessary hardware components, such as the processor 120 , bus 110 , display 170 , and so forth, to carry out the function.
- the basic components are known to those of skill in the art and appropriate variations are contemplated depending on the type of device, such as whether the device 100 is a small, handheld computing device, a desktop computer, or a computer server.
- an input device 190 represents any number of input mechanisms, such as a microphone for speech, a touch-sensitive screen for gesture or graphical input, keyboard, mouse, motion input, speech and so forth.
- An output device 170 can also be one or more of a number of output mechanisms known to those of skill in the art.
- multimodal systems enable a user to provide multiple types of input to communicate with the computing device 100 .
- the communications interface 180 generally governs and manages the user input and system output. There is no restriction on operating on any particular hardware arrangement and therefore the basic features here may easily be substituted for improved hardware or firmware arrangements as they are developed.
- the illustrative system embodiment is presented as including individual functional blocks including functional blocks labeled as a “processor” or processor 120 .
- the functions these blocks represent may be provided through the use of either shared or dedicated hardware, including, but not limited to, hardware capable of executing software and hardware, such as a processor 120 , that is purpose-built to operate as an equivalent to software executing on a general purpose processor.
- the functions of one or more processors presented in FIG. 1 may be provided by a single shared processor or multiple processors.
- Illustrative embodiments may include microprocessor and/or digital signal processor (DSP) hardware, read-only memory (ROM) 140 for storing software performing the operations discussed below, and random access memory (RAM) 150 for storing results.
- DSP digital signal processor
- ROM read-only memory
- RAM random access memory
- VLSI Very large scale integration
- the logical operations of the various embodiments are implemented as: (1) a sequence of computer implemented steps, operations, or procedures running on a programmable circuit within a general use computer, (2) a sequence of computer implemented steps, operations, or procedures running on a specific-use programmable circuit; and/or (3) interconnected machine modules or program engines within the programmable circuits.
- the system 100 shown in FIG. 1 can practice all or part of the recited methods, can be a part of the recited systems, and/or can operate according to instructions in the recited non-transitory computer-readable storage media.
- Such logical operations can be implemented as modules configured to control the processor 120 to perform particular functions according to the programming of the module. For example, FIG.
- Module 1 162 illustrates seven modules Module 1 162 , Module 2 164 , Module 3 166 , Module 4 168 , Module 5 172 , Module 6 174 , and Module 7 176 which are modules configured to control the processor 120 .
- These modules may be stored on the storage device 160 and loaded into RAM 150 or memory 130 at runtime or may be stored as would be known in the art in other computer-readable memory locations.
- FIG. 2 is discussed in terms of an exemplary system such as is shown in FIG. 1 configured to recognize speech input, transcribe the speech input, identify the meaning of the transcribed speech, determine an appropriate response to the speech input, generate text of the appropriate response, and generate audible “speech” based on the generated text.
- FIG. 2 is a functional block diagram that illustrates an exemplary natural language spoken dialog system.
- Spoken dialog systems aim to identify intents of humans, expressed in natural language, and take actions accordingly, to satisfy their requests.
- Natural language spoken dialog system 200 can include an automatic speech recognition (ASR) module 202 , a spoken language understanding (SLU) module 204 , a dialog management (DM) module 206 , a spoken language generation (SLG) module 208 , and text-to-speech module (TTS) 210 .
- the text-to-speech module can be any type of speech output module. For example, it can be a module wherein text is selected and played to a user. Thus, the text-to-speech module represents any type of speech output.
- the present disclosure focuses on innovations related to the ASR module 202 and can also relate to other components of the dialog system.
- the ASR module 202 analyzes speech input and provides a textual transcription of the speech input as output.
- SLU module 204 can receive the transcribed input and can use a natural language understanding model to analyze the group of words that are included in the transcribed input to derive a meaning from the input.
- the role of the DM module 206 is to interact in a natural way and help the user to achieve the task that the system is designed to support.
- the DM module 206 receives the meaning of the speech input from the SLU module 204 and determines an action, such as, for example, providing a response, based on the input.
- the SLG module 208 generates a transcription of one or more words in response to the action provided by the DM 206 .
- the text-to-speech module 210 receives the transcription as input and provides generated audible speech as output based on the transcribed speech.
- the modules of system 200 recognize speech input, such as speech utterances, transcribe the speech input, identify (or understand) the meaning of the transcribed speech, determine an appropriate response to the speech input, generate text of the appropriate response and from that text, generate audible “speech” from system 200 , which the user then hears. In this manner, the user can carry on a natural language dialog with system 200 .
- speech input such as speech utterances
- transcribe the speech input identify (or understand) the meaning of the transcribed speech, determine an appropriate response to the speech input, generate text of the appropriate response and from that text, generate audible “speech” from system 200 , which the user then hears.
- speech input such as speech utterances
- identify (or understand) the meaning of the transcribed speech determine an appropriate response to the speech input, generate text of the appropriate response and from that text, generate audible “speech” from system 200 , which the user then hears.
- audible “speech” from system 200 ,
- a computing device such as a smartphone (or any processing device having a phone capability) can include an ASR module wherein a user says “call mom” and the smartphone acts on the instruction without a “spoken dialog.”
- a module for automatically transcribing user speech can join the system at any point or at multiple points in the cycle or can be integrated with any of the modules shown in FIG. 2 .
- FIG. 3 illustrates one embodiment of a system 202 for automatic speech recognition.
- the system 202 includes the natural language spoken dialog system 202 of FIG. 2 , however, for clarity, only the ASR 202 is depicted here.
- the system 202 first receives speech 302 .
- the system 202 then recognizes the received speech with a collection of domain-specific speech recognizers 304 , 306 , 308 , and 310 , to yield respective speech recognition outputs 312 a, 312 b, 314 , 316 , and 318 .
- the collection of domain-specific speech recognizers 304 , 306 , 308 , and 310 includes at least two experts from different domains; at least one of the different domains includes SMS, question/answering, video search, broadcast news, voicemail to text, web search, or local business search.
- an expert is defined as a domain-specific speech recognizer.
- the collection of domain-specific speech recognizers 304 , 306 , 308 , and 310 can include one or more experts from a specific domain (e.g., web search 304 , web search 306 ), and at least one expert from a different domain (e.g., local business search 308 and video search 310 ).
- a specific domain e.g., web search 304 , web search 306
- a different domain e.g., local business search 308 and video search 310
- Other exemplary different domains include travel, banking, and business.
- each expert from the collection of domain-specific speech recognizers 304 , 306 , 308 , and 310 provides a speech recognition output 312 a, 312 b, 314 , 316 , and 318 based on the received speech 302 .
- the following examples illustrate possible speech recognition outputs based on the words “Paris Hilton” as recognized by each expert: “Pairs Hill” 312 a , “Paris Hilton” 314 , “Paris Hill” 316 , and “Perez Hilton” 318 .
- An output can include a lattice, confidence scores, and other meta data including beam width.
- each output in our example above may include a confidence score, viz.: “Pairs Hill” may include a confidence score of 40, “Paris Hilton” may include a confidence score of 100, “Paris Hill” may include a confidence score of 74, and “Perez Hilton” may include a confidence score of 80.
- an output may include more than one confidence score; each confidence score corresponds to a different segment of the output.
- the following examples illustrate an output including a plurality of confidence scores: “Pairs Hill” and a confidence score of 40 for “Pairs” and 60 for “Hill,” “Paris Hilton” and a confidence score of 100 for “Paris” and 100 for “Hilton,” “Paris Hill” and a confidence score of 100 for “Paris” and 60 for “Hill,” and “Perez Hilton” and a confidence score of 80 for “Perez” and 100 for “Hilton.”
- the machine-learning algorithm 300 analyzes the speech recognition outputs 312 a, 312 b, 314 , 316 , and 318 to determine at least one speech recognition confidence score for the respective speech recognition outputs 312 a, 312 b, 314 , 316 , and 318 .
- the machine-learning algorithm 300 selects speech recognition candidates from segments of the speech recognition outputs 312 a, 312 b, 314 , 316 , and 318 based on at least one speech recognition confidence score for the respective speech recognition outputs 312 a, 312 b, 314 , 316 , and 318 .
- the machine-learning algorithm 300 may select the speech recognition candidates from those segments of the speech recognition outputs in our example having the highest confidence scores (100, 74, 100 respectively): “Paris Hilton,” “Paris Hill,” and “Hilton.”
- the machine-learning algorithm 300 then combines the speech recognition candidates to yield a combination of the speech recognition candidates, and generates a text string 330 based on the combination. For example, the machine-learning algorithm 300 can generate the words “Paris Hilton” based on the combination of “Paris Hilton,” “Paris Hill,” and “Perez Hilton.” Alternatively, the machine-learning algorithm 300 can generate a text string 330 based on a single speech recognition candidate having the highest confidence score, which, in our example, corresponds to “Paris Hilton” 314 .
- the text string 330 includes a mesh of the speech recognition candidates.
- the experts divide the speech recognition candidates into substrings (e.g., “Paris” 312 a, “Hilton” 312 b ), and the machine-learning algorithm 300 selects a best speech recognition candidate for each substring.
- the system 202 collects usage statistics based on the speech recognition candidates.
- the system 202 uses the collected statistics to train the machine-learning algorithm 300 .
- the system 202 uses the collected statistics to train the collection of domain-specific speech recognizers 304 , 306 , 308 , and 310 .
- the system 202 may also use the collected statistics to train both the machine-learning algorithm 300 and the collection of domain-specific speech recognizers 304 , 306 , 308 , and 310 .
- Training parameters can include a lattice combination and a neural network graph that learns from an edit distance between the speech recognition candidates and a correct recognition candidate. This step ensures that the machine-learning algorithm 300 and each expert from the collection of domain-specific speech recognizers 304 , 306 , 308 , and 310 are optimized to increase overall performance.
- an arrow can indicate a waiting or monitoring period of unspecified duration between enumerated steps of the depicted method.
- the order in which a particular method occurs can or cannot strictly adhere to the order of the corresponding steps shown.
- One or more steps of the following methods are performed by a hardware component such as a processor or computing device.
- FIG. 4 is a schematic flow chart diagram illustrating a disclosed method 600 for automatic speech recognition.
- the method 600 starts and the collection of domain-specific speech recognizers 304 , 306 , 308 , and 310 of FIG. 3 first recognize the received speech 302 of FIG. 3 to yield respective speech recognition outputs 604 .
- the machine-learning algorithm 300 of FIG. 3 then analyzes the speech recognition outputs 312 a, 312 b, 314 , 316 , and 318 of FIG. 3 to determine at least one speech recognition confidence score for the respective speech recognition outputs 312 a, 312 b, 314 , 316 , and 318 of FIG. 3 606 .
- the machine-learning algorithm 300 of FIG. 3 selects speech recognition candidates from segments of the speech recognition outputs 312 a, 312 b, 314 , 316 , and 318 of FIG. 3 , based on the at least one speech recognition confidence score for the respective speech recognition outputs 608 .
- the machine-learning algorithm 300 of FIG. 3 then combines the speech recognition candidates to yield a combination of the speech recognition candidates 610 , and generates a text string 330 of FIG. 3 based on the combination 612 .
- the machine-learning algorithm 300 of FIG. 3 can generate a text string 330 of FIG. 3 based on a single speech recognition candidate having a highest confidence score.
- the text string 330 of FIG. 3 includes a mesh of the speech recognition candidates.
- the experts divide the speech recognition candidates into substrings, and the machine-learning algorithm 300 of FIG. 3 selects a best speech recognition candidate for each substring.
- This approach allows for speech recognition across multiple applications or environments without model customization or knowledge of the domain of the received speech.
- This approach requires a lower volume of data, thereby increasing scalability and reducing cost, and provides numerous additional benefits, such as higher speech recognition performance and rapid deployment of speech applications without intensive development of expertise.
- Embodiments within the scope of the present disclosure may also include tangible computer-readable storage media for carrying or having computer-executable instructions or data structures stored thereon for controlling a data processing device or other computing device.
- Such computer-readable storage media can be any available media that can be accessed by a general purpose or special purpose computer, including the functional design of any special purpose processor as discussed above.
- Such computer-readable media can include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code means in the form of computer-executable instructions, data structures, or processor chip design.
- Computer-executable instructions include, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions.
- Computer-executable instructions also include program modules that are executed by computers in stand-alone or network environments.
- program modules include routines, programs, components, data structures, objects, and the functions inherent in the design of special-purpose processors, etc. that perform particular tasks or implement particular abstract data types.
- Computer-executable instructions, associated data structures, and program modules represent examples of the program code means for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps.
- Embodiments of the disclosure may be practiced in network computing environments with many types of computer system configurations, including personal computers, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. Embodiments may also be practiced in distributed computing environments where tasks are performed by local and remote processing devices that are linked (either by hardwired links, wireless links, or by a combination thereof) through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
Abstract
Description
- The present application is a continuation of U.S. patent application Ser. No. 12/895,359, filed Sep. 30, 2010, the content of which is incorporated herein by reference in its entirety.
- 1. Technical Field
- The present disclosure relates to automatic speech recognition and, in particular, to automatic speech recognition across different applications or environments.
- 2. Introduction
- Over the past 5 decades, researchers and developers have been creating tools and algorithms to enable rapid development of acoustic and language models to support domain-specific speech recognition applications. These applications rely on speech recognition models. Often, a generic speech model is used to recognize speech from multiple users. Similarly, current systems capable of performing speech recognition across different applications or environments rely on generic speech models. Given that speech recognizers depend significantly on the distribution of words and phrases, such systems typically fail as they attempt to provide generality while lowering performance.
- Moreover, these systems require tremendous costs to develop. For example, a team of 3-6 people may take 3-6 months to develop a single speech application. In addition, known models for performing speech recognition across different applications or environments perforce require a high volume of data. Disadvantageously, these systems are created by combining all potential data available into a single system. The increased volume of data requires intensive processing and causes out of memory problems. As a result, these systems are costly and hard to scale.
- Additional features and advantages of the disclosure will be set forth in the description which follows, and in part will be obvious from the description, or can be learned by practice of the herein disclosed principles. The features and advantages of the disclosure can be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. These and other features of the disclosure will become more fully apparent from the following description and appended claims, or can be learned by the practice of the principles set forth herein.
- Disclosed herein are systems, methods, and non-transitory computer-readable storage media for performing speech recognition across different applications or environments without model customization or prior knowledge of the domain of the received speech. In accordance with the disclosure, speech recognition across multiple applications or environments is improved by using a collection of domain-specific speech recognizers to recognize received speech to yield respective speech recognition outputs; determining at least one speech recognition confidence score for the respective speech recognition outputs; selecting, via a machine-learning algorithm, speech recognition candidates from segments of the speech recognition outputs based on the at least one speech recognition confidence score for the respective speech recognition outputs; and combining, via a machine-learning algorithm, selected speech recognition candidates to generate text.
- In this way, speech recognition across multiple applications or environments can be accomplished without model customization and necessitates a lower volume of data, thereby increasing scalability and reducing cost. This approach provides numerous additional benefits, such as higher speech recognition performance and rapid deployment of speech applications without intensive development of expertise.
- In order to describe the manner in which the above-recited and other advantages and features of the disclosure can be obtained, a more particular description of the principles briefly described above will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Understanding that these drawings depict only exemplary embodiments of the disclosure and are not therefore to be considered to be limiting of its scope, the principles herein are described and explained with additional specificity and detail through the use of the accompanying drawings in which:
-
FIG. 1 illustrates an example system embodiment; -
FIG. 2 is a functional block diagram that illustrates an exemplary natural language spoken dialog system; -
FIG. 3 is a schematic block diagram illustrating one embodiment of an example system for automatic speech recognition; and -
FIG. 4 is a schematic flow chart diagram illustrating one embodiment of an example method for automatic speech recognition. - Various embodiments of the disclosure are discussed in detail below. While specific implementations are discussed, it should be understood that this is done for illustration purposes only. A person skilled in the relevant art will recognize that other components and configurations may be used without parting from the spirit and scope of the disclosure.
- The present disclosure addresses the need in the art for developing a system capable of performing speech recognition across different applications or environments without model customization or prior knowledge of the domain of the received speech. Some introductory principles and concepts are discussed first, followed by a brief introductory description of a basic general purpose system or computing device in
FIG. 1 which can be employed to practice the concepts disclosed herein. A more detailed description of an exemplary natural language spoken dialog system inFIG. 2 , an exemplary automatic speech recognition system inFIG. 3 , and an exemplary method inFIG. 4 will then follow. - This disclosure provides a system for performing speech recognition across different applications or environments without model customization or prior knowledge of the domain of the received speech. Known models for performing speech recognition across different applications or environments require a high volume of data. Disadvantageously, these systems are created by combining all potential data available into a single system. The increased volume of data requires intensive processing and causes out of memory problems. As a result, these systems are costly and hard to scale.
- The approaches discussed herein can be used to provide a standards-based API (like a web services API) where developers provide audio input and obtain text output without any model building, tuning, or optimization. The system determines the best recognition performance by aggregating information from a collection of domain-specific speech recognizers. Accordingly, the system provides speech recognition across multiple applications or environments without model customization and a lower volume of data, thereby increasing scalability and reducing cost. These principles provide numerous additional benefits, such as higher speech recognition performance and rapid deployment of speech applications without intensive development of expertise.
- With reference to
FIG. 1 , anexemplary system 100 includes a general-purpose computing device 100, including a processing unit (CPU or processor) 120 and asystem bus 110 that couples various system components including thesystem memory 130 such as read only memory (ROM) 140 and random access memory (RAM) 150 to theprocessor 120. Thesystem 100 can include acache 122 of high speed memory connected directly with, in close proximity to, or integrated as part of theprocessor 120. Thesystem 100 copies data from thememory 130 and/or thestorage device 160 to thecache 122 for quick access by theprocessor 120. In this way, thecache 122 provides a performance boost that avoidsprocessor 120 delays while waiting for data. These and other modules can be configured to control theprocessor 120 to perform various actions.Other system memory 130 may be available for use as well. Thememory 130 can include multiple different types of memory with different performance characteristics. It can be appreciated that the disclosure may operate on acomputing device 100 with more than oneprocessor 120 or on a group or cluster of computing devices networked together to provide greater processing capability. Theprocessor 120 can include any general purpose processor and a hardware module or software module, such asmodule 1 162,module 2 164,module 3 166,module 4 168,module 5 172,module 6 174, andmodule 7 176 stored instorage device 160, configured to control theprocessor 120 as well as a special-purpose processor where software instructions are incorporated into the actual processor design. Theprocessor 120 may essentially be a completely self-contained computing system, containing multiple cores or processors, a bus, memory controller, cache, etc. A multi-core processor may be symmetric or asymmetric. - The
system bus 110 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. A basic input/output (BIOS) stored inROM 140 or the like, may provide the basic routine that helps to transfer information between elements within thecomputing device 100, such as during start-up. Thecomputing device 100 further includesstorage devices 160 such as a hard disk drive, a magnetic disk drive, an optical disk drive, tape drive or the like. Thestorage device 160 can includesoftware modules processor 120. Other hardware or software modules are contemplated. Thestorage device 160 is connected to thesystem bus 110 by a drive interface. The drives and the associated computer readable storage media provide nonvolatile storage of computer readable instructions, data structures, program modules and other data for thecomputing device 100. In one aspect, a hardware module that performs a particular function includes the software component stored in a non-transitory computer-readable medium in connection with the necessary hardware components, such as theprocessor 120,bus 110,display 170, and so forth, to carry out the function. The basic components are known to those of skill in the art and appropriate variations are contemplated depending on the type of device, such as whether thedevice 100 is a small, handheld computing device, a desktop computer, or a computer server. - Although the exemplary embodiment described herein employs the
hard disk 160, it should be appreciated by those skilled in the art that other types of computer readable media which can store data that are accessible by a computer, such as magnetic cassettes, flash memory cards, digital versatile disks, cartridges, random access memories (RAMs) 150, read only memory (ROM) 140, a cable or wireless signal containing a bit stream and the like, may also be used in the exemplary operating environment. Tangible computer-readable storage media, computer-readable storage devices, or computer-readable memory devices, expressly exclude media such as energy, carrier signals, electromagnetic waves, and signals per se. - To enable user interaction with the
computing device 100, aninput device 190 represents any number of input mechanisms, such as a microphone for speech, a touch-sensitive screen for gesture or graphical input, keyboard, mouse, motion input, speech and so forth. Anoutput device 170 can also be one or more of a number of output mechanisms known to those of skill in the art. In some instances, multimodal systems enable a user to provide multiple types of input to communicate with thecomputing device 100. Thecommunications interface 180 generally governs and manages the user input and system output. There is no restriction on operating on any particular hardware arrangement and therefore the basic features here may easily be substituted for improved hardware or firmware arrangements as they are developed. - For clarity of explanation, the illustrative system embodiment is presented as including individual functional blocks including functional blocks labeled as a “processor” or
processor 120. The functions these blocks represent may be provided through the use of either shared or dedicated hardware, including, but not limited to, hardware capable of executing software and hardware, such as aprocessor 120, that is purpose-built to operate as an equivalent to software executing on a general purpose processor. For example the functions of one or more processors presented inFIG. 1 may be provided by a single shared processor or multiple processors. (Use of the term “processor” should not be construed to refer exclusively to hardware capable of executing software.) Illustrative embodiments may include microprocessor and/or digital signal processor (DSP) hardware, read-only memory (ROM) 140 for storing software performing the operations discussed below, and random access memory (RAM) 150 for storing results. Very large scale integration (VLSI) hardware embodiments, as well as custom VLSI circuitry in combination with a general purpose DSP circuit, may also be provided. - The logical operations of the various embodiments are implemented as: (1) a sequence of computer implemented steps, operations, or procedures running on a programmable circuit within a general use computer, (2) a sequence of computer implemented steps, operations, or procedures running on a specific-use programmable circuit; and/or (3) interconnected machine modules or program engines within the programmable circuits. The
system 100 shown inFIG. 1 can practice all or part of the recited methods, can be a part of the recited systems, and/or can operate according to instructions in the recited non-transitory computer-readable storage media. Such logical operations can be implemented as modules configured to control theprocessor 120 to perform particular functions according to the programming of the module. For example,FIG. 1 illustrates sevenmodules Module 1 162,Module 2 164,Module 3 166,Module 4 168,Module 5 172,Module 6 174, andModule 7 176 which are modules configured to control theprocessor 120. These modules may be stored on thestorage device 160 and loaded intoRAM 150 ormemory 130 at runtime or may be stored as would be known in the art in other computer-readable memory locations. - Having disclosed some basic system components, the disclosure now turns to the exemplary natural language spoken dialog system shown in
FIG. 2 . For the sake of clarity,FIG. 2 is discussed in terms of an exemplary system such as is shown inFIG. 1 configured to recognize speech input, transcribe the speech input, identify the meaning of the transcribed speech, determine an appropriate response to the speech input, generate text of the appropriate response, and generate audible “speech” based on the generated text. -
FIG. 2 is a functional block diagram that illustrates an exemplary natural language spoken dialog system. Spoken dialog systems aim to identify intents of humans, expressed in natural language, and take actions accordingly, to satisfy their requests. Natural language spokendialog system 200 can include an automatic speech recognition (ASR)module 202, a spoken language understanding (SLU)module 204, a dialog management (DM)module 206, a spoken language generation (SLG)module 208, and text-to-speech module (TTS) 210. The text-to-speech module can be any type of speech output module. For example, it can be a module wherein text is selected and played to a user. Thus, the text-to-speech module represents any type of speech output. The present disclosure focuses on innovations related to theASR module 202 and can also relate to other components of the dialog system. - The
ASR module 202 analyzes speech input and provides a textual transcription of the speech input as output.SLU module 204 can receive the transcribed input and can use a natural language understanding model to analyze the group of words that are included in the transcribed input to derive a meaning from the input. The role of theDM module 206 is to interact in a natural way and help the user to achieve the task that the system is designed to support. TheDM module 206 receives the meaning of the speech input from theSLU module 204 and determines an action, such as, for example, providing a response, based on the input. TheSLG module 208 generates a transcription of one or more words in response to the action provided by theDM 206. The text-to-speech module 210 receives the transcription as input and provides generated audible speech as output based on the transcribed speech. - Thus, the modules of
system 200 recognize speech input, such as speech utterances, transcribe the speech input, identify (or understand) the meaning of the transcribed speech, determine an appropriate response to the speech input, generate text of the appropriate response and from that text, generate audible “speech” fromsystem 200, which the user then hears. In this manner, the user can carry on a natural language dialog withsystem 200. Those of ordinary skill in the art will understand the programming languages for generating andtraining ASR module 202 or any of the other modules in the spoken dialog system. Further, the modules ofsystem 200 can operate independent of a full dialog system. For example, a computing device such as a smartphone (or any processing device having a phone capability) can include an ASR module wherein a user says “call mom” and the smartphone acts on the instruction without a “spoken dialog.” A module for automatically transcribing user speech can join the system at any point or at multiple points in the cycle or can be integrated with any of the modules shown inFIG. 2 . - The disclosure now turns to
FIG. 3 , which illustrates one embodiment of asystem 202 for automatic speech recognition. Thesystem 202 includes the natural language spokendialog system 202 ofFIG. 2 , however, for clarity, only theASR 202 is depicted here. - The
system 202 first receivesspeech 302. Thesystem 202 then recognizes the received speech with a collection of domain-specific speech recognizers specific speech recognizers FIG. 3 , the collection of domain-specific speech recognizers web search 304, web search 306), and at least one expert from a different domain (e.g.,local business search 308 and video search 310). Other exemplary different domains include travel, banking, and business. - Next, each expert from the collection of domain-
specific speech recognizers speech recognition output speech 302. The following examples illustrate possible speech recognition outputs based on the words “Paris Hilton” as recognized by each expert: “Pairs Hill” 312 a, “Paris Hilton” 314, “Paris Hill” 316, and “Perez Hilton” 318. An output can include a lattice, confidence scores, and other meta data including beam width. Accordingly, each output in our example above may include a confidence score, viz.: “Pairs Hill” may include a confidence score of 40, “Paris Hilton” may include a confidence score of 100, “Paris Hill” may include a confidence score of 74, and “Perez Hilton” may include a confidence score of 80. In one aspect, an output may include more than one confidence score; each confidence score corresponds to a different segment of the output. The following examples illustrate an output including a plurality of confidence scores: “Pairs Hill” and a confidence score of 40 for “Pairs” and 60 for “Hill,” “Paris Hilton” and a confidence score of 100 for “Paris” and 100 for “Hilton,” “Paris Hill” and a confidence score of 100 for “Paris” and 60 for “Hill,” and “Perez Hilton” and a confidence score of 80 for “Perez” and 100 for “Hilton.” - Next, the machine-learning
algorithm 300 analyzes the speech recognition outputs 312 a, 312 b, 314, 316, and 318 to determine at least one speech recognition confidence score for the respective speech recognition outputs 312 a, 312 b, 314, 316, and 318. The machine-learningalgorithm 300 then selects speech recognition candidates from segments of the speech recognition outputs 312 a, 312 b, 314, 316, and 318 based on at least one speech recognition confidence score for the respective speech recognition outputs 312 a, 312 b, 314, 316, and 318. For example, the machine-learningalgorithm 300 may select the speech recognition candidates from those segments of the speech recognition outputs in our example having the highest confidence scores (100, 74, 100 respectively): “Paris Hilton,” “Paris Hill,” and “Hilton.” - The machine-learning
algorithm 300 then combines the speech recognition candidates to yield a combination of the speech recognition candidates, and generates atext string 330 based on the combination. For example, the machine-learningalgorithm 300 can generate the words “Paris Hilton” based on the combination of “Paris Hilton,” “Paris Hill,” and “Perez Hilton.” Alternatively, the machine-learningalgorithm 300 can generate atext string 330 based on a single speech recognition candidate having the highest confidence score, which, in our example, corresponds to “Paris Hilton” 314. - In particular embodiments, the
text string 330 includes a mesh of the speech recognition candidates. In another aspect, the experts divide the speech recognition candidates into substrings (e.g., “Paris” 312 a, “Hilton” 312 b), and the machine-learningalgorithm 300 selects a best speech recognition candidate for each substring. - Finally, the
system 202 collects usage statistics based on the speech recognition candidates. In one aspect, thesystem 202 uses the collected statistics to train the machine-learningalgorithm 300. In another aspect, thesystem 202 uses the collected statistics to train the collection of domain-specific speech recognizers system 202 may also use the collected statistics to train both the machine-learningalgorithm 300 and the collection of domain-specific speech recognizers algorithm 300 and each expert from the collection of domain-specific speech recognizers - The schematic flow chart diagrams included herein are generally set forth as logical flow chart diagrams. As such, the depicted order and labeled steps are indicative of one embodiment of the presented method. Other steps and methods can be conceived that are equivalent in function, logic, or effect to one or more steps, or portions thereof, of the illustrated method. Additionally, the format and symbols employed are provided to explain the logical steps of the method and are understood not to limit the scope of the method. Although various arrow types and line types can be employed in the flow chart diagram, they are understood not to limit the scope of the corresponding method. Indeed, some arrows or other connectors can be used to indicate only the logical flow of the method. For instance, an arrow can indicate a waiting or monitoring period of unspecified duration between enumerated steps of the depicted method. Additionally, the order in which a particular method occurs can or cannot strictly adhere to the order of the corresponding steps shown. One or more steps of the following methods are performed by a hardware component such as a processor or computing device.
-
FIG. 4 is a schematic flow chart diagram illustrating a disclosedmethod 600 for automatic speech recognition. As seen fromFIG. 4 , themethod 600 starts and the collection of domain-specific speech recognizers FIG. 3 first recognize the receivedspeech 302 ofFIG. 3 to yield respective speech recognition outputs 604. The machine-learningalgorithm 300 ofFIG. 3 then analyzes the speech recognition outputs 312 a, 312 b, 314, 316, and 318 ofFIG. 3 to determine at least one speech recognition confidence score for the respective speech recognition outputs 312 a, 312 b, 314, 316, and 318 ofFIG. 3 606. - Next, the machine-learning
algorithm 300 ofFIG. 3 selects speech recognition candidates from segments of the speech recognition outputs 312 a, 312 b, 314, 316, and 318 ofFIG. 3 , based on the at least one speech recognition confidence score for the respective speech recognition outputs 608. The machine-learningalgorithm 300 ofFIG. 3 then combines the speech recognition candidates to yield a combination of thespeech recognition candidates 610, and generates atext string 330 ofFIG. 3 based on thecombination 612. Alternatively, the machine-learningalgorithm 300 ofFIG. 3 can generate atext string 330 ofFIG. 3 based on a single speech recognition candidate having a highest confidence score. In particular embodiments, thetext string 330 ofFIG. 3 includes a mesh of the speech recognition candidates. In another aspect, the experts divide the speech recognition candidates into substrings, and the machine-learningalgorithm 300 ofFIG. 3 selects a best speech recognition candidate for each substring. - This approach allows for speech recognition across multiple applications or environments without model customization or knowledge of the domain of the received speech. This approach requires a lower volume of data, thereby increasing scalability and reducing cost, and provides numerous additional benefits, such as higher speech recognition performance and rapid deployment of speech applications without intensive development of expertise.
- Embodiments within the scope of the present disclosure may also include tangible computer-readable storage media for carrying or having computer-executable instructions or data structures stored thereon for controlling a data processing device or other computing device. Such computer-readable storage media can be any available media that can be accessed by a general purpose or special purpose computer, including the functional design of any special purpose processor as discussed above. By way of example, and not limitation, such computer-readable media can include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code means in the form of computer-executable instructions, data structures, or processor chip design. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or combination thereof) to a computer, the computer properly views the connection as a computer-readable medium. Thus, any such connection is properly termed a computer-readable medium. Combinations of the above should also be included within the scope of the computer-readable media.
- Computer-executable instructions include, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. Computer-executable instructions also include program modules that are executed by computers in stand-alone or network environments. Generally, program modules include routines, programs, components, data structures, objects, and the functions inherent in the design of special-purpose processors, etc. that perform particular tasks or implement particular abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of the program code means for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps.
- Those of skill in the art will appreciate that other embodiments of the disclosure may be practiced in network computing environments with many types of computer system configurations, including personal computers, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. Embodiments may also be practiced in distributed computing environments where tasks are performed by local and remote processing devices that are linked (either by hardwired links, wireless links, or by a combination thereof) through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
- The various embodiments described above are provided by way of illustration only and should not be construed to limit the scope of the disclosure. Those skilled in the art will readily recognize various modifications and changes that may be made to the principles described herein without following the example embodiments and applications illustrated and described herein, and without departing from the spirit and scope of the disclosure.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/459,719 US20140358537A1 (en) | 2010-09-30 | 2014-08-14 | System and Method for Combining Speech Recognition Outputs From a Plurality of Domain-Specific Speech Recognizers Via Machine Learning |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/895,359 US8812321B2 (en) | 2010-09-30 | 2010-09-30 | System and method for combining speech recognition outputs from a plurality of domain-specific speech recognizers via machine learning |
US14/459,719 US20140358537A1 (en) | 2010-09-30 | 2014-08-14 | System and Method for Combining Speech Recognition Outputs From a Plurality of Domain-Specific Speech Recognizers Via Machine Learning |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/895,359 Continuation US8812321B2 (en) | 2010-09-30 | 2010-09-30 | System and method for combining speech recognition outputs from a plurality of domain-specific speech recognizers via machine learning |
Publications (1)
Publication Number | Publication Date |
---|---|
US20140358537A1 true US20140358537A1 (en) | 2014-12-04 |
Family
ID=45890571
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/895,359 Active 2031-12-01 US8812321B2 (en) | 2010-09-30 | 2010-09-30 | System and method for combining speech recognition outputs from a plurality of domain-specific speech recognizers via machine learning |
US14/459,719 Abandoned US20140358537A1 (en) | 2010-09-30 | 2014-08-14 | System and Method for Combining Speech Recognition Outputs From a Plurality of Domain-Specific Speech Recognizers Via Machine Learning |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/895,359 Active 2031-12-01 US8812321B2 (en) | 2010-09-30 | 2010-09-30 | System and method for combining speech recognition outputs from a plurality of domain-specific speech recognizers via machine learning |
Country Status (1)
Country | Link |
---|---|
US (2) | US8812321B2 (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150310858A1 (en) * | 2014-04-29 | 2015-10-29 | Microsoft Corporation | Shared hidden layer combination for speech recognition systems |
US9324321B2 (en) | 2014-03-07 | 2016-04-26 | Microsoft Technology Licensing, Llc | Low-footprint adaptation and personalization for a deep neural network |
US9430667B2 (en) | 2014-05-12 | 2016-08-30 | Microsoft Technology Licensing, Llc | Managed wireless distribution network |
US9477625B2 (en) | 2014-06-13 | 2016-10-25 | Microsoft Technology Licensing, Llc | Reversible connector for accessory devices |
US9529794B2 (en) | 2014-03-27 | 2016-12-27 | Microsoft Technology Licensing, Llc | Flexible schema for language model customization |
US9589565B2 (en) | 2013-06-21 | 2017-03-07 | Microsoft Technology Licensing, Llc | Environmentally aware dialog policies and response generation |
US9614724B2 (en) | 2014-04-21 | 2017-04-04 | Microsoft Technology Licensing, Llc | Session-based device configuration |
US9697200B2 (en) | 2013-06-21 | 2017-07-04 | Microsoft Technology Licensing, Llc | Building conversational understanding systems using a toolset |
US9717006B2 (en) | 2014-06-23 | 2017-07-25 | Microsoft Technology Licensing, Llc | Device quarantine in a wireless network |
US9728184B2 (en) | 2013-06-18 | 2017-08-08 | Microsoft Technology Licensing, Llc | Restructuring deep neural network acoustic models |
US9874914B2 (en) | 2014-05-19 | 2018-01-23 | Microsoft Technology Licensing, Llc | Power management contracts for accessory devices |
US10111099B2 (en) | 2014-05-12 | 2018-10-23 | Microsoft Technology Licensing, Llc | Distributing content in managed wireless distribution networks |
WO2019028282A1 (en) * | 2017-08-02 | 2019-02-07 | Veritone, Inc. | Methods and systems for transcription |
US10691445B2 (en) | 2014-06-03 | 2020-06-23 | Microsoft Technology Licensing, Llc | Isolating a portion of an online computing service for testing |
US20210312914A1 (en) * | 2018-11-29 | 2021-10-07 | Amazon Technologies, Inc. | Speech recognition using dialog history |
US11294942B2 (en) | 2016-09-29 | 2022-04-05 | Koninklijk Ephilips N.V. | Question generation |
Families Citing this family (213)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8645137B2 (en) | 2000-03-16 | 2014-02-04 | Apple Inc. | Fast, language-independent method for user authentication by voice |
US8677377B2 (en) | 2005-09-08 | 2014-03-18 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US9318108B2 (en) * | 2010-01-18 | 2016-04-19 | Apple Inc. | Intelligent automated assistant |
US8977255B2 (en) | 2007-04-03 | 2015-03-10 | Apple Inc. | Method and system for operating a multi-function portable electronic device using voice-activation |
US10002189B2 (en) | 2007-12-20 | 2018-06-19 | Apple Inc. | Method and apparatus for searching using an active ontology |
US9330720B2 (en) | 2008-01-03 | 2016-05-03 | Apple Inc. | Methods and apparatus for altering audio output signals |
US8996376B2 (en) | 2008-04-05 | 2015-03-31 | Apple Inc. | Intelligent text-to-speech conversion |
US10496753B2 (en) | 2010-01-18 | 2019-12-03 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US8077836B2 (en) * | 2008-07-30 | 2011-12-13 | At&T Intellectual Property, I, L.P. | Transparent voice registration and verification method and system |
US20100030549A1 (en) | 2008-07-31 | 2010-02-04 | Lee Michael M | Mobile device having human language translation capability with positional feedback |
US8676904B2 (en) | 2008-10-02 | 2014-03-18 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
WO2010067118A1 (en) | 2008-12-11 | 2010-06-17 | Novauris Technologies Limited | Speech recognition involving a mobile device |
US9858925B2 (en) | 2009-06-05 | 2018-01-02 | Apple Inc. | Using context information to facilitate processing of commands in a virtual assistant |
US10241752B2 (en) | 2011-09-30 | 2019-03-26 | Apple Inc. | Interface for a virtual digital assistant |
US20120311585A1 (en) | 2011-06-03 | 2012-12-06 | Apple Inc. | Organizing task items that represent tasks to perform |
US10241644B2 (en) | 2011-06-03 | 2019-03-26 | Apple Inc. | Actionable reminder entries |
US9431006B2 (en) | 2009-07-02 | 2016-08-30 | Apple Inc. | Methods and apparatuses for automatic speech recognition |
US10679605B2 (en) | 2010-01-18 | 2020-06-09 | Apple Inc. | Hands-free list-reading by intelligent automated assistant |
US10553209B2 (en) | 2010-01-18 | 2020-02-04 | Apple Inc. | Systems and methods for hands-free notification summaries |
US10705794B2 (en) | 2010-01-18 | 2020-07-07 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US10276170B2 (en) | 2010-01-18 | 2019-04-30 | Apple Inc. | Intelligent automated assistant |
US8682667B2 (en) | 2010-02-25 | 2014-03-25 | Apple Inc. | User profiling for selecting user specific voice input processing information |
US8812321B2 (en) * | 2010-09-30 | 2014-08-19 | At&T Intellectual Property I, L.P. | System and method for combining speech recognition outputs from a plurality of domain-specific speech recognizers via machine learning |
US10762293B2 (en) | 2010-12-22 | 2020-09-01 | Apple Inc. | Using parts-of-speech tagging and named entity recognition for spelling correction |
US9262612B2 (en) | 2011-03-21 | 2016-02-16 | Apple Inc. | Device access using voice authentication |
US20120310642A1 (en) | 2011-06-03 | 2012-12-06 | Apple Inc. | Automatically creating a mapping between text data and audio data |
US10057736B2 (en) | 2011-06-03 | 2018-08-21 | Apple Inc. | Active transport based notifications |
US8994660B2 (en) | 2011-08-29 | 2015-03-31 | Apple Inc. | Text correction processing |
US10134385B2 (en) | 2012-03-02 | 2018-11-20 | Apple Inc. | Systems and methods for name pronunciation |
US9483461B2 (en) | 2012-03-06 | 2016-11-01 | Apple Inc. | Handling speech synthesis of content for multiple languages |
US9431012B2 (en) | 2012-04-30 | 2016-08-30 | 2236008 Ontario Inc. | Post processing of natural language automatic speech recognition |
US9093076B2 (en) * | 2012-04-30 | 2015-07-28 | 2236008 Ontario Inc. | Multipass ASR controlling multiple applications |
US9280610B2 (en) | 2012-05-14 | 2016-03-08 | Apple Inc. | Crowd sourcing information to fulfill user requests |
US10417037B2 (en) | 2012-05-15 | 2019-09-17 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
US9721563B2 (en) | 2012-06-08 | 2017-08-01 | Apple Inc. | Name recognition system |
US9495129B2 (en) | 2012-06-29 | 2016-11-15 | Apple Inc. | Device, method, and user interface for voice-activated navigation and browsing of a document |
US9576574B2 (en) | 2012-09-10 | 2017-02-21 | Apple Inc. | Context-sensitive handling of interruptions by intelligent digital assistant |
US9547647B2 (en) | 2012-09-19 | 2017-01-17 | Apple Inc. | Voice-based media searching |
US9240184B1 (en) * | 2012-11-15 | 2016-01-19 | Google Inc. | Frame-level combination of deep neural network and gaussian mixture models |
US10282419B2 (en) * | 2012-12-12 | 2019-05-07 | Nuance Communications, Inc. | Multi-domain natural language processing architecture |
KR102516577B1 (en) | 2013-02-07 | 2023-04-03 | 애플 인크. | Voice trigger for a digital assistant |
US9626629B2 (en) | 2013-02-14 | 2017-04-18 | 24/7 Customer, Inc. | Categorization of user interactions into predefined hierarchical categories |
US9368114B2 (en) | 2013-03-14 | 2016-06-14 | Apple Inc. | Context-sensitive handling of interruptions |
US10652394B2 (en) | 2013-03-14 | 2020-05-12 | Apple Inc. | System and method for processing voicemail |
WO2014144579A1 (en) | 2013-03-15 | 2014-09-18 | Apple Inc. | System and method for updating an adaptive speech recognition model |
US10748529B1 (en) | 2013-03-15 | 2020-08-18 | Apple Inc. | Voice activated device for use with a voice-based digital assistant |
WO2014144949A2 (en) | 2013-03-15 | 2014-09-18 | Apple Inc. | Training an at least partial voice command system |
WO2014197334A2 (en) | 2013-06-07 | 2014-12-11 | Apple Inc. | System and method for user-specified pronunciation of words for speech synthesis and recognition |
WO2014197336A1 (en) | 2013-06-07 | 2014-12-11 | Apple Inc. | System and method for detecting errors in interactions with a voice-based digital assistant |
US9582608B2 (en) | 2013-06-07 | 2017-02-28 | Apple Inc. | Unified ranking with entropy-weighted information for phrase-based semantic auto-completion |
WO2014197335A1 (en) | 2013-06-08 | 2014-12-11 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
US10176167B2 (en) | 2013-06-09 | 2019-01-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
EP3008641A1 (en) | 2013-06-09 | 2016-04-20 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
CN105265005B (en) | 2013-06-13 | 2019-09-17 | 苹果公司 | System and method for the urgent call initiated by voice command |
WO2015020942A1 (en) | 2013-08-06 | 2015-02-12 | Apple Inc. | Auto-activating smart responses based on activities from remote devices |
US10296160B2 (en) | 2013-12-06 | 2019-05-21 | Apple Inc. | Method for extracting salient dialog usage from live data |
US9620105B2 (en) | 2014-05-15 | 2017-04-11 | Apple Inc. | Analyzing audio input for efficient speech and music recognition |
US10592095B2 (en) | 2014-05-23 | 2020-03-17 | Apple Inc. | Instantaneous speaking of content on touch devices |
US9502031B2 (en) | 2014-05-27 | 2016-11-22 | Apple Inc. | Method for supporting dynamic grammars in WFST-based ASR |
JP6596924B2 (en) * | 2014-05-29 | 2019-10-30 | 日本電気株式会社 | Audio data processing apparatus, audio data processing method, and audio data processing program |
US9430463B2 (en) | 2014-05-30 | 2016-08-30 | Apple Inc. | Exemplar-based natural language processing |
US10078631B2 (en) | 2014-05-30 | 2018-09-18 | Apple Inc. | Entropy-guided text prediction using combined word and character n-gram language models |
US9785630B2 (en) | 2014-05-30 | 2017-10-10 | Apple Inc. | Text prediction using combined word N-gram and unigram language models |
US10289433B2 (en) | 2014-05-30 | 2019-05-14 | Apple Inc. | Domain specific language for encoding assistant dialog |
US9842101B2 (en) | 2014-05-30 | 2017-12-12 | Apple Inc. | Predictive conversion of language input |
US10170123B2 (en) | 2014-05-30 | 2019-01-01 | Apple Inc. | Intelligent assistant for home automation |
US9734193B2 (en) | 2014-05-30 | 2017-08-15 | Apple Inc. | Determining domain salience ranking from ambiguous words in natural speech |
US9633004B2 (en) | 2014-05-30 | 2017-04-25 | Apple Inc. | Better resolution when referencing to concepts |
EP3149728B1 (en) | 2014-05-30 | 2019-01-16 | Apple Inc. | Multi-command single utterance input method |
US9715875B2 (en) | 2014-05-30 | 2017-07-25 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US9760559B2 (en) | 2014-05-30 | 2017-09-12 | Apple Inc. | Predictive text input |
US10659851B2 (en) | 2014-06-30 | 2020-05-19 | Apple Inc. | Real-time digital assistant knowledge updates |
US9338493B2 (en) | 2014-06-30 | 2016-05-10 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US10446141B2 (en) | 2014-08-28 | 2019-10-15 | Apple Inc. | Automatic speech recognition based on user feedback |
US9818400B2 (en) | 2014-09-11 | 2017-11-14 | Apple Inc. | Method and apparatus for discovering trending terms in speech requests |
US10789041B2 (en) | 2014-09-12 | 2020-09-29 | Apple Inc. | Dynamic thresholds for always listening speech trigger |
US10127911B2 (en) | 2014-09-30 | 2018-11-13 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
US10074360B2 (en) | 2014-09-30 | 2018-09-11 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US9886432B2 (en) | 2014-09-30 | 2018-02-06 | Apple Inc. | Parsimonious handling of word inflection via categorical stem + suffix N-gram language models |
US9646609B2 (en) | 2014-09-30 | 2017-05-09 | Apple Inc. | Caching apparatus for serving phonetic pronunciations |
US9668121B2 (en) | 2014-09-30 | 2017-05-30 | Apple Inc. | Social reminders |
US10552013B2 (en) | 2014-12-02 | 2020-02-04 | Apple Inc. | Data detection |
US9711141B2 (en) | 2014-12-09 | 2017-07-18 | Apple Inc. | Disambiguating heteronyms in speech synthesis |
US9836452B2 (en) * | 2014-12-30 | 2017-12-05 | Microsoft Technology Licensing, Llc | Discriminating ambiguous expressions to enhance user experience |
US10152299B2 (en) | 2015-03-06 | 2018-12-11 | Apple Inc. | Reducing response latency of intelligent automated assistants |
EP3065133A1 (en) * | 2015-03-06 | 2016-09-07 | ZETES Industries S.A. | Method and system for generating an optimised solution in speech recognition |
US9865280B2 (en) | 2015-03-06 | 2018-01-09 | Apple Inc. | Structured dictation using intelligent automated assistants |
US9721566B2 (en) | 2015-03-08 | 2017-08-01 | Apple Inc. | Competing devices responding to voice triggers |
US9886953B2 (en) | 2015-03-08 | 2018-02-06 | Apple Inc. | Virtual assistant activation |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
US9899019B2 (en) | 2015-03-18 | 2018-02-20 | Apple Inc. | Systems and methods for structured stem and suffix language models |
US20160300573A1 (en) * | 2015-04-08 | 2016-10-13 | Google Inc. | Mapping input to form fields |
US9842105B2 (en) | 2015-04-16 | 2017-12-12 | Apple Inc. | Parsimonious continuous-space phrase representations for natural language processing |
US10460227B2 (en) | 2015-05-15 | 2019-10-29 | Apple Inc. | Virtual assistant in a communication session |
US10200824B2 (en) | 2015-05-27 | 2019-02-05 | Apple Inc. | Systems and methods for proactively identifying and surfacing relevant content on a touch-sensitive device |
US10083688B2 (en) | 2015-05-27 | 2018-09-25 | Apple Inc. | Device voice control for selecting a displayed affordance |
US10127220B2 (en) | 2015-06-04 | 2018-11-13 | Apple Inc. | Language identification from short strings |
US10101822B2 (en) | 2015-06-05 | 2018-10-16 | Apple Inc. | Language input correction |
US9578173B2 (en) | 2015-06-05 | 2017-02-21 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US10255907B2 (en) | 2015-06-07 | 2019-04-09 | Apple Inc. | Automatic accent detection using acoustic models |
US10186254B2 (en) | 2015-06-07 | 2019-01-22 | Apple Inc. | Context-based endpoint detection |
US11025565B2 (en) | 2015-06-07 | 2021-06-01 | Apple Inc. | Personalized prediction of responses for instant messaging |
US20160378747A1 (en) | 2015-06-29 | 2016-12-29 | Apple Inc. | Virtual assistant for media playback |
US10671428B2 (en) | 2015-09-08 | 2020-06-02 | Apple Inc. | Distributed personal assistant |
US10331312B2 (en) | 2015-09-08 | 2019-06-25 | Apple Inc. | Intelligent automated assistant in a media environment |
US10747498B2 (en) | 2015-09-08 | 2020-08-18 | Apple Inc. | Zero latency digital assistant |
US10740384B2 (en) | 2015-09-08 | 2020-08-11 | Apple Inc. | Intelligent automated assistant for media search and playback |
US9858923B2 (en) * | 2015-09-24 | 2018-01-02 | Intel Corporation | Dynamic adaptation of language models and semantic tracking for automatic speech recognition |
US9697820B2 (en) | 2015-09-24 | 2017-07-04 | Apple Inc. | Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks |
US11010550B2 (en) | 2015-09-29 | 2021-05-18 | Apple Inc. | Unified language modeling framework for word prediction, auto-completion and auto-correction |
US10366158B2 (en) | 2015-09-29 | 2019-07-30 | Apple Inc. | Efficient word encoding for recurrent neural network language models |
US11587559B2 (en) | 2015-09-30 | 2023-02-21 | Apple Inc. | Intelligent device identification |
US10691473B2 (en) | 2015-11-06 | 2020-06-23 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US10956666B2 (en) | 2015-11-09 | 2021-03-23 | Apple Inc. | Unconventional virtual assistant interactions |
US10049668B2 (en) | 2015-12-02 | 2018-08-14 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10223066B2 (en) | 2015-12-23 | 2019-03-05 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10446143B2 (en) | 2016-03-14 | 2019-10-15 | Apple Inc. | Identification of voice inputs providing credentials |
US9934775B2 (en) | 2016-05-26 | 2018-04-03 | Apple Inc. | Unit-selection text-to-speech synthesis based on predicted concatenation parameters |
US9972304B2 (en) | 2016-06-03 | 2018-05-15 | Apple Inc. | Privacy preserving distributed evaluation framework for embedded personalized systems |
US10249300B2 (en) | 2016-06-06 | 2019-04-02 | Apple Inc. | Intelligent list reading |
US11227589B2 (en) | 2016-06-06 | 2022-01-18 | Apple Inc. | Intelligent list reading |
US10049663B2 (en) | 2016-06-08 | 2018-08-14 | Apple, Inc. | Intelligent automated assistant for media exploration |
DK179309B1 (en) | 2016-06-09 | 2018-04-23 | Apple Inc | Intelligent automated assistant in a home environment |
US10586535B2 (en) | 2016-06-10 | 2020-03-10 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10067938B2 (en) | 2016-06-10 | 2018-09-04 | Apple Inc. | Multilingual word prediction |
US10509862B2 (en) | 2016-06-10 | 2019-12-17 | Apple Inc. | Dynamic phrase expansion of language input |
US10490187B2 (en) | 2016-06-10 | 2019-11-26 | Apple Inc. | Digital assistant providing automated status report |
US10192552B2 (en) | 2016-06-10 | 2019-01-29 | Apple Inc. | Digital assistant providing whispered speech |
DK179415B1 (en) | 2016-06-11 | 2018-06-14 | Apple Inc | Intelligent device arbitration and control |
DK201670540A1 (en) | 2016-06-11 | 2018-01-08 | Apple Inc | Application integration with a digital assistant |
DK179343B1 (en) | 2016-06-11 | 2018-05-14 | Apple Inc | Intelligent task discovery |
DK179049B1 (en) | 2016-06-11 | 2017-09-18 | Apple Inc | Data driven natural language event detection and classification |
US10474753B2 (en) | 2016-09-07 | 2019-11-12 | Apple Inc. | Language identification using recurrent neural networks |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
US10170110B2 (en) * | 2016-11-17 | 2019-01-01 | Robert Bosch Gmbh | System and method for ranking of hybrid speech recognition results with neural networks |
US11281993B2 (en) | 2016-12-05 | 2022-03-22 | Apple Inc. | Model and ensemble compression for metric learning |
US10593346B2 (en) | 2016-12-22 | 2020-03-17 | Apple Inc. | Rank-reduced token representation for automatic speech recognition |
US11204787B2 (en) | 2017-01-09 | 2021-12-21 | Apple Inc. | Application integration with a digital assistant |
CN108573706B (en) * | 2017-03-10 | 2021-06-08 | 北京搜狗科技发展有限公司 | Voice recognition method, device and equipment |
CN107122179A (en) * | 2017-03-31 | 2017-09-01 | 阿里巴巴集团控股有限公司 | The function control method and device of voice |
US10417266B2 (en) | 2017-05-09 | 2019-09-17 | Apple Inc. | Context-aware ranking of intelligent response suggestions |
DK201770383A1 (en) | 2017-05-09 | 2018-12-14 | Apple Inc. | User interface for correcting recognition errors |
DK201770439A1 (en) | 2017-05-11 | 2018-12-13 | Apple Inc. | Offline personal assistant |
US10726832B2 (en) | 2017-05-11 | 2020-07-28 | Apple Inc. | Maintaining privacy of personal information |
DK180048B1 (en) | 2017-05-11 | 2020-02-04 | Apple Inc. | MAINTAINING THE DATA PROTECTION OF PERSONAL INFORMATION |
US10395654B2 (en) | 2017-05-11 | 2019-08-27 | Apple Inc. | Text normalization based on a data-driven learning network |
US11301477B2 (en) | 2017-05-12 | 2022-04-12 | Apple Inc. | Feedback analysis of a digital assistant |
DK179745B1 (en) | 2017-05-12 | 2019-05-01 | Apple Inc. | SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT |
DK201770429A1 (en) | 2017-05-12 | 2018-12-14 | Apple Inc. | Low-latency intelligent automated assistant |
DK179496B1 (en) | 2017-05-12 | 2019-01-15 | Apple Inc. | USER-SPECIFIC Acoustic Models |
DK201770431A1 (en) | 2017-05-15 | 2018-12-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
DK201770432A1 (en) | 2017-05-15 | 2018-12-21 | Apple Inc. | Hierarchical belief states for digital assistants |
US20180336275A1 (en) | 2017-05-16 | 2018-11-22 | Apple Inc. | Intelligent automated assistant for media exploration |
US10311144B2 (en) | 2017-05-16 | 2019-06-04 | Apple Inc. | Emoji word sense disambiguation |
US20180336892A1 (en) | 2017-05-16 | 2018-11-22 | Apple Inc. | Detecting a trigger of a digital assistant |
DK179560B1 (en) | 2017-05-16 | 2019-02-18 | Apple Inc. | Far-field extension for digital assistant services |
US10403278B2 (en) | 2017-05-16 | 2019-09-03 | Apple Inc. | Methods and systems for phonetic matching in digital assistant services |
US10657328B2 (en) | 2017-06-02 | 2020-05-19 | Apple Inc. | Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling |
US10445429B2 (en) | 2017-09-21 | 2019-10-15 | Apple Inc. | Natural language understanding using vocabularies with compressed serialized tries |
US10665223B2 (en) | 2017-09-29 | 2020-05-26 | Udifi, Inc. | Acoustic and other waveform event detection and correction systems and methods |
US10755051B2 (en) | 2017-09-29 | 2020-08-25 | Apple Inc. | Rule-based natural language processing |
US10636424B2 (en) | 2017-11-30 | 2020-04-28 | Apple Inc. | Multi-turn canned dialog |
US10733982B2 (en) | 2018-01-08 | 2020-08-04 | Apple Inc. | Multi-directional dialog |
US10733375B2 (en) | 2018-01-31 | 2020-08-04 | Apple Inc. | Knowledge-based framework for improving natural language understanding |
US10789959B2 (en) | 2018-03-02 | 2020-09-29 | Apple Inc. | Training speaker recognition models for digital assistants |
US10592604B2 (en) | 2018-03-12 | 2020-03-17 | Apple Inc. | Inverse text normalization for automatic speech recognition |
US10818288B2 (en) | 2018-03-26 | 2020-10-27 | Apple Inc. | Natural assistant interaction |
US10909331B2 (en) | 2018-03-30 | 2021-02-02 | Apple Inc. | Implicit identification of translation payload with neural machine translation |
US10928918B2 (en) | 2018-05-07 | 2021-02-23 | Apple Inc. | Raise to speak |
US11145294B2 (en) | 2018-05-07 | 2021-10-12 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US10984780B2 (en) | 2018-05-21 | 2021-04-20 | Apple Inc. | Global semantic word embeddings using bi-directional recurrent neural networks |
US10892996B2 (en) | 2018-06-01 | 2021-01-12 | Apple Inc. | Variable latency device coordination |
DK179822B1 (en) | 2018-06-01 | 2019-07-12 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
DK201870355A1 (en) | 2018-06-01 | 2019-12-16 | Apple Inc. | Virtual assistant operation in multi-device environments |
DK180639B1 (en) | 2018-06-01 | 2021-11-04 | Apple Inc | DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT |
US11386266B2 (en) | 2018-06-01 | 2022-07-12 | Apple Inc. | Text correction |
US10496705B1 (en) | 2018-06-03 | 2019-12-03 | Apple Inc. | Accelerated task performance |
US11100140B2 (en) | 2018-06-04 | 2021-08-24 | International Business Machines Corporation | Generation of domain specific type system |
US11010561B2 (en) | 2018-09-27 | 2021-05-18 | Apple Inc. | Sentiment prediction from textual data |
US11462215B2 (en) | 2018-09-28 | 2022-10-04 | Apple Inc. | Multi-modal inputs for voice commands |
US11170166B2 (en) | 2018-09-28 | 2021-11-09 | Apple Inc. | Neural typographical error modeling via generative adversarial networks |
US10839159B2 (en) | 2018-09-28 | 2020-11-17 | Apple Inc. | Named entity normalization in a spoken dialog system |
CN111128127A (en) * | 2018-10-15 | 2020-05-08 | 珠海格力电器股份有限公司 | Voice recognition processing method and device |
US11475898B2 (en) | 2018-10-26 | 2022-10-18 | Apple Inc. | Low-latency multi-speaker speech recognition |
US10388272B1 (en) | 2018-12-04 | 2019-08-20 | Sorenson Ip Holdings, Llc | Training speech recognition systems using word sequences |
US11170761B2 (en) | 2018-12-04 | 2021-11-09 | Sorenson Ip Holdings, Llc | Training of speech recognition systems |
US10573312B1 (en) | 2018-12-04 | 2020-02-25 | Sorenson Ip Holdings, Llc | Transcription generation from multiple speech recognition systems |
US11017778B1 (en) | 2018-12-04 | 2021-05-25 | Sorenson Ip Holdings, Llc | Switching between speech recognition systems |
US11638059B2 (en) | 2019-01-04 | 2023-04-25 | Apple Inc. | Content playback on multiple devices |
US11348573B2 (en) | 2019-03-18 | 2022-05-31 | Apple Inc. | Multimodality in digital assistant systems |
DK201970509A1 (en) | 2019-05-06 | 2021-01-15 | Apple Inc | Spoken notifications |
US11423908B2 (en) | 2019-05-06 | 2022-08-23 | Apple Inc. | Interpreting spoken requests |
US11307752B2 (en) | 2019-05-06 | 2022-04-19 | Apple Inc. | User configurable task triggers |
US11475884B2 (en) | 2019-05-06 | 2022-10-18 | Apple Inc. | Reducing digital assistant latency when a language is incorrectly determined |
US11140099B2 (en) | 2019-05-21 | 2021-10-05 | Apple Inc. | Providing message response suggestions |
DK201970510A1 (en) | 2019-05-31 | 2021-02-11 | Apple Inc | Voice identification in digital assistant systems |
US11496600B2 (en) | 2019-05-31 | 2022-11-08 | Apple Inc. | Remote execution of machine-learned models |
US11289073B2 (en) | 2019-05-31 | 2022-03-29 | Apple Inc. | Device text to speech |
DK180129B1 (en) | 2019-05-31 | 2020-06-02 | Apple Inc. | User activity shortcut suggestions |
US11468890B2 (en) | 2019-06-01 | 2022-10-11 | Apple Inc. | Methods and user interfaces for voice-based control of electronic devices |
US11360641B2 (en) | 2019-06-01 | 2022-06-14 | Apple Inc. | Increasing the relevance of new available information |
CN110473547B (en) * | 2019-07-12 | 2021-07-30 | 云知声智能科技股份有限公司 | Speech recognition method |
US11488406B2 (en) | 2019-09-25 | 2022-11-01 | Apple Inc. | Text detection using global geometry estimators |
US11567788B1 (en) | 2019-10-18 | 2023-01-31 | Meta Platforms, Inc. | Generating proactive reminders for assistant systems |
US11861674B1 (en) | 2019-10-18 | 2024-01-02 | Meta Platforms Technologies, Llc | Method, one or more computer-readable non-transitory storage media, and a system for generating comprehensive information for products of interest by assistant systems |
US11810578B2 (en) | 2020-05-11 | 2023-11-07 | Apple Inc. | Device arbitration for digital assistant-based intercom systems |
US11061543B1 (en) | 2020-05-11 | 2021-07-13 | Apple Inc. | Providing relevant data items based on context |
US11183193B1 (en) | 2020-05-11 | 2021-11-23 | Apple Inc. | Digital assistant hardware abstraction |
US11755276B2 (en) | 2020-05-12 | 2023-09-12 | Apple Inc. | Reducing description length based on confidence |
US11490204B2 (en) | 2020-07-20 | 2022-11-01 | Apple Inc. | Multi-device audio adjustment coordination |
US11438683B2 (en) | 2020-07-21 | 2022-09-06 | Apple Inc. | User identification using headphones |
US11488604B2 (en) | 2020-08-19 | 2022-11-01 | Sorenson Ip Holdings, Llc | Transcription of audio |
EP4254401A1 (en) * | 2021-02-17 | 2023-10-04 | Samsung Electronics Co., Ltd. | Electronic device and control method therefor |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2383459A (en) * | 2001-12-20 | 2003-06-25 | Hewlett Packard Co | Speech recognition system with confidence assessment |
US20030125940A1 (en) * | 2002-01-02 | 2003-07-03 | International Business Machines Corporation | Method and apparatus for transcribing speech when a plurality of speakers are participating |
US20030144837A1 (en) * | 2002-01-29 | 2003-07-31 | Basson Sara H. | Collaboration of multiple automatic speech recognition (ASR) systems |
US20040138885A1 (en) * | 2003-01-09 | 2004-07-15 | Xiaofan Lin | Commercial automatic speech recognition engine combinations |
US20050065790A1 (en) * | 2003-09-23 | 2005-03-24 | Sherif Yacoub | System and method using multiple automated speech recognition engines |
US20050125224A1 (en) * | 2003-11-06 | 2005-06-09 | Myers Gregory K. | Method and apparatus for fusion of recognition results from multiple types of data sources |
US20050143995A1 (en) * | 2001-07-03 | 2005-06-30 | Kibkalo Alexandr A. | Method and apparatus for dynamic beam control in viterbi search |
US20060009980A1 (en) * | 2004-07-12 | 2006-01-12 | Burke Paul M | Allocation of speech recognition tasks and combination of results thereof |
US6996525B2 (en) * | 2001-06-15 | 2006-02-07 | Intel Corporation | Selecting one of multiple speech recognizers in a system based on performance predections resulting from experience |
US20070038453A1 (en) * | 2005-08-09 | 2007-02-15 | Kabushiki Kaisha Toshiba | Speech recognition system |
US20070094270A1 (en) * | 2005-10-21 | 2007-04-26 | Callminer, Inc. | Method and apparatus for the processing of heterogeneous units of work |
US20070118373A1 (en) * | 2005-11-23 | 2007-05-24 | Wise Gerald B | System and method for generating closed captions |
US20100057451A1 (en) * | 2008-08-29 | 2010-03-04 | Eric Carraux | Distributed Speech Recognition Using One Way Communication |
US7711561B2 (en) * | 2004-01-05 | 2010-05-04 | Kabushiki Kaisha Toshiba | Speech recognition system and technique |
US20100250250A1 (en) * | 2009-03-30 | 2010-09-30 | Jonathan Wiggs | Systems and methods for generating a hybrid text string from two or more text strings generated by multiple automated speech recognition systems |
US8812321B2 (en) * | 2010-09-30 | 2014-08-19 | At&T Intellectual Property I, L.P. | System and method for combining speech recognition outputs from a plurality of domain-specific speech recognizers via machine learning |
Family Cites Families (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5710864A (en) * | 1994-12-29 | 1998-01-20 | Lucent Technologies Inc. | Systems, methods and articles of manufacture for improving recognition confidence in hypothesized keywords |
US5754978A (en) * | 1995-10-27 | 1998-05-19 | Speech Systems Of Colorado, Inc. | Speech recognition system |
DE19635754A1 (en) * | 1996-09-03 | 1998-03-05 | Siemens Ag | Speech processing system and method for speech processing |
US6487532B1 (en) * | 1997-09-24 | 2002-11-26 | Scansoft, Inc. | Apparatus and method for distinguishing similar-sounding utterances speech recognition |
US6061646A (en) * | 1997-12-18 | 2000-05-09 | International Business Machines Corp. | Kiosk for multiple spoken languages |
US6324510B1 (en) * | 1998-11-06 | 2001-11-27 | Lernout & Hauspie Speech Products N.V. | Method and apparatus of hierarchically organizing an acoustic model for speech recognition and adaptation of the model to unseen domains |
US6526380B1 (en) * | 1999-03-26 | 2003-02-25 | Koninklijke Philips Electronics N.V. | Speech recognition system having parallel large vocabulary recognition engines |
US7016835B2 (en) * | 1999-10-29 | 2006-03-21 | International Business Machines Corporation | Speech and signal digitization by using recognition metrics to select from multiple techniques |
US6671669B1 (en) * | 2000-07-18 | 2003-12-30 | Qualcomm Incorporated | combined engine system and method for voice recognition |
US6973429B2 (en) * | 2000-12-04 | 2005-12-06 | A9.Com, Inc. | Grammar generation for voice-based searches |
US20030050777A1 (en) * | 2001-09-07 | 2003-03-13 | Walker William Donald | System and method for automatic transcription of conversations |
US7149689B2 (en) * | 2003-01-30 | 2006-12-12 | Hewlett-Packard Development Company, Lp. | Two-engine speech recognition |
US20040210437A1 (en) * | 2003-04-15 | 2004-10-21 | Aurilab, Llc | Semi-discrete utterance recognizer for carefully articulated speech |
US20050065789A1 (en) * | 2003-09-23 | 2005-03-24 | Sherif Yacoub | System and method with automated speech recognition engines |
US20050177371A1 (en) * | 2004-02-06 | 2005-08-11 | Sherif Yacoub | Automated speech recognition |
US20060064177A1 (en) * | 2004-09-17 | 2006-03-23 | Nokia Corporation | System and method for measuring confusion among words in an adaptive speech recognition system |
US7739286B2 (en) * | 2005-03-17 | 2010-06-15 | University Of Southern California | Topic specific language models built from large numbers of documents |
KR100755677B1 (en) * | 2005-11-02 | 2007-09-05 | 삼성전자주식회사 | Apparatus and method for dialogue speech recognition using topic detection |
ATE449403T1 (en) * | 2005-12-12 | 2009-12-15 | Gregory John Gadbois | MULTI-VOICE SPEECH RECOGNITION |
JP5212910B2 (en) * | 2006-07-07 | 2013-06-19 | 日本電気株式会社 | Speech recognition apparatus, speech recognition method, and speech recognition program |
US7840407B2 (en) * | 2006-10-13 | 2010-11-23 | Google Inc. | Business listing search |
WO2008096582A1 (en) * | 2007-02-06 | 2008-08-14 | Nec Corporation | Recognizer weight learning device, speech recognizing device, and system |
US8041565B1 (en) * | 2007-05-04 | 2011-10-18 | Foneweb, Inc. | Precision speech to text conversion |
US8275615B2 (en) * | 2007-07-13 | 2012-09-25 | International Business Machines Corporation | Model weighting, selection and hypotheses combination for automatic speech recognition and machine translation |
US8660844B2 (en) * | 2007-10-24 | 2014-02-25 | At&T Intellectual Property I, L.P. | System and method of evaluating user simulations in a spoken dialog system with a diversion metric |
US8843370B2 (en) * | 2007-11-26 | 2014-09-23 | Nuance Communications, Inc. | Joint discriminative training of multiple speech recognizers |
US8364481B2 (en) * | 2008-07-02 | 2013-01-29 | Google Inc. | Speech recognition with parallel recognition tasks |
JP5530729B2 (en) * | 2009-01-23 | 2014-06-25 | 本田技研工業株式会社 | Speech understanding device |
JP5377430B2 (en) * | 2009-07-08 | 2013-12-25 | 本田技研工業株式会社 | Question answering database expansion device and question answering database expansion method |
-
2010
- 2010-09-30 US US12/895,359 patent/US8812321B2/en active Active
-
2014
- 2014-08-14 US US14/459,719 patent/US20140358537A1/en not_active Abandoned
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6996525B2 (en) * | 2001-06-15 | 2006-02-07 | Intel Corporation | Selecting one of multiple speech recognizers in a system based on performance predections resulting from experience |
US20050143995A1 (en) * | 2001-07-03 | 2005-06-30 | Kibkalo Alexandr A. | Method and apparatus for dynamic beam control in viterbi search |
GB2383459A (en) * | 2001-12-20 | 2003-06-25 | Hewlett Packard Co | Speech recognition system with confidence assessment |
US20030125940A1 (en) * | 2002-01-02 | 2003-07-03 | International Business Machines Corporation | Method and apparatus for transcribing speech when a plurality of speakers are participating |
US20030144837A1 (en) * | 2002-01-29 | 2003-07-31 | Basson Sara H. | Collaboration of multiple automatic speech recognition (ASR) systems |
US20040138885A1 (en) * | 2003-01-09 | 2004-07-15 | Xiaofan Lin | Commercial automatic speech recognition engine combinations |
US20050065790A1 (en) * | 2003-09-23 | 2005-03-24 | Sherif Yacoub | System and method using multiple automated speech recognition engines |
US20050125224A1 (en) * | 2003-11-06 | 2005-06-09 | Myers Gregory K. | Method and apparatus for fusion of recognition results from multiple types of data sources |
US7711561B2 (en) * | 2004-01-05 | 2010-05-04 | Kabushiki Kaisha Toshiba | Speech recognition system and technique |
US20060009980A1 (en) * | 2004-07-12 | 2006-01-12 | Burke Paul M | Allocation of speech recognition tasks and combination of results thereof |
US20070038453A1 (en) * | 2005-08-09 | 2007-02-15 | Kabushiki Kaisha Toshiba | Speech recognition system |
US20070094270A1 (en) * | 2005-10-21 | 2007-04-26 | Callminer, Inc. | Method and apparatus for the processing of heterogeneous units of work |
US20070118373A1 (en) * | 2005-11-23 | 2007-05-24 | Wise Gerald B | System and method for generating closed captions |
US20100057451A1 (en) * | 2008-08-29 | 2010-03-04 | Eric Carraux | Distributed Speech Recognition Using One Way Communication |
US20100250250A1 (en) * | 2009-03-30 | 2010-09-30 | Jonathan Wiggs | Systems and methods for generating a hybrid text string from two or more text strings generated by multiple automated speech recognition systems |
US8812321B2 (en) * | 2010-09-30 | 2014-08-19 | At&T Intellectual Property I, L.P. | System and method for combining speech recognition outputs from a plurality of domain-specific speech recognizers via machine learning |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9728184B2 (en) | 2013-06-18 | 2017-08-08 | Microsoft Technology Licensing, Llc | Restructuring deep neural network acoustic models |
US9589565B2 (en) | 2013-06-21 | 2017-03-07 | Microsoft Technology Licensing, Llc | Environmentally aware dialog policies and response generation |
US10572602B2 (en) | 2013-06-21 | 2020-02-25 | Microsoft Technology Licensing, Llc | Building conversational understanding systems using a toolset |
US10304448B2 (en) | 2013-06-21 | 2019-05-28 | Microsoft Technology Licensing, Llc | Environmentally aware dialog policies and response generation |
US9697200B2 (en) | 2013-06-21 | 2017-07-04 | Microsoft Technology Licensing, Llc | Building conversational understanding systems using a toolset |
US9324321B2 (en) | 2014-03-07 | 2016-04-26 | Microsoft Technology Licensing, Llc | Low-footprint adaptation and personalization for a deep neural network |
US9529794B2 (en) | 2014-03-27 | 2016-12-27 | Microsoft Technology Licensing, Llc | Flexible schema for language model customization |
US10497367B2 (en) | 2014-03-27 | 2019-12-03 | Microsoft Technology Licensing, Llc | Flexible schema for language model customization |
US9614724B2 (en) | 2014-04-21 | 2017-04-04 | Microsoft Technology Licensing, Llc | Session-based device configuration |
US20150310858A1 (en) * | 2014-04-29 | 2015-10-29 | Microsoft Corporation | Shared hidden layer combination for speech recognition systems |
US9520127B2 (en) * | 2014-04-29 | 2016-12-13 | Microsoft Technology Licensing, Llc | Shared hidden layer combination for speech recognition systems |
US10111099B2 (en) | 2014-05-12 | 2018-10-23 | Microsoft Technology Licensing, Llc | Distributing content in managed wireless distribution networks |
US9430667B2 (en) | 2014-05-12 | 2016-08-30 | Microsoft Technology Licensing, Llc | Managed wireless distribution network |
US9874914B2 (en) | 2014-05-19 | 2018-01-23 | Microsoft Technology Licensing, Llc | Power management contracts for accessory devices |
US10691445B2 (en) | 2014-06-03 | 2020-06-23 | Microsoft Technology Licensing, Llc | Isolating a portion of an online computing service for testing |
US9477625B2 (en) | 2014-06-13 | 2016-10-25 | Microsoft Technology Licensing, Llc | Reversible connector for accessory devices |
US9717006B2 (en) | 2014-06-23 | 2017-07-25 | Microsoft Technology Licensing, Llc | Device quarantine in a wireless network |
US11294942B2 (en) | 2016-09-29 | 2022-04-05 | Koninklijk Ephilips N.V. | Question generation |
WO2019028255A1 (en) * | 2017-08-02 | 2019-02-07 | Veritone, Inc. | Methods and systems for optimizing engine selection |
WO2019028279A1 (en) * | 2017-08-02 | 2019-02-07 | Veritone, Inc. | Methods and systems for optimizing engine selection using machine learning modeling |
WO2019028282A1 (en) * | 2017-08-02 | 2019-02-07 | Veritone, Inc. | Methods and systems for transcription |
US20210312914A1 (en) * | 2018-11-29 | 2021-10-07 | Amazon Technologies, Inc. | Speech recognition using dialog history |
Also Published As
Publication number | Publication date |
---|---|
US8812321B2 (en) | 2014-08-19 |
US20120084086A1 (en) | 2012-04-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8812321B2 (en) | System and method for combining speech recognition outputs from a plurality of domain-specific speech recognizers via machine learning | |
US10726833B2 (en) | System and method for rapid customization of speech recognition models | |
US11545142B2 (en) | Using context information with end-to-end models for speech recognition | |
US10699702B2 (en) | System and method for personalization of acoustic models for automatic speech recognition | |
US10152971B2 (en) | System and method for advanced turn-taking for interactive spoken dialog systems | |
US11423883B2 (en) | Contextual biasing for speech recognition | |
US8738375B2 (en) | System and method for optimizing speech recognition and natural language parameters with user feedback | |
US8589163B2 (en) | Adapting language models with a bit mask for a subset of related words | |
US9984679B2 (en) | System and method for optimizing speech recognition and natural language parameters with user feedback | |
US9431005B2 (en) | System and method for supplemental speech recognition by identified idle resources | |
US8600749B2 (en) | System and method for training adaptation-specific acoustic models for automatic speech recognition | |
US10854191B1 (en) | Machine learning models for data driven dialog management | |
US20130066632A1 (en) | System and method for enriching text-to-speech synthesis with automatic dialog act tags | |
US20230368796A1 (en) | Speech processing | |
Shon et al. | Leveraging pre-trained language model for speech sentiment analysis | |
US11804225B1 (en) | Dialog management system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: AT&T INTELLECTUAL PROPERTY I, L.P., NEVADA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GILBERT, MAZIN;BANGALORE, SRINIVAS;HAFFNER, PATRICK;AND OTHERS;SIGNING DATES FROM 20100927 TO 20100930;REEL/FRAME:034442/0827 |
|
AS | Assignment |
Owner name: AT&T ALEX HOLDINGS, LLC, TEXAS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AT&T INTELLECTUAL PROPERTY I, L.P.;REEL/FRAME:034482/0130 Effective date: 20141208 |
|
AS | Assignment |
Owner name: INTERACTIONS LLC, MASSACHUSETTS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AT&T ALEX HOLDINGS, LLC;REEL/FRAME:034642/0640 Effective date: 20141210 |
|
AS | Assignment |
Owner name: ORIX VENTURES, LLC, NEW YORK Free format text: SECURITY INTEREST;ASSIGNOR:INTERACTIONS LLC;REEL/FRAME:034677/0768 Effective date: 20141218 |
|
AS | Assignment |
Owner name: ARES VENTURE FINANCE, L.P., NEW YORK Free format text: SECURITY INTEREST;ASSIGNOR:INTERACTIONS LLC;REEL/FRAME:036009/0349 Effective date: 20150616 |
|
AS | Assignment |
Owner name: SILICON VALLEY BANK, MASSACHUSETTS Free format text: FIRST AMENDMENT TO INTELLECTUAL PROPERTY SECURITY AGREEMENT;ASSIGNOR:INTERACTIONS LLC;REEL/FRAME:036100/0925 Effective date: 20150709 |
|
AS | Assignment |
Owner name: ARES VENTURE FINANCE, L.P., NEW YORK Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE CHANGE PATENT 7146987 TO 7149687 PREVIOUSLY RECORDED ON REEL 036009 FRAME 0349. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT;ASSIGNOR:INTERACTIONS LLC;REEL/FRAME:037134/0712 Effective date: 20150616 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |
|
AS | Assignment |
Owner name: BEARCUB ACQUISITIONS LLC, CALIFORNIA Free format text: ASSIGNMENT OF IP SECURITY AGREEMENT;ASSIGNOR:ARES VENTURE FINANCE, L.P.;REEL/FRAME:044481/0034 Effective date: 20171107 |
|
AS | Assignment |
Owner name: SILICON VALLEY BANK, MASSACHUSETTS Free format text: INTELLECTUAL PROPERTY SECURITY AGREEMENT;ASSIGNOR:INTERACTIONS LLC;REEL/FRAME:049388/0082 Effective date: 20190603 |
|
AS | Assignment |
Owner name: ARES VENTURE FINANCE, L.P., NEW YORK Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:BEARCUB ACQUISITIONS LLC;REEL/FRAME:052693/0866 Effective date: 20200515 |
|
AS | Assignment |
Owner name: INTERACTIONS LLC, MASSACHUSETTS Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN INTELLECTUAL PROPERTY;ASSIGNOR:ORIX GROWTH CAPITAL, LLC;REEL/FRAME:061749/0825 Effective date: 20190606 Owner name: INTERACTIONS CORPORATION, MASSACHUSETTS Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN INTELLECTUAL PROPERTY;ASSIGNOR:ORIX GROWTH CAPITAL, LLC;REEL/FRAME:061749/0825 Effective date: 20190606 |
|
AS | Assignment |
Owner name: INTERACTIONS LLC, MASSACHUSETTS Free format text: RELEASE OF SECURITY INTEREST IN INTELLECTUAL PROPERTY RECORDED AT REEL/FRAME: 049388/0082;ASSIGNOR:SILICON VALLEY BANK;REEL/FRAME:060558/0474 Effective date: 20220624 |
|
AS | Assignment |
Owner name: INTERACTIONS LLC, MASSACHUSETTS Free format text: RELEASE OF SECURITY INTEREST IN INTELLECTUAL PROPERTY RECORDED AT REEL/FRAME: 036100/0925;ASSIGNOR:SILICON VALLEY BANK;REEL/FRAME:060559/0576 Effective date: 20220624 |