US20010041978A1 - Search optimization for continuous speech recognition - Google Patents

Search optimization for continuous speech recognition Download PDF

Info

Publication number
US20010041978A1
US20010041978A1 US09/185,529 US18552998A US2001041978A1 US 20010041978 A1 US20010041978 A1 US 20010041978A1 US 18552998 A US18552998 A US 18552998A US 2001041978 A1 US2001041978 A1 US 2001041978A1
Authority
US
United States
Prior art keywords
words
continuous speech
csr
providing
salient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US09/185,529
Other versions
US6397179B2 (en
Inventor
Jean-Francois Crespo
Peter R. Stubley
Serge Robillard
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Benhov GmbH LLC
Original Assignee
Nortel Networks Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/119,621 external-priority patent/US6092045A/en
Priority to US09/185,529 priority Critical patent/US6397179B2/en
Application filed by Nortel Networks Ltd filed Critical Nortel Networks Ltd
Assigned to NORTHERN TELECOM LIMITED reassignment NORTHERN TELECOM LIMITED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CRESPO, JEAN-FRANCOIS, ROBILLARS, SERGE, SSTUBLEY, PETER R.
Priority to DE69908254T priority patent/DE69908254T2/en
Priority to EP99305530A priority patent/EP0977174B1/en
Assigned to NORTHERN TELECOM LIMITED reassignment NORTHERN TELECOM LIMITED RE-RECORD TO CORRECT THE SURNAME OF INVENTORS PREVIOUSLY RECORDED AT REEL/FRAME 9815/0728. Assignors: CRESPO. JEAN-FRANCOIS, ROBILLARD, SERGE, STUBLEY, PETER R.
Assigned to NORTEL NETWORKS CORPORATION reassignment NORTEL NETWORKS CORPORATION CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: NORTHERN TELECOM LIMITED
Assigned to NORTEL NETWORKS LIMITED reassignment NORTEL NETWORKS LIMITED CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: NORTEL NETWORKS CORPORATION
Publication of US20010041978A1 publication Critical patent/US20010041978A1/en
Publication of US6397179B2 publication Critical patent/US6397179B2/en
Application granted granted Critical
Assigned to INNOVATION MANAGEMENT SCIENCES, LLC reassignment INNOVATION MANAGEMENT SCIENCES, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: NORTEL NETWORKS LIMITED
Assigned to POPKIN FAMILY ASSETS, L.L.C. reassignment POPKIN FAMILY ASSETS, L.L.C. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: INNOVATION MANAGEMENT SCIENCES LLC
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/085Methods for reducing search complexity, pruning

Definitions

  • This invention relates to a system and method for optimization of searching for continuous speech recognition.
  • Speech recognition for applications such as automated directory enquiry assistance and control of operation based on speech input requires a real time response.
  • Spoken input must be recognized within about half a second of the end of the spoken input to simulate the response of a human operator and avoid a perception of unnatural delay.
  • Processing of speech input falls into five main steps: audio channel adaptation, feature extraction, word end point detection, speech recognition, and accept/reject decision logic.
  • Pattern recognition generally, and more particularly recognition of patterns in continuous signals such as speech signals, requires complex calculations and is dependent on providing sufficient processing power to meet the computational load.
  • the speech recognition step is the most computationally intensive step of the process.
  • the computational load is dependent on the number of words or other elements of speech, which are modelled and held in a dictionary, for comparison to the spoken input (i.e. the size of vocabulary of the system); the complexity of the models in the dictionary; how the speech input is processed into a representation ready for comparison to the models; and the algorithm used for carrying out the comparison process. Numerous attempts have been made to improve the trade off between computational load, accuracy of recognition and speed of recognition.
  • the first is to make use of specialized hardware or parallel processing architectures.
  • the second is to develop optimized search methods based on search algorithms that yield reasonable accuracies, but at a fraction of the cost of more optimal architectures.
  • the latter approach is favored by many researchers, since it tackles the problem at the source, see for example, Schwartz, R., Nguyen, L., Makhoul, J., “Multiple-pass search strategies”, in Automatic Speech and Speaker Recognition, Lee, C. H., Soong, F. K., Paliwal, K. K. (eds.), Kluwer Academic Publishers (1996), pp 429-456.
  • This approach is appealing since the hardware and algorithmic optimizations are often orthogonal, so the latter can always be built on top of the former.
  • the basic components of a spoken language processing (SLP) system include a continuous speech recognizer (CSR) for receiving spoken input from the user and a Natural Language Understanding component (NLU), represented schematically in FIG. 1.
  • CSR continuous speech recognizer
  • NLU Natural Language Understanding component
  • a conventional system operates as follows. Speech input is received by the CSR, and a search is performed by the CSR using acoustic models that model speech sounds, and a language model or ‘grammar’ that describes how words may be connected together.
  • the accoustic model is typically in the form of Hidden Markov Models (HMM) describing the accoustic space.
  • HMM Hidden Markov Models
  • the language knowledge is usually used for both the CSR component and the NLU component, as shown in FIG.
  • the language knowledge usually takes the form of a statistical language model (bigram or trigram). If the goal is to recognize a specific constrained vocabulary, then the language knowledge takes the form of a regular grammar.
  • the search passes the recognized word strings representing several likely choices, in the form of a graph, to the natural language understanding component for extracting meaning from the recognized word strings.
  • the language model provides knowledge to the NLU relating to understanding of the recognized word strings. More particularly the semantic information from the language knowledge is fed exclusively to the NLU component with information on how to construct a meaning representation of the CSR's output. This involves, among other things, identifying which words are important to the meaning and which are not. The latter are referred to as non-keywords or semantically-null words. Thus semantically-meaningful words and semantically-null words are identified to provide understanding of the input, and in the process, the word strings are converted to a standard logical form.
  • the logical form is passed to a discourse manager DM, which is the interface between the user and the application.
  • the DM gathers the necessary information from the user to request the applications to perform the user's goal by prompting the user for input.
  • a language model is defined as the graph that is used by the CSR search algorithm to perform recognition.
  • a grammar is a set of rules, which may also be represented as a graph, used by the NLU component to extract meaning from the recognized speech. There may be a one to one mapping between the language model and the grammar in the case where the language model is a constrained model. Connected Word Recognition (CWR) is an example of the latter.
  • CWR Connected Word Recognition
  • the Viterbi beam search without a doubt the most widely used optimization, prunes the paths whose scores (likelihoods) are outside a beam determined by the best local score.
  • Some neural-network based approaches threshold the posterior probabilities of each state to determine if it should remain active (Bourlard, H. Morgan, N., “Connectionist Speech Recognition—A Hybrid Approach”, Kluwer Academic Press, 1994.)
  • Word spotting techniques are an attempt to indirectly use semantic information by focusing the recognizer on the list of keywords(or key phrases) that are semantically meaningful. Some word spotting techniques use background models of speech in an attempt to capture every word that is not in the word spotters dictionary, including semantically null words (non-keywords) (Rohlicek, J. R., Russel, W., Roukos, S., Gish, H., “Word Spotting”, ICASSP 1989, pp 627-630).
  • LVCSR Large Vocabulary Continuous Speech Recognizers
  • the CSR recognizer simply outputs a string of keywords and non-keywords for further processing using semantic information: it does not make use of semantic information during the search. Consequently there is a need for further optimzation of continuous speech recognizers.
  • the present invention seeks to provide a system and method for optimization of searching for continuous speech recognizers which overcomes or avoids the above mentioned problems.
  • a method for continuous speech recognition comprising: incorporating semantic information during searching by a continuous speech recognizer.
  • incorporating semantic information during searching comprises searching using semantic information to identify semantically-null words and thereby generate an N-best list of salient words, instead of an N-best list of both salient and semantically null words.
  • N-best list of salient words The savings, which reduce processing time both during the forward and the backward passes of the search, as well as during rescoring, are achieved by performing only the minimal amount of computation required to produce an exact N-best list of semantically meaningful words (N-best list of salient words).
  • NLU Natural Language Understanding
  • a method for continuous speech recognition comprising: providing speech input to a continuous speech recognizer; providing to the continuous speech recognizer an acoustic model comprising a set of Hidden Markov Models, and a language model comprising both grammar and semantic information; performing recognition of speech input using semantic information to eliminate semantically null words from the N-best list of words and restrict searching to an N-best list of salient words; and performing word matching to output from the speech recognizer the N-best salient word sequences.
  • the step of performing recognition comprises: detecting connected word grammars bounded by semantically null words; collapsing each list of semantically null words into a unique single-input single-output acoustic network; and identifying stop nodes in the acoustic network.
  • scoring comprises Viterbi scoring or other known methods.
  • the method above may be combined with other techniques to save processing time.
  • searching may alternatively be based on beam searches and lexical trees to provide benefits of those methods in addition to benefits of the method above.
  • the method comprises searching using semantic information to identify semantically-null words and thereby generate a list of N-best salient words.
  • Yet another aspect of the invention provides software on a machine readable medium for performing a method for continuous speech recognition comprising: providing speech input to a continuous speech recognizer; providing to the continuous speech recognizer an acoustic model comprising a set of Hidden Markov Models, and a language model comprising both grammar and semantic information; performing recognition of speech input using semantic information to eliminate semantically null words from the N-best list of words and restrict searching to an N-best list of salient words.
  • Another aspect of the invention provides a system for continuous speech recognition comprising:
  • a spoken language processing system for speech recognition comprising: a continuous speech recognition component (CSR); a natural language understanding component (NLU); means for providing speech input to the CSR; means for providing acoustic-phonetic knowledge to the CSR comprising a set of Hidden Markov Models; means for providing language knowledge comprising grammar and statistical models to the CSR, and means for providing semantic knowledge the NLU, and means for providing semantic knowledge to the CSR; the CSR being operable for searching using the semantic knowledge to constrain the search to an N-best list of salient words, and perform word matching to output N-best list of salient words to the NLU for interpretation of meaning.
  • CSR continuous speech recognition component
  • NLU natural language understanding component
  • Another aspect of the present invention provides a method for continuous speech recognition using a spoken language system comprising a continuous speech recognition component (CSR) linked to a natural language understanding component (NLU); providing speech input to the CSR; providing acoustic-phonetic knowledge to the CSR comprising a set of Hidden Markov Models; providing language knowledge comprising grammar and statistical models to the CSR; providing language knowledge semantic knowledge to the CSR; performing searching with the CSR using the semantic knowledge to constrain the search to an N-best list of salient words comprising semantically meaningful words of the N-best list of words; and, performing word matching to output the N-best salient word sequences to the NLU.
  • CSR continuous speech recognition component
  • NLU natural language understanding component
  • searching may alternatively be based on beam searches and lexical trees to provide benefits of those methods in addition to benefits of the method described above.
  • FIG. 1 shows a known prior art spoken language processing system comprising a continuous speech recognition component (CSR) and a natural language understanding component (NLU);
  • CSR continuous speech recognition component
  • NLU natural language understanding component
  • FIG. 2 shows a spoken language processing system comprising a continuous speech recognizer for search optimization according to a first embodiment of the present invention
  • FIG. 3 shows an example of a search network for a prefix-core-suffix regular grammar
  • FIG. 4 represents forward scoring of the search network
  • FIG. 5 shows an example of a word graph using a backward pass using a known search optimization process
  • FIG. 6 shows the search network of FIG. 3 after collapsing of the affixes
  • FIG. 7 shows a rescore graph generated during the optimized backward pass.
  • a conventional known spoken language processing system 10 for continuous speech recognition is represented by the schematic diagram shown in FIG. 1, which comprises an input means 12 for receiving spoken input, a CSR component 14 for performing a search and word match outputting an N-best word sequence to an NLU component 16 , providing output to a dialogue manager 26 .
  • Another part of the language knowledge comprises semantic information 24 , which is fed to the NLU component 16 .
  • language knowledge 20 comprises separated parts for use by separate components of the systems: the grammar and statistical information 22 used by the CSR, and the semantic information 24 used by the NLU.
  • FIGS. 2 representing schematically a spoken language system 100 comprising a CSR 120 and an NLU component 130 .
  • Input means 110 receives spoken input in the form of a sentence which is passed to the CSR 120 .
  • Acoustic phonetic information in the form of an acoustic model represented by element 140 , and language knowledge 150 comprising grammar and statistical information 160 are fed to the CSR 120 in a conventional manner, typically to constrain the search space of the recognizer.
  • the system 100 is distinguished from known systems, such as that exemplified in FIG.
  • the language knowledge 150 comprising semantic information 170 is fed not only to the NLU 130 , in a conventional manner, and also semantic knowledge is fed to the CSR 120 .
  • the linkage 152 between the semantic information 170 and the CSR component 120 is represented by a heavy arrow.
  • the acoustic phonetic knowledge 140 is provided, as is conventional, in the form of Hidden Markov Models (HMM) describing the accoustic space.
  • the search is optimized to take advantage of available semantic information 170 .
  • Each word in the vocabulary has its dedicated acoustic network
  • the search network branches all have zero weight.
  • the optimized CSR search is based on a known four-pass process as follows:
  • the first two passes known as the fast match, prune the search space into a compact representation of a limited number of sentence hypothesis known as a word graph.
  • the last two passes known as rescoring, perform a more detailed search of the word graph produced by the fast match to output the most likely word hypothesis.
  • the fast match search occurs in two passes.
  • forward scores are computed for each word-ending node of the search graph. These forward scores measure, for each word in the graph, the likelihood of the best path which starts at time 0 and ends at the last node of w just before time t.
  • the path information is not preserved.
  • the task of the backward pass is to recover this path information by backtracking through the most likely word hypothesis. In doing so, the backward pass is able to construct a word graph to be used later during the rescoring phase.
  • FIG. 3 shows an example of a search network for a simple prefix-core-suffix type of regular grammar.
  • the search network consists of a collection of network nodes and branches. These are depicted in FIG. 3 as solid circles and arrows, respectively.
  • the hollow arrows and circles represent the acoustic networks for the words to be recognized.
  • Each of the branches on an acoustic network are in fact an HMM, with its own collection of branches and nodes. Dashed arrows represent null branches in the acoustic network.
  • the vocabulary consists of two prefix words, five core words and two suffix words.
  • score vectors containing the likelihood of the best path starting at time 0 and ending in the last state of each word w, for all times t are computed. This process is depicted in FIG. 4. The arrow below the score vector indicates that this is a forward score vector.
  • the starting point of the backward pass is the last (right-most) network node of the search network.
  • a backward initial score buffer is initialized to the values ( ⁇ , . . . , 0)
  • the operation is in the log-probability domain, so ⁇ refers to the most unlikely event and 0 refers to the most likely event.
  • the value at time T is initialized to 0 because it is known for sure that the utterance must end at time T.
  • All semantically null words which originate (directly or indirectly) from the same search network node and which merge (indirectly) to the same node are collapsed into a unique single-input single-output acoustic network.
  • All prefix words originate indirectly from node 0 and merge indirectly at node 5 , so these words may be collapsed into a single acoustic network with a single input and a single output.
  • the suffix words may be collapsed into a single acoustic network, since they all originate from node 16 and merge at node 21 .
  • FIG. 6 shows the search network of FIG. 3 when the affixes are collapsed, with the new node labeling.
  • a stop node is a special type of network node that signals the search algorithm to stop the Viterbi scoring along the path it is currently following.
  • the forward stop nodes are used during the forward pass of the search and signal the search to stop the forward scoring.
  • the backward stop nodes signal the search to stop the backward scoring.
  • the position of these stop nodes is uniquely determined by the layout of the search network and the position of the collapsed networks (hence the semantically null words).
  • the forward stop nodes are located at the end nodes of the right-most (i.e. closest to the network's end node) set of non-semantically null words (i.e. semantically meaningful words) that are connected to a semantically-null acoustic network.
  • the backward stop nodes are located at the end nodes of the left-most (i.e. closest to the network's start node) set of non-semantically null words that are connected to a semantically null acoustic network.
  • the search network of FIG. 6 may be used to locate stop nodes, starting with the forward stop nodes.
  • the right-most set of non-semantically null words happen to be the core words, because they are connected to the suffix (a collapsed acoustic network) and no other salient words occur past the suffix.
  • nodes 7 , 8 , 9 , 10 and 11 are all forward stop nodes.
  • the core is also the left-most set of non-semantically null words, since it is connected to the prefix (a collapsed network) and no other salient words occur before the suffix. So in this case, the same nodes, 7 , 8 , 9 , 10 and 11 , are also backward stop nodes.
  • the first savings occur during the forward pass, when the prefix network is traversed. Because all words of the prefix were collapsed into a unique single-input single-output network, the resulting number of acoustic network branches is potentially much smaller. Note, however, that even without the proposed optimizations, it would have been possible to collapse the search network from the entry point, thus generating a tree instead of a graph. So the actual savings are the reduction in branches from a tree to a single-input single-output graph, which may or may not be significant, depending on the size of the prefix.
  • the forward pass then continues by generating the forward score vectors for nodes 1 through 11 . However, the forward processing stops there, since nodes 7 through 11 are forward stop nodes. This means that the score vector “max-out” at node 12 will not take place, and neither will the scoring of the suffix network. At this point, the forward pass is completed.
  • the backward pass then takes over by first reverse-scoring the collapsed suffix acoustic network. Because the suffix network was collapsed, scoring all suffix words occurs simultaneously.
  • the backward pass described above actually scores words on a “need-to” basis.
  • the backward pass extends paths with the highest total likelihood first. Hence alternate suffix words will be scored only if they belong to a path with a high total likelihood. So the backward scoring of the suffix network may end-up being more costly than individual scoring of suffix words on a “need-to” basis.
  • the backward pass meets the reverse suffix score vector with the forward score vectors of nodes 7 through 11 .
  • the word that yields the best total likelihood would be chosen for backward scoring. But because this node is a backward stop node, the backward scoring does not take place. Instead, the word is still backtracked, but only to construct the rescore graph properly. Depending on the layout of the search network, this saving can be considerable. Note that most of the time spent during the backward pass is for back-scoring networks.
  • the rescoring algorithm is very similar to the fast match algorithm previously described. It contains a forward pass to compute the forward score vectors at each word-ending node and a backward pass to decode the list of choices, just as described above. The most notable differences with the fast match pass is that in rescoring:
  • the network does not contain any loops, so a block algorithm may be used;
  • the whole utterance is available, so the block may be set to the entire utterance;
  • FIG. 7 shows the optimized rescore graph.
  • Constrained window Viterbi scoring can only be used to a limited extent with the proposed optimizations.
  • Constrained window Viterbi scoring occurs when scoring is constrained to a fixed time window determined (approximately) by the word segmentation provided by the fast match pass. Since not all word segmentations are produced with the optimized backward pass of the fast match, the rescoring algorithm may be forced to score some words over a larger window than it should.
  • the extent to which this is a problem is highly dependent on the mean word durations of non-semantically null words with respect to semantically null words. In other words, the shorter the semantically null words are with respect to the non-semantically null words, the smaller the penalty.
  • FIG. 5 shows a word graph representing the true N-best list.
  • a reduction in the amount of computations required to perform the search in continuous speech recognition is achieved by incorporating semantic information into the recognizer.
  • Search optimizations involve collapsing each list of semantically null words into a unique single-input single-output acoustic network, and identifying stop nodes in the acoustic network.
  • Time synchronous processing time, occuring while the utterance is being spoken, is reduced by computing only a subset of the search space.
  • the amount of delay after a person finished speaking before the recongized word string is returned by the application is reduced.
  • the processing time for the backward pass of the search is reduced, by up to a factor of ten in some cases.
  • the post processing delay is also reduced during the rescoring pass since a more compact list of choices needs to be rescored.
  • a single generic continuous speech recognizer may be used for all types of tasks, including those that may be optimised by incorporating semantic information at the recognizer level.
  • searching may alternatively be based on beam searches and lexical trees to provide benefits of those methods in addition to benefits of the method described above.

Abstract

A system and method for continuous speech recognition (CSR) is optimized to reduce processing time for connected word grammars bounded by semantically null words. The savings, which reduce processing time both during the forward and the backward passes of the search, as well as during rescoring, are achieved by performing only the minimal amount of computation required to produce an exact N-best list of semantically meaningful words (N-best list of salient words). This departs from the standard Spoken Language System modeling which any notion of meaning is handled by the Natural Language Understanding (NLU) component. By expanding the task of the recognizer component from a simple acoustic match to allow semantic information to be fed to the recognizer, significant processing time savings are achieved, and make it possible to run an increased number of speech recognition channels in parallel for improved performance, which may enhance users perception of value and quality of service.

Description

    RELATED APPLICATIONS
  • This application is related to U.S. patent application Ser. No. 08/997,824 to Stubley et al. entitled “Order of matching observations to state models”, filed Dec. 24, 1997; U.S. patent application Ser. No. 09/118,621 to Stubley et al. entitled “Block algorithm for pattern recognition”, filed Jul. 21, 1998; and U.S. patent application Ser. No. 08/934,736 to Robillard et al. entitled “Search and rescoring mehtod for a speech recognition system”, filed Sep. 22, 1997, which are incorporated herein by reference.[0001]
  • FIELD OF THE INVENTION
  • This invention relates to a system and method for optimization of searching for continuous speech recognition. [0002]
  • BACKGROUND OF THE INVENTION
  • Speech recognition for applications such as automated directory enquiry assistance and control of operation based on speech input requires a real time response. Spoken input must be recognized within about half a second of the end of the spoken input to simulate the response of a human operator and avoid a perception of unnatural delay. [0003]
  • Processing of speech input falls into five main steps: audio channel adaptation, feature extraction, word end point detection, speech recognition, and accept/reject decision logic. Pattern recognition generally, and more particularly recognition of patterns in continuous signals such as speech signals, requires complex calculations and is dependent on providing sufficient processing power to meet the computational load. Thus the speech recognition step is the most computationally intensive step of the process. [0004]
  • The computational load is dependent on the number of words or other elements of speech, which are modelled and held in a dictionary, for comparison to the spoken input (i.e. the size of vocabulary of the system); the complexity of the models in the dictionary; how the speech input is processed into a representation ready for comparison to the models; and the algorithm used for carrying out the comparison process. Numerous attempts have been made to improve the trade off between computational load, accuracy of recognition and speed of recognition. [0005]
  • Examples are described, e.g., in U.S. Pat. No. 5,390,278 to Gupta et al., and U.S. Pat. No. 5,515,475 to Gupta et al. Many other background references are included in the above referenced copending applications. [0006]
  • In order to provide speech recognition which works efficiently in real time, two approaches are generally considered. The first is to make use of specialized hardware or parallel processing architectures. The second is to develop optimized search methods based on search algorithms that yield reasonable accuracies, but at a fraction of the cost of more optimal architectures. The latter approach is favored by many researchers, since it tackles the problem at the source, see for example, Schwartz, R., Nguyen, L., Makhoul, J., “Multiple-pass search strategies”, in Automatic Speech and Speaker Recognition, Lee, C. H., Soong, F. K., Paliwal, K. K. (eds.), Kluwer Academic Publishers (1996), pp 429-456. This approach is appealing since the hardware and algorithmic optimizations are often orthogonal, so the latter can always be built on top of the former. [0007]
  • The basic components of a spoken language processing (SLP) system include a continuous speech recognizer (CSR) for receiving spoken input from the user and a Natural Language Understanding component (NLU), represented schematically in FIG. 1. A conventional system operates as follows. Speech input is received by the CSR, and a search is performed by the CSR using acoustic models that model speech sounds, and a language model or ‘grammar’ that describes how words may be connected together. The accoustic model is typically in the form of Hidden Markov Models (HMM) describing the accoustic space. The language knowledge is usually used for both the CSR component and the NLU component, as shown in FIG. 1, with inforamtion on grammar and/or statistical models being used by the CSR, and semantic information being used by the NLU. The structure of the language is often used to constrain the search space of the recognizer. If the goal is to recognize unconstrained speech, the language knowledge usually takes the form of a statistical language model (bigram or trigram). If the goal is to recognize a specific constrained vocabulary, then the language knowledge takes the form of a regular grammar. [0008]
  • The search passes the recognized word strings representing several likely choices, in the form of a graph, to the natural language understanding component for extracting meaning from the recognized word strings. The language model provides knowledge to the NLU relating to understanding of the recognized word strings. More particularly the semantic information from the language knowledge is fed exclusively to the NLU component with information on how to construct a meaning representation of the CSR's output. This involves, among other things, identifying which words are important to the meaning and which are not. The latter are referred to as non-keywords or semantically-null words. Thus semantically-meaningful words and semantically-null words are identified to provide understanding of the input, and in the process, the word strings are converted to a standard logical form. The logical form is passed to a discourse manager DM, which is the interface between the user and the application. The DM gathers the necessary information from the user to request the applications to perform the user's goal by prompting the user for input. [0009]
  • While the terms ‘grammar’ and ‘language model’ are often used interchangeably, in this application, a language model is defined as the graph that is used by the CSR search algorithm to perform recognition. A grammar is a set of rules, which may also be represented as a graph, used by the NLU component to extract meaning from the recognized speech. There may be a one to one mapping between the language model and the grammar in the case where the language model is a constrained model. Connected Word Recognition (CWR) is an example of the latter. Nevertheless, known spoken language systems described above separate language knowledge into grammar and semantic information, and feed the former to the CSR and feed the latter to the NLU. [0010]
  • Most search optimization techniques involve reducing computation by making use of local scores during the decoding of a speech utterance. Copending U.S. application Ser. No. 09/118,621 entitled “Block algorithm for pattern recognition”, referenced above describes in detail an example of a search algorithm and scoring method. [0011]
  • For example, the Viterbi beam search, without a doubt the most widely used optimization, prunes the paths whose scores (likelihoods) are outside a beam determined by the best local score. Some neural-network based approaches threshold the posterior probabilities of each state to determine if it should remain active (Bourlard, H. Morgan, N., “Connectionist Speech Recognition—A Hybrid Approach”, Kluwer Academic Press, 1994.) [0012]
  • Another important technique that helped reduce the computation burden was the use of lexical trees instead of dedicated acoustic networks as described by Ney, H., Aubert, X., “Dynamic Programming Search Strategies: From Digit Strings to Large Vocabulary Word Graphs”, in Automatic Speech and Speaker Recognition, Lee, C. H., Soong, F. K., Paliwal, K. K. (eds.), Kluwer Academic Publishers (1996), pp 385-411. Along with that idea came language model look-ahead techniques to enhance the pruning described by Murveit, H., Monaco, P., Digalakis, V., Butzberger, J., “Techniques to Achieve an Accurate Real-Time Large-Vocabulary Speech Recognition System”, in ARPA Workshop on Human Language Technology, pp 368-373. [0013]
  • While these techniques are undisputedly effective at solving these specific problems, in all cases, the sole sources of “language knowledge” used to reduce the search space are the language model and the grammar layout; semantic information is not used by the CSR. [0014]
  • Word spotting techniques are an attempt to indirectly use semantic information by focusing the recognizer on the list of keywords(or key phrases) that are semantically meaningful. Some word spotting techniques use background models of speech in an attempt to capture every word that is not in the word spotters dictionary, including semantically null words (non-keywords) (Rohlicek, J. R., Russel, W., Roukos, S., Gish, H., “Word Spotting”, ICASSP 1989, pp 627-630). [0015]
  • While word spotting is generic, it is very costly and provides poor accuracy, especially when there is prior knowledge of which non-keywords are likely to be used. Because these latter models are so broad, they do not always efficiently model non-keywords which are likely to occur in an utterance (for example, hesitations, and polite formulations). [0016]
  • To overcome the low accuracy problems encountered in word spotting, Large Vocabulary Continuous Speech Recognizers, LVCSR, are used in the hope that any semantically null word will exist in the recognizers vocabulary (Weitraub, M., “LVCSR Log-Likelihood Ratio Scoring For Keyword Spotting”, ICASSP 1995, [0017] Vol 1, PP 297-300). The output of the recognizer in this case is a string of keywords and non-keywords that is later processed by an NLU module to extract meaning. Language knowledge is separated into grammar and statistical information which are used by the CSR, and semantic information that is used by the NLU.
  • In all these approaches, the CSR recognizer simply outputs a string of keywords and non-keywords for further processing using semantic information: it does not make use of semantic information during the search. Consequently there is a need for further optimzation of continuous speech recognizers. [0018]
  • SUMMARY OF THE INVENTION
  • Thus, the present invention seeks to provide a system and method for optimization of searching for continuous speech recognizers which overcomes or avoids the above mentioned problems. [0019]
  • Therefore, according to a first aspect of the present invention there is provided a method for continuous speech recognition comprising: incorporating semantic information during searching by a continuous speech recognizer. [0020]
  • Beneficially, incorporating semantic information during searching comprises searching using semantic information to identify semantically-null words and thereby generate an N-best list of salient words, instead of an N-best list of both salient and semantically null words. [0021]
  • The savings, which reduce processing time both during the forward and the backward passes of the search, as well as during rescoring, are achieved by performing only the minimal amount of computation required to produce an exact N-best list of semantically meaningful words (N-best list of salient words). This departs from the standard Spoken Language System modeling in which any notion of meaning is handled by the Natural Language Understanding (NLU) component. By expanding the task of the recognizer component from a simple acoustic match to allow semantic information to be fed to the recognizer, significant processing time savings are achieved. Thus, for example, it is possible to run an increased number of speech recognition channels in parallel for improved performance, which may enhance users' perception of value and quality of service. [0022]
  • According to another aspect of the present invention, there is provided a method for continuous speech recognition comprising: providing speech input to a continuous speech recognizer; providing to the continuous speech recognizer an acoustic model comprising a set of Hidden Markov Models, and a language model comprising both grammar and semantic information; performing recognition of speech input using semantic information to eliminate semantically null words from the N-best list of words and restrict searching to an N-best list of salient words; and performing word matching to output from the speech recognizer the N-best salient word sequences. [0023]
  • Advantageously, the step of performing recognition comprises: detecting connected word grammars bounded by semantically null words; collapsing each list of semantically null words into a unique single-input single-output acoustic network; and identifying stop nodes in the acoustic network. [0024]
  • Thus, during a forward pass of a search, forward stop nodes are detected, signalling the search to stop forward scoring along a path currently being followed, and during a backward pass of the search backward stop nodes are detected, signalling the search to stop backward scoring along a path currently being followed. Then, for example, right-most semantically null networks are not computed, and some semantically salient words are not backward-scored. Thus an N-best list of only salient words is rescored instead of a true N-best list. [0025]
  • Advantageously, scoring comprises Viterbi scoring or other known methods. The method above may be combined with other techniques to save processing time. For example, searching may alternatively be based on beam searches and lexical trees to provide benefits of those methods in addition to benefits of the method above. [0026]
  • According to another aspect of the invention there is provided software on a machine readable medium for performing a method of continuous speech recognition comprising: incorporating semantic information during searching by a continuous speech recognizer. [0027]
  • Preferably, the method comprises searching using semantic information to identify semantically-null words and thereby generate a list of N-best salient words. [0028]
  • Yet another aspect of the invention provides software on a machine readable medium for performing a method for continuous speech recognition comprising: providing speech input to a continuous speech recognizer; providing to the continuous speech recognizer an acoustic model comprising a set of Hidden Markov Models, and a language model comprising both grammar and semantic information; performing recognition of speech input using semantic information to eliminate semantically null words from the N-best list of words and restrict searching to an N-best list of salient words. [0029]
  • Another aspect of the invention provides a system for continuous speech recognition comprising: [0030]
  • means for incorporating semantic information during searching by a continuous speech recognizer; input means for providing speech input to the continuous speech recognizer; means for providing to the continuous speech recognizer an acoustic model comprising a set of Hidden Markov Models, and a language model comprising both grammar and semantic information; the continuous speech recognizer comprises means for performing recognition of speech input using the semantic information for eliminating semantically null words from the N-best list of words and thereby restricting searching to an N-best list of salient words, and performing word matching to output the N-best salient word sequences. [0031]
  • According to a further aspect of the present invention there is provided a spoken language processing system for speech recognition comprising: a continuous speech recognition component (CSR); a natural language understanding component (NLU); means for providing speech input to the CSR; means for providing acoustic-phonetic knowledge to the CSR comprising a set of Hidden Markov Models; means for providing language knowledge comprising grammar and statistical models to the CSR, and means for providing semantic knowledge the NLU, and means for providing semantic knowledge to the CSR; the CSR being operable for searching using the semantic knowledge to constrain the search to an N-best list of salient words, and perform word matching to output N-best list of salient words to the NLU for interpretation of meaning. [0032]
  • Another aspect of the present invention provides a method for continuous speech recognition using a spoken language system comprising a continuous speech recognition component (CSR) linked to a natural language understanding component (NLU); providing speech input to the CSR; providing acoustic-phonetic knowledge to the CSR comprising a set of Hidden Markov Models; providing language knowledge comprising grammar and statistical models to the CSR; providing language knowledge semantic knowledge to the CSR; performing searching with the CSR using the semantic knowledge to constrain the search to an N-best list of salient words comprising semantically meaningful words of the N-best list of words; and, performing word matching to output the N-best salient word sequences to the NLU. [0033]
  • The method and system described above may be combined with other techniques to save processing time. For example, searching may alternatively be based on beam searches and lexical trees to provide benefits of those methods in addition to benefits of the method described above. [0034]
  • Thus systems and methods are provided which allow considerable savings in computation time, so that more complex speech applications may be implemented on smaller and older platforms. Thus existing products with older processors may advantageously be upgraded to provide extended services. In newer products and processors, the number of simultaneous channels that can be supported is higher, reducing the cost of deploying services. Improved performance may enhance users perception of value and quality of service.[0035]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The invention will now be described in greater detail with reference to the attached drawings wherein: [0036]
  • FIG. 1 shows a known prior art spoken language processing system comprising a continuous speech recognition component (CSR) and a natural language understanding component (NLU); [0037]
  • FIG. 2 shows a spoken language processing system comprising a continuous speech recognizer for search optimization according to a first embodiment of the present invention; [0038]
  • FIG. 3 shows an example of a search network for a prefix-core-suffix regular grammar; [0039]
  • FIG. 4 represents forward scoring of the search network; [0040]
  • FIG. 5 shows an example of a word graph using a backward pass using a known search optimization process; [0041]
  • FIG. 6 shows the search network of FIG. 3 after collapsing of the affixes; [0042]
  • FIG. 7 shows a rescore graph generated during the optimized backward pass.[0043]
  • DETAILED DESCRIPTION OF THE EMBODIMENTS
  • A conventional known spoken [0044] language processing system 10 for continuous speech recognition is represented by the schematic diagram shown in FIG. 1, which comprises an input means 12 for receiving spoken input, a CSR component 14 for performing a search and word match outputting an N-best word sequence to an NLU component 16, providing output to a dialogue manager 26. Acoustic phonetic information in the form of an acoustic model represented by element 18 which is fed to the CSR, and language knowledge represented by element 20, part of which comprising grammar and statistical information 22 is fed to the CSR component 14, in a conventional manner, typically to constrain the search space of the recognizer. Another part of the language knowledge comprises semantic information 24, which is fed to the NLU component 16. Thus language knowledge 20 comprises separated parts for use by separate components of the systems: the grammar and statistical information 22 used by the CSR, and the semantic information 24 used by the NLU.
  • A system and method for continuous speech recognition according to a first embodiment of the present invention is described with reference to FIGS. [0045] 2 representing schematically a spoken language system 100 comprising a CSR 120 and an NLU component 130. Input means 110 receives spoken input in the form of a sentence which is passed to the CSR 120. Acoustic phonetic information in the form of an acoustic model represented by element 140, and language knowledge 150 comprising grammar and statistical information 160 are fed to the CSR 120 in a conventional manner, typically to constrain the search space of the recognizer. The system 100 is distinguished from known systems, such as that exemplified in FIG. 1, in that the language knowledge 150 comprising semantic information 170 is fed not only to the NLU 130, in a conventional manner, and also semantic knowledge is fed to the CSR 120. The linkage 152 between the semantic information 170 and the CSR component 120 is represented by a heavy arrow. Thus when speech input in the form of a speech utterance comprising a series of words or sentence is received by the CSR, a search is performed. The acoustic phonetic knowledge 140 is provided, as is conventional, in the form of Hidden Markov Models (HMM) describing the accoustic space. In addition, the search is optimized to take advantage of available semantic information 170.
  • In the following description, the following simplifying assumptions are made for the sake of clarity: [0046]
  • Each word in the vocabulary has its dedicated acoustic network; [0047]
  • The search network branches all have zero weight. [0048]
  • These simplifying assumptions do not in any way reflect limitations of the proposed optimization and are merely made for the sake of clarity. [0049]
  • The optimized CSR search is based on a known four-pass process as follows: [0050]
  • The first two passes, known as the fast match, prune the search space into a compact representation of a limited number of sentence hypothesis known as a word graph. The last two passes, known as rescoring, perform a more detailed search of the word graph produced by the fast match to output the most likely word hypothesis. [0051]
  • The fast match search occurs in two passes. During the first pass, forward scores are computed for each word-ending node of the search graph. These forward scores measure, for each word in the graph, the likelihood of the best path which starts at [0052] time 0 and ends at the last node of w just before time t. During the forward pass, the path information is not preserved. The task of the backward pass is to recover this path information by backtracking through the most likely word hypothesis. In doing so, the backward pass is able to construct a word graph to be used later during the rescoring phase.
  • FIG. 3 shows an example of a search network for a simple prefix-core-suffix type of regular grammar. The search network consists of a collection of network nodes and branches. These are depicted in FIG. 3 as solid circles and arrows, respectively. The hollow arrows and circles represent the acoustic networks for the words to be recognized. Each of the branches on an acoustic network are in fact an HMM, with its own collection of branches and nodes. Dashed arrows represent null branches in the acoustic network. In this example, the vocabulary consists of two prefix words, five core words and two suffix words. [0053]
  • Forward Pass [0054]
  • During the forward pass of the fast match, score vectors containing the likelihood of the best path starting at [0055] time 0 and ending in the last state of each word w, for all times t are computed. This process is depicted in FIG. 4. The arrow below the score vector indicates that this is a forward score vector.
  • Backward Pass [0056]
  • During the forward pass, path information is not saved. The purpose of the backward pass is to recover this path information for the N-best choices required. It uses a priority queue to keep track of the partial choices that are being extended. [0057]
  • The starting point of the backward pass is the last (right-most) network node of the search network. A backward initial score buffer is initialized to the values (−∞, . . . , 0) The operation is in the log-probability domain, so −∞ refers to the most unlikely event and 0 refers to the most likely event. The value at time T is initialized to 0 because it is known for sure that the utterance must end at time T. [0058]
  • The rest of the backward pass algorithm is as follows (each step is described below): [0059]
  • pull the next entry from the priority queue [0060]
  • extend the word for this entry by back-scoring its acoustic network with the Viterbi algorithm [0061]
  • find all word-ending nodes connected to the word-starting node of the extended word [0062]
  • for all these word-ending nodes, meet the forward score vector with the backward score vector to determine the best meeting time. [0063]
  • return to step 1 until the queue is empty or the number of N of desired choices has been reached. [0064]
  • This algorithm treats each word with equal salience, that is, each word is considered important in determining the meaning of the utterance. [0065]
  • In practice, some words are more salient than others. Consider the prefix-core-suffix grammar depicted in FIG. 3. This grammar essentially acts as a (limited) word spotter, where each word in the core list may be preceded by any prefix word, and succeeded by any suffix word. In this particular case, which affix is actually used, is completely irrelevant to determine the meaning of the utterance: only the core entry is needed. Yet the word lattice produced by the backward pass described in above will give a detailed segmentation of each N-best choice, which may look something like FIG. 5. [0066]
  • On the other hand, when the fact that the affixes are semantically null is used, that is, they bring nothing to the meaning of the utterance, substantial savings may be achieved. [0067]
  • The key to those savings is that instead of producing an N-best list of complete choices, we produce an N-best list consisting of only non-semantically null words, i.e. an N-best list of salient words. In our prefix-core-suffix example, this would be a list of only core entries. To achieve this, the search network is modified in at least two respects, which are described below. The optimizations work together to reduce search time. [0068]
  • Collapsing of Acoustic Networks for Semantically Null Words. [0069]
  • All semantically null words which originate (directly or indirectly) from the same search network node and which merge (indirectly) to the same node are collapsed into a unique single-input single-output acoustic network. As an example, refer to FIG. 3. All prefix words originate indirectly from [0070] node 0 and merge indirectly at node 5, so these words may be collapsed into a single acoustic network with a single input and a single output. Similarly, the suffix words may be collapsed into a single acoustic network, since they all originate from node 16 and merge at node 21.
  • The reason for this collapsing is two-fold. First, because the acoustic network has a single input and a single output, greater graph compression may be achieved since the phonetic similarities of the words may be exploited from both ends. Second, the output score vector resulting from the backtracking of the collapsed acoustic network will yield the scores of the best paths (for all times) through that network, regardless of which word was traversed. FIG. 6 shows the search network of FIG. 3 when the affixes are collapsed, with the new node labeling. [0071]
  • Stop Nodes [0072]
  • The lion's share of the savings resulting from the proposed optimizations are due to the presence of stop nodes in the search network. A stop node is a special type of network node that signals the search algorithm to stop the Viterbi scoring along the path it is currently following. There are two types of stop nodes: forward and backward. The forward stop nodes are used during the forward pass of the search and signal the search to stop the forward scoring. Similarly the backward stop nodes signal the search to stop the backward scoring. [0073]
  • The position of these stop nodes is uniquely determined by the layout of the search network and the position of the collapsed networks (hence the semantically null words). The forward stop nodes are located at the end nodes of the right-most (i.e. closest to the network's end node) set of non-semantically null words (i.e. semantically meaningful words) that are connected to a semantically-null acoustic network. The backward stop nodes are located at the end nodes of the left-most (i.e. closest to the network's start node) set of non-semantically null words that are connected to a semantically null acoustic network. [0074]
  • In summary, the search network of FIG. 6 may be used to locate stop nodes, starting with the forward stop nodes. In this case, the right-most set of non-semantically null words happen to be the core words, because they are connected to the suffix (a collapsed acoustic network) and no other salient words occur past the suffix. So [0075] nodes 7, 8, 9, 10 and 11 are all forward stop nodes. The core is also the left-most set of non-semantically null words, since it is connected to the prefix (a collapsed network) and no other salient words occur before the suffix. So in this case, the same nodes, 7, 8, 9, 10 and 11, are also backward stop nodes.
  • With the semantically null words collapsed and stop nodes in place, search benefits from these alterations to the network will be described. Throughout this section, without loss in generality, the prefix-core-suffix network of FIG. 6 is used as an example. [0076]
  • The first savings occur during the forward pass, when the prefix network is traversed. Because all words of the prefix were collapsed into a unique single-input single-output network, the resulting number of acoustic network branches is potentially much smaller. Note, however, that even without the proposed optimizations, it would have been possible to collapse the search network from the entry point, thus generating a tree instead of a graph. So the actual savings are the reduction in branches from a tree to a single-input single-output graph, which may or may not be significant, depending on the size of the prefix. [0077]
  • The forward pass then continues by generating the forward score vectors for [0078] nodes 1 through 11. However, the forward processing stops there, since nodes 7 through 11 are forward stop nodes. This means that the score vector “max-out” at node 12 will not take place, and neither will the scoring of the suffix network. At this point, the forward pass is completed.
  • The backward pass then takes over by first reverse-scoring the collapsed suffix acoustic network. Because the suffix network was collapsed, scoring all suffix words occurs simultaneously. The backward pass described above actually scores words on a “need-to” basis. The backward pass extends paths with the highest total likelihood first. Hence alternate suffix words will be scored only if they belong to a path with a high total likelihood. So the backward scoring of the suffix network may end-up being more costly than individual scoring of suffix words on a “need-to” basis. [0079]
  • After back-scoring the suffix, the backward pass meets the reverse suffix score vector with the forward score vectors of [0080] nodes 7 through 11. Conventionally, the word that yields the best total likelihood would be chosen for backward scoring. But because this node is a backward stop node, the backward scoring does not take place. Instead, the word is still backtracked, but only to construct the rescore graph properly. Depending on the layout of the search network, this saving can be considerable. Note that most of the time spent during the backward pass is for back-scoring networks.
  • Impact on Rescoring [0081]
  • The rescoring algorithm is very similar to the fast match algorithm previously described. It contains a forward pass to compute the forward score vectors at each word-ending node and a backward pass to decode the list of choices, just as described above. The most notable differences with the fast match pass is that in rescoring: [0082]
  • the network does not contain any loops, so a block algorithm may be used; [0083]
  • the whole utterance is available, so the block may be set to the entire utterance; [0084]
  • no pruning is done, since it is assumed that the fast match has already done the necessary pruning. [0085]
  • Given these strong parallels with the fast match steps, it is easy to see that all the optimizations previously described may be applied to the rescoring algorithm as well. [0086]
  • Furthermore, additional savings are made possible since the rescoring graph is a compact representation of N-best list of non-semantically null word sequences, instead of the true N-best list. Hence, the rescoring algorithm is forced to focus only on the meaningful choice alternatives, leaving aside the non-informative affixes. FIG. 7 shows the optimized rescore graph. [0087]
  • Care must be taken, however, when designing the grammar. If the list of semantically null words is large, then rescoring time will be adversely affected, since all these words need to be rescored (remember there is no pruning in rescoring). If that is the case, then it may be more efficient to revert to the true N-best search. [0088]
  • Another point to mention is that constrained window Viterbi scoring can only be used to a limited extent with the proposed optimizations. Constrained window Viterbi scoring occurs when scoring is constrained to a fixed time window determined (approximately) by the word segmentation provided by the fast match pass. Since not all word segmentations are produced with the optimized backward pass of the fast match, the rescoring algorithm may be forced to score some words over a larger window than it should. The extent to which this is a problem is highly dependent on the mean word durations of non-semantically null words with respect to semantically null words. In other words, the shorter the semantically null words are with respect to the non-semantically null words, the smaller the penalty. [0089]
  • As mentioned before, rescoring is more efficient since we rescore only the list of N-best non-semantically null words sequences, instead of rescoring the true N-best list. To understand why this is so, refer to FIG. 5, which shows a word graph representing the true N-best list. Consider the word labeled “[0090] word 1” in the graph. Because this word is connected to two different suffixes, at different times (“suffix 1” and “suffix 2”), it will have to be scored twice.
  • Conclusion [0091]
  • A reduction in the amount of computations required to perform the search in continuous speech recognition is achieved by incorporating semantic information into the recognizer. Search optimizations involve collapsing each list of semantically null words into a unique single-input single-output acoustic network, and identifying stop nodes in the acoustic network. [0092]
  • These optimizations translate into savings in the processing required for the search because: [0093]
  • forward semantically null networks are collapsed into a graph. [0094]
  • right-most semantically null networks are not computed. [0095]
  • some non-semantically null words are not backward-scored. [0096]
  • an N-best list of only salient words is rescored instead of a true N-best list. [0097]
  • As a result, time savings during both forward and backward passes of the search, as well as during rescoring, are achieved by performing only the minimal amount of computations required to produce an exact N best list of only semantically meaningful words, which is referred to as the N-best list of salient words. [0098]
  • The benefits are achieved by allowing semantically null meaning to be used by the recognizer component. [0099]
  • Time synchronous processing time, occuring while the utterance is being spoken, is reduced by computing only a subset of the search space. The amount of delay after a person finished speaking before the recongized word string is returned by the application is reduced. By performing only the necessary computation required to produce a top-N list of semantically meaningful words, the processing time for the backward pass of the search is reduced, by up to a factor of ten in some cases. [0100]
  • The post processing delay is also reduced during the rescoring pass since a more compact list of choices needs to be rescored. [0101]
  • Thus a single generic continuous speech recognizer may be used for all types of tasks, including those that may be optimised by incorporating semantic information at the recognizer level. [0102]
  • These processing time savings make it possible to run an increased number of speech recognition channels in parallel. This advantage is paramount for cost-effective real-time applications such as, for example, Nortel's Personal Voice Dialer (PVD) and Voice Activated Business Directory (VABD) and Automated Directory Assistance Service Plus (ADAS+). [0103]
  • This development allows more complex speech applications to be implemented on smaller and older platforms. Thus existing products with older processors may advantageously be upgraded to provide extended services. In newer products and processors, the number of simultaneous channels that can be supported is higher, reducing the cost of deploying services. Improved performance may enhance users perception of value and quality of service. [0104]
  • The method and system described above may be combined with other techniques to save processing time. For example, searching may alternatively be based on beam searches and lexical trees to provide benefits of those methods in addition to benefits of the method described above. [0105]
  • Although specific embodiments of the invention have been described in detail, it will be that numerous variations and modifications to the embodiments may be made within the scope of the following claims. [0106]

Claims (18)

What is claimed is:
1. A method for continuous speech recognition comprising:
incorporating semantic information during searching by a continuous speech recognizer.
2. A method for continuous speech recognition according to
claim 1
, comprising searching using semantic information to identify semantically-null words and thereby generate a list of N-best salient words.
3. A method for continuous speech recognition
providing speech input to a continuous speech recognizer,
providing to the continuous speech recognizer an acoustic model comprising a set of Hidden Markov Models, and a language model comprising both grammar and semantic information,
performing recognition of speech input using semantic information to eliminate semantically null words from the N-best list of words and restrict searching to an N-best list of salient words,
and performing word matching to output from the speech recognizer the N-best salient word sequences.
4. A method for a continuous speech recognition process according to
claim 3
wherein the step of performing recognition comprises:
detecting connected word grammars bounded by semantically null words;
collapsing each list of semantically null words into a unique single-input single-output acoustic network;
and identifying stop nodes in the acoustic network.
5. A method according to
claim 4
comprising:
during a forward pass of a search detecting forward stop nodes and signalling the search to stop forward scoring along a path currently being followed, and
during a backward pass of the search detecting backwards stop nodes and signalling the search to stop backward scoring along a path currently being followed.
6. A method according to 5 wherein right-most semantically null networks are not computed.
7. A method according to 5 wherein some semantically salient words are not backward-scored.
8. A method according to 5 wherein an N-best list of only salient words is rescored instead of a true N-best list.
9. A method according to
claim 8
wherein scoring comprises Viterbi scoring.
10. Software on a machine readable medium for performing a method of continuous speech recognition comprising:
incorporating semantic information during searching by a continuous speech recognizer.
11. Software for performing a method of continuous speech recognition according to
claim 10
, wherein the method comprises searching using semantic information to generate a list of N-best salient words.
12. Software on a machine readable medium for performing a method for continuous speech recognition
providing speech input to a continuous speech recognizer,
providing to the continuous speech recognizer an acoustic model comprising a set of Hidden Markov Models, and a language model comprising both grammar and semantic information,
performing recognition of speech input using semantic information to eliminate semantically null words from the N-best list of words and restrict searching to an N-best list of salient words,
13. A system for continuous speech recognition comprising:
means for incorporating semantic information during searching by a continuous speech recognizer.
14. A system for continuous speech recognition according to
claim 1
, comprising means for searching using semantic information to generate a list of N-best salient words.
15. A system for continuous speech recognition
comprising a continuous speech recognizer,
input means for providing speech input to the continuous speech recognizer,
means for providing to the continuous speech recognizer an acoustic model comprising a set of Hidden Markov Models, and a language model comprising both grammar and semantic information,
the continuous speech recognizer comprising means for performing recognition of speech input using the semantic information for eliminating semantically null words from the N-best list of words and thereby restricting searching to an N-best list of salient words, and performing word matching to output the N-best salient word sequences.
16. A system according to
claim 15
means for performing recognition of speech input using the semantic information comprises:
means for detecting connected word grammars bounded by semantically null words;
means for collapsing each list of semantically null words into a unique single-input single-output acoustic network;
and means for identifying stop nodes in the acoustic network.
17. A spoken language processing system for speech recognition comprising:
a continuous speech recognition component (CSR)
a natural language understanding component (NLU)
means for providing speech input to the CSR,
means for providing acoustic-phonetic knowledge to the CSR comprising a set of Hidden Markov Models;
means for providing language knowledge comprising grammar and statistical models to the CSR, and means for providing semantic knowledge the NLU, and
means for providing semantic knowledge to the CSR,
the CSR being operable for searching using the semantic knowledge to constrain the search to an N-best list of salient words, and perform word matching to output N-best list of salient words to the NLU for interpretation of meaning.
18. A method for continuous speech recognition using a spoken language system comprising a continuous speech recognition component (CSR) linked to a natural language understanding component (NLU)
providing speech input to the CSR
providing acoustic-phonetic knowledge to the CSR comprising a set of Hidden Markov Models;
providing language knowledge comprising grammar and statistical models to the CSR;
providing language knowledge semantic knowledge to the CSR;
performing searching with the CSR using the semantic knowledge to constrain the search to an N-best list of salient words comprising semantically meaningful words of the N-best list of words,
and performing word matching to output the N-best salient word sequences to the NLU.
US09/185,529 1997-12-24 1998-11-04 Search optimization system and method for continuous speech recognition Expired - Fee Related US6397179B2 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US09/185,529 US6397179B2 (en) 1997-12-24 1998-11-04 Search optimization system and method for continuous speech recognition
DE69908254T DE69908254T2 (en) 1998-07-21 1999-07-13 Search optimization system and method for continuous speech recognition
EP99305530A EP0977174B1 (en) 1998-07-21 1999-07-13 Search optimization system and method for continuous speech recognition

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US99782497A 1997-12-24 1997-12-24
US08997824 1997-12-24
US09/119,621 US6092045A (en) 1997-09-19 1998-07-21 Method and apparatus for speech recognition
US09/185,529 US6397179B2 (en) 1997-12-24 1998-11-04 Search optimization system and method for continuous speech recognition

Publications (2)

Publication Number Publication Date
US20010041978A1 true US20010041978A1 (en) 2001-11-15
US6397179B2 US6397179B2 (en) 2002-05-28

Family

ID=26817516

Family Applications (1)

Application Number Title Priority Date Filing Date
US09/185,529 Expired - Fee Related US6397179B2 (en) 1997-12-24 1998-11-04 Search optimization system and method for continuous speech recognition

Country Status (3)

Country Link
US (1) US6397179B2 (en)
EP (1) EP0977174B1 (en)
DE (1) DE69908254T2 (en)

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010049601A1 (en) * 2000-03-24 2001-12-06 John Kroeker Phonetic data processing system and method
US20030200085A1 (en) * 2002-04-22 2003-10-23 Patrick Nguyen Pattern matching for large vocabulary speech recognition systems
WO2006127504A2 (en) * 2005-05-20 2006-11-30 Sony Computer Entertainment Inc. Optimisation of a grammar for speech recognition
US20070233485A1 (en) * 2006-03-31 2007-10-04 Denso Corporation Speech recognition apparatus and speech recognition program
US20080091429A1 (en) * 2006-10-12 2008-04-17 International Business Machines Corporation Enhancement to viterbi speech processing algorithm for hybrid speech models that conserves memory
US7383172B1 (en) 2003-08-15 2008-06-03 Patrick William Jamieson Process and system for semantically recognizing, correcting, and suggesting domain specific speech
US20080183462A1 (en) * 2007-01-31 2008-07-31 Motorola, Inc. Method and apparatus for intention based communications for mobile communication devices
US7493253B1 (en) 2002-07-12 2009-02-17 Language And Computing, Inc. Conceptual world representation natural language understanding system and method
US7571098B1 (en) * 2003-05-29 2009-08-04 At&T Intellectual Property Ii, L.P. System and method of spoken language understanding using word confusion networks
US20090240500A1 (en) * 2008-03-19 2009-09-24 Kabushiki Kaisha Toshiba Speech recognition apparatus and method
US20100211391A1 (en) * 2009-02-17 2010-08-19 Sony Computer Entertainment Inc. Automatic computation streaming partition for voice recognition on multiple processors with limited memory
US20100228540A1 (en) * 1999-11-12 2010-09-09 Phoenix Solutions, Inc. Methods and Systems for Query-Based Searching Using Spoken Input
US20110029311A1 (en) * 2009-07-30 2011-02-03 Sony Corporation Voice processing device and method, and program
US7902447B1 (en) 2006-10-03 2011-03-08 Sony Computer Entertainment Inc. Automatic composition of sound sequences using finite state automata
US20110144986A1 (en) * 2009-12-10 2011-06-16 Microsoft Corporation Confidence calibration in automatic speech recognition systems
US20110213616A1 (en) * 2009-09-23 2011-09-01 Williams Robert E "System and Method for the Adaptive Use of Uncertainty Information in Speech Recognition to Assist in the Recognition of Natural Language Phrases"
US20120245939A1 (en) * 2005-02-04 2012-09-27 Keith Braho Method and system for considering information about an expected response when performing speech recognition
US20130060570A1 (en) * 2011-09-01 2013-03-07 At&T Intellectual Property I, L.P. System and method for advanced turn-taking for interactive spoken dialog systems
US20130080163A1 (en) * 2011-09-26 2013-03-28 Kabushiki Kaisha Toshiba Information processing apparatus, information processing method and computer program product
US8560318B2 (en) 2010-05-14 2013-10-15 Sony Computer Entertainment Inc. Methods and system for evaluating potential confusion within grammar structure for set of statements to be used in speech recognition during computing event
US20130304467A1 (en) * 2010-01-05 2013-11-14 Google Inc. Word-Level Correction of Speech Input
US8600757B2 (en) * 2005-01-07 2013-12-03 At&T Intellectual Property Ii, L.P. System and method of dynamically modifying a spoken dialog system to reduce hardware requirements
US20140176603A1 (en) * 2012-12-20 2014-06-26 Sri International Method and apparatus for mentoring via an augmented reality assistant
US8868409B1 (en) 2014-01-16 2014-10-21 Google Inc. Evaluating transcriptions with a semantic parser
US20140316764A1 (en) * 2013-04-19 2014-10-23 Sri International Clarifying natural language input using targeted questions
US9026431B1 (en) * 2013-07-30 2015-05-05 Google Inc. Semantic parsing with multiple parsers
CN106598937A (en) * 2015-10-16 2017-04-26 阿里巴巴集团控股有限公司 Language recognition method and device for text and electronic equipment
US20170358293A1 (en) * 2016-06-10 2017-12-14 Google Inc. Predicting pronunciations with word stress
US20180218735A1 (en) * 2008-12-11 2018-08-02 Apple Inc. Speech recognition involving a mobile device
US20190139540A1 (en) * 2016-06-09 2019-05-09 National Institute Of Information And Communications Technology Speech recognition device and computer program
US10354647B2 (en) 2015-04-28 2019-07-16 Google Llc Correcting voice recognition using selective re-speak
US10380236B1 (en) * 2017-09-22 2019-08-13 Amazon Technologies, Inc. Machine learning system for annotating unstructured text
CN110517693A (en) * 2019-08-01 2019-11-29 出门问问(苏州)信息科技有限公司 Audio recognition method, device, electronic equipment and computer readable storage medium
US10607602B2 (en) * 2015-05-22 2020-03-31 National Institute Of Information And Communications Technology Speech recognition device and computer program

Families Citing this family (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6519562B1 (en) * 1999-02-25 2003-02-11 Speechworks International, Inc. Dynamic semantic control of a speech recognition system
JP2001100781A (en) * 1999-09-30 2001-04-13 Sony Corp Method and device for voice processing and recording medium
US20030191625A1 (en) * 1999-11-05 2003-10-09 Gorin Allen Louis Method and system for creating a named entity language model
US7286984B1 (en) 1999-11-05 2007-10-23 At&T Corp. Method and system for automatically detecting morphemes in a task classification system using lattices
US8392188B1 (en) 1999-11-05 2013-03-05 At&T Intellectual Property Ii, L.P. Method and system for building a phonotactic model for domain independent speech recognition
US9076448B2 (en) 1999-11-12 2015-07-07 Nuance Communications, Inc. Distributed real time speech recognition system
US7050977B1 (en) 1999-11-12 2006-05-23 Phoenix Solutions, Inc. Speech-enabled server for internet website and method
US7392185B2 (en) 1999-11-12 2008-06-24 Phoenix Solutions, Inc. Speech based learning/training system using semantic decoding
JP2001249684A (en) * 2000-03-02 2001-09-14 Sony Corp Device and method for recognizing speech, and recording medium
US7401023B1 (en) * 2000-09-06 2008-07-15 Verizon Corporate Services Group Inc. Systems and methods for providing automated directory assistance using transcripts
EP1207517B1 (en) * 2000-11-16 2007-01-03 Sony Deutschland GmbH Method for recognizing speech
US20020133347A1 (en) * 2000-12-29 2002-09-19 Eberhard Schoneburg Method and apparatus for natural language dialog interface
US7403938B2 (en) * 2001-09-24 2008-07-22 Iac Search & Media, Inc. Natural language query processing
US20040190687A1 (en) * 2003-03-26 2004-09-30 Aurilab, Llc Speech recognition assistant for human call center operator
KR20050054706A (en) * 2003-12-05 2005-06-10 엘지전자 주식회사 Method for building lexical tree for speech recognition
US7295981B1 (en) * 2004-01-09 2007-11-13 At&T Corp. Method for building a natural language understanding model for a spoken dialog system
US7447636B1 (en) 2005-05-12 2008-11-04 Verizon Corporate Services Group Inc. System and methods for using transcripts to train an automated directory assistance service
US7860722B1 (en) * 2006-01-18 2010-12-28 Securus Technologies, Inc. System and method for keyword detection in a controlled-environment facility using a hybrid application
US7877256B2 (en) * 2006-02-17 2011-01-25 Microsoft Corporation Time synchronous decoding for long-span hidden trajectory model
US7809564B2 (en) * 2006-12-18 2010-10-05 International Business Machines Corporation Voice based keyword search algorithm
WO2008096582A1 (en) * 2007-02-06 2008-08-14 Nec Corporation Recognizer weight learning device, speech recognizing device, and system
US7813929B2 (en) * 2007-03-30 2010-10-12 Nuance Communications, Inc. Automatic editing using probabilistic word substitution models
TWI420510B (en) * 2010-05-28 2013-12-21 Ind Tech Res Inst Speech recognition system and method with adjustable memory usage
US9298287B2 (en) 2011-03-31 2016-03-29 Microsoft Technology Licensing, Llc Combined activation for natural user interface systems
US9244984B2 (en) 2011-03-31 2016-01-26 Microsoft Technology Licensing, Llc Location based conversational understanding
US9760566B2 (en) 2011-03-31 2017-09-12 Microsoft Technology Licensing, Llc Augmented conversational understanding agent to identify conversation context between two humans and taking an agent action thereof
US9842168B2 (en) 2011-03-31 2017-12-12 Microsoft Technology Licensing, Llc Task driven user intents
US9858343B2 (en) 2011-03-31 2018-01-02 Microsoft Technology Licensing Llc Personalization of queries, conversations, and searches
US10642934B2 (en) 2011-03-31 2020-05-05 Microsoft Technology Licensing, Llc Augmented conversational understanding architecture
US9064006B2 (en) 2012-08-23 2015-06-23 Microsoft Technology Licensing, Llc Translating natural language utterances to keyword search queries
US9454962B2 (en) 2011-05-12 2016-09-27 Microsoft Technology Licensing, Llc Sentence simplification for spoken language understanding
US8914290B2 (en) 2011-05-20 2014-12-16 Vocollect, Inc. Systems and methods for dynamically improving user intelligibility of synthesized speech in a work environment
US9064503B2 (en) 2012-03-23 2015-06-23 Dolby Laboratories Licensing Corporation Hierarchical active voice detection
US9514739B2 (en) * 2012-06-06 2016-12-06 Cypress Semiconductor Corporation Phoneme score accelerator
US9978395B2 (en) 2013-03-15 2018-05-22 Vocollect, Inc. Method and system for mitigating delay in receiving audio stream during production of sound from audio stream
US9390708B1 (en) * 2013-05-28 2016-07-12 Amazon Technologies, Inc. Low latency and memory efficient keywork spotting
US9401148B2 (en) 2013-11-04 2016-07-26 Google Inc. Speaker verification using neural networks
US9620145B2 (en) 2013-11-01 2017-04-11 Google Inc. Context-dependent state tying using a neural network
US9514753B2 (en) 2013-11-04 2016-12-06 Google Inc. Speaker identification using hash-based indexing
US9786270B2 (en) 2015-07-09 2017-10-10 Google Inc. Generating acoustic models
US10229672B1 (en) 2015-12-31 2019-03-12 Google Llc Training acoustic models using connectionist temporal classification
US20180018973A1 (en) 2016-07-15 2018-01-18 Google Inc. Speaker verification
US10714121B2 (en) 2016-07-27 2020-07-14 Vocollect, Inc. Distinguishing user speech from background speech in speech-dense environments
US10706840B2 (en) 2017-08-18 2020-07-07 Google Llc Encoder-decoder models for sequence to sequence mapping
US10832679B2 (en) 2018-11-20 2020-11-10 International Business Machines Corporation Method and system for correcting speech-to-text auto-transcription using local context of talk
IT201900015506A1 (en) 2019-09-03 2021-03-03 St Microelectronics Srl Process of processing an electrical signal transduced by a speech signal, electronic device, connected network of electronic devices and corresponding computer product

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2768561B2 (en) * 1990-12-19 1998-06-25 富士通株式会社 Network transformation device and creation device
US5388183A (en) * 1991-09-30 1995-02-07 Kurzwell Applied Intelligence, Inc. Speech recognition providing multiple outputs
US5621859A (en) 1994-01-19 1997-04-15 Bbn Corporation Single tree method for grammar directed, very large vocabulary speech recognizer
JP3265864B2 (en) 1994-10-28 2002-03-18 三菱電機株式会社 Voice recognition device
US5799276A (en) * 1995-11-07 1998-08-25 Accent Incorporated Knowledge-based speech recognition system and methods having frame length computed based upon estimated pitch period of vocalic intervals
US5819220A (en) * 1996-09-30 1998-10-06 Hewlett-Packard Company Web triggered word set boosting for speech interfaces to the world wide web
US5797123A (en) * 1996-10-01 1998-08-18 Lucent Technologies Inc. Method of key-phase detection and verification for flexible speech understanding
US6016470A (en) * 1997-11-12 2000-01-18 Gte Internetworking Incorporated Rejection grammar using selected phonemes for speech recognition system

Cited By (73)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100228540A1 (en) * 1999-11-12 2010-09-09 Phoenix Solutions, Inc. Methods and Systems for Query-Based Searching Using Spoken Input
US20100235341A1 (en) * 1999-11-12 2010-09-16 Phoenix Solutions, Inc. Methods and Systems for Searching Using Spoken Input and User Context Information
US6895377B2 (en) * 2000-03-24 2005-05-17 Eliza Corporation Phonetic data processing system and method
US20010049601A1 (en) * 2000-03-24 2001-12-06 John Kroeker Phonetic data processing system and method
WO2003090203A3 (en) * 2002-04-22 2004-02-26 Matsushita Electric Ind Co Ltd Pattern matching for large vocabulary speech recognition with packed distribution and localized trellis access
US6879954B2 (en) * 2002-04-22 2005-04-12 Matsushita Electric Industrial Co., Ltd. Pattern matching for large vocabulary speech recognition systems
US20050159952A1 (en) * 2002-04-22 2005-07-21 Matsushita Electric Industrial Co., Ltd Pattern matching for large vocabulary speech recognition with packed distribution and localized trellis access
WO2003090203A2 (en) * 2002-04-22 2003-10-30 Matsushita Electric Industrial Co., Ltd. Pattern matching for large vocabulary speech recognition with packed distribution and localized trellis access
US20030200085A1 (en) * 2002-04-22 2003-10-23 Patrick Nguyen Pattern matching for large vocabulary speech recognition systems
US20110179032A1 (en) * 2002-07-12 2011-07-21 Nuance Communications, Inc. Conceptual world representation natural language understanding system and method
US8812292B2 (en) 2002-07-12 2014-08-19 Nuance Communications, Inc. Conceptual world representation natural language understanding system and method
US7493253B1 (en) 2002-07-12 2009-02-17 Language And Computing, Inc. Conceptual world representation natural language understanding system and method
US9292494B2 (en) 2002-07-12 2016-03-22 Nuance Communications, Inc. Conceptual world representation natural language understanding system and method
US8442814B2 (en) 2002-07-12 2013-05-14 Nuance Communications, Inc. Conceptual world representation natural language understanding system and method
US7571098B1 (en) * 2003-05-29 2009-08-04 At&T Intellectual Property Ii, L.P. System and method of spoken language understanding using word confusion networks
US7957971B2 (en) 2003-05-29 2011-06-07 At&T Intellectual Property Ii, L.P. System and method of spoken language understanding using word confusion networks
US7383172B1 (en) 2003-08-15 2008-06-03 Patrick William Jamieson Process and system for semantically recognizing, correcting, and suggesting domain specific speech
US8600757B2 (en) * 2005-01-07 2013-12-03 At&T Intellectual Property Ii, L.P. System and method of dynamically modifying a spoken dialog system to reduce hardware requirements
US20120245939A1 (en) * 2005-02-04 2012-09-27 Keith Braho Method and system for considering information about an expected response when performing speech recognition
US8612235B2 (en) * 2005-02-04 2013-12-17 Vocollect, Inc. Method and system for considering information about an expected response when performing speech recognition
US20060277032A1 (en) * 2005-05-20 2006-12-07 Sony Computer Entertainment Inc. Structure for grammar and dictionary representation in voice recognition and method for simplifying link and node-generated grammars
WO2006127504A2 (en) * 2005-05-20 2006-11-30 Sony Computer Entertainment Inc. Optimisation of a grammar for speech recognition
WO2006127504A3 (en) * 2005-05-20 2007-06-28 Sony Computer Entertainment Inc Optimisation of a grammar for speech recognition
US7921011B2 (en) 2005-05-20 2011-04-05 Sony Computer Entertainment Inc. Structure for grammar and dictionary representation in voice recognition and method for simplifying link and node-generated grammars
US20070233485A1 (en) * 2006-03-31 2007-10-04 Denso Corporation Speech recognition apparatus and speech recognition program
US7818171B2 (en) 2006-03-31 2010-10-19 Denso Corporation Speech recognition apparatus and speech recognition program
US7902447B1 (en) 2006-10-03 2011-03-08 Sony Computer Entertainment Inc. Automatic composition of sound sequences using finite state automata
US8450591B2 (en) 2006-10-03 2013-05-28 Sony Computer Entertainment Inc. Methods for generating new output sounds from input sounds
US7805305B2 (en) * 2006-10-12 2010-09-28 Nuance Communications, Inc. Enhancement to Viterbi speech processing algorithm for hybrid speech models that conserves memory
US20080091429A1 (en) * 2006-10-12 2008-04-17 International Business Machines Corporation Enhancement to viterbi speech processing algorithm for hybrid speech models that conserves memory
US7818166B2 (en) 2007-01-31 2010-10-19 Motorola, Inc. Method and apparatus for intention based communications for mobile communication devices
WO2008094332A1 (en) * 2007-01-31 2008-08-07 Motorola, Inc. Method and apparatus for intention based communications for mobile communication devices
US20080183462A1 (en) * 2007-01-31 2008-07-31 Motorola, Inc. Method and apparatus for intention based communications for mobile communication devices
US20090240500A1 (en) * 2008-03-19 2009-09-24 Kabushiki Kaisha Toshiba Speech recognition apparatus and method
US20180218735A1 (en) * 2008-12-11 2018-08-02 Apple Inc. Speech recognition involving a mobile device
US20100211391A1 (en) * 2009-02-17 2010-08-19 Sony Computer Entertainment Inc. Automatic computation streaming partition for voice recognition on multiple processors with limited memory
US8442829B2 (en) * 2009-02-17 2013-05-14 Sony Computer Entertainment Inc. Automatic computation streaming partition for voice recognition on multiple processors with limited memory
US20110029311A1 (en) * 2009-07-30 2011-02-03 Sony Corporation Voice processing device and method, and program
US8612223B2 (en) * 2009-07-30 2013-12-17 Sony Corporation Voice processing device and method, and program
US8560311B2 (en) * 2009-09-23 2013-10-15 Robert W. Williams System and method for isolating uncertainty between speech recognition and natural language processing
US20110213616A1 (en) * 2009-09-23 2011-09-01 Williams Robert E "System and Method for the Adaptive Use of Uncertainty Information in Speech Recognition to Assist in the Recognition of Natural Language Phrases"
US9070360B2 (en) * 2009-12-10 2015-06-30 Microsoft Technology Licensing, Llc Confidence calibration in automatic speech recognition systems
US20110144986A1 (en) * 2009-12-10 2011-06-16 Microsoft Corporation Confidence calibration in automatic speech recognition systems
US9711145B2 (en) 2010-01-05 2017-07-18 Google Inc. Word-level correction of speech input
US11037566B2 (en) 2010-01-05 2021-06-15 Google Llc Word-level correction of speech input
US10672394B2 (en) 2010-01-05 2020-06-02 Google Llc Word-level correction of speech input
US9466287B2 (en) 2010-01-05 2016-10-11 Google Inc. Word-level correction of speech input
US9542932B2 (en) 2010-01-05 2017-01-10 Google Inc. Word-level correction of speech input
US9881608B2 (en) 2010-01-05 2018-01-30 Google Llc Word-level correction of speech input
US20130304467A1 (en) * 2010-01-05 2013-11-14 Google Inc. Word-Level Correction of Speech Input
US9087517B2 (en) * 2010-01-05 2015-07-21 Google Inc. Word-level correction of speech input
US9263048B2 (en) 2010-01-05 2016-02-16 Google Inc. Word-level correction of speech input
US8560318B2 (en) 2010-05-14 2013-10-15 Sony Computer Entertainment Inc. Methods and system for evaluating potential confusion within grammar structure for set of statements to be used in speech recognition during computing event
US8914288B2 (en) * 2011-09-01 2014-12-16 At&T Intellectual Property I, L.P. System and method for advanced turn-taking for interactive spoken dialog systems
US20130060570A1 (en) * 2011-09-01 2013-03-07 At&T Intellectual Property I, L.P. System and method for advanced turn-taking for interactive spoken dialog systems
US10152971B2 (en) 2011-09-01 2018-12-11 Nuance Communications, Inc. System and method for advanced turn-taking for interactive spoken dialog systems
US9378738B2 (en) 2011-09-01 2016-06-28 At&T Intellectual Property I, L.P. System and method for advanced turn-taking for interactive spoken dialog systems
US20130080163A1 (en) * 2011-09-26 2013-03-28 Kabushiki Kaisha Toshiba Information processing apparatus, information processing method and computer program product
US20140176603A1 (en) * 2012-12-20 2014-06-26 Sri International Method and apparatus for mentoring via an augmented reality assistant
US10573037B2 (en) * 2012-12-20 2020-02-25 Sri International Method and apparatus for mentoring via an augmented reality assistant
US9805718B2 (en) * 2013-04-19 2017-10-31 Sri Internaitonal Clarifying natural language input using targeted questions
US20140316764A1 (en) * 2013-04-19 2014-10-23 Sri International Clarifying natural language input using targeted questions
US9026431B1 (en) * 2013-07-30 2015-05-05 Google Inc. Semantic parsing with multiple parsers
US8868409B1 (en) 2014-01-16 2014-10-21 Google Inc. Evaluating transcriptions with a semantic parser
US10354647B2 (en) 2015-04-28 2019-07-16 Google Llc Correcting voice recognition using selective re-speak
US10607602B2 (en) * 2015-05-22 2020-03-31 National Institute Of Information And Communications Technology Speech recognition device and computer program
CN106598937A (en) * 2015-10-16 2017-04-26 阿里巴巴集团控股有限公司 Language recognition method and device for text and electronic equipment
US10909976B2 (en) * 2016-06-09 2021-02-02 National Institute Of Information And Communications Technology Speech recognition device and computer program
US20190139540A1 (en) * 2016-06-09 2019-05-09 National Institute Of Information And Communications Technology Speech recognition device and computer program
US10255905B2 (en) * 2016-06-10 2019-04-09 Google Llc Predicting pronunciations with word stress
US20170358293A1 (en) * 2016-06-10 2017-12-14 Google Inc. Predicting pronunciations with word stress
US10380236B1 (en) * 2017-09-22 2019-08-13 Amazon Technologies, Inc. Machine learning system for annotating unstructured text
CN110517693A (en) * 2019-08-01 2019-11-29 出门问问(苏州)信息科技有限公司 Audio recognition method, device, electronic equipment and computer readable storage medium

Also Published As

Publication number Publication date
US6397179B2 (en) 2002-05-28
DE69908254T2 (en) 2003-11-27
EP0977174A3 (en) 2001-02-14
EP0977174B1 (en) 2003-05-28
EP0977174A2 (en) 2000-02-02
DE69908254D1 (en) 2003-07-03

Similar Documents

Publication Publication Date Title
US6397179B2 (en) Search optimization system and method for continuous speech recognition
US10176802B1 (en) Lattice encoding using recurrent neural networks
US10134388B1 (en) Word generation for speech recognition
US10453117B1 (en) Determining domains for natural language understanding
US5797123A (en) Method of key-phase detection and verification for flexible speech understanding
US6292779B1 (en) System and method for modeless large vocabulary speech recognition
US10170107B1 (en) Extendable label recognition of linguistic input
Ward Extracting information in spontaneous speech.
Ljolje et al. Efficient general lattice generation and rescoring
US10381000B1 (en) Compressed finite state transducers for automatic speech recognition
WO2001022400A1 (en) Iterative speech recognition from multiple feature vectors
KR20070047579A (en) Apparatus and method for dialogue speech recognition using topic detection
JP2001517816A (en) A speech recognition system for recognizing continuous and separated speech
JP2001249684A (en) Device and method for recognizing speech, and recording medium
US6980954B1 (en) Search method based on single triphone tree for large vocabulary continuous speech recognizer
JPH08505957A (en) Voice recognition system
JP4528540B2 (en) Voice recognition method and apparatus, voice recognition program, and storage medium storing voice recognition program
Wang Mandarin spoken document retrieval based on syllable lattice matching
JP2938865B1 (en) Voice recognition device
JPH09134192A (en) Statistical language model forming device and speech recognition device
Smaïli et al. An hybrid language model for a continuous dictation prototype
Bai et al. A multi-phase approach for fast spotting of large vocabulary Chinese keywords from Mandarin speech using prosodic information
JP2731133B2 (en) Continuous speech recognition device
Fu et al. Combination of multiple predictors to improve confidence measure based on local posterior probabilities
Thomae et al. A One-Stage Decoder for Interpretation of Natural Speech

Legal Events

Date Code Title Description
AS Assignment

Owner name: NORTHERN TELECOM LIMITED, CANADA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CRESPO, JEAN-FRANCOIS;SSTUBLEY, PETER R.;ROBILLARS, SERGE;REEL/FRAME:009815/0728

Effective date: 19990222

AS Assignment

Owner name: NORTHERN TELECOM LIMITED, CANADA

Free format text: RE-RECORD TO CORRECT THE SURNAME OF INVENTORS PREVIOUSLY RECORDED AT REEL/FRAME 9815/0728.;ASSIGNORS:CRESPO. JEAN-FRANCOIS;STUBLEY, PETER R.;ROBILLARD, SERGE;REEL/FRAME:010244/0661

Effective date: 19990222

AS Assignment

Owner name: NORTEL NETWORKS CORPORATION, CANADA

Free format text: CHANGE OF NAME;ASSIGNOR:NORTHERN TELECOM LIMITED;REEL/FRAME:010567/0001

Effective date: 19990429

AS Assignment

Owner name: NORTEL NETWORKS LIMITED, CANADA

Free format text: CHANGE OF NAME;ASSIGNOR:NORTEL NETWORKS CORPORATION;REEL/FRAME:011195/0706

Effective date: 20000830

Owner name: NORTEL NETWORKS LIMITED,CANADA

Free format text: CHANGE OF NAME;ASSIGNOR:NORTEL NETWORKS CORPORATION;REEL/FRAME:011195/0706

Effective date: 20000830

REMI Maintenance fee reminder mailed
FPAY Fee payment

Year of fee payment: 4

SULP Surcharge for late payment
AS Assignment

Owner name: INNOVATION MANAGEMENT SCIENCES, LLC, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NORTEL NETWORKS LIMITED;REEL/FRAME:019215/0788

Effective date: 20070424

AS Assignment

Owner name: POPKIN FAMILY ASSETS, L.L.C., DELAWARE

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INNOVATION MANAGEMENT SCIENCES LLC;REEL/FRAME:019605/0022

Effective date: 20070427

FEPP Fee payment procedure

Free format text: PAYER NUMBER DE-ASSIGNED (ORIGINAL EVENT CODE: RMPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FPAY Fee payment

Year of fee payment: 8

REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees
STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20140528