WO2014025282A1 - Method for recognition of speech messages and device for carrying out the method - Google Patents

Method for recognition of speech messages and device for carrying out the method Download PDF

Info

Publication number
WO2014025282A1
WO2014025282A1 PCT/RU2012/000667 RU2012000667W WO2014025282A1 WO 2014025282 A1 WO2014025282 A1 WO 2014025282A1 RU 2012000667 W RU2012000667 W RU 2012000667W WO 2014025282 A1 WO2014025282 A1 WO 2014025282A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
unit
recognition
data
audio signal
Prior art date
Application number
PCT/RU2012/000667
Other languages
French (fr)
Inventor
Mikhail Vasilevich KHITROV
Kirill Evgen'evich LEVIN
Original Assignee
Khitrov Mikhail Vasilevich
Levin Kirill Evgen Evich
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Khitrov Mikhail Vasilevich, Levin Kirill Evgen Evich filed Critical Khitrov Mikhail Vasilevich
Priority to PCT/RU2012/000667 priority Critical patent/WO2014025282A1/en
Priority to EP12837648.0A priority patent/EP2883224A1/en
Publication of WO2014025282A1 publication Critical patent/WO2014025282A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source

Definitions

  • the present invention relates to the field of automatic speech recognition, particularly to method for automatic recognition of speech messages and a device for carrying out said method.
  • the present invention may be used for recognition of news reports, voice search queries and also to process recordings of meetings and negotiations. STATE OF THE ART
  • a device and method, disclosed in the patent EP 1069551 used for recognition of words in continuous speech flow in order to determine user voice instructions are known.
  • an algorithm for speech recognition based on hidden Markov models is implemented.
  • a dictionary of reference speech patterns is created, said dictionary comprises, for example, a set of common user voice instructions.
  • Said reference patterns are compared with the patterns received from the user.
  • a probability of matching the delivered speech message with any of speech messages from the directory of reference speech patterns is determined by probability evaluation unit.
  • P stands for probability value in case of comparison with the given phrase from speech pattern dictionary and P max stands for maximum threshold a speaker phrase is assigned with the value of the given phrase.
  • Proposed solution is dedicated to recognition of single words and provides only a low reliability when performing word-by-word recognition, due to the fact that according to the solution no multifactorial preprocessing of speech message is performed, criteria set for calculation of probabilities is limited, dictionary of reference speech patterns is limited and no creation of the speakers' dictionary is provided. Therefore said solution does not allow recognition of speech messages with a high degree of reliability.
  • Device and method for speech recognition perform recognition of words based on the inputted information on the models of elementary speech units, where each one of elementary speech units is shorter than a word.
  • Device for speech recognition comprises means for accumulating an aggregate of dictionary symbols, used for accumulating symbolic sequences of said elementary speech units for common words which are generally used for word recognition based on the inputted speech information of random speakers.
  • the device also comprises means for extracting symbolic sequences for recorded words, generating symbolic sequences corresponding to internal connections of said elementary speech units by using an aggregate, in which the requirement considering connections between elementary speech units is described, and symbolic sequences of said elementary speech units have the largest probability for recorded words from the inputted speech information of particular speaker
  • the device also comprises recording means, storing said symbolic sequences of elementary speech units for the common words generally used for word recognition based on the inputted speech information from random speakers and created symbolic sequences for recorded words in the form of parallel aggregates.
  • Said elementary speech units are acoustic events, generated by separation of the hidden Markov model of the phoneme into separate states, without changing the values of transition probability, resultant probability and quantity of states.
  • the device enables performing a word-by-word recognition of continuous speech, and hardware of the device allows forming an extendable dictionary of models of elementary speech units, wherein word recognition is performed by using pre-defined models of random speakers.
  • the disadvantage of the device and method disclosed in RU 2223554 is low reliability of recognition of the speech message due to limited probabilistic information used while performing recognition.
  • the closest prior art of the proposed method and device for recognition of speech messages is method and device disclosed in patent RU 2296376.
  • the method of RU 2296376 comprises steps of receiving audio signal, preprocessing of received audio signal by extraction of intervals, corresponding to the words separated from the background noise, and by dividing the signal into intervals of a certain duration, which is smaller than duration of phonemes, further initial decoding of the speech component and forming bispectral features during the decoding , the bispectral features are compared with reference features of phonemes in order to make a decision on a recognized phoneme of every word segment.
  • the known method and device do not allow determining the topic of speech message and identifying the speaker. Said disadvantages limit probabilistic information used for recognition and do not allow achieving a high level of reliability of such recognition.
  • A Acoustic model - a set of statistical features of separate speech sounds, which enables determination of the most probable word sequences.
  • LM Language (linguistic) model
  • Syntagma an aggregation of several words, combined according to semantic, grammatical and phonetic rules of the language.
  • - the background a unit of speech sound level, which can be extracted from the speech flow regardless of its phonemic place (i.e. without assigning it to one or another phoneme) or as a specific realization of the phoneme in speech.
  • the object of the present invention is to provide a technical solution, allowing recognition of speech messages with a high level of reliability and providing a multilevel textual layout which enables assigning separate phrases to different speakers.
  • the object is achieved by a method, comprising steps of receiving an audio signal, preprocessing of the received audio signal by extraction a speech component of the signal and initial decoding of the speech component using speech pattern dictionary data.
  • the method characterized in that during preprocessing stage time points indicating change of speaker and syntagma borders are determined, data on syntagma borders are used during initial decoding stage, said method also comprises a stage of determination the speech component topic using a topic classifier, secondary decoding of the speech component using the data for the topic of the speech component and speech pattern dictionary data so as to obtain word sequences in the text form, identification of speakers using data for models of the speakers and logical processing of the obtained word sequence using data for the topic of the speech component and data for identities of the speakers so as to obtain a multilevel textual layout.
  • the proposed method allows processing complex speech messages, generated by single or multiple speakers and comprising intervals with different topics and recording quality.
  • Statistical information about content of the speech message which is sufficient to achieve a high level of reliability of recognition, is provided in the form of several hypotheses resulting from multifactorial preprocessing, use of topic classifier and models of speakers.
  • a method is proposed wherein after the secondary decoding a grammatical agreement of the obtained word sequence is provided.
  • a method is proposed, wherein after receiving the audio signal conversion thereof to the format suitable for recognition is provided.
  • the object of the present invention may also be achieved by a device for recognition of speech messages as proposed, the device comprising an audio signal receiving unit, a preprocessing unit for preprocessing of received audio signal, the preprocessing unit is arranged to extract a speech component of the audio signal from the received audio signal, a speech recognition unit, comprising a decoder, the decoder is arranged to provide an initial decoding of the speech component of the audio signal using speech pattern dictionary data.
  • the device characterized in that it comprises a topic classification unit arranged to determine the speech component topic, and the decoder of the speech recognition unit is a two- pass decoder arranged to perform secondary decoding so as to obtain word sequences in the text form, wherein the speech recognition unit is arranged to use data for the topic of said speech component received from the topic classification unit, the preprocessing unit is arrange to determine time points indicating change of speaker and syntagma borders, and the device also comprises an identification unit for identification of speakers using the data for models of the speakers and a logical unit arranged to perform logical processing of the obtained word sequence using data for the topic of the speech component and data for identities of the speakers so as to obtain a multilevel textual layout.
  • Proposed device allows providing a high level of reliability of recognition of speech messages, due to multifactorial preprocessing, use of topic classification unit and speaker models, providing statistical information which is sufficient to achieve a high level of reliability of recognition of the speech message.
  • the preprocessing unit is arranged to determine types and degrees of noises and distortions of the audio signal.
  • the audio signal receiving unit is arranged to provide data exchange with the user, data processing control, loading of speech data from different sources and outputting resulting information.
  • the proposed device comprises a conversion unit arranged to convert input data, having various formats and stored on various storage mediums to the format, suitable for speech recognition.
  • the proposed device comprises a grammatical agreement unit arranged to provide grammatical analysis of the word sequences, obtained during secondary decoding.
  • the proposed device comprises storage mediums for storing results of speech message recognition.
  • Fig. 1 shows the preferred embodiment of the device for recognition of speech messages according to the present invention.
  • Fig. 2 shows the preferred embodiment of the method for recognition of speech messages according to the present invention.
  • Fig. 1 shows the preferred embodiment of the device for recognition of speech messages according to the present invention.
  • Proposed device is a hardware and software system, comprising, for example, a computer system.
  • the proposed device comprises user interface 1, recognition job formation unit 2, preprocessing unit 3, speech recognition unit 4, comprising decoder, topic classification unit 5, annotation unit 6, learning unit 7, linguistic processing unit 8, result saving unit 9, and post-processing unit 10, speaker identification unit 11 and logical unit 12, compensational unit 14, detecting elements 15 for detecting presence of speech/absence of speech, computing device 16 for computing Signal-to- Noise Ratio (SNR) and artificial neural networks 17 (shown on Fig. 2).
  • SNR Signal-to- Noise Ratio
  • Audio signal is received by the interface 1, enabling initiating recognition process after receiving audio signal.
  • the interface may comprise a keyboard and a computer mouse and to further comprise means for receiving an audio signal, means for loading speech data from various data sources and for outputting resulting information.
  • Said interface 1 enables user to control data processing operation.
  • Speech data enters unit 2 through interface 1.
  • Unit 2 is an intermediate unit between interface 1 and unit 3.
  • Unit 2 converts input data, having various formats and stored on various storage mediums to the format, suitable for speech recognition. Upon appropriate preprocessing of the speech signal said signal is transmitted from the unit 2 to unit 3.
  • Unit 3 converts a speech signal to the set of speech features, such as FBANK1, FBANK2 that resulting from processing of the speech message by Mel-frequency filter banks, F 0 - frequency values of fundamental tone of speech signal, MFCC - mel-frequency cepstral coefficients. Said features allow extracting signal informational component and reducing inter- speaker and inter-sessional variability of the original signal.
  • FTT FTT, LCRC.
  • the main feature set FBANK1
  • detecting element 15 and computing element 16 define recording quality and provide data for further processing in unit 4.
  • a compensation of FBANK1 features is performed constantly, allowing removing constant signal distortions from the input signal, which are introduced by transmitting channel frequency response.
  • FBANK2 Other supporting sets (FBANK2) are received by detecting elements 15 for detecting presence of speech/absence of speech, noises and unwanted signals, by computing element 6 for computing Signal-to-Noise Ratio (SNR) and by artificial neural networks 17 (ANN 1, ANN 2).
  • Neural networks compute posterior probabilities for the input data vector being a member of the backgrounds states. However the probabilities are computed without taking into consideration an admissible phoneme pattern of speech, the pattern is considered in the unit 4 during decoding operation. Time points indicating change of speaker and syntagma borders are determined using remaining sets of features. Furthermore, intervals, containing speech, are extracted in the unit 3 and types and degrees of noises and distortions of the audio signal are determined.
  • Unit 3 enables extracting several common types of distortions, having the greatest impact on reliability of recognition: non-linear distortions (overload) and additive noises of the transmitting channel.
  • non-linear distortions overload
  • additive noises of the transmitting channel.
  • signal-to-noise ratio is computed and intervals with amplitude changes typical for distortions caused by overload are determined.
  • An important object of the unit 3 is to determine the informative part of the speech signal. This allows saving recognition time due to exclusion of silent intervals from recognition operation.
  • said signal is transmitted from the unit 3 to unit 4.
  • Unit 4 enables determining the most probable grammatical hypothesis for the unknown phrase, i.e. the most probable path through a recognition network, which consists of word models (which in turn are formed of single backgrounds models). Likelihood of the hypothesis depends on two following factors - probabilities of the backgrounds sequences, assigned by acoustic model, and probability of consecutive arrangement of words, particularly probabilities of the backgrounds sequences within the word (the word can have several pronunciation variants) are multiplied with probability of consecutive arrangement of words. In this case operational speed of unit 4 is acceptable and is achieved due to performing search with a limit, suggesting analysis not of all possible partial paths in the recognition network, but those for which total likelihood is greater than the certain limit. Furthermore, at any specific time the most probable partial path is presented in the model and the likelihood of this path is used to define a lower search limit. All paths with the likelihood lower than the defined limit are excluded from further analysis.
  • unit 8 In the proposed device language model is formed using unit 8. At the same time, operational speed of unit 6 may also be increased by learning acoustic models with the help of unit 7. Said learning involves reconfiguration of acoustic models using results of the previous recognition.
  • Unit 4 comprises a two-pass decoder, which enables gradual complication of the search condition for searching the most probable word sequence.
  • fine tuning of acoustic models to speech recording conditions is performed based on the information on the speech message signal quality and data for the topic of the message enables selection of the language model appropriate for the certain topic on the second pass of the decoder. It should be noted that when carrying out recognition a separate conversion of speech message features tailoring characteristic of the speaker to some "average" speaker is performed for every speaker. For a fine tuning of acoustic models to the speech recording conditions depending on the degree of unwanted signals a compensation of the speech message spectrum is used.
  • the corresponding information is supplied directly to the decoder, which uses other decoding mode on these intervals, allowing only propagation of existing hypotheses and not generating new word hypotheses. This allows excluding words known to be incorrect from recognition results.
  • a separate language model is formed in advance, for example, a language model of political news or a language model of sportscast.
  • decoder selects the most appropriate language model for the given topic. This allows more accurate recognition of words and chunks of language specific for every topic.
  • speech component is transmitted from the unit 4 to the unit 5, where the speech component is assigned with the topic and it is transmitted further to unit 6, where an annotation could be compiled for the speech component.
  • Annotation contains several sentences from the complete recognition result. Said sentences are selected according to the criteria of "information gain”.
  • the device also comprises units 10, 11 and 12.
  • speech component may be transmitted from the unit 4 directly to the unit 10, which performs analysis of the context of recognition results and according to the grammar rules of Russian language a grammatical agreement of case and gender endings is achieved in the unit and possible recognition errors are corrected.
  • Last stage of recognition is performed in the logical unit 12. While performing recognition of the speech component in the unit 12 probabilistic information received by decoder from unit 11 and unit 5 is used.
  • Unit 11 comprises models of speakers, provided as an acoustic features vector, which dimensions lie in the range 200-300, and human voices formed automatically according to a pattern. Models of speakers are formed automatically according to patterns of recordings of certain people, whose speech is presented in the speech message.
  • Unit 11 may comprise information storing means for storing models of speakers. In the unit 12 diverse hypotheses are combined as described above to provide the most probable word chain, where borders of sentences and intervals with different topics and voices are defined.
  • Fig. 2 shows the preferred embodiment of the method for recognition of speech messages according to the present invention.
  • a received audio signal is transmitted to the speech features evaluation unit, that preprocess the received audio signal so as to obtain several sets of features, which are used during further processing with other units and speech component of the signal as described above.
  • the proposed method on the next stage speech decoding and identification of a speaker is performed.
  • a two-pass decoder For decoding a two-pass decoder is used, which enables gradual complication of the search condition for searching the most probable word sequence.
  • speech component is supplied to post-processing unit, which performs analysis of the context of recognition results and according to the grammar rules of Russian language a grammatical agreement of case and gender endings is achieved in the unit and possible recognition errors are corrected.
  • Proposed method allows converting input speech message to the text with a high reliability.

Abstract

Proposed are a method for automatic recognition of speech messages and a device for carrying out the method. Proposed device and method allow converting input speech message to the text, subject to multifactorial processing the speech message using algorithms for evaluating the quality of the speech message, algorithm for identification of a speaker, syntagma border searching algorithms and algorithms for determining the topic of speech message. The device comprises a common logical unit, enabling combining diverse probabilistic estimators in order to get a general decision on the content of speech message. The use of said device and method increase reliability of recognition of voice search queries, recordings of conferences and negotiations.

Description

METHOD FOR RECOGNITION OF SPEECH MESSAGES AND DEVICE FOR CARRYING OUT THE METHOD
TECHNICAL FIELD
The present invention relates to the field of automatic speech recognition, particularly to method for automatic recognition of speech messages and a device for carrying out said method. The present invention may be used for recognition of news reports, voice search queries and also to process recordings of meetings and negotiations. STATE OF THE ART
Nowadays various devices and methods for recognition of speech messages and single words are known. Known solutions are based on comparison of inputted speech signals with reference signals, provided in the corresponding dictionaries of speech patterns, and analysis of probabilities of matching upon such comparisons.
A device and method, disclosed in the patent EP 1069551 used for recognition of words in continuous speech flow in order to determine user voice instructions are known. In the proposed device and method an algorithm for speech recognition based on hidden Markov models is implemented. According to the invention, at first a dictionary of reference speech patterns is created, said dictionary comprises, for example, a set of common user voice instructions. Said reference patterns are compared with the patterns received from the user. After receiving the speech message a probability of matching the delivered speech message with any of speech messages from the directory of reference speech patterns is determined by probability evaluation unit. In case the inequation P > Pmax is true, where P stands for probability value in case of comparison with the given phrase from speech pattern dictionary and Pmax stands for maximum threshold a speaker phrase is assigned with the value of the given phrase.
Proposed solution is dedicated to recognition of single words and provides only a low reliability when performing word-by-word recognition, due to the fact that according to the solution no multifactorial preprocessing of speech message is performed, criteria set for calculation of probabilities is limited, dictionary of reference speech patterns is limited and no creation of the speakers' dictionary is provided. Therefore said solution does not allow recognition of speech messages with a high degree of reliability.
Known from patent RU 2223554 are device and method for speech recognition, said device and method perform recognition of words based on the inputted information on the models of elementary speech units, where each one of elementary speech units is shorter than a word. Device for speech recognition comprises means for accumulating an aggregate of dictionary symbols, used for accumulating symbolic sequences of said elementary speech units for common words which are generally used for word recognition based on the inputted speech information of random speakers.
The device also comprises means for extracting symbolic sequences for recorded words, generating symbolic sequences corresponding to internal connections of said elementary speech units by using an aggregate, in which the requirement considering connections between elementary speech units is described, and symbolic sequences of said elementary speech units have the largest probability for recorded words from the inputted speech information of particular speaker The device also comprises recording means, storing said symbolic sequences of elementary speech units for the common words generally used for word recognition based on the inputted speech information from random speakers and created symbolic sequences for recorded words in the form of parallel aggregates. Said elementary speech units are acoustic events, generated by separation of the hidden Markov model of the phoneme into separate states, without changing the values of transition probability, resultant probability and quantity of states.
The device enables performing a word-by-word recognition of continuous speech, and hardware of the device allows forming an extendable dictionary of models of elementary speech units, wherein word recognition is performed by using pre-defined models of random speakers. The disadvantage of the device and method disclosed in RU 2223554 is low reliability of recognition of the speech message due to limited probabilistic information used while performing recognition.
The closest prior art of the proposed method and device for recognition of speech messages is method and device disclosed in patent RU 2296376. The method of RU 2296376 comprises steps of receiving audio signal, preprocessing of received audio signal by extraction of intervals, corresponding to the words separated from the background noise, and by dividing the signal into intervals of a certain duration, which is smaller than duration of phonemes, further initial decoding of the speech component and forming bispectral features during the decoding , the bispectral features are compared with reference features of phonemes in order to make a decision on a recognized phoneme of every word segment. When comparing a formed set of alphabetic codes the phonemes of the word to be recognized with sets of alphabetic codes of the phonemes of dictionary words using word reference features, a value array of the recognition factor is created, where the values are represented by the number of alphabetic codes and interval codes of the word to be recognized matching the corresponding ones of the dictionary words. The decision on the recognition of the word to be recognized is made towards that one of dictionary words, which provides the highest value of recognition factor when compared to the word to be recognized. Therefore, word-by-word recognition of speech messages may be achieved.
The known method and device do not allow determining the topic of speech message and identifying the speaker. Said disadvantages limit probabilistic information used for recognition and do not allow achieving a high level of reliability of such recognition.
SUMMARY OF THE INVENTION
In this section following terms and definitions are used:
Acoustic model (AM) - a set of statistical features of separate speech sounds, which enables determination of the most probable word sequences.
Language (linguistic) model (LM) - an aggregation of the possible word sequences in oral speech.
Syntagma - an aggregation of several words, combined according to semantic, grammatical and phonetic rules of the language.
Background of a word (hereinafter referred to as - the background) - a unit of speech sound level, which can be extracted from the speech flow regardless of its phonemic place (i.e. without assigning it to one or another phoneme) or as a specific realization of the phoneme in speech.
The object of the present invention is to provide a technical solution, allowing recognition of speech messages with a high level of reliability and providing a multilevel textual layout which enables assigning separate phrases to different speakers.
The object is achieved by a method, comprising steps of receiving an audio signal, preprocessing of the received audio signal by extraction a speech component of the signal and initial decoding of the speech component using speech pattern dictionary data. The method, characterized in that during preprocessing stage time points indicating change of speaker and syntagma borders are determined, data on syntagma borders are used during initial decoding stage, said method also comprises a stage of determination the speech component topic using a topic classifier, secondary decoding of the speech component using the data for the topic of the speech component and speech pattern dictionary data so as to obtain word sequences in the text form, identification of speakers using data for models of the speakers and logical processing of the obtained word sequence using data for the topic of the speech component and data for identities of the speakers so as to obtain a multilevel textual layout. Technical result, achieved by the proposed method, is to increase reliability of recognition of speech messages. The proposed method allows processing complex speech messages, generated by single or multiple speakers and comprising intervals with different topics and recording quality. Statistical information about content of the speech message which is sufficient to achieve a high level of reliability of recognition, is provided in the form of several hypotheses resulting from multifactorial preprocessing, use of topic classifier and models of speakers.
According to yet another embodiment, a method is proposed wherein after the secondary decoding a grammatical agreement of the obtained word sequence is provided.
According to yet another embodiment, a method is proposed, wherein after receiving the audio signal conversion thereof to the format suitable for recognition is provided.
The object of the present invention may also be achieved by a device for recognition of speech messages as proposed, the device comprising an audio signal receiving unit, a preprocessing unit for preprocessing of received audio signal, the preprocessing unit is arranged to extract a speech component of the audio signal from the received audio signal, a speech recognition unit, comprising a decoder, the decoder is arranged to provide an initial decoding of the speech component of the audio signal using speech pattern dictionary data.
The device characterized in that it comprises a topic classification unit arranged to determine the speech component topic, and the decoder of the speech recognition unit is a two- pass decoder arranged to perform secondary decoding so as to obtain word sequences in the text form, wherein the speech recognition unit is arranged to use data for the topic of said speech component received from the topic classification unit, the preprocessing unit is arrange to determine time points indicating change of speaker and syntagma borders, and the device also comprises an identification unit for identification of speakers using the data for models of the speakers and a logical unit arranged to perform logical processing of the obtained word sequence using data for the topic of the speech component and data for identities of the speakers so as to obtain a multilevel textual layout.
Technical result achieved by the device is to increase reliability of recognition of speech messages. Proposed device allows providing a high level of reliability of recognition of speech messages, due to multifactorial preprocessing, use of topic classification unit and speaker models, providing statistical information which is sufficient to achieve a high level of reliability of recognition of the speech message.
According to another embodiment of the device the preprocessing unit is arranged to determine types and degrees of noises and distortions of the audio signal. According to another embodiment of the device the audio signal receiving unit is arranged to provide data exchange with the user, data processing control, loading of speech data from different sources and outputting resulting information.
The proposed device according to yet another embodiment comprises a conversion unit arranged to convert input data, having various formats and stored on various storage mediums to the format, suitable for speech recognition.
The proposed device according to yet another embodiment comprises a grammatical agreement unit arranged to provide grammatical analysis of the word sequences, obtained during secondary decoding.
The proposed device according to yet another embodiment comprises storage mediums for storing results of speech message recognition.
BRIEF DESCRIPTION OF DRAWINGS
A detailed description of the proposed invention is provided below with reference to figures, wherein
Fig. 1 shows the preferred embodiment of the device for recognition of speech messages according to the present invention.
Fig. 2 shows the preferred embodiment of the method for recognition of speech messages according to the present invention.
DETAILED DESCRIPTION OF THE INVENTION
Fig. 1 shows the preferred embodiment of the device for recognition of speech messages according to the present invention.
Proposed device according to the preferred embodiment is a hardware and software system, comprising, for example, a computer system. As shown on Fig. 1, the proposed device comprises user interface 1, recognition job formation unit 2, preprocessing unit 3, speech recognition unit 4, comprising decoder, topic classification unit 5, annotation unit 6, learning unit 7, linguistic processing unit 8, result saving unit 9, and post-processing unit 10, speaker identification unit 11 and logical unit 12, compensational unit 14, detecting elements 15 for detecting presence of speech/absence of speech, computing device 16 for computing Signal-to- Noise Ratio (SNR) and artificial neural networks 17 (shown on Fig. 2).
The operation principle of proposed invention with a detailed description of interconnections between its units is disclosed below. Audio signal is received by the interface 1, enabling initiating recognition process after receiving audio signal. The interface may comprise a keyboard and a computer mouse and to further comprise means for receiving an audio signal, means for loading speech data from various data sources and for outputting resulting information. Said interface 1 enables user to control data processing operation. Speech data enters unit 2 through interface 1.
Unit 2 is an intermediate unit between interface 1 and unit 3. Unit 2 converts input data, having various formats and stored on various storage mediums to the format, suitable for speech recognition. Upon appropriate preprocessing of the speech signal said signal is transmitted from the unit 2 to unit 3.
Unit 3 converts a speech signal to the set of speech features, such as FBANK1, FBANK2 that resulting from processing of the speech message by Mel-frequency filter banks, F0 - frequency values of fundamental tone of speech signal, MFCC - mel-frequency cepstral coefficients. Said features allow extracting signal informational component and reducing inter- speaker and inter-sessional variability of the original signal.
Conversion of said original signal is performed using known algorithms, such as MFCC,
FTT, LCRC. As can be seen from Fig. 2, the main feature set (FBANK1) is received by compensational unit 14, where initial tuning to the speech message transmitting channel is performed, while detecting element 15 and computing element 16 define recording quality and provide data for further processing in unit 4. Furthermore, during initial tuning to the speech message transmitting channel a compensation of FBANK1 features is performed constantly, allowing removing constant signal distortions from the input signal, which are introduced by transmitting channel frequency response.
Other supporting sets (FBANK2) are received by detecting elements 15 for detecting presence of speech/absence of speech, noises and unwanted signals, by computing element 6 for computing Signal-to-Noise Ratio (SNR) and by artificial neural networks 17 (ANN 1, ANN 2). Neural networks compute posterior probabilities for the input data vector being a member of the backgrounds states. However the probabilities are computed without taking into consideration an admissible phoneme pattern of speech, the pattern is considered in the unit 4 during decoding operation. Time points indicating change of speaker and syntagma borders are determined using remaining sets of features. Furthermore, intervals, containing speech, are extracted in the unit 3 and types and degrees of noises and distortions of the audio signal are determined. Unit 3 enables extracting several common types of distortions, having the greatest impact on reliability of recognition: non-linear distortions (overload) and additive noises of the transmitting channel. In order to evaluate these distortions within a speech signal signal-to-noise ratio is computed and intervals with amplitude changes typical for distortions caused by overload are determined. An important object of the unit 3 is to determine the informative part of the speech signal. This allows saving recognition time due to exclusion of silent intervals from recognition operation. Upon appropriate preprocessing of the speech signal said signal is transmitted from the unit 3 to unit 4.
Unit 4 enables determining the most probable grammatical hypothesis for the unknown phrase, i.e. the most probable path through a recognition network, which consists of word models (which in turn are formed of single backgrounds models). Likelihood of the hypothesis depends on two following factors - probabilities of the backgrounds sequences, assigned by acoustic model, and probability of consecutive arrangement of words, particularly probabilities of the backgrounds sequences within the word (the word can have several pronunciation variants) are multiplied with probability of consecutive arrangement of words. In this case operational speed of unit 4 is acceptable and is achieved due to performing search with a limit, suggesting analysis not of all possible partial paths in the recognition network, but those for which total likelihood is greater than the certain limit. Furthermore, at any specific time the most probable partial path is presented in the model and the likelihood of this path is used to define a lower search limit. All paths with the likelihood lower than the defined limit are excluded from further analysis.
In the proposed device language model is formed using unit 8. At the same time, operational speed of unit 6 may also be increased by learning acoustic models with the help of unit 7. Said learning involves reconfiguration of acoustic models using results of the previous recognition.
Unit 4 comprises a two-pass decoder, which enables gradual complication of the search condition for searching the most probable word sequence. As shown on Fig. 2 fine tuning of acoustic models to speech recording conditions is performed based on the information on the speech message signal quality and data for the topic of the message enables selection of the language model appropriate for the certain topic on the second pass of the decoder. It should be noted that when carrying out recognition a separate conversion of speech message features tailoring characteristic of the speaker to some "average" speaker is performed for every speaker. For a fine tuning of acoustic models to the speech recording conditions depending on the degree of unwanted signals a compensation of the speech message spectrum is used. For some non- speech events (for example, cracks, honks, music) the corresponding information is supplied directly to the decoder, which uses other decoding mode on these intervals, allowing only propagation of existing hypotheses and not generating new word hypotheses. This allows excluding words known to be incorrect from recognition results. W
8
Furthermore, for each topic a separate language model is formed in advance, for example, a language model of political news or a language model of sportscast. After determination of the topic on the second pass, decoder selects the most appropriate language model for the given topic. This allows more accurate recognition of words and chunks of language specific for every topic.
As shown on Fig. 1, speech component is transmitted from the unit 4 to the unit 5, where the speech component is assigned with the topic and it is transmitted further to unit 6, where an annotation could be compiled for the speech component. Annotation contains several sentences from the complete recognition result. Said sentences are selected according to the criteria of "information gain". To achieve higher accuracy of recognition the device also comprises units 10, 11 and 12. As shown on Fig. 2, speech component may be transmitted from the unit 4 directly to the unit 10, which performs analysis of the context of recognition results and according to the grammar rules of Russian language a grammatical agreement of case and gender endings is achieved in the unit and possible recognition errors are corrected.
Last stage of recognition is performed in the logical unit 12. While performing recognition of the speech component in the unit 12 probabilistic information received by decoder from unit 11 and unit 5 is used. Unit 11 comprises models of speakers, provided as an acoustic features vector, which dimensions lie in the range 200-300, and human voices formed automatically according to a pattern. Models of speakers are formed automatically according to patterns of recordings of certain people, whose speech is presented in the speech message. Unit 11 may comprise information storing means for storing models of speakers. In the unit 12 diverse hypotheses are combined as described above to provide the most probable word chain, where borders of sentences and intervals with different topics and voices are defined.
In the end recognition result is received by the unit 9 which performs saving of the result. Fig. 2 shows the preferred embodiment of the method for recognition of speech messages according to the present invention. As can be seen from Fig. 2 on the initial processing stage a received audio signal is transmitted to the speech features evaluation unit, that preprocess the received audio signal so as to obtain several sets of features, which are used during further processing with other units and speech component of the signal as described above. According to the proposed method on the next stage speech decoding and identification of a speaker is performed.
For decoding a two-pass decoder is used, which enables gradual complication of the search condition for searching the most probable word sequence. According to the proposed method on the next stage speech component is supplied to post-processing unit, which performs analysis of the context of recognition results and according to the grammar rules of Russian language a grammatical agreement of case and gender endings is achieved in the unit and possible recognition errors are corrected.
According to the proposed method on the next stage in the logical unit diverse hypotheses are combined to provide the most probable word chain, where borders of sentences and intervals with different topics and voices are defined. Use of the single speech features evaluation unit allows significant reduction of computational costs on this processing stage.
Recognition result is saved during the end processing stage.
Proposed method allows converting input speech message to the text with a high reliability.

Claims

WE CLAIM
1. A method for recognition of speech messages, the method comprising steps of: receiving audio signal
preprocessing of the received audio signal by extraction a speech component of said signal, initial decoding of the speech component using speech pattern dictionary data
the method characterized in that
during preprocessing stage time points indicating change of speaker and syntagma borders are determined,
data on syntagma borders are used during initial decoding stage,
speech component topic is determined using a topic classifier,
and the method comprises the steps of:
secondary decoding of the speech component using the data for the topic of the speech component and speech pattern dictionary data so as to obtain word sequences in the text form; identification of speakers using data for models of the speakers;
logical processing of the obtained word sequence using data for the topic of the speech component and data for identities of the speakers so as to obtain a multilevel textual layout.
2. The method according to claim 1, wherein after the secondary decoding a grammatical agreement of the obtained word sequence is provided.
3. The method according to claim 2, wherein after receiving the audio signal conversion thereof to the format suitable for recognition is provided.
4. A device for recognition of speech messages, the device comprising
an audio signal receiving unit,
a preprocessing unit for preprocessing the received audio signal, the preprocessing unit is arranged to extract a speech component of the audio signal from the received audio signal,
a speech recognition unit, comprising a decoder, the decoder is arranged to provide initial decoding of the speech component of the audio signal using speech pattern dictionary data, the device characterized in that it comprises a topic classification unit arranged to determine the speech component topic, and
the decoder of the speech recognition unit is a two-pass decoder arranged to perform secondary decoding so as to obtain word sequences in the text form, wherein the speech recognition unit is arranged to use data for the topic of said speech component received from the topic classification unit,
the preprocessing unit is arranged to determine time points indicating change of speaker and syntagma borders, and
the device also comprises
an identification unit for identification of speakers using the data for models of the speakers, and
a logical unit arranged to perform logical processing of the obtained word sequence using data for the topic of the speech component and data for identities of the speakers so as to obtain a multilevel textual layout.
5. The device according to claim 4, wherein the preprocessing unit is arranged to determine types and degrees of noises and distortions of the audio signal.
6. The device according to claim 4, wherein the audio signal receiving unit is arranged to provide data exchange with the user, data processing control, loading of speech data from different sources and outputting resulting information.
7. The device according to claim 4, comprising a conversion unit arranged to convert input data, having various formats and stored on various storage mediums to the format suitable for speech recognition.
8. The device according to claim 4, comprising a grammatical agreement unit arranged to provide grammatical analysis of the word sequences obtained during secondary decoding.
9. The device according to claim 4, comprising storage mediums for storing results of speech message recognition.
PCT/RU2012/000667 2012-08-10 2012-08-10 Method for recognition of speech messages and device for carrying out the method WO2014025282A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/RU2012/000667 WO2014025282A1 (en) 2012-08-10 2012-08-10 Method for recognition of speech messages and device for carrying out the method
EP12837648.0A EP2883224A1 (en) 2012-08-10 2012-08-10 Method for recognition of speech messages and device for carrying out the method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2012/000667 WO2014025282A1 (en) 2012-08-10 2012-08-10 Method for recognition of speech messages and device for carrying out the method

Publications (1)

Publication Number Publication Date
WO2014025282A1 true WO2014025282A1 (en) 2014-02-13

Family

ID=48014276

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2012/000667 WO2014025282A1 (en) 2012-08-10 2012-08-10 Method for recognition of speech messages and device for carrying out the method

Country Status (2)

Country Link
EP (1) EP2883224A1 (en)
WO (1) WO2014025282A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1069551A2 (en) 1999-07-16 2001-01-17 Bayerische Motoren Werke Aktiengesellschaft Speech recognition system and method for recognising given speech patterns, specially for voice control
US6434520B1 (en) * 1999-04-16 2002-08-13 International Business Machines Corporation System and method for indexing and querying audio archives
RU2223554C2 (en) 1998-09-09 2004-02-10 Асахи Касеи Кабусики Кайся Speech recognition device
US20050010411A1 (en) * 2003-07-09 2005-01-13 Luca Rigazio Speech data mining for call center management
RU2296376C2 (en) 2005-03-30 2007-03-27 Открытое акционерное общество "Корпорация "Фазотрон - научно-исследовательский институт радиостроения" Method for recognizing spoken words
US20070118372A1 (en) * 2005-11-23 2007-05-24 General Electric Company System and method for generating closed captions

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2223554C2 (en) 1998-09-09 2004-02-10 Асахи Касеи Кабусики Кайся Speech recognition device
US6434520B1 (en) * 1999-04-16 2002-08-13 International Business Machines Corporation System and method for indexing and querying audio archives
EP1069551A2 (en) 1999-07-16 2001-01-17 Bayerische Motoren Werke Aktiengesellschaft Speech recognition system and method for recognising given speech patterns, specially for voice control
US20050010411A1 (en) * 2003-07-09 2005-01-13 Luca Rigazio Speech data mining for call center management
RU2296376C2 (en) 2005-03-30 2007-03-27 Открытое акционерное общество "Корпорация "Фазотрон - научно-исследовательский институт радиостроения" Method for recognizing spoken words
US20070118372A1 (en) * 2005-11-23 2007-05-24 General Electric Company System and method for generating closed captions

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DENENBERG L ET AL: "Gisting conversational speech in real time", 1993 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 1993. ICASSP-93; [PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP)], PISCATAWAY, NJ, USA, vol. 2, 27 April 1993 (1993-04-27), pages 131 - 134, XP010110411, ISBN: 978-0-7803-0946-3, DOI: 10.1109/ICASSP.1993.319249 *

Also Published As

Publication number Publication date
EP2883224A1 (en) 2015-06-17

Similar Documents

Publication Publication Date Title
US11514901B2 (en) Anchored speech detection and speech recognition
US10923111B1 (en) Speech detection and speech recognition
EP3314606B1 (en) Language model speech endpointing
US9600231B1 (en) Model shrinking for embedded keyword spotting
Shriberg et al. Prosody-based automatic segmentation of speech into sentences and topics
US9646605B2 (en) False alarm reduction in speech recognition systems using contextual information
Zhou et al. Efficient audio stream segmentation via the combined T/sup 2/statistic and Bayesian information criterion
Huijbregts Segmentation, diarization and speech transcription: surprise data unraveled
Pellegrino et al. Automatic language identification: an alternative approach to phonetic modelling
Verma et al. Indian language identification using k-means clustering and support vector machine (SVM)
CN111489743A (en) Operation management analysis system based on intelligent voice technology
Siniscalchi et al. A study on lattice rescoring with knowledge scores for automatic speech recognition
Koumpis et al. The role of prosody in a voicemail summarization system
Zhang et al. Reliable accent-specific unit generation with discriminative dynamic Gaussian mixture selection for multi-accent Chinese speech recognition
Tabibian A survey on structured discriminative spoken keyword spotting
Nagarajan et al. Language identification using acoustic log-likelihoods of syllable-like units
Hansen et al. Audio stream phrase recognition for a national gallery of the spoken word:" one small step".
Prukkanon et al. F0 contour approximation model for a one-stream tonal word recognition system
Shukla Keywords Extraction and Sentiment Analysis using Automatic Speech Recognition
EP2883224A1 (en) Method for recognition of speech messages and device for carrying out the method
Chen Characterizing phonetic transformations and fine-grained acoustic differences across dialects
Zubi et al. Arabic Dialects System using Hidden Markov Models (HMMs)
Nilsson Speech Recognition Software and Vidispine
Buzo et al. An Automatic Speech Recognition solution with speaker identification support
AU2013375318B2 (en) False alarm reduction in speech recognition systems using contextual information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12837648

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2012837648

Country of ref document: EP