WO2014025282A1

WO2014025282A1 - Method for recognition of speech messages and device for carrying out the method

Info

Publication number: WO2014025282A1
Application number: PCT/RU2012/000667
Authority: WO
Inventors: Mikhail Vasilevich KHITROV; Kirill Evgen'evich LEVIN
Original assignee: Khitrov Mikhail Vasilevich; Levin Kirill Evgen Evich
Priority date: 2012-08-10
Filing date: 2012-08-10
Publication date: 2014-02-13
Also published as: EP2883224A1

Abstract

Proposed are a method for automatic recognition of speech messages and a device for carrying out the method. Proposed device and method allow converting input speech message to the text, subject to multifactorial processing the speech message using algorithms for evaluating the quality of the speech message, algorithm for identification of a speaker, syntagma border searching algorithms and algorithms for determining the topic of speech message. The device comprises a common logical unit, enabling combining diverse probabilistic estimators in order to get a general decision on the content of speech message. The use of said device and method increase reliability of recognition of voice search queries, recordings of conferences and negotiations.

Description

METHOD FOR RECOGNITION OF SPEECH MESSAGES AND DEVICE FOR CARRYING OUT THE METHOD

TECHNICAL FIELD

The present invention relates to the field of automatic speech recognition, particularly to method for automatic recognition of speech messages and a device for carrying out said method. The present invention may be used for recognition of news reports, voice search queries and also to process recordings of meetings and negotiations. STATE OF THE ART

Nowadays various devices and methods for recognition of speech messages and single words are known. Known solutions are based on comparison of inputted speech signals with reference signals, provided in the corresponding dictionaries of speech patterns, and analysis of probabilities of matching upon such comparisons.

A device and method, disclosed in the patent EP 1069551 used for recognition of words in continuous speech flow in order to determine user voice instructions are known. In the proposed device and method an algorithm for speech recognition based on hidden Markov models is implemented. According to the invention, at first a dictionary of reference speech patterns is created, said dictionary comprises, for example, a set of common user voice instructions. Said reference patterns are compared with the patterns received from the user. After receiving the speech message a probability of matching the delivered speech message with any of speech messages from the directory of reference speech patterns is determined by probability evaluation unit. In case the inequation P > P_max is true, where P stands for probability value in case of comparison with the given phrase from speech pattern dictionary and P_max stands for maximum threshold a speaker phrase is assigned with the value of the given phrase.

Proposed solution is dedicated to recognition of single words and provides only a low reliability when performing word-by-word recognition, due to the fact that according to the solution no multifactorial preprocessing of speech message is performed, criteria set for calculation of probabilities is limited, dictionary of reference speech patterns is limited and no creation of the speakers' dictionary is provided. Therefore said solution does not allow recognition of speech messages with a high degree of reliability.

Known from patent RU 2223554 are device and method for speech recognition, said device and method perform recognition of words based on the inputted information on the models of elementary speech units, where each one of elementary speech units is shorter than a word. Device for speech recognition comprises means for accumulating an aggregate of dictionary symbols, used for accumulating symbolic sequences of said elementary speech units for common words which are generally used for word recognition based on the inputted speech information of random speakers.

The device also comprises means for extracting symbolic sequences for recorded words, generating symbolic sequences corresponding to internal connections of said elementary speech units by using an aggregate, in which the requirement considering connections between elementary speech units is described, and symbolic sequences of said elementary speech units have the largest probability for recorded words from the inputted speech information of particular speaker The device also comprises recording means, storing said symbolic sequences of elementary speech units for the common words generally used for word recognition based on the inputted speech information from random speakers and created symbolic sequences for recorded words in the form of parallel aggregates. Said elementary speech units are acoustic events, generated by separation of the hidden Markov model of the phoneme into separate states, without changing the values of transition probability, resultant probability and quantity of states.

The device enables performing a word-by-word recognition of continuous speech, and hardware of the device allows forming an extendable dictionary of models of elementary speech units, wherein word recognition is performed by using pre-defined models of random speakers. The disadvantage of the device and method disclosed in RU 2223554 is low reliability of recognition of the speech message due to limited probabilistic information used while performing recognition.

The closest prior art of the proposed method and device for recognition of speech messages is method and device disclosed in patent RU 2296376. The method of RU 2296376 comprises steps of receiving audio signal, preprocessing of received audio signal by extraction of intervals, corresponding to the words separated from the background noise, and by dividing the signal into intervals of a certain duration, which is smaller than duration of phonemes, further initial decoding of the speech component and forming bispectral features during the decoding , the bispectral features are compared with reference features of phonemes in order to make a decision on a recognized phoneme of every word segment. When comparing a formed set of alphabetic codes the phonemes of the word to be recognized with sets of alphabetic codes of the phonemes of dictionary words using word reference features, a value array of the recognition factor is created, where the values are represented by the number of alphabetic codes and interval codes of the word to be recognized matching the corresponding ones of the dictionary words. The decision on the recognition of the word to be recognized is made towards that one of dictionary words, which provides the highest value of recognition factor when compared to the word to be recognized. Therefore, word-by-word recognition of speech messages may be achieved.

The known method and device do not allow determining the topic of speech message and identifying the speaker. Said disadvantages limit probabilistic information used for recognition and do not allow achieving a high level of reliability of such recognition.

SUMMARY OF THE INVENTION

In this section following terms and definitions are used:

Acoustic model (AM) - a set of statistical features of separate speech sounds, which enables determination of the most probable word sequences.

Language (linguistic) model (LM) - an aggregation of the possible word sequences in oral speech.

Syntagma - an aggregation of several words, combined according to semantic, grammatical and phonetic rules of the language.

Background of a word (hereinafter referred to as - the background) - a unit of speech sound level, which can be extracted from the speech flow regardless of its phonemic place (i.e. without assigning it to one or another phoneme) or as a specific realization of the phoneme in speech.

The object of the present invention is to provide a technical solution, allowing recognition of speech messages with a high level of reliability and providing a multilevel textual layout which enables assigning separate phrases to different speakers.

The object is achieved by a method, comprising steps of receiving an audio signal, preprocessing of the received audio signal by extraction a speech component of the signal and initial decoding of the speech component using speech pattern dictionary data. The method, characterized in that during preprocessing stage time points indicating change of speaker and syntagma borders are determined, data on syntagma borders are used during initial decoding stage, said method also comprises a stage of determination the speech component topic using a topic classifier, secondary decoding of the speech component using the data for the topic of the speech component and speech pattern dictionary data so as to obtain word sequences in the text form, identification of speakers using data for models of the speakers and logical processing of the obtained word sequence using data for the topic of the speech component and data for identities of the speakers so as to obtain a multilevel textual layout. Technical result, achieved by the proposed method, is to increase reliability of recognition of speech messages. The proposed method allows processing complex speech messages, generated by single or multiple speakers and comprising intervals with different topics and recording quality. Statistical information about content of the speech message which is sufficient to achieve a high level of reliability of recognition, is provided in the form of several hypotheses resulting from multifactorial preprocessing, use of topic classifier and models of speakers.

According to yet another embodiment, a method is proposed wherein after the secondary decoding a grammatical agreement of the obtained word sequence is provided.

According to yet another embodiment, a method is proposed, wherein after receiving the audio signal conversion thereof to the format suitable for recognition is provided.

The object of the present invention may also be achieved by a device for recognition of speech messages as proposed, the device comprising an audio signal receiving unit, a preprocessing unit for preprocessing of received audio signal, the preprocessing unit is arranged to extract a speech component of the audio signal from the received audio signal, a speech recognition unit, comprising a decoder, the decoder is arranged to provide an initial decoding of the speech component of the audio signal using speech pattern dictionary data.

The device characterized in that it comprises a topic classification unit arranged to determine the speech component topic, and the decoder of the speech recognition unit is a two- pass decoder arranged to perform secondary decoding so as to obtain word sequences in the text form, wherein the speech recognition unit is arranged to use data for the topic of said speech component received from the topic classification unit, the preprocessing unit is arrange to determine time points indicating change of speaker and syntagma borders, and the device also comprises an identification unit for identification of speakers using the data for models of the speakers and a logical unit arranged to perform logical processing of the obtained word sequence using data for the topic of the speech component and data for identities of the speakers so as to obtain a multilevel textual layout.

Technical result achieved by the device is to increase reliability of recognition of speech messages. Proposed device allows providing a high level of reliability of recognition of speech messages, due to multifactorial preprocessing, use of topic classification unit and speaker models, providing statistical information which is sufficient to achieve a high level of reliability of recognition of the speech message.

According to another embodiment of the device the preprocessing unit is arranged to determine types and degrees of noises and distortions of the audio signal. According to another embodiment of the device the audio signal receiving unit is arranged to provide data exchange with the user, data processing control, loading of speech data from different sources and outputting resulting information.

The proposed device according to yet another embodiment comprises a conversion unit arranged to convert input data, having various formats and stored on various storage mediums to the format, suitable for speech recognition.

The proposed device according to yet another embodiment comprises a grammatical agreement unit arranged to provide grammatical analysis of the word sequences, obtained during secondary decoding.

The proposed device according to yet another embodiment comprises storage mediums for storing results of speech message recognition.

BRIEF DESCRIPTION OF DRAWINGS

A detailed description of the proposed invention is provided below with reference to figures, wherein

Fig. 1 shows the preferred embodiment of the device for recognition of speech messages according to the present invention.

Fig. 2 shows the preferred embodiment of the method for recognition of speech messages according to the present invention.

DETAILED DESCRIPTION OF THE INVENTION

Proposed device according to the preferred embodiment is a hardware and software system, comprising, for example, a computer system. As shown on Fig. 1, the proposed device comprises user interface 1, recognition job formation unit 2, preprocessing unit 3, speech recognition unit 4, comprising decoder, topic classification unit 5, annotation unit 6, learning unit 7, linguistic processing unit 8, result saving unit 9, and post-processing unit 10, speaker identification unit 11 and logical unit 12, compensational unit 14, detecting elements 15 for detecting presence of speech/absence of speech, computing device 16 for computing Signal-to- Noise Ratio (SNR) and artificial neural networks 17 (shown on Fig. 2).

The operation principle of proposed invention with a detailed description of interconnections between its units is disclosed below. Audio signal is received by the interface 1, enabling initiating recognition process after receiving audio signal. The interface may comprise a keyboard and a computer mouse and to further comprise means for receiving an audio signal, means for loading speech data from various data sources and for outputting resulting information. Said interface 1 enables user to control data processing operation. Speech data enters unit 2 through interface 1.

Unit 2 is an intermediate unit between interface 1 and unit 3. Unit 2 converts input data, having various formats and stored on various storage mediums to the format, suitable for speech recognition. Upon appropriate preprocessing of the speech signal said signal is transmitted from the unit 2 to unit 3.

Unit 3 converts a speech signal to the set of speech features, such as FBANK1, FBANK2 that resulting from processing of the speech message by Mel-frequency filter banks, F₀ - frequency values of fundamental tone of speech signal, MFCC - mel-frequency cepstral coefficients. Said features allow extracting signal informational component and reducing inter- speaker and inter-sessional variability of the original signal.

Conversion of said original signal is performed using known algorithms, such as MFCC,

FTT, LCRC. As can be seen from Fig. 2, the main feature set (FBANK1) is received by compensational unit 14, where initial tuning to the speech message transmitting channel is performed, while detecting element 15 and computing element 16 define recording quality and provide data for further processing in unit 4. Furthermore, during initial tuning to the speech message transmitting channel a compensation of FBANK1 features is performed constantly, allowing removing constant signal distortions from the input signal, which are introduced by transmitting channel frequency response.

Other supporting sets (FBANK2) are received by detecting elements 15 for detecting presence of speech/absence of speech, noises and unwanted signals, by computing element 6 for computing Signal-to-Noise Ratio (SNR) and by artificial neural networks 17 (ANN 1, ANN 2). Neural networks compute posterior probabilities for the input data vector being a member of the backgrounds states. However the probabilities are computed without taking into consideration an admissible phoneme pattern of speech, the pattern is considered in the unit 4 during decoding operation. Time points indicating change of speaker and syntagma borders are determined using remaining sets of features. Furthermore, intervals, containing speech, are extracted in the unit 3 and types and degrees of noises and distortions of the audio signal are determined. Unit 3 enables extracting several common types of distortions, having the greatest impact on reliability of recognition: non-linear distortions (overload) and additive noises of the transmitting channel. In order to evaluate these distortions within a speech signal signal-to-noise ratio is computed and intervals with amplitude changes typical for distortions caused by overload are determined. An important object of the unit 3 is to determine the informative part of the speech signal. This allows saving recognition time due to exclusion of silent intervals from recognition operation. Upon appropriate preprocessing of the speech signal said signal is transmitted from the unit 3 to unit 4.

Unit 4 enables determining the most probable grammatical hypothesis for the unknown phrase, i.e. the most probable path through a recognition network, which consists of word models (which in turn are formed of single backgrounds models). Likelihood of the hypothesis depends on two following factors - probabilities of the backgrounds sequences, assigned by acoustic model, and probability of consecutive arrangement of words, particularly probabilities of the backgrounds sequences within the word (the word can have several pronunciation variants) are multiplied with probability of consecutive arrangement of words. In this case operational speed of unit 4 is acceptable and is achieved due to performing search with a limit, suggesting analysis not of all possible partial paths in the recognition network, but those for which total likelihood is greater than the certain limit. Furthermore, at any specific time the most probable partial path is presented in the model and the likelihood of this path is used to define a lower search limit. All paths with the likelihood lower than the defined limit are excluded from further analysis.

In the proposed device language model is formed using unit 8. At the same time, operational speed of unit 6 may also be increased by learning acoustic models with the help of unit 7. Said learning involves reconfiguration of acoustic models using results of the previous recognition.

Unit 4 comprises a two-pass decoder, which enables gradual complication of the search condition for searching the most probable word sequence. As shown on Fig. 2 fine tuning of acoustic models to speech recording conditions is performed based on the information on the speech message signal quality and data for the topic of the message enables selection of the language model appropriate for the certain topic on the second pass of the decoder. It should be noted that when carrying out recognition a separate conversion of speech message features tailoring characteristic of the speaker to some "average" speaker is performed for every speaker. For a fine tuning of acoustic models to the speech recording conditions depending on the degree of unwanted signals a compensation of the speech message spectrum is used. For some non- speech events (for example, cracks, honks, music) the corresponding information is supplied directly to the decoder, which uses other decoding mode on these intervals, allowing only propagation of existing hypotheses and not generating new word hypotheses. This allows excluding words known to be incorrect from recognition results. W

8

Furthermore, for each topic a separate language model is formed in advance, for example, a language model of political news or a language model of sportscast. After determination of the topic on the second pass, decoder selects the most appropriate language model for the given topic. This allows more accurate recognition of words and chunks of language specific for every topic.

As shown on Fig. 1, speech component is transmitted from the unit 4 to the unit 5, where the speech component is assigned with the topic and it is transmitted further to unit 6, where an annotation could be compiled for the speech component. Annotation contains several sentences from the complete recognition result. Said sentences are selected according to the criteria of "information gain". To achieve higher accuracy of recognition the device also comprises units 10, 11 and 12. As shown on Fig. 2, speech component may be transmitted from the unit 4 directly to the unit 10, which performs analysis of the context of recognition results and according to the grammar rules of Russian language a grammatical agreement of case and gender endings is achieved in the unit and possible recognition errors are corrected.

Last stage of recognition is performed in the logical unit 12. While performing recognition of the speech component in the unit 12 probabilistic information received by decoder from unit 11 and unit 5 is used. Unit 11 comprises models of speakers, provided as an acoustic features vector, which dimensions lie in the range 200-300, and human voices formed automatically according to a pattern. Models of speakers are formed automatically according to patterns of recordings of certain people, whose speech is presented in the speech message. Unit 11 may comprise information storing means for storing models of speakers. In the unit 12 diverse hypotheses are combined as described above to provide the most probable word chain, where borders of sentences and intervals with different topics and voices are defined.

In the end recognition result is received by the unit 9 which performs saving of the result. Fig. 2 shows the preferred embodiment of the method for recognition of speech messages according to the present invention. As can be seen from Fig. 2 on the initial processing stage a received audio signal is transmitted to the speech features evaluation unit, that preprocess the received audio signal so as to obtain several sets of features, which are used during further processing with other units and speech component of the signal as described above. According to the proposed method on the next stage speech decoding and identification of a speaker is performed.

For decoding a two-pass decoder is used, which enables gradual complication of the search condition for searching the most probable word sequence. According to the proposed method on the next stage speech component is supplied to post-processing unit, which performs analysis of the context of recognition results and according to the grammar rules of Russian language a grammatical agreement of case and gender endings is achieved in the unit and possible recognition errors are corrected.

According to the proposed method on the next stage in the logical unit diverse hypotheses are combined to provide the most probable word chain, where borders of sentences and intervals with different topics and voices are defined. Use of the single speech features evaluation unit allows significant reduction of computational costs on this processing stage.

Recognition result is saved during the end processing stage.

Proposed method allows converting input speech message to the text with a high reliability.

Claims

WE CLAIM

1. A method for recognition of speech messages, the method comprising steps of: receiving audio signal

preprocessing of the received audio signal by extraction a speech component of said signal, initial decoding of the speech component using speech pattern dictionary data

the method characterized in that

during preprocessing stage time points indicating change of speaker and syntagma borders are determined,

data on syntagma borders are used during initial decoding stage,

speech component topic is determined using a topic classifier,

and the method comprises the steps of:

secondary decoding of the speech component using the data for the topic of the speech component and speech pattern dictionary data so as to obtain word sequences in the text form; identification of speakers using data for models of the speakers;

logical processing of the obtained word sequence using data for the topic of the speech component and data for identities of the speakers so as to obtain a multilevel textual layout.

2. The method according to claim 1, wherein after the secondary decoding a grammatical agreement of the obtained word sequence is provided.

3. The method according to claim 2, wherein after receiving the audio signal conversion thereof to the format suitable for recognition is provided.

4. A device for recognition of speech messages, the device comprising

an audio signal receiving unit,

a preprocessing unit for preprocessing the received audio signal, the preprocessing unit is arranged to extract a speech component of the audio signal from the received audio signal,

a speech recognition unit, comprising a decoder, the decoder is arranged to provide initial decoding of the speech component of the audio signal using speech pattern dictionary data, the device characterized in that it comprises a topic classification unit arranged to determine the speech component topic, and

the decoder of the speech recognition unit is a two-pass decoder arranged to perform secondary decoding so as to obtain word sequences in the text form, wherein the speech recognition unit is arranged to use data for the topic of said speech component received from the topic classification unit,

the preprocessing unit is arranged to determine time points indicating change of speaker and syntagma borders, and

the device also comprises

an identification unit for identification of speakers using the data for models of the speakers, and

a logical unit arranged to perform logical processing of the obtained word sequence using data for the topic of the speech component and data for identities of the speakers so as to obtain a multilevel textual layout.

5. The device according to claim 4, wherein the preprocessing unit is arranged to determine types and degrees of noises and distortions of the audio signal.

6. The device according to claim 4, wherein the audio signal receiving unit is arranged to provide data exchange with the user, data processing control, loading of speech data from different sources and outputting resulting information.

7. The device according to claim 4, comprising a conversion unit arranged to convert input data, having various formats and stored on various storage mediums to the format suitable for speech recognition.

8. The device according to claim 4, comprising a grammatical agreement unit arranged to provide grammatical analysis of the word sequences obtained during secondary decoding.

9. The device according to claim 4, comprising storage mediums for storing results of speech message recognition.