WO2004111999A1

WO2004111999A1 - An amplitude warping approach to intra-speaker normalization for speech recognition

Info

Publication number: WO2004111999A1
Application number: PCT/KR2003/001216
Authority: WO
Inventors: Kwang-Seok Hong
Original assignee: Kwangwoon Foundation
Priority date: 2003-06-13
Filing date: 2003-06-20
Publication date: 2004-12-23
Also published as: KR100511248B1; KR20040107173A; AU2003244240A1

Abstract

Disclosed is an amplitude warping approach to intra-speaker normalization for speech recognition, which applies a recognition model suited for user's voice to a speaker recognition applied product and applies a speaker adaptive method to recognition device in order to improve recognition rate and reliance of the speaker recognition applied product by achieving amplitude normalization according to pitch alteration utterance through intra-speaker warping factor estimation for intra-speaker. Due to acoustic difference of a speaker occurred according to voiceprint and vocal tract, feature space of speech being not transformed from intra-speaker pitch alteration utterance is variously distributed. In order to prevent various distribution of feature space of speaker, reverse rate between input and reference pitches of speaker is applied to determine amplitude warping factor. Amplitude slope on total frequency axis is determined while adjusting height of triangle filter and amplitude.

Description

AN AMPLITUDE WARPING APPROACH TO INTRA-SPEAKER NORMALIZATION FOR SPEECH RECOGNITION

Technical Field The present invention relates to an amplitude warping method of intra- speaker normalization for speech recognition. More specifically, the present invention relates to an amplitude warping method of intra-speaker normalization for speech recognition which applies recognition model suited for user's voice to a speaker recognition applied product and applies a speaker adaptive method to recognition device in order to improve recognition rate and reliance of a speaker recognition applied product by achieving amplitude normalization according to pitch alteration utterance through intra-speaker warping factor estimation for intra- speaker.

Background Art

Generally, voiceprint of speaker's vocal chord controls the pitch of a voice. Vocal tract determines vowels through formant and articulates consonants. Pitch and formant component of uttered speech are nearly independent of a speech signal.

A frequency warping has been studied as a manner in order to prevent the deterioration of speech recognition performance due to a shape change of the vocal tract between speakers. That is, techniques normalizing parameter component expressions of a speech signal have been researched in order to reduce the effect due to a difference between speakers,.

Hence, so as to compensate for the location change of formant between speakers, normalization is performed using linear and nonlinear frequency warping functions. Such processes have been tried to conform a real vocal tract shape of each speaker to each other and to solve complicated problems of formant location estimation according to a compensation for such differences. When Gaussian mixtures are used as an output distribution in Hidden

Markov Model, one main problem is that various speaker dependent scale factors have a tendency to form a model by multimodes of mixture distributions.

Also, a factor between speakers plays an important part in speech recognition and is based on vocal tract normalization (VTN) for a speaker adaptation requiring normalization between speakers. The attempt is performed in order to reduce the change of speeches between speakers by compensating the change of pitch alteration utterance according to feeling state.

A conventional vocal tract normalization method is used for improving accuracy of normalization between speakers and is based on a frequency axis normalization for normalization between speakers.

Hereinafter, the frequency axis normalization for normalization between speakers will be described.

The object of the VTN is to normalize frequency axes for each speaker in order to eliminate a speaker dependent variability from acoustic vectors during the speech recognition. In order to utter a given speech, locations of formant peaks of a spectrum are inverse proportion to a length of the vocal tract.

At this time, the length of the vocal tract ranges from 13 cm to 18 cm. A formant center frequency changes by 25 % between speakers. It means that the formant center frequency varies according to the persons by about 25 %. Such variation elements deteriorate the performances of speaker dependent and independent speech recognitions.

An optimal warping factor is obtained by searching 13 elements having uniform intervals in the range of 0.88 < α ≤ 1.12. For example, 13 elements having 0.88, 0.90, 0.92, ...., 1.12 are obtained by uniformly dividing α at intervals of 0.02 in the range.

The range of α is selected in order to reflect the variation of 25 % in the length of the vocal tract which is found in adults.

So as to determine an optimal magnitude of a frequency warping in speech recognition, various methods have been proposed.

A sequence of acoustic vectors in speech recognition is observed through time t=l, 2, ...., T. Namely, X = X₁Ax₁Ax₇..

A model distribution p{XIW;θ) is assumed using a reference model

parameters θ suited for each assumed word sequence W . It is assumed that reliability exists as follows in a distribution of a feature parameter α of any speaker in an acoustic speaker adaptation modeling. p{XIW;θ)

The model distribution is divided into conversions of two variables. Firstly, the conventional method converts a model parameter θ which is not normalized for respective values of feature parameters between speakers. That is, it converts a model parameter θ into a feature parameter θ^a which is normalized.

θ → θ^a Accordingly, the model distribution p(X /W;θ,a) becomes p(X IW;θ^a) .

Secondly, the conventional method converts an observational vector X . The conversion of the observation vector X is formalized by mapping acoustic vectors.

X →X^a

At this time, the model distribution p(XIW;θ,a) becomes p(X^a /W;θ) .

As described above, the conventional vocal tract normalization method compensates for the variation in a frequency axis of a spectrum envelope component of utterance through the normalization of the frequency axis. The utterance is generated, because the length of vocal tract varies according to persons. However, the conventional vocal tract normalization method compensates for a frequency difference but does not compensate for an amplitude difference.

Disclosure of Invention

Therefore, it is an object of the present invention to provide an amplitude warping method to an intra-speaker normalization for speech recognition, which applies recognition model suited for user's voice to a speaker recognition applied product and applies a speaker adaptive method to recognition device in order to improve recognition rate and reliance of speaker recognition applied product by achieving amplitude normalization according to pitch alteration utterance through intra-speaker warping factor estimation for intra-speaker.

According to the present invention, there is provided an amplitude warping method of intra-speaker normalization for speech recognition, the method comprising the steps of: calculating reverse rate between input and reference pitches of input speaker and determining amplitude warping factor since feature space of speech being not transformed from intra-speaker pitch alteration utterance is variously distributed due to acoustic difference of a speaker occurred according to voiceprint and vocal tract; and determining amplitude slope on total frequency axis while adjusting height of triangle filter and amplitude.

Preferably, a calculation of reverse rate between input and reference pitches of input speaker calculates pl/p2 as a new slope adjusting constant in order to adjust an amplitude to an amplitude of a first pitch pi since an amplitude becomes higher when a second pitch p2 is high, wherein a first pitch pi is a pitch during a normal utterance, and the second pitch p2 is a pitch when newly uttering the same voice.

More preferably, after a slope of the amplitude is determined, resulting feature vectors are hidden Markov model-decoded.

Brief Description of Drawings

The foregoing and other objects, features and advantages of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings in which:

FIGs. 1 and 2 are views showing linear prediction coefficient (LPC) spectrum envelopes of voiced sounds which a man and a woman uttered according to a pitch change utterance, respectively;

FIG. 3 is a view showing a spectrum for a normal utterance (thick line) of a man's speech /a/ and an uttered speech with a reduced pitch (dotted line);

FIG. 4 is a view showing an intra-speaker feature parameter β according to

pitch change utterance;

FIGs. 5a and 5b are views for showing optimal frequency warping factor estimations based on mixture;

FIG. 6 is a view showing a mel-filter bank analysis for warping; and

FIG. 7 is a view showing a mel-filter bank for an amplitude warping for intra-speaker normalization.

FIG. 8 shows a method to apply a and β in the order.

Best Mode for Carrying Out the Invention

Hereinafter, an amplitude warping method of intra-speaker normalization for speech recognition according to a preferred embodiment of the present invention will be described with reference to the accompanying drawings.

At first, a process used to perform an amplitude warping method of intra- speaker normalization for speech recognition will be explained. An object of the process is to reduce a change of an intra-speaker speech by a compensation of a transformation in a pitch alteration utterance according to feeling.

Since distortions due to pitch change utterances are designed by a simple linear warping in a frequency domain of a speech signal, a normalization adjusts an amplitude axis by a suitably estimated warping factor.

A cadence is known as an expression of acoustic features in feeling. A feature parameter is analyzed from a voiced sound section of a speech waveform data. It is a main feature of an intra-speaker element. FIGs. 1 and 2 are views showing linear prediction coefficient (LPC) spectrum envelopes of voiced sounds which a man and a woman uttered according to a pitch change utterance. That is, FIG. 1 shows the linear prediction coefficient (LPC) spectrum envelope of a voiced sound which a man uttered according to a pitch change utterance. The voice sound is a vowel /a/ having a pitch of 113 ~ 251 Hz. FIG. 2 shows the linear prediction coefficient (LPC) spectrum envelope of a voiced sound which a woman uttered according to a pitch change utterance. The voice sound is a vowel /a/ having a zone pitch of 194 ~ 342 Hz. An energy gain in a higher harmonics indicates by comparing a flow waveform of voiceprint. As an intensity of a pronunciation is increased, a closed rate of the voiceprint is increased. A man's voice has a lower fundamental frequency and a stronger harmonics than a woman's voice.

FIG. 3 is a view showing a spectrum for a normal utterance (thick line) of a man's speech /a/ and an uttered speech with a reduced pitch (dotted line). Due to acoustic difference of speaker according to voiceprint and vocal tract, feature space of speech being not transformed from intra-speaker pitch alteration utterance is variously distributed. In order to prevent various distribution of feature space of speaker, an amplitude warping factor is determined by an inverse proportion calculation between input and reference pitches of input speaker. More specifically, the warping factor indicates the reverse rate between input and reference pitches of input speaker. While adjusting height of triangle filter and amplitude using the amplitude warping factor, amplitude slope on total frequency axis is determined. As shown in FIGs. 6, 7, and 8, when analyzing a speech and extracting a feature from the speech, in an MFCC method, the speech is analyzed by applying the triangle filter to a frequency axis. That is, the triangle filter is used in order to divide the frequency axis into predetermined bandwidths.

Since an amplitude becomes higher when a second pitch p2 is high, a calculation of reverse rate between input and reference pitches of input speaker calculates pl/p2 as a new slope adjusting the constant in order to adjust an amplitude to an amplitude of a first pitch pi. The first pitch pi is a pitch during a normal utterance, and the second pitch p2 is a pitch when newly uttering the same voice.

Resulting feature vectors are used in HMM decoding. A destination is warped to be matched with an HMM model which normalized an amplitude scale of each test utterance.

FIG. 4 is a view showing an intra-speaker feature parameter β according to

pitch change utterance. Intra-speaker feature scale factor is closely related to an energy of a spectrum. FIG. 4 estimates the intra-speaker feature parameter β which

satisfies β_x = ββ₂ using pitch and energy.

It is assumed that reliability exists as follows in a distribution of a feature parameter β of any speaker in an acoustic speaker adaptation modeling. p(XIW;θ,a,β)

The model distribution is divided into conversions of two variables. Firstly, the conventional method converts a model parameter θ which is not normalized for the intra-speaker feature parameter β . That is, it converts a

model parameter θ into a feature parameter θ^β which is normalized. θ → θ^a'^β(or θ → θ^β)

Accordingly, the model distribution p(X IW;θ,a,β) becomes

p(XIW;θ^a'^β) orp(XIW;θ^β) .

Secondly, the method according to the present invention converts an observational vector X . The conversion of the observation vector X is formalized by mapping acoustic vectors.

X →X^a'^β(orX →X^β)

At this time, the model distribution p{X IW;θ,a,β) becomes

p(X^a'^β /W;θ)orp(X^β IW;θ^a) .

An intra-speaker scale element β of an amplitude axis is used to rescale an

amplitude axis prior to calculating an acoustic vector for speech recognition. The amplitude warping method of intra-speaker normalization for speech recognition according to the present invention is experimented using SungKyunKwan University-1 (SKKU-I) speech database.

A vocabulary of the SKKU-I speech database includes Korean language numeral sounds, names, phonetically balanced word (PBW), and phonetically rich word (PRW).

A speech signal is high-pass-analyzed with 1-0.95 z^~λ , and a Hamming window of 20 ms therefore is performed, and is analyzed in 10 ms unit. Each frame extracts 39 dimensional feature vector.

The features include 12-order mel-frequency cepstrum coefficient (MFCC) vector, 12-order delta-MFCC vector, 12-order delta-delta-MFCC vector, log energy, delta log energy, and delta-delta energy.

When a phonemically balanced word is selected, PRW (phonetically balanced word) includes words more than those of PRW (phonetically rich word). That is, the PBW includes several ten thousand words, the number of words in the

PRW is not regulated. The Hamming window is a representative window function used in a unit analyzing section when analyzing the speech.

FIG. 5a shows an optimal frequency warping factor estimation based on mixture. FIG. 5b shows an optimal frequency warping factor estimation based on frequency warping of input speech. The speech is warped using an estimated warping factor. Results of feature vectors are used for HMM decoding.

FIG. 6 shows a mel-filter bank analysis for warping. For the amplitude normalization, at first, pitch and energy are extracted from utterance. Secondly, an intra-speaker parameter is determined. FIG. 7 is a view showing a mel-filter bank of an amplitude warping for intra- speaker normalization. That is, FIG. 6 is a method of analyzing the mel-filter bank when the factor a is determined, whereas FIG. 7 is the method for analyzing the mel-filter bank when the factor β is determined.

Table 1 shows recognition word error rates of numerals and words in SKKU- 1 database when using a fundamental recognition device, when using a base fundamental recognition device by normalization between speakers, and when using a base fundamental recognition device by normalization between speakers and intra-speaker normalization. [Table 1]

As shown in table 1, The "baseline" indicates a recognition error rate when the normalization is not applied. The "with a " indicates a recognition error rate when only the conventional frequency normalization is applied. The "with a and β " indicates a recognition error rate when the conventional frequency

normalization and the amplitude normalization according to the present invention are applied. The error is gradually reduced in order of "baseline", "with a " , and

"with a and β "

Namely, a word recognition rate according to the recognition result is 96.4% and 98.2%. Error rate is reduced to 0.4 ~ 2.3% with respect to numeral and word recognitions.

Industrial Applicability

As mentioned above, according to the amplitude warping method of intra- speaker normalization for speech recognition, an amplitude normalization according to pitch alteration utterance is achieved through a new intra-speaker warping faction estimation.

Furthermore, applying recognition model suited for user's voice to a speaker recognition applied product and applying a speaker adaptive method to recognition device, recognition rate and reliance of speaker recognition applied product are improved.

Claims

1. An amplitude warping method of intra-speaker normalization for speech recognition, the method comprising the steps of: calculating reverse rate between input and reference pitches of input speaker and determining amplitude warping factor since feature space of speech being not transformed from intra-speaker pitch alteration utterance is variously distributed due to acoustic difference of a speaker occurred according to voiceprint and vocal tract; and determining amplitude slope on total frequency axis while adjusting height of triangle filter and amplitude.

2. The method according to claim 1, wherein a calculation of reverse rate between input and reference pitches of input speaker calculates pl/p2 as a new slope adjusting constant in order to adjust an amplitude to an amplitude of a first pitch pi since an amplitude becomes higher when a second pitch p2 is high, wherein a first pitch pi is a pitch during a normal utterance, and the second pitch p2 is a pitch when newly uttering the same voice.

3. The method according to claim 1 or 2, wherein after a slope of the amplitude is determined, resulting feature vectors are hidden Markov model- decoded.