WO2004111999A1 - An amplitude warping approach to intra-speaker normalization for speech recognition - Google Patents

An amplitude warping approach to intra-speaker normalization for speech recognition Download PDF

Info

Publication number
WO2004111999A1
WO2004111999A1 PCT/KR2003/001216 KR0301216W WO2004111999A1 WO 2004111999 A1 WO2004111999 A1 WO 2004111999A1 KR 0301216 W KR0301216 W KR 0301216W WO 2004111999 A1 WO2004111999 A1 WO 2004111999A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
amplitude
pitch
intra
recognition
Prior art date
Application number
PCT/KR2003/001216
Other languages
French (fr)
Inventor
Kwang-Seok Hong
Original Assignee
Kwangwoon Foundation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kwangwoon Foundation filed Critical Kwangwoon Foundation
Priority to AU2003244240A priority Critical patent/AU2003244240A1/en
Publication of WO2004111999A1 publication Critical patent/WO2004111999A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude

Definitions

  • the present invention relates to an amplitude warping method of intra- speaker normalization for speech recognition. More specifically, the present invention relates to an amplitude warping method of intra-speaker normalization for speech recognition which applies recognition model suited for user's voice to a speaker recognition applied product and applies a speaker adaptive method to recognition device in order to improve recognition rate and reliance of a speaker recognition applied product by achieving amplitude normalization according to pitch alteration utterance through intra-speaker warping factor estimation for intra- speaker.
  • voiceprint of speaker's vocal chord controls the pitch of a voice.
  • Vocal tract determines vowels through formant and articulates consonants. Pitch and formant component of uttered speech are nearly independent of a speech signal.
  • a frequency warping has been studied as a manner in order to prevent the deterioration of speech recognition performance due to a shape change of the vocal tract between speakers. That is, techniques normalizing parameter component expressions of a speech signal have been researched in order to reduce the effect due to a difference between speakers,.
  • Markov Model one main problem is that various speaker dependent scale factors have a tendency to form a model by multimodes of mixture distributions.
  • VTN vocal tract normalization
  • a conventional vocal tract normalization method is used for improving accuracy of normalization between speakers and is based on a frequency axis normalization for normalization between speakers.
  • the object of the VTN is to normalize frequency axes for each speaker in order to eliminate a speaker dependent variability from acoustic vectors during the speech recognition.
  • locations of formant peaks of a spectrum are inverse proportion to a length of the vocal tract.
  • the length of the vocal tract ranges from 13 cm to 18 cm.
  • a formant center frequency changes by 25 % between speakers. It means that the formant center frequency varies according to the persons by about 25 %. Such variation elements deteriorate the performances of speaker dependent and independent speech recognitions.
  • An optimal warping factor is obtained by searching 13 elements having uniform intervals in the range of 0.88 ⁇ ⁇ ⁇ 1.12. For example, 13 elements having 0.88, 0.90, 0.92, ...., 1.12 are obtained by uniformly dividing ⁇ at intervals of 0.02 in the range.
  • the range of ⁇ is selected in order to reflect the variation of 25 % in the length of the vocal tract which is found in adults.
  • the model distribution is divided into conversions of two variables. Firstly, the conventional method converts a model parameter ⁇ which is not normalized for respective values of feature parameters between speakers. That is, it converts a model parameter ⁇ into a feature parameter ⁇ a which is normalized.
  • the conventional method converts an observational vector X .
  • the conversion of the observation vector X is formalized by mapping acoustic vectors.
  • the conventional vocal tract normalization method compensates for the variation in a frequency axis of a spectrum envelope component of utterance through the normalization of the frequency axis.
  • the utterance is generated, because the length of vocal tract varies according to persons.
  • the conventional vocal tract normalization method compensates for a frequency difference but does not compensate for an amplitude difference.
  • an object of the present invention to provide an amplitude warping method to an intra-speaker normalization for speech recognition, which applies recognition model suited for user's voice to a speaker recognition applied product and applies a speaker adaptive method to recognition device in order to improve recognition rate and reliance of speaker recognition applied product by achieving amplitude normalization according to pitch alteration utterance through intra-speaker warping factor estimation for intra-speaker.
  • an amplitude warping method of intra-speaker normalization for speech recognition comprising the steps of: calculating reverse rate between input and reference pitches of input speaker and determining amplitude warping factor since feature space of speech being not transformed from intra-speaker pitch alteration utterance is variously distributed due to acoustic difference of a speaker occurred according to voiceprint and vocal tract; and determining amplitude slope on total frequency axis while adjusting height of triangle filter and amplitude.
  • a calculation of reverse rate between input and reference pitches of input speaker calculates pl/p2 as a new slope adjusting constant in order to adjust an amplitude to an amplitude of a first pitch pi since an amplitude becomes higher when a second pitch p2 is high, wherein a first pitch pi is a pitch during a normal utterance, and the second pitch p2 is a pitch when newly uttering the same voice.
  • resulting feature vectors are hidden Markov model-decoded.
  • FIGs. 1 and 2 are views showing linear prediction coefficient (LPC) spectrum envelopes of voiced sounds which a man and a woman uttered according to a pitch change utterance, respectively;
  • LPC linear prediction coefficient
  • FIG. 3 is a view showing a spectrum for a normal utterance (thick line) of a man's speech /a/ and an uttered speech with a reduced pitch (dotted line);
  • FIG. 4 is a view showing an intra-speaker feature parameter ⁇ according to
  • FIGs. 5a and 5b are views for showing optimal frequency warping factor estimations based on mixture
  • FIG. 6 is a view showing a mel-filter bank analysis for warping.
  • FIG. 7 is a view showing a mel-filter bank for an amplitude warping for intra-speaker normalization.
  • FIG. 8 shows a method to apply a and ⁇ in the order.
  • An object of the process is to reduce a change of an intra-speaker speech by a compensation of a transformation in a pitch alteration utterance according to feeling.
  • a normalization adjusts an amplitude axis by a suitably estimated warping factor.
  • FIGs. 1 and 2 are views showing linear prediction coefficient (LPC) spectrum envelopes of voiced sounds which a man and a woman uttered according to a pitch change utterance. That is, FIG. 1 shows the linear prediction coefficient (LPC) spectrum envelope of a voiced sound which a man uttered according to a pitch change utterance. The voice sound is a vowel /a/ having a pitch of 113 ⁇ 251 Hz.
  • LPC linear prediction coefficient
  • FIG. 3 is a view showing a spectrum for a normal utterance (thick line) of a man's speech /a/ and an uttered speech with a reduced pitch (dotted line). Due to acoustic difference of speaker according to voiceprint and vocal tract, feature space of speech being not transformed from intra-speaker pitch alteration utterance is variously distributed. In order to prevent various distribution of feature space of speaker, an amplitude warping factor is determined by an inverse proportion calculation between input and reference pitches of input speaker. More specifically, the warping factor indicates the reverse rate between input and reference pitches of input speaker. While adjusting height of triangle filter and amplitude using the amplitude warping factor, amplitude slope on total frequency axis is determined. As shown in FIGs.
  • the speech is analyzed by applying the triangle filter to a frequency axis. That is, the triangle filter is used in order to divide the frequency axis into predetermined bandwidths.
  • a calculation of reverse rate between input and reference pitches of input speaker calculates pl/p2 as a new slope adjusting the constant in order to adjust an amplitude to an amplitude of a first pitch pi.
  • the first pitch pi is a pitch during a normal utterance
  • the second pitch p2 is a pitch when newly uttering the same voice.
  • Resulting feature vectors are used in HMM decoding.
  • a destination is warped to be matched with an HMM model which normalized an amplitude scale of each test utterance.
  • FIG. 4 is a view showing an intra-speaker feature parameter ⁇ according to
  • Intra-speaker feature scale factor is closely related to an energy of a spectrum.
  • FIG. 4 estimates the intra-speaker feature parameter ⁇ which
  • the model distribution is divided into conversions of two variables. Firstly, the conventional method converts a model parameter ⁇ which is not normalized for the intra-speaker feature parameter ⁇ . That is, it converts a
  • model parameter ⁇ into a feature parameter ⁇ ⁇ which is normalized. ⁇ ⁇ ⁇ a ' ⁇ (or ⁇ ⁇ ⁇ ⁇ )
  • the method according to the present invention converts an observational vector X .
  • the conversion of the observation vector X is formalized by mapping acoustic vectors.
  • An intra-speaker scale element ⁇ of an amplitude axis is used to rescale an
  • the amplitude warping method of intra-speaker normalization for speech recognition according to the present invention is experimented using SungKyunKwan University-1 (SKKU-I) speech database.
  • a vocabulary of the SKKU-I speech database includes Korean language numeral sounds, names, phonetically balanced word (PBW), and phonetically rich word (PRW).
  • a speech signal is high-pass-analyzed with 1-0.95 z ⁇ , and a Hamming window of 20 ms therefore is performed, and is analyzed in 10 ms unit.
  • Each frame extracts 39 dimensional feature vector.
  • the features include 12-order mel-frequency cepstrum coefficient (MFCC) vector, 12-order delta-MFCC vector, 12-order delta-delta-MFCC vector, log energy, delta log energy, and delta-delta energy.
  • MFCC mel-frequency cepstrum coefficient
  • PRW phonetically balanced word
  • PRW phonetically rich word
  • the PBW includes several ten thousand words, the number of words in the
  • the Hamming window is a representative window function used in a unit analyzing section when analyzing the speech.
  • FIG. 5a shows an optimal frequency warping factor estimation based on mixture.
  • FIG. 5b shows an optimal frequency warping factor estimation based on frequency warping of input speech. The speech is warped using an estimated warping factor. Results of feature vectors are used for HMM decoding.
  • FIG. 6 shows a mel-filter bank analysis for warping.
  • pitch and energy are extracted from utterance.
  • an intra-speaker parameter is determined.
  • FIG. 7 is a view showing a mel-filter bank of an amplitude warping for intra- speaker normalization. That is, FIG. 6 is a method of analyzing the mel-filter bank when the factor a is determined, whereas FIG. 7 is the method for analyzing the mel-filter bank when the factor ⁇ is determined.
  • Table 1 shows recognition word error rates of numerals and words in SKKU- 1 database when using a fundamental recognition device, when using a base fundamental recognition device by normalization between speakers, and when using a base fundamental recognition device by normalization between speakers and intra-speaker normalization. [Table 1]
  • the "baseline” indicates a recognition error rate when the normalization is not applied.
  • the “with a” indicates a recognition error rate when only the conventional frequency normalization is applied.
  • the "with a and ⁇ " indicates a recognition error rate when the conventional frequency
  • the error is gradually reduced in order of "baseline”, "with a " , and
  • a word recognition rate according to the recognition result is 96.4% and 98.2%. Error rate is reduced to 0.4 ⁇ 2.3% with respect to numeral and word recognitions.
  • an amplitude normalization according to pitch alteration utterance is achieved through a new intra-speaker warping faction estimation.
  • recognition model suited for user's voice to a speaker recognition applied product and applying a speaker adaptive method to recognition device, recognition rate and reliance of speaker recognition applied product are improved.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Disclosed is an amplitude warping approach to intra-speaker normalization for speech recognition, which applies a recognition model suited for user's voice to a speaker recognition applied product and applies a speaker adaptive method to recognition device in order to improve recognition rate and reliance of the speaker recognition applied product by achieving amplitude normalization according to pitch alteration utterance through intra-speaker warping factor estimation for intra-speaker. Due to acoustic difference of a speaker occurred according to voiceprint and vocal tract, feature space of speech being not transformed from intra-speaker pitch alteration utterance is variously distributed. In order to prevent various distribution of feature space of speaker, reverse rate between input and reference pitches of speaker is applied to determine amplitude warping factor. Amplitude slope on total frequency axis is determined while adjusting height of triangle filter and amplitude.

Description

AN AMPLITUDE WARPING APPROACH TO INTRA-SPEAKER NORMALIZATION FOR SPEECH RECOGNITION
Technical Field The present invention relates to an amplitude warping method of intra- speaker normalization for speech recognition. More specifically, the present invention relates to an amplitude warping method of intra-speaker normalization for speech recognition which applies recognition model suited for user's voice to a speaker recognition applied product and applies a speaker adaptive method to recognition device in order to improve recognition rate and reliance of a speaker recognition applied product by achieving amplitude normalization according to pitch alteration utterance through intra-speaker warping factor estimation for intra- speaker.
Background Art
Generally, voiceprint of speaker's vocal chord controls the pitch of a voice. Vocal tract determines vowels through formant and articulates consonants. Pitch and formant component of uttered speech are nearly independent of a speech signal.
A frequency warping has been studied as a manner in order to prevent the deterioration of speech recognition performance due to a shape change of the vocal tract between speakers. That is, techniques normalizing parameter component expressions of a speech signal have been researched in order to reduce the effect due to a difference between speakers,.
Hence, so as to compensate for the location change of formant between speakers, normalization is performed using linear and nonlinear frequency warping functions. Such processes have been tried to conform a real vocal tract shape of each speaker to each other and to solve complicated problems of formant location estimation according to a compensation for such differences. When Gaussian mixtures are used as an output distribution in Hidden
Markov Model, one main problem is that various speaker dependent scale factors have a tendency to form a model by multimodes of mixture distributions.
Also, a factor between speakers plays an important part in speech recognition and is based on vocal tract normalization (VTN) for a speaker adaptation requiring normalization between speakers. The attempt is performed in order to reduce the change of speeches between speakers by compensating the change of pitch alteration utterance according to feeling state.
A conventional vocal tract normalization method is used for improving accuracy of normalization between speakers and is based on a frequency axis normalization for normalization between speakers.
Hereinafter, the frequency axis normalization for normalization between speakers will be described.
The object of the VTN is to normalize frequency axes for each speaker in order to eliminate a speaker dependent variability from acoustic vectors during the speech recognition. In order to utter a given speech, locations of formant peaks of a spectrum are inverse proportion to a length of the vocal tract.
At this time, the length of the vocal tract ranges from 13 cm to 18 cm. A formant center frequency changes by 25 % between speakers. It means that the formant center frequency varies according to the persons by about 25 %. Such variation elements deteriorate the performances of speaker dependent and independent speech recognitions.
An optimal warping factor is obtained by searching 13 elements having uniform intervals in the range of 0.88 < α ≤ 1.12. For example, 13 elements having 0.88, 0.90, 0.92, ...., 1.12 are obtained by uniformly dividing α at intervals of 0.02 in the range.
The range of α is selected in order to reflect the variation of 25 % in the length of the vocal tract which is found in adults.
So as to determine an optimal magnitude of a frequency warping in speech recognition, various methods have been proposed.
A sequence of acoustic vectors in speech recognition is observed through time t=l, 2, ...., T. Namely, X = X1Ax1Ax7..
A model distribution p{XIW;θ) is assumed using a reference model
parameters θ suited for each assumed word sequence W . It is assumed that reliability exists as follows in a distribution of a feature parameter α of any speaker in an acoustic speaker adaptation modeling. p{XIW;θ)
The model distribution is divided into conversions of two variables. Firstly, the conventional method converts a model parameter θ which is not normalized for respective values of feature parameters between speakers. That is, it converts a model parameter θ into a feature parameter θa which is normalized.
θ → θa Accordingly, the model distribution p(X /W;θ,a) becomes p(X IW;θa) .
Secondly, the conventional method converts an observational vector X . The conversion of the observation vector X is formalized by mapping acoustic vectors.
X →Xa
At this time, the model distribution p(XIW;θ,a) becomes p(Xa /W;θ) .
As described above, the conventional vocal tract normalization method compensates for the variation in a frequency axis of a spectrum envelope component of utterance through the normalization of the frequency axis. The utterance is generated, because the length of vocal tract varies according to persons. However, the conventional vocal tract normalization method compensates for a frequency difference but does not compensate for an amplitude difference.
Disclosure of Invention
Therefore, it is an object of the present invention to provide an amplitude warping method to an intra-speaker normalization for speech recognition, which applies recognition model suited for user's voice to a speaker recognition applied product and applies a speaker adaptive method to recognition device in order to improve recognition rate and reliance of speaker recognition applied product by achieving amplitude normalization according to pitch alteration utterance through intra-speaker warping factor estimation for intra-speaker.
According to the present invention, there is provided an amplitude warping method of intra-speaker normalization for speech recognition, the method comprising the steps of: calculating reverse rate between input and reference pitches of input speaker and determining amplitude warping factor since feature space of speech being not transformed from intra-speaker pitch alteration utterance is variously distributed due to acoustic difference of a speaker occurred according to voiceprint and vocal tract; and determining amplitude slope on total frequency axis while adjusting height of triangle filter and amplitude.
Preferably, a calculation of reverse rate between input and reference pitches of input speaker calculates pl/p2 as a new slope adjusting constant in order to adjust an amplitude to an amplitude of a first pitch pi since an amplitude becomes higher when a second pitch p2 is high, wherein a first pitch pi is a pitch during a normal utterance, and the second pitch p2 is a pitch when newly uttering the same voice.
More preferably, after a slope of the amplitude is determined, resulting feature vectors are hidden Markov model-decoded.
Brief Description of Drawings
The foregoing and other objects, features and advantages of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings in which:
FIGs. 1 and 2 are views showing linear prediction coefficient (LPC) spectrum envelopes of voiced sounds which a man and a woman uttered according to a pitch change utterance, respectively;
FIG. 3 is a view showing a spectrum for a normal utterance (thick line) of a man's speech /a/ and an uttered speech with a reduced pitch (dotted line);
FIG. 4 is a view showing an intra-speaker feature parameter β according to
pitch change utterance;
FIGs. 5a and 5b are views for showing optimal frequency warping factor estimations based on mixture;
FIG. 6 is a view showing a mel-filter bank analysis for warping; and
FIG. 7 is a view showing a mel-filter bank for an amplitude warping for intra-speaker normalization.
FIG. 8 shows a method to apply a and β in the order.
Best Mode for Carrying Out the Invention
Hereinafter, an amplitude warping method of intra-speaker normalization for speech recognition according to a preferred embodiment of the present invention will be described with reference to the accompanying drawings.
At first, a process used to perform an amplitude warping method of intra- speaker normalization for speech recognition will be explained. An object of the process is to reduce a change of an intra-speaker speech by a compensation of a transformation in a pitch alteration utterance according to feeling.
Since distortions due to pitch change utterances are designed by a simple linear warping in a frequency domain of a speech signal, a normalization adjusts an amplitude axis by a suitably estimated warping factor.
A cadence is known as an expression of acoustic features in feeling. A feature parameter is analyzed from a voiced sound section of a speech waveform data. It is a main feature of an intra-speaker element. FIGs. 1 and 2 are views showing linear prediction coefficient (LPC) spectrum envelopes of voiced sounds which a man and a woman uttered according to a pitch change utterance. That is, FIG. 1 shows the linear prediction coefficient (LPC) spectrum envelope of a voiced sound which a man uttered according to a pitch change utterance. The voice sound is a vowel /a/ having a pitch of 113 ~ 251 Hz. FIG. 2 shows the linear prediction coefficient (LPC) spectrum envelope of a voiced sound which a woman uttered according to a pitch change utterance. The voice sound is a vowel /a/ having a zone pitch of 194 ~ 342 Hz. An energy gain in a higher harmonics indicates by comparing a flow waveform of voiceprint. As an intensity of a pronunciation is increased, a closed rate of the voiceprint is increased. A man's voice has a lower fundamental frequency and a stronger harmonics than a woman's voice.
FIG. 3 is a view showing a spectrum for a normal utterance (thick line) of a man's speech /a/ and an uttered speech with a reduced pitch (dotted line). Due to acoustic difference of speaker according to voiceprint and vocal tract, feature space of speech being not transformed from intra-speaker pitch alteration utterance is variously distributed. In order to prevent various distribution of feature space of speaker, an amplitude warping factor is determined by an inverse proportion calculation between input and reference pitches of input speaker. More specifically, the warping factor indicates the reverse rate between input and reference pitches of input speaker. While adjusting height of triangle filter and amplitude using the amplitude warping factor, amplitude slope on total frequency axis is determined. As shown in FIGs. 6, 7, and 8, when analyzing a speech and extracting a feature from the speech, in an MFCC method, the speech is analyzed by applying the triangle filter to a frequency axis. That is, the triangle filter is used in order to divide the frequency axis into predetermined bandwidths.
Since an amplitude becomes higher when a second pitch p2 is high, a calculation of reverse rate between input and reference pitches of input speaker calculates pl/p2 as a new slope adjusting the constant in order to adjust an amplitude to an amplitude of a first pitch pi. The first pitch pi is a pitch during a normal utterance, and the second pitch p2 is a pitch when newly uttering the same voice.
Resulting feature vectors are used in HMM decoding. A destination is warped to be matched with an HMM model which normalized an amplitude scale of each test utterance.
FIG. 4 is a view showing an intra-speaker feature parameter β according to
pitch change utterance. Intra-speaker feature scale factor is closely related to an energy of a spectrum. FIG. 4 estimates the intra-speaker feature parameter β which
satisfies βx = ββ2 using pitch and energy.
It is assumed that reliability exists as follows in a distribution of a feature parameter β of any speaker in an acoustic speaker adaptation modeling. p(XIW;θ,a,β)
The model distribution is divided into conversions of two variables. Firstly, the conventional method converts a model parameter θ which is not normalized for the intra-speaker feature parameter β . That is, it converts a
model parameter θ into a feature parameter θβ which is normalized. θ → θa'β(or θ → θβ)
Accordingly, the model distribution p(X IW;θ,a,β) becomes
p(XIW;θa'β) orp(XIW;θβ) .
Secondly, the method according to the present invention converts an observational vector X . The conversion of the observation vector X is formalized by mapping acoustic vectors.
X →Xa'β(orX →Xβ)
At this time, the model distribution p{X IW;θ,a,β) becomes
p(Xa'β /W;θ)orp(Xβ IW;θa) .
An intra-speaker scale element β of an amplitude axis is used to rescale an
amplitude axis prior to calculating an acoustic vector for speech recognition. The amplitude warping method of intra-speaker normalization for speech recognition according to the present invention is experimented using SungKyunKwan University-1 (SKKU-I) speech database.
A vocabulary of the SKKU-I speech database includes Korean language numeral sounds, names, phonetically balanced word (PBW), and phonetically rich word (PRW).
A speech signal is high-pass-analyzed with 1-0.95 z , and a Hamming window of 20 ms therefore is performed, and is analyzed in 10 ms unit. Each frame extracts 39 dimensional feature vector.
The features include 12-order mel-frequency cepstrum coefficient (MFCC) vector, 12-order delta-MFCC vector, 12-order delta-delta-MFCC vector, log energy, delta log energy, and delta-delta energy.
When a phonemically balanced word is selected, PRW (phonetically balanced word) includes words more than those of PRW (phonetically rich word). That is, the PBW includes several ten thousand words, the number of words in the
PRW is not regulated. The Hamming window is a representative window function used in a unit analyzing section when analyzing the speech.
FIG. 5a shows an optimal frequency warping factor estimation based on mixture. FIG. 5b shows an optimal frequency warping factor estimation based on frequency warping of input speech. The speech is warped using an estimated warping factor. Results of feature vectors are used for HMM decoding.
FIG. 6 shows a mel-filter bank analysis for warping. For the amplitude normalization, at first, pitch and energy are extracted from utterance. Secondly, an intra-speaker parameter is determined. FIG. 7 is a view showing a mel-filter bank of an amplitude warping for intra- speaker normalization. That is, FIG. 6 is a method of analyzing the mel-filter bank when the factor a is determined, whereas FIG. 7 is the method for analyzing the mel-filter bank when the factor β is determined.
Table 1 shows recognition word error rates of numerals and words in SKKU- 1 database when using a fundamental recognition device, when using a base fundamental recognition device by normalization between speakers, and when using a base fundamental recognition device by normalization between speakers and intra-speaker normalization. [Table 1]
Figure imgf000012_0001
As shown in table 1, The "baseline" indicates a recognition error rate when the normalization is not applied. The "with a " indicates a recognition error rate when only the conventional frequency normalization is applied. The "with a and β " indicates a recognition error rate when the conventional frequency
normalization and the amplitude normalization according to the present invention are applied. The error is gradually reduced in order of "baseline", "with a " , and
"with a and β "
Namely, a word recognition rate according to the recognition result is 96.4% and 98.2%. Error rate is reduced to 0.4 ~ 2.3% with respect to numeral and word recognitions.
Industrial Applicability
As mentioned above, according to the amplitude warping method of intra- speaker normalization for speech recognition, an amplitude normalization according to pitch alteration utterance is achieved through a new intra-speaker warping faction estimation.
Furthermore, applying recognition model suited for user's voice to a speaker recognition applied product and applying a speaker adaptive method to recognition device, recognition rate and reliance of speaker recognition applied product are improved.

Claims

Claims
1. An amplitude warping method of intra-speaker normalization for speech recognition, the method comprising the steps of: calculating reverse rate between input and reference pitches of input speaker and determining amplitude warping factor since feature space of speech being not transformed from intra-speaker pitch alteration utterance is variously distributed due to acoustic difference of a speaker occurred according to voiceprint and vocal tract; and determining amplitude slope on total frequency axis while adjusting height of triangle filter and amplitude.
2. The method according to claim 1, wherein a calculation of reverse rate between input and reference pitches of input speaker calculates pl/p2 as a new slope adjusting constant in order to adjust an amplitude to an amplitude of a first pitch pi since an amplitude becomes higher when a second pitch p2 is high, wherein a first pitch pi is a pitch during a normal utterance, and the second pitch p2 is a pitch when newly uttering the same voice.
3. The method according to claim 1 or 2, wherein after a slope of the amplitude is determined, resulting feature vectors are hidden Markov model- decoded.
PCT/KR2003/001216 2003-06-13 2003-06-20 An amplitude warping approach to intra-speaker normalization for speech recognition WO2004111999A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003244240A AU2003244240A1 (en) 2003-06-13 2003-06-20 An amplitude warping approach to intra-speaker normalization for speech recognition

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2003-0038102 2003-06-13
KR10-2003-0038102A KR100511248B1 (en) 2003-06-13 2003-06-13 An Amplitude Warping Approach to Intra-Speaker Normalization for Speech Recognition

Publications (1)

Publication Number Publication Date
WO2004111999A1 true WO2004111999A1 (en) 2004-12-23

Family

ID=33550159

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2003/001216 WO2004111999A1 (en) 2003-06-13 2003-06-20 An amplitude warping approach to intra-speaker normalization for speech recognition

Country Status (3)

Country Link
KR (1) KR100511248B1 (en)
AU (1) AU2003244240A1 (en)
WO (1) WO2004111999A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7778831B2 (en) * 2006-02-21 2010-08-17 Sony Computer Entertainment Inc. Voice recognition with dynamic filter bank adjustment based on speaker categorization determined from runtime pitch
US8010358B2 (en) * 2006-02-21 2011-08-30 Sony Computer Entertainment Inc. Voice recognition with parallel gender and age normalization
US8442829B2 (en) 2009-02-17 2013-05-14 Sony Computer Entertainment Inc. Automatic computation streaming partition for voice recognition on multiple processors with limited memory
US8442833B2 (en) 2009-02-17 2013-05-14 Sony Computer Entertainment Inc. Speech processing with source location estimation using signals from two or more microphones
US8788256B2 (en) 2009-02-17 2014-07-22 Sony Computer Entertainment Inc. Multiple language voice recognition
TWI629680B (en) * 2017-06-15 2018-07-11 中華電信股份有限公司 Voice confidence assessment method and system
TWI636452B (en) * 2017-05-10 2018-09-21 平安科技(深圳)有限公司 Method and system of voice recognition
CN109102810A (en) * 2017-06-21 2018-12-28 北京搜狗科技发展有限公司 Method for recognizing sound-groove and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
KWANG-SEOK HONG: 'An amplitude warping approach to intra-speaker normalization for speech recognition' LECTURE NOTES IN COMPUTER SCIENCE vol. 2668, January 2003, pages 639 - 645 *
LEE L. ET AL.: 'Speaker normalization using effi-cient frequency warping procedures' PROC. ICASSP-96 vol. 1, May 1996, pages 353 - 356, XP002093540 *
ZHAN P., WESTPHAL M.: 'Speaker normalization based on frequency warping' PROC. ICASSP-97 vol. 2, April 1997, pages 1039 - 1042, XP002904783 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7778831B2 (en) * 2006-02-21 2010-08-17 Sony Computer Entertainment Inc. Voice recognition with dynamic filter bank adjustment based on speaker categorization determined from runtime pitch
US8010358B2 (en) * 2006-02-21 2011-08-30 Sony Computer Entertainment Inc. Voice recognition with parallel gender and age normalization
US8442829B2 (en) 2009-02-17 2013-05-14 Sony Computer Entertainment Inc. Automatic computation streaming partition for voice recognition on multiple processors with limited memory
US8442833B2 (en) 2009-02-17 2013-05-14 Sony Computer Entertainment Inc. Speech processing with source location estimation using signals from two or more microphones
US8788256B2 (en) 2009-02-17 2014-07-22 Sony Computer Entertainment Inc. Multiple language voice recognition
TWI636452B (en) * 2017-05-10 2018-09-21 平安科技(深圳)有限公司 Method and system of voice recognition
TWI629680B (en) * 2017-06-15 2018-07-11 中華電信股份有限公司 Voice confidence assessment method and system
CN109102810A (en) * 2017-06-21 2018-12-28 北京搜狗科技发展有限公司 Method for recognizing sound-groove and device
CN109102810B (en) * 2017-06-21 2021-10-15 北京搜狗科技发展有限公司 Voiceprint recognition method and device

Also Published As

Publication number Publication date
KR100511248B1 (en) 2005-08-31
KR20040107173A (en) 2004-12-20
AU2003244240A1 (en) 2005-01-04

Similar Documents

Publication Publication Date Title
Vergin et al. Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition
Furui Speaker-independent isolated word recognition based on emphasized spectral dynamics
US7957959B2 (en) Method and apparatus for processing speech data with classification models
Shahnawazuddin et al. Pitch-Adaptive Front-End Features for Robust Children's ASR.
US20150025892A1 (en) Method and system for template-based personalized singing synthesis
Sündermann et al. A first step towards text-independent voice conversion
Vergin et al. Compensated mel frequency cepstrum coefficients
Bhat et al. Recognition of Dysarthric Speech Using Voice Parameters for Speaker Adaptation and Multi-Taper Spectral Estimation.
Krobba et al. Evaluation of speaker identification system using GSMEFR speech data
Shao et al. Pitch prediction from MFCC vectors for speech reconstruction
Hon et al. Towards large vocabulary Mandarin Chinese speech recognition
Sinha et al. On the use of pitch normalization for improving children's speech recognition
WO2004111999A1 (en) An amplitude warping approach to intra-speaker normalization for speech recognition
Zolnay et al. Using multiple acoustic feature sets for speech recognition
KR20060066483A (en) Method for extracting feature vectors for voice recognition
Liberatore et al. Voice conversion through residual warping in a sparse, anchor-based representation of speech
Irino et al. Evaluation of a speech recognition/generation method based on HMM and straight.
Jančovič et al. Incorporating the voicing information into HMM-based automatic speech recognition in noisy environments
Ouzounov Cepstral features and text-dependent speaker identification–A comparative study
Ashour et al. Characterization of speech during imitation.
Kitaoka et al. Speaker independent speech recognition using features based on glottal sound source
JP2834471B2 (en) Pronunciation evaluation method
Liu et al. The effect of fundamental frequency on Mandarin speech recognition.
Wang et al. Improved Mandarin speech recognition by lattice rescoring with enhanced tone models
Bořil et al. Lombard speech recognition: A comparative study

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP