WO2004072953A1

WO2004072953A1 - Method for reducing computational quantity amount utterrance verification using anti-phoneme model

Info

Publication number: WO2004072953A1
Application number: PCT/KR2003/000863
Authority: WO
Inventors: Soon-Hyob Kim; Ho-Jun Lee
Original assignee: Speechsoundnet Co., Ltd.; Institute Information Technology Assessment; Kwangwoon Foundation
Priority date: 2003-02-12
Filing date: 2003-04-29
Publication date: 2004-08-26
Also published as: AU2003223135A1; AU2003223135A8; KR100492089B1; KR20040072989A

Abstract

Disclosed is a method for reducing computational quantity amount utterance verification using an anti-phoneme model in order to reduce a scenario error due to a wrong recognition in a speech recognition application system. In the method, a plurality of phonemes are arranged. Distances between the phonemes are measured using, a Bhattacharyya’s distance manner. The phonemes are integrated from a phoneme having the greatest degree of similarity, one by one, to perform an integrated layer clustering. Anti-phoneme model aggregates are classified into nine classes by the integrated layer clustering. Each of the nine classes has a similar phoneme. A degree of similarity with respect to an uttered phoneme based on the anti-phoneme model aggregates which are classified into the nine classes is calculated during an utterance verification.

Description

METHOD FOR REDUCING COMPUTATIONAL QUANTITY AMOUNT UTTERRANCE VERIFICATION USING ANTI-PHONEME MODEL

Technical Field

The present invention relates to a method for reducing computational process of utterance verification using an anti-phoneme model. More specifically, the present invention relates to a method for reducing computational process of utterance verification using an anti-phoneme model in order to reduce a scenario error due to a wrong recognition in a speech recognition application system

Background Art

Speech recognition refers to a function of a machine to understand a human's speech and performs a work according to the human's speech.

Due to the development of computers and information technology, human beings can easily obtain information from a distance without a motion. Accordingly, speech recognition devices comprising systems that operate according to a given speech have been developed.

Various speech recognition application systems based on such a speech recognition also have been developed. One of them is a system which guides desired information according to a language uttered together with an utterance.

It is assumed that there is a telephone guide system for all groups. When a user utters a name of a department in one of the groups to be searched as a speech, a speech recognition system displays a telephone number of the corresponding department. The speech recognition system is a kind of a speech recognition application field. Hereinafter, conventional speech recognition system and an utterance verification system will be described with reference to FIG. 9. FIG. 9 is a view that illustrates a conventional speech recognition system and utterance verification system. When a user speaks out a desired utterance, various parameters of a speech signal corresponding to the utterance are preprocessed and inputted to an Automatic Speech Recognition (ASR) system. Registered vocabulary and a phoneme model are also inputted to the ASR system. Then, the ASR system recognizes the corresponding speech signal, and performs a post process rejection or recognition (approval) of the signal. This is called an utterance verification step.

That is, the utterance verification step is a step which verifies the rejection or recognition (approval) of the speech inputted to the ASR system.

According to a conventional rejection method for incorrect input using an utterance verification, an anti-phoneme model is formed using a mono-phoneme model. Then, a recognition result is analyzed in a post process of a recognition engine (program). Phoneme label information having the level of a frame is extracted. Then, based on the extracted label information, a class is formed by anti- phoneme models, excluding a mono-phoneme model expressing each frame, using the established anti-phoneme model. Conventionally, as shown in FIG. 3, an anti-phoneme model is produced by a mono-phoneme model which is produced during the training procedure of words recognition. Here, the total number of mono-phoneme models is forty-five, and the total number of anti-phoneme models is forty-four.

In FIG. 5, the total number of each phoneme is identical with the number of anti-phoneme models. Reliance of an uttered speech is calculated based on such an anti-phoneme model and is detected in the ASR.

For example, as shown in FIG. 11, when a user utters "Kwang woon university", each mono-phoneme is arranged by a feature vector. An initially arranged mono-phoneme K (^~ι ) is compared with an anti-phoneme model (remainder: 44 phonemes), and the reliance of an uttered speech is detected.

That is, in order to express an alternative hypothesis for the reliance calculation, a model most similar to a feature parameter of a corresponding frame in each anti-phoneme model class from all frames is searched, and the reliance of the uttered speech is calculated using the most similar model acquired from the search to verify an inputted speech.

A detection of a little error in a speech recognition is performed by the reliance, and discrimination between registered words and unregistered words is determined based on the reliance of the uttered speech. The reliance represents the relative degree of similarity between a recognized model and an unrecognized model. Models similar to each model are searched and are called anti-phoneme models.

When searching an anti-phone model in order to detect the reliance, the computational process is increased in proportion to the length of a recognized speech and to the number of similar phoneme units. Consequently, it requires a long computational time, thereby requiring a long response time.

In detail, since 44 anti-phoneme models are sequentially compared with mono-phoneme models in the order of from the initially arranged mono-phoneme to a finally arranged mono-phoneme, it takes a considerably long computational time. As described above, in the conventional rejection method for incorrect input using an utterance verification, since all similar phoneme areas are searched, a computational process is increased in proportion to the number of similar phoneme units.

Disclosure of the Invention

Therefore, it is an object of the present invention to provide a method which is capable of significantly reducing computational process of utterance verification using an anti-phoneme model in order to obtain a high computational speed by measuring distances between the phonemes using a Bhattacharyya's distance method, which forms anti-phoneme model classes having similar phonemes using an Agglomerative Hierarchical Clustering.

According to the present invention, there is provided a method for reducing computational process of utterance verification using an anti-phoneme model, the method comprising the steps of: arranging a plurality of phonemes; measuring distances between the phonemes by using a Bhattacharyya's distance measuring method; integrating the phonemes from a phoneme having the greatest degree of similarity one by one to perform an Agglomerative Hierarchical Clustering; forming anti-phoneme model classes which are classified into nine classes by the Agglomerative Hierarchical Clustering, each of the nine classes each having a similar phoneme; and

Computing a degree of similarity with respect to an uttered phoneme based on the anti-phoneme model classes which are classified into the nine classes during an utterance verification. In a preferred embodiment of the present invention, the above nine classes and the anti-phoneme model classes, classified into the above nine classes, include: {ti (final sound), τ= (final sound), ^~ι (final sound), T-, Ξ (final sound)}, μ-, -i , -, T, T^} H, -11 , 4 , , Ξ (initial sound), -sr},

{o, ^ϋ, },

{π _ t. , π (initial sound), "6^" (initial sound), ^ (initial sound)}, {-l , *, *κ, },

{mj, re, , Λ, H, n, t (initial sound), t. (initial sound)}, (TT, ^, =.1 , -] }, and

{ > , ^JT TI, ^C }.

In a more preferred embodiment of the present invention, when searching the anti-phoneme model during the utterance verification, only those classes which have a recognized phoneme among the above nine classes are searched to reduce computational quantity amount and speed of a degree of similarity. Meanwhile, the Bhattacharyya's distance measuring method enables to measure a distance between two Gaussian distributions using the following equation:

Brief Description of the Drawings

The foregoing and other objects, features and advantages of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings in which: FIG. 1 is a block diagram showing configurations of a speech recognition system and an utterance verification system according to an embodiment of the present invention;

FIG. 2 is a block diagram for illustrating an anti-phoneme model forming method according to an embodiment of the present invention;

FIG. 3 is a flow chart of a phoneme classifying procedure using the Bhattacharyya's distance measuring method and the Agglomerative Hierarchical Clustering in order to form an anti-phoneme model according to an embodiment of the present invention; FIG. 4 is a view showing a phoneme classification tree using the

Agglomerative Hierarchical Clustering and the Bhattacharyya's distance measuring method in order to form an anti-phoneme model according to an embodiment of the present invention;

FIG. 5 is a view which shows a final anti-phoneme model class formed by the Agglomerative Hierarchical Clustering and the Bhattacharyya's distance measuring method which are used in an embodiment of the present invention;

FIG. 6 is a view which shows a state of executing an utterance verification using an anti-phoneme model according to the present invention;

FIG. 7 is a view which shows a performance estimate reference of a conventional utterance verification function;

FIG. 8 is a view which shows a performance estimate result of an utterance verification function according to the present invention;

FIG. 9 is a view illustrating conventional speech recognition system and an utterance verification system; FIG. 10 is a view illustrating a conventional anti-phoneme model forming method which is used in an utterance verification;

FIG. 11 is a view illustrating an utterance verification method using a conventional anti-phoneme model forming method;

FIG. 12 is a view showing a performance of an utterance verification using the conventional method; and

FIG. 13 is a view showing a comparison of performances according to threshold values of a method according to the present invention and the conventional method.

Best Mode for Carrying Out the Invention

References will now be made in detail to the preferred embodiments of the present invention.

FIG. 1 is a block diagram showing configurations of a speech recognition system and an utterance verification system according to an embodiment of the present invention. FIG. 2 is a block diagram for illustrating an anti-phoneme model forming method according to an embodiment of the present invention.

When a user speaks desired utterances, various parameters of a speech signal corresponding to the utterances are preprocessed and inputted to an ASR system. Registered vocabulary and phoneme models are also inputted to the ASR system.

Then, the ASR system recognizes the corresponding speech signal, and performs a post process which rejects or recognizes (approves) it. That is called an utterance verification step.

That is, the utterance verification step is a step which verifies to reject or recognize (approve) the speech inputted to the ASR system. In general, the anti-phoneme model is formed by a class of all phone models other than recognized phonemes. At this time, in order to calculate an alternative hypothesis, a phoneme model having the greatest degree of similarity among the anti-phoneme model. The alternative hypothesis is a probability that the recognized phoneme is a wrong phoneme.

Conventionally, when 46 mono-phoneme models are used for an utterance verification, 45 computations for the degree of similarity are required. Accordingly, in order to detect the reliance (similarity degree) of the uttered speech signal, when the anti-phoneme mode is searched, the computational process is increased in proportion to a length of a recognized speech and the number of similar phoneme units. Consequently, it takes a long time to compute, thus leading to a long response time.

Therefore, the present invention is characterized by integrating mono- phoneme models extracted during a recognition model training from a phoneme having the highest degree of similarity one by one using a Bhattacharyya's distance measuring method and an Agglomerative Hierarchical Clustering. Accordingly, when searching an anti-phoneme model during an utterance verification, by searching only clusters having recognized phonemes among previously classified clusters, the number of computations for the degree of similarity is reduced from 5 to 3. So it reduces the computational process to increase the computational speed.

That is, the present invention searches phones similar to classified anti- phoneme models by using the Bhattacharyya's distance measuring method and the Agglomerative Hierarchical Clustering to greatly reduce computational process of a degree of similarity and to increase the computational speed. FIG. 3 is a flow chart of a phoneme classifying procedure using the Bhattacharyya's distance measuring method and the Agglomerative Hierarchical Clustering in order to form an anti-phoneme model according to the present invention.

The phoneme classifying procedure includes the steps of: 1) arranging a plurality of phonemes;

2) sequentially comparing each phoneme with the rest of phonemes using Bhattacharyya's distance measuring method, and clustering the phonemes into a plurality of classes in such a manner that the phonemes phonetically similar to each other form the same class in a plurality of the classes;

3) obtaining a minimal distance between the phonemes and clustering the phonemes using the Agglomerative Hierarchical Clustering; and

4) clustering the phonemes by repeating steps (2) and (3) until the number of the clustered phonemes becomes the desired number. The present invention uses the Agglomerative Hierarchical Clustering for reducing an anti-phoneme model in order to search and form the anti-phoneme model similar to a recognition phoneme model as a similar phoneme class, thereby reducing the searching number and a searching time.

The present invention can use N phoneme models having a great similarity degree. However, a class formed by the N phoneme models is not flexible to the different number of similar phonemes. The number of similar phonemes varies according to each phoneme and features of the phonemes. The Agglomerative Hierarchical Clustering is an unsupervised clustering which clusters similar phonemes and forms a layer classification. The Agglomerative Hierarchical Clustering forms a similar phoneme class based on a feature of a phoneme. The present invention uses a Bhattacharyya's distance measuring method as a distance measuring method. The Bhattacharyya's distance measuring method measures a distance between two Gaussian distributions. Since a computation in the Bhattacharyya's distance measuring method is simple and the Bhattacharyya's distance measuring method provides a boundary of an error rather than an exact computation of the distance, it has a flexibility.

The Bhattacharyya's distance measuring method measures a distance between two Gaussian distributions using the equation:

and a boundary with respect to an error between the two Gaussian distributions is

expressed by s < Jp exp(-D_bhat ).

FIG. 4 shows a phoneme classification tree using the Agglomerative

Hierarchical Clustering and the Bhattacharyya's distance measuring method in order to form an anti-phoneme model according to the present invention. The phoneme classification tree is formed by the phonemes which are phonetically similar to each other.

FIG. 5 is a view which shows a final anti-phoneme model aggregate formed by the Agglomerative Hierarchical Clustering and the Bhattacharyya's distance measuring method which are used in the present invention. As shown in FIG. 5, anti-phoneme model classes are classified into nine classes. Preferably, the nine classes and the anti-phoneme model classes classified into the nine classes include:

{ ti (final sound), t_: (final sound), ^~ι (final sound), 1-, ≡ (final sound)}, P-, 4 , - T, )}

{τ-11, "11 , , Y , Ξ (initial sound), -&}, {o, ^, 1 ,-rϊ},

{^π , ^ , (initial sound), ^~& (initial sound), ^ (initial sound)}, {=ι , X-, , },

[m, ιx. , A, E, jr . t-: (initial sound), ti (initial sound)}, {IT, -^, 41 , -l }, and { h 4, TI, τ=}.

FIG. 6 is a view which shows a state executing an utterance verification using an anti-phoneme model according to the present invention. The utterance verification is executed by the above anti-phoneme models.

For example, when the user utters "Kwang woon university", each mono- phoneme is arranged by feature vectors. An initially arranged mono-phoneme K (π ) is compared with an anti-phoneme model (included in a class E of FIG. 5) and the reliance of an uttered speech is detected.

That is, the initially arranged mono-phoneme K (^~>) is compared with {ti , π (initial sound), "S^", ^ (initial sound)}, and the reliance of an uttered speech is detected.

As described above, when searching the anti-phoneme model during the utterance verification, by searching only clusters having recognized phonemes among previously classified clusters, the number of computations for the degree of similarity is reduced from 5 to 3, thereby reducing computational process while increasing the computational speed.

FIG. 7 is a view which shows a performance estimate reference of a conventional utterance verification function. FIG. 8 is a view which shows a performance estimate result of an utterance verification function according to the present invention. FIG. 12 is a view showing a performance of an utterance verification. FIG. 13 is a view showing a comparison of performances according to threshold values of a method according to the present invention and a conventional method. When comparing the performance estimate result of FIG. 8 with a performance of FIG. 12, a total recognition ratio of the present invention is minutely less than that of the conventional method. It is a range which has a great influence on the recognition ratio.

Industrial Applicability

As seen from the foregoing, according to the method for reducing the computational process of utterance verification using an anti-phoneme model of the present invention, computational process is reduced by more than 50 % by forming the anti-phoneme model using the Agglomerative Hierarchical Clustering and the Bhattacharyya's distance measuring method, in an utterance verification function which is a method for reducing a scenario error due to a incorrect recognition in a speech recognition application system, during searching of a similar phoneme. Also, by searching a limited area, an effect according to a change of a threshold value is minimized. Furthermore, in accordance with the present invention, by minimizing the computational process during an utterance verification, the present invention uses an utterance verification method for minimizing a scenario error due to a incorrect recognition in an actual field, thereby providing more convenient interface to a user.

While this invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not limited to the disclosed embodiment and the drawings, but to the contrary, it is intended to cover various modifications and variations within the spirit and scope of the appended claims.

Claims

ClaimsWhat is claimed is:

1. A method for reducing computational process in utterance verification using an anti-phoneme model, wherein said method comprises the following steps of: arranging a plurality of phonemes; measuring distances between the phonemes by using a Bhattacharyya's distance measuring method; integrating the phonemes from a phoneme having the greatest degree of similarity one by one to perform an Agglomerative Hierarchical Clustering; generating anti-phoneme model classes which are classified into nine classes by the Agglomerative Hierarchical Clustering, each of said nine classes having a similar phoneme; and computing a degree of similarity with respect to an uttered phoneme based on the anti-phoneme model classes which are classified into the nine classes during an utterance verification.

2. The method according to claim 1, wherein said nine classes and said anti- phoneme model classes classified into said nine classes include: {u (final sound), ^ (final sound), ^~ι (final sound), ^L, s (final sound)}, μ-, -. , -, T, i}

H, "fl , 4 , , Ξ (initial sound), ^"&},

{ , ^y , (initial sound), ^"er (initial sound), A (initial sound)}, {=ι , *, , A }, [m, v , , , ≡, -si, τ=: (initial sound), ti (initial sound)}, {TT, >-, =fl , -l }, and { V , , TI, ^t=}.

3. The method according to claim 1 or 2, wherein only classes having a recognized phoneme among said nine classes are searched to reduce computational process and speed of a degree of similarity when searching the anti-phoneme model during the utterance verification.

4. The method according to claim 1 or 2, wherein said Bhattacharyya's distance measuring method measures a distance between two Gaussian distributions using the following equation:

D_hhal =

expressed by ε ≤ P ex$(-D_blM ).