WO1992006469A1 - Boundary relaxation for speech pattern recognition - Google Patents

Boundary relaxation for speech pattern recognition Download PDF

Info

Publication number
WO1992006469A1
WO1992006469A1 PCT/US1991/007165 US9107165W WO9206469A1 WO 1992006469 A1 WO1992006469 A1 WO 1992006469A1 US 9107165 W US9107165 W US 9107165W WO 9206469 A1 WO9206469 A1 WO 9206469A1
Authority
WO
WIPO (PCT)
Prior art keywords
point
pattern
path
score
feasible
Prior art date
Application number
PCT/US1991/007165
Other languages
French (fr)
Inventor
Ilan D. Shallom
Raziel Haimi-Cohen
Original Assignee
The Dsp Group, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from IL9586990A external-priority patent/IL95869A/en
Priority claimed from IL98092A external-priority patent/IL98092A0/en
Application filed by The Dsp Group, Inc. filed Critical The Dsp Group, Inc.
Publication of WO1992006469A1 publication Critical patent/WO1992006469A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/12Speech classification or search using dynamic programming techniques, e.g. dynamic time warping [DTW]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

A speech recognition algorithm is implemented in a computer program by sending speech input into a coder (2) and processing it in a standard computer (4) using reference patterns stored in memory (6). The algorithm uses the well-known technique called dynamic programming to include weighting and normalizing functions.

Description

BOUNDARY RELAXATION FOR SPEECH PATTERN RECOGNITION FIELD OF THE INVENTION The present invention relates to pattern recognition processing generally and more particularly to speech recognition using a dynamic programming algorithm, typically a modification of a standard Dynamic Time Warping (DTW) or similar algorithms (for example Hidden Markov Model based on Viterbi s algorithm) . BACKGROUND OF THE INVENTION
Conventional Dynamic Time Warping (DTW) algorithms assume a precise knowledge of the boundaries of both reference and test utterances. However, the output of practical boundary detectors is inaccurate, particularly so in a noisy environment. This results in a severe deterioration of the accuracy of isolated word recognition. This problem has been well described in several publications including an article by Wilpon, Rabiner and Martin, entitled "An Improved Word Detection Algorithm for Telephone Quality Speech Incorporating both Syntactic and Semantic Constraints", and published in the AT&T Bell Lab. Tech. Journal. Vol. 63(3), March 1984, pp. 479-498. Wilpon et al show the results of recognition experiments in which the actual endpoints are manually varied. Their work suggests that the accuracy of isolated word recognition decreases dramatically as a function of errors in boundary detection. The method outlined by Wilpon et al describes a way to improve speech recognition, based on
SUBSTITUTESHEET a new boundary estimation algorithm which reduces boundary recognition errors.
The degradation in recognition accuracy due to mismatch in boundary determination can be reduced by various approaches. The method of Wilpon et al uses the approach of improving the accuracy in boundary determination to a certain degree of uncertainty. In addition, to overcome the remaining problem it is recommended that a procedure be developed that is immune to small endpoint errors.
Relaxing the requirement of exact knowledge of the boundaries gives a strong tool with which to measure the similarity between two speech events with uncertain endpoints (within a reasonable limit) . This method for improved isolated word recognition is described in
"Considerations in Dynamic Time Warping Algorithms for Discrete Word Recognition," by L.R. Rabiner, A.E. Rosenberg and S.E. Levinson and published in IEEE Trans, QR Acoustics. Speech and Signal Processing. Vol. ASSP-23, Dec. 1978, pp. 575-582. The method of
Rabiner et al attempts to improve speech recognition by relaxation of the boundary constraints and modification of the standard dynamic time warping algorithm, allowing the warping path to begin and end within a specified range with respect to the estimated boundaries.
According to this method, the accumulated distance of the final path is normalized by its length.
The method of Rabiner et al is enhanced by the algorithm described in "Dynamic Time Warping with Boundaries Constraint Relaxation", by I.D. Shallo , R. Haimi Cohen and T. Golan, and published in Proc. Conf. IEEE Israel. 1989, paper 3.1.3. The algorithm of Shallom et al also uses relaxation of boundary constraints. Their method uses the dynamic time warping algorithm — that is, where a path length normalization factor is applied in the dynamic equation at each grid point. This improves the path optimization process.
SUBSTITUTESHEET However, this method ignores the "length" of the "future" part of the warping function. Ignoring the future "length" may lead to inaccuracies, especially near the beginning of the warping path. As a result errors may occur at the overall level of the similarity measuring.
SUMMARY OF THE INVENTION The present invention provides a method of improved pattern recognition which may be used for speech recognition by relaxation of boundary constraints so as to account for boundary detection errors. The dynamic programming algorithm is modified so that the known and predicted path lengths are taken into account when determining the optimal path to each gridpoint. Additionally, the present invention provides a method for improving the accuracy of the estimated boundaries of a tested pattern.
A method for determining the predicted path length and for utilizing it in a dynamic programming algorithm is outlined below.
There is therefore provided, in accordance with a preferred embodiment of the present invention, apparatus for pattern recognition including apparatus for providing a digital pattern to be inspected which contains a plurality of feature vectors, apparatus for providing at least one digital reference pattern containing a different plurality of parameter vectors and apparatus for comparing the digital pattern to be inspected with the at least one digital reference pattern. The apparatus for comparing includes apparatus for providing a search area including a grid with the feature vectors on a first axis and the parameter vectors on a second axis and apparatus for calculating a final normalized score which is the estimated minimum of a plurality of optimal normalized scores each associated with a corresponding feasible path, wherein each of the feasible paths is located in the search area. The
SUBSTITUTESHEET apparatus for calculating includes, for each point in the search area, apparatus for computing an accumulated score for a plurality of feasible paths which contain the point, apparatus for computing an overall weight for each of the plurality of feasible paths which contain the point, apparatus for computing a normalized score, whereby the normalized score is the accumulated score for the point divided by the overall weight for the point, for each of the plurality of feasible paths which contain the point, and apparatus for selecting the normalized score which is least, from the plurality of normalized scores, as an optimal normalized score for the point.
Additionally, in accordance with a preferred embodiment of the present invention, the search area includes a plurality of path beginning points and a plurality of path ending points.
Moreover, in accordance with a preferred embodiment of the present invention, the apparatus for pattern recognition also includes an apparatus for determining beginning and ending points of that feasible path which is associated with the final normalized score thereby to determine beginning and ending points of the digital pattern. Furthermore, in accordance with a preferred embodiment of the present invention, the overall weight includes an accumulated weight and a predicted weight.
Still further, in accordance with a preferred embodiment of the present invention, the pattern to be inspected is a speech utterance and the reference pattern is based on a Hidden Markov Model. Alternatively, the pattern to be inspected is a speech utterance, the reference pattern is a reference template, and the feasible paths are calculated according to a Dynamic Time Warping algorithm.
Moveover, in accordance with a preferred embodiment of the present invention, the beginning and
SUBSTITUTESHEET ending points of the feasible path which is associated with the final normalized score are used to estimate beginning and ending points of the pattern to be inspected. Additionally, in accordance with a preferred embodiment of the present invention, the digital pattern is derived from a speech signal.
There is further provided, in accordance with a preferred embodiment of the present invention, a method for producing a final normalized score which is the minimum of a plurality of optimal normalized scores each associated with a corresponding feasible path, wherein each of the feasible paths is located in a search area and wherein the search area includes a set of points characterized by a plurality of path beginning points and a plurality of path ending points. For each point in the search area, the method includes the steps of computing an accumulated score for a plurality of feasible paths which contain the point, computing an overall weight for each of the plurality of feasible paths which contain the point, computing a normalized score, whereby the normalized score is the accumulated score for the point divided by the overall weight for the point, for each of the plurality of feasible paths which contain the point, and selecting the normalized score which is least, from the plurality of normalized scores, as an optimal normalized score for the point.
Additionally, in accordance with a preferred embodiment of the present invention, the method also includes the step of determining beginning and ending points of that feasible path which is associated with the final normalized score.
Moreover, in accordance with a preferred embodiment of the present invention, the overall weight includes an accumulated weight and a predicted wei ht.
Furthermore, in accordance with a preferred embodiment of the present invention, the final
SUBSTITUTESHEET normalized score indicates the similarity between a reference form and a pattern to be inspected. Preferably, the pattern to be inspected is a speech utterance and the reference form is based on a Hidden Markov Model. Alternatively, the pattern to be in¬ spected is a speech utterance, the reference form is a reference template, and the feasible paths are cal¬ culated according to a Dynamic Time Warping algorithm. Additionally, in accordance with a preferred embodiment of the present invention, the beginning and ending points of the feasible path which is associated with the final normalized score are used to estimate beginning and ending points of the pattern to be inspected. Finally, there is provided, in accordance with a preferred embodiment of the present invention, a method for pattern recognition including the steps of providing a digital pattern to be inspected which contains a plurality of feature vectors, providing at least one digital reference pattern containing a different plurality of parameter vectors, and comparing the digital pattern to be inspected with the at least one digital reference pattern. The step of comparing includes the steps of providing a search area including a grid with the feature vectors on a first axis and the parameter vectors on a second axis, and calculating a final normalized score which is the minimum of a plurality of optimal normalized scores each associated with a corresponding feasible path, wherein each of the feasible paths is located in the search area. The step of calculating includes, for each point in the search area, the steps of computing an accumulated score for a plurality of feasible paths which contain the point, computing an overall weight for each of the plurality of feasible paths which contain the point, computing a normalized score, whereby the normalized score is the accumulated score for the point divided by the overall
SUBSTITUTESHEET weight for the point, for each of the plurality of feasible paths which contain the point, and selecting the normalized score which is least, from the plurality of normalized scores, as an optimal normalized score for the point.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention will be understood and appreciated more fully from the following detailed description taken in conjunction with the drawings in which:
Fig. 1 is a schematic block diagram illustration of the architecture of a preferred embodiment of speech recognition apparatus constructed and operated in accordance with a preferred embodiment of the present invention;
Fig. 2 is a schematic block diagram illustration of a speech recognition system constructed and operated in accordance with the principles of a preferred embodiment of the present invention; Fig. 3 is a graphical representation illustration of an optimization procedure of a preferred embodiment of the invention; and
Fig. 4 is a pseudo-code illustration of a scoring algorithm for pattern recognition in the speech recognition system of Fig. 2 in accordance with a dynamic programming technique of the invention.
DETAILED DESCRIPTION OF THE PRESENT INVENTION
Reference is now made to Figs. 1 - 4, Fig. 1 shows a schematic block diagram of the architecture of a microprocessor-based speech recognition system operated in accordance with the principles of the present invention.
A user codec 2, such as an Intel 2913, from Intel Corporation, interfaces with digital signal processing circuitry 4, typically a TNS 320C25 from Texas Instruments Corporation.
A memory storage area 6, which typically
SUBSTITUTESHEET comprises a static random-access memory, such as a 32K by 8 bit with an access time of 100 nsec, is connected to the digital signal processing circuitry by means of a standard address data and read-write control bus. Fig. 2 shows a schematic block diagram of a microprocessor-based speech recognition system operated in accordance with the principles of the present invention.
The algorithms of Fig. 2 are typically carried out by software run on digital signal processing circuitry 4, such as the digital signal processing circuitry of Fig. 1.
An analog signal 12, which may be obtained from a microphone or similar device, is typically provided to a standard sampling device 14. The sampling device 14, which may be codec 2(Fig. 1), converts the analog signal to a digital signal 16.
The output of the sampling device, the digital signal 16, is then supplied to a voice activated detection device 18 which may be a device as described in U.S. Patent Application 07/151,740 to the same assignee, which is incorporated herein by reference. The output of the voice activated detection device 18 is a digital speech signal 20. The voice activated detection device may be incorporated by digital signal processing circuitry 4(Fig. 1).
After the digital speech signal 20 has been extracted from the input signal, the digital speech signal 20 is provided to a boundary detector 22 which typically determines the beginning and end points of an utterance that is found in the digital speech signal. The determination may be carried out by a standard boundary detector algorithm such as the type described by Wilpon et al.
The utterance is then conveyed to a feature extraction device 26 where spectral or other features
SUBSTITUTESHEET are typically extracted, typically through LPC analysis. The feature extraction procedure transforms the utterance into a sequence of test feature vectors 28. Preferably, each test vector contains the features of a speech frame of approximately 30 msec. An overlap of typically 50% may be applied between adjacent speech frames.
The sequence of test feature vectors 28 supplied by the feature extraction 26 is provided to a pattern recognition algorithm 30. The pattern recognition algorithm consists of two primary parts — a scoring algorithm 31 and a decision procedure 36. A set of reference templates 32 from a memory 34 is passed to the scoring algorithm 31 to serve as a reference. The memory storage area 34 is typically of the type depicted in Fig. 1.
Preferably, reference templates, consisting of sequences of parameter vectors, are stored in the memory 34 during a process called training (not shown) . Training typically consists of inputting signals of a certain class to the system according to the steps of voice detection through feature extraction described above. Following these steps, the input signals are processed, and reference templates 32 are generated and stored in the memory area 34.
The parameter vectors of the template provided by the training procedure represent characteristic features of the class of input signals. Typically, a template may represent utterances of a particular word or of a particular subword word unit such as a syllable or a phoneme. Alternatively, the template may represent the voice of a particular person.
Typically each parameter vector is a feature vector of a reference utterance. Alternatively, the parameter vectors may include parameters defining a model for a feature sequence of a test utterance.
In accordance with the present invention, a
SUBSTITUTESHEET novel approach to pattern recognition, using a modification of the dynamic programming method for the scoring procedure, is achieved based on a method of path estimation and normalization of an accumulated similarity score as described in detail hereinbelow. Preferably, the novel approach to pattern recognition uses a modified Dynamic Time Warping algorithm or alternatively, a Hidden Markov Model algorithm for the scoring algorithm 31. According to a further embodiment, any other suitable dynamic programming based algorithm may be used instead of the examples offered herein.
The output of the scoring algorithm 31 is a set of final similarity scores (as defined hereinbelow) , with each score indicating the similarity between the sequence of test vectors 28 and each of the reference templates 32.
The scoring algorithm output is typically provided to decision procedure 36 which may comprise a k-NN (k-Nearest Neighbor) rule for determination of the class of inputs to which the pattern between the beginning and endpoints in input signal 12 belongs.
The overall output of the pattern recognition procedure provides a code or index 40, which describes the class of inputs to which the pattern between the beginning and the endpoints in input signal 12 belongs. Typically, this code or index indicates the verbal contents of input signal 12. Alternatively, the code or index indicates the identity of the speaker who uttered the speech embodied in the input signal 12.
Reference is now made to Fig. 3 which shows a graphical representation of a preferred embodiment of a part of the sequence of the pattern recognition procedure of Fig. 2 in accordance with a preferred embodiment of the invention.
The graph representation shows a non-linear time warping function which may be used for scoring the
SUBSTITUTESHEET similarity between a test utterance and a reference template.
The time warping function maps the time axis of a test feature sequence 50 to the time axis of a reference template 52. The mapping provides a time registration between the reference template 52, which is preferably provided by the memory storage area 34 (Fig. 2) and the test feature 50, which may be provided by the feature extraction device 26 (Fig. 2) . The reference template 52 comprises a sequence of M parameter vectors representing a word from a vocabulary recognizable by a speech recognition system such as the speech recognition system of Fig. 2. M may vary according to the particular reference template. The test feature sequence 50, comprises a sequence of N test feature vectors.
The graph comprises a grid with points associated with a local similarity score for the point (n,m) where m is the m**-*1 parameter vector of the reference template and n is the n*-**1** test feature vector in the sequence of test feature vectors.
The skilled professional may determine the local similarity score associated with each pair of test feature vectors and reference parameter vectors according to his considerations.
It is assumed that the lower the local similarity score, the greater the similarity between the pair of items being compared.
Preferably, the local similarity scores may be determined by computing standard Euclidean or
Mahalanobis distances between the test feature vector and the reference parameter vector.
According to a further embodiment, the local similarity score may be determined by a speech specific distortion measure such as the likelihood ratio distortion measure proposed by Itakura in the article, "Minimum Prediction Residual Principle Applied to Speech
SUBSTITUTESHEET Recognition, IEEE Trans. Acoustic. Speech and Signal Processing,. Vol. ASSP-23, Feb. 1975, pp. 67-72. The article is incorporated herein by reference.
Alternatively, the local similarity score may be probabilistic. Typically, the probabilistic local similarity score could be computed using a parametric function of the test feature vector, which depends on the reference parameter vector. The function value provides a statistical estimate of the minus log of the likelihood of observing the test feature vector in a particular segment of the reference word.
A feasible warping path, 54, is a sequence of grid points which satisfy certain constraints. Specific constraints are determined by the skilled professional. A typical constraint requires the feasible warping path to map the beginning and ending feature vectors of the test to the beginning and ending parameter vectors of the reference, respectively. Another typical constraint is that the slope of the warping path will be within a specified limit, typically between 1:2 and 2:1.
A search area 56, in which all feasible warping paths (as defined above) are contained, is typically defined by: { (n,m) g(n)<m<f(n) } where f and g are defined as follows: f = min {^....fp} where {f1....fp} is a set of linear searching area boundary functions defining the upper constraints, and g » max {gι....gq} where {gι«...g } is a set of linear searching area boundary functions defining the lower constraints.
In Fig. 3, p-*=3 and q=3. These figures are just offered as examples and should not be seen as limiting the possible definition for f or for g.
Fig. 4 shows a pseudo-code description of a scoring algorithm as part of the pattern recognition in the speech recognition system of Fig. 2 in accordance with a preferred embodiment of a dynamic programming
SUBSTITUTESHEET technique of the invention.
The algorithm of Fig. 4 can be implemented by the digital processing circuitry 4 of Fig. 1. Alternatively, the algorithm can be implemented using other suitable computing hardware in accordance with state-of-the-art electronic design and programming techniques.
The scoring procedure, which is typically based on a Dynamic Time Warping algorithm, or alternatively, on a Hidden Markov Model algorithm, is preferably used to determine the similarity between a test utterance and reference word in speech recognition procedures.
The operation of the scoring algorithm of Fig. 4 is described according to the following steps: STEP 1: INITIALIZE GRID
During this step, initial values are assigned to each point in search area 56, where the search area is as defined above. This step is independent of the content of the sequence of test feature vectors, and depends only on the number N of test feature vectors in a certain sequence and the number M of parameter vectors in a reference template.
Initial grid properties are defined as follows:
(1) A set of path beginning grid points and a set of path ending grid points are defined. A typical definition of the beginning set is:
{ (n, ) ; n = 1....X],} U { (l,m) ; m = l. . . . x_) A typical definition of the ending set is:
{(n,M); n - N-x2 N) U { (N,m) ; m = M-
X2> • • .M) where x., x. are the maximum expected beginning and end errors of the boundary detector at the beginning and at the end of the test word (assuming that the reference boundaries are sufficiently accurate) .
SUBSTITUTESHEET (2) For each grid point in the search area, as defined hereinabove, a list of "access paths" is defined. An access path is a short path leading from a neighboring grid point to a given grid point. The access paths should be defined in such a way that a concatenation of access paths leading from a path beginning grid point to a path ending grid point constitutes a feasible path (as defined above) . Additionally, any feasible path must be representable as a concatenation of access paths from a path beginning grid point to a path ending grid point.
When the scoring algorithm is based on the Dynamic Time Warping algorithm, access paths are preferably defined according to the symmetric p=l rule of Sakoe and Chiba. The rule is described in the article, incorporated herein by reference, "Dynamic programming Algorithm Optimization for Spoken Word Recognition", published in the IEEE Trans. Acoustic. Speech and Signal Processing. Vol. ASSP-26, Feb. 1978, pp. 43-49.
Alternatively, an access path may be defined by a left to right finite state automaton where each reference parameter vector is represented by a state and each grid point (n,m) indicates that at time n, the automaton has reached state m. An access path to a grid point (n,m) is a two-point path of the form [(n-l,k), (n,m)] where there exists a transition leading from the state representing the k-th reference parameter vector to the state representing the m-th reference parameter vector. Such a definition is common in Hidden Markov Models. STEP 2: LOOP ON GRID POINTS IN SEARCH AREA:
For each grid point (n,m) in the search area 56, the following steps are performed to establish the optimal path which reaches that point, where the following definitions hold true:
SUBSTITUTESHEET For each grid point (n,m) along a feasible path, a local weight may be defined indicating the significance of the local similarity score at that point. A bias at the point (n,m) may be defined to indicate the apriori likelihood of the feasible path passing through that point.
The accumulated similarity score, D(n,m) of a feasible path containing the grid point (n,m) , is the sum of all biases along the path from the path beginning to the point (n,m) , plus the sum of all local similarity scores from the path beginning to the point (n,m) , where each local score is multiplied by a corresponding local weight. The local similarity score is calculated according to the methods outlined above and the bias and local weight are calculated as defined below.
The overall weight, W(n,m) of a path con¬ taining the point (n,m) is the sum of all local weights along that path from its beginning to its ending. The accumulated weight, B(n,m) of a path containing the point (n,m) is the sum of all local weights along the path, from the path beginning till the point (n,m) .
The future weight, F(n,m) of a path containing the point (n,m) is the sum of all local weights along the path, from the point following (n,m) till the path end. For a given feasible path, the overall weight is the sum of the accumulated weight and the future weight.
The normaliz .d similarity score with respect to a feasible path containing the grid point (n,m) , A(n,m) , is the accumulated similarity score divided by the overall weight, i.e. A(n,m) = D(n,m)/W(n,m) .
The optimal normalized similarity score, A*(n,m) is the minimum of the normalized similarity scores A(n,m) , taken over all feasible paths containing (n,m) . The optimal feasible path through (n,m) is the path for which A(n,m) was minimal. If there are more than one such paths, the choice of the optimal one is
SUBSTITUTESHEET arbitrary.
The optimal overall weight W*(n,m), the optimal accumulated weight B*(n,m), the optimal future weight F*(n,m) and the optimal accumulated similarity score D*(n,m) are the overall weight W(n,m) , the accumulated weight B(n,m) , the future weight F(n,m) and the accumulated similarity score D(n,m) respectively, associated with the optimal feasible path through (n,m) . The optimal path beginning grid point b* (n,m) , and the optimal path ending grid point _£*(n,m) are the beginning and ending points, respectively, of the optimal feasible path through (n,m) (the underline in _ and b indicates that each represents a pair of coordinates) . During this step, the values of D*(n,m),
W*(n,m), B*(n,m), F*(n,m), A*(n,m) and b*(n,m) are estimated for each grid point (n,m) in the search area. In addition, the access path leading to (n,m) in the optimal feasible path through (n,m) is also computed. The preferable method of performing this task is according to the steps that follow. STEP 2.1: COMPUTE LOCAL SIMILARITY SCORE
The local similarity score D(n,m) at point (n,m) is computed according to the methods outlined above.
STEP 2.2: ESTIMATING THE FUTURE WEIGHT.
F*(n,m), the optimal future weight is predicted. Preferably, F*(n,m) is the average of the future weights from (n,m) to each of the path ending grid points which are accessible from (n,m) by a feasible path. Alternatively, F*(n,m) may be the median of those future weights.
Typically, all the definitions of local weights mentioned hereinabove share the property of path in variance, which means that all future weights F(n,m) of paths with the same ending grid point are equal. Therefore the future weight from (n,m) to each path
SUBSTITUTESHEET ending grid point is uniquely defined.
STEP 2.3: INITIALIZE SCORES FOR BEGINNING POINTS
During this step, initial estimates for the optimal scores of a grid point (n,m) are established, based on the assumption that the optimal path begins at that point.
If (n,m) does not belong to the set of path beginning grid points, the initial estimate of A*(n,m) is assigned the value of infinity, indicating the falseness of the assumption.
If (n,m) is in the set of path beginning grid points (as defined in step 1) , the initial estimates are computed according to the following steps. STEP 2.3.1: DEFINE LOCAL WEIGHT AND BIAS A local weight and a bias are defined for a path beginning grid point (n,m) . The skilled professional may determine these values according to his considerations.
In a dynamic time warping embodiment of the present invention, a typical value for the bias is 0 and a typical value for the local weight is 2.
Alternatively, in a Hidden Markov Model embodiment, a typical value for the bias may be minus log of the likelihood that the path begins at the given point (n,m) and the local weight may be set equal to 1. Typically, the value of the bias is estimated during the training procedure.
2.3.2: COMPUTE INITIAL SCORES
Using the local weight and bias calculated in Step 2.3.1, the initial estimate for the optimal scores and optimal path beginning grid point, under the assumption that ti.e optimal path begins at (n,m) , can be made as follows:
The optimal beginning point is set to be the same point: fe*(n,m) - (n,m) .
The optimal accumulated weight, B*(n,m) gets the value of the local weight.
SUBSTITUTESHEET The optimal overall weight W*(n,m) is the sum of optimal accumulated and future weights, B*(n,m)+F*(n,m) .
The optimal accumulated similarity score, D*(n,m), is the bias for the point (n,m) plus the local similarity score of that same point multiplied by the local weight of the point.
The optimal normalized similarity score, A*(n,m), is the optimal accumulated similarity score divided by the optimal overall weight D*(n,m)/W*(n,m) . STEP 2.4: LOOP ON LIST OF ACCESS PATHS LEADING TO (n,m)
In each execution of this loop, one of the access paths leading to a point (n,m) is checked for the hypothesis that the optimal path through (n,m) contains that particular access path. This is done by computing the normalized similarity score for a particular access path under this hypothesis and then comparing it to the current estimated value of the optimal normalized similarity score. If the computed value is smaller than the current estimate, all current estimates of optimal scores for that point (n,m) are replaced by the computed value.
The following steps describe the operation of the loop for each given access path to (n,m) . In the description, (p,q) will denote the beginning point of the access path to (n,m) under consideration. STEP 2.4.1: SET WEIGHTS AND BIASES FOR GIVEN ACCESS PATH Local weights and biases are defined for each point on the given access path except for the first point of the access path. Typically, in a Dynamic Time Warping embodiment of the present invention, the bias is 0 and the weight is the sum of the absolute values of the differences of corresponding coordinates in the current and previous grid points on the access path
(i.e. if point (k,l) immediately precedes (n,m) on the access path, then the local weight of (n,m) equals -kj
SUBSTITUTESHEET |m-l|).
Alternatively, the bias may be minus log of the likelihood of moving to the current grid point from the preceding one (this likelihood may typically be determined during training) and the local weight is 1. This is the common ca^a in Hidden Markov Model devices. STEP 2.4.2: COMPUTE ACCUMULATED SIMILARITY SCORE FOR GIVEN ACCESS PATH
The accumulated similarity score D(n,m) is computed for a path which comprises the concatenation of the optimal path to (p,q) and the given access path. Therefore D(n,m) is calculated as D*(p,q) plus the sum of all biases along the given access path (except for the first point (p,q)) plus the sum of all local similarity scores along the access path (except for the first point (p,q)), each multiplied by the corresponding local weight.
STEP 2.4.3: COMPUTE ACCUMULATED AND OVERALL WEIGHT FOR GIVEN ACCESS PATH The accumulated weight B(n,m) is computed for a path which contains the concatenation of the optimal path to (p,q) and the given access path. Therefore B(n,m) is calculated as B*(p,q) plus the sum of all local weights along the access paths (except for the first point (p,q)).
The overall weight W(n,m) is computed by adding the accumulated weight B(n,m) to the estimated optimal future weight F*(n,m). 2.4.4: COMPUTE NORMALIZED SIMILARITY SCORE FOR GIVEN ACCESS PATH
The normalized similarity score A(n,m) is computed for a path which contains the concatenation of the optimal path to (p,q) and the given access path. Therefore A(n,m) is calculated as D(n,m) divided by W(n,m).
STEP 2.4.5: UPDATE OPTIMAL SCORES IF NECESSARY
If the normalized score for the given access
SUBSTITUTESHEET path, A(n,m) , is less than the current estimate of the optimal normalized similarity score, A*(n,m), the following step is performed: STEP 2.4.5.1: ASSIGN NEW OPTIMAL VALUES The current estimate for the optimal path through (n,m) is updated to be a path which contains the concatenation of the optimal path to point (p,q) and the given access path. Accordingly, the current values of D*(n,m), B*(n,m), W*(n,m), and A*(n,m) are replaced by the values corresponding to the updated optimal path, that is, D(n,m) , B(n,m) , W(n,m) , and A(n,m) , respectively.
In addition, the path beginning grid point b*(n,m) is set to be equal to fe(p,q), the optimal path beginning grid point of the beginning point of the given access path. STEP 3: DETERMINE FINAL VALUES
After optimal scores have been estimated for all grid points in the search area, the final outputs of the algorithm are determined in the following steps: STEP 3.1: DETERMINE FINAL NORMALIZED SIMILARITY SCORE
The minimal value of A*(n,m), over all the points in the set of path ending grid points (as defined in step 1) is the final normalized similarity score. The feasible path associated with the final normalized score is the final path.
The path ending grid point (n,m) of the final path is the final path ending grid point. The optimal path beginning grid point of the final path, b*(n,m) is the final path beginning grid point.
STEP 3.2: DETERMINE FINAL BEGIN AND END ESTIMATES The first coordinates of the final path beginning grid point and of the path ending grid point are the final estimates for the beginning and ending of a test utterance, respectively. The second coordinate of these grid points indicates the beginning and ending, respectively, of the part of a reference template
SUBSTITUTESHEET sequence that was matched by the test utterance. If the second coordinate of the final beginning point or of the final ending point does not equal 1 or M, respectively, this indicates that the initial boundary estimate clipped the beginning or the ending, respectively, of the tested utterance.
Having described the invention with regard to certain specific embodiments thereof, it is to be understood that the description is not meant as a limitation since further modifications may now suggest themselves to those skilled in the art and it is intended to cover such modifications as fall within the scope of the appended claims.
SUBSTITUTESHEET

Claims

c ft I M s 1. Apparatus for pattern recognition comprising: means for providing a digital pattern to be inspected, said pattern containing a plurality of feature vectors; means for providing at least one digital reference pattern containing a different plurality of parameter vectors; and means for comparing said digital pattern to be inspected with said at least one digital reference pattern, said means comprising: means for providing a search area comprising a grid with said feature vectors on a first axis and said parameter vectors on a second axis; and means for calculating a final normalized score which is the minimum of a plurality of optimal normalized scores each associated with a corresponding feasible path, wherein each of said feasible paths is located in said search area, said means comprising for each point in said search area: means for computing an accumulated score for a plurality of feasible paths which contain said point; means for computing an overall weight for each of said plurality of feasible paths which contain said point; means for computing a normalized score, whereby the normalized score is the accumulated score for said point divided by the overall weight for said point, for each of said plurality of feasible paths which contain said point; and means for selecting the normalized score which is least, from said plurality of normalized scores, as an optimal normalized score for said point.
2. Apparatus according to claim 1, and wherein said search area comprises a plurality of path beginning points and a plurality of path ending points.
SUBSTITUTESHEET
3. Apparatus according to claim 1, and also comprising means for determining beginning and ending points of that feasible path which is associated with said final normalized score thereby to determine beginning and ending points of said digital pattern.
4. Apparatus according to claim 1, and wherein said overall weight comprises an accumulated weight and a predicted weight.
5. Apparatus according to claim 3, and wherein said overall weight comprises an accumulated weight and a predicted weight.
6. Apparatus according to claim 1, wherein said digital pattern to be inspected is a speech utterance and said digital reference pattern is based on a Hidden Markov Model.
7. Apparatus according to claim 1, wherein said pattern to be inspected is a speech utterance, said reference pattern is a reference template, and said feasible paths are calculated according to a Dynamic Time Warping algorithm.
8. Apparatus according to claim 1, wherein the beginning and ending points of said feasible path which is associated with the final normalized score are used to estimate beginning and ending points of said pattern to be inspected.
9. Apparatus according to claim 1, and wherein said digital pattern is derived from a speech signal.
10. A method for producing a final normalized score which is the minimum of a plurality of optimal normalized scores each associated with a corresponding feasible path, wherein each of said feasible paths is located in a search area and wherein said search area comprises a set of points characterized by a plurality of path beginning points and a plurality of path ending points, for each point in said search area said method comprising the steps of: computing an accumulated score for a plurality of
SUBSTITUTESHEET feasible paths which contain said point; computing an overall weight for each of said plurality of feasible paths which contain said point; computing a normalized score, whereby the normalized score is the accumulated score for said point divided by the overall weight for said point, for each of said plurality of feasible paths which contain said point; and selecting the normalized score which is least, from said plurality of normalized scores, as an optimal normalized score for said point.
11. A me-thod according to claim 10, which also comprises the step of determining beginning and ending points of that feasible path which is associated with said final normalized score.
12. A method according to claim 10, and wherein said overall weight comprises an accumulated weight and a predicted weight.
13. A method according to claim 11, and wherein said overall weight comprises an accumulated weight and a predicted weight.
14. A method for pattern recognition utilizing the method of claim 10, and wherein said final normalized score indicates the similarity between a reference form and a pattern to be inspected.
15. A method for pattern recognition utilizing the method of claim 12, and wherein said final normalized score indicates the similarity between a reference form and a pattern to be inspected.
16. A method according to claim 14, wherein said pattern to be inspected is a speech utterance and said reference form is based on a Hidden Markov Model.
17. A method according to claim 15, wherein said pattern to be inspected is a speech utterance, said reference form is a reference template, and said feasible paths are calculated according to a Dynamic Time Warping
SUBSTITUTESHEET algorithm.
18. A method according to claim 10, wherein the beginning and ending points of said feasible path which is associated with the final normalized score are used to estimate beginning and ending points of said pattern to be inspected.
19. A method for pattern recognition comprising the steps of: providing a digital pattern to be inspected, said pattern containing a plurality of feature vectors; providing at least one digital reference pattern containing a different plurality of parameter vectors; and comparing said digital pattern to be inspected with said at least one digital reference pattern, said step of comparing comprising the steps of: providing a search area comprising a grid with said feature vectors on a first axis and said parameter vectors on a second axis; and calculating a final normalized score which is the minimum of a plurality of optimal normalized scores each associated with a corresponding feasible path, wherein each of said feasible paths is located in said search area, said step of calculating comprising, for each point in said search area, the steps of: computing an accumulated score for a plurality of feasible paths which contain said point; computing an overall weight for each of said plurality of feasible paths which contain said point; computing a normalized score, whereby the normalized score is the accumulated score for said point divided by the overall weight for said point, for each of said plurality of feasible paths which contain said point; and selecting the normalized score which is least, from said plurality of normalized scores, as an optimal normalized score for said point.
SUBSTITUTESHEET
20. Apparatus according to claim 9, wherein said digital reference pattern indicates the verbal contents of said speech signal.
21. Apparatus according to claim 9, wherein said digital reference pattern indicates the identity of the speaker of said speech signal.
22. A method according to claim 16, wherein said reference form indicates the verbal contents of said speech utterance.
23. A method according to claim 17, wherein said reference form indicates the verbal contents of said speech utterance.
24. A method according to claim 16, wherein said reference form indicates the identity of the speaker of said speech utterance.
25. A method according to claim 17, wherein said reference form indicates the identity of the speaker of said speech utterance.
SUBSTITUTESHEET
PCT/US1991/007165 1990-10-02 1991-10-02 Boundary relaxation for speech pattern recognition WO1992006469A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
IL95869 1990-10-02
IL9586990A IL95869A (en) 1990-10-02 1990-10-02 Boundary relaxation for speech pattern recognition
IL98092 1991-05-09
IL98092A IL98092A0 (en) 1991-05-09 1991-05-09 Boundary relaxation for speech pattern recognition

Publications (1)

Publication Number Publication Date
WO1992006469A1 true WO1992006469A1 (en) 1992-04-16

Family

ID=26322136

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1991/007165 WO1992006469A1 (en) 1990-10-02 1991-10-02 Boundary relaxation for speech pattern recognition

Country Status (2)

Country Link
EP (1) EP0551374A4 (en)
WO (1) WO1992006469A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE19930522A1 (en) * 1999-07-05 2001-02-01 Univ Ilmenau Tech Detecting sound signals involves weighting negative deviations of test vector coefficients from reference vector coefficients more heavily than positive deviations for score computation

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4256924A (en) * 1978-11-22 1981-03-17 Nippon Electric Co., Ltd. Device for recognizing an input pattern with approximate patterns used for reference patterns on mapping
US4400788A (en) * 1981-03-27 1983-08-23 Bell Telephone Laboratories, Incorporated Continuous speech pattern recognizer
US4400828A (en) * 1981-03-27 1983-08-23 Bell Telephone Laboratories, Incorporated Word recognizer
US4467437A (en) * 1981-03-06 1984-08-21 Nippon Electric Co., Ltd. Pattern matching device with a DP technique applied to feature vectors of two information compressed patterns
US4570232A (en) * 1981-12-21 1986-02-11 Nippon Telegraph & Telephone Public Corporation Speech recognition apparatus
US4624008A (en) * 1983-03-09 1986-11-18 International Telephone And Telegraph Corporation Apparatus for automatic speech recognition
US4751737A (en) * 1985-11-06 1988-06-14 Motorola Inc. Template generation method in a speech recognition system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4256924A (en) * 1978-11-22 1981-03-17 Nippon Electric Co., Ltd. Device for recognizing an input pattern with approximate patterns used for reference patterns on mapping
US4467437A (en) * 1981-03-06 1984-08-21 Nippon Electric Co., Ltd. Pattern matching device with a DP technique applied to feature vectors of two information compressed patterns
US4400788A (en) * 1981-03-27 1983-08-23 Bell Telephone Laboratories, Incorporated Continuous speech pattern recognizer
US4400828A (en) * 1981-03-27 1983-08-23 Bell Telephone Laboratories, Incorporated Word recognizer
US4570232A (en) * 1981-12-21 1986-02-11 Nippon Telegraph & Telephone Public Corporation Speech recognition apparatus
US4624008A (en) * 1983-03-09 1986-11-18 International Telephone And Telegraph Corporation Apparatus for automatic speech recognition
US4751737A (en) * 1985-11-06 1988-06-14 Motorola Inc. Template generation method in a speech recognition system

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ICASSP'86, Tokyo, April 1986, QUENOT et al., "A Dyamic time wrap VLSI processor for continuous speech recognition", see esp fig. 2. *
IEEE Trans. on ASSP, Vol. 32, No. 2, April 1984, NEY, "The use of a One-Stage Dynamic - Programming Algorithm for connected word recognition", pages 263-271, see esp. pages 265, 269. *
IEEE Transion ASSP, Vol. 26, No. 1, February 1978, SAKOE et al., "Dynamic Programming Alogrithm Optimization for Spoken Word Recognition", pages 43-49, see esp. page 44. *
IEEE Transion ASSP, vol. 36, no. 9, September 1988, IRWIN, "A Digit Pipelined Dynamic Time Warp Processor", pages 1412-1422, see especially pages 1413,1415 (Fig.4), 1418 (Figs.9,11) and page 1420. *
See also references of EP0551374A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE19930522A1 (en) * 1999-07-05 2001-02-01 Univ Ilmenau Tech Detecting sound signals involves weighting negative deviations of test vector coefficients from reference vector coefficients more heavily than positive deviations for score computation

Also Published As

Publication number Publication date
EP0551374A1 (en) 1993-07-21
EP0551374A4 (en) 1995-02-15

Similar Documents

Publication Publication Date Title
US6125345A (en) Method and apparatus for discriminative utterance verification using multiple confidence measures
US4918732A (en) Frame comparison method for word recognition in high noise environments
JP3549681B2 (en) Verification of utterance identification for recognition of connected digits
US7447634B2 (en) Speech recognizing apparatus having optimal phoneme series comparing unit and speech recognizing method
US6226612B1 (en) Method of evaluating an utterance in a speech recognition system
US7027985B2 (en) Speech recognition method with a replace command
US6615170B1 (en) Model-based voice activity detection system and method using a log-likelihood ratio and pitch
US7013276B2 (en) Method of assessing degree of acoustic confusability, and system therefor
US6029124A (en) Sequential, nonparametric speech recognition and speaker identification
US8271283B2 (en) Method and apparatus for recognizing speech by measuring confidence levels of respective frames
US5613037A (en) Rejection of non-digit strings for connected digit speech recognition
US6317711B1 (en) Speech segment detection and word recognition
EP0601778A1 (en) Keyword/non-keyword classification in isolated word speech recognition
US20020049593A1 (en) Speech processing apparatus and method
JP2001503154A (en) Hidden Markov Speech Model Fitting Method in Speech Recognition System
JPH07334184A (en) Calculating device for acoustic category mean value and adapting device therefor
McDermott et al. Prototype-based minimum classification error/generalized probabilistic descent training for various speech units
US4937870A (en) Speech recognition arrangement
EP0255529A4 (en) Frame comparison method for word recognition in high noise environments.
EP0177854B1 (en) Keyword recognition system using template-concatenation model
WO1992006469A1 (en) Boundary relaxation for speech pattern recognition
Sanchis et al. Improving utterance verification using a smoothed naive Bayes model
JP2853418B2 (en) Voice recognition method
IL95869A (en) Boundary relaxation for speech pattern recognition
Ying et al. Robust voice activity detection based on noise eigenspace

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP SU

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IT LU NL SE

WWE Wipo information: entry into national phase

Ref document number: 1991917937

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1991917937

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 1991917937

Country of ref document: EP