WO1992000586A1

WO1992000586A1 - Keyword-based speaker selection

Info

Publication number: WO1992000586A1
Application number: PCT/US1991/004327
Authority: WO
Inventors: Paul F. Smith; Kamyar Rohani; Mark R. Harrison
Original assignee: Motorola, Inc.
Priority date: 1990-07-02
Filing date: 1991-06-17
Publication date: 1992-01-09

Abstract

A method is provided for recognizing an utterance of a voice command sequence having a keyword spoken at the beginning of the sequence. A plurality of templates is stored, each template uniquely identified with one user. At least one spoken keyword uniquely identified with one of the users is received (220). The method determines (230) which particular user spoke the keyword and selects a subset of the templates uniquely identified with this particular user to provide a set of recognizable commands for subsequent utterances of the voice command sequences (270).

Description

KEYWORD-BASED SPEAKER SELECTION

BACKGROUND OF THE INVENTION: This invention relates generally to speech recognition, and more particularly to speaker dependant speech recognition systems, and is particularly directed toward providing a time efficient and accurate method of template selection based on keyword recognition. People communicate more effectively in a contemporary society via spoken instructions or commands than by any other communication medium. Accordingly, it would be advantageous to have many devices (or equipment) used by contemporary society to be at least partially controllable by voice commands. One such speech recognition control system (which is responsive to human voice) is highly desirable in automotive applications. Most mobile radio transceiver functions (e.g., on/off, transmit/receive, volume, squelch, changing channels, etc.) or mobile radio telephone control functions (e.g., push button dialing, speech recognizer training, telephone call answering) may readily be achieved by voice command without requiring any manual operations. Hence the use of speech recognition has a potential for providing a totally hands free telephone conversation without ever requiring the automobile driver to remove the driver's hands from the steering wheel, or take his or her eyes off the road. In addition, this has added to the safety and convenience of using mobile radio telephones in vehicles.

Some designers of voice recognition equipment have attempted to utilize speaker dependant voice recognition technology to accomplish voice controls. As is known, speaker dependant technology utilizes pre-stored voice templates as references to recognize the voice of a particular individual to perform specified functions that relate to a predetermined set of recognized command words. A template is a time-ordered set of features that characterize the behavior of the speech signal for a particular speaker. Most speech recognition technologies use either a word or a multi-word utterance as this reference template. Speaker dependant technology requires that the speaker dependant device must be programmed (or trained) to recognize each individual operator. Training is commonly understood to be a process by which an individual repeats a predetermined set of command words a sufficient number of times so that an acceptable template is formed. Specifically, a word recognizer recognizes the word command by extracting features which adequately represent the utterance and making a decision as to whether these features meet some distance criteria to match a particular template out of the set of pre-stored templates. These templates correspond to the set of pre-stored features representing the command words to be recognized. The speaker dependant word recognizer is then designed to recognize the command words of users by comparing the utterance to pre- stored voice templates which contain the voice features of those users.

However, in many prior art speaker dependant word recognition systems, the operator or speaker must first "log-in", manipulate or adjust one or more control knobs or buttons to enter an identification code or otherwise inform the recognizer of who the operator is so that the recognizer can reference the voice templates which were generated when the operator trained the system to his or her voice initially. This "logging in" procedure is cumbersome but necessary in prior art systems since the templates are stored in direct association with each user's identification code. One reason for using this logging-in procedure is to preclude an exhaustive search of all the templates to match a particular speaker with the speaker's voice. Another reason is to improve accuracy by knowing who the speaker is before matching his or her voice command.

This logging-in procedure is therefore inefficient since one purpose of voice control of a two way mobile radio is to alleviate the need to divert a driver's attention from operating the vehicle to manipulate or adjust such knobs on the radio. In addition, this procedure is cumbersome. It forces the operator to remember another number; for example, to identify what car number the operator is in, what batch number he or she has, and which user the operator is. Thus, this cumbersome method detracts from the main purpose of using voice control in the first place, which is to improve usability.

Accordingly, a need exists in the art to provide for an efficient selection of a template while rendering the communication equipment user-friendly.

BRIEF SUMMARY OF THE INVENTION

Accordingly, it is an object of the present invention to reduce the template search time and improve accuracy while providing a user friendly interface. The best match of a speaker with a particular reference template will be determined. This template, corresponding to a keyword, determines the user who is associated with this template. Only the other templates associated with this speaker will be searched for subsequent commands and thus reduce the scope of the search. Briefly, according to the invention, a method for recognizing an utterance of a voice command sequence having a keyword spoken at the beginning of the sequence includes storing a plurality of templates, each template uniquely identified with one user. At least one spoken keyword uniquely identified with one of the users is received. The method determines which particular trained user spoke the keyword and selects a subset of the templates uniquely identified with this particular user to provide a set of recognizable commands for subsequent utterances of the voice command sequences.

In one aspect of the invention, the determining step comprises comparing the received spoken keyword with a portion of the set of templates. Each template of this portion is uniquely identified with one user by the spoken keyword being a unique word distinct for each of the users.

In another aspect of the invention, the spoken keyword is a single word characteristically spoken by each of the users.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a communication device in accordance with the present invention.

FIG. 2 is a flow diagram illustrating the operation of the communication device of FIG. 1 in accordance with the present invention.

DESCRIPTION OF THE PREFERRED EMBODIMENT

The invention of selecting a set of templates based on speaker identification due to an utterance can be applied to many applications. The following mobile radio application is just one example of the various applications possible.

Referring by characters of reference to the drawings and first to FIG. 1, a communication device 1000 is illustrated in block diagram form. Functionally, the communication device 1000 may comprise a land mobile two way radio frequency communication device, such as the SYNTOR X 9000 series radio manufactured by Motorola, Inc., but the present invention need not be limited thereto. The communication device 1000 includes a word recognizer 100 which utilizes the principles of the present invention to determine who a speaker is. By matching the corresponding template (using distance criteria to find the best match) to that particular speaker, a keyword is recognized before word commands (such as "change channel") are similarly processed. For subsequent word commands, the search for a matching template will only be conducted using the templates associated with this particular speaker.

Functionally, to change a channel, for example, a voice command sequence having a keyword at its beginning to represent who the speaker is, followed by the command words "change channel" is uttered by the user. This utterance is received by a microphone 102. The analog representation of the utterance from the microphone 102 is filtered, sampled, and digitized by the Codec 106. The digitized utterance is sent to the digital signal processor (DSP) 120, which performs the speech recognition function, providing recognition results to the controller 160. The DSP 120 also may send digitized audio (synthesized speech or other sounds) to the Codec 106 to provide audible feedback or error messages, as dictated by inputs from the controller 160. The Codec 106 converts the digital representation of the audio back to an analog representation, filters the analog signal, and drives the speaker 170. The controller 160 takes input from a keyboard 110 and recognition results from the DSP 120. Based upon these inputs, it controls the operation of the radio 104 (in this example, changing the radio channel).

The controller 160 may be any suitable microprocessor, microcomputer, or microcontroller, and preferably comprises an MC68HC11 (or its functional equivalent), manufactured by Motorola, Inc. Note that the functions of the controller 160 may be incorporated in the DSP 120 if desired. Generally, the controller

160 will require random access memory (RAM), and read only memory (ROM), in addition to what is included in the microprocessor itself. The RAM is used for temporary data, and the ROM is used to store permanent data and operating programs. Block 150 provides this additional memory.

The DSP 120 may be of any suitable type, such as the 56000 family of DSPs, manufactured by Motorola, Inc. Generally, the DSP 120 too may require additional RAM and ROM over what is included on the DSP 120, which is provided in block 150. The Codec 106 may be internal to the DSP 150 and incorporated in a single block. The DSP 150 also will use the RAM for temporary data storage, and the ROM for program storage. Block 140 provides electrically erasable programmable read only memory (EEPROM), which is used mostly to store the recognition templates. Operationally, the speaker dependant word recognizer 100 recognizes words by comparing them to pre-stored reference templates which contains "extracted features" of the recognizable words, spoken by each user. Extracted features are representations of digitized words (or utterances) which are thought to contain the essential characteristics of the speech. The process of feature extraction is well known in the art, and examples may be found in G. White, R. Neely, "Speech Recognition Experiments with Linear Prediction, Band Pass Filtering, and Dynamic Programming", IEEE ASSP, Vol. ASSP- 23, No. 2, April 1976, which is incorporated here by reference.

The pre-stored templates are generated by a process called "training". Training is known to be a process by which the individual user repeats a predetermined set of reference words or utterances for a sufficient number of times, until an acceptable number of their voice features are extracted and stored.

In the preferred embodiment of the invention, the word recognizer 100 comprises a speaker dependant word recognizer and provides two modes of operation, a training mode and a recognition mode. A control panel 110 coupled to the controller 160 includes buttons 114 and 116 for selecting the desired mode. However, before the system 100 can be trained, the system 100 must be notified which user is being trained. Hence, the control panel 110 also includes buttons 112 for each user to be trained. In the training mode, the extracted features of reference commands or keywords may be stored in an erasable memory means 140, such as any suitable EEPROM.

However, during the recognition mode, this invention eliminates the use of these user buttons 112. According to the present invention, the word recognizer 100 is designed such that in the recognition mode, the user buttons need not be pressed for recognizing the voice of the individual user since voice recognition of the speaker is done automatically.

Figure 2 illustrates the use of speaker identification in the recognizer to select the subset of possible recognized templates. Step 208 is the initialization step, in which the DSP, 120, resets internal variables used in the recognition process. In step 210, the DSP extracts the features from the digitized input utterance provided by the Codec, 106. (It should be noted that step 210 need not be performed completely before the following steps in the flow chart. In fact, prior art systems often extract features from a new utterance at the same time as the features from a previous utterance are being recognized.)

Steps 220 and 230 are the critical steps to the present invention. If the utterance (a spoken word or phrase) is the first in the command sequence, the user will be uttering the keyword.

The system will then use this keyword to determine which user is currently speaking in step 230. Methods of identifying people from their speech are well known in the art. Examples of this speaker identification may be found in NAIK, "Speaker Verification: A Tutorial," IEEE Communications Magazine, pages

42-48, January 1990, which are incorporated here by reference.

Once the user has been identified, the system will only compare the features from the input utterance to that user's templates. By comparing the input features to only a subset of the templates, the recognition accuracy is improved, and the amount of computational effort to accomplish the recognition is reduced. The recognition process is accomplished in steps 240 and 250. Step 240 compares the features from the input utterance to those templates attributed to the identified user, determining distances between each template and the input utterance. Since the utterances will often be of different lengths, the distance determination method must include some means of aligning them at equivalent points in time. A well known technique of time alignment is called dynamic time warping, which is described in ITAKURA, "Minimum Prediction Residual Principal Applied to Speech Recognition," IEEE Proceedings on Acoustics, Speech, and Signal Processing, Volume ASSP - 23, No. 1 , pages 67 -72, February 1975 which is incorporated here by reference.

Step 250 updates the total command sequence distances by adding the distances from the features of the input utterance to the templates identified with words that are included in the command sequence. In this way, entire command sequences may be compared to determine the best recognized sequence.

Step 260 determines if the command sequence is complete. If the sequence is complete, then the recognizer outputs the recognized sequence of commands words in step

270. If not, the system returns to step 210, examining the next utterance.

Accordingly, the present invention involves using the utterance of a keyword to determine who the user or speaker is. In a first preferred embodiment utilizing word recognition to identify the speaker, each user has a particular keyword associated with him or her which was used to train the word recognizer 100 to his or her voice to form the reference templates. Upon initiating, or turning on of the mobile radio (208), the word recognizer 100 does not know who is about to use the mobile radio. The user may be a previous speaker already recognized or a new speaker. Thus the word recognizer 100 must initially be responsive to all of the keywords spoken by the different speakers in step 230. When an utterance is detected (210), it determines (230) whether it was sufficiently close to one of a limited number of keywords that the recognizer was trained on and thus can determine (230) who said the utterance. Particular keywords may be selected because of their distinctive phonetic content to increase the reliability of system recognizer. A second preferred embodiment uses a single keyword that has been trained by all the users to form the reference templates. Upon utterance of this single keyword, the recognizer 100 performs the speaker identification process or speaker recognition of step 230 as previously described to determine who the speaker is. The advantage to this embodiment is that only one keyword needs to be matched. However, during training, the word recognizer 100 may determine if all the users naturally say the particular keyword acoustically different from each other so that the single keyword will be discernible as spoken by each speaker. The keyword may ideally be chosen because it accentuates the different acoustic characteristics of different speakers speaking the same word.

In either embodiment, of word or speaker recognition, the identification of the speaker is automatic (230) and requires no additional effort by the operator. This invention therefore enhances the ease-of-use that voice control systems strive for.

We claim:

Claims

1. A method for recognizing a voice command sequence, comprising the steps of: storing a plurality of templates, each template uniquely identified with one of a plurality of users; receiving at least one keyword spoken at the beginning of said voice command sequence and uniquely identified with said one of said plurality of users; determining which particular one of said plurality of users spoke said at least one keyword; and selecting a subset of said templates uniquely identified with said particular one of said plurality of users for at least the rest of said voice command sequence in response to said determining step.

2. The method of claim 1 wherein said determining step comprises comparing said received at least one keyword with a portion of said plurality of templates, each template of said portion uniquely identified with one of said plurality of users by said at least one keyword being a unique word for each of said plurality of users.

3. The method of claim 1 wherein said determining step comprises comparing said received at least one keyword with a portion of said plurality of templates, each template of said portion uniquely identified with one of said plurality of users by said at least one keyword being an utterance the same for but characteristic of each of said plurality of users.

4. The method of claim 1 further comprising searching only through said subset to recognize a command.

5. A speech recognizer in a communication device for recognizing a voice command sequence, comprising:

storing means in said communication device for storing a plurality of templates, each template uniquely identified with one of a plurality of users; a receiver for receiving at least one keyword spoken at the beginning of said voice command sequence and uniquely identified with said one of said plurality of users; determining means in said communication device for determining which particular one of said plurality of users spoke said at least one keyword; and selection means in said communication device for selecting a subset of said templates uniquely identified with said particular one of said plurality of users to provide a set of recognizable commands in response to said determining means.

6. The speech recognizer of claim 4 wherein said determining means comprises comparing means for comparing said received at least one keyword with a portion of said plurality of templates, each template of said portion uniquely identified with one of said plurality of users by said at least one keyword being a unique word distinct for each of said plurality of users.

7. The speech recognizer of claim 4 wherein said determining means comprises comparing means for comparing said received at least one keyword with a portion of said plurality of templates, each template of said portion uniquely identified with one of said plurality of users by said at least one keyword being an utterance the same for but characteristic of each of said plurality of users.

8. A method for recognizing a command voice sequence, comprising the steps of: storing a plurality of templates, each template uniquely identified with one of a plurality of users; apportioning said plurality of templates into a plurality of command subsets, each command subset uniquely identified with one of a plurality of users; receiving at least one keyword spoken at the beginning of said voice command sequence and uniquely identified with said one of said plurality of users; determining which particular one of said plurality of users spoke said at least one keyword; selecting said command subset of said templates uniquely identified with said particular one of said plurality of users to provide a set of recognizable commands in response to said determining step; and searching only through said command subset to recognize a command for at least the rest of said voice command sequence.

9. The method of claim 8 wherein said determining step comprises comparing said received at least one keyword with a portion of said plurality of templates, each template of said portion uniquely identified with one of said plurality of users by said at least one keyword being a unique word for each of said plurality of users.

10. The method of claim 8 wherein said determining step comprises comparing said received at least one keyword with a portion of said plurality of templates, each template of said portion uniquely identified with one of said plurality of users by said at least one keyword being an utterance the same for but characteristic of each of said plurality of users.