WO2001082291A1 - Speech recognition and training methods and systems - Google Patents

Speech recognition and training methods and systems Download PDF

Info

Publication number
WO2001082291A1
WO2001082291A1 PCT/US2001/012959 US0112959W WO0182291A1 WO 2001082291 A1 WO2001082291 A1 WO 2001082291A1 US 0112959 W US0112959 W US 0112959W WO 0182291 A1 WO0182291 A1 WO 0182291A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
error
speech
audible sounds
training
Prior art date
Application number
PCT/US2001/012959
Other languages
French (fr)
Inventor
H. Donald Wilson
Anthony H. Handal
Michael Lessac
Original Assignee
Lessac Systems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lessac Systems, Inc. filed Critical Lessac Systems, Inc.
Priority to AU2001255560A priority Critical patent/AU2001255560A1/en
Publication of WO2001082291A1 publication Critical patent/WO2001082291A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B19/00Teaching not covered by other main groups of this subclass
    • G09B19/04Speaking
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures

Definitions

  • the present invention relates to speech recognition technology and voice training using speech recognition of the type typically embodied in speech recognition software typically implemented on personal computer systems.
  • Tiny laptop computing devices fly at speeds a thousand times that of those earh powerhouse computers and boast thousands of times the memory. Instead of huge reels of recording tape, hard disks with capacities on the order of eighteen GB are found in those same laptop computing devices. These devices, with their huge memory and computing capabilities, move as freely in the business world as people, under the arm, in a bag, or on the lap of a businessman flying across the ocean. No doubt, this technology lies at the foundations of the most remarkable, reliable and completely unanticipated bull market in the history of business.
  • speech recognition programs generally have an error correction dialog window which is used to train the system to the features of an individual user's voice, as will be more fully described below.
  • error correction dialog window which is used to train the system to the features of an individual user's voice
  • an acoustic signal received by a microphone is input into a voice board which digitizes the signal.
  • the computer then generates a spectrogram which, for a series of discrete time intervals, records those frequency ranges at which sound exists and the intensity of sound in each of those frequency ranges.
  • the spectrogram referred to in the art as a token, is thus a series of spectrographic displays, one for each of a plurality of time intervals which together form an audible sound to be recognized.
  • Each spectrographic display shows the distribution of energy as a function of frequency during the time interval.
  • sampling rates of 6,000 to 16,000 samples per second are typical, and are used to generate about fifty spectrum intervals per second for an audible sound to be recognized.
  • a speech recognition system involves the input of vocabulary into the hard drive of a computer in the form of the above described spectral analysis matrix, with one or more spectral analysis matrices for each word in the vocabulary of the system. These matrices then serve as word models.
  • SUBST ⁇ UTE SHEET (RULE 26) models in the database can be used as a reliable means for speech recognition.
  • different speakers speak at different rates.
  • a word they take a certain period of time while for another speaker, the word they take a longer period of time.
  • different speakers have voices of different pitch.
  • speakers may give different inflections, emphasis, duration and so forth to different syllables of a word in different ways, depending on the speaker. Even a single speaker will speak in different ways on different occasions.
  • each of the spectral sample periods for the sound to be recognized are compared against the corresponding spectral sample periods of the model which is being rated.
  • the cumulative score for all of the sample periods in the sound against the model is a quality rating for the match.
  • the quality ratings for all the proposed matches are compared and the proposed match having the highest quality rating is output to the system, usually in the form of a computer display of the word or phrase.
  • SUBST ⁇ TUTE SHEET (RULE 26) the input of global information using one or more speakers to develop a global database.
  • the second method in which the database is assembled is the training of the database to a particular user's speech, typically done both during a training session with preselected text, and on an ad hoc basis through use of the error correction dialog window in the speech recognition program.
  • the present invention stems from the recognition that spectrograms of audible sounds may be used not only to recognize speakers and words, but also mispronunciations.
  • the performance of the speech recognition software is improved by focusing in on the user, as opposed to the software.
  • the invention has its objective improvement of the speech patterns of persons using the software. The result is enhanced performance, with the bonus of voice training for the user.
  • Such training is of great importance. For example, salesman, lawyers, store clerks, mothers dealing with children and many others rely heavily on oral communications skills to accomplish daily objectives. Nevertheless, many individuals possess poor speaking characteristics and take this handicap with them to the workplace and throughout daily life.
  • a specialized but highly effective speech training regimen is provided for application in the context of speech recognition software for receiving human language inputs in audio form to a microphone, analyzing the same in a personal computer and outputting alphanumeric documents and navigation commands for control of the person computer, and alphanumeric guidance and aural pronunciation examples from the sound card and speakers associated w ith a personal computer.
  • speech recognition is performed on a first computing device using a microphone to receive audible sounds input by a user into a first computing device having a program with a database consisting of (i) digital representations of known audible sounds and associated alphanumeric representations of the known audible sounds and (ii) digital representations of known audible sounds corresponding to mispronunciations resulting from known classes of mispronounced words and phrases.
  • the method is performed by receiving the audible sounds in the form of the electrical output of a microphone.
  • the training method is performed by having the person being trained read a preselected text and translating the audible sounds into the form of the electrical output of a microphone being sent to a computer.
  • a particular audible sound to be recognized is converting into a digital representation of the audible sound.
  • the digital representation of the particular audible sound is then compared to the digital representations of the known audible sounds to determine which of those known audible sounds is most likely to be the particular audible sound being compared to the sounds in the database.
  • a speech recognition output consisting of the alphanumeric representation associated with the audible sound most likely to be the particular audible sound is then produced.
  • An error indication is then received from the user indicating that there is an error in recognition.
  • the user also indicates the proper alphanumeric representation of the particular audible sound. This allows the system to determine whether the error is a result of a known type or instance of mispronunciation.
  • the digital representation of the particular audible sound is then compared to the digital representations of a proper pronunciation of that audible sound to determine whether there is an error that results from a known type or instance of mispronunciation.
  • the system presents an interactive training program from the computer to the user to enable the user to correct such mispronunciation.
  • the presented interactive training program comprises playback of the properly pronounced sound from a database of recorded sounds corresponding to proper pronunciations of the mispronunciations resulting from the known classes of mispronounced words and phrases.
  • the user is given the option of receiving speech training or training the program to recognize the user's speech pattern, although this is the choice of the user of the program.
  • the determination of whether the error is a result of a known type or instance of mispronunciation is performed by comparing the mispronunciation to the digital representations of known audible sounds corresponding to mispronunciations resulting from known classes of mispronounced words and phrases using a speech recognition engine.
  • the inventive method will be implemented by having the database consisting of (i) digital representations of known audible sounds and associated alphanumeric representations of the known audible sounds and (ii) digital representations of known audible sounds corresponding to mispronunciations resulting from known classes of mispronounced words and phrases, generated by the steps of speaking and digitizing the known audible sounds and the known audible sounds corresponding to mispronunciations resulting from known classes of mispronounced words and phrases.
  • the database will then be introduced into the computing device of many users after the generation by speaking and digitizing has been done on another computing device and transferred together w ith voice recognition and error correcting subroutines to the first computing device using CD-ROM or other appropriate data carrying medium.
  • mispronunciations are input into the database by actual speakers that have such errors as a natural part of this speech pattern.
  • normalization to word, phrase and other sound models may be achieved by normalizing words or phrases to one of a plurality of sound durations. This procedure is followed with respect to all the word that phrase models in the database. When a word is received by the system, it measures the actual duration, and then normalizes the duration of the sound to one of the plurality of the preselected normalized sound durations. This reduces the number of items in the database against which the sound is compared and rated.
  • Figure 1 is a block diagram illustrating a voice recognition program in accordance with the method of the present invention
  • Figure 2 is a block diagram illustrating a voice recognition program in accordance with the training method of the present invention
  • Figure 3 is an alternative embodiment of the inventive system illustrated in Figure 2;
  • a voice and error model is generated using subroutines 12 and 1 12.
  • Subroutines 12 and 1 12 comprises a number of steps which are performed at the site of the software developer, the results of which are sent for example, in the form of a CD-ROM, other media or via the Internet, together with the software for executing voice recognition, to a user, as will be apparent from the description below.
  • the inventive speech recognition method may be practiced on personal computers, as well as on more advanced systems, and even relatively stripped down lightweight systems, the referred to as subnotebooks and even smaller s ⁇ stems provided at the same have sound boards for interfacing with and receiving the output of a microphone. It is noted that quality sound board electronics is important to good recognition and successful of the methods of the invention.
  • a database of word models is generated by having speakers speak the relevant words and phrases into a microphone connected to the sound board of the personal computer being used to generate the database.
  • speakers who have been trained in proper speech habits are used to input words, phrases and sounds into the database at steps 14 and 1 14.
  • the information is generated by the speakers speaking into microphones, attached to the sound boards in the computer, the information is digitized, analyzed and stored on the hard drive 16, 1 16 of the computer.
  • phoneme is used to mean the smallest sound, perhaps meaningless in itself, capable of indicating a difference in meaning between two words.
  • the word “dog” differs from “cog” by virtue of a change of the phoneme “do” pronounced '"daw “ and “co” pronounced “cah. "”
  • the model generating speaker can speak a database of common phoneme errors into the microphone attached to the sound board of the computer to result in input of an error database into hard drive 16, 1 16 of the computer.
  • the phoneme errors are spoken by persons who in various ways make the pronunciation error as part of their normal speech patents.
  • the system is enhanced by the introduction into the database contained on hard drive 16, 1 16 of a plurality of exercise word models, selected for the purpose of training the speech of a user of the system.
  • the same are input into the system through the use of a microphone and sound board, in the same way that the database of the language model was input into the system.
  • SUBS i l l U l h SHEET (RULE 26) word and/or phrase models is associated with each type of phoneme error. This is because if a person makes a speech pronunciation error of a particular type, it is likely that the same speaker makes certain other errors which have common characteristics with other pronunciation errors in the group. For example, a person who mispronounces the word “them “ to sound like “dem " is also likely to mispronounce the words “that", “those '* and "these. "
  • Exercise phrase models are input at steps 22 and 122. These exercise phrase models are stored by the system in hard drive 16, 1 16. The exercise word models and the exercise phrase models input into the system at steps 20 , 22, 120 and 122 respectively are associated in groups having common mispronunciation characteristics. The same are input into the system through the use of a microphone and sound board, in the same way that the database of the language model was input into the system.
  • a plurality of typical mispronunciations are input into the system to create a database of exercise word error models in hard drive 16, 1 16. The same are input into the system through the use of a microphone and sound board, in the same way that the database of the language model was input into the system.
  • the database of relatively common mispronunciation errors is completed at steps 26 and 126 where the speaker generating that database speaks into the system to generate a plurality of exercise phrase error models. These error models are also input into the system through the use of a microphone and stored on hard drive 16, 1 16.
  • SUBST ⁇ UTE SHEET (RULE 26) is done using a speaker or speakers who have the actual speech error as part of their normal speech patterns. The same is believed to achieve substantially enhanced recognition of speech errors, although the same is not believed to be necessary to a functioning system.
  • the models stored on hard disk 16. 1 16. and generated as described above may be recorded on a CD-ROM or other program carrying media, together with a voice recognition engine, such as that marketed by any one of a number of manufacturers such as IBM. Dragon Systems, and others.
  • a prior art speech recognition program may be used for both the purpose of recognizing words, recognizing mispronunciations and phoneme errors, together with the above described audio recordings of proper pronunciations, both during speech recognition operation training sessions ( Figure 1 ). and during speech training session with the inventive interactive program ( Figures 2 and 3).
  • such software comprising the speech recognition engine, editing and training utilities, and database of word models, phrase models, vocal recordings, and error models may be supplied to the user for a one time fee and transported over a publicly accessible digital network, such as the Internet.
  • the software may be made available for limited use for any period of time with charges associated with each such use, in which case the software would never be permanently resident on the computer of a user.
  • the software containing the program and the database is loaded into a personal computer and words are spoken into a microphone coupled to the sound board of the computer, in order to input the speech into the computer in the manner of a conventional speech recognition program.
  • the software containing the program and the database is loaded into a personal computer and the student user is instructed to read a preselected text that appears on the screen of the computer.
  • words are spoken into a microphone coupled to the sound board of the computer, in order to input the speech into the computer in the manner of a conventional speech recognition program.
  • SUBST ⁇ SHEET (RULE 26) 18, 20, 22. 24 . 26, 1 14, 1 18, 120, 124, and 126 and the speech recognition engine, editing and training utilities added, the system proceeds at steps 28 and 128 to receive, through a microphone, speech to be recognized from a user of the program who has loaded the speech recognition engine, editing and training utilities, and database of word models, phrase models, vocal recordings, and error models onto the user's personal computer.
  • the operation of the speech recognition program of the present invention is substantially identical to other speech recognition programs presently on the market. More particularly, at steps 30 and 130, a conventional speech recognition algorithm is applied to recognize audible sounds as the words which they are meant to represent.
  • the computer then outputs the recognized speech on the screen of the computer monitor, and the next phrase uttered by the user proceeds at steps 30 and 130 through the speech recognition algorithm resulting in that speech also being displayed on the monitor screen.
  • the user may use any one of a number of different techniques to bring up an error correction window at step 32 and 132. For example, he may simply double-click on the error, or highlight the erroneous recognition and hit a key dedicated to presentation of the error correction window.
  • the call up of the error correction window at steps 34 and 134 has indicated to the system that there is an error. While some errors are unrelated to pronunciation errors, many are.
  • the system then proceeds at steps 36 and 136 to determine whether the error made by the user is recognized as one of the speech errors recognized by the system. If it is, this information is determined at steps 36 and 138. That the nature of the pronunciation error is then input into the system and logged at steps 38 and 140. In this matter, the system keeps track of the number of errors of a particular type for the user by storing them and tallying them in hard drive 16, 1 16.
  • the speech training will not be triggered by a single mispronunciation. Instead, it is contemplated that the repeated instances of a single type of mispronunciation error will be tallied, and when a threshold of pronunciation errors for that error is reached in the tally, only then will speech training be proposed by the appearance of the screen of a prompt window
  • SUBST ⁇ UTE SHEET suggesting speech training.
  • the same could take the form of a window having the words "The system has determined that it is likely that w e can improve your recognition and speech by coaching you now. Would you like to speak to the speech coach?"
  • the screen may also have a headline above the question, such as "The coach wants to talk to you!
  • the screen ill also have a button bar "OK” to start a training session.
  • a 5 button marked “Cancel” may also be included so the student click on the "Cancel” button to delay the speech coaching session for a limited amount of time or cancel it altogether.
  • the error correction algorithm operates in a manner identical to the speech recognition algorithm at steps 30 and 130, except that the error correction algorithm checks the database of common phoneme errors input into the system by the software developer at step 18 and 1 18 and
  • the system determines that the threshold number of errors in that class has not 25 been reached, it sends the system back to steps 28 and 128. where speech recognition proceeds. If, on the other hand, a predetermined number of errors of the same class have been detected by the system and logged at steps 38 or 140. at steps 40 or 142 the system is sent to step 42 or 144 where the above described "The coach wants to talk to you! screen is presented to the user, who is thus given the opportunity to train his voice.
  • step 42 If the speech recognition user declines the opportunity to train at steps 42. he is given the opportunity to train the database at step 43. If he declines that opportunity also, the system is returned to step 28, where, again, speech recognition proceeds.
  • step 45 the database is trained in the same manner as a conventional speech recognition processing program.
  • step 146 If the speech training user declines the opportunity to train at step 146. the system is returned to step 128, where, again, reading of the rest of the preselected pronunciation error detecting text proceeds.
  • step 42 or 146 when the user decides to accept speech training, the system proceeds to step 44 or 148 respectively, where the determination is made as to whether the particular error is an error in the pronunciation of a word or what is referred to herein as a phrase.
  • phrase in this context, is meant at least parts from two different words. This may mean two or more words, or the combination of one or more words and at least a syllable from another word, and most often the end of one word combined with the beginning of another word, following the tendency of natural speakers to couples sounds to each other, sometimes varying their stand-alone pronunciation. If. at step 44 or 148 the system determines that the mispronunciation is the mispronunciation of a word, the system is sent to step 46 or 150 respectively, where the system retrieves from memory words which have the same or similar mispronunciation errors.
  • these words have been stored in the system, not only in the form of alphanumeric presentations, but also in high-quality audio format.
  • the object of the storage of the high-quality audio sound is to provide for audible playback of the words in the training dialog screen.
  • the words retrieved at step 46 and 150 are also presented on-screen in alphanumeric form to the user and the user is invited to pronounce the word. If the word is pronounced properly, this is determined at step 48, 154. If there is no error, the system proceeds to step 50. 156 where the system determines whether there are two incidences of no error having occurred consecutively. If no error has occurred twice consecutively, the system is returned to act as a voice recognition system at step 28, 128. If no error has occurred only once, at steps 50 and 158 the system is returned to the training dialog screen at step 46 or 150 respectively and the user is invited to pronounce the same or another word having the same type of mispronunciation to ensure that the user is facet word correctly. Once the user has pronounced words twice in a row without errors, the user is returned at step 50 or 156 respectively, to the voice recognition function.
  • step 48 or 146 the system proceeds to step 50 or 150 respectively, to where an instruction screen telling the user how to make the sound, with physical instructions on how to move the muscles of the mouth and tongue to achieve the sound is presented to the user.
  • SUBST ⁇ UTE SHEET (RULE 26)
  • the screen allows for the incorporation of more creative speech training approaches such as the Lessac method described in The Use and Training of the Human Voice -A Bio-Dynamic Approach to Vocal Life. Arthur Lessac. Mayfield Publishing Co. ( 1997).
  • the user is encouraged to use his "inner harmonic sensing. " This enhances the description of a particular sound by having the user explore how the sound affects the user " s feelings or encourages the user to some action.
  • the Lessac method teaches the sound of the letter "N” by not only describing the physical requirements but also instructs the user to liken the sound to the "N” in violin and to "Play this consonant instrument tunefully.”
  • This screen also has a button which may be clicked to cause the system to play back the high-quality audio sound from memory, which was previously recorded during software development, as described above.
  • the system may also incorporate interactive techniques.
  • This approach presents the user with a wire frame drawings of a human face depicting, amongst other information, placement of the tongue, movement of the lips, etc.
  • the user may interactively move the wire frame drawing to get a view from various angles or cause the sounds to be made slowly so that the "facial" movements can be carefully observed.
  • This screen also has a button which may be clicked to cause the system to play back the high-quality audio sound from memory, which was previously recorded during software development, as described above.
  • the user is then invited to say the sound again, and at steps 54 and 152, the user says the word into the microphone which is coupled to the computer, which compares the word to the database for proper pronunciation at determines whether there is an error in the pronunciation of the word at step 56 and 154 respectively.
  • step 46 the word is displayed and the user invited to say the word into the machine to determine whether there is error, with the system testing the output to determine whether it should proceed to speech recognition at step 28. when the standard of two consecutive correct pronunciations has been reached. If there is no error at step 56, however, the tally is cleared and the system proceeds to step 28. where normal speech recognition continues.
  • step 158 the error tally flag is set at step 158 and the system is sent back to step 150 where, again, the sound is displayed in alphanumeric form and the user invited to say the sound into the machine with the system testing the output to determine whether there is an error at step 154. If no pronunciation error is found, the system determines in step 156 if the previous attempt was an error by checking whether the tally error flag is set. If the flag is set, indicating that the previous attempt had a pronunciation error, then the system is sent to step 158 where the tally flag is now cleared and the system returns to step 150. In step 156. if the tally flag is found not set, indicating that the previous attempt had no pronunciation error, then the standard of t o consecutive correct pronunciations has been met and the system until training has been completed.
  • step 44 or 148 the system determines that the mispronunciation is the mispronunciation of a phrase, the system is sent to step 58 or 150 respectively, where the system retrieves from memory phrases which have the same or similar mispronunciation errors.
  • the words or phrase retrieved at step 58 or 150 are also presented on-screen and alphanumeric form to the user at the user is invited to pronounce the word or phrase. If the word is pronounced properly, this is determined at step 60 Or 154 respectively. If there is no error, the system proceeds to step 62 or 158 respectively where the system determines whether there are two incidents of no error having occurred. If no error has occurred twice, the system is returned to act as a voice recognition system at steps 28 and 128.
  • step 62 and 158 the system is returned to the training dialog screen to at step 58 and 150 respectively and the user is invited to pronounce the same or to the word having the same type of mispronunciation to ensure that the user is pronouncing word correctly. Once the user has pronounced words twice in a row without errors, the user is returned at step 62 or 158 to the voice recognition function.
  • step 60 or 154 the system proceeds to step 62 or 150 respectively, where an instruction screen telling the user how to make the sound, with physical instructions on how to move the muscles of the mouth and tongue to achieve the sound is presented to the user as well as any other techniques such as the Lessac method described herein above.
  • This screen also has a button which may be clicked to cause the system to playback the high-quality audio sound from memory, which was previously recorded during software development, as described above.
  • the user is then invited to say the sound again, and at steps 66 and 152, the user says the phrase into the microphone which is coupled to the computer, which compares the word to the database for proper pronunciation and determines whether there is an error in the pronunciation of the word at step 68 and 154 respectively.
  • step 58 or 150 the word is displayed and the user invited to say the word into the machine to determine whether there is error, with the system testing the output to determine whether it should proceed to speech recognition at step 28, 128, when the standard of two consecutive correct pronunciations has been reached. If there is no error at step 68 or 150, however, the tally is cleared and the system proceeds to step 28, 128, where normal speech recognition continues, the training session having been completed.
  • FIG. 3 An alternative embodiment of the invention 210 is shown in Figure 3, wherein steps analogous to those of the Figure 2 embodiment are numbered with numbers one hundred higher than those in the Figure 2 embodiment.
  • the user has the additional alternative of training the database rather than having the database train his speech in step 262. This option is provided for those users who have particular speech habits that the user wants to accord special attention.
  • the database is taught the user ' s particular pronunciation error at step 262.
  • the user can assign a high error threshold or tell the database to ignore the error if he or she does not want training and prefers to keep his or her speech affection.
  • the user may assign a low error threshold if he or she desires extra training for a certain type of error.
  • the above methods can be incorporated into computer systems, including, computer-implementable programs, plug-in component hardware, and firmware.

Abstract

In accordance with a present invention speech recognition (10) and training (110), methods and systems are disclosed. A microphone receives audible sounds input (28) from a user into a first computing device having a program with a database (16). The database consists of digital representations of known audible sounds and associated alphanumeric representations of the known audible sounds and mispronunciations. The program compares the digital representation to the digital representations of known audible sounds in a database (30) to determine the likely desired output. If an error in recognition (32) occurs, then the user can indicate the proper alphanumeric representation of the particular audible sound (34). This allows the system to determine whether the error is a result of a known type or instance of mispronunciation (36). In response to a determination of the error's nature, the system presents an interactive training program from the computer to the user to enable the user to correct such mispronunciation (45). The present invention has the advantage of improving voice recognition and speech patterns of the user by focusing in on the user in error correction. Thus improving oral communication skills of the user.

Description

SPEECH RECOGNITION AND TRAINING METHODS AND SYSTEMS
TECHNICAL FIELD The present invention relates to speech recognition technology and voice training using speech recognition of the type typically embodied in speech recognition software typically implemented on personal computer systems.
BACKGROUND
In 1964. a group of computer scientists marveled over the new computer being delivered into the computer center at the Courant Institute for Mathematical Sciences at New York University. The machine was the latest introduction from Control Data Corporation, a Model CDC 6600, whose speed and memory capacity far outstripped the 7.094 K random access memory capacity of the now humble IBM 7094 that it replaced. In a portent of things to come, they little suspected that IBM. within months, would obsolete the long heralded CDC 6600 with its IBM 360. a machine which, incredibly, had an unheard-of 360 K of RAM, all built with discrete components, a conservative move in the face of concerns about the reliability of the then- new integrated circuit technology. This impressive machine came to be housed in a room about eighteen feet square, and surrounded by ten or so air conditioners necessary to keep the system from overheating and failing. A half dozen tape decks, nearly a meter across and as tall as a man, and several key punch machines, the size of garment outer table serving machines completed the installation.
Thirty-five years later, changes in technology have been remarkable. Tiny laptop computing devices fly at speeds a thousand times that of those earh powerhouse computers and boast thousands of times the memory. Instead of huge reels of recording tape, hard disks with capacities on the order of eighteen GB are found in those same laptop computing devices. These devices, with their huge memory and computing capabilities, move as freely in the business world as people, under the arm, in a bag, or on the lap of a businessman flying across the ocean. No doubt, this technology lies at the foundations of the most remarkable, reliable and completely unanticipated bull market in the history of business.
Just as certainly, the future holds the promise of similar progress.
Notwithstanding the gargantuan magnitude of the progress made in computing during the last third of the 20th century, the world of computing has been largely self-contained. The vast majority of all computing tasks involved computers talking to other computers, or otherwise communicating through use of electrical
-1 -
SUBSTrTUTΕ SHEET (RULE 2S) input signals whose characteristics are substantially absolutely determined. In this respect, computers are completely unlike the humans hich they serve, humans whose communications vary in almost infinite ways, regardless of the method of communication, be it voice, writing, or other means. If computing is to continue to make progress, computers must become integrated into usual human communications modalities.
And, indeed, this is already happening. From a slow start at becoming an important factor in the marketplace about a decade ago. speech recognition technology holds just such a promise. A human voice interface to a computer represents what should be probably the most ideal evolutionarily defined modality for the human-computer communications interface. While humans customarily write, gesture and, to a limited extent, use other communications modes, voice communication remains predominant. This is not surprising, insofar as speech has evolved in human beings probably for many millions of years. This is believed to be the case because even relatively primitive forms of life have fairly highly developed "speech" characteristics. For example, much work has been done in the study of the use of sounds to communicate various items of information by whales. Likewise, scientists have identified and cataloged uniform global communications patterns in chimpanzees.
In view of the highly natural nature of communication by speech, a direct result of its having evolved over such a large fraction of the history of the species, speech communications impose an extremely low level of cognitive overhead on the brain, thus providing a facile communications interface while allowing the brain to perform a number of other functions simultaneously. We see this in everyday life. For example, people engaged in sports activities routinely combine complex physical tasks, situational analysis, and exchange of information through speech, sometimes simultaneously transmitting and receiving audible information, while doing all of these other tasks.
Clearly, the mind is well adapted to simultaneously control other tasks while communicating and receiving audible information in the form of speech. It is thus no surprise that virtually every culture on earth has devised its own highly sophisticated audible language.
In view of the above, it is thus, easily understood why voice recognition technology has come to be the Holy Grail of computing. While useful work began to be done with this technology about ten years ago, users only obtained performance which left much to hope for. Individual quirks, regional pronunciations, speech defects and impediments, bad habits and the like pepper virtually everybody's speech to some extent. And this is no small matter. Good speech recognition requires not only good technology, it also requires recognizable speech.
.?. Toward this end, speech recognition programs generally have an error correction dialog window which is used to train the system to the features of an individual user's voice, as will be more fully described below. The motivation behind such technique is apparent when one considers and analyzes the schemes typically used in speech recognition systems.
Early on. speech recognition was proposed through the use of a series of bandpass filters. These proposals grew out of the use of spectrographic analysis for the purpose of speaker identification. More particularly, it was discovered that if one made a spectral print of a person saying a particular word, wherein the x-axis represented time and y-axis represented frequency, with the intensity of sound at the various frequencies being displayed in shades of gray or in black and white, the pattern made by almost every speaker was unique, largely as a function of physiology, and speakers could be identified by their spectrographic "prints". Interestingly enough, however, the very diversity which this technique showed suggested to persons working in the field the likelihood that commonalities, as opposed to differences, could be used to identify words regardless of speaker. Hence the proposal for a series of bandpass filters to generate spectrographs for the purpose of speech recognition.
While such an approach was logical given the state of technology in the 1960s, the problems were also apparent. Obtaining high-quality factors or "Q" in electrical filters comprising inductors and capacitors is extremely difficult at audio frequencies.
This is due to a number of factors. First of all, obtaining resonance at these frequencies necessitates the use of large capacitors and inductors. Such components, in the case of capacitors have substantial resistance leak through. In the case of inductances, large values of inductance are required, thus requiring large lengths of wire for the windings and, accordingly, high resistance. The result is that the selectivity of the filters is extremely poor and the ability to separate different bandpasses is compromised. Finally, the approach was almost fatally flawed, from a mass-market standpoint, by the fact that these tuned electrical circuits were very large and mechanically cumbersome, as well as very expensive.
However, in the late 1960's, electrical engineers began to model the action of electrical circuits in the digital domain. This work was done by determining, using classical analytic techniques, the mathematical characteristics of the electrical circuit, and then solving these equations for various electrical inputs. In the 1970's, it was well understood that the emerging digital technology was going to be powerful enough to perform a wide variety of computing tasks previously defaulted to the analog world. Thus, it was inevitable that the original approaches to voice recognition through the concept of using banks of tuned circuits would eventually come to be executed in the digital domain.
In a typical speech recognition system, an acoustic signal received by a microphone is input into a voice board which digitizes the signal. The computer then generates a spectrogram which, for a series of discrete time intervals, records those frequency ranges at which sound exists and the intensity of sound in each of those frequency ranges. The spectrogram, referred to in the art as a token, is thus a series of spectrographic displays, one for each of a plurality of time intervals which together form an audible sound to be recognized. Each spectrographic display shows the distribution of energy as a function of frequency during the time interval. In a typical system, sampling rates of 6,000 to 16,000 samples per second are typical, and are used to generate about fifty spectrum intervals per second for an audible sound to be recognized.
In a typical system, quantitative spectral analysis is done for seven frequency ranges, resulting in eight spectral parameters for each fiftieth of a second, or spectral sample period. While the idea that a spectral analysis over time can be a reliable recognition strategy may be counterintuitive given the human perspective of listening to envelope, tonal variation and inflection, an objective view of the strategy shows that exactly this information is laid out in an easy to process spectral analysis matrix.
Based on the theoretical underpinnings of the above recognition strategy, development of a speech recognition system involves the input of vocabulary into the hard drive of a computer in the form of the above described spectral analysis matrix, with one or more spectral analysis matrices for each word in the vocabulary of the system. These matrices then serve as word models.
In more advanced systems (such as those using so-called "natural" speech, that is continuous strings of words, the natural tendency of speakers to, on occasion, blend the end of one word into the beginning of another, and less frequently to separate words into two parts, sometimes with association of the parts with different words) models are also developed for these artifacts of the language to be recognized.
Once broken down into a spectral picture over time of frequency energy distributions, recognition of speech is reduced to comparison of known spectral pictures for particular sounds to the sound to be recognized, and achieving recognition through the determination of that model which best matches the unknown speech sound to be recognized. But this picture, while in principle correct, is an unrealistic simplification of the problem of speech recognition.
After a database of word models has been input into the system, comparison of an audible sound to the
-4-
SUBSTΓΓUTE SHEET (RULE 26) models in the database can be used as a reliable means for speech recognition. However, there are many differences in the speech patterns of users. For example, different speakers speak at different rates. Thus, for one speaker, a word they take a certain period of time, while for another speaker, the word they take a longer period of time. Moreover, different speakers have voices of different pitch. In addition, speakers may give different inflections, emphasis, duration and so forth to different syllables of a word in different ways, depending on the speaker. Even a single speaker will speak in different ways on different occasions.
Accordingly, effective speech recognition requires normalization of spoken sounds to word and phrase models in the database. In other words, the encoded received sound or token must be normalized to have a duration equal to that of the model. This technology is referred to as time aligning, and results in stretching out or compressing the spoken sound or word to fit it against the model of the word or sound with the objective of achieving the best match between the model and the sound input into the system. Of course, it is possible to leave the sound unchanged and stretch or compress the model.
In accordance with existing technology, each of the spectral sample periods for the sound to be recognized are compared against the corresponding spectral sample periods of the model which is being rated. The cumulative score for all of the sample periods in the sound against the model is a quality rating for the match. In accordance with existing technology, the quality ratings for all the proposed matches are compared and the proposed match having the highest quality rating is output to the system, usually in the form of a computer display of the word or phrase.
However, even this relatively complex system fails to achieve adequate quality in the recognition of human speech. Accordingly, most commercial systems, do a contextual analysis and also require or strongly recommend a period of additional training, during which the above matching functions are performed with respect to a preselected text. During this process, the model is appended to take into account the individual characteristics of the person training the system. Finally, during use. an error correction dialog box is used when the user detects an error, inputs this information into the system and thus causes the word model to become adapted to the user's speech. This supplemental training of the system may also be enhanced by inviting the user, during the error correction dialog to speak the word, as well as other words that may be confused with the word by the system, into the system to further train the recognition engine.
As is apparent from the above discussion, the development of speech recognition systems has centered on assembling a database of sound models likely to have a high degree of correlation to the speech to be recognized by the speech recognition engine. Such assembly of the database takes two forms. The first is
-5-
SUBSTΓTUTE SHEET (RULE 26) the input of global information using one or more speakers to develop a global database. The second method in which the database is assembled is the training of the database to a particular user's speech, typically done both during a training session with preselected text, and on an ad hoc basis through use of the error correction dialog window in the speech recognition program.
SUMMARY OF THE INVENTION The present invention stems from the recognition that spectrograms of audible sounds may be used not only to recognize speakers and words, but also mispronunciations. In accordance with the invention, the performance of the speech recognition software is improved by focusing in on the user, as opposed to the software. In particular, the invention has its objective improvement of the speech patterns of persons using the software. The result is enhanced performance, with the bonus of voice training for the user. Such training is of great importance. For example, salesman, lawyers, store clerks, mothers dealing with children and many others rely heavily on oral communications skills to accomplish daily objectives. Nevertheless, many individuals possess poor speaking characteristics and take this handicap with them to the workplace and throughout daily life.
Perhaps even more seriously, speech defects, regionalisms, and artifacts indicative of social standing, ethnic background and level of education often hold back persons otherwise eminently qualified to advance themselves in life. For this reason, speech, as a subject, has long been a part of the curricula in many schools, although in recent years education in this area has become, more and more, relegated to courses of study highly dependent on good speaking ability, such as radio, television, motion pictures and the theater.
Part of the problem here has been the difficulty of finding good voice instructors and the relatively high cost of the individualized instruction needed for a high degree of effectiveness in this area. In accordance with the invention, a specialized but highly effective speech training regimen is provided for application in the context of speech recognition software for receiving human language inputs in audio form to a microphone, analyzing the same in a personal computer and outputting alphanumeric documents and navigation commands for control of the person computer, and alphanumeric guidance and aural pronunciation examples from the sound card and speakers associated w ith a personal computer.
In accordance with a present invention speech recognition is performed on a first computing device using a microphone to receive audible sounds input by a user into a first computing device having a program with a database consisting of (i) digital representations of known audible sounds and associated alphanumeric representations of the known audible sounds and (ii) digital representations of known audible sounds corresponding to mispronunciations resulting from known classes of mispronounced words and phrases. The method is performed by receiving the audible sounds in the form of the electrical output of a microphone. The training method is performed by having the person being trained read a preselected text and translating the audible sounds into the form of the electrical output of a microphone being sent to a computer. A particular audible sound to be recognized is converting into a digital representation of the audible sound.
The digital representation of the particular audible sound is then compared to the digital representations of the known audible sounds to determine which of those known audible sounds is most likely to be the particular audible sound being compared to the sounds in the database. A speech recognition output consisting of the alphanumeric representation associated with the audible sound most likely to be the particular audible sound is then produced. An error indication is then received from the user indicating that there is an error in recognition. The user also indicates the proper alphanumeric representation of the particular audible sound. This allows the system to determine whether the error is a result of a known type or instance of mispronunciation. In the case of voice training, the digital representation of the particular audible sound is then compared to the digital representations of a proper pronunciation of that audible sound to determine whether there is an error that results from a known type or instance of mispronunciation. In response to a determination of error corresponding to a known type or instance of mispronunciation, the system presents an interactive training program from the computer to the user to enable the user to correct such mispronunciation.
The presented interactive training program comprises playback of the properly pronounced sound from a database of recorded sounds corresponding to proper pronunciations of the mispronunciations resulting from the known classes of mispronounced words and phrases.
In accordance with a preferred embodiment of the invention, the user is given the option of receiving speech training or training the program to recognize the user's speech pattern, although this is the choice of the user of the program.
In accordance with the invention, the determination of whether the error is a result of a known type or instance of mispronunciation is performed by comparing the mispronunciation to the digital representations of known audible sounds corresponding to mispronunciations resulting from known classes of mispronounced words and phrases using a speech recognition engine. It is anticipated that the inventive method will be implemented by having the database consisting of (i) digital representations of known audible sounds and associated alphanumeric representations of the known audible sounds and (ii) digital representations of known audible sounds corresponding to mispronunciations resulting from known classes of mispronounced words and phrases, generated by the steps of speaking and digitizing the known audible sounds and the known audible sounds corresponding to mispronunciations resulting from known classes of mispronounced words and phrases. The database will then be introduced into the computing device of many users after the generation by speaking and digitizing has been done on another computing device and transferred together w ith voice recognition and error correcting subroutines to the first computing device using CD-ROM or other appropriate data carrying medium.
It is also contemplated that mispronunciations are input into the database by actual speakers that have such errors as a natural part of this speech pattern.
In accordance with the invention, normalization to word, phrase and other sound models may be achieved by normalizing words or phrases to one of a plurality of sound durations. This procedure is followed with respect to all the word that phrase models in the database. When a word is received by the system, it measures the actual duration, and then normalizes the duration of the sound to one of the plurality of the preselected normalized sound durations. This reduces the number of items in the database against which the sound is compared and rated.
BRIEF DESCRIPTION OF THE DRAWINGS The advantages, and the system and apparatus of the present invention will be understood from the following description taken together with the drawings, in which one way of carrying out the speech recognition, and two ways of carrying out the training invention is described below in connection with the figures, in which:
Figure 1 is a block diagram illustrating a voice recognition program in accordance with the method of the present invention;
Figure 2 is a block diagram illustrating a voice recognition program in accordance with the training method of the present invention;
Figure 3 is an alternative embodiment of the inventive system illustrated in Figure 2;
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Referring to Figures 1 and 2. the system and method of the present invention may be understood. In accordance with the inventive method 10, a voice and error model is generated using subroutines 12 and 1 12. Subroutines 12 and 1 12 comprises a number of steps which are performed at the site of the software developer, the results of which are sent for example, in the form of a CD-ROM, other media or via the Internet, together with the software for executing voice recognition, to a user, as will be apparent from the description below. In accordance with the present invention, the inventive speech recognition method may be practiced on personal computers, as well as on more advanced systems, and even relatively stripped down lightweight systems, the referred to as subnotebooks and even smaller s\ stems provided at the same have sound boards for interfacing with and receiving the output of a microphone. It is noted that quality sound board electronics is important to good recognition and successful of the methods of the invention.
SOFTWARE DEVELOPMENT PHASE At steps 14 and 1 14 a database of word models is generated by having speakers speak the relevant words and phrases into a microphone connected to the sound board of the personal computer being used to generate the database. In accordance ith the preferred embodiment of the invention, speakers who have been trained in proper speech habits are used to input words, phrases and sounds into the database at steps 14 and 1 14. As the information is generated by the speakers speaking into microphones, attached to the sound boards in the computer, the information is digitized, analyzed and stored on the hard drive 16, 1 16 of the computer.
In accordance with the present invention, relatively common pronunciation errors are also input into the system at steps 18 and 1 18. In this specification the term "phoneme" is used to mean the smallest sound, perhaps meaningless in itself, capable of indicating a difference in meaning between two words. The word "dog" differs from "cog" by virtue of a change of the phoneme "do" pronounced '"daw" and "co" pronounced "cah.""
Thus, at steps 18 and 1 18, the model generating speaker can speak a database of common phoneme errors into the microphone attached to the sound board of the computer to result in input of an error database into hard drive 16, 1 16 of the computer. However, it is preferred that the phoneme errors are spoken by persons who in various ways make the pronunciation error as part of their normal speech patents.
At steps 20 and 120. the system is enhanced by the introduction into the database contained on hard drive 16, 1 16 of a plurality of exercise word models, selected for the purpose of training the speech of a user of the system. The same are input into the system through the use of a microphone and sound board, in the same way that the database of the language model was input into the system. Generally, a collection of
-9-
SUBS i l l U l h SHEET (RULE 26) word and/or phrase models is associated with each type of phoneme error. This is because if a person makes a speech pronunciation error of a particular type, it is likely that the same speaker makes certain other errors which have common characteristics with other pronunciation errors in the group. For example, a person who mispronounces the word "them" to sound like "dem" is also likely to mispronounce the words "that", "those'* and "these."
Exercise phrase models are input at steps 22 and 122. These exercise phrase models are stored by the system in hard drive 16, 1 16. The exercise word models and the exercise phrase models input into the system at steps 20 , 22, 120 and 122 respectively are associated in groups having common mispronunciation characteristics. The same are input into the system through the use of a microphone and sound board, in the same way that the database of the language model was input into the system.
In addition, in accordance with the present invention, it is recognized that computer errors may result in misrecognition of a particular error, mistaken acceptance of a mispronunciation, or mistaken rejection of a proper pronunciation. Accordingly, during the database generation session during which properly pronounced exercise word models or exercise phrase models or input into the system at steps 20. 22 120 and 122, audio recordings of the same are also stored on hard disk 16, 1 16, to allow for playback of these proper pronunciations during use the program by a person performing speech recognition using the program. This provides for an audible cue to the user and allows the user to monitor the reliability of the system during the voice recognition and speech training process of the present invention.
In accordance with the invention, it is anticipated that there may be more than one mispronunciation associated with a particular word or phrase. Accordingly, at steps 24 and 124, a plurality of typical mispronunciations are input into the system to create a database of exercise word error models in hard drive 16, 1 16. The same are input into the system through the use of a microphone and sound board, in the same way that the database of the language model was input into the system.
Finally, the database of relatively common mispronunciation errors is completed at steps 26 and 126 where the speaker generating that database speaks into the system to generate a plurality of exercise phrase error models. These error models are also input into the system through the use of a microphone and stored on hard drive 16, 1 16.
In accordance with a preferred embodiment of the invention, the input of audible sounds into the system to generate the word error models at steps 24 and 124 and the exercise phrase error models at steps 26 and 126
- 10-
SUBSTΓΓUTE SHEET (RULE 26) is done using a speaker or speakers who have the actual speech error as part of their normal speech patterns. The same is believed to achieve substantially enhanced recognition of speech errors, although the same is not believed to be necessary to a functioning system.
In accordance with the preferred embodiment of the invention, the models stored on hard disk 16. 1 16. and generated as described above may be recorded on a CD-ROM or other program carrying media, together with a voice recognition engine, such as that marketed by any one of a number of manufacturers such as IBM. Dragon Systems, and others. In accordance with a present invention, such a prior art speech recognition program may be used for both the purpose of recognizing words, recognizing mispronunciations and phoneme errors, together with the above described audio recordings of proper pronunciations, both during speech recognition operation training sessions (Figure 1 ). and during speech training session with the inventive interactive program (Figures 2 and 3).
In accordance with the invention, such software comprising the speech recognition engine, editing and training utilities, and database of word models, phrase models, vocal recordings, and error models may be supplied to the user for a one time fee and transported over a publicly accessible digital network, such as the Internet. Alternatively, the software may be made available for limited use for any period of time with charges associated with each such use, in which case the software would never be permanently resident on the computer of a user.
USER TRAINING PROGRAM
When a user desires to use the speech recognition inventive program, the software containing the program and the database is loaded into a personal computer and words are spoken into a microphone coupled to the sound board of the computer, in order to input the speech into the computer in the manner of a conventional speech recognition program.
Additionally, when a user desires to use the voice training inventive program, the software containing the program and the database is loaded into a personal computer and the student user is instructed to read a preselected text that appears on the screen of the computer. Thus, words are spoken into a microphone coupled to the sound board of the computer, in order to input the speech into the computer in the manner of a conventional speech recognition program.
More particularly, as discussed above, after the system has proceeded through the performance of steps 14,
•1 1-
SUBSTΓΠΠΈ SHEET (RULE 26) 18, 20, 22. 24 . 26, 1 14, 1 18, 120, 124, and 126 and the speech recognition engine, editing and training utilities added, the system proceeds at steps 28 and 128 to receive, through a microphone, speech to be recognized from a user of the program who has loaded the speech recognition engine, editing and training utilities, and database of word models, phrase models, vocal recordings, and error models onto the user's personal computer. In this respect, the operation of the speech recognition program of the present invention is substantially identical to other speech recognition programs presently on the market. More particularly, at steps 30 and 130, a conventional speech recognition algorithm is applied to recognize audible sounds as the words which they are meant to represent.
The computer then outputs the recognized speech on the screen of the computer monitor, and the next phrase uttered by the user proceeds at steps 30 and 130 through the speech recognition algorithm resulting in that speech also being displayed on the monitor screen. When the user notices that an error has occurred, he may use any one of a number of different techniques to bring up an error correction window at step 32 and 132. For example, he may simply double-click on the error, or highlight the erroneous recognition and hit a key dedicated to presentation of the error correction window.
User correction occurs at steps 34 and 134. In typical programs, call up of the error correction window results in the presentation of a screen showing the highlighted word, and suggesting, through the use of a menu, a number of alternatives which may be selected for double-clicking, in order to correct the error. If the problem word is not in the menu of alternatives, the user may type in the problem word or spell it out. After the system has been given the correct word by any of these means, the same is input into the system.
At this point, the call up of the error correction window at steps 34 and 134 has indicated to the system that there is an error. While some errors are unrelated to pronunciation errors, many are. Once the user indicates the error, the system then proceeds at steps 36 and 136 to determine whether the error made by the user is recognized as one of the speech errors recognized by the system. If it is, this information is determined at steps 36 and 138. That the nature of the pronunciation error is then input into the system and logged at steps 38 and 140. In this matter, the system keeps track of the number of errors of a particular type for the user by storing them and tallying them in hard drive 16, 1 16.
In accordance with the present invention, it is contemplated that the speech training will not be triggered by a single mispronunciation. Instead, it is contemplated that the repeated instances of a single type of mispronunciation error will be tallied, and when a threshold of pronunciation errors for that error is reached in the tally, only then will speech training be proposed by the appearance of the screen of a prompt window
12-
SUBSTΓΓUTE SHEET (RULE 26) suggesting speech training. The same could take the form of a window having the words "The system has determined that it is likely that w e can improve your recognition and speech by coaching you now. Would you like to speak to the speech coach?" The screen may also have a headline above the question, such as "The coach wants to talk to you!" The screen ill also have a button bar "OK" to start a training session. A 5 button marked "Cancel", may also be included so the student click on the "Cancel" button to delay the speech coaching session for a limited amount of time or cancel it altogether.
It is also noted that other combinations of events may be used to trigger training. For example, if the particular mispronunciation detected is a very well-defined one. such as the almost uniform tendency of
10 some speakers to mispronounce the word "oil" as "earl", the definiteness with which this error has been determined makes training relatively likely to be necessary, and the threshold for that error can be lowered to. for example, one instance of that error being detected. In other cases, or in the general case, one may wish to set the threshold at three, five or even ten instances of the error before the "The coach wants to talk to you!" screen is presented to the user of the sv stem.
15
Once a mispronunciation has been detected by the system as a result of the information input by the user to the user correction screen at steps 34 and 134. the error correction algorithm operates in a manner identical to the speech recognition algorithm at steps 30 and 130, except that the error correction algorithm checks the database of common phoneme errors input into the system by the software developer at step 18 and 1 18 and
20 the exercise word error models and exercise phrase error models input at steps 24 , 26, 124 and 126. In connection with this, it is noted that the so-called phoneme errors relate to particular sounds consisting of one syllable or less, while the phrase and word models are somewhat more general, as described herein.
Thus. if. at steps 40 or 142 the system determines that the threshold number of errors in that class has not 25 been reached, it sends the system back to steps 28 and 128. where speech recognition proceeds. If, on the other hand, a predetermined number of errors of the same class have been detected by the system and logged at steps 38 or 140. at steps 40 or 142 the system is sent to step 42 or 144 where the above described "The coach wants to talk to you!" screen is presented to the user, who is thus given the opportunity to train his voice.
If the speech recognition user declines the opportunity to train at steps 42. he is given the opportunity to train the database at step 43. If he declines that opportunity also, the system is returned to step 28, where, again, speech recognition proceeds.
-13-
SUBSTΓΠΠΈ SHEET (RULE 26) However, if he accepts the opportunity to train the database, the system proceeds to step 45, where the database is trained in the same manner as a conventional speech recognition processing program.
If the speech training user declines the opportunity to train at step 146. the system is returned to step 128, where, again, reading of the rest of the preselected pronunciation error detecting text proceeds.
In the other case, at step 42 or 146, when the user decides to accept speech training, the system proceeds to step 44 or 148 respectively, where the determination is made as to whether the particular error is an error in the pronunciation of a word or what is referred to herein as a phrase. By "phrase" in this context, is meant at least parts from two different words. This may mean two or more words, or the combination of one or more words and at least a syllable from another word, and most often the end of one word combined with the beginning of another word, following the tendency of natural speakers to couples sounds to each other, sometimes varying their stand-alone pronunciation. If. at step 44 or 148 the system determines that the mispronunciation is the mispronunciation of a word, the system is sent to step 46 or 150 respectively, where the system retrieves from memory words which have the same or similar mispronunciation errors.
As noted above, these words have been stored in the system, not only in the form of alphanumeric presentations, but also in high-quality audio format. The object of the storage of the high-quality audio sound is to provide for audible playback of the words in the training dialog screen.
The words retrieved at step 46 and 150 are also presented on-screen in alphanumeric form to the user and the user is invited to pronounce the word. If the word is pronounced properly, this is determined at step 48, 154. If there is no error, the system proceeds to step 50. 156 where the system determines whether there are two incidences of no error having occurred consecutively. If no error has occurred twice consecutively, the system is returned to act as a voice recognition system at step 28, 128. If no error has occurred only once, at steps 50 and 158 the system is returned to the training dialog screen at step 46 or 150 respectively and the user is invited to pronounce the same or another word having the same type of mispronunciation to ensure that the user is facet word correctly. Once the user has pronounced words twice in a row without errors, the user is returned at step 50 or 156 respectively, to the voice recognition function.
However, where an error has been detected at step 48 or 146, the system proceeds to step 50 or 150 respectively, to where an instruction screen telling the user how to make the sound, with physical instructions on how to move the muscles of the mouth and tongue to achieve the sound is presented to the user.
-14-
SUBSTΓΓUTE SHEET (RULE 26) The screen allows for the incorporation of more creative speech training approaches such as the Lessac method described in The Use and Training of the Human Voice -A Bio-Dynamic Approach to Vocal Life. Arthur Lessac. Mayfield Publishing Co. ( 1997). In this technique the user is encouraged to use his "inner harmonic sensing." This enhances the description of a particular sound by having the user explore how the sound affects the user"s feelings or encourages the user to some action.
In an illustrative example, the Lessac method teaches the sound of the letter "N" by not only describing the physical requirements but also instructs the user to liken the sound to the "N" in violin and to "Play this consonant instrument tunefully." This screen also has a button which may be clicked to cause the system to play back the high-quality audio sound from memory, which was previously recorded during software development, as described above.
The system may also incorporate interactive techniques. This approach presents the user with a wire frame drawings of a human face depicting, amongst other information, placement of the tongue, movement of the lips, etc. The user may interactively move the wire frame drawing to get a view from various angles or cause the sounds to be made slowly so that the "facial" movements can be carefully observed.
Also, in accordance w ith the invention, it is contemplated that substantial improvement in the system may be obtained by training the database in the system in accordance with the training techniques and in accordance with the speaking techniques of individuals who are familiar with the particular method of speech training a to be implemented in the system. For example, if the system will use the so-called Lessac system, that the individuals inputting into the database should be Lessac trained speakers who will input a vocal phonetic database which is particularly well tailored to the Lessac standard. On the other hand, if the system is to use another system other than Lessac. in that case, the person inputting the information into the database of phonetic sounds, words, and so forth would be individuals trained in that other system, thus resulting in consistency between the database and the particular training methodology used.
This screen also has a button which may be clicked to cause the system to play back the high-quality audio sound from memory, which was previously recorded during software development, as described above.
The user is then invited to say the sound again, and at steps 54 and 152, the user says the word into the microphone which is coupled to the computer, which compares the word to the database for proper pronunciation at determines whether there is an error in the pronunciation of the word at step 56 and 154 respectively.
-15-
SUBSTΓΠΠΈ SHEET (RULE 26) If there is error, the speech recognition system 10 is sent back to step 46 where, again, the word is displayed and the user invited to say the word into the machine to determine whether there is error, with the system testing the output to determine whether it should proceed to speech recognition at step 28. when the standard of two consecutive correct pronunciations has been reached. If there is no error at step 56, however, the tally is cleared and the system proceeds to step 28. where normal speech recognition continues.
If. in the speech training system 1 10 there is an error, the error tally flag is set at step 158 and the system is sent back to step 150 where, again, the sound is displayed in alphanumeric form and the user invited to say the sound into the machine with the system testing the output to determine whether there is an error at step 154. If no pronunciation error is found, the system determines in step 156 if the previous attempt was an error by checking whether the tally error flag is set. If the flag is set, indicating that the previous attempt had a pronunciation error, then the system is sent to step 158 where the tally flag is now cleared and the system returns to step 150. In step 156. if the tally flag is found not set, indicating that the previous attempt had no pronunciation error, then the standard of t o consecutive correct pronunciations has been met and the system until training has been completed.
If, at step 44 or 148 the system determines that the mispronunciation is the mispronunciation of a phrase, the system is sent to step 58 or 150 respectively, where the system retrieves from memory phrases which have the same or similar mispronunciation errors.
As noted above, these phrases have been stored in the system, not only in the form of alphanumeric presentations, but also in high-quality audio format. The object of the storage of the high-quality audio sound is to provide for audible playback of the words in the training dialog screen.
The words or phrase retrieved at step 58 or 150 are also presented on-screen and alphanumeric form to the user at the user is invited to pronounce the word or phrase. If the word is pronounced properly, this is determined at step 60 Or 154 respectively. If there is no error, the system proceeds to step 62 or 158 respectively where the system determines whether there are two incidents of no error having occurred. If no error has occurred twice, the system is returned to act as a voice recognition system at steps 28 and 128.
If no error has occurred only once, at step 62 and 158 the system is returned to the training dialog screen to at step 58 and 150 respectively and the user is invited to pronounce the same or to the word having the same type of mispronunciation to ensure that the user is pronouncing word correctly. Once the user has pronounced words twice in a row without errors, the user is returned at step 62 or 158 to the voice recognition function.
However, where an error has been detected at step 60 or 154, the system proceeds to step 62 or 150 respectively, where an instruction screen telling the user how to make the sound, with physical instructions on how to move the muscles of the mouth and tongue to achieve the sound is presented to the user as well as any other techniques such as the Lessac method described herein above. This screen also has a button which may be clicked to cause the system to playback the high-quality audio sound from memory, which was previously recorded during software development, as described above.
The user is then invited to say the sound again, and at steps 66 and 152, the user says the phrase into the microphone which is coupled to the computer, which compares the word to the database for proper pronunciation and determines whether there is an error in the pronunciation of the word at step 68 and 154 respectively.
If there is error, the system is sent back to step 58 or 150 where, again, the word is displayed and the user invited to say the word into the machine to determine whether there is error, with the system testing the output to determine whether it should proceed to speech recognition at step 28, 128, when the standard of two consecutive correct pronunciations has been reached. If there is no error at step 68 or 150, however, the tally is cleared and the system proceeds to step 28, 128, where normal speech recognition continues, the training session having been completed.
An alternative embodiment of the invention 210 is shown in Figure 3, wherein steps analogous to those of the Figure 2 embodiment are numbered with numbers one hundred higher than those in the Figure 2 embodiment. In Figure 3, the user has the additional alternative of training the database rather than having the database train his speech in step 262. This option is provided for those users who have particular speech habits that the user wants to accord special attention. The database is taught the user's particular pronunciation error at step 262. The user can assign a high error threshold or tell the database to ignore the error if he or she does not want training and prefers to keep his or her speech affection. Alternatively, the user may assign a low error threshold if he or she desires extra training for a certain type of error. The above methods can be incorporated into computer systems, including, computer-implementable programs, plug-in component hardware, and firmware.
While an illustrative embodiment of the invention has been described together with several alternatives for various parts of the system, it is, of course, understood that various modifications will be obvious to those of
-17-
SUBSTΓΓUTE SHEET (RULE 26) ordinary skill in the art. Such modifications are within the spirit and scope of the invention, which is limited and defined only by the following claims.

Claims

Claims:
1. A method of speech recognition ( 10) using a microphone to receive audible sounds input by a user into a first computing device having a program with a database ( 16) consisting of (i) digital representations of known audible sounds and associated alphanumeric representations of said known audible sounds and (ii) digital representations of known audible sounds corresponding to mispronunciations resulting from known classes of mispronounced words and phrases, comprising the steps of:
(a) receiving said audible sounds in the form of the electrical output of said microphone (28);
(b) converting a particular audible sound into a digital representation of said audible sound; (c) comparing said digital representation of said particular audible sound to said digital representations of said known audible sounds to determine which said known audible sounds is likely to be the particular audible sound being compared to the sounds in said database (30);
(d) outputting as a speech recognition output the alphanumeric representations associated with said audible sound likely to be said particular audible sound; (e) receiving an error indication from said user indicating that there is an error in recognition (32); and
(f) receiving from said user an indication of the proper alphanumeric representations of said particular audible sound (34); characterized by further comprising: (g) determining whether said error is a result of a known type or instance of mispronunciation (36); and
(h) in response to a determination of error corresponding to a known type or instance of mispronunciation, presenting an interactive training program from said computer to said user to enable said user to correct such mispronunciation (45).
2. A method of speech recognition ( 10) as claimed in claim 1. characterized in that said interactive training program comprises playback of the properly pronounced sound from a database of recorded sounds corresponding to proper pronunciations of said mispronunciations resulting from said known classes of mispronounced words and phrases.
3. A method of speech recognition ( 10) as claimed in claim 1 or 2. characterized in that the user is given the option of receiving speech training or training the program to recognize the user's speech pattern (42).
4. A method of speech recognition ( 10) as claimed in claim 3. characterized in that said determination of
-19-
SUBSΠTIΠΈ SHEET (RULE 26) whether said error is a result of a known type or instance of mispronunciation is performed by comparing the mispronunciation to said digital representations of known audible sounds corresponding to mispronunciations resulting from known classes of mispronounced words and phrases (36) using a speech recognition engine.
5. A method of speech recognition ( 10) as claimed in claim 1. characterized in that said determination of whether said error is a result of a known type or instance of mispronunciation is performed by comparing the mispronunciation to said digital representations of known audible sounds corresponding to mispronunciations resulting from known classes of mispronounced words and phrases using a (36) speech recognition engine.
6. A method of speech recognition ( 10) as claimed in claim 1, characterized in that said database consisting of (i) digital representations of known audible sounds and associated alphanumeric representations of said known audible sounds (14) and (ii) digital representations of known audible sounds corresponding to mispronunciations resulting from known classes of mispronounced words and phrases (18), is generated by the steps of speaking(28) and digitizing said known audible sounds and said known audible sounds corresponding to mispronunciations resulting from known classes of mispronounced words and phrases (36).
7. A method of speech recognition (10) as claimed in claim 6, characterized in that said database ( 16) has been introduced into said computing device after said generation by speaking (28) and digitizing has been done on another computing device and transferred together with voice recognition (30) and error correcting subroutines to first computing device.
8. A method of speech training (1 10) using a microphone to receive audible sounds input by a user into a first computing device having a program with a database (16) comprising (i) digital representations of known audible sounds and associated alphanumeric representations of said known audible sounds, and (ii) digital representations of known audible sounds corresponding to mispronunciations resulting from known classes of mispronounced words and phrases, characterized by comprising the steps of:
(a) presenting a training text to user for reading into a microphone to send audible sounds in the form of the electrical output to the computing device (128);
(b) converting a particular audible sound into a digital representation of said audible sound;
(c) comparing said digital representation of said particular audible sound to said digital
-20-
SUBSTΓΠΠΈ SHEET (RULE 26) representations of said known audible sounds to determine which of said known audible sounds is likely to be the particular audible sound being compared to the sounds in said database ( 130): and
(d) outputting as a speech recognition output the alphanumeric representations associated with said audible sound likely to be said particular audible sound; (e) receiving an error indication from the computing device indicating that there is an error in recognition:
(f) receiving from said user an indication of the proper alphanumeric representations of said particular audible sound ( 134);
(g) determining whether said error is a result of a known type or instance of mispronunciation (36); and
(h) in response to a determination of error corresponding to a known type or instance of mispronunciation, presenting an interactive training program from said computer to said user to enable said user to correct such mispronunciation.
9. A method speech training ( 1 10)as claimed in claim 8, characterized in that said interactive training program comprises playback of the properly pronounced sound from a database of recorded sounds corresponding to proper pronunciations of said mispronunciations resulting from said known classes of mispronounced words and phrases.
10. A method of speech training ( 1 10)as claimed in claim 8 or 9, characterized in that the user is given the option of receiving speech training or training the program to recognize the user's speech pattern (144).
1 1. A method of speech training ( 1 10)as claimed in claim 10, characterized in that said determination of whether said error is a result of a known type or instance of mispronunciation is performed by comparing the mispronunciation to said digital representations of known audible sounds corresponding to mispronunciations resulting from known classes of mispronounced words and phrases (136) using a speech recognition engine.
12. A method of speech training ( 1 10)as claimed in claim 8, characterized in that said determination of whether said error is a result of a known type or instance of mispronunciation is performed by comparing the mispronunciation to said digital representations of known audible sounds corresponding to mispronunciations resulting from known classes of mispronounced words and phrases (136) using a speech recognition engine.
-21 -
LE 26
13. A method of speech training (1 10)as claimed in claim 8, characterized in that said database consisting of (i) digital representations of known audible sounds and associated alphanumeric representations of said known audible sounds (1 14) and (ii) digital representations of known audible sounds corresponding to mispronunciations resulting from known classes of mispronounced words and phrases (118), is generated by the steps of speaking (128) and digitizing said known audible sounds and said known audible sounds corresponding to mispronunciations resulting from known classes of mispronounced words and phrases (136).
14. A method of speech training (1 10) as claimed in claim 13, characterized in that said database (1 16) has been introduced into said computing device after said generation by speaking (128) and digitizing has been done on another computing device and transferred together with voice recognition (130) and error correcting subroutines to first computing device.
-22-
SUBSTΓΓUTE SHEET (RULE 26)
PCT/US2001/012959 2000-04-21 2001-04-23 Speech recognition and training methods and systems WO2001082291A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2001255560A AU2001255560A1 (en) 2000-04-21 2001-04-23 Speech recognition and training methods and systems

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US55381100A 2000-04-21 2000-04-21
US55381000A 2000-04-21 2000-04-21
US09/553,811 2000-04-21
US09/553,810 2000-04-21

Publications (1)

Publication Number Publication Date
WO2001082291A1 true WO2001082291A1 (en) 2001-11-01

Family

ID=27070435

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/012959 WO2001082291A1 (en) 2000-04-21 2001-04-23 Speech recognition and training methods and systems

Country Status (2)

Country Link
AU (1) AU2001255560A1 (en)
WO (1) WO2001082291A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6847931B2 (en) 2002-01-29 2005-01-25 Lessac Technology, Inc. Expressive parsing in computerized conversion of text to speech
US6865533B2 (en) 2000-04-21 2005-03-08 Lessac Technology Inc. Text to speech
US6963841B2 (en) 2000-04-21 2005-11-08 Lessac Technology, Inc. Speech training method with alternative proper pronunciation database
EP1606793A1 (en) * 2002-12-31 2005-12-21 Lessac Technology, Inc. Speech recognition method
EP2337006A1 (en) * 2009-11-24 2011-06-22 Kai Yu Speech processing and learning
CN113284380A (en) * 2021-05-26 2021-08-20 秦皇岛职业技术学院 Oral english practice trainer based on artificial intelligence
US11735169B2 (en) 2020-03-20 2023-08-22 International Business Machines Corporation Speech recognition and training for data inputs

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4783803A (en) * 1985-11-12 1988-11-08 Dragon Systems, Inc. Speech recognition apparatus and method
US5010495A (en) * 1989-02-02 1991-04-23 American Language Academy Interactive language learning system
US5231670A (en) * 1987-06-01 1993-07-27 Kurzweil Applied Intelligence, Inc. Voice controlled system and method for generating text from a voice controlled input
US5487671A (en) * 1993-01-21 1996-01-30 Dsp Solutions (International) Computerized system for teaching speech
US5679001A (en) * 1992-11-04 1997-10-21 The Secretary Of State For Defence In Her Britannic Majesty's Government Of The United Kingdom Of Great Britain And Northern Ireland Children's speech training aid
US5717828A (en) * 1995-03-15 1998-02-10 Syracuse Language Systems Speech recognition apparatus and method for learning
US5787231A (en) * 1995-02-02 1998-07-28 International Business Machines Corporation Method and system for improving pronunciation in a voice control system
GB2323693A (en) * 1997-03-27 1998-09-30 Forum Technology Limited Speech to text conversion
US5864805A (en) * 1996-12-20 1999-01-26 International Business Machines Corporation Method and apparatus for error correction in a continuous dictation system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4783803A (en) * 1985-11-12 1988-11-08 Dragon Systems, Inc. Speech recognition apparatus and method
US5231670A (en) * 1987-06-01 1993-07-27 Kurzweil Applied Intelligence, Inc. Voice controlled system and method for generating text from a voice controlled input
US5010495A (en) * 1989-02-02 1991-04-23 American Language Academy Interactive language learning system
US5679001A (en) * 1992-11-04 1997-10-21 The Secretary Of State For Defence In Her Britannic Majesty's Government Of The United Kingdom Of Great Britain And Northern Ireland Children's speech training aid
US5487671A (en) * 1993-01-21 1996-01-30 Dsp Solutions (International) Computerized system for teaching speech
US5787231A (en) * 1995-02-02 1998-07-28 International Business Machines Corporation Method and system for improving pronunciation in a voice control system
US5717828A (en) * 1995-03-15 1998-02-10 Syracuse Language Systems Speech recognition apparatus and method for learning
US5864805A (en) * 1996-12-20 1999-01-26 International Business Machines Corporation Method and apparatus for error correction in a continuous dictation system
GB2323693A (en) * 1997-03-27 1998-09-30 Forum Technology Limited Speech to text conversion

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6865533B2 (en) 2000-04-21 2005-03-08 Lessac Technology Inc. Text to speech
US6963841B2 (en) 2000-04-21 2005-11-08 Lessac Technology, Inc. Speech training method with alternative proper pronunciation database
US7280964B2 (en) 2000-04-21 2007-10-09 Lessac Technologies, Inc. Method of recognizing spoken language with recognition of language color
US6847931B2 (en) 2002-01-29 2005-01-25 Lessac Technology, Inc. Expressive parsing in computerized conversion of text to speech
EP1606793A1 (en) * 2002-12-31 2005-12-21 Lessac Technology, Inc. Speech recognition method
EP1606793A4 (en) * 2002-12-31 2007-05-16 Lessac Technology Inc Speech recognition method
EP2337006A1 (en) * 2009-11-24 2011-06-22 Kai Yu Speech processing and learning
US11735169B2 (en) 2020-03-20 2023-08-22 International Business Machines Corporation Speech recognition and training for data inputs
CN113284380A (en) * 2021-05-26 2021-08-20 秦皇岛职业技术学院 Oral english practice trainer based on artificial intelligence
CN113284380B (en) * 2021-05-26 2022-03-25 秦皇岛职业技术学院 Oral english practice trainer based on artificial intelligence

Also Published As

Publication number Publication date
AU2001255560A1 (en) 2001-11-07

Similar Documents

Publication Publication Date Title
US7280964B2 (en) Method of recognizing spoken language with recognition of language color
US6963841B2 (en) Speech training method with alternative proper pronunciation database
Gerosa et al. A review of ASR technologies for children's speech
US5717828A (en) Speech recognition apparatus and method for learning
Kumar et al. Improving literacy in developing countries using speech recognition-supported games on mobile devices
US6560574B2 (en) Speech recognition enrollment for non-readers and displayless devices
Bernstein et al. Automatic evaluation and training in English pronunciation.
US6397185B1 (en) Language independent suprasegmental pronunciation tutoring system and methods
USRE37684E1 (en) Computerized system for teaching speech
US6134529A (en) Speech recognition apparatus and method for learning
Mak et al. PLASER: Pronunciation learning via automatic speech recognition
Arimoto et al. Naturalistic emotional speech collection paradigm with online game and its psychological and acoustical assessment
US20080027731A1 (en) Comprehensive Spoken Language Learning System
US20070003913A1 (en) Educational verbo-visualizer interface system
Ouni et al. Visual contribution to speech perception: measuring the intelligibility of animated talking heads
US20060053012A1 (en) Speech mapping system and method
Stemberger et al. Phonetic transcription for speech-language pathology in the 21st century
Vicsi et al. A multimedia, multilingual teaching and training system for children with speech disorders
WO2001082291A1 (en) Speech recognition and training methods and systems
WO1999013446A1 (en) Interactive system for teaching speech pronunciation and reading
US20050144010A1 (en) Interactive language learning method capable of speech recognition
KR101270010B1 (en) Method and the system of learning words based on speech recognition
Kim et al. Non-native speech rhythm: A large-scale study of English pronunciation by Korean learners: A large-scale study of English pronunciation by Korean learners
Stativă et al. Assessment of Pronunciation in Language Learning Applications
JP3988270B2 (en) Pronunciation display device, pronunciation display method, and program for causing computer to execute pronunciation display function

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP