|Publication number||US6044346 A|
|Application number||US 09/037,329|
|Publication date||28 Mar 2000|
|Filing date||9 Mar 1998|
|Priority date||9 Mar 1998|
|Publication number||037329, 09037329, US 6044346 A, US 6044346A, US-A-6044346, US6044346 A, US6044346A|
|Inventors||Syed S. Ali, Stephen C. Glinski|
|Original Assignee||Lucent Technologies Inc.|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (5), Non-Patent Citations (4), Referenced by (61), Classifications (10), Legal Events (11)|
|External Links: USPTO, USPTO Assignment, Espacenet|
The invention relates to a system and method for interfacing a digital audio processor used in voice recognition processing to a non-volatile storage device, and more particularly to a system and method for operating a programmable digital signal processor chip, which is used to sample, process, and recognize a limited set of spoken audio commands, with a flash memory device that can accommodate a limited number of write operations.
While practically unheard-of, or considered as science fiction, only a few years ago, automatic electronic voice recognition is now a reality. This technology, while complex, is becoming increasingly popular even in consumer devices.
Digital voice recognition is useful for several reasons. First, it offers a user the possibility of increased productivity at work, as a voice-operated device can be used hands-free. For example, a telephone "voice mail" system that uses voice recognition techniques to receive commands from the user can be operated via a user's voice while the user is looking at other things or performing other duties. Second, operating a device by voice commands is more natural for many people than entering cryptic command codes via a keyboard or keypad, such as one on a telephone. Operating a device by voice may seem slightly unnatural at first, as it is a new technology, but most people have been found to acclimate quickly. Finally, when a device is operated with spoken commands, and the user is addressed via a synthesized voice, there is a reduced need to memorize a complex set of commands. Voice commands can be set up using natural phrases, such as "retrieve messages" or "erase," and not sequences of numeric codes and "*" and "#" symbols, as would be necessary on a traditional telephone keypad.
The increase in the popularity of voice recognition systems has been facilitated by a number of technical advances, as well. Only recently has it become possible for a relatively cost-effective consumer-oriented device to perform a satisfactory level of voice recognition.
Over the last several years, there have been order-of magnitude increases in computer performance. It is now possible for a relatively simple special-purpose digital computer to perform the kinds of mathematical calculations and signal processing operations necessary to accomplish voice recognition in real time. In the past, satisfactory voice recognition called for substantial amounts of processing time above and beyond that required to digitally capture the speech.
There have also been extremely significant decreases in price. Powerful special-purpose digital signal processing computer chips are now available at prices that make real-time voice recognition possible in low-priced consumer articles. The cost of other digital components, particularly memory, has also decreased drastically within the last several years.
Finally, there have also been great improvements and refinements in the signal processing algorithms used to accomplish voice recognition. Much research in this area has been undertaken within the last ten to fifteen years, and the refined algorithms now preferred for voice recognition have only recently been developed.
There are numerous types of voice recognition systems in development and use today. These types can be broken down by several characteristics: the vocabulary size, speaker dependency, and continuous vs. discrete speech recognition.
Large vocabulary voice recognition systems are typically used for dictation and complex control applications. These systems still require a large amount of computing power. For example, large vocabulary recognition can only be performed on a computer system comparable to those typically used as high-end personal or office computers. Accordingly, large vocabulary recognition is still not well-suited for use in consumer products.
However, small vocabulary voice recognition systems are still useful in a variety of applications. A relatively small number of command words or phrases can be used to operate a simple device, such as a telephone or a telephone answering machine. Traditionally, these devices have typically been operated via a small control panel. Accordingly, the functions performed by entering codes on the device's control panel can also be performed upon receiving an appropriate voice command. Because only a small number of words and phrases are understood by such a system, a reduced amount of computer processing capability is necessary to perform the required mathematical operations to identify any given spoken command. Thus, low-cost special-purpose digital signal processor chips can be used in consumer goods to implement such a small vocabulary voice recognition system.
Some voice recognition systems are known as "speaker-independent," while others are considered "speaker-dependent." Speaker-independent systems include generic models of the words and phrases that are to be recognized. Such systems need not be "trained" to understand a particular speaker's voice. However, because of this, a user's unusual accents or speech patterns may result in reduced recognition accuracy. On the other hand, speaker-dependent systems require some level of training. That is, the system requires a user to recite several words, or to speak for several minutes, so that the system can adapt its internal word models to match the user's particular manner of speaking. This approach usually results in improved recognition accuracy, but the necessary training before use can be tedious or inconvenient. Moreover, if multiple users will be using a speaker-dependent system, the device must provide for the storage of multiple user voice models, and each user must train the device separately.
Two final categories of voice recognition systems are those systems capable of recognizing continuous speech and those systems only capable of recognizing discrete speech. Continuous speech recognition is most often useful for natural language dictation systems. However, as continuous speech often "runs together" into a single long string of sounds, additional computing resources must be devoted to determining where individual words and phrases begin and end. This process typically requires more processing ability than would be present in a typical low-cost consumer product.
Discrete speech recognition systems require a short pause between each word or phrase to allow the system to determine where words begin and end. However, it should be noted that it is not necessary for each word to be pronounced separately; a small number of short command phrases can be treated as discrete speech for purposes of voice recognition.
While there are advantages to large-vocabulary, speaker-independent, continuous speech recognition systems, it is observed that several compromises must be made to facilitate the use of voice recognition in low-cost consumer articles. Accordingly, it is recognized that small-vocabulary, speaker-dependent, discrete speech recognition systems and methods are still useful in a variety of applications, as discussed above. Even so, additional compromises are necessary to permit the efficiencies in manufacturing and use that would allow such systems to gain acceptance among consumers.
For example, in most speech recognition systems, large amounts of memory are used for various purposes in the recognition system. Buffers are needed to store incoming sampled voice information, as well as to store intermediate versions of processed voice information before recognition is accomplished. These buffers are constantly written and rewritten during training and recognition processing to accommodate voice input, update voice models, alter internal variables, and for other reasons. In most cases, static random-access memory ("static RAM") has traditionally been used in this application; it will be discussed in further detail below.
The traditional low-cost digital memory devices used in most digital voice storage and recognition applications have a significant disadvantage. When power is removed, the memory contents are permanently lost. For example, the least expensive type of digital memory usable for audio recording and processing is dynamic random-access memory ("dynamic RAM"). Audio grade dynamic RAM, which may be partially defective (and thus not usable in data-storage applications) is known as ARAM. When power is disconnected from ARAM, the memory contents are lost. Moreover, ARAM must be periodically "refreshed" by electrically stimulating the memory cells. For these reasons, a battery backup must be provided to preserve ARAM contents when the device is removed from its primary power source. This is inconvenient for the user and adds bulk and expense to a device that uses ARAM. Moreover, additional circuitry can be necessary to provide the necessary refresh signals to the ARAM.
Despite their disadvantages, ARAM devices are in relatively high demand because of their low price point. Accordingly, ARAM devices are sometimes in short supply, causing their price advantage to be nullified.
Static RAM is also a type of volatile digital memory. Static RAM typically provides very fast memory access, but is also power-consuming and expensive. No refresh signals are necessary, but like dynamic RAM, power must be continually supplied to the device, or memory contents will be permanently lost.
With both of the foregoing types of volatile digital memory, speaker-dependent training data and other vital system information can be lost in a power failure unless battery backup is provided. If speaker-dependent training data is lost, the system must be re-trained by each user before it can be used again. As discussed above, training can be inconvenient and tedious, and it may take at least a few minutes.
Several types of non-volatile memory, or memory that retains its contents when power is removed, are also available. EEPROM, or Electrically Erasable Programmable Read-Only Memory, is expensive in the quantities and densities necessary for audio storage and processing. So-called bubble memory is also available; it, too, is expensive, and is generally too slow for advantageous use in audio applications. Finally, flash memory is available. Traditionally, flash memory has been expensive, and very slow to erase and to write. In recent years, the time required to program flash memory has been reduced, and it is now usable for audio recording and processing systems. However, flash memory is subject to a burnout effect. After a limited number of re-writes to a portion of the storage device, that portion of the device will wear out and become unusable.
The problems inherent in using volatile digital memory can be solved by combining a quantity of non-volatile memory with the usual volatile memory. However, this solution is disadvantageous in that it increases the component count, and therefore increases manufacturing expenses. Separate volatile and non-volatile memory components would be necessary when such a solution is used.
In light of the disadvantages of the various volatile and non-volatile digital storage options for voice recognition processing, there is a recognized need for a low-cost voice recognition system that is capable of using low-cost non-volatile memory for substantially all of its storage requirements. Such a system should be able to accommodate a relatively small vocabulary of commands for the control of an electronic device, such as a telephone answering device. Such a system should also be durable and resistant to memory burnout effects.
The invention uses a low-cost programmable digital signal processor (DSP) in conjunction with a low-cost flash memory device to provide for digital voice recognition in a consumer device, such as a telephone answering machine. Because flash memory is used, the data stored therein is nonvolatile, and no battery backup or refresh circuitry is required to prevent loss of speaker-dependent training data or other information when a power failure is encountered.
As flash memory is known to exhibit a burnout effect, in which a portion of the device that has been written to a large number of times (typically in excess of 100,000 write operations) eventually becomes inoperative, the invention employs a buffer manipulation scheme to reduce the number of write operations performed on any single portion of the flash memory device, thereby reducing the incidence of the flash memory burnout effect to a level acceptable for a consumer device.
In this scheme, two digital voice buffers are allocated from "scratchpad" memory within the DSP chip. While data from one buffer is being processed for training or recognition purposes, the other buffer continues to receive sampled audio data. The buffers are allowed to completely fill before being processed as necessary by the invention, and a certain amount of data is accumulated before being written to a "wandering" buffer in the flash memory, thereby minimizing the number of flash memory writes and reducing the incidence of flash burnout.
The device uses voice recognition algorithms and schemes that are well-known in the art. For example, voice features are encoded via an 8th order algorithm known as Durbin's recursion, and the voice models used for training and recognition are Hidden Markov Models, which have been found to be useful in voice recognition applications.
In one embodiment of the system, a DSP1605 digital signal processor available from the Microelectronics Group of Lucent Technologies, Inc. ("Lucent Microelectronics"), is programmed and used as the digital signal processor.
These and other objects, features, and advantages of the invention will become apparent from the detailed description below and the accompanying drawings in which:
FIG. 1 is a block diagram of a telephone answering device that employs the voice recognition system and method of the invention;
FIG. 2 is a block diagram of the internal structure of a digital signal processor used in the telephone answering device of FIG. 1;
FIG. 3 is a flowchart illustrating the steps performed in speaker-dependent voice training according to the invention;
FIG. 4 is a flowchart illustrating the steps performed in voice recognition according to the invention;
FIG. 5 is a block diagram showing a wandering buffer employed to reduce the incidence of flash memory burnout according to the invention; and
FIG. 6 is a flowchart illustrating how voice training and recognition buffers are moved and tracked by the invention.
The invention is described below, with reference to detailed illustrative embodiments. It will be apparent that a system according to the invention may be embodied in a wide variety of forms. Consequently, the specific structural and functional details disclosed herein are representative and do not limit the scope of the invention.
In particular, described below is a telephone answering device that includes a voice recognition capability. As is well known in the art, the Hidden Markov Model is used for voice training and recognition; it has been found to provide better accuracy and acceptable levels of computing power than other voice recognition techniques. More information on Hidden Markov Models and other algorithms and techniques used in implementing the invention can be found in L.R. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice Hall, Englewood Cliffs, N.J., 1993.
Referring initially to FIG. 1, the interrelationships among the functional components of a telephone answering device according to the invention are shown. A digital signal processor (DSP) 110 is the heart of the device. The DSP 110 is coupled to a system microcontroller 112, which directs the operation of the telephone answering system by means known in the art. For example, the microcontroller 112 communicates with a user I/O interface 113, which may include such components as a digital display and push-button switches.
Also coupled to the DSP 110 is a signal coder and decoder unit (CODEC) 114. The CODEC 114 is capable of performing analog-to-digital and digital-to-analog conversions; it acts as an interface between the DSP 110, which receives and processes data in digital form, and the analog audio signals of the outside world. Accordingly, the CODEC 114 is coupled to an audio interface 116, which includes a microphone and a speaker, and a telephone interface 118, which connects the telephone answering device to the telephone system. The CODEC 114 is used to convert analog audio signals received from the audio interface 116 and the telephone interface 118 into digital data that can be processed by the DSP 110 (a process known as "sampling"); it is also used to convert the DSP's digital data back to analog audio signals when necessary for playback.
In one embodiment of the invention, the CODEC 114 and the : DSP 110 are separate integrated circuits. For example, the DSP 110 can be a chip selected from the DSP160x family of digital signal processors from Lucent Microelectronics; a preferred embodiment of the invention uses the DSP1605 chip. This chip includes 1,024 16-bit words of on-board randomaccess memory (RAM) and 16K of on-board program read-only memory (ROM) into which the telephone answering device functionality is programmed. In this embodiment, the CODEC 114 is a separate analog-to-digital and digital-to-analog converter, such as the T7513B CODEC, also available from Lucent Microelectronics. In an alternative embodiment of the invention, the CODEC 114 and the DSP 110 are incorporated into the same integrated circuit chip; examples of this configuration are found in the DSP165x family of devices from Lucent Microelectronics.
The DSP 110 is in communication with a flash memory unit 120, which is used for the long-term storage of data, such as speaker-dependent training data (e.g., voice models), as well as recorded audio for voice prompts and other information.
The flash memory unit 120, which in one embodiment of the invention has a size of four megabits (or 512K bytes), comprises the sole means of long-term data storage used by the telephone answering device. Accordingly, the flash memory unit 120 includes regions reserved for the storage of the outgoing message, incoming messages, and system-specific data (such as, for example, a message table that identifies the stored incoming messages by a time and date stamp and identifies where the messages are located within the flash memory). In addition, as discussed above, the flash memory can store voice data representative of a number of voice prompts (e.g. the numerals and words used to speak message time-and-date stamps to the user). Although such voice prompt data typically would be permanently programmed at the factory and never altered in normal use of a telephone answering system, it has been found that flash memory is sufficiently reliable that the remaining portions of a telephone answering device are likely to wear out or break before there is any trouble with the permanent voice prompts stored in flash memory. By programming the flash memory with the voice prompt data, no additional external ROM is necessary to store the voice prompts, thereby reducing chip count and potentially reducing production costs.
If power is removed from the telephone answering device, the system-specific data can be used to reconstruct any real-time data necessary for the operation of the system.
Several internal features of the DSP 110 are important and are illustrated in FIG. 2. The DSP 110 has an internal DSP processor 210. The processor 210 is coupled to a program ROM 212, which stores the program code necessary to implement the functionality of the telephone answering device, including the training and recognition operations that will be discussed in further detail below, in conjunction with the flowcharts of FIGS. 3 and 4.
The processor 210 is also coupled to internal RAM 214. The Lucent DSP1605 device has 1,024 words of RAM 214. A portion of the RAM 214 is used for the storage of variables and other data temporarily used by the program stored in the ROM 212. A sample buffer 216 is also allocated within the RAM 214. The sample buffer 216 is used to hold the raw recorded digital sound data received from the CODEC 114 before and during processing. The DSP processor 210 operates on the data in the sample buffer 216 and accumulates its output before writing it to the flash memory unit 120. A score buffer 218, also allocated from the internal RAM 214, is used during the voice recognition operation to keep track of the recognition scores for each word in the device's vocabulary. The functions performed in these operations will be discussed in further detail below.
As discussed above, a training operation is necessary before a speaker-dependent voice recognition system can be used by any particular user. The process performed in the training operation, including the interaction between the DSP 110 and the flash memory unit 120 during the training operation, will now be considered in conjunction with the flowchart of FIG. 3.
Training is accomplished by having a user speak each word desired to be recognized at least twice. Optionally, at the beginning of each training pass, the user can be prompted to speak the desired word by programming the DSP 110 to read certain voice prompts from the flash memory unit 120. The system records each utterance, and if two sequential recordings are sufficiently similar to each other, then a model representing the average sound of the spoken word is stored for later use in the voice recognition operation.
Initially, the voice recognition system begins sampling audio continuously into a sample buffer (step 310). Preferably, the sample buffer is maintained in the on-board RAM within the DSP 110. In a preferred embodiment, sampling is performed at a rate of 8 kHz, or 8,000 samples per second. Each sample has 8-bit resolution, and is encoded in μ-law format by the codec 114. As is well known in the art, μ-law quantization is a logarithmic quantization scheme; 8 bits of μ-law-encoded information are able to provide approximately the same dynamic range as 14-bit linear encoding.
Sampling is performed into two 160-sample frame buffers in the on-board RAM. Together, these two frame buffers make up the sample buffer. Accordingly, at an 8 kHz sample rate, each frame buffer holds 20 milliseconds (or 1/50 second) of audio information. Stated another way, the frame rate is 50 Hz. Each successive frame begins upon completion of the prior frame; the frames are "touching" but do not overlap. While sampling is being performed into one frame buffer, the other frame buffer is asynchronously being processed by the invention, as will be discussed in further detail below.
After one frame buffer is full of samples, processing can begin. Initially, the samples in the frame buffer are signal-processed to pre-emphasize high frequencies and to "window" the frame (step 312). In a preferred embodiment of the invention, a trapezoidal window with 20-sample rise and fall times is employed. This is done to ensure that the greatest contribution to the signal is made by the samples at the center of the frame, and not the boundary samples near the preceding and following frames. At that time, autocorrelation coefficients are calculated for the frame (step 314). As is known in the art, the autocorrelation coefficients represent a time-based frequency spectrum for the samples in the frame.
The autocorrelation coefficients are then converted to a feature vector (step 316). This is performed via an 8th order Linear Predictive Coding (LPC) technique known as Durbin's recursion, which is known in the art. Resulting from this manipulation is a set of nine values known as "cepstral" coefficients. A first-order term is converted to a log energy coefficient, which represents the energy contained in the signal. The remaining eight terms are also part of the feature vector, as are seven additional terms, which are weighted delta values from the previous frame's feature vector. Accordingly, the feature vector for a single frame comprises sixteen terms: one log energy, eight LPC or cepstral terms, and seven delta cepstral terms.
The terms of the feature vector are then normalized based on the values of preceding feature vectors (step 318). This not only compensates for variations in signal amplitude, but also for signal variations (for example, based on whether the speaker is speaking directly into the apparatus or indirectly through a telephone connection). This normalization process is well-known in the art and is used in many voice recognition techniques.
In a system according to the present invention, the entire feature vector is then stored into a feature buffer in the flash memory 120 (step 320). Because flash memory is being used, rather than static RAM, for example, it is useful to store the entire feature vector (comprising sixteen values) at one time, rather than piecemeal. This then serves to reduce the incidence of flash burnout.
After the feature vector is computed, normalized, and stored, the endpoint is calculated (step 322). The endpoint is calculated based on the value of the feature vector just calculated, as well as the feature vectors corresponding to preceding frames. The endpoint, which indicates where a particular utterance or word ends, is calculated by means known in the art. It should be noted that the endpoint calculation algorithm usually must look back to determine where an utterance ends; it usually cannot be determined by looking only at a single feature vector. Accordingly, an utterance may be determined to have ended several frames after it actually ended.
If the endpoint does not indicate that an end-of-word has been reached (step 324), then the algorithm repeats and additional frames are processed (beginning at step 312). Otherwise, the pass number is determined (step 326), and the actions to follow depend on whether this is the first, second, or third (or greater) pass through the algorithm.
If this is the first pass through the algorithm, the normalized feature vectors corresponding to the spoken word are stored to a word model in the flash memory (step 328). The algorithm is then reset (step 330), and a second pass is made through the algorithm.
On the second pass through the algorithm, the newly computed normalized feature vectors are compared to the word model stored on the first pass (step 332). If they are sufficiently similar (step 334), then the two passes are averaged (step 336) and stored in the word model (step 338). The training algorithm is then finished for that word (step 340). It should be noted that there is a separate word model stored in the flash memory unit 120 for each word in the device's vocabulary.
If the two passes are not sufficiently similar (step 334), then the newly computed feature vectors are stored in a second word model (step 342), the algorithm is reset (step 344), and a third pass is made.
On the third pass, the newly computed normalized feature vectors are compared to both vocabulary models stored on the first and second passes (step 346). If the new feature vectors are sufficiently similar to either of the two prior vocabulary models (step 348), then the new feature vectors are averaged with those from the most similar word model (step 350), and the average is stored in the word model (step 352). Training is then complete for that word (step 354). If the new feature vectors match neither prior pass, then the new feature vectors are written to replace the least similar word model (step 356), the algorithm is reset (step 358), and another pass is made.
At the completion of this training algorithm, one word model comprising an average of at least two passes through the algorithm is stored in the flash memory 120. In a preferred embodiment of the invention, feature vector variances between the two passes are also stored in the word model for later use in the recognition process. By using three or more passes, it can be ensured that at least two utterances of the same word are sufficiently similar in sound, so that sufficient meaningful statistical information for the voice model can be derived therefrom.
The foregoing training operation is repeated for each word in the system's operative vocabulary. These words may comprise a number of fixed commands, such as "erase messages," "play messages," and the numerical digits for use in a telephone answering device, and may also include a number of customized utterances for use in an automatic telephone dialing device. For example, in the latter case, the user might wish to train individuals' names, so that the device will dial properly when it recognizes one of the names.
After training is completed, a voice recognition operation can be performed. The process performed in the recognition operation, including the interaction between the DSP 110 and the flash memory unit 120 during the recognition operation, will now be considered in conjunction with the flowchart of FIG. 4.
Again, the voice recognition system begins sampling continuously into a sample buffer (step 410). Sampling is again performed into two 160-sample frame buffers in the on-board RAM. While sampling is being performed into one frame buffer, the other frame buffer is asynchronously being processed for recognition by the invention, as will be discussed in further detail below.
After one frame buffer is full of samples, processing can begin. Initially, the samples in the frame buffer are signal-processed to pre-emphasize high frequencies and to "window" the frame (step 412). Then, autocorrelation coefficients are calculated for the frame (step 414). The autocorrelation coefficients are then converted to a feature vector (step 416) via the Durbin's recursion technique discussed above. The sixteen terms of the feature vector are then normalized based on the values of preceding feature vectors (step 418). In a system according to the present invention, the entire feature vector is then stored into a feature buffer in the flash memory 120 (step 420). Again, storing the entire feature vector at once reduces the incidence of flash burnout.
After a feature vector is calculated, the feature vector is then scored against all of the word models in the device's vocabulary (step 422). In a preferred embodiment, this is accomplished by the Viterbi algorithm, which is well known in the art. The result of this processing is a set of scores, one for each Hidden Markov Model state in each vocabulary word. For example, if a device is trained to recognize 25 different vocabulary words, and each word model has eight states, then there will be a total of 200 scores at the conclusion of the Viterbi scoring step. These scores are all temporarily stored in the score buffer 218 in the DSP's internal RAM 214 (step 424). The scores corresponding to the final state score for each vocabulary word (in the example, a total of 25 scores) are further stored in a "traceback buffer" in the flash memory unit 120 (step 426).
After the feature vector is computed, normalized, stored, and scored, the endpoint is calculated (step 428). The endpoint is calculated based on the value of the feature vector just calculated, as well as the feature vectors corresponding to preceding frames. The endpoint, which indicates where a particular utterance or word ends, is calculated by means known in the art. A count is also generated (step 430); it corresponds to the number of frames that have been processed since the last endpoint was located. The count corresponds roughly to the length of the current utterance.
It should be noted that the endpoint calculation algorithm usually must look back to determine where an utterance ends; it usually cannot be determined by looking only at a single feature vector. Accordingly, an utterance may be determined to have ended several frames after it actually ended. For this reason, the traceback buffer described above is used to keep track of previous model-ending scores; the model-ending scores corresponding to the endpoint are checked to determine whether an utterance has been recognized. Hence, once the endpointer algorithm determines that and end-of-word has previously been reached (step 432), the scores in the traceback buffer at the point identified by the endpoint are evaluated (step 434).
If the score for one word model in the device's vocabulary exceeds a recognition threshold and also exceeds the scores for all other words in the vocabulary, the word corresponding to that model has been successfully recognized (step 436).
It should be noted that the invention is capable of recognizing a sequence of words rather than individually spoken words. The former is accomplished by recognizing the last word in a sequence, then moving back through the counts and scores stored in the traceback buffer to recognize the prior words. In either case, once recognition is complete, the DSP 110 or the microcontroller 112 is caused to act as specified by the recognized word (step 438), which can be either a command or data input from the user. If no score exceeds the recognition threshold, then the recognition operation is reset (step 440), and recognition processing continues. Optionally, at this time, an indication can be made to the user that the prior word was not recognized.
As discussed above, the present invention employs several means to reduce the incidence of flash memory burnout. For example, during training, a frame of sampled audio data is processed completely until its corresponding feature vector has been entirely calculated; individual portions of each feature vector are not written to flash memory until that time. Moreover, word models are not written to flash memory until a complete pass has been made through the recognition algorithm. During the recognition operation, similarly, feature vectors are not written piecemeal, and only final state scores for each frame are written to the flash memory.
An additional buffer manipulation scheme is further used to reduce flash burnout effects. As discussed above, a four megabit flash memory device is typically used for a telephone answering device that can incorporate both digital voice storage and voice recognition capabilities. However, only a small fraction of the flash memory unit 120 need be committed to voice recognition. Specifically, out of 512K bytes of storage available in a four megabit device, it has been found that only approximately 64K bytes are necessary for all of the buffers, variables, and model storage necessary in a voice recognition system according to the invention. However, as the 64K region is frequently written to by both the training and recognition operations, this segment of the flash memory unit 120 will frequently be subject to burnout effects after only a relatively short period of usage.
Accordingly, a scheme has been devised to minimize the impact of voice training and recognition writes on a buffer chosen from a larger flash memory unit. This scheme is depicted schematically in FIG. 5.
In FIG. 5, two instances of a flash memory unit 120 are shown. In the first (FIG. 5a), the flash memory unit 120 maintains an operating system region 510 and a voice recognition region 512. The operating system region 510 maintains critical system data used by the telephone answering device or other apparatus, including variables which specify the location of other data structures throughout the flash memory unit 120. When a telephone answering device or other apparatus is first initialized, the voice recognition region 512, which contains all buffers, models, and other data necessary for voice recognition operations, is located in a portion of the flash memory unit 120 immediately following the operating system region 510. The remaining portion of the flash memory unit 120 can be used to store other data, such as outgoing voice prompts and incoming messages. A consistently-located pointer 514 in the operating system region 510 specifies the location of the voice recognition region.
After the system has been operated for some time, the voice recognition region is relocated by the invention. The time after which this is performed can vary from several hours to several months; it is merely important that the relocation operation not disrupt the ordinary operation of the telephone answering device (or other apparatus), and the number of write operations to the voice recognition region 512 must not have already caused a burnout of that region.
Accordingly, in the second instance of the flash memory unit 120 (FIG. 5b), a second voice recognition region 516 has been allocated to replace the first voice recognition region 512. A new pointer 518 identifies the location of the second voice recognition region 516. This is performed according to the algorithm set forth in FIG. 6.
Initially, at an idle time after a specified number of write operations have been performed or after a specified elapsed time, the present voice recognition block or region is identified (step 610). Then, a new location for the voice recognition region is calculated (step 612). The region is chosen such that there is no page-overlap between the prior region and the new region. This ensures that writes to the new region will not contribute to burnout of the old region.
Data in the old location is swapped (step 614) with data in the new location. It should be noted that the new location need not be free or empty; it may already contain other data not used in the voice recognition operation. Finally, pointers in the operating system region or block are updated (step 616) to reflect the locations of the new voice recognition region, as well as the new location for the data that was previously occupying the new voice recognition region.
By employing this "wandering buffer" technique, it has been found that a flash memory-based telephone answering device or other consumer device featuring voice recognition capability can have a lifespan of ten or more years, which is considered to be an acceptable lifespan for a relatively inexpensive consumer device.
The other operations performed by a telephone answering device according to the invention are traditional in nature. See, for example, the Product Note for the Lucent LJ30 NAND FlashTAD telephone answering device subsystem, which is incorporated by reference as though set forth in full herein. The LJ30 is capable of using certain standard types of flash memory for digital voice storage, but in its standard form is not capable of the voice recognition operations discussed herein.
It should be observed that while the foregoing detailed description of various embodiments of the present invention is set forth in some detail, the invention is not limited to those details and a digital voice recognition device made according to the invention can differ from the disclosed embodiments in numerous ways. In particular, it will be appreciated that embodiments of the present invention may be employed in many different applications to recognize spoken commands.
Moreover, while certain particular parts from Lucent Microelectronics are disclosed as being operable in a system according to the invention, other electronic components, including custom devices, having the essential attributes discussed above also may function as described. Specifically, while various components have been described as performing certain functions in the invention, it should be noted that these functional descriptions are illustrative only, and a device performing the same functions can be implemented in a different way, by either physically combining or separating the functions described, or by adding additional functions. In an alternative embodiment of the invention, for example, the DSP 110, the microcontroller 112, the CODEC 114 can be combined into a single chip.
Hence, the appropriate scope hereof is deemed to be in accordance with the claims as set forth below.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US5127043 *||15 May 1990||30 Jun 1992||Vcs Industries, Inc.||Simultaneous speaker-independent voice recognition and verification over a telephone network|
|US5687279 *||23 Dec 1994||11 Nov 1997||Intel Corporation||Retro-storing analog information in a digital storage circuit|
|US5774859 *||3 Jan 1995||30 Jun 1998||Scientific-Atlanta, Inc.||Information system having a speech interface|
|US5787445 *||7 Mar 1996||28 Jul 1998||Norris Communications Corporation||Operating system including improved file management for use in devices utilizing flash memory as main memory|
|US5881134 *||3 Sep 1996||9 Mar 1999||Voice Control Systems, Inc.||Intelligent call processing platform for home telephone system|
|1||*||Iwata et al. IEEE Journal of Solid state Circuits, vol. 30, Issue 11, pp. 1157 1164, Nov. 1995.|
|2||Iwata et al. IEEE Journal of Solid-state Circuits, vol. 30, Issue 11, pp. 1157-1164, Nov. 1995.|
|3||*||Tae Sung et al. A3.3 V 128 Mb multi level NAND flash memory for mass storage applications. Solid State Circuits Conference, 1996. Digest of technical Papers, 42 nd ISSCC. 1996 IEEE INternational. pp. 32 33 and 412, Feb. 1996.|
|4||Tae-Sung et al. A3.3 V 128 Mb multi-level NAND flash memory for mass storage applications. Solid-State Circuits Conference, 1996. Digest of technical Papers, 42 nd ISSCC. 1996 IEEE INternational. pp. 32-33 and 412, Feb. 1996.|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US6683943||24 Jan 2001||27 Jan 2004||Richard A. Wuelly||Automated mass audience telecommunications database creation method|
|US6766295||10 May 1999||20 Jul 2004||Nuance Communications||Adaptation of a speech recognition system across multiple remote sessions with a speaker|
|US6832194 *||26 Oct 2000||14 Dec 2004||Sensory, Incorporated||Audio recognition peripheral system|
|US7006967 *||4 Feb 2000||28 Feb 2006||Custom Speech Usa, Inc.||System and method for automating transcription services|
|US7058573 *||20 Apr 1999||6 Jun 2006||Nuance Communications Inc.||Speech recognition system to selectively utilize different speech recognition techniques over multiple speech recognition passes|
|US7126467||23 Jul 2004||24 Oct 2006||Innovalarm Corporation||Enhanced fire, safety, security, and health monitoring and alarm response method, system and device|
|US7129833||23 Jul 2004||31 Oct 2006||Innovalarm Corporation||Enhanced fire, safety, security and health monitoring and alarm response method, system and device|
|US7148797||23 Jul 2004||12 Dec 2006||Innovalarm Corporation||Enhanced fire, safety, security and health monitoring and alarm response method, system and device|
|US7170404||16 Aug 2005||30 Jan 2007||Innovalarm Corporation||Acoustic alert communication system with enhanced signal to noise capabilities|
|US7173525||23 Jul 2004||6 Feb 2007||Innovalarm Corporation||Enhanced fire, safety, security and health monitoring and alarm response method, system and device|
|US7346859||31 Mar 2004||18 Mar 2008||Lenovo Singapore, Ltd.||Administration of keyboard input in a computer having a display device supporting a graphical user interface|
|US7391316||27 Jul 2006||24 Jun 2008||Innovalarm Corporation||Sound monitoring screen savers for enhanced fire, safety, security and health monitoring|
|US7401017||4 Apr 2006||15 Jul 2008||Nuance Communications||Adaptive multi-pass speech recognition system|
|US7403110||27 Jul 2006||22 Jul 2008||Innovalarm Corporation||Enhanced alarm monitoring using a sound monitoring screen saver|
|US7477142||27 Jul 2006||13 Jan 2009||Innovalarm Corporation||Residential fire, safety and security monitoring using a sound monitoring screen saver|
|US7477143||18 Sep 2006||13 Jan 2009||Innovalarm Corporation||Enhanced personal monitoring and alarm response method and system|
|US7477144||18 Sep 2006||13 Jan 2009||Innovalarm Corporation||Breathing sound monitoring and alarm response method, system and device|
|US7508307||18 Dec 2006||24 Mar 2009||Innovalarm Corporation||Home health and medical monitoring method and service|
|US7522035||18 Sep 2006||21 Apr 2009||Innovalarm Corporation||Enhanced bedside sound monitoring and alarm response method, system and device|
|US7555430||4 Apr 2006||30 Jun 2009||Nuance Communications||Selective multi-pass speech recognition system and method|
|US7656287||7 Jul 2006||2 Feb 2010||Innovalarm Corporation||Alert system with enhanced waking capabilities|
|US8370157 *||8 Jul 2010||5 Feb 2013||Honeywell International Inc.||Aircraft speech recognition and voice training data storage and retrieval methods and apparatus|
|US8447619 *||21 Sep 2010||21 May 2013||Broadcom Corporation||User attribute distribution for network/peer assisted speech coding|
|US8589166||21 Sep 2010||19 Nov 2013||Broadcom Corporation||Speech content based packet loss concealment|
|US8818817||11 Oct 2010||26 Aug 2014||Broadcom Corporation||Network/peer assisted speech coding|
|US9058818||21 Sep 2010||16 Jun 2015||Broadcom Corporation||User attribute derivation and update for network/peer assisted speech coding|
|US9245535||28 Feb 2013||26 Jan 2016||Broadcom Corporation||Network/peer assisted speech coding|
|US9514739 *||21 Dec 2012||6 Dec 2016||Cypress Semiconductor Corporation||Phoneme score accelerator|
|US20010003173 *||6 Dec 2000||7 Jun 2001||Lg Electronics Inc.||Method for increasing recognition rate in voice recognition system|
|US20020042713 *||23 Aug 2001||11 Apr 2002||Korea Axis Co., Ltd.||Toy having speech recognition function and two-way conversation for dialogue partner|
|US20030158739 *||15 Feb 2002||21 Aug 2003||Moody Peter A.||Speech navigation of voice mail systems|
|US20060017558 *||23 Jul 2004||26 Jan 2006||Albert David E||Enhanced fire, safety, security, and health monitoring and alarm response method, system and device|
|US20060017560 *||23 Jul 2004||26 Jan 2006||Albert David E||Enhanced fire, safety, security and health monitoring and alarm response method, system and device|
|US20060017579 *||16 Aug 2005||26 Jan 2006||Innovalarm Corporation||Acoustic alert communication system with enhanced signal to noise capabilities|
|US20060106613 *||28 Dec 2005||18 May 2006||Sbc Technology Resources, Inc.||Method and system for evaluating automatic speech recognition telephone services|
|US20060178879 *||4 Apr 2006||10 Aug 2006||Hy Murveit||Adaptive multi-pass speech recognition system|
|US20060184360 *||4 Apr 2006||17 Aug 2006||Hy Murveit||Adaptive multi-pass speech recognition system|
|US20060250260 *||7 Jul 2006||9 Nov 2006||Innovalarm Corporation||Alert system with enhanced waking capabilities|
|US20060261974 *||27 Jul 2006||23 Nov 2006||Innovalarm Corporation||Health monitoring using a sound monitoring screen saver|
|US20060267755 *||27 Jul 2006||30 Nov 2006||Innovalarm Corporation||Residential fire, safety and security monitoring using a sound monitoring screen saver|
|US20060279418 *||27 Jul 2006||14 Dec 2006||Innovalarm Corporation||Enhanced alarm monitoring using a sound monitoring screen saver|
|US20070008153 *||18 Sep 2006||11 Jan 2007||Innovalarm Corporation||Enhanced personal monitoring and alarm response method and system|
|US20070024451 *||18 Sep 2006||1 Feb 2007||Innovalarm Corporation||Enhanced bedside sound monitoring and alarm response method, system and device|
|US20070116212 *||7 Oct 2005||24 May 2007||Microsoft Corporation||Dynamic call announcement using recipient identification|
|US20110099009 *||11 Oct 2010||28 Apr 2011||Broadcom Corporation||Network/peer assisted speech coding|
|US20110099014 *||21 Sep 2010||28 Apr 2011||Broadcom Corporation||Speech content based packet loss concealment|
|US20110099015 *||21 Sep 2010||28 Apr 2011||Broadcom Corporation||User attribute derivation and update for network/peer assisted speech coding|
|US20110099019 *||21 Sep 2010||28 Apr 2011||Broadcom Corporation||User attribute distribution for network/peer assisted speech coding|
|US20120010887 *||8 Jul 2010||12 Jan 2012||Honeywell International Inc.||Speech recognition and voice training data storage and access methods and apparatus|
|US20120065972 *||12 Sep 2010||15 Mar 2012||Var Systems Ltd.||Wireless voice recognition control system for controlling a welder power supply by voice commands|
|US20140180694 *||21 Dec 2012||26 Jun 2014||Spansion Llc||Phoneme Score Accelerator|
|CN102646067A *||27 Feb 2012||22 Aug 2012||深圳市共进电子股份有限公司||Test method for embedded software|
|CN102646067B *||27 Feb 2012||29 Jul 2015||深圳市共进电子股份有限公司||一种嵌入式软件的测试方法|
|CN103605606A *||1 Dec 2013||26 Feb 2014||北京航空航天大学||Embedded software test case batch execution method capable of being automatically convertible|
|CN103605606B *||1 Dec 2013||16 Mar 2016||北京航空航天大学||一种可自动转换的嵌入式软件测试用例批量执行方法|
|CN104317717A *||31 Oct 2014||28 Jan 2015||北京航空航天大学||Embedded software testing method on basis of dimension conversion|
|CN104317717B *||31 Oct 2014||15 Feb 2017||北京航空航天大学||一种基于量纲转换的嵌入式软件测试方法|
|CN104536880A *||28 Nov 2014||22 Apr 2015||南京大学||GUI program testing case augmentation method based on symbolic execution|
|CN104536880B *||28 Nov 2014||15 Sep 2017||南京大学||基于符号执行的gui程序测试用例扩增方法|
|EP1302929A1 *||16 Oct 2001||16 Apr 2003||Siemens Aktiengesellschaft||Method for automatically implementing a speech recognizer, and speech recognizer|
|EP1352389B1 *||10 Jan 2002||3 Sep 2008||QUALCOMM Incorporated||System and method for storage of speech recognition models|
|U.S. Classification||704/270, 704/231, 704/272, 704/E15.048, 704/275|
|International Classification||G11C16/02, G10L15/00, G10L15/28|
|8 Sep 1998||AS||Assignment|
Owner name: LUCENT TECHNOLOGIES INC., NEW JERSEY
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ALI, SYED S.;GLINSKI, STEPHEN C.;REEL/FRAME:009442/0285;SIGNING DATES FROM 19980803 TO 19980817
|20 Mar 2001||CC||Certificate of correction|
|25 Sep 2003||FPAY||Fee payment|
Year of fee payment: 4
|15 Nov 2004||AS||Assignment|
Owner name: AGERE SYSTEMS INC., PENNSYLVANIA
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:LUCENT TECHNOLOGIES INC.;REEL/FRAME:015980/0344
Effective date: 20010130
|20 Sep 2007||FPAY||Fee payment|
Year of fee payment: 8
|16 Sep 2011||FPAY||Fee payment|
Year of fee payment: 12
|8 May 2014||AS||Assignment|
Owner name: DEUTSCHE BANK AG NEW YORK BRANCH, AS COLLATERAL AG
Free format text: PATENT SECURITY AGREEMENT;ASSIGNORS:LSI CORPORATION;AGERE SYSTEMS LLC;REEL/FRAME:032856/0031
Effective date: 20140506
|3 Apr 2015||AS||Assignment|
Owner name: AVAGO TECHNOLOGIES GENERAL IP (SINGAPORE) PTE. LTD
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:AGERE SYSTEMS LLC;REEL/FRAME:035365/0634
Effective date: 20140804
|2 Feb 2016||AS||Assignment|
Owner name: LSI CORPORATION, CALIFORNIA
Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT RIGHTS (RELEASES RF 032856-0031);ASSIGNOR:DEUTSCHE BANK AG NEW YORK BRANCH, AS COLLATERAL AGENT;REEL/FRAME:037684/0039
Effective date: 20160201
Owner name: AGERE SYSTEMS LLC, PENNSYLVANIA
Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENT RIGHTS (RELEASES RF 032856-0031);ASSIGNOR:DEUTSCHE BANK AG NEW YORK BRANCH, AS COLLATERAL AGENT;REEL/FRAME:037684/0039
Effective date: 20160201
|11 Feb 2016||AS||Assignment|
Owner name: BANK OF AMERICA, N.A., AS COLLATERAL AGENT, NORTH
Free format text: PATENT SECURITY AGREEMENT;ASSIGNOR:AVAGO TECHNOLOGIES GENERAL IP (SINGAPORE) PTE. LTD.;REEL/FRAME:037808/0001
Effective date: 20160201
|3 Feb 2017||AS||Assignment|
Owner name: AVAGO TECHNOLOGIES GENERAL IP (SINGAPORE) PTE. LTD
Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS;ASSIGNOR:BANK OF AMERICA, N.A., AS COLLATERAL AGENT;REEL/FRAME:041710/0001
Effective date: 20170119