|Publication number||US3828132 A|
|Publication date||6 Aug 1974|
|Filing date||30 Oct 1970|
|Priority date||30 Oct 1970|
|Also published as||CA941968A, CA941968A1, DE2115258A1, DE2115258B2, DE2115258C3|
|Publication number||US 3828132 A, US 3828132A, US-A-3828132, US3828132 A, US3828132A|
|Inventors||Flanagan J, Rabiner L, Schafer R|
|Original Assignee||Bell Telephone Labor Inc|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (6), Non-Patent Citations (2), Referenced by (147), Classifications (7)|
|External Links: USPTO, USPTO Assignment, Espacenet|
United States Patent 1191 Flanagan et al.
[ SPEECH SYNTHESIS BY CONCATENATION OF FORMANT ENCODED WORDS Inventors: James Loton Flanagan, Warren;
Lawrence Richard Rabiner, Berkeley Heights; Ronald William Schafer, New Providence, all of NJ.
Bell Telephone Laboratories Incorporated, Murray Hill, NJ.
Filed: Oct. 30, 1970 Appl. No.: 85,660
References Cited UNITED STATES PATENTS 11/1958 David 179/l5.55 R 11/1964 Gerstman 179/1 SA 5/1967 DeClerk 179/1 SA 2/1968 French 179/1 SA 10/1970 Nakata 179/1 SA 6/1971 Martin 179/1 SA OTHER PUBLICATIONS Rabiner, A Model for synthesizing Speech by Rule,
[ Aug. 6, 1974 IEEE Transactions AU-l7 3/69, pp. 7-13.
J. L. Flanagan et a1. Synthetic Voices for Computers, IEEE Spectrum, pp. 2245, October 14, 1970.
Primary Examiner-Kathleen H. Claffy Assistant Examiner-Jon Bradford Leaheey Attorney, Agent, or Firm-A. E. Hirsch; G. E. Murphy [5 7] ABSTRACT Audio response units that select speech sounds, stored in analog or coded digital form, as the excitation for a speech synthesizer are widely used, for example in telephone audio announcement terminals. The speech produced by most units is noticeably artifical and mechanical sounding.
According to this invention, human speech is analyzed in terms of formant structure and coded for storage in the unit. As the individual words are called for, a stored program assembles them into a complete utterance, taking into account the durations of the words in the context of the complete utterance, pitch variations common to the language, and transitions between voiced portions of the speech. The result is a more natural sounding synthetic utterance.
13 Claims, 8 Drawing Figures SPOKEN 5 WORD INPUT lf CONV. I'
SPEECH ANALYZER i I 1 j RMANT PITCH 14 AMPLITUDE 15 FRICATIVE a ANALYZER ANALy ANALYZER POLE/ZERO ANALYZER L 1 2 9 P\ v Aw Em r PARAMETRIC DESCRIPTION STORAGE PATENIEIIIIII: B I 3.828.132
SHEEI 1 BF 6 SPOKEN woRD INPUT 4; coNv. I
I' SPEECH ANALYZER F I I 'l FORMANT I3 PITcH l4 AMPLITUDE I5 FRICI TIVE l6 PoLE ZERO I ANALYZER ANALYZER ANALYZER ANALYZER l F21 F3 P\ v AN E5 z PARAMETRIC DEscRIPTIoN sToRAeE I7 I8 26 mm WORD DIGITAL SEQUENCE CONCATENATING SPEECH COMMAND I INPUT PROCESSOR I SYNTHESIZER TIMING M PITCH SPOKEN DATA VAR'AT'ON MESSAGE 24 DATA (STORED) (STORED) v OUTPUT I EXTERNAL EXTERNAL PlTCH TIMING vARIATIoN DATA DATA J. L. FLANAGAN -/Nl/EN7'OR L. R. RAB/IVER R. n. SCHAFER ATTORNEY PAIENIEB M18 3.828. 1 32 WORD I I CONTROL -I-- FUNCTION I M I I WORD 2 I I CONTROL FUNCTION I INTERPOLATED I l 1 I CURVE I I I I OVER LAP OVERLAP OVER LAP OVERLAP REGION REGION REGION REGIO NT-HV,E
PHYONEMES DIGIT POSITION 3 450 500 560 6IO [)l g fl 4 260 300 340 380 SEQUENCE 5 340 370 4IO 440 TIME (m SEC.)
PATENTEM 6W4 3.828.132
sum 3 or e We were away FIG. 4 F3 IG 5 IFP sow ThlS mon PATENTED NIB SHEET '4 BF 6 FIG. 6A
WORD SEQUENCE COMMAND INPUT PARAMETRIC DESCRIPTION STORAGE CATALOG OF WORD DESCRIPTIONS 6| T TABLE DETERMINE DURATIONS 62/ LOOKUP OF EACH WORD 22 (F|G.l) IN THE SEQUENCE $23,? MODIFY DURATIONAL ENTRIES DATA usme EXTERNAL INITIALIZE WORD SEQUENCIEWCOUNTER PATENIEIIMIB 1 3.828.132
SHEET 5 (IF 6 FIG. 6B
DATA FROM WORD CATALOG WAS ITH WORD MERGED WITH (I-DsT WORD? SYNTHESIZE FIRST 7 50 m SEC OF I TH WORD I I LENGTHEN OR SHORTEN SUBROUTINE I 1 TH WORD TO MAKE CRDELL TIMING AGREE WITH DURATIONAL DATA YES IS ITH WORD MERGED WITH (I+I) sT WORD? DATA FROM DATA FROM WORD CATALOG WORD CATALOG Iv I I OVERLAP MERGE I SUBROUTINE 1 END OF I TH WORD B E R E; BE INTPL WITH BEGINNING 1TH WORD I T (I+I)ST WORD I v UPDATE WORD $72 SEQUENCING INDEX I IS WORD SEQUENCING INDEX GREATER THAN INDEX OF LAST WORD IN INPUT SEQUENCE? PAIENIEBMIB 6 w 3.828.132-
SNEET 8 BF 6 FIG. 66'
v TABLE SUPERIMPOSE PITCH v LOOKUP DATA ON FORMANT 24 (FIG. D AND GAIN DATA 77 IS EXTERNAL TABLE PITCH DATA ENTRIES To BE USED? 25 (FIG. I)
SYNTHESIZE p26 FINAL OUTPUT SPEECH SYNTHESIS BY CONCATENATION OF FORMANT ENCODED WORDS This invention relates to the synthesis of limited context messages from stored data, and more particularly to processing techniques for assembling stored information into an appropriate specification for energizing a speech synthesizer.
BACKGROUND OF THE INVENTION information with which to energize a speech synthe-.
sizer. The response to the question is eventually provided in the form of complete spoken utterances.
For such a service, it is evident that the system must have a large and flexible vocabulary. The system, therefore, must store sizable quantities of speech information and it must have the information in a form amenable to the production of a great variety of messages. Speech generated by the system would be as intelligible as natural speech. Indeed, the possibility exists that it might be made more intelligible than natural speech. It need not, however, sound like any particular human and may even be permitted to have a machine accent.
DESCRIPTION OF THE PRIOR ART One technique for the synthesis of messages is to store individually spoken words and to select the words in accordance with the desired message output. Words pieced together in this fashion yield intelligible but highly unnatural sounding messages. One difficulty is that word waveforms cannot easily be adjusted in duration. Also, it is difficult to make smooth transitions from one word to the next. Nonetheless, such systems are relatively simple to implement and afford a relatively large vocabulary with simple storage apparatus.
To avoid some of the difficulties of word storage and to reduce the size of the store needed for a reasonable variety of message responses, individual speech sounds may be stored in the form of phoneme specifications. Such specifications can be called out of storage in accordance with word and message assembly rules and used to energize a speech synthesizer. However, speech at the acoustic level is not particularly discrete. Articulations of adjacent phonemes interact, and transient movements of the vocal tract in the production of any phoneme last much longer than the average duration of the phoneme. That is, the articulatory gestures overlap and are superimposed on one another. Hence, transient motions of the vocal tract are perceptually important. Moreover, much information about the identity of a constant is carried, not by the spectral shape at the steady-state time of the consonant but by its dynamic interactions with adjacent phonemes.
Speech synthesis, therefore, is strongly concerned with dynamics. A synthesizer must reproduce not only the characteristics of sounds when they most nearly represent the ideal of each phoneme, but also the dynamics of vocaltract motion as it progresses from one phoneme to another. This fact highlights a difference between speech synthesis from word or phrase storage and synthesis from more elementary speech units. If the library of speech elements is a small number of short units, such as phonemes, the linking procedures approach the complexity of the vocal tract itself. Conversely, if the library of speech elements is a much larger number of longer segments of speech, such as words or phrases, the elements can be linked together at points in the message where information in transients is minimal.
Thus, although phoneme synthesis techniques are attractive and sometimes adequate, the intermediate steps of assembling elementary speech specifications into words and words into messages according to prescribed rules requires complicated equipment and, at best, yields mechanical sounding speech.
SUMMARY OF THE INVENTION These shortcomings are overcome in accordance with the present invention by storing representations of spoken words or phrases in terms of individual formant and other speech defining characteristics. Formants are the natural resonances of the vocal tract and they take on different frequency values as the vocal tract changes its shape during talking. Typically, three such resonances occur in the frequency range most important to intelligibility, namely, 0 3 kHz. Representation of the speech wave as a set of slowly varying excitation parameters and vocal tract resonances is attractive for at least two reasons. First it is more efficient for data storage than, for example, a pulse code modulation (PCM) representation of the speech waveform. Secondly, a formant representation permits flexibility in manipulation of the speech signal for the concatenation of words or phrases.
Thus, in accordance with the invention individual, naturally spoken, isolated words are analyzed to produce a word library which is stored in terms of formant frequencies. In the formant representation of an utterance, formant frequencies, voice pitch, amplitude and timing, can all be manipulated independently. Thus in synthesizing an utterance, an artificial pitch contour, i.e., the time course of the relevant parameters, can be substituted for the natural contour; A steady-state sound can be lengthened or shortened, and even the entire utterance can be speeded up, or slowed down with little or no loss in intelligibility. Formants can be locally distorted, and the entire formant contour can be uniformly raised, or lowered, to alter voice quality.
Upon program demand, word length formant data are accessed and concatenated to form complete formant functions for the desired utterance. The formant functions are interpolated in accordance with spectral derivatives to establish contours which define smooth transitions between words. Speech contour and word duration data are calculated according to stored rules. Following the necessary processing and interpolation, concatenated formant functions are used to synthesize a waveform which approximates a naturally spoken message. As an added advantage, economy in storage is achieved because formant and excitation parameters change relatively slowly and can be specified by fewer binary numbers per second (bits) than can, for example, the speech waveform.
BRIEF DESCRIPTION OF THE DRAWINGS The invention will be fully apprehended from the following detailed description of illustrative embodiments thereof taken in connection with the appended drawings in which:
FIG. 1 illustrates schematically a suitable arrangement in accordance with the invention for synthesizing message-length utterances upon command;
FIG. 2 illustrates the manner of overlapping individual word formants, in accordance with the invention, for four different combinations of words;
FIG. 3 illustrates timing data which may be used for processing formant data;
FIG. 4 illustrates the processing of voiced formant data for individual words to produce a concatenated formant structure useful for actuating a speech synthesizer;
FIG. 5 illustrates the processing of both voiced and fricative formant data for individual words to produce a concatenated formant structure useful for actuating a speech synthesizer; and
FIGS. 6A, 6B and 6C illustrate by way of a flow chart the operations employed in accordance with the invention, for processing parametric data and for concatenating these data to produce a complete set of control signals for energizing a'formant speech synthesizer.
DETAILED DESCRIPTION OF THE INVENTION A system for synthesizing speech by the concatenation of formant encoded words, in accordance with the invention, is illustrated schematically in FIG. I. Isolated words spoken by a human being are analyzed to estimate the parameters required for synthesis. Thus, naturally spoken, isolated words originating, for example, in system 10, which may include either studio generated or recorded words, are converted, if desired, to digital form in converter 11. The individual words, in whatever format, are supplied to speech analyzer 12, wherein individual formants, amplitudes, pitch period designations, and fricative pole and zero identifications are developed at the Nyquist rate. A suitable speech analyzer is described in detail in a copending application of Rabiner-Schafer, Ser. No. 872,050, filed Oct. 29, 1969, now U. S. Pat. 3,649,765, granted Mar. 14, I972. In essence, analyzer 12 includes individual channels, including analyser 13 for identifying formant (voiced) frequencies F F F analyzer 14 for developing a pitch period signal P, analyzer 15 for developing buzz, A and hiss, A level control signals, and analyzer 16 for developing fricative (unvoiced) pole and zero signals, Fp and F These control parameter values are delivered to parametric description storage unit 17, which may take any desired form. Both analog and digital stores, which may be accessed upon command, are known in the art. When completed, storage unit 17 constitutes a word catalog which may be referenced by the word concatenation portion of the system. The parameter values maintained in catalog 17 may be revised from time to time by the addition or deletion of new words.
INPUT COMMAND An input command from word sequence input 18 initiates the necessary operations to synthesize a message composed of words from catalog 17. The exact form of input 18 depends upon the particular application of the word synthesis system. Typically, an inquiry of some form is made to the system embodied by unit 18, the necessary data for a response is formulated, and the appropriate word designations for the response, for example, in the English language, are assembled in code language and delivered to the synthesis system as the output signal of unit 18. Such response units are known to those skilled in the art and are described in various patents and publications. The output developed by such a responsive unit may thus be in the form of machine code language, phoneme or other linguistic symbols, or the like. Whatever the form of the output signal, it is delivered, in accordance with this invention, to word processing System 20, wherein required word data is assembled, processed, and delivered to speech synthesizer 26.
To synthesize a message composed of words from storag unit 17 requires the generation of timing contours, a pitch contour, and formant and amplitude contours. Processor 20, in accordance with the invention, employs separate strategies for handling the segmental features of the message, such as formant frequencies, unvoiced pole and zero frequencies and amplitudes, and the prosodic features, such as timing and pitch. Program strategy for treating the segmental features is self-stored in the processor. The prosodic feature information needed for processing is derived in or is supplied to processor 20. It is this flexibility in manipulating formant-coded speech that permits the breaking of the synthesis problem into two parts.
TIMING DATA Timing information may be derived in one of several ways. For limited vocabulary applications, such as automatic intercept services, the timing rules need be nothing more complicated than a table specifying word duration as a function of position in an input string of data and as a function of the number of phonemes per word. Timing data for a typical seven number digit string is illustrated in the table of FIG. 3 and is normally stored in timing unit 22. For more sophisticated applications, word duration is determined from rules which take into account the syntax of the specific message to be produced, i.e., rules based on models of the English language. Such data also is stored in timing store 22. It is also possible to specify the duration of each word in the input string to be synthesized from external timing data supplied from unit 23. In this case, word duration is chosen according to some external criterion, for example, or measured from a naturally spoken version of the message to be synthesized, and is not necessarily a typical duration for that word, independent of context. Thus, external timing data may be supplied from stored data or from real time adjustments made during synthe- SIS.
PITCH DATA Synthesis also requires the determination of the appropriate pitch contour, i.e., pitch period as a function of time, for the message being synthesized. Pitch information can be obtained in several ways. For example, the pitch character of the original sequence of spoken words may be measured. Alternatively, a monotone or an arbitrarily shaped contour may be used. However,
in practice both of these have been found to give unacceptable, unnatural results. Accordingly, it is in accordance with this invention to use a time-normalized pitch contour, stored in unit 24, and to modify it to matchthe word portions as determined from the timing rules. Thus, pitch data stored in unit 24 are supplied to concatenating processor 21 wherein the contour is locally lengthened or shortened as required by ;the specific utterance timing as specified by the timing data. If desired, pitch variation data may be supplied from external source 25, either in the form of auxiliary stored data, or as real time input data. For example, a pitch contour extracted from a naturally spoken version of the message may be used. Such data would normally be used when word durations have been obtained in a similar manner, i.e., from external timing unit 23.
Pitch and timing information obtained externally in this manner provide the most natural sounding synthesized speech. It is also possible to calculate pitch contour information by rule. Thus, there are many ways in which the prosodic information for a message can be obtained, and the choice depends strongly on the desired quality of the synthetic speech and the specific application for which it is to be used.
WORD DURATION ADJUSTMENT Once the timing pattern for the message is established, isolated words in word catalog 17 can be withdrawn and altered to match the specified timing. Thus, formant data for a word in the catalog may be either lengthened or shortened. The formant contours for successive voiced words are smoothly connected together to form continuous transitions and continuous formant contours for the message. The choice of place in a word to alter duration is based on the dynamics of the formant contours. For each subinterval of a voiced sound, typically msec in duration, a measure of the rate of change of formant contours is computed in processor 21. This measure is called the spectral derivative. Regions of the word where the spectral derivative is smallare regions where the word can be shortened or lengthened with the least effect on word intelligibility. Thus, to shorten a word by a given amount, an appropriate number of 10 msec intervals are deleted in the region of the smallest spectral derivative. To lengthen a word, the region of the lowest spectral derivative is lengthened by adding an appropraite number of 10 msec intervals. Unvoiced regions of words are never modified.
In practice, the measure of spectral derivative, SD,-, is calculated where i(1,2, is the i'" 10 msec interval and F,(i) is the value of the j"' formant in the i"' time interval. To determine how many 10 msec intervals must be added to (or substracted from) the isolated word controls, an equation is used based on desired word duration, isolated word duration, and some simple contextual information concerning how the current word is concatenated with its preceding and following neighbors. By defining the symbols:
I l if the end of the preceding word is voiced, and the beginning of the current word is also voiced; 0 otherwise 1 if the end of the current word is voiced, and the beginning of the following word is also voiced; 0 otherwise W, duration of current word spoken in isolation W duration of current word spoken in context (as determined from timing rules) W number of 10 msec intervals to be added if W 0 (or subtracted if W 0) then W W W 5 X (IPM NM) The reason for the last term in the above equation is that whenever either l or 1, it means that the two words must be smoothly merged together, and will overlap each other by msec. However, this 100 msec region is shared by the two words; hence 50 msec (5 intervals) are allotted to each word separately in terms of the overall timing. The technique by which the W additional 10 msec intervals are inserted, or removed, is based entirely on the spectral derivative measurement. As noted above, for each 10 msec voiced interval of the isolated word, the spectral derivative is calculated. To shorten a word, the W intervals having smallest spectral derivatives are removed. To lengthen a word, the region of the word having smallest spectral derivative is located and W intervals are inserted at the middle of this region. Each of the W intervals is given the control parameters of the center of the interval i.e., a steady-state region of W intervals is added.
OVERLAP OF WORD DESCRIPTIONS Except for the case when the end of the current word, as well as the beginning of the following word, are both voiced, the control data from word to word are simply abutted. Whenever the end of one word is voiced and the beginning of the next word is also voiced, a smooth transition is thus made from the formants at the end of one word to those at the beginning of the next word. This transition is made, for example, over the last 100 msec of the first word and the first 100 msec of the second. The transition rate depends on the relative rates of spectram change of the two words over the merging region.
To perform this transition task, an interpolation function is used whose parameters depend strongly on the average spectral derivatives of the two words during the merging region. If the spectral derivative symbols are defined as:
"0+9 SD1= 2 SDI i=n 7'10. I SD2 =2 SDZ,-
n starting interval of merging region for current word F,(l) Value of formant j of the message contour at time 1 during the merger region, 1 0,1, 9,
then the interlation function used F,(l) F,(n +l)-(9l)-SD1 F"j(l)-l-SD2/(9-l)SDl 4-1-8023 where F *(1) value of the j" formant, at time I for word k (k l is current word, k 2 is following word).
FORMANT INTERPOLATION FIG. 2 illustrates the type of interpolation performed for four simple cases in accordance with these considerations. Although all three formants of a sound are in terpolated, only one formant is illustrated for each word to simplify the presentation. For the words in column 1, word 1 (the top spectrum) exhibits a very small change over its last 100 msec of voicing, whereas word 2 (middle spectrum) exhibits a large change. The interpolated curve shown at the bottom of the first column, although beginning at the formants of word 1, rapidly makes a transition and follows the formants of word 2. Column 2 shows the reverse situation; word 2 exhibits little spectrum change whereas word 1 has a large spectrum change. The interpolated curve, therefore, follows the formants of word 1 for most of the merging or overlap region and makes the transition to the formants of word 2 at the end of the region. Columns 3 and 4 show examples in which spectrum changes in both words are relatively the same. When they are small, as in column 3, the interpolated curve is essentially linear. When they are large, as in column 4, the interpolated curve tends to follow the formants of the first word for half of the overlap region, and the formants of the second word for the other half.
The interpolated curve thus always begins at the formants of word 1 (the current word) and terminates with the formants of word 2 (the following word). The rate at which the interpolated curve makes a transition from the formants of the first word to those of the second is defir nined by the average spectral derivatives SDI and SD2. In the example of column 1, the spectral derivative of the second word is much greater than that of the first so the transition occurs rapidly at the beginning of the overlap region. For example of the second column the spectral derivative of the first word is the greater so that the transition occurs rapidly at the end of the overlap region. As indicated above, the spectral derivatives for both words in the examples of columns 3 and 4am much the same so that no rapid transitions take place in the overlap region.
EXAMPLES OF CONCATENATION FIGS. 4 and 5 illustrate the manner in which these rules and considerations are turned to account in the practice of the invention. FIG. 4 illustrates the manner in which three voiced words, We, Were, and Away are linked together to form the sentence We were away. As spoken, the words have durations W,, W W as indicated, and through analysis have been determined to have formants F F and F These formant data are stored in storage unit 17 (FIG. 1) for the individual words, as discussed above. Upon an input command from word sequences unit 18 to assemble the three words into the sentence We were away, the formant data is drawn from storage unit 17 and delivered to word concatenating processor 21. Timing data from storage 22 (or alternatively from external unit 23) and pitch variation data from store 24 (or alternatively from external source 25) are supplied to the processor. It is initially determined that the words We and Were are normally linked together in speech by a smooth transition and uttered as one continuous phrase, Wewere. Hence, the two voiced words are adjusted in duration to values D D in accordance with the context of the utterance, and the formants of the words are overlapped and interpolated to provide the smooth transition. Similarly, the words were and away are normally spoken as wereaway with time emphasis on away. Hence, the duration of away is lengthened to D and the formants for the two words are overlapped and interpolated.
The resulting smoothly interpolated formant specification is further modified by superimposing the pitch period contour illustrated in the figure. The resultant is a continguous formant specification of the entire utterance. These formant data as modified, together with the pitch period contour, and voiced-unvoiced character data A and A are delivered to speech synthesizer 26 (FIG. 1).
FIG. 5 illustrates the concatenation of the words I, Saw, This, and Man, to form the phrase I saw this man". In this case the words I and Saw are not overlapped because of the intervening fricative at the beginning of Saw. However, the words Saw and This" are generally spoken with a smooth transition. Hence, these words are overlapped and the formants are interpolated. Since the word This ends in a fricative, the words This and Man are not overlapped. In accordance with the context of the expression, the individual word lengths W are each modified to the new values D. Finally, a stored pitch period contour is superimposed according to a stored rule. The resultant specification of the phrase I saw this man is thus delivered together with voiced-unvoiced character data, A A and fricative pole-zero data, Fp and F to the speech synthesizer.
INTENSITY DATA The unvoiced intensity parameters, A is obtained directly from the stored controls in word catalog v17 when the interval to be synthesized is unvoiced. The voiced intensity parameter, A is similarly obtained directly from word catalog 17, except during a merging region of two voiced intervals, in which case it is obtained by interpolation of the individual voiced intensities of the two words in a fashion similar to that described for the interpolation of formants.
CONCATENATION PROCESSOR IMPLEMENTATION Although the operations described above for processing word formant data to form word sequence information may be carried out using any desired apparatus and techniques, one suitable arrangement used in practice relies upon the high-speed processing ability of a digital computer. In practice a general purpose digital computer, namely, the Honeywell DDP-516 or the GE- 635, have been found to be satisfactory. The two machines and their software systems are equally adaptable for receiving a program prepared to convert them from a general purpose machine to a special purpose processor for use in the practice of the invention.
A flow chart of the programming steps employed to convert such a machine into special purpose processing apparatus which turns to account the features of the invention, is shown in FIGS. 6A, 6B, and 6C, taken together as one complete description. Each step illustrated in the flow chart is itself well known and can be reduced to a suitable program by any one skilled in the programming art. The unique subroutines employed in the word length modification operation and in the overlapping operation are set forth in Fortran IV language in Appendices A and B attached hereto.
Although any general purpose digital computer may be adapted to perform the operations required by the flow chart of FIG. 6, a unit with characteristics similar to that of the DDP-l6 is preferred. The DDP-5 16 includes 16 k of core memory, hardware, multiply and divide, direct multiplex control with 16 data channels (0.25 mI-Iz each), and a direct memory access channel (1.0 mHz). Input is by way of a teletypewriter. A Fortran IV compiler, DAP-l6 machine-language assembler, match libraries, and various utility software are standard items supplied by the manufacturer and delivered with the machine. If desired, a number of peripheral units may be interfaced with the computer for convenience. This may include auxiliary word stores, card readers, display scopes, printers, tape readers, registers, and the like. Such units are well known to those skilled in the art and are generally available on the open market. They may be interconnected with the basic computer as required by the specific application to which the processor of this invention is to be used.
PROCESSOR OPERATIONS In the portion of the flow chart shown at the top of FIG. 6A there is indicated schematically the parametric description storage unit 17 of FIG. 1 which contains a catalog of formant pitch amplitude and fricative specifieations for each of the words in the catalog. Upon command from word sequence input 18, these data are transferred to word concatenating processor system 20, which is illustrated by the reaminder of the flow chart.
Initially, the duration of each word in the connected sequence is determined, as indicated in block 61, for example, by examining a stored table of timing data 62, of the sort illustrated in FIG. 3 and by unit 22 in FIG. 1. If a timing change is necessary, the program statements of unit 63 determines whether data in store 62 is sufficient of whether external timing data from unit 64 (block 23 of FIG. 1) should be used. In either event, the duration of each commanded word is established and a word sequence counter, in unit 65, is initialized by setting I=l.
It is then necessary to modify the parametric description of the first word in accordance with timing data and other stored rules. Accordingly, it is determined whether the 1" word was merged with the (ll word. This determination is represented by block 66. If it was not, information for the 1" word is withdrawn from word catalog l7 and the first 50 msec of the 1" is synthesized by unit 67. If the 1" word was so merged, the 1 word is lengthened or shortened to make timing agree with durational data supplied as above. This operation takes place in unit 68 in conjunction with subroutines CRDELL, a listing for which appears in Appendix A.
It is then ascertained whether the 1" word is to be merged with the (1+1 word via the steps of block 69. If there is to be a merger, the operations of block 70 are carried out to overlap the end of the 1" word with the beginning of the (1+1 word. This operation is carried out in conjunction with subroutine INTPL, a listing for which appears as Appendix B. If it is determined in block 69 that there is to be no merging, the operations of block 71 synthesize the last 50 msec of the 1" word using data for that word supplied from store 17.
It is then necessary in unit 72 to update the word sequencing of index I and, in operation 73, to determine if the word sequencing index is greater than the index of the last word in the input sequence. If it is not, control is returned to block 66, and the next word is composed in the fashion just described. The operations are thus iterated until the index is equal to the index of the last word in the input sequence, at which time the data from block 73 is transferred to block 74.
Pitch data is then superimposed on the formant and gain structure of each word in the utterance in the fashion described in detail above. These data are available in pitch variation data store 75 (store 24 of FIG. 1). It is next determined by the steps indicated in block 76 whether external pitch data is to be used. If it is, such data from unit 77 (unit 25 in FIG. 1) is supplied by way of data store 75 to the operations of unit 74.
When the pitch contour operation has been completed, all of the data in the word concatenating processor 20 as modified by the program of FIG. 6, is transferred, for example, to speech synthesizer 26 of FIG. 1.
FORMANT SYNTHESIS When all of the control parameter contours of the commanded utterance have been generated, they may, if desired, be smoothed and band-limited to about 16 Hz. They are then used to control a formant synthesizer which produces a continuous speech output. Numerous systems, both analog and digital, have been described for synthesizing speech from formant data. One suitable synthesizer is described in J. L. Flanagan Pat. No. 3,330,910, another in David-Flanagan, Pat. No. 3,190,963, FIG. 5, and another is described in Gerstman-Kelly Pat. No. 3,158,685. The abovecited Rabiner-Schafer application illustrates a typical formant synthesizer and relates the exact parameters described hereinabove to the input of the synthesizer described in the Flanagan patent. Very simply, a formant synthesizer includes a system for producing excitation as a train of impulses with a spacing proportional to the fundamental pitch of the desired signal. The intensity of the pulse excitation is controlled and the signal is applied to a cascade of variable resonators.
Suffice it to say, speech synthesizer 26 generates a waveform which approximates that required for the desired utterance. This signal is utilized in any desired fashion, for example, to energize output unit 27 which may be in the form of a loudspeaker, recording device, or the like.
SUBROUTINECRDELL(IST.ISP,NEL) COMMONJFTCH(300)JADR(8),LSTA,LNO.LTYPE, LCALL,1LLENG COMMONJAP(720.8).IA(25),IAD(25)JSEQQS). lIWTIM(25),J2,J3JRD2JRD4JRD8JAV 7 COMMONIFCTJ Fl(500)JF2(500)JF3(5 00)JB4(500) IC=I l-IST IF(IC.NE.O)CALL CRDSYNGSTJC) CALLCRDSYN(TI,I)
IC=ISPIST+IIC RETURN CONTINU E FIND SPECTRAL DERIVATIVE LEVEL SUCH THAT NEL INTERVALS HAS THIS LEVEL OR SMALLER IDIF=500 ITHR=O JC=O DO4OI=IST,ISP IF(JAR(T,I).EQ.O)GOTO40 IF(.IAR(I-8),LE.ITHRIJC=IC+l CONTINUE IDIF=IDIFl2 IF(IDIF.EQ.O)GOTO IF(.IC.GT.(NEL+I))GOTO45 IF(.IC.LT.NEL)GOTO50 GOTOSS ITI-IR==ITHRIDI-F GOTO37 ITHR=ITHR+IDIF GOTO37 ICNT=O FLIMINATE THE INTERVALS WITH SPECTRAL DERIVATIVES LESS THAN THIS LEVEL DOI=IST,ISP IF(.IAR(I,I).EQ.O)GOTO56 lFUAR(I.8).LE.ITI-IR)GOTO57 CONTINUE CALLCRDSYN(I.1) GOTO60 ICNT=ICNT+I IF(ICNT.GE.NEL)ITHR=-l CONTINUE RETURN END APPENDIX B y SUBROUTINE TO MERGE WORDS AND INTERPOLATE THEIR CONTROL SIGNALS SUBROUTINENTPL(IST,IST,IW,LST,IL1,IL2,ITPANS, 55
INUM) (l)MMON.I PTCH(300),JADR(8),LSTA,LNO,LTYPE, LCALLJLLENG CUMMGNTAR 720,8 .1A 25).IAD(25).1SEQ(25).
C C C lIWTIM(25)J2J3,.IRD2JRD4.IRD8.IAV COMMONIFCTJFI(500)JF2(S00)JF3(500)JF4(500) COMMONJ F5(500)JF6(500)J F7600 DIMENSIONJFINI'I) DI RENSIONJSATG) CALCULATE AVERAGE SPECTRAL DERIVATIVES OF BOTH WORDS OVER THE MERG ING REGION WHICH CONSIST OF IW INTERVALS CONTINUE JSI=O 182 0 DO5I=LIW i Il =IST+Il. I2=IST+I1 JSI=JSI+JAR(II,8) .IS2=JS2+.IAR(I2.8) IND=I l5 DO3OI= I .IW
**** GET STARTING ADDRESSES OF DATA FOR BOTH WORDS IL=IST+II JL=JST+II KL=LST+II LL=IND+I1 LM=IW+1-LL NORM=ISI+LM+JS2+LL MERGE AND INTERPOLATE CONTROL SIGNALS OVER THESE IW INTERVALS DO20J=L7 .ILI=.IAR(ILJ)+LM JL2=JAR(J L.J)+LL .IAR(KLJ)=.IKOL(.IL1,ISI .NORM)+.IKOL(JL2,JS2, NORM) CONTINUE CONTINUE CALLCRDSYN(LST.IW)
RETURN END What is claimed is: l. A system for composing speech messages from sequences of prerecorded words, which comprises:
means for analyzing each word of a vocabulary of spoken words to produce a separate parametric description of each;
means for storing said parametric descriptions;
means under control of an applied command signal for sequentially withdrawing from storage those descriptions required to assemble a desired spoken message;
means for individually altering the duration of the description of each word of said message in accordance with prescribed timing rules;
means for merging consecutive word descriptions together on the basis of the respective, voiceunvoiced character of the merged word descriptions;
means for altering the pitch characteristic of said continuous message description in accordance with a prescribed contour; and
means for utilizing said continuous description to control a speech synthesizer. V
2. A system for composing speech messages as defined in claim 1, wherein,
said parametric description of each word in said vocabulary comprises:
a representation of the formants, voiced and unvoiced amplitudes, and fricative pole-zero characteristics of said spoken word.
3. A system for composing speech messages as defined in claim 2, wherein,
said representations are in a coded digital formant.
4. Apparatus for processing parametric descriptions of selected prerecorded spoken words to form a continuous description of a prescribed message suitable for actuating a speech synthesizer, which comprises:
means for deriving a spectral derivative function for each word description of said message;
means for individually altering the durations of selected word descriptions in accordance with stored timing information;
means operative in response to said spectral derivative functions for developing parametric descriptions of transitions between voiced word regions scheduled to be merged to form said message;
means for concatenating said altered word descrip- 13 tions with said transition descriptions in accor-'} dance with said prescribed message to form a con-, tinuous parametric message description; and l means for altering the pitch characteristic of said message description in accordance with prescribed rules. 5. Apparatus for processing parametric descriptions as defined in claim 4, wherein:
said stored timing information comprises a schedule of word durations as a function of position in an input string of words, and of the number of pho-' nemes per word. 6. Apparatus for processing parametric descriptionsas defined in claim 4, wherein, said stored timing infor-' mation comprises: a schedule of word durations derived from rules based on common language usage. I 7. Apparatus for processing parametric descriptions as defined in claim 4, wherein, said stored timing information comprises:
a schedule of word durations assembled from mea-' surements of a naturally spoken version of said prescribed message.
8. Apparatus for processing parametric descriptions of selected prerecorded words, as defined in claim 4, wherein,
said parametric descriptions of transitions are developed for the last 100 msec of the first of two words to be merged and the first 100 msec of the second of said two words to be merged.
9. Apparatus as defined in claim 8, wherein,
the rate of transition between said two words is proportional to the average of said spectral derivatives I for said two words. 10. Apparatus for processing parametric descriptions of selected words as defined in claim 4, wherein said means for altering the pitch characteristic of said message description comprises:
a stored, time-normalized pitch contour for a selected number of different messages; and
means for modifying said contour in accordance with said altered word description durations.
11. Apparatus for developing control signals for a speech synthesizer, which comprises:
means supplied with word length segmental and prosodic functions of each individual word of a desired message for deriving the spectral derivatives of each of said functions;
means responsive to said spectral derivatives for interpolating said segmental functions to establish for said words as a function of message syntax.
UNITED STATES PATENT OFFICE CERTIFICATE OF CORRECTION Patent No. 3,828,132 Dated August 6, 197M Inv nt (s) James L. Flanagan-Lawrence R. Rabiner- Ronald W. Schafer It is certified that error appears in the above-identified patent and that said Letters Patentare hereby corrected as shown below:
Col. 1, line 31, "should" di read -should;
line 66, "constant" should read --consonant--. 2, line 49, "parameters" should read --parameter-- 6, line 49, "spectram" should read --spectrum;
line 6 1, "1 6(1)" should read --F .($Z.);
line 67, equation (3) should read -F z F (n +,Q,) 9-2 SDl F 1) -2,- SD2 (9mm +2-s D 2 (3) The bar should be only over SDl and SD2 and the numeral (3) should be separated from the equation by spaces as this numeral is not part of the equation but is only intended to identify same. Col. line 13, "eontinguou s should read -c ontiguous--;
001.10, line 67, "IF(TS7)" should read --IF(TST)--. 001.11, line 1, "IF(N C.EQ.O|)GOTO22 should read --IF('NC.EQ.Ol)GOTO22--; line 7, "Il =T+J-l" shouldread --Il=I+J-l--; line 18, "I1 TLOC+NC/2" should read --Il=ILOC+NC/2--; line 29, "IF(NEL.E0.0)GOTO30" should read --IF(NEL.EQ.O)GOTO30--; line 22, "CALLCRDSYN(T1,1)" should read 1 --CALLCRDSYN(Il,l)--; Following line 23, add:
--IF(IC,NE,O)CALLCRDSYN(I1.IC); line 10, "FLIMINATE" should read --ELIMINATE--; line 55, "SUBROUTINE NTPL" should read --SUBROUTINE INTPL--;
line 62, "DIRENSIONJSAT(7) should read nT1vrF.1\TsTnN.TsA r-(7\ v igned and sealed this 1st day 0. April 1375.
MARSHALL DANN RUTH C. I-IASON Commissioner of Patents Attesting Officer and Trademarks
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US2860187 *||8 Dec 1955||11 Nov 1958||Bell Telephone Labor Inc||Artificial reconstruction of speech|
|US3158685 *||4 May 1961||24 Nov 1964||Bell Telephone Labor Inc||Synthesis of speech from code signals|
|US3319002 *||24 May 1963||9 May 1967||Clerk Joseph L De||Electronic formant speech synthesizer|
|US3369077 *||9 Jun 1964||13 Feb 1968||Ibm||Pitch modification of audio waveforms|
|US3532821 *||25 Nov 1968||6 Oct 1970||Hitachi Ltd||Speech synthesizer|
|US3588353 *||26 Feb 1968||28 Jun 1971||Rca Corp||Speech synthesizer utilizing timewise truncation of adjacent phonemes to provide smooth formant transition|
|1||*||J. L. Flanagan et al. Synthetic Voices for Computers, IEEE Spectrum, pp. 22 45, October 14, 1970.|
|2||*||Rabiner, A Model for Synthesizing Speech by Rule, IEEE Transactions AU 17 3/69, pp. 7 13.|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US3982070 *||5 Jun 1974||21 Sep 1976||Bell Telephone Laboratories, Incorporated||Phase vocoder speech synthesis system|
|US3995116 *||18 Nov 1974||30 Nov 1976||Bell Telephone Laboratories, Incorporated||Emphasis controlled speech synthesizer|
|US4060848 *||22 Jan 1973||29 Nov 1977||Gilbert Peter Hyatt||Electronic calculator system having audio messages for operator interaction|
|US4075424 *||13 Dec 1976||21 Feb 1978||International Computers Limited||Speech synthesizing apparatus|
|US4092495 *||13 Dec 1976||30 May 1978||International Computers Limited||Speech synthesizing apparatus|
|US4144582 *||31 May 1977||13 Mar 1979||Hyatt Gilbert P||Voice signal processing system|
|US4163120 *||6 Apr 1978||31 Jul 1979||Bell Telephone Laboratories, Incorporated||Voice synthesizer|
|US4384170 *||29 Oct 1979||17 May 1983||Forrest S. Mozer||Method and apparatus for speech synthesizing|
|US4455551 *||20 Jul 1981||19 Jun 1984||Lemelson Jerome H||Synthetic speech communicating system and method|
|US4559602 *||27 Jan 1983||17 Dec 1985||Bates Jr John K||Signal processing and synthesizing method and apparatus|
|US5146502 *||26 Feb 1990||8 Sep 1992||Davis, Van Nortwick & Company||Speech pattern correction device for deaf and voice-impaired|
|US6366884 *||8 Nov 1999||2 Apr 2002||Apple Computer, Inc.||Method and apparatus for improved duration modeling of phonemes|
|US6405169 *||4 Jun 1999||11 Jun 2002||Nec Corporation||Speech synthesis apparatus|
|US6553344||22 Feb 2002||22 Apr 2003||Apple Computer, Inc.||Method and apparatus for improved duration modeling of phonemes|
|US6591240 *||25 Sep 1996||8 Jul 2003||Nippon Telegraph And Telephone Corporation||Speech signal modification and concatenation method by gradually changing speech parameters|
|US6708154 *||14 Nov 2002||16 Mar 2004||Microsoft Corporation||Method and apparatus for using formant models in resonance control for speech systems|
|US6785652 *||19 Dec 2002||31 Aug 2004||Apple Computer, Inc.||Method and apparatus for improved duration modeling of phonemes|
|US6915261||16 Mar 2001||5 Jul 2005||Intel Corporation||Matching a synthetic disc jockey's voice characteristics to the sound characteristics of audio programs|
|US7409347||23 Oct 2003||5 Aug 2008||Apple Inc.||Data-driven global boundary optimization|
|US7643990||23 Oct 2003||5 Jan 2010||Apple Inc.||Global boundary-centric feature extraction and associated discontinuity metrics|
|US7895041 *||27 Apr 2007||22 Feb 2011||Dickson Craig B||Text to speech interactive voice response system|
|US7930172||8 Dec 2009||19 Apr 2011||Apple Inc.||Global boundary-centric feature extraction and associated discontinuity metrics|
|US8015012||28 Jul 2008||6 Sep 2011||Apple Inc.||Data-driven global boundary optimization|
|US8175881 *||14 Aug 2008||8 May 2012||Kabushiki Kaisha Toshiba||Method and apparatus using fused formant parameters to generate synthesized speech|
|US8229086||1 Apr 2003||24 Jul 2012||Silent Communication Ltd||Apparatus, system and method for providing silently selectable audible communication|
|US8370149 *||15 Aug 2008||5 Feb 2013||Nuance Communications, Inc.||Speech synthesis system, speech synthesis program product, and speech synthesis method|
|US8494490||11 May 2010||23 Jul 2013||Silent Communicatin Ltd.||Method, circuit, system and application for providing messaging services|
|US8583418||29 Sep 2008||12 Nov 2013||Apple Inc.||Systems and methods of detecting language and natural language strings for text to speech synthesis|
|US8600743||6 Jan 2010||3 Dec 2013||Apple Inc.||Noise profile determination for voice-related feature|
|US8614431||5 Nov 2009||24 Dec 2013||Apple Inc.||Automated response to and sensing of user activity in portable devices|
|US8620662||20 Nov 2007||31 Dec 2013||Apple Inc.||Context-aware unit selection|
|US8621508||13 Sep 2004||31 Dec 2013||Xialan Chi Ltd., Llc||Encapsulated, streaming media automation and distribution system|
|US8645137||11 Jun 2007||4 Feb 2014||Apple Inc.||Fast, language-independent method for user authentication by voice|
|US8660849||21 Dec 2012||25 Feb 2014||Apple Inc.||Prioritizing selection criteria by automated assistant|
|US8670979||21 Dec 2012||11 Mar 2014||Apple Inc.||Active input elicitation by intelligent automated assistant|
|US8670985||13 Sep 2012||11 Mar 2014||Apple Inc.||Devices and methods for identifying a prompt corresponding to a voice input in a sequence of prompts|
|US8676904||2 Oct 2008||18 Mar 2014||Apple Inc.||Electronic devices with voice command and contextual data processing capabilities|
|US8677377||8 Sep 2006||18 Mar 2014||Apple Inc.||Method and apparatus for building an intelligent automated assistant|
|US8682649||12 Nov 2009||25 Mar 2014||Apple Inc.||Sentiment prediction from textual data|
|US8682667||25 Feb 2010||25 Mar 2014||Apple Inc.||User profiling for selecting user specific voice input processing information|
|US8688446||18 Nov 2011||1 Apr 2014||Apple Inc.||Providing text input using speech data and non-speech data|
|US8706472||11 Aug 2011||22 Apr 2014||Apple Inc.||Method for disambiguating multiple readings in language conversion|
|US8706503||21 Dec 2012||22 Apr 2014||Apple Inc.||Intent deduction based on previous user interactions with voice assistant|
|US8712776||29 Sep 2008||29 Apr 2014||Apple Inc.||Systems and methods for selective text to speech synthesis|
|US8713021||7 Jul 2010||29 Apr 2014||Apple Inc.||Unsupervised document clustering using latent semantic density analysis|
|US8713119||13 Sep 2012||29 Apr 2014||Apple Inc.||Electronic devices with voice command and contextual data processing capabilities|
|US8718047||28 Dec 2012||6 May 2014||Apple Inc.||Text to speech conversion of text messages from mobile communication devices|
|US8719006||27 Aug 2010||6 May 2014||Apple Inc.||Combined statistical and rule-based part-of-speech tagging for text-to-speech synthesis|
|US8719014||27 Sep 2010||6 May 2014||Apple Inc.||Electronic device with text error correction based on voice recognition data|
|US8731942||4 Mar 2013||20 May 2014||Apple Inc.||Maintaining context information between user interactions with a voice assistant|
|US8751238||15 Feb 2013||10 Jun 2014||Apple Inc.||Systems and methods for determining the language to use for speech generated by a text to speech engine|
|US8762156||28 Sep 2011||24 Jun 2014||Apple Inc.||Speech recognition repair using contextual information|
|US8762469||5 Sep 2012||24 Jun 2014||Apple Inc.||Electronic devices with voice command and contextual data processing capabilities|
|US8768702||5 Sep 2008||1 Jul 2014||Apple Inc.||Multi-tiered voice feedback in an electronic device|
|US8775442||15 May 2012||8 Jul 2014||Apple Inc.||Semantic search using a single-source semantic model|
|US8781836||22 Feb 2011||15 Jul 2014||Apple Inc.||Hearing assistance system for providing consistent human speech|
|US8788268 *||19 Nov 2012||22 Jul 2014||At&T Intellectual Property Ii, L.P.||Speech synthesis from acoustic units with default values of concatenation cost|
|US8792874||20 May 2013||29 Jul 2014||Silent Communication Ltd.||Systems, methods, circuits and associated software for augmenting contact details stored on a communication device with data relating to the contact contained on social networking sites|
|US8799000||21 Dec 2012||5 Aug 2014||Apple Inc.||Disambiguation based on active input elicitation by intelligent automated assistant|
|US8812294||21 Jun 2011||19 Aug 2014||Apple Inc.||Translating phrases from one language into another using an order-based set of declarative rules|
|US8862252||30 Jan 2009||14 Oct 2014||Apple Inc.||Audio user interface for displayless electronic device|
|US8892446||21 Dec 2012||18 Nov 2014||Apple Inc.||Service orchestration for intelligent automated assistant|
|US8898568||9 Sep 2008||25 Nov 2014||Apple Inc.||Audio user interface|
|US8903716||21 Dec 2012||2 Dec 2014||Apple Inc.||Personalized vocabulary for digital assistant|
|US8930191||4 Mar 2013||6 Jan 2015||Apple Inc.||Paraphrasing of user requests and results by automated digital assistant|
|US8935167||25 Sep 2012||13 Jan 2015||Apple Inc.||Exemplar-based latent perceptual modeling for automatic speech recognition|
|US8942986||21 Dec 2012||27 Jan 2015||Apple Inc.||Determining user intent based on ontologies of domains|
|US8977255||3 Apr 2007||10 Mar 2015||Apple Inc.||Method and system for operating a multi-function portable electronic device using voice-activation|
|US8977584||25 Jan 2011||10 Mar 2015||Newvaluexchange Global Ai Llp||Apparatuses, methods and systems for a digital conversation management platform|
|US8996376||5 Apr 2008||31 Mar 2015||Apple Inc.||Intelligent text-to-speech conversion|
|US9053089||2 Oct 2007||9 Jun 2015||Apple Inc.||Part-of-speech tagging using latent analogy|
|US9075783||22 Jul 2013||7 Jul 2015||Apple Inc.||Electronic device with text error correction based on voice recognition data|
|US9117447||21 Dec 2012||25 Aug 2015||Apple Inc.||Using event alert text as input to an automated assistant|
|US9190062||4 Mar 2014||17 Nov 2015||Apple Inc.||User profiling for voice input processing|
|US9236044||18 Jul 2014||12 Jan 2016||At&T Intellectual Property Ii, L.P.||Recording concatenation costs of most common acoustic unit sequential pairs to a concatenation cost database for speech synthesis|
|US9262612||21 Mar 2011||16 Feb 2016||Apple Inc.||Device access using voice authentication|
|US9275631 *||31 Dec 2012||1 Mar 2016||Nuance Communications, Inc.||Speech synthesis system, speech synthesis program product, and speech synthesis method|
|US9280610||15 Mar 2013||8 Mar 2016||Apple Inc.||Crowd sourcing information to fulfill user requests|
|US9300784||13 Jun 2014||29 Mar 2016||Apple Inc.||System and method for emergency calls initiated by voice command|
|US9311043||15 Feb 2013||12 Apr 2016||Apple Inc.||Adaptive audio feedback system and method|
|US9318108||10 Jan 2011||19 Apr 2016||Apple Inc.||Intelligent automated assistant|
|US9330720||2 Apr 2008||3 May 2016||Apple Inc.||Methods and apparatus for altering audio output signals|
|US9338493||26 Sep 2014||10 May 2016||Apple Inc.||Intelligent automated assistant for TV user interactions|
|US9361886||17 Oct 2013||7 Jun 2016||Apple Inc.||Providing text input using speech data and non-speech data|
|US9368114||6 Mar 2014||14 Jun 2016||Apple Inc.||Context-sensitive handling of interruptions|
|US9389729||20 Dec 2013||12 Jul 2016||Apple Inc.||Automated response to and sensing of user activity in portable devices|
|US9412392||27 Jan 2014||9 Aug 2016||Apple Inc.||Electronic devices with voice command and contextual data processing capabilities|
|US9424861||28 May 2014||23 Aug 2016||Newvaluexchange Ltd||Apparatuses, methods and systems for a digital conversation management platform|
|US9424862||2 Dec 2014||23 Aug 2016||Newvaluexchange Ltd||Apparatuses, methods and systems for a digital conversation management platform|
|US9430463||30 Sep 2014||30 Aug 2016||Apple Inc.||Exemplar-based natural language processing|
|US9431006||2 Jul 2009||30 Aug 2016||Apple Inc.||Methods and apparatuses for automatic speech recognition|
|US9431028||28 May 2014||30 Aug 2016||Newvaluexchange Ltd||Apparatuses, methods and systems for a digital conversation management platform|
|US9483461||6 Mar 2012||1 Nov 2016||Apple Inc.||Handling speech synthesis of content for multiple languages|
|US9495129||12 Mar 2013||15 Nov 2016||Apple Inc.||Device, method, and user interface for voice-activated navigation and browsing of a document|
|US9501741||26 Dec 2013||22 Nov 2016||Apple Inc.||Method and apparatus for building an intelligent automated assistant|
|US9502031||23 Sep 2014||22 Nov 2016||Apple Inc.||Method for supporting dynamic grammars in WFST-based ASR|
|US9535906||17 Jun 2015||3 Jan 2017||Apple Inc.||Mobile device having human language translation capability with positional feedback|
|US9547647||19 Nov 2012||17 Jan 2017||Apple Inc.||Voice-based media searching|
|US9548050||9 Jun 2012||17 Jan 2017||Apple Inc.||Intelligent automated assistant|
|US9565551||24 Jul 2014||7 Feb 2017||Mobile Synergy Solutions, Llc||Systems, methods, circuits and associated software for augmenting contact details stored on a communication device with data relating to the contact contained on social networking sites|
|US9576574||9 Sep 2013||21 Feb 2017||Apple Inc.||Context-sensitive handling of interruptions by intelligent digital assistant|
|US9582608||6 Jun 2014||28 Feb 2017||Apple Inc.||Unified ranking with entropy-weighted information for phrase-based semantic auto-completion|
|US9619079||11 Jul 2016||11 Apr 2017||Apple Inc.||Automated response to and sensing of user activity in portable devices|
|US9620104||6 Jun 2014||11 Apr 2017||Apple Inc.||System and method for user-specified pronunciation of words for speech synthesis and recognition|
|US9620105||29 Sep 2014||11 Apr 2017||Apple Inc.||Analyzing audio input for efficient speech and music recognition|
|US9626955||4 Apr 2016||18 Apr 2017||Apple Inc.||Intelligent text-to-speech conversion|
|US9633004||29 Sep 2014||25 Apr 2017||Apple Inc.||Better resolution when referencing to concepts|
|US9633660||13 Nov 2015||25 Apr 2017||Apple Inc.||User profiling for voice input processing|
|US9633674||5 Jun 2014||25 Apr 2017||Apple Inc.||System and method for detecting errors in interactions with a voice-based digital assistant|
|US9646609||25 Aug 2015||9 May 2017||Apple Inc.||Caching apparatus for serving phonetic pronunciations|
|US9646614||21 Dec 2015||9 May 2017||Apple Inc.||Fast, language-independent method for user authentication by voice|
|US9668024||30 Mar 2016||30 May 2017||Apple Inc.||Intelligent automated assistant for TV user interactions|
|US9668121||25 Aug 2015||30 May 2017||Apple Inc.||Social reminders|
|US9691376||8 Dec 2015||27 Jun 2017||Nuance Communications, Inc.||Concatenation cost in speech synthesis for acoustic unit sequential pair using hash table and default concatenation cost|
|US9691383||26 Dec 2013||27 Jun 2017||Apple Inc.||Multi-tiered voice feedback in an electronic device|
|US9697820||7 Dec 2015||4 Jul 2017||Apple Inc.||Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks|
|US9697822||28 Apr 2014||4 Jul 2017||Apple Inc.||System and method for updating an adaptive speech recognition model|
|US9706030||18 Jul 2012||11 Jul 2017||Mobile Synergy Solutions, Llc||System and method for telephone communication|
|US9711141||12 Dec 2014||18 Jul 2017||Apple Inc.||Disambiguating heteronyms in speech synthesis|
|US9715875||30 Sep 2014||25 Jul 2017||Apple Inc.||Reducing the need for manual start/end-pointing and trigger phrases|
|US9721563||8 Jun 2012||1 Aug 2017||Apple Inc.||Name recognition system|
|US9721566||31 Aug 2015||1 Aug 2017||Apple Inc.||Competing devices responding to voice triggers|
|US9733821||3 Mar 2014||15 Aug 2017||Apple Inc.||Voice control to diagnose inadvertent activation of accessibility features|
|US9734193||18 Sep 2014||15 Aug 2017||Apple Inc.||Determining domain salience ranking from ambiguous words in natural speech|
|US9756185 *||28 Dec 2016||5 Sep 2017||Teton1, Llc||System for automated call analysis using context specific lexicon|
|US9760559||22 May 2015||12 Sep 2017||Apple Inc.||Predictive text input|
|US9785630||28 May 2015||10 Oct 2017||Apple Inc.||Text prediction using combined word N-gram and unigram language models|
|US9798393||25 Feb 2015||24 Oct 2017||Apple Inc.||Text correction processing|
|US20020123130 *||1 Mar 2001||5 Sep 2002||Cheung Ling Y.||Methods and compositions for degrading polymeric compounds|
|US20020133349 *||16 Mar 2001||19 Sep 2002||Barile Steven E.||Matching a synthetic disc jockey's voice characteristics to the sound characteristics of audio programs|
|US20030097266 *||14 Nov 2002||22 May 2003||Alejandro Acero||Method and apparatus for using formant models in speech systems|
|US20050060759 *||13 Sep 2004||17 Mar 2005||New Horizons Telecasting, Inc.||Encapsulated, streaming media automation and distribution system|
|US20080270137 *||27 Apr 2007||30 Oct 2008||Dickson Craig B||Text to speech interactive voice response system|
|US20090048836 *||28 Jul 2008||19 Feb 2009||Bellegarda Jerome R||Data-driven global boundary optimization|
|US20090048844 *||14 Aug 2008||19 Feb 2009||Kabushiki Kaisha Toshiba||Speech synthesis method and apparatus|
|US20090070115 *||15 Aug 2008||12 Mar 2009||International Business Machines Corporation||Speech synthesis system, speech synthesis program product, and speech synthesis method|
|US20100145691 *||8 Dec 2009||10 Jun 2010||Bellegarda Jerome R||Global boundary-centric feature extraction and associated discontinuity metrics|
|US20100285778 *||11 May 2010||11 Nov 2010||Max Bluvband||Method, circuit, system and application for providing messaging services|
|US20120309363 *||30 Sep 2011||6 Dec 2012||Apple Inc.||Triggering notifications associated with tasks items that represent tasks to perform|
|US20130080176 *||19 Nov 2012||28 Mar 2013||At&T Intellectual Property Ii, L.P.||Methods and Apparatus for Rapid Acoustic Unit Selection From a Large Speech Corpus|
|US20130268275 *||31 Dec 2012||10 Oct 2013||Nuance Communications, Inc.||Speech synthesis system, speech synthesis program product, and speech synthesis method|
|DE2854601A1 *||18 Dec 1978||21 Jun 1979||Sanyo Electric Co||Ton-synthesizer und verfahren zur ton-aufbereitung|
|DE3019823A1 *||23 May 1980||11 Dec 1980||Texas Instruments Inc||Datenumsetzer und damit ausgestattete sprachsyntheseanordnung|
|EP1193615A2 *||15 Sep 2001||3 Apr 2002||Global Language Communication Systems e.K.||Electronic text translation apparatus|
|EP1193615A3 *||15 Sep 2001||13 Jul 2005||Global Language Communication Systems e.K.||Electronic text translation apparatus|
|WO1979000892A1 *||2 Apr 1979||15 Nov 1979||Western Electric Co||Voice synthesizer|
|WO2000070799A1 *||27 Apr 2000||23 Nov 2000||New Horizons Telecasting, Inc.||Streaming media automation and distribution system for multi-window television programming|
|U.S. Classification||704/268, 704/208, 704/E13.1|
|International Classification||G10L13/00, G10L13/06|