Title: Language Translation Voice Telephony
Field of the Invention
The present invention relates to telecommunications and more particularly to the conversion of a voice message of one language to a corresponding voice or text message of a second language in .a grammatically correct manner.
Background of the Invention
In the present global economy, peoples who speak different languages often have to communicate with each other. This communication between persons who speak different languages often are done either by voice or by texts such as emails. To communicate effectively, oftentimes persons who speak different languages would have to settle on a common language. Yet due to the oftentimes idiomatic and grammatic usage of a language, such communication may become confusing and misunderstood. Moreover, the translation of one language into another language is often done manually by a human translator, therefore necessitating a time delay as well as the cost of the translator.
In a telephony environment where the conversation between two persons are held in real time and the exchange of texts such as emails between the parties are held in substantially real time, the use of translators would at best be awkward and time consuming, and at worst not work.
Thus, if two parties who speak different languages were to communicate effectively, without having to require the services of translators, a method of readily converting a voice stream in one language into either spoken or written texts in an other language in substantially real time is needed.
Summary of the Invention
The translation system of the present invention provides an abstracted, or generalized, capability for effecting language translation among different human languages, both written and spoken. This method is based on a context-centric basis and focuses specifically to an object oriented technique.
In particular, in a telephony environment, the incoming speech signal, either analog or digital, Is first stored in a voice buffer. The speech stream that is stored in the buffer is broken down into phoneme sets. This could be done to an analog input signal by sensing the distinct changes in the state of the carrier signal. In the case of a digital signal such as a voice packet stream, the silence symbol of the digitized voice stream can be used for separating the voice stream into the phoneme sets.
Once given the various phoneme sets, the speech is converted into texts. The texts, most likely in ASCII format, are stored in a text buffer.
The individual text words in the buffer are then tokenized so as to generate a token that is unique to each word. The tokenized words are used as keys for effecting a fast retrieval or look-up in a tokenized dictionary of a given standard reference language.
A pattern matching mechanism is then used to validate the tokenized texts against the tokenized dictionary of the standard reference language. Thereafter, the tokenized texts in the text buffer are translated by means of a translation mechanism that is specific to the targeted language. This is done by applying grammatical rules using a grammar engine that parses the tokenized words in the buffer and applies the grammatical rules of the target language thereto. Once the text words have been translated into the target language, they are compressed and packetized. Thereafter, the packetized packets could be transmitted.
In the case of a communication in a textual format, the packetized text words could be transmitted directly, for example in the form of an email or fax. On the other hand, if it is a voice communication, then the packetized words are fed to a voice synthesizer for conversion, so that the voice output Is in the target language. If there is compression of the tokenized texts, decompression is used for decompressing the tokenized texts.
It is therefore an objective of the present invention to provide. a method of translating in substantially real time an incoming voice message of one language into a communicative output of a different language.
It is another objective of the present invention to translate an input voice stream of one language into either an output voice stream or an output text stream of another language.
It is yet another objective of the present invention to allow persons who speak or understand different languages to communicate directly without the need for translators.
Brief Description of the Figures The above-mentioned objectives and advantages of the present invention will become apparent and the invention itself will be best understood by reference to the following description of an embodiment of the invention taken in conjunction with the accompanying drawings, wherein:
Fig. 1 is a high level block diagram of the architecture of the system of the present invention; and
Fig. 2 is a flow chart illustrating the method in which an input voice stream is translated into a target language data stream.
Detailed Description of the invention With reference to Fig. 1 , a general translation system (GTS) architecture of the present invention for converting an input voice stream of one language into an output voice or text stream of another language is shown.
In particular, an input voice stream 2, from an input transmission medium such as for example either a landline or wireless phone, -is received by a conventional voice signal receiver 4. The voice stream is then routed to a speech to text converter 6 that converts the voice stream
into text. Converter 6 could be a hardwired converter or a processor that runs any one of a number of conventional speech to text conversion programs such as for example Dragon Naturally Speaking by the Dragon System Inc. of Newton, Massachusetts that enables the conversion of spoken words into texts. Moreover, the technology for converting speech to texts is well known and two such systems are described in U.S. patent 5,031 ,113 assigned to the U.S. Phillips Corporation and U.S. patent 5,754,978 assigned to the Speech Systems of Colorado Company. The disclosures of the '113 and '978 patents are incorporated by reference herein.
The incoming voice stream could be either an analog voice stream or a stream of digitized voice packets. In any event, the voice stream is chunked into phoneme sets for storage in a text buffer, or a series of text buffers 8. The partition of the voice stream may be done by the pauses that are inherent in the spoken words uttered by a speaker. And there are a number of conventional systems available for identifying the phoneme sound types that are contained in an audio speech stream. One such system for recognizing speech and dividing the speech into phoneme sets is described in U.S. patent 5,646,490, assigned to the Fonix Corporation. The disclosure of the '490 patent is incorporated by reference herein.
The texts in the text buffer 8, which are in the language of the input voice stream, i.e., the source language, are then parsed. Each word of the text stream is then tokenized by a tokenizing algorithm so as to build a tree of tokenized words, which are tied to a standard reference language. This could be done by utilizing the NeoData program by the NeoCore
Corporation of Colorado Springs. In essence, the NeoData program generates a grammar object based on the grammar of the source language, and then expresses the source grammar object in terms of the grammar object of a standard reference language. The NeoData program therefore provides a transform generator that, in receipt of input transforms and data strings, would convert them into new transform outputs. The technology behind the NeoData program is disclosed in U.S. patent 5,942,002 assigned to the NeoCore Corporation. The disclosure of the '002 patent is incorporated by reference herein.
The source language in addition to having a particular grammar also has a given vocabulary. By means of the tokenizing technique as disclosed in the '002 patent, the grammar and the vocabulary of the source language are combined to generate a semantics object that is expressed in terms of the standard reference language. The standard reference language can be any language, such as for example English, that has a sufficiently rich vocabulary and grammar so as to be able to act as a standard from which other languages could be compared to and translated from.
The vocabulary class is basically an object that points to a given place in a dictionary of the particular language for definition. So, too, any language has its own grammar object classes, such as for example nouns, subjects, objects, predicates, possessives, interrogatives, etc. that are common in a language such as for example the English language. In other words, the grammar object class encapsulates state models and language meaning modifiers for the language. And once the actual state
models and language meaning modifiers are defined for a language such as for example the standard reference language, specific grammar classes for the language could be derived.
The standard reference language further has a semantics object that in essence provides a study of the meanings, significance and changes in the various words and phrases of the language, and the linguistic development of the meaning and relationship of the various words of the language. In essence, with the vocabulary object and the grammar object being combined to generate the semantics object, the source language, and also the standard reference language, would each have objects that comprise the vocabulary class 10, the grammar class 12 and the semantics class 14, as shown in the source language module 16 of Fig. 1.
Source language module 16 is further shown to be connected to a standard reference language module 18. The interconnection between source language module 16 and standard reference language module "18 allows the matching of the various classes between the source language and the standard reference language. This may be referred to as a pattern matching for looking up text, words, or phrases in the standard reference language that correspond to the transformed texts, words, or phrases in the source language. By thus patterning or correlating the source language with the standard reference language, a semantics object . is created with the standard reference language that is a combination of the dictionary, grammar and semantics classes of the source language. Therefore, a meaningful translation of the voice stream in a standard reference language text could be generated.
The texts generated from the mapping of the source language texts with the standard reference language texts could be done using the method and architecture as described in U.S. patent 5,677,835 assigned to the Catipillar Inc. In brief, the "835 patent discloses that source texts may be converted into target texts by using a constraints source language analyzer and a machine translation generator. The disclosure of the '835 patent is incorporated by reference herein.
The thus translated or derived source language text stream is stored in a source reference language mapped text store 20. Depending on the language to be targeted, a selector 22 would select from among a number of target language modules 24a-24n for mapping therewith the translated voice stream texts based on the standard reference language. Another mapping process such as that taught in the '835 patent whereby the translated voice stream based on the standard reference language is mapped to the target language is effected so that a text stream that now is based on the targeted language is sent to and stored in a target language text buffer 26. From there, the translated target language text stream could be output a number of ways via a transmission output medium such as for example laπdline or wireless telephony.
For example, if it is a voice-to-voice communication between two speakers, the translated voice stream now based on the target language is output to a voice synthesizer 28 so that a voice stream based on the target language is output to the listener, who presumably is a speaker of the target language. Such translated speech may be output as voice packets in a telephony environment with insignificant lag time. Of course,
the reverse process takes place in a two-way voice communication, as the listener then becomes the speaker and the same process as described above, but in the reverse order, will take place and will continue until the conversation is terminated.
In the event that the input voice stream is to be output as texts such as for example an email or fax to the receiving party, or a braille message if the receiving party happens to be blind, then the translated texts stored in the target language text buffer 26 are output as a text stream.
Note that the various modules as noted in Fig. 1 could be program applications that reside and run in a computer or processor means such as for example a Pentium based personal computer. Furthermore, the various buffers may be high speed and high capacity IC memory chips built onto a board inserted into any one of the available slots of the personal computer.
Fig. 2 provides a flow chart for illustrating the method of how the voice stream in the source language is translated to a standard reference language, and the subsequent translation of the standard reference language text stream to a target language text stream.
In particular, when a voice stream is received at the processor, it is converted into texts and chunked into segments using some natural division or pauses in the voice stream The chunked voice text segments are then stored into a text buffer such as buffer 8. This is illustrated in step
30. Thereafter, in step 32, the stored texts are processed and mapped
onto a source language object. This is done by matching each word to the vocabulary and placing each word in its grammatical context, and then matching the word with the semantics object of the language so as to place the word into its proper context as being used in the input voice stream. Having done that, the word could then be mapped with the standard reference language, as illustrated in process step 34. And with the vocabulary, grammar and semantics classes or objects having been established for each word, the now mapped standard reference language word is next related to the target language, by means of the different objects of the target language, per step 36. Once that is done and a target language object is generated, a target voice stream is created and provided to a buffer of the target language, per step 38. Thereafter, the now converted voice stream, in the form of a text stream, is output from the voice buffer per step 40. As was mentioned previously, the target language texts could be output in either a speech or a text format.
Inasmuch as the present invention is subject to many variations, modifications and changes in detail, it is intended that all matter described throughout this specification and shown in the accompanying drawings be interpreted as illustrative only and not in a limiting sense. For example, even though the present invention discussed so far relates to the conversion of an incoming voice message, in practice, an incoming text message in one language could be converted just as well as the former. Such translation of an input text message may occur, for example, in the guise of a received email of one language that requires translation to another language, or for that matter, to an output voice message. In such input text scenario, there is no need for any speech to voice translation
process. Accordingly, it is intended that the invention be limited only the spirit and scope of the hereto attached claims.