WO2003094153A1

WO2003094153A1 - Method and device for handling speech information

Info

Publication number: WO2003094153A1
Application number: PCT/EP2002/004738
Authority: WO
Inventors: Joseph Wallers
Original assignee: Digital Design Gmbh
Priority date: 2002-04-29
Filing date: 2002-04-29
Publication date: 2003-11-13
Also published as: AU2002319158A1; CN1628338A

Abstract

The invention relates to a method and a corresponding device for implementing a speech-based database and for storing and/or reproducing and/or transmitting speech information. Said method uses means for entering and/or storing and/or acoustically reproducing speech and data information and/or transmitting said information to other devices, where it is stored and/or reproduced, in addition to means for searching for one or more speech segments in the stored speech information. The aim of the invention is to develop a method and a device of this type, which overcomes the disadvantages of prior art and guarantees that speech information can be recorded, searched and reproduced without manual identification and classification and without the specification of a vocabulary. To achieve this, spoken words and/or coherent phrases (memoranda) are digitally recorded as speech signals in a memory, are spoken again in the partial form of at least one word for search purposes and are compared with the recordings in a device and evaluated. A distance value between the two speech patterns is determined and the memorandum with the lowest distance value is output acoustically.

Description

Method and device for handling speech information

The invention relates to a method and a device for handling voice information for realizing a language-based database such as storage and / or playback and / or transmission, the means for input and / or storage and / or acoustic playback and / or for the transmission of voice - and have data information on other devices for storage and / or playback there and contain means for searching for one or more language segments in the stored voice information.

Methods and devices with which it is possible to store voice information are already known, the stored voice information being provided with digital identification signal words in order to make it easier to find certain voice information for playback or transmission.

The published patent application DE 33 33 958 AI describes a device for storing voice information, which contains a controllable generator, by means of which digital identification signal words are generated by means of a keyboard. These are recorded with or separately from the speech information and are used in later searches to find the information sought.

It is disadvantageous that the user has to classify the speech information in order to be able to start a search process for certain speech information. US Pat. No. 5,602,963 describes an electronic personal organizer which can record and play back voice memos. The organizer also has a function that enables the user, after recording a voice memo, to mark it by entering one or more spoken words for later retrieval.

This procedure has the disadvantage that for each note, if a classification is desired, the user must explicitly classify it after recording the note. The amount of words that can be searched for must be determined beforehand. These words must be spoken beforehand in a training phase. The speech signals in the organizer are processed by different processing functions, depending on whether the speech is to be recorded or compared with a predefined vocabulary.

In US 4,829,576 to increase the probability of the correct recognition of a word it is proposed to use for comparison only those words from the given vocabulary which are contained in the text part to be searched. For this purpose, a SEARCH WORD LIST is created in a separate step.

US Pat. No. 6,041,300 describes a method and a device according to which, in order to improve the choice of language, the language is to be mapped into a sequence of "Lefern ⁿ n", which are compared with previously stored "Lefeme sequences". The recognized word is synthesized and output for confirmation by the user from the sequence of "learn ⁿ n" stored in the database. It is particularly disadvantageous that the first step is to set up an additional (waveform) database with information about the “lefeme” display.

The object of the invention is to develop a generic method and a generic device with which the disadvantages of the prior art are avoided and with which a language-based database is provided and with which a recording and search / playback of voice information without manual Labeling and classification and can be guaranteed without the specification of a vocabulary.

According to the invention, this object is achieved by the features of claims 1 and 22.

The method is characterized in that spoken words and / or coherent sentences are recorded digitally as speech signals of the note in a memory, with an analysis of the speech signals and their direct representation in the time and / or frequency range taking place, which are in a partial scope of at least one Word for searching is spoken again and compared and evaluated in a device with the recordings, from which a distance value between the speech patterns is determined and notes containing one or more words with short distance values to one or more searched keywords are output acoustically.

Compared to speech recognition systems, the user has a greater tolerance for errors in the classification. The speaker dependency is not disadvantageous in the search, but is a pleasant side effect with regard to. of confidentiality. There is no explicit assignment of the voice memos to certain words by the user, the vocabulary does not have to be specified explicitly. No training phase is required.

The device for carrying out the method is characterized in that a telecommunications terminal such as a cell phone / telephone with a memory card such as FlashROM and / or a data processing device such as a PC, a server equipped with special additional software are used.

The basic functions of the method and the device according to this invention can be described by two processes: recording and searching / reproducing.

When recording, notes are in the form of individual, spoken key words, e.g. Terms, names, numbers, or related sentences are recorded. Keywords that the user is particularly interested in finding can be spoken several times within a note. In a preferred embodiment, the keywords are spoken at the beginning of the note and repeated again at the end of the note. The recording can be done both in the device (device), e.g. on a memory card built into a cell phone, as well as by voice / data transmission on a remote device, e.g. a server or a PC.

To search, the user speaks the searched keywords, names, etc. In the device (eg cell phone) or in the other, remote device (server or PC), the spoken speech patterns are compared with the stored speech information and evaluated for their similarity, or a distance value between the two speech patterns is determined. The notes, which words with the greatest similarity (smallest distance value) included, are then reproduced acoustically. If there are several locations, the playback can take place in the order of the recording (eg last recordings first) or according to the similarity of the searched speech pattern and stored speech information. Search commands can contain one or more keywords. If there are several keywords, you can search for notes that contain one or more or all of the keywords.

In a preferred embodiment, the notes which contain the highest number of keywords searched are reproduced first. In a further preferred embodiment, the notes are searched in reverse order of their recording: the last spoken first.

In a preferred embodiment of the invention, the speech signals are recorded in compressed form.

A number of methods of speech compression are known in practice, e.g. the Recommendations G.723 or G.729 of the ITU (International Telecommunication Union) or the Recomendation GSM 06.10 from ETSI. These processes work in several stages. After pretreatment by filters, there is a division into segments and a signal analysis, e.g. using LPC (Linear Predictive Coding). The segments determined (speech / speech pause, voiced / unvoiced) and the calculated parameters (e.g. energy content, the

Autocorrelation coefficients, the LPC coefficients, the LSP line spectral pair coefficients and parameters of further processing levels) are also suitable for comparing speech patterns. The decomposition of the language with these methods and the compressed saving reduce the required Storage space compared to uncompressed storage. At the same time, the later pattern comparison is accelerated.

Another embodiment of the invention stores not only the compressed speech information but also uncompressed signals.

The purpose of this procedure is to be able to use better algorithms at a later point in time. The recorded voice information may be required over a longer period (decades). With most speech compressions, detailed information is inevitably lost. Since both the performance of the information processing devices and the quality of the algorithms for pattern matching are likely to continue to develop, the original signals should be retained for later use. The expected further continuous capacity increase (with simultaneous price drop) of the storage media makes this option affordable for the user.

It is also possible according to the invention to specify when entering which part of the speech information is also stored uncompressed. The uncompressed signals can also be stored on another memory, an offline memory.

The method also allows hidden searches. If a voice memo was found during a search process, in which the comparison of the voice pattern of the search command and voice pattern of the voice memo exceeds a predetermined threshold of similarity, this is reproduced. The search continues in the background during playback. This partially hides the user from searching. In a preferred embodiment, a hidden search is carried out during the recording. Keywords that were spoken at the beginning of the recording are searched while speaking and recording the rest of the note. If appropriate speech patterns are found in the notes already stored, the speech pattern found is reproduced acoustically at the end of the recording. The user can then decide whether it is identical to what they are looking for. The user's response is recorded, e.g. B. by storing a pointer to the speech pattern found for the newly recorded keyword. This creates a list of speech patterns of identical words. This is used in later searches to improve the hit rate and increase the search speed. Even if there is a mismatch, a pointer can be saved with the appropriate identification. This procedure prevents the two language patterns from being assumed to be identical again in later searches. In a further embodiment of this invention, the search process for identical keywords which have already been stored may take longer than speaking the new note.

In a further embodiment, the speech patterns of the keywords contained in the search commands, the pointers to the speech notes found, the calculated distance values and the reaction of the user are also stored. In this version, it is assumed that the user makes a rating after playing back a note: GOOD, FALSE. This reaction is saved together with the pointers to the played note. With a new search command, the current speech pattern of the search command is compared with the speech pattern from previous searches. If the patterns match or if they are very similar, the stored previous reaction of the user is checked and, if positive, the voice note to which the pointer of the recording of the previous search command points is output. The subsequent reaction of the user is saved again with the pointer to the output voice memo. This procedure has several advantages:

• it shortens the search,

• the marksmanship increases continuously,

• gradual changes in the. Pronunciation or the user's voice are compensated

• The saved speech patterns and evaluations of decisions can be used to optimize the procedures.

It is still possible to search the original notes.

Furthermore, according to the invention it is possible, after the indirect search by means of a pointer of a previous search command, to compare the speech pattern of the new search command with the speech pattern of the note shown, and to use the result to determine the distance value.

It is also possible, according to the invention, that the evaluations are carried out in finer stages: e.g. VERY WRONG, VERY GOOD.

A GANZ-FALSCH rating then prevents the corresponding note from being reproduced in a later search. A FALSE rating will reset the note in the order of the candidates found, e.g. by increasing its distance value by multiplying it by a factor greater than one. Accordingly, a VERY GOOD rating in a later search will prefer the note found, provided that its distance value is below a predetermined threshold, in the order in which it is output.

In another embodiment, a pointer to the record of the previous search command is also added to the pointer on the note, saved with the user's rating to record the current search command.

An additional refinement of the search function extends the search functionality: associations. The device searches for the keywords contained in the search command. If it finds the searched keywords in an earlier search command or in a note, and if the previous search command or the voice memo contains further keywords, the device asks by acoustic reproduction of these keywords whether the search process should be expanded to include these keywords.

In a further refinement, only those keywords are queried which occur several times when several search commands or notes are found.

In the preferred embodiment, the additional language patterns with the most frequent occurrence are reproduced first.

The user can then expand the list of the speech patterns to be searched for by these patterns, which have patterns ignored, or exclude voice memos which contain this speech pattern from the reproduction. On the one hand, this function allows you to successively limit the number of voice memos found, on the other hand, you can find related recordings.

To speed up the search in more extensive recordings, the device can create a list with keywords and pointers to voice memos in which these keywords occur. The list belonging to a keyword can contain one or more pointers. If there are several pointers per keyword, the list can contain the distance value for each pointer between the keyword (language pattern) ^' in the index list and the keyword (language pattern) in the referenced note. The user can be provided with a special function with which he can dictate key points for each note. Alternatively, all words that are spoken individually can be automatically added to the index list (with a clear pause at the beginning and end of the word). The compilation of this list requires computing power. Therefore, it is preferred to compile it if the device is connected to an external power supply, eg while charging the battery. The list can also be created in another device (e.g. server).

In addition, other data can be stored together with the voice information.

An example of this is image data from a digital camera integrated in the device, which are stored together with voice memos. In this embodiment, as already explained, the search is carried out by comparing the speech pattern contained in the search command and the stored speech signals. The found notes are played back together with the other saved data. Text data or images are e.g. output on a screen; Melodies, music, links can also be output on websites and e-mails, for example.

In one embodiment of this invention, images are stored in a digital camera together with the speech patterns of key words and / or annotations for these images. The pictures can later be found by speaking key words and searching for the corresponding speech patterns in the speech data recorded with the pictures and output, for example, on a display or printer. Another example of storing other data is the recording of telephone calls or parts thereof, with or without additional comments and the telephone numbers. You can search for keywords, together with phone numbers, and, using the associative function described above, for the speech pattern of the person you are talking to, for example, the language pattern of his or her name when reporting at the beginning of the call.

In all searches, time restrictions in the search (between date and date, by time of day, by day of the week, season, etc.) can of course also be used to restrict the search space.

In one embodiment of this invention, in which the input and output device (cell phone) is connected to a remote storage and computing device by means of a voice or data transmission device, the following additional functional sequences result: entering offline, searching offline, separate memories with different storage volumes , Need for encryption.

Enter offline: to record new voice memos, it is not necessary that there is a communication connection to the remote device. The information is e.g. cached or uncompressed on a flash memory card. Several notes can be collected and transferred together. The transmission can take place at times when cheaper connection rates apply or the user is already in the vicinity of the second device, e.g. Transfer in the office to a work PC.

Search offline: if the search is to take place on the remote device, there is no need for a permanent connection between the two devices. It is sufficient if the search command with the speech patterns, for example by IP packet, is transmitted to the remote device and the result is also transmitted by IP packet or callback.

It is also possible according to the invention to save voice recordings on different devices at the same time. The user will typically carry an input and output device in the form of a cell phone. According to the current state of storage technology and compression algorithms, voice recordings up to a total of a few hours can be stored there in a flash memory card. This memory can e.g. contain the last recordings (enter offline) and the current as well as frequently used notes. The recordings in the cell phone are periodically transferred to the remote device, see 'Enter offline'. Searching can be done on the local device in the local recordings or on the remote device.

The remote device can be a large server provided by a provider, similar to voice mail services. In this version, encrypted transmission and storage on the provider's server is particularly important. Methods for encrypting voice and data are known. The data should never be unencrypted on the server or on the transmission link. The search is carried out exclusively in the cell phone using the index lists or by searching the keywords and pointers of previous, stored search commands. The server is only used to save the notes.

In a further embodiment, the index list or the recording of the previous search commands can partly be on the server. The index list is structured hierarchically, the List of previous search commands is broken down by time.

Lists with older search commands are on the server. To the

Search the lists are transferred to the cell phone if necessary.

In addition to the described search and classification methods using distance values (scores) and "dynamic programming", other methods known to the person skilled in the art, such as B. Markov models or 'neural networks' can be used to implement this invention.

The invention is explained in more detail below in an exemplary embodiment. 1 shows the schematic representation of a possible communication configuration.

In the following description, the user's commands are triggered by pressing buttons. These can also be soft keys. It is also possible according to the invention to give the commands by voice commands.

Recording: The user presses the RECORD button on a cell phone 10 and speaks his note into the cell phone 10. At the end he presses the STOP button. The language is input via a microphone of the mobile phone 10. The analog voice signals are digitized in an analog-digital converter and sent to a DSP 11. There the signals are passed through a pre-filter

(High and low pass), then divided into segments (typically 10 to 20 ms segments). Depending on the compression standard used, the segments overlap

(e.g. around 10ms). The signal values in the segments are weighted using a Hamming window function. The autocorrelation function of the signal values in the individual segments is then calculated. The LPC coefficients are calculated from this. For the purpose of compression and storage these coefficients and the speech signals are further processed in accordance with the specifications of the compression standard used. For the purpose of pattern comparison, the LPC coefficients or transformed representations (eg cepstrum coefficients, PARCOR coefficients) are stored in a memory card 12 as part of the compressed speech information. A date and time stamp is also saved.

Instead of using the LPC method, other methods for speech compression and pattern recognition can also be used, e.g. based on short-term Fourier analysis or filter banks.

The recording can also take place by means of a voice / data transmission 13 on a remote device, here a computer 14 or a server 15.

Search: The user presses the SEARCH button on the cell phone 10 and speaks while holding the button the keywords to be searched. The device 10 searches for corresponding notes and acoustically reproduces the first voice information found. The user can then press the NEXT button to continue searching or to output the next notes found, or press a button to evaluate (GOOD, FALSE) and then, if necessary, the NEXT button. The processing of the speech signals is carried out analogously to that described in the 'recording'. The speech patterns are also saved. The LPC parameters or transformed representations, for example ceptrum coefficients, are then fed to the pattern recognition. The parameters are combined into vectors for pattern recognition. The individual keywords are combined into groups of vectors. Then they are compared with the stored voice information. The adaptation The different speech speed of the patterns is done using the method known as 'dynamic programming'. For each keyword, the distance value (score) for the most similar saved pattern is determined in each note. Depending on the setting of the device, the first note found, which contains patterns whose distance values are below a predefined threshold value, is output and searched further. In another setting, all records are searched first, the notes are sorted according to their distance values and those with the smallest distance values are output first. Each time the NEXT button is pressed, the note with the next lower rating is displayed. Before a note is played back, the pointer to this note is added to record the search command. Ratings that the user enters after hearing the note are also added to record the pointer.

Differences to speech recognition systems

Speech recognition systems are designed for other tasks. Their purpose is to convert a dictated input into written form as error-free as possible. In speech recognition systems, spoken language is mapped onto a predetermined, generally expandable set of words or functions. The algorithms are structured accordingly. The mapping takes place in several steps. The last steps in particular differed from the procedure according to this invention. They use statistical models (mostly hidden Markov models) with information about the frequency of transitions between speech segments (speech sounds or phonemes). These are partly created by a training phase that is annoying for the user before the first use. The training phase is omitted in the method according to the invention the first use. The vocabulary (keywords) is also not defined a priori, but results dynamically and automatically when recording and searching. Another difference: in the speech recognition system, there is a 'correct' illustration for each spoken word, namely that which the user intended. In the device according to the invention, a keyword can have several correct "hits".

An extension to search functions in flash memory elements is described below. This has two decisive advantages: performance and access protection. 1 shows a cell phone 10 with a DSP 11 and a memory card 12. The memory card 12 can contain the voice memos and / or the index file. This constellation has a disadvantage: with most of the small memory cards in use today, e.g. MultiMediaCard, Memory Stick, the data is transferred to the host system via a serial bus. These serial buses have such a small bandwidth that the search process, provided the search program is running in the cell phone 10 or in the DSP 11, and the data is on the memory card 12, is considerably slowed down. Therefore, in one embodiment of this invention, special cards are used which, in addition to the memory for the index files and / or the voice memos, contain a search processor on the card. Only the search commands with the associated language patterns or the results in the form of pointers to the found note (s) or voice data of the found notes are transmitted via the interface to the card. If the memory card 12 contains only the index files, the voice memos can be on a server 15 or PC 14. After the search process in the card, the found notes are then retrieved from the server 15 / PC 14.

The architecture presented has another advantage: access protection. The memory card 12 can be protected against unauthorized access by means of a password or other authentication mechanisms (for example biometric methods). If the notes and, if available, the index files in the Card storage, there is good protection against unauthorized access. If only the index files are saved on the card, the notes can be stored encrypted on a server / PC. The notes found are first transferred to the card before being output, decrypted there and forwarded to the cell phone 10 for playback. When recording voice memos, the transfer takes place in reverse: the voice data is transferred from the mobile phone 10 to the memory card 12, where, if available, the index files are expanded, the voice data of the note is stored and / or encrypted and then forwarded to the server 15 / PC 14. The bandwidth of the serial interfaces is sufficient for the transmission of voice data in encrypted and unencrypted form in real time.

LIST OF REFERENCE NUMBERS

Cell phone digital signal processor DSP memory card EPO-BERLIN voice / data transmission ^{2 9} -04- 2002 computer server

Claims

EPO-BERLIN2 9 -04- 2002 Patent claims

1. A method for handling voice information for realizing a language-based database with functions for storage and / or playback and / or transmission, the method comprising means for input and / or storage and / or acoustic playback and / or for the transmission of voice and data information and digital image information with speech data is used and means for searching for one or more speech segments in the stored speech information are used, characterized in that spoken words and / or related sentences (notes) are recorded digitally as speech signals in a memory, whereby means for transformation the speech signals are provided in a representation in the frequency domain and / or in a compressed representation in the time domain, which are spoken again in part by at least one word for searching and are compared and evaluated in a device with the recordings, where a distance value between the two speech patterns is determined on the basis of their parameters in the time and / or frequency range and notes which contain language segments with the smallest distance values are output acoustically.

2. The method according to claim 1, characterized in that the speech signals are recorded in compressed form and / or in addition to the compressed speech information, uncompressed signals are also stored.

3. The method according to claim 1, characterized in that further information such as language patterns of the keywords contained in the search commands, such as pointers to the found voice memos, such as calculated distance values and how reactions and / or evaluations of the user are stored and / or in later searches the Speech patterns of the previous search commands can be searched taking into account the recorded ratings and / or user reactions and / or distance values and / or pointers.

4. The method according to claim 1, characterized in that the notes are reproduced first, which contain the highest number of searched keywords.

5. The method according to claims 1 to 4, characterized in that the speech patterns are compared when searching with the same data sets that are also used for playback.

6. The method according to claims 1 to 5, characterized in that a hidden search is carried out during the recording and / or a hidden search is continued during the playback of a site.

7. The method according to claims 1 to 6, characterized in that the search algorithms and parameters are optimized on the basis of the au drawn patterns and ratings.

8. The method according to claims 1 to 7, characterized in that encrypted storage and access protection are installed.

9. The method according to claims 1 to 8, characterized in that the voice input via microphone, telephone or offline via dictation machine, voice box and playback via headphones, speakers, telephone.

10. The method according to claims 1 to 9, characterized in that a short-term storage in a cell phone, a long-term storage are carried out on a server, with periodic and / or when accessing the long-term storage is dubbed or voice recordings are made on different devices simultaneously.

11. The method according to claims 1 to 10, characterized in that an index is built up by storing individual speech patterns separately and providing pointers to the recorded notes, or storing pointers together with matching coefficients (scores) and / or the index pattern by the Users can be determined by speaking individual words.

12. The method according to claim 11, characterized in that the creation of the index files and the optimization of the index files take place when the cell phone is connected to the power supply (network), or take place offline, on a more powerful computer.

13. The method according to claims 1 to 10, characterized in that the search is carried out with a time.

14. Device for performing the method according to claims 1 to 13, characterized in that a telecommunications terminal such as a cell phone / telephone (10) with a. Storage medium like FlashROM and / or

Memory card (12) and / or one

Data processing device such as PC, server equipped with special additional software (11) can be used.

15. The device for performing the method according to claim 14, characterized in that a special software on a computer (14) such as PC with voice input and output is used.

16. The device for performing the method according to claim 14, characterized in that a telephone is used via a network connected to a computer such as a PC or to a special server.

17. The apparatus for performing the method according to claim 14, characterized in that the memory card (12) contains a search processor and / or devices for access protection in addition to the memory for the index files and / or the voice memos.