WO1998035491A1 - Voice-data interface - Google Patents

Voice-data interface Download PDF

Info

Publication number
WO1998035491A1
WO1998035491A1 PCT/GB1998/000194 GB9800194W WO9835491A1 WO 1998035491 A1 WO1998035491 A1 WO 1998035491A1 GB 9800194 W GB9800194 W GB 9800194W WO 9835491 A1 WO9835491 A1 WO 9835491A1
Authority
WO
WIPO (PCT)
Prior art keywords
words
coded signals
speech
signals
link
Prior art date
Application number
PCT/GB1998/000194
Other languages
French (fr)
Inventor
Robert Denis Johnston
Original Assignee
British Telecommunications Public Limited Company
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by British Telecommunications Public Limited Company filed Critical British Telecommunications Public Limited Company
Priority to EP98900943A priority Critical patent/EP0958692A1/en
Priority to PCT/GB1998/000194 priority patent/WO1998035491A1/en
Priority to JP53397198A priority patent/JP2001510660A/en
Priority to AU56743/98A priority patent/AU5674398A/en
Publication of WO1998035491A1 publication Critical patent/WO1998035491A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/487Arrangements for providing information services, e.g. recorded voice services or time announcements
    • H04M3/493Interactive information services, e.g. directory enquiries ; Arrangements therefor, e.g. interactive voice response [IVR] systems or voice portals
    • H04M3/4938Interactive information services, e.g. directory enquiries ; Arrangements therefor, e.g. interactive voice response [IVR] systems or voice portals comprising a voice browser which renders and interprets, e.g. VoiceXML
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B5/00Electrically-operated educational appliances
    • G09B5/06Electrically-operated educational appliances with both visual and audible presentation of the material to be studied
    • G09B5/065Combinations of audio and video presentations, e.g. videotapes, videodiscs, television systems
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B5/00Electrically-operated educational appliances
    • G09B5/08Electrically-operated educational appliances providing for individual presentation of information to a plurality of student stations
    • G09B5/14Electrically-operated educational appliances providing for individual presentation of information to a plurality of student stations with provision for individual teacher-student communication
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/221Announcement of recognition results

Definitions

  • an interface for a voice interactive service comprising: a speech synthesiser to receive coded signals representing sequences of words and to generate audio signals corresponding thereto for output; speech recognition means connected to receive the said coded signals and operable upon receipt of a speech signal to be recognised to identify that part of the word sequence represented by the coded signals which most resemble the speech signal to be recognised.
  • the invention provides a method of operating a voice interactive service comprising (a) receiving coded signals representing a sequence of words and synthesising audio signals corresponding thereto for output;
  • an apparatus 1 for providing a voice-interactive service is shown and in this example it is intended to allow a user to access a text-based information service by voice only, using a telephone 2.
  • the apparatus 1 could be located at the user's premises or at the location of the text-based information service, in this example it is located at a telephone exchange or other central location where it can be accessed by many users (at different times or - with duplication of its functions - simultaneously) via a telecommunications link such as a PSTN dialled connection 3.
  • the information service is provided by a remote database server 4 which contains (or forms a gateway offering access to) stored pages of textual information - though the database could if desired be incorporated into the apparatus 1 .
  • the server is part of a network accessible via a telecommunications link 5, such as the Internet, and responds to addresses transmitted to it by sending a document identified by that address.
  • Documents provided by the Internet are commonly formatted according to the hypertext markup language (HTML) which is itself a particular example of the standard generalised markup language according to international standard ISO 8879.
  • HTML hypertext markup language
  • an HTML document also contains formatting information suggesting the appearance of the document when displayed on a screen (or printed) such as position, font size, italics and so forth. The precise details of these are not important for present purposes; one thing that is of significance however is that these documents also have provision for flagging words or phrases as associated with the address of another document.
  • Figure 2a Part of such a document is illustrated in Figure 2a with its displayed appearance shown in Figure 2b. It is seen that this format and control information is enclosed with chevrons " ⁇ > " as delimiters, not being intended for display.
  • the text "Patent Office Sites” is to be shown in bold type as indicated by the start and finish codes ⁇ b > and ⁇ /b > .
  • the text "US Patent and Trademark Office” is flanked by ⁇ a > and ⁇ /a > delimiters which normally cause the text to be displayed in a distinctive manner - a special colour or underlined, for example - to identify this phrase a representing a link.
  • ⁇ a > code contains an associated address "http://www.uspto.gov" which is the address of the Internet page of the US Patent and Trademark Office.
  • a user with a visual display terminal receives such a document and wishes to select the USPTO page, he uses a pointing device such as a mouse to point to the underlined phrase, causing the terminal to extract the associated address and transmit it for selection of a new document.
  • FIG. 3 shows the apparatus 1 in more detail. It contains a network interface 10 which comprises a modem for connection to the link 5, and a processor programmed with software to transmit addresses via the modem to the server and receive documents from the server.
  • This software differs from conventional browser software such as Netscape only in that (a) it receives addresses via a connection 1 1 rather than having them typed in at a keyboard or selected using a mouse and (b) it outputs the received text directly to a file or buffer 1 2 which can be accessed via a connection 1 3.
  • a document has been received by the interface 10 and is stored in the buffer 1 2.
  • a first portion of text is read out and a correspondingly coded signal is output on the line 13.
  • the actual amount of the text output could rely on punctuation characters included in the text, for example up to the first (or second etc.) full stop, or up to the first paragraph mark.
  • This is received by a text pre-processing unit 14 which serves to delete unwanted control information, and forward it to a conventional text-to-speech synthesiser 1 5.
  • link addresses are stored in the buffer 1 6, they are removed by further text processing 1 7 before forwarding the text to a recognition network generator 1 8 which is connected to a speech recogniser 1 9.
  • the recogniser 1 9 is connected to receive audio signals from the telephone line 3, so that responses from the user at 2 may be recognised.
  • the recogniser may have permanent programming to enable it to recognise some standard command words for control of the system; however its primary purpose is to match the user's response to the source text which has just been spoken by the synthesiser 1 5; more particularly to identify that part of the source text present in the buffer 1 6 which most closely resembles the user's response.
  • the function of the recognition network generator 1 8 is to derive, from the text input to it, parameters for the recogniser defining a vocabulary and grammar corresponding to this task.
  • the output of the recognser is a text string corresponding to the matched portion of text (or command word) .
  • This output representing the user's response is taken to be a request for a further document information, and the next task is to identify this by locating the text string in the buffer 1 6 and returning the link address contained within in; or if there is none, returning the nearest link address stored in the buffer.
  • This function (to be discussed in more detail below) is performed by a link resolve unit 20 which outputs the link address to the interface 10, which transmits it to the database server 4 as a request for a further document. If however the link represents a position in the current document, then this is recognised and a command issued to the buffer 1 2 to read text from a specified point. Control functions - for example if the user wishes to move on to the next
  • control words e.g. More, Back, Home, Quit
  • control unit 21 which, upon receiving one of these words along, then issues appropriate instructions to the buffer 1 2 and/or interface 10.
  • the buffer 1 2 is set up to output one paragraph at a time; suppose further that the user has already heard the title and asked for "More", the buffer 1 2 outputs the next paragraph ""Welcome forests") to the text preprocessor as shown in Figure 4C.
  • the recogniser 1 9 matches the speech signal and outputs the text string "Amazon basin", whereupon the link resolver 20 searches in the buffer 1 6 for this text string, finds that it is attached to the link address http://www/amazon. basin", read out this address and forwards it to the interface 1 0 which transmits it to the database server 4 to call up another page.
  • the link resolver operates according to the flowchart shown in Figure 5.
  • a first test 30 it is determined whether the matched source text is, or contains a link. "Amazon basin", "birds in the Amazon basin” or even “basin many of” would pass this test. In this case, the link address in question is chosen at 31 . Otherwise a second test 32 is performed to establish whether the matched source text lies in a sentence which contains a link; "one thousand species" for example would fall into this category. In this case the address in that sentence (or, if more than one, the one nearest to the matched source text) is chosen.
  • the nearest link to the matched source text is chosen, for example by counting the number of words (or the number of characters) from the matched text to the next link above and below it in the buffer, and choosing the link with the lower count.
  • a more complex algorithm could examine the nearest links above and below the matched text for the degree of semantic similarity to the matched text and choose the more similar. In a refinement, one could weight this choice to take account of punctuation, for example by increasing by (e.g.) 10 words the count when crossing a paragraph boundary.
  • the HTML language also permits links to other parts of the current document - as shown in Figure 4A for the British Wildlife Society.
  • the address "#3224" would be recognised by the link resolver as an internal address and forwarded not to the interface 10 but to the buffer 1 2 to cause readout of a paragraph from a point in the document specified by the address.
  • the operation of the recognition network generator 1 8 may now be discussed further. There are essentially two components to the setting up of a recogniser for a given function. First, defining its vocabulary, and second, defining its grammar.
  • the vocabulary is a question of ensuring that the recogniser has a set of models or templates, typically one for each of the words to be recognised - that is, one for each of the words (other than link addresses) present in the buffer 1 6.
  • Vocabulary generation for this purpose may use any of the conventional methods. Typically this is done by using a recogniser preprogrammed with a set of sub-word models (e.g. one per phoneme) and processing each word delivered from the buffer, in similar manner to the operation of a text-to-speech synthesiser, to generate a word template by concatenation of the appropriate sub-word models.
  • the recogniser may have a standard store of word models which can be retrieved when the corresponding words are received from the buffer 1 6, though to accommodate proper names and other words not in the standard set the sub-word concatenation method would usually be employed as well.
  • the grammar of a recogniser is a set of stored parameters which define what word sequences are permissible; for example, considering the buffer contents shown in Figure 4A whilst “Amazon basin” is a word sequence which is useful to recognise “basin Amazon” is not.
  • One possibility is to allow (as sequences for matching against the user's utterance) any number of words from 1 upwards, but only in the sequence in which they appear in the buffer.
  • Figure 6 shows this represented graphically (for a portion only of the text) where 40 represents a start node of a recognition "tree", 41 represents an end node, 42 represents word models and the lines 43 represent allowable paths so. It would be possible to include a network of 'carrier phrases' as shown in
  • FIG. 7 so that the user could say sentences such as "Tell me more about the Amazon Basin please”.
  • a garbage or sink model (Fig. 8) could be included at the beginning and end of the network to allow any speech to surround the echoed phrase.
  • the recogniser could simply allow any of the words on the page to be uttered in any order as shown in Figure 9. The accuracy of such a recogniser would not be as high as those shown in Figures ⁇ to 8, but if statistical constraints based on the contents of the HTML page were incorporated in the recognition process a working system could be created.
  • the recogniser returns, as a "label" representing its recognition result, the relevant part of the actual text string supplied to the recognition network generator 1 8 by the buffer 1 6, and the link resolver 20 matches this string against the buffer contents to locate the desired links. Whilst this may be convenient to permit use of a conventional unit for the recogniser 1 6, a way of speeding up the operation of the link resolver would be to set up the recogniser to return some parameter enabling faster access to the buffer, for example pointer values giving the addresses in the buffer 1 6 of the first and last characters of the matched source text string.
  • This embodiment presupposes that the source text carries hyperlink addresses; however it is also possible to operate this system without embedded addressed of this kind. For example one could transmit to the database server coordinates to identify the point in a (or range of) the source text at which the match occurred. In the case of connectionless service such as the Internet, it would be necessary to concatenate this information with the address of the server before transmitting it.
  • the text preprocessor 14 could be arranged to pass certain markings through to the synthesiser 1 5 to allow bold type to be emphasised. Similarly, it would be possible for the preprocessor to pass the hyperlink markings ⁇ a > ... ⁇ /a > (albeit without the addresses) and arrange the synthesiser to respond to these by applying an emphasis, or even switching to a different voice (for example a male instead of female voice) from that used for the remainder of the text. With this expedient, in an alternative embodiment, one can simplify the speech recogniser vocabulary to include only the link words, though it is still preferred to operate the recogniser as described above, against the possibility that the user may not always accurately recollect which words were spoken with the emphasis (or different voice).

Abstract

A page of text from a database (4) has certain words marked with the addresses of other, linked pages. The text is received at (10) and converted into an audio signal by a speech synthesiser (15) that it can be heard by a user. The user's spoken responses are fed to a speech recogniser (19) so that an address associated with a marked word in which the user is interested can be returned to the database (4) for retrieval of the corresponding linked page. Because the user will not necessarily know which words are marked, the recogniser is set up to match the user's response against the whole of the text fed to the synthesiser to identify the words in the text giving the best match. A resolver (20) finds the nearest marked word to the words identified and extracts the associated link address.

Description

VOICE-DATA INTERFACE
The present application is concerned with voice-interactive access to text- based services. According to one aspect of the invention there is provided an interface for a voice interactive service comprising: a speech synthesiser to receive coded signals representing sequences of words and to generate audio signals corresponding thereto for output; speech recognition means connected to receive the said coded signals and operable upon receipt of a speech signal to be recognised to identify that part of the word sequence represented by the coded signals which most resemble the speech signal to be recognised.
In another aspect the invention provides a method of operating a voice interactive service comprising (a) receiving coded signals representing a sequence of words and synthesising audio signals corresponding thereto for output;
(b) receiving a speech signal and identifying by means of a speech recogniser that part of the word sequence represented by the coded signals which most resembles the received speech signal; and (c) using the recognition result to select a further sequence of words.
Other aspects of the invention are set out in the claims.
Some embodiments of the invention will now be described, by way of example, with reference to the accompanying drawings.
In Figure 1 an apparatus 1 for providing a voice-interactive service is shown and in this example it is intended to allow a user to access a text-based information service by voice only, using a telephone 2. Although the apparatus 1 could be located at the user's premises or at the location of the text-based information service, in this example it is located at a telephone exchange or other central location where it can be accessed by many users (at different times or - with duplication of its functions - simultaneously) via a telecommunications link such as a PSTN dialled connection 3. The information service is provided by a remote database server 4 which contains (or forms a gateway offering access to) stored pages of textual information - though the database could if desired be incorporated into the apparatus 1 . Here we suppose that the server is part of a network accessible via a telecommunications link 5, such as the Internet, and responds to addresses transmitted to it by sending a document identified by that address. Documents provided by the Internet are commonly formatted according to the hypertext markup language (HTML) which is itself a particular example of the standard generalised markup language according to international standard ISO 8879. As well as containing text characters forming the words of the text, an HTML document also contains formatting information suggesting the appearance of the document when displayed on a screen (or printed) such as position, font size, italics and so forth. The precise details of these are not important for present purposes; one thing that is of significance however is that these documents also have provision for flagging words or phrases as associated with the address of another document. Part of such a document is illustrated in Figure 2a with its displayed appearance shown in Figure 2b. It is seen that this format and control information is enclosed with chevrons " < > " as delimiters, not being intended for display. The text "Patent Office Sites" is to be shown in bold type as indicated by the start and finish codes < b > and < /b > . The text "US Patent and Trademark Office" is flanked by < a > and < /a > delimiters which normally cause the text to be displayed in a distinctive manner - a special colour or underlined, for example - to identify this phrase a representing a link. Moreover the < a > code contains an associated address "http://www.uspto.gov" which is the address of the Internet page of the US Patent and Trademark Office. When a user with a visual display terminal receives such a document and wishes to select the USPTO page, he uses a pointing device such as a mouse to point to the underlined phrase, causing the terminal to extract the associated address and transmit it for selection of a new document.
The function of the apparatus 1 of Figure 1 is, in brief as follows:
(a) to receive HTML documents from the server 4;
(b) to synthesise an audio signal reciting the text contained in the document and transmit it via the line 3 to the user at 2;
(c) to recognise spoken replies from the user;
(d) to recognise the replies from the user as indicating a selection of a further document; (e) to transmit the address of that document to the server 4. Figure 3 shows the apparatus 1 in more detail. It contains a network interface 10 which comprises a modem for connection to the link 5, and a processor programmed with software to transmit addresses via the modem to the server and receive documents from the server. This software differs from conventional browser software such as Netscape only in that (a) it receives addresses via a connection 1 1 rather than having them typed in at a keyboard or selected using a mouse and (b) it outputs the received text directly to a file or buffer 1 2 which can be accessed via a connection 1 3. Suppose that a document has been received by the interface 10 and is stored in the buffer 1 2. A first portion of text is read out and a correspondingly coded signal is output on the line 13. The actual amount of the text output could rely on punctuation characters included in the text, for example up to the first (or second etc.) full stop, or up to the first paragraph mark. This is received by a text pre-processing unit 14 which serves to delete unwanted control information, and forward it to a conventional text-to-speech synthesiser 1 5. This produces an audio signal corresponding to the portion of text, which is transmitted over the telephone line 3 to the user at 2.
The portion of text is also copied to a buffer 1 6. This is shown as coming from a second output of the pre-processing unit 14, since whilst the unit 14 removes from the text sent to the synthesiser 15 all format and control information (i.e. the characters < and > and anything within them), the text sent to the buffer 1 6 still includes the link address commands (e.g. < a ref =
"http//www.epo.co.at/epo" > .... < /a > but omits all other formatting and control information.
If desired, one could allow selected markings to pass to the synthesiser, for example so that bold type could be more heavily stressed, but this is entirely optional.
Although the link addresses are stored in the buffer 1 6, they are removed by further text processing 1 7 before forwarding the text to a recognition network generator 1 8 which is connected to a speech recogniser 1 9.
The recogniser 1 9 is connected to receive audio signals from the telephone line 3, so that responses from the user at 2 may be recognised. The recogniser may have permanent programming to enable it to recognise some standard command words for control of the system; however its primary purpose is to match the user's response to the source text which has just been spoken by the synthesiser 1 5; more particularly to identify that part of the source text present in the buffer 1 6 which most closely resembles the user's response.
Thus the function of the recognition network generator 1 8 is to derive, from the text input to it, parameters for the recogniser defining a vocabulary and grammar corresponding to this task.
In this example, it is assumed that the output of the recognser is a text string corresponding to the matched portion of text (or command word) . This output representing the user's response is taken to be a request for a further document information, and the next task is to identify this by locating the text string in the buffer 1 6 and returning the link address contained within in; or if there is none, returning the nearest link address stored in the buffer. This function (to be discussed in more detail below) is performed by a link resolve unit 20 which outputs the link address to the interface 10, which transmits it to the database server 4 as a request for a further document. If however the link represents a position in the current document, then this is recognised and a command issued to the buffer 1 2 to read text from a specified point. Control functions - for example if the user wishes to move on to the next
(or preceding) paragraph of the document currently stored in the buffer 1 2, or to return to some default document, or to terminate the connection - could be performed using the telephone keypad, but preferably is achieved by designating certain words as control words (e.g. More, Back, Home, Quit) stored as a permanent vocabulary in the recogniser 1 9 and received by a control unit 21 which, upon receiving one of these words along, then issues appropriate instructions to the buffer 1 2 and/or interface 10.
By way of further explanation of the operation of the apparatus, and in particular of the link resolver 20, consider a situation where the buffer 1 2 is loaded with a document as shown in Figure 4A. The appearance of this document were it to be displayed on a visual display unit is as in Figure 4B.
Suppose that the buffer 1 2 is set up to output one paragraph at a time; suppose further that the user has already heard the title and asked for "More", the buffer 1 2 outputs the next paragraph ""Welcome forests") to the text preprocessor as shown in Figure 4C.
Suppose now the user says "the Amazon basin". The recogniser 1 9 matches the speech signal and outputs the text string "Amazon basin", whereupon the link resolver 20 searches in the buffer 1 6 for this text string, finds that it is attached to the link address http://www/amazon. basin", read out this address and forwards it to the interface 1 0 which transmits it to the database server 4 to call up another page.
Naturally the user cannot know which expressions have link addresses attached and to cater for the possibility of him/her uttering some other words, the link resolver operates according to the flowchart shown in Figure 5. In a first test 30, it is determined whether the matched source text is, or contains a link. "Amazon basin", "birds in the Amazon basin" or even "basin many of" would pass this test. In this case, the link address in question is chosen at 31 . Otherwise a second test 32 is performed to establish whether the matched source text lies in a sentence which contains a link; "one thousand species" for example would fall into this category. In this case the address in that sentence (or, if more than one, the one nearest to the matched source text) is chosen. Otherwise the nearest link to the matched source text is chosen, for example by counting the number of words (or the number of characters) from the matched text to the next link above and below it in the buffer, and choosing the link with the lower count. A more complex algorithm could examine the nearest links above and below the matched text for the degree of semantic similarity to the matched text and choose the more similar. In a refinement, one could weight this choice to take account of punctuation, for example by increasing by (e.g.) 10 words the count when crossing a paragraph boundary.
The HTML language also permits links to other parts of the current document - as shown in Figure 4A for the British Wildlife Society. Upon recognition of this name by the recogniser, the address "#3224" would be recognised by the link resolver as an internal address and forwarded not to the interface 10 but to the buffer 1 2 to cause readout of a paragraph from a point in the document specified by the address. The operation of the recognition network generator 1 8 may now be discussed further. There are essentially two components to the setting up of a recogniser for a given function. First, defining its vocabulary, and second, defining its grammar. The vocabulary is a question of ensuring that the recogniser has a set of models or templates, typically one for each of the words to be recognised - that is, one for each of the words (other than link addresses) present in the buffer 1 6. Vocabulary generation for this purpose may use any of the conventional methods. Typically this is done by using a recogniser preprogrammed with a set of sub-word models (e.g. one per phoneme) and processing each word delivered from the buffer, in similar manner to the operation of a text-to-speech synthesiser, to generate a word template by concatenation of the appropriate sub-word models. Alternatively the recogniser may have a standard store of word models which can be retrieved when the corresponding words are received from the buffer 1 6, though to accommodate proper names and other words not in the standard set the sub-word concatenation method would usually be employed as well.
The grammar of a recogniser is a set of stored parameters which define what word sequences are permissible; for example, considering the buffer contents shown in Figure 4A whilst "Amazon basin" is a word sequence which is useful to recognise "basin Amazon" is not. One possibility is to allow (as sequences for matching against the user's utterance) any number of words from 1 upwards, but only in the sequence in which they appear in the buffer. Figure 6 shows this represented graphically (for a portion only of the text) where 40 represents a start node of a recognition "tree", 41 represents an end node, 42 represents word models and the lines 43 represent allowable paths so. It would be possible to include a network of 'carrier phrases' as shown in
Figure 7 so that the user could say sentences such as "Tell me more about the Amazon Basin please". Alternatively a garbage or sink model (Fig. 8) could be included at the beginning and end of the network to allow any speech to surround the echoed phrase. In another embodiment the recogniser could simply allow any of the words on the page to be uttered in any order as shown in Figure 9. The accuracy of such a recogniser would not be as high as those shown in Figuresδ to 8, but if statistical constraints based on the contents of the HTML page were incorporated in the recognition process a working system could be created.
Returning briefly to Figure 3, in this embodiment it has been assumed that the recogniser returns, as a "label" representing its recognition result, the relevant part of the actual text string supplied to the recognition network generator 1 8 by the buffer 1 6, and the link resolver 20 matches this string against the buffer contents to locate the desired links. Whilst this may be convenient to permit use of a conventional unit for the recogniser 1 6, a way of speeding up the operation of the link resolver would be to set up the recogniser to return some parameter enabling faster access to the buffer, for example pointer values giving the addresses in the buffer 1 6 of the first and last characters of the matched source text string.
Although only one server is shown in Figure 1 , of course there could be others, and the transmitted link address could well be destined for a different server from the one sending the document from which it was obtained.
This embodiment presupposes that the source text carries hyperlink addresses; however it is also possible to operate this system without embedded addressed of this kind. For example one could transmit to the database server coordinates to identify the point in a (or range of) the source text at which the match occurred. In the case of connectionless service such as the Internet, it would be necessary to concatenate this information with the address of the server before transmitting it.
It was mentioned earlier that the text preprocessor 14 could be arranged to pass certain markings through to the synthesiser 1 5 to allow bold type to be emphasised. Similarly, it would be possible for the preprocessor to pass the hyperlink markings < a > ... < /a > (albeit without the addresses) and arrange the synthesiser to respond to these by applying an emphasis, or even switching to a different voice (for example a male instead of female voice) from that used for the remainder of the text. With this expedient, in an alternative embodiment, one can simplify the speech recogniser vocabulary to include only the link words, though it is still preferred to operate the recogniser as described above, against the possibility that the user may not always accurately recollect which words were spoken with the emphasis (or different voice).

Claims

1 . An interface for a voice interactive service comprising: a speech synthesiser to receive coded signals representing sequences of words and to generate audio signals corresponding thereto for output; speech recognition means connected to receive the said coded signals and operable upon receipt of a speech signal to be recognised to identify that part of the word sequence represented by the coded signals which most resembles the speech signal to be recognised.
2. An interface according to Claim 1 in which the coded signals include link signals identifying one or more words of a sequence which represent links to further information, and the apparatus is operable to select from the coded signals a link signal which is in or adjacent to the identified resembling part of the sequence.
3. An interface according to Claim 2 including a communications interface connected to receive the coded signals from a remote source and to transmit the selected link signal to the same or another remote source for requesting further coded signals.
4. An interface according to Claim 2 or 3 including a buffer for storing the coded signals, wherein:
(a) the interface is so operable that the speech synthesiser can generate audio signals corresponding to a portion only of the coded signals stored in the buffer and the recogniser thereupon identifies that part of the word sequence which is represented by said portion of the coded signals which part most resembles the speech signal to be recognised; and (b) the interface includes control means responsive to a link signal which identifies a further portion of the coded signals stored in the buffer to transmit that further portion to the synthesiser and recognition means.
5. An interface according to any one of the preceding claims including a telephone line interface whereby the generated audio signals and received speech signals may respectively be sent to and received from a remote user.
6. An interface for a voice interactive service as herein described with reference to the accompanying drawings.
7. A method of operating a voice interactive service comprising
(a) receiving coded signals representing a sequence of words and synthesising audio signals corresponding thereto for output;
(b) receiving a speech signal and identifying by means of a speech recogniser that part of the word sequence represented by the coded signals which most resembles the received speech signal; and
(c) using the recognition result to select a further sequence of words.
8. A method according to Claim 7 in which the coded signals include link signals identifying one or more words of a sequence which represent links to further information, and step (c) includes selecting from the coded signals a link signal which is in or adjacent the identified resembling part of the sequence.
9. An interface for a voice interactive service comprising: a speech synthesiser to receive coded signals representing sequences of words and to generate audio signals corresponding thereto for output, in which the coded signals include link signals identifying one or more words of a sequence which represent links to further information, the synthesiser being responsive to receipt of the link signals to utter the words so identified in a different manner from words not so identified; and speech recognition means connected to receive at least those of the coded signals which represent link-representing words and operable upon receipt of a speech signal to be recognised to identify which of the link-representing words most resemble the speech signal to be recognised.
PCT/GB1998/000194 1997-02-05 1998-01-22 Voice-data interface WO1998035491A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP98900943A EP0958692A1 (en) 1997-02-05 1998-01-22 Voice-data interface
PCT/GB1998/000194 WO1998035491A1 (en) 1997-02-05 1998-01-22 Voice-data interface
JP53397198A JP2001510660A (en) 1997-02-05 1998-01-22 Voice data interface
AU56743/98A AU5674398A (en) 1997-02-05 1998-01-22 Voice-data interface

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP97300741.2 1997-02-05
PCT/GB1998/000194 WO1998035491A1 (en) 1997-02-05 1998-01-22 Voice-data interface

Publications (1)

Publication Number Publication Date
WO1998035491A1 true WO1998035491A1 (en) 1998-08-13

Family

ID=10824876

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB1998/000194 WO1998035491A1 (en) 1997-02-05 1998-01-22 Voice-data interface

Country Status (1)

Country Link
WO (1) WO1998035491A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000014728A1 (en) * 1998-09-09 2000-03-16 One Voice Technologies, Inc. Network interactive user interface using speech recognition and natural language processing
EP0992980A2 (en) * 1998-10-06 2000-04-12 Lucent Technologies Inc. Web-based platform for interactive voice response (IVR)
WO2000052914A1 (en) * 1999-02-27 2000-09-08 Khan Emdadur R System and method for internet audio browsing using a standard telephone
WO2001016936A1 (en) * 1999-08-31 2001-03-08 Accenture Llp Voice recognition for internet navigation
WO2001043388A2 (en) * 1999-12-10 2001-06-14 Deutsche Telekom Ag Communication system and method for establishing an internet connection by means of a telephone
EP1134948A2 (en) * 2000-03-15 2001-09-19 Nec Corporation Information search system using radio portable terminal
EP1168799A2 (en) * 2000-06-30 2002-01-02 Fujitsu Limited Data processing system with vocalisation mechanism
JP2002091756A (en) * 2000-06-15 2002-03-29 Internatl Business Mach Corp <Ibm> System and method for simultaneously providing a large number of acoustic information sources
DE10201623C1 (en) * 2002-01-16 2003-09-11 Mediabeam Gmbh Method for data acquisition of data made available on an Internet page and method for data transmission to an Internet page
US6662163B1 (en) * 2000-03-30 2003-12-09 Voxware, Inc. System and method for programming portable devices from a remote computer system
DE102010001564A1 (en) 2010-02-03 2011-08-04 Bayar, Seher, 51063 A method and computer program product for automated configurable acoustic reproduction and editing of website content

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0735736A2 (en) * 1995-03-30 1996-10-02 AT&T IPM Corp. Method for automatic speech recognition of arbitrary spoken words
US5572625A (en) * 1993-10-22 1996-11-05 Cornell Research Foundation, Inc. Method for generating audio renderings of digitized works having highly technical content
GB2307619A (en) * 1995-11-21 1997-05-28 Alexander James Pollitt Internet information access system
WO1997023973A1 (en) * 1995-12-22 1997-07-03 Rutgers University Method and system for audio access to information in a wide area computer network
WO1997040611A1 (en) * 1996-04-22 1997-10-30 At & T Corp. Method and apparatus for information retrieval using audio interface

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5572625A (en) * 1993-10-22 1996-11-05 Cornell Research Foundation, Inc. Method for generating audio renderings of digitized works having highly technical content
EP0735736A2 (en) * 1995-03-30 1996-10-02 AT&T IPM Corp. Method for automatic speech recognition of arbitrary spoken words
GB2307619A (en) * 1995-11-21 1997-05-28 Alexander James Pollitt Internet information access system
WO1997023973A1 (en) * 1995-12-22 1997-07-03 Rutgers University Method and system for audio access to information in a wide area computer network
WO1997040611A1 (en) * 1996-04-22 1997-10-30 At & T Corp. Method and apparatus for information retrieval using audio interface

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ATKINS D L , BALL T , BARAN T R , BENEDICT M A , COX K C , LADD D A , MATAGA P A, PUCHOL C , RAMMING J C , REHOR K G , TUCKEY C.: "INTEGRATED WEB AND TELEPHONE SERVICE CREATION", BELL LABS TECHNICAL JOURNAL, vol. 2, no. I, 1 January 1997 (1997-01-01), USA, pages 19 - 35, XP002036350 *
PAGE J H ET AL: "THE LAUREATE TEXT-TO-SPEECH SYSTEM - ARCHITECTURE AND APPLICATIONS", BT TECHNOLOGY JOURNAL, vol. 14, no. 1, 1 January 1996 (1996-01-01), pages 57 - 67, XP000554639 *
RABINER L R: "The impact of voice processing on modern telecommunications", SPEECH COMMUNICATION, vol. 17, no. 3-4, November 1995 (1995-11-01), pages 217 - 226, XP000641894 *
RICCIO A ET AL: "VOICE BASED REMOTE DATA BASE ACCESS", PROCEEDINGS OF THE EUROPEAN CONFERENCE ON SPEECH COMMUNICATION AND TECHNOLOGY (EUROSPEECH), PARIS, SEPT. 26 - 28, 1989, vol. 1, 26 September 1989 (1989-09-26), TUBACH J P;MARIANI J J, pages 561 - 564, XP000209922 *
TAKAHASHI J ET AL: "INTERACTIVE VOICE TECHNOLOGY DEVELOPMENT FOR TELECOMMUNICATIONS APPLICATIONS", SPEECH COMMUNICATION, vol. 17, no. 3/04, November 1995 (1995-11-01), pages 287 - 301, XP000641897 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000014728A1 (en) * 1998-09-09 2000-03-16 One Voice Technologies, Inc. Network interactive user interface using speech recognition and natural language processing
EP0992980A2 (en) * 1998-10-06 2000-04-12 Lucent Technologies Inc. Web-based platform for interactive voice response (IVR)
JP2000137596A (en) * 1998-10-06 2000-05-16 Lucent Technol Inc Interactive voice response system
EP0992980A3 (en) * 1998-10-06 2001-05-23 Lucent Technologies Inc. Web-based platform for interactive voice response (IVR)
US6587822B2 (en) 1998-10-06 2003-07-01 Lucent Technologies Inc. Web-based platform for interactive voice response (IVR)
WO2000052914A1 (en) * 1999-02-27 2000-09-08 Khan Emdadur R System and method for internet audio browsing using a standard telephone
CN100393073C (en) * 1999-02-27 2008-06-04 E·R·汗 System and method for internet audio browsing using standard telephone
US6606611B1 (en) * 1999-02-27 2003-08-12 Emdadur Khan System and method for audio-only internet browsing using a standard telephone
WO2001016936A1 (en) * 1999-08-31 2001-03-08 Accenture Llp Voice recognition for internet navigation
US7590538B2 (en) 1999-08-31 2009-09-15 Accenture Llp Voice recognition system for navigating on the internet
WO2001043388A3 (en) * 1999-12-10 2002-04-04 Deutsche Telekom Ag Communication system and method for establishing an internet connection by means of a telephone
WO2001043388A2 (en) * 1999-12-10 2001-06-14 Deutsche Telekom Ag Communication system and method for establishing an internet connection by means of a telephone
EP1134948A3 (en) * 2000-03-15 2003-04-23 Nec Corporation Information search system using radio portable terminal
EP1134948A2 (en) * 2000-03-15 2001-09-19 Nec Corporation Information search system using radio portable terminal
US7805145B2 (en) 2000-03-15 2010-09-28 Nec Corporation Information search system using radio portable terminal
US6662163B1 (en) * 2000-03-30 2003-12-09 Voxware, Inc. System and method for programming portable devices from a remote computer system
JP2002091756A (en) * 2000-06-15 2002-03-29 Internatl Business Mach Corp <Ibm> System and method for simultaneously providing a large number of acoustic information sources
EP1168799A2 (en) * 2000-06-30 2002-01-02 Fujitsu Limited Data processing system with vocalisation mechanism
EP1168799A3 (en) * 2000-06-30 2005-12-14 Fujitsu Limited Data processing system with vocalisation mechanism
DE10201623C1 (en) * 2002-01-16 2003-09-11 Mediabeam Gmbh Method for data acquisition of data made available on an Internet page and method for data transmission to an Internet page
US6741681B2 (en) 2002-01-16 2004-05-25 Mediabeam Gmbh Method for acquisition of data provided on an internet site and for data communication to an internet site
DE102010001564A1 (en) 2010-02-03 2011-08-04 Bayar, Seher, 51063 A method and computer program product for automated configurable acoustic reproduction and editing of website content
WO2011095457A2 (en) 2010-02-03 2011-08-11 Bayar, Seher Method and computer program product for automated configurable acoustic reproduction and processing of internet site content

Similar Documents

Publication Publication Date Title
US6532444B1 (en) Network interactive user interface using speech recognition and natural language processing
US6434524B1 (en) Object interactive user interface using speech recognition and natural language processing
US6604075B1 (en) Web-based voice dialog interface
KR100661687B1 (en) Web-based platform for interactive voice responseivr
US6282511B1 (en) Voiced interface with hyperlinked information
US20180007201A1 (en) Personal Voice-Based Information Retrieval System
US6188985B1 (en) Wireless voice-activated device for control of a processor-based host system
US8046228B2 (en) Voice activated hypermedia systems using grammatical metadata
US20020087328A1 (en) Automatic dynamic speech recognition vocabulary based on external sources of information
US20030144846A1 (en) Method and system for modifying the behavior of an application based upon the application&#39;s grammar
US5884262A (en) Computer network audio access and conversion system
US20020077823A1 (en) Software development systems and methods
US20020010715A1 (en) System and method for browsing using a limited display device
US8566102B1 (en) System and method of automating a spoken dialogue service
US20060235694A1 (en) Integrating conversational speech into Web browsers
WO1997032427A9 (en) Method and apparatus for telephonically accessing and navigating the internet
JPH08335160A (en) System for making video screen display voice-interactive
WO1997032427A1 (en) Method and apparatus for telephonically accessing and navigating the internet
GB2407657A (en) Automatic grammar generator comprising phase chunking and morphological variation
AU2001251354A1 (en) Natural language and dialogue generation processing
WO1998035491A1 (en) Voice-data interface
JP2009187349A (en) Text correction support system, text correction support method and program for supporting text correction
Brown et al. Web page analysis for voice browsing
EP0958692A1 (en) Voice-data interface
US20030091176A1 (en) Communication system and method for establishing an internet connection by means of a telephone

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 09043165

Country of ref document: US

AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH GM GW HU ID IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
ENP Entry into the national phase

Ref country code: JP

Ref document number: 1998 533971

Kind code of ref document: A

Format of ref document f/p: F

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 1998900943

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1998900943

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642