WO1998035491A1

WO1998035491A1 - Voice-data interface

Info

Publication number: WO1998035491A1
Application number: PCT/GB1998/000194
Authority: WO
Inventors: Robert Denis Johnston
Original assignee: British Telecommunications Public Limited Company
Priority date: 1997-02-05
Filing date: 1998-01-22
Publication date: 1998-08-13

Abstract

A page of text from a database (4) has certain words marked with the addresses of other, linked pages. The text is received at (10) and converted into an audio signal by a speech synthesiser (15) that it can be heard by a user. The user's spoken responses are fed to a speech recogniser (19) so that an address associated with a marked word in which the user is interested can be returned to the database (4) for retrieval of the corresponding linked page. Because the user will not necessarily know which words are marked, the recogniser is set up to match the user's response against the whole of the text fed to the synthesiser to identify the words in the text giving the best match. A resolver (20) finds the nearest marked word to the words identified and extracts the associated link address.

Description

VOICE-DATA INTERFACE

The present application is concerned with voice-interactive access to text- based services. According to one aspect of the invention there is provided an interface for a voice interactive service comprising: a speech synthesiser to receive coded signals representing sequences of words and to generate audio signals corresponding thereto for output; speech recognition means connected to receive the said coded signals and operable upon receipt of a speech signal to be recognised to identify that part of the word sequence represented by the coded signals which most resemble the speech signal to be recognised.

In another aspect the invention provides a method of operating a voice interactive service comprising (a) receiving coded signals representing a sequence of words and synthesising audio signals corresponding thereto for output;

(b) receiving a speech signal and identifying by means of a speech recogniser that part of the word sequence represented by the coded signals which most resembles the received speech signal; and (c) using the recognition result to select a further sequence of words.

Other aspects of the invention are set out in the claims.

Some embodiments of the invention will now be described, by way of example, with reference to the accompanying drawings.

In Figure 1 an apparatus 1 for providing a voice-interactive service is shown and in this example it is intended to allow a user to access a text-based information service by voice only, using a telephone 2. Although the apparatus 1 could be located at the user's premises or at the location of the text-based information service, in this example it is located at a telephone exchange or other central location where it can be accessed by many users (at different times or - with duplication of its functions - simultaneously) via a telecommunications link such as a PSTN dialled connection 3. The information service is provided by a remote database server 4 which contains (or forms a gateway offering access to) stored pages of textual information - though the database could if desired be incorporated into the apparatus 1 . Here we suppose that the server is part of a network accessible via a telecommunications link 5, such as the Internet, and responds to addresses transmitted to it by sending a document identified by that address. Documents provided by the Internet are commonly formatted according to the hypertext markup language (HTML) which is itself a particular example of the standard generalised markup language according to international standard ISO 8879. As well as containing text characters forming the words of the text, an HTML document also contains formatting information suggesting the appearance of the document when displayed on a screen (or printed) such as position, font size, italics and so forth. The precise details of these are not important for present purposes; one thing that is of significance however is that these documents also have provision for flagging words or phrases as associated with the address of another document. Part of such a document is illustrated in Figure 2a with its displayed appearance shown in Figure 2b. It is seen that this format and control information is enclosed with chevrons " < > " as delimiters, not being intended for display. The text "Patent Office Sites" is to be shown in bold type as indicated by the start and finish codes < b > and < /b > . The text "US Patent and Trademark Office" is flanked by < a > and < /a > delimiters which normally cause the text to be displayed in a distinctive manner - a special colour or underlined, for example - to identify this phrase a representing a link. Moreover the < a > code contains an associated address "http://www.uspto.gov" which is the address of the Internet page of the US Patent and Trademark Office. When a user with a visual display terminal receives such a document and wishes to select the USPTO page, he uses a pointing device such as a mouse to point to the underlined phrase, causing the terminal to extract the associated address and transmit it for selection of a new document.

The function of the apparatus 1 of Figure 1 is, in brief as follows:

(a) to receive HTML documents from the server 4;

(b) to synthesise an audio signal reciting the text contained in the document and transmit it via the line 3 to the user at 2;

(c) to recognise spoken replies from the user;

(d) to recognise the replies from the user as indicating a selection of a further document; (e) to transmit the address of that document to the server 4. Figure 3 shows the apparatus 1 in more detail. It contains a network interface 10 which comprises a modem for connection to the link 5, and a processor programmed with software to transmit addresses via the modem to the server and receive documents from the server. This software differs from conventional browser software such as Netscape only in that (a) it receives addresses via a connection 1 1 rather than having them typed in at a keyboard or selected using a mouse and (b) it outputs the received text directly to a file or buffer 1 2 which can be accessed via a connection 1 3. Suppose that a document has been received by the interface 10 and is stored in the buffer 1 2. A first portion of text is read out and a correspondingly coded signal is output on the line 13. The actual amount of the text output could rely on punctuation characters included in the text, for example up to the first (or second etc.) full stop, or up to the first paragraph mark. This is received by a text pre-processing unit 14 which serves to delete unwanted control information, and forward it to a conventional text-to-speech synthesiser 1 5. This produces an audio signal corresponding to the portion of text, which is transmitted over the telephone line 3 to the user at 2.

The portion of text is also copied to a buffer 1 6. This is shown as coming from a second output of the pre-processing unit 14, since whilst the unit 14 removes from the text sent to the synthesiser 15 all format and control information (i.e. the characters < and > and anything within them), the text sent to the buffer 1 6 still includes the link address commands (e.g. < a ref =

"http//www.epo.co.at/epo" > .... < /a > but omits all other formatting and control information.

If desired, one could allow selected markings to pass to the synthesiser, for example so that bold type could be more heavily stressed, but this is entirely optional.

Although the link addresses are stored in the buffer 1 6, they are removed by further text processing 1 7 before forwarding the text to a recognition network generator 1 8 which is connected to a speech recogniser 1 9.

The recogniser 1 9 is connected to receive audio signals from the telephone line 3, so that responses from the user at 2 may be recognised. The recogniser may have permanent programming to enable it to recognise some standard command words for control of the system; however its primary purpose is to match the user's response to the source text which has just been spoken by the synthesiser 1 5; more particularly to identify that part of the source text present in the buffer 1 6 which most closely resembles the user's response.

Thus the function of the recognition network generator 1 8 is to derive, from the text input to it, parameters for the recogniser defining a vocabulary and grammar corresponding to this task.

In this example, it is assumed that the output of the recognser is a text string corresponding to the matched portion of text (or command word) . This output representing the user's response is taken to be a request for a further document information, and the next task is to identify this by locating the text string in the buffer 1 6 and returning the link address contained within in; or if there is none, returning the nearest link address stored in the buffer. This function (to be discussed in more detail below) is performed by a link resolve unit 20 which outputs the link address to the interface 10, which transmits it to the database server 4 as a request for a further document. If however the link represents a position in the current document, then this is recognised and a command issued to the buffer 1 2 to read text from a specified point. Control functions - for example if the user wishes to move on to the next

(or preceding) paragraph of the document currently stored in the buffer 1 2, or to return to some default document, or to terminate the connection - could be performed using the telephone keypad, but preferably is achieved by designating certain words as control words (e.g. More, Back, Home, Quit) stored as a permanent vocabulary in the recogniser 1 9 and received by a control unit 21 which, upon receiving one of these words along, then issues appropriate instructions to the buffer 1 2 and/or interface 10.

By way of further explanation of the operation of the apparatus, and in particular of the link resolver 20, consider a situation where the buffer 1 2 is loaded with a document as shown in Figure 4A. The appearance of this document were it to be displayed on a visual display unit is as in Figure 4B.

Suppose that the buffer 1 2 is set up to output one paragraph at a time; suppose further that the user has already heard the title and asked for "More", the buffer 1 2 outputs the next paragraph ""Welcome forests") to the text preprocessor as shown in Figure 4C.

Suppose now the user says "the Amazon basin". The recogniser 1 9 matches the speech signal and outputs the text string "Amazon basin", whereupon the link resolver 20 searches in the buffer 1 6 for this text string, finds that it is attached to the link address http://www/amazon. basin", read out this address and forwards it to the interface 1 0 which transmits it to the database server 4 to call up another page.

Naturally the user cannot know which expressions have link addresses attached and to cater for the possibility of him/her uttering some other words, the link resolver operates according to the flowchart shown in Figure 5. In a first test 30, it is determined whether the matched source text is, or contains a link. "Amazon basin", "birds in the Amazon basin" or even "basin many of" would pass this test. In this case, the link address in question is chosen at 31 . Otherwise a second test 32 is performed to establish whether the matched source text lies in a sentence which contains a link; "one thousand species" for example would fall into this category. In this case the address in that sentence (or, if more than one, the one nearest to the matched source text) is chosen. Otherwise the nearest link to the matched source text is chosen, for example by counting the number of words (or the number of characters) from the matched text to the next link above and below it in the buffer, and choosing the link with the lower count. A more complex algorithm could examine the nearest links above and below the matched text for the degree of semantic similarity to the matched text and choose the more similar. In a refinement, one could weight this choice to take account of punctuation, for example by increasing by (e.g.) 10 words the count when crossing a paragraph boundary.

The HTML language also permits links to other parts of the current document - as shown in Figure 4A for the British Wildlife Society. Upon recognition of this name by the recogniser, the address "#3224" would be recognised by the link resolver as an internal address and forwarded not to the interface 10 but to the buffer 1 2 to cause readout of a paragraph from a point in the document specified by the address. The operation of the recognition network generator 1 8 may now be discussed further. There are essentially two components to the setting up of a recogniser for a given function. First, defining its vocabulary, and second, defining its grammar. The vocabulary is a question of ensuring that the recogniser has a set of models or templates, typically one for each of the words to be recognised - that is, one for each of the words (other than link addresses) present in the buffer 1 6. Vocabulary generation for this purpose may use any of the conventional methods. Typically this is done by using a recogniser preprogrammed with a set of sub-word models (e.g. one per phoneme) and processing each word delivered from the buffer, in similar manner to the operation of a text-to-speech synthesiser, to generate a word template by concatenation of the appropriate sub-word models. Alternatively the recogniser may have a standard store of word models which can be retrieved when the corresponding words are received from the buffer 1 6, though to accommodate proper names and other words not in the standard set the sub-word concatenation method would usually be employed as well.

The grammar of a recogniser is a set of stored parameters which define what word sequences are permissible; for example, considering the buffer contents shown in Figure 4A whilst "Amazon basin" is a word sequence which is useful to recognise "basin Amazon" is not. One possibility is to allow (as sequences for matching against the user's utterance) any number of words from 1 upwards, but only in the sequence in which they appear in the buffer. Figure 6 shows this represented graphically (for a portion only of the text) where 40 represents a start node of a recognition "tree", 41 represents an end node, 42 represents word models and the lines 43 represent allowable paths so. It would be possible to include a network of 'carrier phrases' as shown in

Figure 7 so that the user could say sentences such as "Tell me more about the Amazon Basin please". Alternatively a garbage or sink model (Fig. 8) could be included at the beginning and end of the network to allow any speech to surround the echoed phrase. In another embodiment the recogniser could simply allow any of the words on the page to be uttered in any order as shown in Figure 9. The accuracy of such a recogniser would not be as high as those shown in Figuresδ to 8, but if statistical constraints based on the contents of the HTML page were incorporated in the recognition process a working system could be created.

Returning briefly to Figure 3, in this embodiment it has been assumed that the recogniser returns, as a "label" representing its recognition result, the relevant part of the actual text string supplied to the recognition network generator 1 8 by the buffer 1 6, and the link resolver 20 matches this string against the buffer contents to locate the desired links. Whilst this may be convenient to permit use of a conventional unit for the recogniser 1 6, a way of speeding up the operation of the link resolver would be to set up the recogniser to return some parameter enabling faster access to the buffer, for example pointer values giving the addresses in the buffer 1 6 of the first and last characters of the matched source text string.

Although only one server is shown in Figure 1 , of course there could be others, and the transmitted link address could well be destined for a different server from the one sending the document from which it was obtained.

This embodiment presupposes that the source text carries hyperlink addresses; however it is also possible to operate this system without embedded addressed of this kind. For example one could transmit to the database server coordinates to identify the point in a (or range of) the source text at which the match occurred. In the case of connectionless service such as the Internet, it would be necessary to concatenate this information with the address of the server before transmitting it.

It was mentioned earlier that the text preprocessor 14 could be arranged to pass certain markings through to the synthesiser 1 5 to allow bold type to be emphasised. Similarly, it would be possible for the preprocessor to pass the hyperlink markings < a > ... < /a > (albeit without the addresses) and arrange the synthesiser to respond to these by applying an emphasis, or even switching to a different voice (for example a male instead of female voice) from that used for the remainder of the text. With this expedient, in an alternative embodiment, one can simplify the speech recogniser vocabulary to include only the link words, though it is still preferred to operate the recogniser as described above, against the possibility that the user may not always accurately recollect which words were spoken with the emphasis (or different voice).

Claims

1 . An interface for a voice interactive service comprising: a speech synthesiser to receive coded signals representing sequences of words and to generate audio signals corresponding thereto for output; speech recognition means connected to receive the said coded signals and operable upon receipt of a speech signal to be recognised to identify that part of the word sequence represented by the coded signals which most resembles the speech signal to be recognised.

2. An interface according to Claim 1 in which the coded signals include link signals identifying one or more words of a sequence which represent links to further information, and the apparatus is operable to select from the coded signals a link signal which is in or adjacent to the identified resembling part of the sequence.

3. An interface according to Claim 2 including a communications interface connected to receive the coded signals from a remote source and to transmit the selected link signal to the same or another remote source for requesting further coded signals.

4. An interface according to Claim 2 or 3 including a buffer for storing the coded signals, wherein:

(a) the interface is so operable that the speech synthesiser can generate audio signals corresponding to a portion only of the coded signals stored in the buffer and the recogniser thereupon identifies that part of the word sequence which is represented by said portion of the coded signals which part most resembles the speech signal to be recognised; and (b) the interface includes control means responsive to a link signal which identifies a further portion of the coded signals stored in the buffer to transmit that further portion to the synthesiser and recognition means.

5. An interface according to any one of the preceding claims including a telephone line interface whereby the generated audio signals and received speech signals may respectively be sent to and received from a remote user.

6. An interface for a voice interactive service as herein described with reference to the accompanying drawings.

7. A method of operating a voice interactive service comprising

(a) receiving coded signals representing a sequence of words and synthesising audio signals corresponding thereto for output;

(b) receiving a speech signal and identifying by means of a speech recogniser that part of the word sequence represented by the coded signals which most resembles the received speech signal; and

(c) using the recognition result to select a further sequence of words.

8. A method according to Claim 7 in which the coded signals include link signals identifying one or more words of a sequence which represent links to further information, and step (c) includes selecting from the coded signals a link signal which is in or adjacent the identified resembling part of the sequence.

9. An interface for a voice interactive service comprising: a speech synthesiser to receive coded signals representing sequences of words and to generate audio signals corresponding thereto for output, in which the coded signals include link signals identifying one or more words of a sequence which represent links to further information, the synthesiser being responsive to receipt of the link signals to utter the words so identified in a different manner from words not so identified; and speech recognition means connected to receive at least those of the coded signals which represent link-representing words and operable upon receipt of a speech signal to be recognised to identify which of the link-representing words most resemble the speech signal to be recognised.