WO2003098373A2 - Voice authentication - Google Patents

Voice authentication Download PDF

Info

Publication number
WO2003098373A2
WO2003098373A2 PCT/GB2003/002246 GB0302246W WO03098373A2 WO 2003098373 A2 WO2003098373 A2 WO 2003098373A2 GB 0302246 W GB0302246 W GB 0302246W WO 03098373 A2 WO03098373 A2 WO 03098373A2
Authority
WO
WIPO (PCT)
Prior art keywords
feature vectors
recorded signal
user
smart card
voice authentication
Prior art date
Application number
PCT/GB2003/002246
Other languages
French (fr)
Other versions
WO2003098373A3 (en
Inventor
Timothy Phipps
John H Robson
Original Assignee
Domain Dynamics Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Domain Dynamics Limited filed Critical Domain Dynamics Limited
Priority to AU2003230039A priority Critical patent/AU2003230039A1/en
Publication of WO2003098373A2 publication Critical patent/WO2003098373A2/en
Publication of WO2003098373A3 publication Critical patent/WO2003098373A3/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces
    • G10L17/24Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the present invention relates to voice authentication.
  • Voice authentication may be defined as a process in which a user's identity is validated by analysing the user's speech patterns. Such a process may be used for controlling access to a system, such as a personal computer, cellular telephone handset or telephone banking account.
  • voice authentication is known in voice recognition systems. Examples of voice recognition systems are described in US-A-4956865, US-A-507939, US-A- 5845092 and WO-A-0221513.
  • the present invention seeks to provide voice authentication.
  • a token storing a voice authentication biometric.
  • the token may be suitable for possession by the user and may be small enough to be kept on a user, worn by the user as jewellery or kept in a pocket of an article of clothing worn by them.
  • the token may be an information storage medium or device.
  • a smart card storing a voice authentication biometric.
  • Storing a voice authentication biometric on a smart card can help validate that a smart card user is the smart card owner.
  • the voice authentication biometric may be suitable for use in authenticating a user using a sample of speech from the user.
  • the voice authentication biometric may include at least one set of feature vectors, such as archetype set of feature vectors.
  • the voice authentication biometric may include at least one prompt, each prompt associated with a respective set of feature, vectors.
  • the voice authentication biometric may include corresponding statistical information relating to each set of feature vectors.
  • the voice authentication biometric may include data for controlling authentication procedure, data for determining authentication and/or data for configuring a voice authentication apparatus.
  • the voice authentication biometric may be encrypted.
  • the token or smart card may include non-volatile memory storing the voice authentication biometric.
  • the token or smart card may store a computer program comprising program instructions for causing a computer to perform a matching process for use in voice authentication.
  • a token for voice authentication including a processor, the token storing a voice authentication biometric including a first set of feature vectors and a computer program comprising program instructions for causing the processor to perform a method, the method comprising receiving a second set of feature vectors and comparing the first and second set of feature vectors.
  • a smart card for voice authentication including a processor, the smart card storing a voice authentication biometric including a first set of feature vectors and storing a computer program comprising program instructions for causing the processor to perform a method, the method comprising receiving a second set of feature vectors and comparing the first and second set of feature vectors.
  • the computer program may comprise program instructions for causing the processor to perform a method, the method comprising requesting a user to provide a spoken response.
  • the computer program may comprise program instructions for causing the processor to perform a method, the method comprising receiving a recorded signal including a recorded signal portion corresponding to a spoken response.
  • the computer program may comprise program instructions for causing the processor to perform a method, the method comprising determining endpoints of the recorded signal portion corresponding to a spoken response.
  • the computer program may comprise program instructions for causing the processor to perform a method, the method comprising deriving the second set of feature vectors for characterising the recorded signal portion.
  • the computer program may comprise program instructions for causing the processor to perform a method, the method comprising producing a score dependent upon a degree of matching between the first and second set of feature vectors.
  • the computer program may comprise program instructions for causing the processor to perform a method, the method comprising comparing the score with a predefined threshold so as to determine authentication of a user.
  • a method of voice authentication comprising in a token or smart card providing a first set of feature vectors, receiving a second set of feature vectors for characterising a recorded signal portion and comparing the first and second sets of feature vectors.
  • the method may further comprise providing data relating to a prompt.
  • the method may further comprise receiving a recorded signal including a recorded signal portion corresponding to a spoken response.
  • the method may further comprise determining endpoints of the recorded signal portion.
  • the method may further comprise deriving the second set of feature vectors for characterising the recorded signal portion.
  • the method may further comprise producing a score dependent upon a degree of matching between the first and second set of feature vectors.
  • the method may further comprise comparing the score with a predefined threshold so as to determine authentication of a user.
  • the method may further comprise receiving a recorded signal which includes a recorded signal portion corresponding to a spoken response and which includes a plurahty of frames, determining endpoints of the recorded signal including determining whether a value of energy for a first frame exceeds a first predetermined value and determining whether a second frame immediately preceding the first frame represents a spoken utterance portion.
  • the method may further comprise requesting the authenticating user to provide first and second spoken responses to the prompt, obtaining a recorded signal including first and second recorded signal portions corresponding to the first and second spoken responses, isolating the first and second recorded signal portions, deriving second and third sets of feature vectors for characterising the first and second isolated recorded signal portions respectively, comparing the second set of feature vectors with the third set of feature vectors so as to produce a score dependent upon the degree of matching; and comparing the score with a predefined threshold so as for determine whether the first set of feature vectors is substantially identical to the second set of feature vectors.
  • the method may further comprise requesting a user to provide a plurahty of spoken responses to a prompt, obtaining a plurahty of corresponding recorded signals, each recorded signal including a recorded signal portion corresponding to a respective spoken response, deriving a plurahty of sets of feature vectors, each set of feature vectors for characterising a respective recorded signal portion, comparing the sets of feature vectors with the first set of feature vectors so as to produce a plurahty of scores dependent upon a degree of matching and determining whether authentication is successful in dependence upon the plurahty of scores.
  • the method may further comprise receiving a recorded signal which includes a recorded signal portion, determining endpoints of the recorded signal by dynamic time warping the second set of feature vectors onto the first set of feature vectors, including determining a first sub-set of feature vectors within the second set of feature vectors from which a dynamic time warping winning path may start and determining a second sub-set of feature vectors within the second set of feature vectors at which the dynamic time warping winning path may finish.
  • Endpointing seeks to locate a start and stop point of a spoken utterance.
  • the present invention seeks to provide an improved method of endpointing.
  • a method of determining an endpoint of a recorded signal portion in a recorded signal including a plurahty of frames comprising deterrmning whether a value of energy for a first frame exceeds a first predetermined value and determining whether a second frame immediately preceding the first frame represents a spoken utterance portion.
  • the first predetermined value may represent a value of energy of a frame comprised of background noise.
  • the method may comprise defining a start point if the value of energy of the first frame exceeds the first predetermined value and the second frame does not represent a spoken utterance portion.
  • the method may further ⁇ comprise indicating that the first frame represents a spoken utterance portion.
  • the method may comprise defining a stop point if the value of energy of the first frame does not exceed the first predetermined value and the second frame represents a spoken utterance portion.
  • the method may comprise defining the first frame as not representing a spoken utterance portion.
  • the method may comprise counting a number of frames preceding a start point of the spoken utterance portion.
  • the method may further comprise pairing the stop point with the start point of the spoken utterance portion if the number of frames exceeds a predetermined number.
  • the method may further comprise pairing the stop point with start point of a preceding spoken utterance portion if the number of frames does not exceed a predetermined number.
  • the method may comprise determining whether the value of energy for a first frame exceeds a third predetermined value and counting a number of frames preceding a start point of the spoken utterance portion.
  • the method may further comprise defining a start point if the value of energy of the first frame exceeds the third predetermined value, the second frame does not represent a spoken utterance portion and if the number of frames does not exceed a predetermined number.
  • the method may further comprise determining whether a value of energy for a third frame following the first frame exceeds the second predetermined value.
  • the method may further comprise defining a stop point if the value of energy of the third frame does not exceed the third predetermined value.
  • the method may further comprise pairing the stop point with the start point of the spoken utterance portion.
  • the method may further comprise pairing the stop point with a start point of a preceding spoken utterance portion.
  • the method may comprise defining the first frame as representing background noise if the value of energy of the first frame does not exceed the third predetermined value.
  • the method may further comprise calculating an updated value of background energy using the value of energy of the first frame.
  • the method may further comprise counting a number of frames preceding a start point of the spoken utterance portion and determining whether the number of frames exceeds another, large number.
  • the method may comprise determining whether a value of rate of change of energy of the first frame exceeds a second predetermined value.
  • the second predetermined value may represent a value of rate of change of energy of a frame comprised of background noise.
  • the method may further comprise defining a start point if the value of energy of the first frame exceeds the first predetermined value, and the value of rate of change of energy exceeds the second predetermined value and the second frame does not represent a spoken utterance portion.
  • the method may comprise defining a stop point if the value of energy of the first frame does not exceed the first predetermined value, and the value of rate of change of energy does not exceed the second predetermined value and the second frame represents a spoken utterance portion.
  • the method may comprise determining whether the value of rate of change of energy for the first frame exceeds a fourth predetermined value.
  • voice recognition and authentication systems use dynamic time warping to match a recording to a template.
  • a user may pause, cough, sigh or generate other sounds before or after providing a response to a prompt. These silences or sounds may be included in the recording. Thus, only a portion of the recording is relevant.
  • the present invention seeks to provide a solution to this problem.
  • a method of dynamic time warping for warping a first speech pattern characterised by a first set of feature vectors onto a second speech pattern characterised by a second set of feature vectors comprising identifying a first sub-set of feature vectors within the first set of feature vectors from which a dynamic time warping winning path starts and identifying a second sub-set of feature vectors within the first set of feature vectors at which the dynamic time warping winning path finishes.
  • the first speech pattern may include speech, background noise and/or silence.
  • the present invention seeks to provide a method of voice authentication.
  • a method of voice authentication comprising: enrolling a user including requesting the enrolling user to provide a spoken response to a prompt, obtaining a recorded signal including a ⁇ recorded signal portion corresponding to the spoken response, determining endpoints of the recorded signal portion, deriving a set of feature vectors for characterising the recorded signal portions, averaging a plurahty of sets of feature vectors, each set of feature vectors relating to one or more different spoken responses to the prompt by the enrolling user so as to provide an archetype set of feature vectors for the response, storing the archetype set of feature vectors together with data relating to the prompt; and authenticating a user including retrieving the data relating to the prompt and the archetype set of feature vectors, requesting the authenticating user to provide another spoken response to the prompt, obtaining another recorded signal including another recorded signal portion corresponding to the other spoken response, determining endpoints of the other recorded signal portion, de
  • Noice authentication systems typically include an amphfier. If a user provides a spoken response which is too quiet, then amphfier gain may be increased. Conversely, if a spoken response if too loud, then amplifier gain may be reduced. Usually, a succession of samples is taken and amphfier gain is increased or reduced accordingly until a settled value of amphfier gain is obtained. However, there is a danger that that amphfier gain rises and falls and never settles.
  • the present invention seeks to ameliorate this problem.
  • a method of gain control comprising a plurahty of times determining whether an amplified signal level is above a predetermined limit, either decreasing gain if the amplified signal level is above the predetermined limit or maintaining gain if otherwise, thereby permitting no increase in gain.
  • a method of gain control comprising a plurahty of times determining whether an amplified signal level is below a predetermined limit, either increasing gain if the amplified signal level is below the predetermined hmit or maintaining gain if otherwise, thereby permitting no decreases in gain.
  • a potential threat to the security offered by any voice authentication system is the possibility of an impostor secretly recording a spoken response of a valid user and subsequently replaying a recording to gain access to the system. This is known as a "replay attack.”
  • the present invention seeks to help detect a replay attack.
  • a method of voice authentication comprising requesting a user to provide first and second spoken responses to a prompt, obtaining a recorded signal including first and second recorded signal portions corresponding to the first and second spoken responses, isolating the first and second recorded signal portions, deriving first and second sets of feature vectors for characterising the first and second isolated recorded signal portions respectively, comparing the first set of feature vectors with the second set of feature vectors so as to produce a second score dependent upon the degree of matching and comparing the second score with another predefined threshold so as for determine whether the first set of feature vectors is substantially identical to the second set of feature vectors.
  • users may occasionally provide an uncharacteristic spoken response.
  • the present invention seeks to provide an improved method of dealing with uncharacteristic responses.
  • a method of voice authentication including requesting an authenticating user to provide a plurahty of spoken responses to a prompt, obtaining a plurahty of corresponding recorded ' signals, each recorded signal including a recorded signal portion corresponding to a respective spoken response, deriving a plurahty of sets of feature vectors, each set of feature vectors for characterising a respective recorded signal portion, comparing the sets of feature vectors with an archetype set of feature vectors so as to produce a plurahty of scores dependent upon a degree of matching and determining whether authentication is successful in dependence upon the plurahty of scores.
  • a threshold score is usually generated.
  • the present invention seeks to provide an improved method of determining an authentication threshold score.
  • a method of determining an authentication threshold score including requesting a first set of users to provide respective spoken responses to a prompt, for each user, obtaining a recorded signal which includes a recorded signal portion corresponding to the user's spoken response, for each user, deriving a set of feature vectors for characterising the recorded signal portion, for each user, comparing the set of feature vectors with an archetype set of feature vectors for the user so as to produce a score dependent upon a degree of matching, fitting a first probabihty density function to frequency of scores for the first set of users, requesting a second set of users to provide respective spoken responses to a prompt, for each user, obtaining a recorded signal which includes a recorded signal portion corresponding to the user's spoken response, for each user, deriving a set of feature vectors for characterising the recorded signal portion, for each user, comparing the set of feature vectors with an archetype set of feature vectors for a different user so as to produce a score dependent upon
  • a method of averaging a plurahty of feature vectors comprising providing a plurahty of feature vectors, comparing the each set of feature vectors with each other set feature vectors so as to produce a respective set of scores dependent upon a degree of matching, searching for a minimum score and determining whether at least one score is below a predetermined threshold.
  • a smart card for voice authentication comprising means for storing a first set of feature vectors and data relating to a prompt, means for providing the data to an external circuit, means for receiving a second set of feature vectors relating to the prompt, means for comparing the first and second set of feature vectors so as to determine a score; and means for comparing the score with a predetermined threshold.
  • a smart card for voice authentication comprising a memory for storing a first set of feature vectors and data relating to a prompt, an interface for providing the data to an external circuit and for receiving a second set of feature vectors relating to the prompt, a processor for comparing the first and second set of feature vectors so as to determine a score and for comparing the score with a predetermined threshold.
  • the information storage medium storing a voice authentication biometric.
  • the information storage medium is portable and may be for example a memory stick.
  • a computer program comprising program instructions for causing a smart card to perform a method, the method comprising retrieving from memory a first set of feature vectors, receiving a second set of feature vectors and comparing the first and second set of feature vectors.
  • a method comprising writing at least part of a voice authentication biometric to a smart card or token.
  • the at least part of a voice authentication biometric is a set of feature vectors.
  • a method comprising writing a computer program to a smart card or token, the computer program comprising computer instructions for performing a method, the method comprising performing voice authentication.
  • a method comprising writing at least part of a voice authentication biometric to a smart card or token and writing a computer program to said smart card or token, the computer program comprising computer instructions for performing a method, the method comprising performing voice authentication
  • a smart card for voice authentication including a processor, the smart card storing a computer program comprising program instructions for causing the processor to perform a method, the method comprising performing voice authentication.
  • a smart card reader/writer connected to apparatus for recording speech and generating feature vectors, said reader/writer being configured to transmit a set of feature vectors to a smart card or token and receive a response therefrom.
  • Figure 1 shows a voice authentication system 1 for performing a method of voice authentication
  • Figure 2 is a process flow diagram of a method of voice authentication
  • Figure 3 is a process flow diagram of a method of enrolment
  • Figure 4 is a process flow diagram of a method of calibration
  • Figure 5 is an analog representation of a recorded signal
  • Figure 6 is a generic representation of a recorded signal
  • Figure 7 is a digital representation of a recorded signal
  • Figure 8 illustrates dividing a recorded signal into timeshces
  • Figure 9 is a process flow diagram of a method of generating a featuregram
  • Figure 10 illustrates generation of a feature vector
  • Figure 1 illustrates generation of a featuregram from a plurahty of feature vectors
  • Figure 12 shows first and second endpointing processes
  • Figure 4 is a process flow diagram of a method of exphcit endpointing
  • Figure 15 illustrates determination of energy and delta energy values of a timeshce
  • Figure 16 shows pairing of a stop point with a two start points
  • Figure 17 shows pairing of a stop point with a start point of a preceding section
  • Figure 18 is a process flow diagram of a method of detecting hp smack
  • Figure 19 shows pairing of a stop point with an updated start point for removing lip smack
  • Figure 20 illustrates a dynamic time warping process for word spotting
  • Figure 21 shows a warping function from a start point to an end point
  • Figure 22 illustrates a local slope constraint on a warping function
  • Figure 23 illustrates a global condition imposed on a warping function
  • Figure 24 is a process flow diagram of a method of finding a minimum distance for an optimised path from a start point to an end point representing matched speech patterns
  • Figure 25a shows an array following initialisation for holding a cumulative distance associated with a path from a start point to an end point
  • Figure 25b shows an array for holding a cumulative distance associated with a path from a start point to an end point
  • Figure 25c shows a completed array for holding a cumulative distance associated with a path from a start point to an end point including a winning path
  • Figure 26 shows a process flow diagram of a method of performing a plurahty of sanity checks
  • Figure 27 illustrates creation of a speech featuregram
  • Figure 28 illustrates generation of a speech featuregram archetype
  • Figure 29 is a process flow diagram of a method of generating a speech featuregram archetype
  • Figure 30 illustrates generation of a featuregram cost matrix
  • Figure 31 shows a featuregram cost matrix
  • Figure 32 is a process flow diagram of a method of finding a minimum distance for an optimised path from a start point to an end point representing matched speech patterns
  • Figure 33 illustrates creation of featuregram archetypes using featuregrams
  • Figure 34 illustrates generation of a featuregram archetype cost matrix
  • Figure 35 shows a featuregram archetype cost matrix
  • Figure 36 shows a probabihty distribution function
  • Figure 37 shows a continuous distribution function
  • Figure 38 shows a voice authentication biometric
  • Figure 39 is a process flow diagram of a method of authentication
  • Figure 40 is an analog representation of an authentication recorded signal
  • Figure 41 illustrates dividing an authentication recorded signal into timeslices
  • Figure 42 illustrates generation of an authentication feature vector
  • Figure 43 illustrates generation of an authentication featuregram from a plurahty of feature vectors
  • Figure 44 illustrates generation of endpoints
  • Figure 45 illustrates comparison of a featuregram archetype with an authentication featuregram
  • Figure 46 illustrates a featuregram including first and second spoken responses of the same prompt for detecting replay attack
  • Figure 47 is a process flow diagram of a method of detecting replay attack
  • Figure 48 shows a voice authentication system employing a smart card
  • Figure 49 illustrates a contact smart card
  • Figure 50 illustrates a contactless smart card
  • Figure 51 is a schematic diagram showing a smart card reader and a smart card
  • Figure 52 is an application program data unit (APDU) table
  • Figure 53 shows exchange of messages between a laptop and a smart card during template loading
  • Figure 54 shows a first exchange of messages between a laptop and a smart card during authentication
  • Figure 55 shows a second exchange of messages between a laptop and a smart card during authentication.
  • a voice authentication system 1 for performing a method of voice authentication is shown.
  • the voice authentication system 1 limits access by a user 2 to a secure system 3.
  • the secure system 3 may be physical, such as a room or building, or logical, such as a computer system, cellular telephone handset or bank account.
  • the voice authentication system 1 is managed by a system administrator 4.
  • the voice authentication system 1 includes a microphone 5 into which a user may provide a spoken response and which converts a sound signal into an electrical signal, an amphfier 6 for amphfying the electrical signal, an analog-to-digital (A/D) converter 7 for samphng the amphfied signal and generating a digital signal, a filter 8, a processor 9 for performing signal processing on the digital signal and controlhng the voice authentication system 1, volatile memory 10 and non-volatile memory 11.
  • the A/D converter 7 samples the amphfied signal at 11025 Hz and provides a mono-linear 16-bit pulse code modulation (PCM) representation of the signal.
  • PCM pulse code modulation
  • the system 1 further includes a digital-to-analog (D/A) converter 12, another amphfier 13 and a speaker 14 for providing audio prompts to the user 2 and a display 15 for providing text prompts to the user 2.
  • the system 1 also includes an interface 16, such as a keyboard and/or mouse, and a display 17 for allowing access by the system administrator 4.
  • the system 1 also includes an interface 18 to the secure system 3.
  • the voice authentication system 1 is provided by a personal computer which operates software performing the voice authentication process.
  • the voice authentication process comprises two stages, namely enrolment (step SI) and authentication (step S2).
  • the purpose of the enrolment is to obtain a plurahty of specimens of speech from a person who is authorised to enrol with the system 1, referred herein as a "vahd user".
  • the specimens of speech are used to generate a rehable and distinctive voice authentication biometric, which is subsequently used in authentication.
  • a voice authentication biometric is a compact data structure comprising acoustic information-bearing attributes that characterise the way a vahd user speaks. These attributes take the form of a template, herein referred to as a "featuregram archetypes" (FGAs), which are described in more detail later.
  • FGAs featuregram archetypes
  • the vahd user's voice authentication biometric may also include further information relating to enrolment and authentication.
  • the further information may include data relating to prompts to which a vahd user has responded during enrolment, which may take the form of text prompts or equivalent identifiers, the number of prompts to be used during authentication and whether prompts should be presented in a random order during authentication, and other data relating to authentication such as scoring strategy, pass/fail/retry thresholds, the number of acceptable failed attempts and amphfier gain.
  • the voice authentication system 1 is calibrated, for example to ensure that a proper amphfier gain is set (step Sl.l).
  • a plurahty of spoken responses are recorded (step SI.2).
  • the recordings are characterised by generating so-called "featuregrams", which comprise a set of feature vectors (step SI.3).
  • the recordings are also examined so as to isolate speech from background noise and periods of silence (step SI.4, step SI.5).
  • Checks are performed to ensure that the recorded responses, isolated specimens of speech and featuregrams are suitable for processing (step SI .6).
  • a plurahty of speech featuregrams are then generated (step SI.7).
  • step SI.8 An average of some or all of the featuregrams is taken thereby forming a more representative featuregram, namely a featuregram archetype (step SI .8).
  • a pass level is set (step SI.9) and a voice authentication biometric is generated and stored (step SI.10)
  • step Sl.l ( Figure 3) is described in more detail:
  • One of the purposes of cahbration is to set the gain of the amphfier 6 ( Figure 1) such that the amphtude of a captured speech utterance is of a predetermined standard.
  • the predetermined standard may specify that the amphtude of the speech utterance peaks at predetermined value, such as 70% of a full-scale deflection of a recording range.
  • the A/D converter 7 ( Figure 1) is 16-bits wide and so 70% of full-scale deflection corresponds to a signal of 87dB.
  • the predetermined standard may also specify that the signal has a minimum signal-to- noise ratio, for instance 20dB which corresponds to a signal ten times stronger than the background noise.
  • the gain of the amphfier 6 ( Figure 1) is set to the highest value (step SI.1.1) and first and second counters are set to zero (steps SI .1.2 & SI.1.3).
  • the first counter keeps a tally of the number of specimens provided by the vahd user.
  • the second counter is used to determine the number of consecutive specimens which meet the predetermined standard.
  • a prompt is issued (step SI.1.4).
  • the prompt is randomly selected. This has the advantage that the it prevents the vahd user from anticipating the spoken response, thereby providing an uncharacteristic or unnatural response, for example which is unnaturally loud or quiet.
  • the prompt may be a text prompt or an audio prompt.
  • the vahd user may be prompted to say a single word, such as "thirty-four" or a phrase "My voice is my pass phrase".
  • the prompts comprise numbers. ' Preferably, the numbers are chosen from a range between 21 and 99. This has the advantage that the spoken utterance is sufficiently long and complex so as to include a plurahty of features.
  • a speech utterance is recorded (step SI.1.5). This comprises the user providing a spoken response which is picked-up by the microphone 5 ( Figure 1), amphfied by the amphfier 6 ( Figure 1), sampled by the analog-to-digital (A/D) converter 7 ( Figure 1), filtered and stored in volatile memory 10 ( Figure 1) as the recorded signal.
  • the processor 9 Figure 1) calculates the power of speech utterance included in the recorded signal and analyses the result.
  • the signal-to-noise ratio is determined (step SI .1.6). If the signal-to-noise ratio is too low, for example less than 20dB, then the spoken response is too quiet and the corresponding signal generated is too weak, even at the highest gain. The user is informed of this fact (step SI.1.7) and the cahbration stage ends. Otherwise, the process continues.
  • the signal level is determined (step SI.1.8). If signal level is too high, for example greater than 87 dB which corresponds to the 95 th percentile of the speech utterance energy being greater that 70% of the full scale deflection of the A/D converter 7 ( Figure 1), then the spoken response is too loud and the corresponding signal generated is too strong. If the gain has already been reduced to its lowest value, then the signal is too strong, even at the lowest gain (step SI.1.9). The user is informed of this fact and cahbration ends (step SI.1.10). Otherwise, the gain of the amphfier 6 is reduced (step SI.1.11). The gain may be reduced by a fixed amount regardless of signal strength. Alternatively, the gain may be reduced by an amount dependent on signal strength.
  • step SI.1.12 The fact that a specimen spoken response has been taken is noted by incrementing the first counter by one (step SI.1.12).
  • the second counter is reset (step SI.1.13). If too many specimens have been taken, for example 15, then cahbration ends (step SI.1.14). Otherwise, the process returns to step SI.1.4, wherein the user is prompted, and the process of recording, calculating and analysing is repeated.
  • step SI.1.8 If, at step SI .1.8, the signal level is not too high, then the spoken response is considered to be satisfactory, i.e. neither too loud nor too quiet. Thus, the recorded signal falls within an appropriate range of values of signal energy.
  • the fact that a specimen spoken response has been taken is recorded by incrementing the first counter by one (step SI.1.16).
  • the fact that the specimen is satisfactory is also recorded by incrementing the second counter by one (step SI.1.17).
  • the gain remains unchanged.
  • the gain setting of the amphfier 6 ( Figure 1) is stored (step SI.1.18 & SI.1.19).
  • the gain setting is stored in the voice authentication biometric.
  • step SI.118 the signal level is measured a final time. If the signal level is too low, then calibration ends without the gain setting being stored.
  • the cahbration process allows decreases, but not increases, in gain. This has the advantage of preventing infinite loops in which the gain fluctuates without reaching a stable setting.
  • the cahbration process may be modified to start at the lowest gain and allow increases, but not decreases, in gain.
  • gain is increased.
  • the signal level may be measured a final time to determine whether it is too high.
  • the cahbration process may include a further check of signal-to-noise ratio. For example, once a settle value of gain has been determined, then peak signal-to-noise ratio of the signal is measured., If the signal-to-noise ratio exceeds a predetermined level, such as 20dB, then the' gain setting is stored. Otherwise, the user is instructed to repeat cahbration in a quieter environment, move closer to the microphone or speak with a louder voice.
  • the voice authentication system 1 records one or more spoken responses (step SI.2). This may occur during cahbration at step Sl .l. Additionally or alternately, a separate recording ' stage may be used.
  • the voice authentication system 1 asks the user to provide a spoken response.
  • the system prompts the user a plurahty of times.
  • Four types of prompt may be used:
  • the prompt comprises a request for a single word, for example "Say 81".
  • the user is asked to repeat the word.
  • the user may be asked to repeat the word once so as to obtain two specimens of the spoken response.
  • the user may be asked to repeat the word more than once so as to obtain multiple examples.
  • the prompt comprises a request for a single phrase, for example "Say My voice is my pass phrase".
  • the user is asked to repeat the phrase.
  • the user may be asked to repeat the phrase once or more than once.
  • the prompt may comprise a challenge requesting personal information, such as "What is your home telephone number?".
  • the vahd user provides a spoken response which includes the personal information.
  • This type of prompt is referred to as a "challenge-response".
  • This type of prompt has the advantage of increasing security. During subsequent authentication, an impostor must know or guess what to say as well as attempt to say the spoken response in the correct manner.
  • vahd user may pronounce digits in different ways, such as pronouncing “10” as “ten”, “one, zero”, “one, nought” or “one-oh”, and/or pause while saying a string of numbers, such as reciting "12345678” as “12-34-56- 78" or "1234-5678".
  • the prompt may comprise a cryptic challenge-response, such as "NOD?".
  • NOD may signify "Name of dog?”.
  • the cryptic challenge is specified by the user. This type of prompt has the advantage of increasing security since the prompt is meaningful only to the vahd user. It offers few clues as to what the spoken response should be.
  • a set of prompts may be common to all users. Alternatively, a set of prompts may be randomly selected on an individual basis. If the prompts are chosen randomly, then a record of the prompts issued to each user is stored in the voice authentication biometric, together with corresponding data generated from the spoken response. Preferably, this information is used during authentication stage to ensure that only prompts responded to by the vahd user are issued and that appropriate comparisons are made with corresponding featuregram archetypes.
  • the administrator 4 ( Figure 1) determines the type and number of prompts used during enrolment and authentication.
  • a spoken response is recorded by the microphone 5, amphfied by amphfier 6 and sampled using A/D converter 7 at 11025 Hz to provide a 16-bit PCM digital signal.
  • the duration of the recording may be fixed. Preferably, the recording lasts between 2 and 3 seconds.
  • the signal is then filtered to remove any d.c. component.
  • the signal may be stored in volatile memory 10.
  • the recorded signal 19 may comprise one or more speech utterances 20, one or more background noises 21 and/or one or more silence intervals 22.
  • a speech utterance 20 is defined as a period in a recorded signal 19 which is derived solely from the spoken response of the user.
  • I background noise 21 is defined as a period in a recorded signal arising from audible sounds, but not originating from the speech utterance.
  • a silence interval 22 is defined as a period in a recorded signal which is free from background noise and speech utterance.
  • the purpose of the enrolment is to obtain a plurahty of specimens of speech so as to generate a voice authentication biometric.
  • recorded responses are characterised by generating "featuregrams" which comprise sets of feature vectors.
  • featuregrams comprise sets of feature vectors.
  • the recordings are also examined so as to isolate speech from background noise and silences.
  • the recordings are known to contain specific words, then they are searched for those words. This is known as "word spotting”. If there is no prior knowledge of the content of the recordings, then the recordings are inspected to identify spoken utterances. This is known as "endpointing". By identifying speech utterances using one or both of these processes, a speech featuregram may be generated which corresponds to portions of the recorded signal comprising speech utterances.
  • timeshces 23 The recorded signal 19 is divided into frames, referred to herein as timeshces 23.
  • the recorded signal 19 is divided into partially-overlapping timeshces 23 having a predetermined period.
  • the recorded signal 19 is divided into frames, herein referred to as timeshces 23
  • a feature vector 24 is a one-dimensional data structure comprising data related to acoustic information-bearing attributes of the timeslice 23.
  • a feature vector 24 comprises a string of numbers, for example 10 to 50 numbers, which represent the acoustic features of signal comprised in the timeslice 23.
  • each feature vector 24 comprises twelve signed 8-bit integers, typically representing the second to thirteenth calculated mel-cepstral coefficients. Data relating to energy (in dB) may be included as a 13 th feature. This has the advantage of helping to improve the performance of a word spotting routine that would otherwise operate on the feature vector coefficients alone.
  • the transform 25 may also calculate first and second differentials, referred to as “delta” and “delta-delta” values.
  • LPC hnear predictive coefficient
  • TESPAR TESPAR
  • LPC Linear predictive coefficient
  • Endpointing seeks to identify portions of a recorded signal which contains spoken utterances. This allows generation of speech featuregrams which characterise the spoken utterances.
  • step SI.4 exphcit endpointing
  • DTW dynamic time warping
  • -Exphcit Endpointing- Exphcit endpointing seeks to locate approximate endpoints of a speech utterance in a particular domain without using any a priori knowledge of the words that might have been spoken.
  • Exphcit endpointing tracks changes in signal energy profile over time and frequency and makes boundary decisions based on general assumptions regarding the nature of profiles that are indicative of speech and those that are representative of noise or silence.
  • Exphcit endpointing cannot easily distinguish between speech spoken by the enrolhng user and speech prominently forming part of background noise. Therefore, it is desirable that no-one else speaks in close proximity to the vahd user when enrolment takes place.
  • an explicit endpointing process 27 generates a plurahty of pairs 28 of possible start and stop points for a stream of timeshces 23.
  • the advantage of generating a plurahty of endpoints is that the true endpoints are likely to be identified.
  • a drawback is that if too many endpoint combinations are identified, then the system response time is adversely affected. Therefore, a trade-off is sought between the number of potential endpoint combinations identified and the response time required.
  • Exphcit endpointing is suitable for both fixed and continuous recording environments, although is mainly intended for use with isolated word or isolated phrase recognition systems.
  • step S1.4.A A check is made whether initialisation is needed, whereby background noise, energy is measured (step S1.4.A). If so, a background noise signal is recorded (step S1.4.B), divided into timeshces (step S1.4.C) and a background energy value is calculated (step S1.4.D).
  • a signal is recorded and divided into a plurahty of timeshces 23 ( Figure 8) (step SI.4.1).
  • a first counter, i for keeping track of which timeshce 23 ; is currently being processed is set to one (step SI .4.2).
  • a second counter, j for counting the number of consecutive timeshces 23 which represent background noise is set to zero (step SI .4.3).
  • a "word" flag is set to zero to represent that the current timeshce 23 ; does not represent a spoken utterance portion, such as a word portion (step SI.4.4).
  • the energy of the current timeshce 23 is calculated (step Sl .4.5).
  • a plurahty of timeshces 23 are used to calculate a value of energy for the current timeshce 23 ; .
  • the timeshces 23 are comprised in a window 29.
  • five timeshces 23;_ 2 , 23;. l5 23 i3 23 i+1 , 23 i+2 are used to calculate a value of energy of the i th timeshce 23 ; .
  • a time encoded speech processing and recognition (TESPAR) coding process 30 is used to calculate an energy value for each timeshce 23;_ 2 , 23 i . 1 , 23 ; , 23 i+1 , 23 i+2 . This comprises taking each timeshce 23;_ 2 , 23 ; . ! , 23;, 23 i+1 , 23 i+2 and dividing it into a plurahty of so-called “epochs" according to where signal magnitude changes from positive to negative and vice versa. These points are known a "zero crossings".
  • An absolute value of a peak magnitude for each epoch is used to calculate an average energy for each timeshce 23;_ 2 , 23;..,, 23 ; , 23 i+1 , 23 i+2 .
  • five energy values are obtained from which a mean energy value 31 is calculated.
  • a delta energy value 32 indicative of changes in the five energy values is also calculated (step SI .4.6).
  • the delta energy value 32 is calculated by performing a smoothed hnear regression calculation using the energy values for the timeshces 23;_ 2 , 23;. l5 23 ; , 23 i+1 , 23 i+2 .
  • the delta energy value 32 represents a gradient of straight hne fitted to the values of energy. Thus, large changes in the energy values result in a large value of the delta energy value 32.
  • the values 31, 32 of energy E ; and delta energy ⁇ E ; are used to determine whether the i th timeshce 23 ; represents a spoken utterance.
  • the energy of 31 of an i th timeshce 23 is equal to or greater than a first threshold, which is first predetermined multiple of background noise energy, i.e. E ; > k, x E 0 (step SI .4.7), and the delta energy 32 is equal to or greater than a second threshold, which is second predetermined multiple of background delta energy, i.e. ⁇ E ; > k 2 x ⁇ E 0 (step SI.4.8), then the i th timeslice 23 is considered to form part of a word.
  • the timeshce 23 is said to form part of a voiced or energetic section.
  • step SI.4.9 If the word flag not set to one, representing that the previous timeshce 23 ; . ! was background noise (step SI.4.9), then the current timeshce 23 ; is considered to be the beginning of a word (step S.1.4.10). Thus, the word flag is set to one (step SI.4.11).
  • step SI.4.9 If at step SI.4.9, the word flag is set to one, then the beginning of the word has already been detected and so the current timeshce 23 ; is located within an existing word (step SI.4.12).
  • the first counter, i is incremented by one (step SI.4.13) and the process returns to step SI .4.5 where the energy of the new i th timeshce 23 ; is calculated. If the energy value 31 falls below the first threshold at step SI.4.7 or the delta energy value 32 falls below the second threshold at step SI .4.8, then it is determined whether there is a stop point, and if so with which start point or start points it could be paired.
  • step SI.4.14 If the word flag is set to one, (step SI.4.14), then the current timeshce 23 ; is considered to be a stop point.
  • the stop point may be paired with one or more other start points, as will now be explained:
  • first and second sections 33, 34 are separated by a gap 35.
  • the first section 33 includes a first start point 36 j and a first stop point 37 t .
  • the second section 34 has a second start point 36 2 .
  • a second stop point 37 2 is found.
  • the second stop point 37 2 may be paired with the second start point 37 2 so identifying the second section 34 as a word.
  • the second stop point 37 2 may also be paired with the first start point 36 j .
  • the first start point 36 j and the second stop point 37 2 may define a larger word 38 which includes both the first and second sections 33, 34.
  • the second counter, j is reset to zero (step SI.4.19), the word flag is set to zero ' (step SI .4.20) and the first counter is incremented by one (step SI .4.21) before returning to step SI.4.5.
  • step SI.4.14 If, at step SI.4.14, the word flag is not set to one, then a further check is made as to whether the current timeshce 23 ; may be considered to be the start point of an unvoiced or unenergetic section, herein after referred to as simply an unvoiced section.
  • step SI.4.22 If the energy 31 of an i th timeshce 23 ; is equal to or greater than a third threshold, which is lower than the first and which is third predetermined multiple of background noise energy, i.e. E ; > k 3 x E 0 (step SI .4.22) and the delta energy 32 is equal to or greater than a fourth threshold, which is lower than the first and which is fourth predetermined multiple of background delta energy, i.e. ⁇ E ; > k 4 x ⁇ E 0 (step SI .4.23), and provided that the timeshce 23 ; is found withing 10 timeshces of the previous stop point (step SI.4.24), then the i th timeshce 23 is considered to be the start point of an unvoiced section (step SI.4.25).
  • the current timeshce 23 is deemed to represent background noise (step SI.4.31).
  • the values of background noise energy and delta background noise energy are updated using the current timeshce 23,.
  • a weighted average is taken using 95% of the background noise energy E 0 and 5% of the timeshce energy E, (step SI.4.32).
  • a weighted average is taken using 95% of the delta background noise ⁇ E 0 energy and 5% of the delta energy ⁇ E, (step SI .4.33).
  • the second counter, j is incremented by one (step SI .4.34).
  • step SI.4.35 A check is made to see whether an isolated word has been found. If a sufficiently long period of background noise is identified, for example by counting twenty timeshces after the end of a word which corresponds to 0.5 seconds of silence (step SI.4.36), then it is reasonable to assume that the last stop point represents the end of an isolated word. If an isolated word is found, then pairing of possible start and stop points may be terminated. Otherwise, searching continues by returning to step SI.4.5.
  • step SI.4.36 If at step SI.4.30, the energy 31 of the timeshce 23, falls below the fifth threshold, then a stop point of an unvoiced section is identified (step SI.4.36).
  • the stop point is associated with the start point of the preceding word (step SI.4.37) and the first counter, i, is incremented by one (step SI .4.38)
  • a first section 39 precedes a second section 40 and has a first start point 41 ⁇ and a first stop point 42 2 .
  • a second stop point 42 2 is found in the second section 40 according to step SI .4.36.
  • the second stop point 42 2 is paired the first start point 41 t .
  • the first start point 41 j and the second stop point 42 2 may define a word 43 which includes both the first and second sections 39, 40.
  • the stop point may be an end point of a voiced section, such as the "le” in “left”, or a stop point of an unvoiced section, such as the "t" in “left".
  • a process for finding and removing extraneous noises such as hp smack and generating an additional pair of endpoints is shown: "
  • a stop point is located at step SI.4.16 or SI.4.17 in a voiced section
  • the current start point is located (step SI.4.39).
  • First and second pointers p, q are set to the start point (steps SI.4.40 & ⁇ S.1.4.41).
  • the first index p points to an updated start point.
  • the second index q points keeps track of which timeshce is currently 1 being examined.
  • the delta energy of a current timeshce 23 is compared with the delta energy of a succeeding timeshce 23 q+1 (step SI .4.42). If the delta energy of the current timeshce 23 is greater than the delta energy of the succeeding timeshce 23 q+1 , then the delta energy of the succeeding timeshce 23 +1 is compared with the delta energy of a second succeeding timeshce 23 q+2 (step SI.4.43). If the delta energy of the succeeding timeshce 23 q+1 is greater than the delta energy of the second succeeding timeshce 23 q+2 , then the start point is updated by incrementing the first index p by one (step SI.4.44). A check is made to see whether updated start point and the stop position are separated by at least three timeshces (step SI.4.45). If not, then the process terminates without generating an additional pair of endpoints including an updated start point.
  • step SI.4.42 or SI.4.43 the delta energy of the current timeshce 23 q is less than the delta energy of the succeeding timeshce 23 q+1 or delta energy of the succeding timeshce 23 q+1 is less than the delta energy of the succeeding timeshce 23 +2 , then the process terminates and generates an additional pair of endpoints including an updated start point.
  • Figure 19a shows a voiced section 44 having a pair of start and stop points 45, 46.
  • Figure 19b shows the voiced section 44 after the process has identified a section portion 47 comprising a hp smack. Another pair of start and stop points 48, 49 are generated.
  • explicit endpointing is performed in real-time. This has the advantage that it may be determined whether or not a timeshce 23 corresponds to a spoken utterance, i.e. whether a potion of the recorded signal currently being processed corresponds to part of word. If so, a featuregram is generated. If not, a featuregram need not be generated. Processing resources may be better put to use, for example by generating a template (if in the training mode) or performing a comparison (if in the real-time live interrogation mode).
  • -Word spotting- Word spotting seeks to locate endpoints of a speech utterance in a particular domain using a priori knowledge of the words that should have been spoken as a guide.
  • the a priori knowledge is typically presented as a speaker-independent featuregram archetype (FGA) generated from speech utterances of the word or phrase being sought that have previously been supplied by a wide range of representative speakers.
  • the featuregram archetype may include an energy term.
  • a dynamic time warping process 50 herein referred to as a DTWFlex
  • the process 50 compares a featuregram 51 derived from the recorded signal 19 ( Figure 5) with a speaker-independent featuregram archetype 52, representing a word or phase being sought. This is achieved by compressing and/or expanding different sections of the featuregram 51 until a region inside the featuregram 51 matches the speaker-independent featuregram archetype 52. The best fit is known as the winning path and the endpoints of the winning path are output 28'.
  • word spotting dehvers more accurate endpoints than those produced by exphcit endpointing, particularly when heavy non-stationary background noise is present. If word spotting is used during enrolment, users are asked to respond to fixed-word or fixed-phrase prompts for which speaker- independent featuregram archetype have been prepared in advance. It is difficult to use word spotting in conjunction with challenge-response prompts, particularly if spoken responses cannot be easily anticipated. Thus, it is preferable to use exphcit endpointing when using challenge-response prompts. An outhne of a word spotting process will now be described:
  • First and second speech patterns A, B may be expressed as a sequence of first and second respective sets of feature vectors a, b, wherein:
  • A a l5 a 2 ,...a;,..., a ⁇ (la)
  • Each respective vector a, b represents a fixed period of time.
  • a dynamic time warping process seeks to ehminate timing differences between the first and second speech patterns A, B.
  • the timing differences may be illustrated using an i— j plot, wherein the first speech pattern A is developed along an i-axis 53 and the second speech pattern B is developed along a j-axis 54.
  • the timing differences between the first and second speech patterns A, B may be represented by a sequence F, wherein:
  • the sequence F may be considered to represent a function which approximately maps the time axis of first speech pattern A onto that of the second speech pattern B.
  • the sequence F is referred to as a warping function 55.
  • the warping function 55 increasingly deviates from the diagonal hne 56.
  • a weighted sum of distances in the warping function 53 is calculated using:
  • w(k) is a positive weighting coefficient.
  • E(F) reaches a minimum value when the warping function 55 optimally adjusts the timing differences between the first and second speech patterns A, B.
  • the minimum value may be considered to be a distance between the first and second speech patterns A, B, once the timing differences between them has been eliminated and is expected to be stable against time-axis fluctuation. Based on these considerations, a time-normalised distance D between the first and second speech patterns A, B is defined as:
  • the speech patterns A, B Two conditions are imposed on the speech patterns A, B. Firstly, the speech patterns A, B are time-sampled with a common and constant samphng period. Secondly, there is no a priori knowledge about which parts of the speech pattern contain hnguistically important information. In this case, each part of the speech pattern is considered to have an equal amount of linguistic information.
  • the warping function 55 is a model of time-axis fluctuations in a speech pattern.
  • the warping function 55 when viewed as a mapping function from the time axis of the second speech pattern B onto that of the first speech pattern A, preserves hnguistically important structures in the second speech pattern B time axis, and vice versa.
  • important speech pattern time- axis structures include continuity, monotonicity and limitation on acoustic parameter transition speed in speech.
  • asymmetric time warping is used, wherein a weighting function w(k) is dependent on i but not j. This condition is realised using the following restrictions on the warping function 55:
  • the monotonic condition specifies that the warping function 55 does not turn back on itself. Secondly, a continuity condition is imposed, wherein:
  • the continuity condition specifies that the warping function 55 advances a predetermined number of steps at a time.
  • Boundary conditions are set such the warping function 55 starts at (1, 1) and ends at (I, J), i-e.:
  • a local slope constraint condition is also imposed. This defines a relation between consecutive points on the warping function 55 and places limitations on possible configurations. In this example, the Itakura condition is used.
  • the second speech pattern B may be maximally compressed or expanded by a factor of 2 in order to time ahgn it with the first speech pattern A.
  • the above conditions effectively constrain the possible warping function 55 to a region in the time axis bounded by a parallelogram 58 and which is referred to as the "legal" search region.
  • the legal search conforms to the following conditions:
  • j may take a maximum value 58 max and minimum value 58 min for a particular value of i.
  • Equation 5 may then be simphfied and re-written as:
  • the time-normalised distance D may be solved using standard dynamic programming techniques.
  • the aim is to find the cost of the shortest path.
  • an asymmetric weighting function w(k) is used, namely:
  • I is the length of speech pattern A.
  • An algorithm for solving equation 13 comprises defining an array g for holding the lowest cost path to each point and initialising such that:
  • the lowest cost to the first point is the distance between the first two elements multiphed by the weighting factor.
  • w(l) 2
  • w(l) 1.
  • equations 17, 18, 19 is simphfied and comprises defining an array g for, holding the lowest cost path to each point and initialising such that:
  • the lowest cost to the first point is the distance between the first two elements.
  • the algorithm comprises calculating g k (i, j) for each row i and column j, wherein:
  • the algorithm further comprises applying the following global conditions, namely: j ⁇ max 2i -2I + J , - + - (22) 2 2
  • An algorithm based on equations 19 to 24 may be used to obtain a score when comparing speech utterances of substantially the same length.
  • the algorithm is used when comparing featuregram archetypes, which is described in more detail later.
  • equations 19 to 24 may be adapted for word spotting applications.
  • word spotting it is assumed that the start and stop points of the first speech vector A are known.
  • the start and stop points of the relevant speech in the second pattern B are unknown. Therefore, the conditions of equation 9 no longer hold and can be re-defined such that:
  • the featuregram 51 derived from the recoded signal 19 ( Figure 5) is compared with the speaker-independent featuregram archetype 52.
  • the featuregram comprises a speech utterance, such as "twenty-one", silence intervals and background noise.
  • the speaker-independent featuregram archetype 52 comprises a word or phrase being sought, which in this example is "twenty-one".
  • the featuregram 51 is warped onto the speaker-independent featuregram archetype 52.
  • the aim is to locate a region within the featuregram 51 (speech pattern B) which best matches speaker-independent featuregram archetype 52 (speech pattern A).
  • An array g for holding the lowest cost path to each point is defined (step SI.5.1).
  • the array may be considered as a net of points or nodes.
  • the start point can assume any value from 1 to J-I/2, therefore the elements g(l, 1) to g(l, J-I/2) are set to values d(l, 1) to d(l, J-I/2) respectively (step Sl.5.2).
  • Elements g(l, J-I/2+1) to g(l, J) may be set to a large number.
  • a corresponding array 59 is shown in Figure 25a.
  • Equation 20 is then calculated for some, but not all, elements (i, j) of array g.
  • the winning score with the lowest value is found (step SI.5.12).
  • the stop point may assume any value from 1/2 to J. Therefore, elements g(I, 1/2) to g(I, J) are searched.
  • a start point 61 may be estimated by tracing back the winning path 62.
  • endpoints 28' are found be reading i-values corresponding to the start and stop points 60, 61.
  • a plurahty of sanity checks may be apphed during enrolment and authentication, preferably on the recorded signal or a recorded signal portion, to ensure that they are suitable for enrolment and authentication, i.e. that the speech utterances carry sufficient information for featuregrams to be generated. Preferably, all the following sanity checks are performed.
  • a first sanity check comprises confirming that the length of speech exceeds a minimum length.
  • the minimum length of speech is a function of not only time but also of the number of feature vector time slices. In this example, the minimum length of speech is 0.5 seconds of speech and 30 feature vector timeshces, and timeshce duration and overlap are defined accordingly.
  • a second sanity check comprises checking that each speech utterance includes a silence interval which exceeds a minimum length.
  • the silence interval is used to determine noise threshold levels for exphcit endpointing, signal to noise measurements and for Speech/Noise entropy.
  • the minimum length of silence is 0.5 seconds and 30 feature vecto timeshces.
  • a third sanity check includes examining whether the signal-to-noise ratio exceeds a minimum.
  • the minimum signal-to-noise ratio is 20dB.
  • the purpose of setting a minimum signal-to-noise ratio is to obtain an accurate speaker biometric template uncorrupted by background noise.
  • An estimate of the SNR can be determined using:
  • I s is the speech energy and I n is the noise energy.
  • the speech and noise energy I s , I n can be calculated using:
  • pcm is the value of the digital signal.
  • Other values of signal-to noise ratio may be used, for example 25dB.
  • a fourth sanity check comprises checking whether the speech energy exceeds a minimum.
  • the purpose of setting a minimum speech intensity is not only to provide adequate signal-to-noise, but also to avoid excessive quantisation in the digital signal.
  • the minimum speech intensity is 47 dB.
  • a fifth sanity check comprises determining whether the degree of chpping exceeds a maximum value.
  • the degree of chpping is defined as the average number of samples which exceeds an absolute value in each speech frame. In this case, the degree of chpping is 32000 which represents about 98% of the full-scale deflection of a 16-bit analog-to-digital converter.
  • a sixth sanity check includes checking whether a so-called "speech entropy" exceeds a minimum.
  • the minimum speech entropy is 40.
  • Speech entropy is defined as the average distance between a speech featuregram and the mean feature vector of the speech featuregram.
  • the mean feature vector is calculated by taking an average of the n-feature vectors in the featuregram.
  • a distance between each feature vector and the mean feature vector is determined.
  • a Manhattan distance is calculated, although a Euclidian distance may be used.
  • An average distance is calculated by taking an average of n-values of distance.
  • a sixth sanity check comprises testing whether a so-called "speech-to-noise entropy" exceeds a minimum.
  • Speech-to-noise entropy is defined as the average distance between the mean feature vector of the speech featuregram and the feature vectors of the background noise. In this example, the minimum speech-to-noise entropy is 40.
  • step SI.6.1 to SI.6.6 If there are number of failures exceeds a threshold, for example 3, then signal is deemed to be inadequate and the user is asked to check their set-up (steps SI .6.7 and SI .6.8). Otherwise, the recorded signal 19 ( Figure 5) is considered to be satisfactory (step SI.6.9).
  • a threshold for example 3
  • a speech featuregram 63 is created using a process 64 by concatenating feature vectors 24 extracted from the section of the featuregram 25 that originates from the speech utterance.
  • the speech section of the featuregram is located via the speech endpoints 28, 28'.
  • Creating speech featuregram archetype The aim of the enrolment is to provide a characteristic voiceprint for one or more words or phrases.
  • specimens of the same word or phase provided by the same user usually differ from one another. Therefore, it is desirable to obtain a plurahty of specimens and derive a model or archetypal specimen. This may involve discarding one or more specimens that differ significantly from other specimens.
  • a speech featuregram archetype 65 is calculated using an averaging process 66 using w-featuregrams 63 l3 63 2 ,..., 63 w .
  • an average of three featuregrams 63 is taken.
  • the featuregram archetype 65 is computed by determining a winning score D for each featuregram 63 l5 63 2 ,..., 63 w warped, using a modified version of process 50 which is shown in Figure 32, against each other featuregram 63 l5 63 2 ,..., 63 w to create an w-by-w featuregram cost matrix 67, whose diagonal elements are zero (steps SI.8.1 to SI.8.9).
  • a minimum value D m ⁇ n in the featuregram cost matrix 67 is determined (step SI.8.10). If the minimum value D m ⁇ n is greater than a predefined threshold distance D 0 , then all the featuregrams 63 l5 63 2 ,..., 63 w are considered to be so dissimilar that a featuregram archetype 67 cannot be created (step SI.8.11).
  • w-featuregram archetypes 69 t , 68 2 ,..., 68 w are computed using each featuregram 63 l5 63 2 ,..., 63 w as a reference and warping each remaining (w-l)-featuregrams 63 l5 63 2 ,..., 63 w onto it (steps SI .8.12 to SI.8.21).
  • a w-by-w featuregram archetype cost matrix 69 is computed whose elements consist of the winning scores E from warping each featuregram 63 l5 63 2 ,..., 63 w into each featuregram archetype 68 ⁇ 68 2 ,..., 68 w (steps SI.8.22 to SI.8.28).
  • An average featuregram archetype cost matrix 70 is computed by averaging elements within each column 71 corresponding to a featuregram 63 1? 63 2 ,..., 63 w (steps Sl.8.29 to Sl .8.37).
  • a maximum value E' max in the featuregram cost matrix 69 is also determined (steps SI .8.38).
  • the featuregram archetype 68 l5 68 2 ,..., 68 w which provides the lowest mean featuregram archetype cost ⁇ E' 1 >, ⁇ E' 2 >,..., ⁇ E' W > is chosen to be included in the voice authentication biometric (steps SI.8.37 to SI.8.50).
  • the lowest mean featuregram archetype cost ⁇ >, ⁇ E' 2 >,..., ⁇ E' W > is calculated by averaging elements within each row 72.
  • step Sl.8.54 If the maximum value E' max in the featuregram cost matrix 69 is greater than the threshold D 0 , then a featuregram 63 l5 63 2 ,..., 63 w is excluded, thus reducing the number of featuregrams to (w-1) and steps SI.8.1 to SI.8.50 are repeated (steps Sl.8.54).
  • a featuregram 63 l3 63 2 ,..., 63 w is chosen for exclusion by calculating a variance ⁇ l5 ⁇ 2 ,..., ⁇ w for each featuregram archetype 68 l5 68 2 ,..., 68 w and excluding the featuregram 63 l3 63 2 ,..., 63 w corresponding to the featuregram archetype 68 t , 68 2 ,..., 68 w having the lowest value of variance ⁇ l3 ⁇ 2 ,..., ⁇ w (steps SI.8.51 to SI .8.53).
  • a variance ⁇ is calculated from the average featuregram archetype cost matrix 70 using:
  • Steps SI.8.1 to SI.8.50 are repeated until a featuregram archetype 65 (Figure 28) is obtained or if only one featuregram 63 l3 63 23 ..., 63 w is left.
  • a featuregram archetype 65 is obtained for each prompt. Thus, during subsequent authentication, a user is asked to provide a response to a prompt. A featuregram is obtained and compared with the featuregram archetype 65 using a dynamic time warping process which produces a score. The score is compared with a preset pass level. A score which falls below the pass level indicates a good match and so the user is accepted as being a vahd user.
  • a vahd user is hkely to provide a response that results in a low score, falling below the pass level, and which is accepted.
  • a vahd user provides a response that results in a high score and which is rejected.
  • an impostor may be expected to provide poor responses which are usually rejected. Nevertheless, they may occasionaUy provide a sufficiently close- matching response which is accepted.
  • the pass level affects the proportion of vahd users being incorrectly rejected, i.e. the "false reject rate” (FRR) and the proportion of impostors which are accepted, i.e "false accept rate” (FAR).
  • a pass level for a fixed-word or fixed-phrase prompt is determined using previously acquired captured recordings taken from a wide range of representative speakers.
  • a featuregram archetype is obtained for each of a first set of users for the same prompt in a manner hereinbefore described. Thereafter, each user provides a spoken response to the prompt from which a featuregram is obtained and compared with the user's featuregram archetype using a dynamic time warping process so as produce a score. This produces a first set of scores corresponding to vahd users.
  • the process is repeated for a second set of users, again using the same prompt. Once more, each user provides a spoken response to the prompt from which a featuregram is obtained. However, the featuregeam is compared with a different user's featuregram archetype. Another set of scores are produced, this time corresponding to impostors.
  • p probabihty
  • x score
  • mean score
  • standard deviation
  • Other probabihty density functions may be used.
  • the mean score ⁇ , for vahd users is expected to be lower than the mean score ⁇ 2 for the impostors.
  • the standard deviation ⁇ ⁇ for the valid users is usually smaller than the standard deviation ⁇ 2 of the second density function
  • the first and second probabihty density functions 73 l5 73 2 re numerically integrated to produced first and second continuous density functions 74 l5 74 2 .
  • ERR error rate
  • the score at the point of intersection 75 is used as a pass score for the prompt.
  • the voice authentication biometric 76 comprises sets of data 77 l5 77 2 ,...77 q corresponding to featuregram archetypes 65 and associated prompts 78. Statistical information 79 regarding each featuregram archetype 65 and an associated prompt 78 may also be stored and will be described in more detail later.
  • the voice authentication biometric 76 further comprises ancillary information including the number of prompts to be issued during authentication 80, scoring strategy 81, higher level and gain settings 82.
  • the biometric 76 may include further information, for example related to high-level logic for analysing scores.
  • the voice authentication biometric 76 is stored in non-volatile memory 11 ( Figure ! •
  • the voice authentication system 1 is initiahsed, for example by setting amphfier gain to a value stored in the voice authentication biometric, or calibrated, for example to ensure that an appropriate amphfier gain is set (step S2.1).
  • the user is then prompted (step 2.2) and the user's responses are recorded (step S2.3).
  • Featuregrams ate generated from the recordings (step S2.4).
  • the recordings are examined so as to isolate speech from background noise and periods of silence (step S2.5, step S2.6).
  • Checks are performed to ensure that the recordings, isolated speech utterances and featuregrams are suitable for processing (step S2.7).
  • the featuregrams are then matched with the featuregram archetype (step S2.8).
  • the response is also checked for replay attack (step S2.9).
  • the user's response is then scored, (step S2.10).
  • the gain of the amphfier 6 ( Figure 1) is set according to the value 82 ( Figure 37) stored in the voice authentication biometric 76 ( Figure 37) which is stored in non- volatile memory 11 ( Figure 1).
  • the system may be calibrated in a way similar to that used in enrolment.
  • the process may differ.
  • prompts used in authentication may differ from those used in enrolment.
  • a value of gain determined during enrolment cahbration need not be recorded but may be compared with a value stored in the voice authentication biometric and user to determine whether the user is a vahd user.
  • Authentication prompts are chosen from those stored in the voice authentication biometric 76 ( Figure 37). Preferably, prompts are randomly chosen from a sub-set. This has the advantage that it becomes more difficult for a user to guess what prompt will be used and so give an unnatural response. Moreover, this improves security.
  • a signal 83 is recorded using the microphone 5 ( Figure 1) in a manner hereinbefore described.
  • the or each recorded signal 83 is divided into timeshces 84.
  • the timeshces 84 use the same window size and the same overlap as used for enrolment.
  • Feature vectors 85 are created. Again, the same process 25 is used in authentication as enrolment.
  • the feature vectors 85 are concatenated to produce featuregrams 86.
  • the featuregrams 86 generated during authentication are usually referred to as authentication featuregrams.
  • exphcit endpointing may be performed using the process 27 described earher so as to generate endpoints 87.
  • Exphcit endpointing may be used to support sanity checks.
  • the process 50 and the featuregram archetype 65 is used to word spot the authentication featuregram 86 and provide a dynamic time warping winning score 87.
  • the process 50 may be used to provide endpoints 28'.
  • a potential threat to the security offered by any voice authentication system is the possibility of an impostor secretly recording a spoken response of the vahd user and subsequently replaying a recording to gain access to the system. This is known as a "replay attack.”
  • a fixed-phrase prompt is randomly selected (step S2.9.1).
  • An example of a fixed- phrase prompt is "This is my voiceprint”.
  • a recording is started (step S2.9.2).
  • the user is then prompted a first time (step S2.9.3).
  • a predetermined period of time for example 1 or 2 seconds
  • the user is prompted a second time with the same prompt (steps S2.9.4 & S2.9.5).
  • the user supphes two different examples 89 l5 89 2 separated by a 1-2 second interval 90.
  • a featuregram 86 is generated as described earher (step S2.9.6).
  • the interval may comprise silence and/or noise.
  • the word spotting process 50 is used to isolate the two spoken responses 89,, 89 2 to the fixed-phrase prompt and the interval 90 (steps S2.9.7 & S2.9.8).
  • the isolate responses 89 l5 89 2 are fed to process 88.
  • Each truncated featuregram provides a representation of the spoken response.
  • the duration of the interval 90 is determined.
  • a record 92 is kept of a degree of match between the two featuregrams 89 l5 89 2 and the length of the intermediate silence 90 (step S2.9.11).
  • This record 92 is known as a "Replay Attack Statistic" (RAS).
  • RAS Replay Attack Statistic
  • the record 92 comprises two integers. Therefore, it is possible to store a plurahty of replay attack statistics 92 for each fixed-phrase prompt in the voice authentication biometric 76 ( Figure 36) without consuming a significant amount of memory.
  • the record 92 is stored in statistical information 79 ( Figure 37).
  • the authentication may be rejected on suspicion of a replay attack. Additionally or alternatively, the process may be repeated using a different prompt and check for the replay attack based on another set of replay attack statistics 92.
  • the authentication may also be rejected on suspicion of a replay attack (step S2.9.17 and step S2.9.18).
  • the advantage of using this approach is that it is possible to monitor and detect suspicious similarities between featuregram archetypes even if the acoustic environment has changed since the time the recording was originally made. Furthermore, the approach helps to guard against replay attacks based on recordings made during enrolment and authentication. Additionally, the cost of storing the replay attack statistics is low, typicaUy 3 bytes per prompt. Thus, to monitor the last 5 authentication attempts across 5 fixed prompts typically requires 75 bytes of memory.
  • a decision on whether to accept or reject the user is based on the degree of match between featuregram archetypes 65 stored in the voice authentication biometric 76 ( Figure 36) and the featuregrams 86 derived from the authentication recordings.
  • Higher-level decision logic is subsequently apphed.
  • Higher-level decision logic may include calculating an average score for, a plurahty featuregrams 86 and determining whether the average score falls below a first predetermined scoring threshold, i.e. D av ⁇ D theshl . If the average score falls below the first predetermined scoring threshold, then authentication is considered successful.
  • Higher-level decision logic may include determining the number, n, of featuregrams 86 whose score fall below a second predetermined scoring threshold, i.e. D ; ⁇ D thesh2 for all 0 ⁇ i ⁇ p.
  • the decision logic subsequently comprises checking a pass condition.
  • the pass condition may be that the scores for ⁇ out of p featuregrams 86 faU below the second predetermined scoring threshold, where 1 ⁇ n ⁇ p. Allowing one or more of the featuregram scores to be ignored is useful because it allows the valid user to provide an uncharacteristic response to at least one of the prompts without being unduly penalised.
  • the scoring thresholds may be set based upon the statistical method described earher.
  • a threshold may be determined during enrolment.
  • a plurahty of specimens, preferably two or three, of the same response are taken.
  • a featuregram archetype is determined.
  • AdditionaUy a variance is determined.
  • an alternative strategy may be used, which adaptively determines a number of prompts to be issued.
  • InitiaUy a user is prompted a predetermined number of times, for example two or three times. Spoken responses are recorded, corresponding featuregrams are obtained and compared with the featuregram archetype so as to produce a number of scores. Depending on the score, further prompts may be issued. For example, if aU or substantially aU the scores fall below a threshold score, indicating a good number of matches, then no further prompts are issued and authentication is successful. Conversely, if all or substantiaUy all the scores exceed the threshold score, indicating a poor number of matches, then authentication is unsuccessful. However, if some scores fall below the threshold and other scores exceed the threshold, then further prompts are issued and further scores obtained.
  • the voice authentication system is comprised in a single unit, such as a personal computer.
  • the voice authentication system may be distributed.
  • the processor for performing the matching process and non-volatile memory holding the voice authentication biometric may be held on a so-caUed "smart card" which is carried by the vahd user.
  • a so-caUed "smart card” which is carried by the vahd user.
  • the door is provided with a microphone and a smart card reader.
  • the door is also provided with a speaker for providing audio prompts and/or a display for providing text prompts.
  • the voice authentication system is connected and permits authentication and, optionally, enrolment. Enrolment may be performed elsewhere, preferably under supervision of the system administrator, using another microphone and smart card reader together with speaker and/or display. This has the advantage that access is conditional not only on successful authentication, but also possession of the smart card.
  • the voice authentication biometric and the matching process may be encrypted.
  • the smart card may also be used in personal electronic devices, such as ceUular telephones and personal data assistants.
  • a modified voice authentication system 1' is provided by a personal computer 93, for example in the form of a lap-top personal computer, and a smart card 94.
  • the personal computer 93 includes a smart card reader 95 for permitting the personal computer 93 and smart card 94 to exchange signals.
  • the smart card reader 95 may be a peripheral device connected to the computer 93.
  • the smart card 94 includes an input/ output circuit 96, processor 9', non-volatile memory 10' and volatile memory 11'. If the smart card 94 is used for storing the voice authentication biometric, but not for performing the matching or other processes, then the smart card 94 need not include the processor 9' and volatile memory 11'.
  • the smart card 94 takes the form of a contact smart card 49
  • the contact smart card 49 t includes a set of contacts 97 and a chip 98.
  • An example of a contact smart card 49 t is a JCOP20 card.
  • the smart card 94 may alternatively take the form of a contactiess smart card 49 2 .
  • the contactiess smart card 49 2 includes a loop or coU 99 and a chip 99.
  • the contactiess smart card 49 2 may include a plurahty of sets of loops (not shown) and corresponding chips (not shown).
  • An example of a contactiess smart card 492 is an iCLASSTM card produced by HID CorporationTM.
  • the contact smart card 94j and smart card reader 95 are shown in more detail.
  • the contact smart card 94 t and smart card reader 95 are connected by an interface 101 including a voltage line Vcc 102 l5 for example at 3 or 5 V, a reset hne Rst 102 2 for resetting RAM, a clock hne 102 3 for providing an external clock signal from which an internal clock is derived and an input/ output hne 102 4 .
  • the interface conforms to ISO 7816.
  • Volatile memory 11' ( Figure 48) is in the form of RAM 103 and is used during operation of software on the card. If a reset signal is applied to hne Rst 102 2 or if the card is disconnected from the card reader 95, then contents of the RAM 103 is reset.
  • Non-volatile memory 10' ( Figure 48) is in the form of ROM 104 and EEPROM 105.
  • An operating system (not shown) is stored in ROM 104.
  • Apphcation software (not shown) and voice authentication biometric 76 ( Figure 38) is stored in EEPROM 105.
  • Contents of the EEPROM 105 may be set using a card manufacturer's development kit.
  • the EEPROM 105 may have a memory size of 8 or l ⁇ kbits, although other memory sizes may be used.
  • Processor 9' may be in the form of an embedded processor 106 which handles, amongst other thing, encryption, for example based on triple data encryption standard (DES).
  • DES triple data encryption standard
  • the interface 96, RAM 103, ROM 104, EEPROM 105, processor 106 may be incorporated into chip 98 ( Figure 49), although a plurahty of chips may be used.
  • the smart card reader 95 is connected to the personal computer 93 which runs one or more computer programs for permitting communication with the smart card 94 j using Apphcation Protocol Data Units (APDUs) for example as specified according to ISO 7816-4.
  • APDUs Apphcation Protocol Data Units
  • FIG. 52 a table 107 hsts APDU commands 108 which can be sent to the smart card 94 ( Figure 49) and corresponding responses 109.
  • Matrix 110 hnks commands 108 with corresponding responses 109 using an 'X'.
  • template is used hereinafter to refer to a featuregram archetype 65.
  • a voice authentication biometric 76 ( Figure 38) is stored on the smart card 94 which includes one or more templates 65 ( Figure 38).
  • One or more "template download” commands are transmitted, each command including a respective section or portion of the template 65 (step S3.1) to be stored on the smart card 94 in EEPROM 105 ( Figure 51).
  • a section of a template can be a feature vector.
  • the smart card 94 returns a response indicating whether template download commands were successful or unsuccessful (step S3.2). If unsuccessful, a response specifies an error and the process is repeated (steps S3.3 & S3.4).
  • This process may be repeated a plurahty of times for each template 65 ( Figure 38), for example corresponding to different prompts.
  • step S3.5 If no more sections of a template are to be downloaded, then a "template download ended" command is sent (step S3.5).
  • the smart card 94 returns a response indicating whether template download was successful or unsuccessful (step S3.6).
  • Template upload Other data, such as data relating to the prompt, may be included in the data field of the "template download" command.
  • At least one template 65 is compared with a featuregram 76 ( Figure 45).
  • the smart card 94 need not necessarily perform the comparison, i.e. the comparison is performed "off-chip". This may be because the smart card 94 does not have a sufficiently powerful processor. Alternatively, it may be decided to perform an "off-chip" comparison. Under these circumstances, one or more templates 65 ( Figure 38) are uploaded to the computer 93.
  • the process is simnar to template downloading, but carried out in the reverse direction.
  • an "upload template to laptop” command is used.
  • At least one template 65 is compared with a featuregram 76 ( Figure 45). It is advantageous for the smart card 94 to perform the comparison, i.e. the comparison is performed "on-chip", in which case templates 65 ( Figure 38) do not leave the smart card 94. This can help prevent copying, stealing and corruption of templates 65 ( Figure 38).
  • One or more "feature vector download” commands are transmitted each including a respective feature vector (step S4.1).
  • the smart card 94 returns a response indicating whether feature vector download was successful or unsuccessful (step S4.2). If unsuccessful, the response specifies an error and the process is repeated (steps S4.3 & S4.4).
  • a "feature vector download ended" command is sent (step S4.5).
  • the smart card 94 returns a response indicating whether feature vector download was successful or unsuccessful (step S4.6).
  • a ' "return score” command is sent to the card (step S5.1).
  • the processor 106 compares the template 65 with the featuregram 76 as described above to produce a score, which is compared with threshold, which may be hard coded or previously downloaded (step S5.2), and a response is returned indicating whether authentication was successful or unsuccessful (step S5.3).
  • APDU commands may be used to delete featuregrams. As shown in Figure 57, other APDU command may be provided, such as "delete biometric” for deleting a all templates and other data and “verify biometric loaded” for checking whether the card holds a voice authentication biometric.
  • the smart card can perform other processes including some of the process described earher, such as detecting replay attack and performing higher-level logic.
  • Storing the voice authentication biometric on a smart card can have several advantages.
  • Authentication can be performed at a remote site without the need to communicate with a server holding a database containing voice authentication biometrics.
  • the smart card is available locaUy at point of use, it helps avoid the need to communicate through telephone or data lines, thus helping to save costs, increase speed and improve security.
  • Performing matching on the smart card can also have several advantages.
  • the recorded signal may comprise a stereo recording.
  • the smart card may be any sort of token, such a tag or key, or incorporated into objects such as a watch or jewellery, which can be held in the user's possession.
  • Information storage media and devices may be used, such as a memory stick, floppy disk or optical disk.
  • the smart card may be a mobile telephone SIM card.
  • the smart card may be marked and/or store data so as to identify that the card belongs to a given user.
  • a prompt may be a green light or the word "Go".
  • Measurements of background noise may be made in different ways. For example, a recorded signal, or part thereof, may be divided into a plurahty of frames. A value of background noise may be determined by selecting one or more of the lowest energy frames and either using one of the selected frames as a representative frame or obtaining an average of aU the selected frames. To select the one or more lowest energy frames, the frames may be arranged in order of signal energy. Thereafter, the ordered frames may be examined to determine a boundary where signal energy jumps from a relatively low level and to a relatively high level. Alternatively, a predetermined number of frames at the lower energy end may be selected.

Abstract

A method of voice authentication comprises enrolment and authentication stages. During enrolment, a user is prompted to provide a spoken response which is recorded using a microphone (5). The recorded signal is divided into frames and converted into feature vectors. Feature vectors are concatenated to form featuregrams. Endpoints of speech may be determined using either explicit endpointing based on analysis of energy of the timeslices or using dynamic time warping methods. Featuregrams corresponding to speech are generated and averaged together to produce a speech featuregram archetype. During and authentication stage, a user is prompted to provide a spoken response to the prompt from which a speech featuregram is obtained. The speech featuregram obtained during authentication is compared with the speech featuregram archetype and is scored. The score is evaluated to determine whether the user is a valid user or an impostor.

Description

Voice authentication
Field of the Invention
The present invention relates to voice authentication.
Background Aft
Voice authentication may be defined as a process in which a user's identity is validated by analysing the user's speech patterns. Such a process may be used for controlling access to a system, such as a personal computer, cellular telephone handset or telephone banking account.
Aspects of voice authentication are known in voice recognition systems. Examples of voice recognition systems are described in US-A-4956865, US-A-507939, US-A- 5845092 and WO-A-0221513.
Summary of the Invention
The present invention seeks to provide voice authentication.
According to the present invention there is provided a token storing a voice authentication biometric. The token may be suitable for possession by the user and may be small enough to be kept on a user, worn by the user as jewellery or kept in a pocket of an article of clothing worn by them. The token may be an information storage medium or device.
According to the present invention there is also provided a smart card storing a voice authentication biometric.
Storing a voice authentication biometric on a smart card can help validate that a smart card user is the smart card owner.
The voice authentication biometric may be suitable for use in authenticating a user using a sample of speech from the user. The voice authentication biometric may include at least one set of feature vectors, such as archetype set of feature vectors.
Case: PJP/40810PCT1 The voice authentication biometric may include at least one prompt, each prompt associated with a respective set of feature, vectors. The voice authentication biometric may include corresponding statistical information relating to each set of feature vectors. The voice authentication biometric may include data for controlling authentication procedure, data for determining authentication and/or data for configuring a voice authentication apparatus. The voice authentication biometric may be encrypted. The token or smart card may include non-volatile memory storing the voice authentication biometric. The token or smart card may store a computer program comprising program instructions for causing a computer to perform a matching process for use in voice authentication.
According to the present invention there is provided a token for voice authentication including a processor, the token storing a voice authentication biometric including a first set of feature vectors and a computer program comprising program instructions for causing the processor to perform a method, the method comprising receiving a second set of feature vectors and comparing the first and second set of feature vectors.
According to the present invention there is provided a smart card for voice authentication including a processor, the smart card storing a voice authentication biometric including a first set of feature vectors and storing a computer program comprising program instructions for causing the processor to perform a method, the method comprising receiving a second set of feature vectors and comparing the first and second set of feature vectors.
The computer program may comprise program instructions for causing the processor to perform a method, the method comprising requesting a user to provide a spoken response. The computer program may comprise program instructions for causing the processor to perform a method, the method comprising receiving a recorded signal including a recorded signal portion corresponding to a spoken response. The computer program may comprise program instructions for causing the processor to perform a method, the method comprising determining endpoints of the recorded signal portion corresponding to a spoken response. The computer program may comprise program instructions for causing the processor to perform a method, the method comprising deriving the second set of feature vectors for characterising the recorded signal portion. The computer program may comprise program instructions for causing the processor to perform a method, the method comprising producing a score dependent upon a degree of matching between the first and second set of feature vectors. The computer program may comprise program instructions for causing the processor to perform a method, the method comprising comparing the score with a predefined threshold so as to determine authentication of a user.
According to the present invention there is also provided a method of voice authentication, the method comprising in a token or smart card providing a first set of feature vectors, receiving a second set of feature vectors for characterising a recorded signal portion and comparing the first and second sets of feature vectors.
The method may further comprise providing data relating to a prompt. The method may further comprise receiving a recorded signal including a recorded signal portion corresponding to a spoken response. The method may further comprise determining endpoints of the recorded signal portion. The method may further comprise deriving the second set of feature vectors for characterising the recorded signal portion. The method may further comprise producing a score dependent upon a degree of matching between the first and second set of feature vectors. The method may further comprise comparing the score with a predefined threshold so as to determine authentication of a user. The method may further comprise receiving a recorded signal which includes a recorded signal portion corresponding to a spoken response and which includes a plurahty of frames, determining endpoints of the recorded signal including determining whether a value of energy for a first frame exceeds a first predetermined value and determining whether a second frame immediately preceding the first frame represents a spoken utterance portion. The method may further comprise requesting the authenticating user to provide first and second spoken responses to the prompt, obtaining a recorded signal including first and second recorded signal portions corresponding to the first and second spoken responses, isolating the first and second recorded signal portions, deriving second and third sets of feature vectors for characterising the first and second isolated recorded signal portions respectively, comparing the second set of feature vectors with the third set of feature vectors so as to produce a score dependent upon the degree of matching; and comparing the score with a predefined threshold so as for determine whether the first set of feature vectors is substantially identical to the second set of feature vectors. The method may further comprise requesting a user to provide a plurahty of spoken responses to a prompt, obtaining a plurahty of corresponding recorded signals, each recorded signal including a recorded signal portion corresponding to a respective spoken response, deriving a plurahty of sets of feature vectors, each set of feature vectors for characterising a respective recorded signal portion, comparing the sets of feature vectors with the first set of feature vectors so as to produce a plurahty of scores dependent upon a degree of matching and determining whether authentication is successful in dependence upon the plurahty of scores. The method may further comprise receiving a recorded signal which includes a recorded signal portion, determining endpoints of the recorded signal by dynamic time warping the second set of feature vectors onto the first set of feature vectors, including determining a first sub-set of feature vectors within the second set of feature vectors from which a dynamic time warping winning path may start and determining a second sub-set of feature vectors within the second set of feature vectors at which the dynamic time warping winning path may finish.
Endpointing seeks to locate a start and stop point of a spoken utterance.
The present invention seeks to provide an improved method of endpointing.
According to the present invention there is also provided a method of determining an endpoint of a recorded signal portion in a recorded signal including a plurahty of frames, the method comprising deterrmning whether a value of energy for a first frame exceeds a first predetermined value and determining whether a second frame immediately preceding the first frame represents a spoken utterance portion. The first predetermined value may represent a value of energy of a frame comprised of background noise. The method may comprise defining a start point if the value of energy of the first frame exceeds the first predetermined value and the second frame does not represent a spoken utterance portion. The method may further comprise indicating that the first frame represents a spoken utterance portion. The method may comprise defining a stop point if the value of energy of the first frame does not exceed the first predetermined value and the second frame represents a spoken utterance portion. The method may comprise defining the first frame as not representing a spoken utterance portion. The method may comprise counting a number of frames preceding a start point of the spoken utterance portion. The method may further comprise pairing the stop point with the start point of the spoken utterance portion if the number of frames exceeds a predetermined number. The method may further comprise pairing the stop point with start point of a preceding spoken utterance portion if the number of frames does not exceed a predetermined number. The method may comprise determining whether the value of energy for a first frame exceeds a third predetermined value and counting a number of frames preceding a start point of the spoken utterance portion. The method may further comprise defining a start point if the value of energy of the first frame exceeds the third predetermined value, the second frame does not represent a spoken utterance portion and if the number of frames does not exceed a predetermined number. The method may further comprise determining whether a value of energy for a third frame following the first frame exceeds the second predetermined value. The method may further comprise defining a stop point if the value of energy of the third frame does not exceed the third predetermined value. The method may further comprise pairing the stop point with the start point of the spoken utterance portion. The method may further comprise pairing the stop point with a start point of a preceding spoken utterance portion. The method may comprise defining the first frame as representing background noise if the value of energy of the first frame does not exceed the third predetermined value. The method may further comprise calculating an updated value of background energy using the value of energy of the first frame. The method may further comprise counting a number of frames preceding a start point of the spoken utterance portion and determining whether the number of frames exceeds another, large number. The method may comprise determining whether a value of rate of change of energy of the first frame exceeds a second predetermined value. The second predetermined value may represent a value of rate of change of energy of a frame comprised of background noise. The method may further comprise defining a start point if the value of energy of the first frame exceeds the first predetermined value, and the value of rate of change of energy exceeds the second predetermined value and the second frame does not represent a spoken utterance portion. The method may comprise defining a stop point if the value of energy of the first frame does not exceed the first predetermined value, and the value of rate of change of energy does not exceed the second predetermined value and the second frame represents a spoken utterance portion. The method may comprise determining whether the value of rate of change of energy for the first frame exceeds a fourth predetermined value.
Many voice recognition and authentication systems use dynamic time warping to match a recording to a template. However, a user may pause, cough, sigh or generate other sounds before or after providing a response to a prompt. These silences or sounds may be included in the recording. Thus, only a portion of the recording is relevant.
The present invention seeks to provide a solution to this problem.
According to the present invention there is provided a method of dynamic time warping for warping a first speech pattern characterised by a first set of feature vectors onto a second speech pattern characterised by a second set of feature vectors, the method comprising identifying a first sub-set of feature vectors within the first set of feature vectors from which a dynamic time warping winning path starts and identifying a second sub-set of feature vectors within the first set of feature vectors at which the dynamic time warping winning path finishes.
The first speech pattern may include speech, background noise and/or silence.
The present invention seeks to provide a method of voice authentication. According to the present invention there is provided a method of voice authentication comprising: enrolling a user including requesting the enrolling user to provide a spoken response to a prompt, obtaining a recorded signal including a recorded signal portion corresponding to the spoken response, determining endpoints of the recorded signal portion, deriving a set of feature vectors for characterising the recorded signal portions, averaging a plurahty of sets of feature vectors, each set of feature vectors relating to one or more different spoken responses to the prompt by the enrolling user so as to provide an archetype set of feature vectors for the response, storing the archetype set of feature vectors together with data relating to the prompt; and authenticating a user including retrieving the data relating to the prompt and the archetype set of feature vectors, requesting the authenticating user to provide another spoken response to the prompt, obtaining another recorded signal including another recorded signal portion corresponding to the other spoken response, determining endpoints of the other recorded signal portion, deriving another set of feature vectors characterising the other recorded signal portions, comparing the another set of feature vectors with the archetype set of feature vectors so as to produce a score dependent upon a degree of matching and comparing the score with a predefined threshold so as to determine whether the enrolling user and the authenticating user are the same.
Noice authentication systems typically include an amphfier. If a user provides a spoken response which is too quiet, then amphfier gain may be increased. Conversely, if a spoken response if too loud, then amplifier gain may be reduced. Usually, a succession of samples is taken and amphfier gain is increased or reduced accordingly until a settled value of amphfier gain is obtained. However, there is a danger that that amphfier gain rises and falls and never settles.
The present invention seeks to ameliorate this problem.
According to the present invention there is provided a method of gain control comprising a plurahty of times determining whether an amplified signal level is above a predetermined limit, either decreasing gain if the amplified signal level is above the predetermined limit or maintaining gain if otherwise, thereby permitting no increase in gain.
According to the present invention there is provided a method of gain control comprising a plurahty of times determining whether an amplified signal level is below a predetermined limit, either increasing gain if the amplified signal level is below the predetermined hmit or maintaining gain if otherwise, thereby permitting no decreases in gain.
A potential threat to the security offered by any voice authentication system is the possibility of an impostor secretly recording a spoken response of a valid user and subsequently replaying a recording to gain access to the system. This is known as a "replay attack."
The present invention seeks to help detect a replay attack.
According to the present invention there is provided a method of voice authentication comprising requesting a user to provide first and second spoken responses to a prompt, obtaining a recorded signal including first and second recorded signal portions corresponding to the first and second spoken responses, isolating the first and second recorded signal portions, deriving first and second sets of feature vectors for characterising the first and second isolated recorded signal portions respectively, comparing the first set of feature vectors with the second set of feature vectors so as to produce a second score dependent upon the degree of matching and comparing the second score with another predefined threshold so as for determine whether the first set of feature vectors is substantially identical to the second set of feature vectors.
During authentication, users may occasionally provide an uncharacteristic spoken response.
The present invention seeks to provide an improved method of dealing with uncharacteristic responses. According to the present invention there is provided a method of voice authentication including requesting an authenticating user to provide a plurahty of spoken responses to a prompt, obtaining a plurahty of corresponding recorded ' signals, each recorded signal including a recorded signal portion corresponding to a respective spoken response, deriving a plurahty of sets of feature vectors, each set of feature vectors for characterising a respective recorded signal portion, comparing the sets of feature vectors with an archetype set of feature vectors so as to produce a plurahty of scores dependent upon a degree of matching and determining whether authentication is successful in dependence upon the plurahty of scores.
If a user provides a spoken response during authentication, a threshold score is usually generated.
The present invention seeks to provide an improved method of determining an authentication threshold score.
According to the present invention there is provided a method of determining an authentication threshold score, the method including requesting a first set of users to provide respective spoken responses to a prompt, for each user, obtaining a recorded signal which includes a recorded signal portion corresponding to the user's spoken response, for each user, deriving a set of feature vectors for characterising the recorded signal portion, for each user, comparing the set of feature vectors with an archetype set of feature vectors for the user so as to produce a score dependent upon a degree of matching, fitting a first probabihty density function to frequency of scores for the first set of users, requesting a second set of users to provide respective spoken responses to a prompt, for each user, obtaining a recorded signal which includes a recorded signal portion corresponding to the user's spoken response, for each user, deriving a set of feature vectors for characterising the recorded signal portion, for each user, comparing the set of feature vectors with an archetype set of feature vectors for a different user so as to produce a score dependent upon a degree of matching, fitting a second probabihty density function to frequency of scores for the set of users. The present invention seeks to provide a, method of averaging a plurahty of feature vectors.
According to the present invention there is provided a method of averaging a plurahty of feature vectors, the method comprising providing a plurahty of feature vectors, comparing the each set of feature vectors with each other set feature vectors so as to produce a respective set of scores dependent upon a degree of matching, searching for a minimum score and determining whether at least one score is below a predetermined threshold.
According to the present invention there is provided a smart card for voice authentication comprising means for storing a first set of feature vectors and data relating to a prompt, means for providing the data to an external circuit, means for receiving a second set of feature vectors relating to the prompt, means for comparing the first and second set of feature vectors so as to determine a score; and means for comparing the score with a predetermined threshold.
According to the present invention there is provided a smart card for voice authentication comprising a memory for storing a first set of feature vectors and data relating to a prompt, an interface for providing the data to an external circuit and for receiving a second set of feature vectors relating to the prompt, a processor for comparing the first and second set of feature vectors so as to determine a score and for comparing the score with a predetermined threshold.
According to the present invention there is also provided information storage medium storing a voice authentication biometric. Preferably, the information storage medium is portable and may be for example a memory stick.
According to the present invention there is also provided a computer program comprising program instructions for causing a smart card to perform a method, the method comprising retrieving from memory a first set of feature vectors, receiving a second set of feature vectors and comparing the first and second set of feature vectors.
According to the present invention there is also provided a method comprising writing at least part of a voice authentication biometric to a smart card or token. The at least part of a voice authentication biometric is a set of feature vectors.
According to the present invention there is also provided a method comprising writing a computer program to a smart card or token, the computer program comprising computer instructions for performing a method, the method comprising performing voice authentication.
According to the present invention there is also provided a method comprising writing at least part of a voice authentication biometric to a smart card or token and writing a computer program to said smart card or token, the computer program comprising computer instructions for performing a method, the method comprising performing voice authentication
According to the present invention there is also provided a smart card for voice authentication including a processor, the smart card storing a computer program comprising program instructions for causing the processor to perform a method, the method comprising performing voice authentication.
According to the present invention there is also provided a smart card reader/writer connected to apparatus for recording speech and generating feature vectors, said reader/writer being configured to transmit a set of feature vectors to a smart card or token and receive a response therefrom.
Brief Description of the Drawings Embodiments of the present invention will now be described, by way of example, with reference to the accompanying drawings in which:
Figure 1 shows a voice authentication system 1 for performing a method of voice authentication; Figure 2 is a process flow diagram of a method of voice authentication;
Figure 3 is a process flow diagram of a method of enrolment;
Figure 4 is a process flow diagram of a method of calibration;
Figure 5 is an analog representation of a recorded signal; Figure 6 is a generic representation of a recorded signal;
Figure 7 is a digital representation of a recorded signal;
Figure 8 illustrates dividing a recorded signal into timeshces;
Figure 9 is a process flow diagram of a method of generating a featuregram;
Figure 10 illustrates generation of a feature vector; Figure 1 illustrates generation of a featuregram from a plurahty of feature vectors;
Figure 12 shows first and second endpointing processes;
Figure 13 illustrates exphcit endpointing;
Figure 4 is a process flow diagram of a method of exphcit endpointing;
Figure 15 illustrates determination of energy and delta energy values of a timeshce; Figure 16 shows pairing of a stop point with a two start points;
Figure 17 shows pairing of a stop point with a start point of a preceding section;
Figure 18 is a process flow diagram of a method of detecting hp smack;
Figure 19 shows pairing of a stop point with an updated start point for removing lip smack; Figure 20 illustrates a dynamic time warping process for word spotting;
Figure 21 shows a warping function from a start point to an end point;
Figure 22 illustrates a local slope constraint on a warping function;
Figure 23 illustrates a global condition imposed on a warping function;
Figure 24 is a process flow diagram of a method of finding a minimum distance for an optimised path from a start point to an end point representing matched speech patterns;
Figure 25a shows an array following initialisation for holding a cumulative distance associated with a path from a start point to an end point;
Figure 25b shows an array for holding a cumulative distance associated with a path from a start point to an end point;
Figure 25c shows a completed array for holding a cumulative distance associated with a path from a start point to an end point including a winning path; Figure 26 shows a process flow diagram of a method of performing a plurahty of sanity checks;
Figure 27 illustrates creation of a speech featuregram;
Figure 28 illustrates generation of a speech featuregram archetype; Figure 29 is a process flow diagram of a method of generating a speech featuregram archetype;
Figure 30 illustrates generation of a featuregram cost matrix;
Figure 31 shows a featuregram cost matrix;
Figure 32 is a process flow diagram of a method of finding a minimum distance for an optimised path from a start point to an end point representing matched speech patterns;
Figure 33 illustrates creation of featuregram archetypes using featuregrams;
Figure 34 illustrates generation of a featuregram archetype cost matrix;
Figure 35 shows a featuregram archetype cost matrix; Figure 36 shows a probabihty distribution function;
Figure 37 shows a continuous distribution function
Figure 38 shows a voice authentication biometric;
Figure 39 is a process flow diagram of a method of authentication;
Figure 40 is an analog representation of an authentication recorded signal; Figure 41 illustrates dividing an authentication recorded signal into timeslices;
Figure 42 illustrates generation of an authentication feature vector;
Figure 43 illustrates generation of an authentication featuregram from a plurahty of feature vectors;
Figure 44 illustrates generation of endpoints; Figure 45 illustrates comparison of a featuregram archetype with an authentication featuregram;
Figure 46 illustrates a featuregram including first and second spoken responses of the same prompt for detecting replay attack;
Figure 47 is a process flow diagram of a method of detecting replay attack; Figure 48 shows a voice authentication system employing a smart card;
Figure 49 illustrates a contact smart card;
Figure 50 illustrates a contactless smart card;
Figure 51 is a schematic diagram showing a smart card reader and a smart card; Figure 52 is an application program data unit (APDU) table; Figure 53 shows exchange of messages between a laptop and a smart card during template loading;
Figure 54 shows a first exchange of messages between a laptop and a smart card during authentication; and
Figure 55 shows a second exchange of messages between a laptop and a smart card during authentication.
Detailed Description of the Invention Voice authentication system 1
Referring to Figure 1, a voice authentication system 1 for performing a method of voice authentication is shown. The voice authentication system 1 limits access by a user 2 to a secure system 3. The secure system 3 may be physical, such as a room or building, or logical, such as a computer system, cellular telephone handset or bank account. The voice authentication system 1 is managed by a system administrator 4.
The voice authentication system 1 includes a microphone 5 into which a user may provide a spoken response and which converts a sound signal into an electrical signal, an amphfier 6 for amphfying the electrical signal, an analog-to-digital (A/D) converter 7 for samphng the amphfied signal and generating a digital signal, a filter 8, a processor 9 for performing signal processing on the digital signal and controlhng the voice authentication system 1, volatile memory 10 and non-volatile memory 11. In this example, the A/D converter 7 samples the amphfied signal at 11025 Hz and provides a mono-linear 16-bit pulse code modulation (PCM) representation of the signal. The digital signal is filtered using a 4th order 100Hz high-pass filter to remove any d.c. offset.
The system 1 further includes a digital-to-analog (D/A) converter 12, another amphfier 13 and a speaker 14 for providing audio prompts to the user 2 and a display 15 for providing text prompts to the user 2. The system 1 also includes an interface 16, such as a keyboard and/or mouse, and a display 17 for allowing access by the system administrator 4. The system 1 also includes an interface 18 to the secure system 3. In this embodiment, the voice authentication system 1 is provided by a personal computer which operates software performing the voice authentication process.
Referring to Figure 2, the voice authentication process comprises two stages, namely enrolment (step SI) and authentication (step S2).
The purpose of the enrolment is to obtain a plurahty of specimens of speech from a person who is authorised to enrol with the system 1, referred herein as a "vahd user". The specimens of speech are used to generate a rehable and distinctive voice authentication biometric, which is subsequently used in authentication.
A voice authentication biometric is a compact data structure comprising acoustic information-bearing attributes that characterise the way a vahd user speaks. These attributes take the form of a template, herein referred to as a "featuregram archetypes" (FGAs), which are described in more detail later.
The vahd user's voice authentication biometric may also include further information relating to enrolment and authentication. The further information may include data relating to prompts to which a vahd user has responded during enrolment, which may take the form of text prompts or equivalent identifiers, the number of prompts to be used during authentication and whether prompts should be presented in a random order during authentication, and other data relating to authentication such as scoring strategy, pass/fail/retry thresholds, the number of acceptable failed attempts and amphfier gain.
Enrolment
Referring to Figure 3, the enrolment process corresponding to step SI in Figure 2 is shown:
The voice authentication system 1 is calibrated, for example to ensure that a proper amphfier gain is set (step Sl.l). Once the system is cahbrated, a plurahty of spoken responses are recorded (step SI.2). The recordings are characterised by generating so-called "featuregrams", which comprise a set of feature vectors (step SI.3). The recordings are also examined so as to isolate speech from background noise and periods of silence (step SI.4, step SI.5). Checks are performed to ensure that the recorded responses, isolated specimens of speech and featuregrams are suitable for processing (step SI .6). A plurahty of speech featuregrams are then generated (step SI.7). Thereafter, an average of some or all of the featuregrams is taken thereby forming a more representative featuregram, namely a featuregram archetype (step SI .8). A pass level is set (step SI.9) and a voice authentication biometric is generated and stored (step SI.10)
Calibration
Referring to Figure 4, the cahbration process of step Sl.l (Figure 3) is described in more detail:
One of the purposes of cahbration is to set the gain of the amphfier 6 (Figure 1) such that the amphtude of a captured speech utterance is of a predetermined standard. The predetermined standard may specify that the amphtude of the speech utterance peaks at predetermined value, such as 70% of a full-scale deflection of a recording range. In this example, the A/D converter 7 (Figure 1) is 16-bits wide and so 70% of full-scale deflection corresponds to a signal of 87dB. The predetermined standard may also specify that the signal has a minimum signal-to- noise ratio, for instance 20dB which corresponds to a signal ten times stronger than the background noise.
The gain of the amphfier 6 (Figure 1) is set to the highest value (step SI.1.1) and first and second counters are set to zero (steps SI .1.2 & SI.1.3). The first counter keeps a tally of the number of specimens provided by the vahd user. The second counter is used to determine the number of consecutive specimens which meet the predetermined standard.
A prompt is issued (step SI.1.4). In this example, the prompt is randomly selected. This has the advantage that the it prevents the vahd user from anticipating the spoken response, thereby providing an uncharacteristic or unnatural response, for example which is unnaturally loud or quiet. The prompt may be a text prompt or an audio prompt. The vahd user may be prompted to say a single word, such as "thirty-four" or a phrase "My voice is my pass phrase". In this example, the prompts comprise numbers. ' Preferably, the numbers are chosen from a range between 21 and 99. This has the advantage that the spoken utterance is sufficiently long and complex so as to include a plurahty of features.
A speech utterance is recorded (step SI.1.5). This comprises the user providing a spoken response which is picked-up by the microphone 5 (Figure 1), amphfied by the amphfier 6 (Figure 1), sampled by the analog-to-digital (A/D) converter 7 (Figure 1), filtered and stored in volatile memory 10 (Figure 1) as the recorded signal. The processor 9 (Figure 1) calculates the power of speech utterance included in the recorded signal and analyses the result.
The signal-to-noise ratio is determined (step SI .1.6). If the signal-to-noise ratio is too low, for example less than 20dB, then the spoken response is too quiet and the corresponding signal generated is too weak, even at the highest gain. The user is informed of this fact (step SI.1.7) and the cahbration stage ends. Otherwise, the process continues.
The signal level is determined (step SI.1.8). If signal level is too high, for example greater than 87 dB which corresponds to the 95th percentile of the speech utterance energy being greater that 70% of the full scale deflection of the A/D converter 7 (Figure 1), then the spoken response is too loud and the corresponding signal generated is too strong. If the gain has already been reduced to its lowest value, then the signal is too strong, even at the lowest gain (step SI.1.9). The user is informed of this fact and cahbration ends (step SI.1.10). Otherwise, the gain of the amphfier 6 is reduced (step SI.1.11). The gain may be reduced by a fixed amount regardless of signal strength. Alternatively, the gain may be reduced by an amount dependent on signal strength. This has the advantage of obtaining an appropriate gain more quickly. The fact that a specimen spoken response has been taken is noted by incrementing the first counter by one (step SI.1.12). The second counter is reset (step SI.1.13). If too many specimens have been taken, for example 15, then cahbration ends (step SI.1.14). Otherwise, the process returns to step SI.1.4, wherein the user is prompted, and the process of recording, calculating and analysing is repeated.
If, at step SI .1.8, the signal level is not too high, then the spoken response is considered to be satisfactory, i.e. neither too loud nor too quiet. Thus, the recorded signal falls within an appropriate range of values of signal energy. The fact that a specimen spoken response has been taken is recorded by incrementing the first counter by one (step SI.1.16). The fact that the specimen is satisfactory is also recorded by incrementing the second counter by one (step SI.1.17). The gain remains unchanged.
If a predetermined number of consecutive specimens are taken without a change in gain, then cahbration is successfully terminated and the gain setting of the amphfier 6 (Figure 1) is stored (step SI.1.18 & SI.1.19). In this example, the gain setting is stored in the voice authentication biometric.
Additional steps may be included. For example, once a settled value of gain is achieved at step SI.118, then the signal level is measured a final time. If the signal level is too low, then calibration ends without the gain setting being stored.
The cahbration process allows decreases, but not increases, in gain. This has the advantage of preventing infinite loops in which the gain fluctuates without reaching a stable setting.
Alternatively, the cahbration process may be modified to start at the lowest gain and allow increases, but not decreases, in gain. Thus, if the signal strength is too low, for example below a predetermined limit, then gain is increased. Once a settled value of gain has been achieved, then the signal level may be measured a final time to determine whether it is too high. The cahbration process may include a further check of signal-to-noise ratio. For example, once a settle value of gain has been determined, then peak signal-to-noise ratio of the signal is measured., If the signal-to-noise ratio exceeds a predetermined level, such as 20dB, then the' gain setting is stored. Otherwise, the user is instructed to repeat cahbration in a quieter environment, move closer to the microphone or speak with a louder voice.
Referring again to Figure 3, during enrolment, the voice authentication system 1 records one or more spoken responses (step SI.2). This may occur during cahbration at step Sl .l. Additionally or alternately, a separate recording' stage may be used.
During enrolment, the voice authentication system 1 asks the user to provide a spoken response. Preferably, the system prompts the user a plurahty of times. Four types of prompt may be used:
In a first type, the prompt comprises a request for a single word, for example "Say 81". Preferably, the user is asked to repeat the word. The user may be asked to repeat the word once so as to obtain two specimens of the spoken response. The user may be asked to repeat the word more than once so as to obtain multiple examples.
In a second type, the prompt comprises a request for a single phrase, for example "Say My voice is my pass phrase". Preferably, the user is asked to repeat the phrase. The user may be asked to repeat the phrase once or more than once.
In a third type, the prompt may comprise a challenge requesting personal information, such as "What is your home telephone number?". The vahd user provides a spoken response which includes the personal information. This type of prompt is referred to as a "challenge-response". This type of prompt has the advantage of increasing security. During subsequent authentication, an impostor must know or guess what to say as well as attempt to say the spoken response in the correct manner. For example, a vahd user may pronounce digits in different ways, such as pronouncing "10" as "ten", "one, zero", "one, nought" or "one-oh", and/or pause while saying a string of numbers, such as reciting "12345678" as "12-34-56- 78" or "1234-5678".
In a fourth type, the prompt may comprise a cryptic challenge-response, such as "NOD?". For example, "NOD" may signify "Name of dog?". Preferably, the cryptic challenge is specified by the user. This type of prompt has the advantage of increasing security since the prompt is meaningful only to the vahd user. It offers few clues as to what the spoken response should be.
A set of prompts may be common to all users. Alternatively, a set of prompts may be randomly selected on an individual basis. If the prompts are chosen randomly, then a record of the prompts issued to each user is stored in the voice authentication biometric, together with corresponding data generated from the spoken response. Preferably, this information is used during authentication stage to ensure that only prompts responded to by the vahd user are issued and that appropriate comparisons are made with corresponding featuregram archetypes. Preferably, the administrator 4 (Figure 1) determines the type and number of prompts used during enrolment and authentication.
Recording
Referring again to Figure 1, a spoken response is recorded by the microphone 5, amphfied by amphfier 6 and sampled using A/D converter 7 at 11025 Hz to provide a 16-bit PCM digital signal. The duration of the recording may be fixed. Preferably, the recording lasts between 2 and 3 seconds. The signal is then filtered to remove any d.c. component. The signal may be stored in volatile memory 10.
Referring to Figures 5, 6, 7, an example of a recorded signal 19 is shown in analog, generic and digital representations.
Referring particularly to Figure 5, the recorded signal 19 may comprise one or more speech utterances 20, one or more background noises 21 and/or one or more silence intervals 22. A speech utterance 20 is defined as a period in a recorded signal 19 which is derived solely from the spoken response of the user. A
I background noise 21 is defined as a period in a recorded signal arising from audible sounds, but not originating from the speech utterance. A silence interval 22 is defined as a period in a recorded signal which is free from background noise and speech utterance.
As explained earher, the purpose of the enrolment is to obtain a plurahty of specimens of speech so as to generate a voice authentication biometric. To help achieve this, recorded responses are characterised by generating "featuregrams" which comprise sets of feature vectors. The recordings are also examined so as to isolate speech from background noise and silences.
If the recordings are known to contain specific words, then they are searched for those words. This is known as "word spotting". If there is no prior knowledge of the content of the recordings, then the recordings are inspected to identify spoken utterances. This is known as "endpointing". By identifying speech utterances using one or both of these processes, a speech featuregram may be generated which corresponds to portions of the recorded signal comprising speech utterances.
Referring to Figure 8, a portion 19' of the recorded signal 19 is shown. The recorded signal 19 is divided into frames, referred to herein as timeshces 23. The recorded signal 19 is divided into partially-overlapping timeshces 23 having a predetermined period. In this example, timeshces 23 have a period of 50 ms, i.e. tj= 50ms, and overlap by 50%, i.e. t2= 25 ms.
Featuregram generation
Referring to Figures 9, 10 and 11, a process by which a featuregram is generated will be described in more detail:
The recorded signal 19 is divided into frames, herein referred to as timeshces 23
(step SI.3.1). Each timeshce 23 is converted into a feature vector 24 using a feature transform 25 (step SI.3.2). The content of the feature vector 24 depends on the transform 25 used. In general, a feature vector 24 is a one-dimensional data structure comprising data related to acoustic information-bearing attributes of the timeslice 23. Typically, a feature vector 24 comprises a string of numbers, for example 10 to 50 numbers, which represent the acoustic features of signal comprised in the timeslice 23.
In this example, a so-called mel-cepstral transform 25 is used. This transform 25 is suitable for use with a 32-bit fixed-point microprocessor. A mel-cepstral transform 25 is a cosine transform of the real-part of a logarithmic-scale energy spectrum. A mel is a measure of perceived pitch or frequency of a tone by a human auditory system. Thus, in this example, for a samphng rate of 11025Hz, each feature vector 24 comprises twelve signed 8-bit integers, typically representing the second to thirteenth calculated mel-cepstral coefficients. Data relating to energy (in dB) may be included as a 13th feature. This has the advantage of helping to improve the performance of a word spotting routine that would otherwise operate on the feature vector coefficients alone.
The transform 25 may also calculate first and second differentials, referred to as "delta" and "delta-delta" values.
Further details regarding mel-ceptral transforms may be found in "Fundamentals of Speech Recognition" by Rabiner & Juang (Prentice HaU, 1993).
Other transforms may be used. For example, a hnear predictive coefficient (LPC) transform may be used in conjunction with a regression algorithm so as to produce LPC cepstral coefficients. This transform is suitable for use with a 16-bit microprocessor. Alternatively, a TESPAR transform may be used.
Linear predictive coefficient (LPC) transform is described by B.S. Atal, "Effectiveness of hnear prediction characteristics of the speech wave for automatic speaker identification and verification", Journal of Acoustical Society of America, Vol. 55, pρ-1304-1312, June 1974. Further details regarding the TESPAR transform may be found in GB-B-2162025. Referring to Figure 11, a featuregram 25 comprises a set or concatenation of feature vectors 24. The featuregram 25 includes speech utterances, background noise and silence intervals.
Endpointing
Endpointing seeks to identify portions of a recorded signal which contains spoken utterances. This allows generation of speech featuregrams which characterise the spoken utterances.
Referring to Figure 12, two methods of endpointing may be used, namely exphcit endpointing (step SI.4) and dynamic time warping (DTW) word spotting (step SI.5).
-Exphcit Endpointing- Exphcit endpointing seeks to locate approximate endpoints of a speech utterance in a particular domain without using any a priori knowledge of the words that might have been spoken. Exphcit endpointing tracks changes in signal energy profile over time and frequency and makes boundary decisions based on general assumptions regarding the nature of profiles that are indicative of speech and those that are representative of noise or silence. Exphcit endpointing cannot easily distinguish between speech spoken by the enrolhng user and speech prominently forming part of background noise. Therefore, it is desirable that no-one else speaks in close proximity to the vahd user when enrolment takes place.
Referring to Figure 13, an explicit endpointing process 27 generates a plurahty of pairs 28 of possible start and stop points for a stream of timeshces 23. The advantage of generating a plurahty of endpoints is that the true endpoints are likely to be identified. However, a drawback is that if too many endpoint combinations are identified, then the system response time is adversely affected. Therefore, a trade-off is sought between the number of potential endpoint combinations identified and the response time required. Exphcit endpointing is suitable for both fixed and continuous recording environments, although is mainly intended for use with isolated word or isolated phrase recognition systems.
Referring to Figure 14, an explicit endpointing process is shown in more detail:
A check is made whether initialisation is needed, whereby background noise, energy is measured (step S1.4.A). If so, a background noise signal is recorded (step S1.4.B), divided into timeshces (step S1.4.C) and a background energy value is calculated (step S1.4.D).
After initialisation, or if no initialisation is needed, a signal is recorded and divided into a plurahty of timeshces 23 (Figure 8) (step SI.4.1). A first counter, i, for keeping track of which timeshce 23; is currently being processed is set to one (step SI .4.2). A second counter, j, for counting the number of consecutive timeshces 23 which represent background noise is set to zero (step SI .4.3). A "word" flag is set to zero to represent that the current timeshce 23; does not represent a spoken utterance portion, such as a word portion (step SI.4.4).
Referring also to Figure 15, the energy of the current timeshce 23; is calculated (step Sl .4.5).
Preferably, a plurahty of timeshces 23 are used to calculate a value of energy for the current timeshce 23;. The timeshces 23 are comprised in a window 29. In this example, five timeshces 23;_2, 23;.l5 23i3 23i+1, 23i+2 are used to calculate a value of energy of the ith timeshce 23;.
A time encoded speech processing and recognition (TESPAR) coding process 30 is used to calculate an energy value for each timeshce 23;_2, 23i.1, 23;, 23i+1, 23i+2. This comprises taking each timeshce 23;_2, 23;.!, 23;, 23i+1, 23i+2 and dividing it into a plurahty of so-called "epochs" according to where signal magnitude changes from positive to negative and vice versa. These points are known a "zero crossings". An absolute value of a peak magnitude for each epoch is used to calculate an average energy for each timeshce 23;_2, 23;..,, 23;, 23i+1, 23i+2. Thus, five energy values are obtained from which a mean energy value 31 is calculated.
A description of the TESPAR coding process is given in GB-A-2020517.
A delta energy value 32 indicative of changes in the five energy values is also calculated (step SI .4.6). In this example, the delta energy value 32 is calculated by performing a smoothed hnear regression calculation using the energy values for the timeshces 23;_2, 23;.l5 23;, 23i+1, 23i+2. The delta energy value 32 represents a gradient of straight hne fitted to the values of energy. Thus, large changes in the energy values result in a large value of the delta energy value 32.
The values 31, 32 of energy E; and delta energy ΔE; are used to determine whether the ith timeshce 23; represents a spoken utterance.
Referring again to Figure 14, if the energy of 31 of an ith timeshce 23; is equal to or greater than a first threshold, which is first predetermined multiple of background noise energy, i.e. E; > k, x E0 (step SI .4.7), and the delta energy 32 is equal to or greater than a second threshold, which is second predetermined multiple of background delta energy, i.e. ΔE; > k2 x ΔE0 (step SI.4.8), then the ith timeslice 23 is considered to form part of a word. The timeshce 23; is said to form part of a voiced or energetic section. In this example, kj= 2 and k2 = 3.
If the word flag not set to one, representing that the previous timeshce 23;.! was background noise (step SI.4.9), then the current timeshce 23; is considered to be the beginning of a word (step S.1.4.10). Thus, the word flag is set to one (step SI.4.11).
If at step SI.4.9, the word flag is set to one, then the beginning of the word has already been detected and so the current timeshce 23; is located within an existing word (step SI.4.12).
The first counter, i, is incremented by one (step SI.4.13) and the process returns to step SI .4.5 where the energy of the new ith timeshce 23; is calculated. If the energy value 31 falls below the first threshold at step SI.4.7 or the delta energy value 32 falls below the second threshold at step SI .4.8, then it is determined whether there is a stop point, and if so with which start point or start points it could be paired.
If the word flag is set to one, (step SI.4.14), then the current timeshce 23; is considered to be a stop point.
The stop point may be paired with one or more other start points, as will now be explained:
Referring to Figure 16, first and second sections 33, 34 are separated by a gap 35. The first section 33 includes a first start point 36j and a first stop point 37t. The second section 34 has a second start point 362. According to step SI.4.7 or SI.4.8 and step SI .4.14, a second stop point 372 is found. The second stop point 372 may be paired with the second start point 372 so identifying the second section 34 as a word. However, the second stop point 372 may also be paired with the first start point 36j. Thus, the first start point 36j and the second stop point 372 may define a larger word 38 which includes both the first and second sections 33, 34. Therefore, it is desirable to determine the duration of a gap 35 between the first stop point 37j and the second start point 362. If the gap 35 is sufficiently short, then an additional pairing is made and the additional word 38 is identified. This has the advantage of identifying a greater number of candidates and thus increase the chances of correctly identifying a word.
Referring again to Figure 14, a check is made as to whether the start point preceding the current endpoint occurs within ten timeshces 23 of the stop point of the previous word (step SI.4.15). If it does not, then the current endpoint is paired with only the preceding start point, thereby identifying a single word (step SI .4.16). If the start point is within ten timeshces 23 or less of the stop point, then the current stop point is paired with both the start point of the current section (step SI .4.17) and the start point of the preceding word (step SI .4.18), thereby identifying two potential words.
The second counter, j, is reset to zero (step SI.4.19), the word flag is set to zero ' (step SI .4.20) and the first counter is incremented by one (step SI .4.21) before returning to step SI.4.5.
If, at step SI.4.14, the word flag is not set to one, then a further check is made as to whether the current timeshce 23; may be considered to be the start point of an unvoiced or unenergetic section, herein after referred to as simply an unvoiced section.
If the energy 31 of an ith timeshce 23; is equal to or greater than a third threshold, which is lower than the first and which is third predetermined multiple of background noise energy, i.e. E; > k3 x E0 (step SI .4.22) and the delta energy 32 is equal to or greater than a fourth threshold, which is lower than the first and which is fourth predetermined multiple of background delta energy, i.e. ΔE; > k4 x ΔE0 (step SI .4.23), and provided that the timeshce 23; is found withing 10 timeshces of the previous stop point (step SI.4.24), then the ith timeshce 23 is considered to be the start point of an unvoiced section (step SI.4.25). In this example, k3= 1.25 and k4 = 2.
The extent of the unvoiced section is determined by incrementing the first counter i (step SI .4.26), calculating values 31, 32 of energy and delta energy (steps SI .4.27 & SI.4.28) and determining whether the energy 31 of the current timeshce 23; exceeds a fifth threshold corresponding to a fifth predetermined multiple of background noise energy, i.e. E; > k5 x E0 (step SI.4.29). In this case, k5= k3= 1.25. Provided that the energy 31 of the current timeshce 23; exceeds a fifth threshold, the timeshce 23; is identified as being part of the unvoiced section (step SI.4.30).
If the energy value 31 falls below the third threshold at step SI.4.22 or the delta energy value 32 fall below the fourth threshold at step SI.4.23, then the current timeshce 23; is deemed to represent background noise (step SI.4.31). The values of background noise energy and delta background noise energy are updated using the current timeshce 23,. In this case, a weighted average is taken using 95% of the background noise energy E0 and 5% of the timeshce energy E, (step SI.4.32). Similarly, a weighted average is taken using 95% of the delta background noise ΔE0 energy and 5% of the delta energy ΔE, (step SI .4.33). The second counter, j, is incremented by one (step SI .4.34).
A check is made to see whether an isolated word has been found (step SI.4.35). If a sufficiently long period of background noise is identified, for example by counting twenty timeshces after the end of a word which corresponds to 0.5 seconds of silence (step SI.4.36), then it is reasonable to assume that the last stop point represents the end of an isolated word. If an isolated word is found, then pairing of possible start and stop points may be terminated. Otherwise, searching continues by returning to step SI.4.5.
If at step SI.4.30, the energy 31 of the timeshce 23, falls below the fifth threshold, then a stop point of an unvoiced section is identified (step SI.4.36). The stop point is associated with the start point of the preceding word (step SI.4.37) and the first counter, i, is incremented by one (step SI .4.38)
Referring to Figure 17, a first section 39 precedes a second section 40 and has a first start point 41α and a first stop point 422. A second stop point 422 is found in the second section 40 according to step SI .4.36. The second stop point 422 is paired the first start point 41t. Thus, the first start point 41j and the second stop point 422 may define a word 43 which includes both the first and second sections 39, 40.
Thus, two types of stop points may be identified. The stop point may be an end point of a voiced section, such as the "le" in "left", or a stop point of an unvoiced section, such as the "t" in "left".
Referring to Figure 18, a process for finding and removing extraneous noises such as hp smack and generating an additional pair of endpoints is shown: "When a stop point is located at step SI.4.16 or SI.4.17 in a voiced section, the current start point is located (step SI.4.39). First and second pointers p, q are set to the start point (steps SI.4.40 &<S.1.4.41). The first index p points to an updated start point. The second index q points keeps track of which timeshce is currently1 being examined.
The delta energy of a current timeshce 23 is compared with the delta energy of a succeeding timeshce 23q+1 (step SI .4.42). If the delta energy of the current timeshce 23 is greater than the delta energy of the succeeding timeshce 23q+1, then the delta energy of the succeeding timeshce 23 +1 is compared with the delta energy of a second succeeding timeshce 23q+2 (step SI.4.43). If the delta energy of the succeeding timeshce 23q+1 is greater than the delta energy of the second succeeding timeshce 23q+2, then the start point is updated by incrementing the first index p by one (step SI.4.44). A check is made to see whether updated start point and the stop position are separated by at least three timeshces (step SI.4.45). If not, then the process terminates without generating an additional pair of endpoints including an updated start point.
If at either, step SI.4.42 or SI.4.43 the delta energy of the current timeshce 23q is less than the delta energy of the succeeding timeshce 23q+1 or delta energy of the succeding timeshce 23q+1 is less than the delta energy of the succeeding timeshce 23 +2, then the process terminates and generates an additional pair of endpoints including an updated start point.
Referring to Figures 19a and 19b, the effect of the process for finding and removing extraneous noises is illustrated.
Figure 19a shows a voiced section 44 having a pair of start and stop points 45, 46. Figure 19b shows the voiced section 44 after the process has identified a section portion 47 comprising a hp smack. Another pair of start and stop points 48, 49 are generated. Preferably, explicit endpointing is performed in real-time. This has the advantage that it may be determined whether or not a timeshce 23 corresponds to a spoken utterance, i.e. whether a potion of the recorded signal currently being processed corresponds to part of word. If so, a featuregram is generated. If not, a featuregram need not be generated. Processing resources may be better put to use, for example by generating a template (if in the training mode) or performing a comparison (if in the real-time live interrogation mode).
-Word spotting- Word spotting seeks to locate endpoints of a speech utterance in a particular domain using a priori knowledge of the words that should have been spoken as a guide. The a priori knowledge is typically presented as a speaker-independent featuregram archetype (FGA) generated from speech utterances of the word or phrase being sought that have previously been supplied by a wide range of representative speakers. The featuregram archetype may include an energy term.
Referring to Figure 20, a dynamic time warping process 50, herein referred to as a DTWFlex, is used. The process 50 compares a featuregram 51 derived from the recorded signal 19 (Figure 5) with a speaker-independent featuregram archetype 52, representing a word or phase being sought. This is achieved by compressing and/or expanding different sections of the featuregram 51 until a region inside the featuregram 51 matches the speaker-independent featuregram archetype 52. The best fit is known as the winning path and the endpoints of the winning path are output 28'.
One advantage of word spotting is that it dehvers more accurate endpoints than those produced by exphcit endpointing, particularly when heavy non-stationary background noise is present. If word spotting is used during enrolment, users are asked to respond to fixed-word or fixed-phrase prompts for which speaker- independent featuregram archetype have been prepared in advance. It is difficult to use word spotting in conjunction with challenge-response prompts, particularly if spoken responses cannot be easily anticipated. Thus, it is preferable to use exphcit endpointing when using challenge-response prompts. An outhne of a word spotting process will now be described:
First and second speech patterns A, B may be expressed as a sequence of first and second respective sets of feature vectors a, b, wherein:
A=al5 a2,...a;,..., aτ (la)
B=b1, b2,...b,,..., bJ (lb)
Each respective vector a, b represents a fixed period of time.
Referring to Figure 21, a dynamic time warping process seeks to ehminate timing differences between the first and second speech patterns A, B. The timing differences may be illustrated using an i— j plot, wherein the first speech pattern A is developed along an i-axis 53 and the second speech pattern B is developed along a j-axis 54.
The timing differences between the first and second speech patterns A, B may be represented by a sequence F, wherein:
F = c(l), c(2),..., c(k),..., c(K) (2)
where c(k) = (i(k), j(k)). The sequence F may be considered to represent a function which approximately maps the time axis of first speech pattern A onto that of the second speech pattern B. The sequence F is referred to as a warping function 55. When there is no timing difference between the first and second speech patterns A, B, the warping function 55 coincides with a diagonal line j = i, indicated by reference number 56. As the timing differences grow, the warping function 55 increasingly deviates from the diagonal hne 56.
A Euchdian distance, d, is used to measure of the timing difference between a pair time points in the form of feature vectors a;, bj3 wherein: d(c) = c(i, j) = l a-b (3)
However, other distances may be used to measure the timing difference, such as Manhattan distance. A weighted sum of distances in the warping function 53 is calculated using:
Figure imgf000033_0001
where w(k) is a positive weighting coefficient. E(F) reaches a minimum value when the warping function 55 optimally adjusts the timing differences between the first and second speech patterns A, B. The minimum value may be considered to be a distance between the first and second speech patterns A, B, once the timing differences between them has been eliminated and is expected to be stable against time-axis fluctuation. Based on these considerations, a time-normalised distance D between the first and second speech patterns A, B is defined as:
Figure imgf000033_0002
where the denominator . w(k) compensates for the number of points on the warping function 55.
Two conditions are imposed on the speech patterns A, B. Firstly, the speech patterns A, B are time-sampled with a common and constant samphng period. Secondly, there is no a priori knowledge about which parts of the speech pattern contain hnguistically important information. In this case, each part of the speech pattern is considered to have an equal amount of linguistic information.
As explained earher, the warping function 55 is a model of time-axis fluctuations in a speech pattern. Thus, the warping function 55, when viewed as a mapping function from the time axis of the second speech pattern B onto that of the first speech pattern A, preserves hnguistically important structures in the second speech pattern B time axis, and vice versa. In this example, important speech pattern time- axis structures include continuity, monotonicity and limitation on acoustic parameter transition speed in speech.
In this example, asymmetric time warping is used, wherein a weighting function w(k) is dependent on i but not j. This condition is realised using the following restrictions on the warping function 55:
Firstly, a monotonic condition is apphed, wherein:
i(k -l)< i(k) and j(k -l) ≤ j(k) (6)
The monotonic condition specifies that the warping function 55 does not turn back on itself. Secondly, a continuity condition is imposed, wherein:
i(k)-i(k -l) = l and j(k)-j(k -l) < 2. (7)
The continuity condition specifies that the warping function 55 advances a predetermined number of steps at a time. As a result of these two conditions, the following relation holds between two consecutive points, namely:
(i(k)-l (k)),
:(k-l) = (i(k)-l,j(k)-l), (8) (i(k)-l,j(k)-2).
Boundary conditions are set such the warping function 55 starts at (1, 1) and ends at (I, J), i-e.:
i(l) = l, j(l) = l. and i(κ) = I, j(κ) = J (9) A local slope constraint condition is also imposed. This defines a relation between consecutive points on the warping function 55 and places limitations on possible configurations. In this example, the Itakura condition is used.
Referring to Figure 22, if point 57t moves forward in the i-direction but not in the j- direction, then the point 572 cannot move again in the i-direction without consecutively moving in the j-direction. Therefore, this condition combined with the monotonicity and continuity conditions, imposes a maximum slope of 2 and a minimum slope of 0.5 on the warping function F. In other words, the second speech pattern B may be maximally compressed or expanded by a factor of 2 in order to time ahgn it with the first speech pattern A.
Referring to Figure 23, the above conditions effectively constrain the possible warping function 55 to a region in the time axis bounded by a parallelogram 58 and which is referred to as the "legal" search region. The legal search conforms to the following conditions:
j(k)> max 2i(k)-2I + J , ' (10)
and
j(k) ≤ mmim 2i(k)-l , i& l + J (11) 2 2
Thus, j may take a maximum value 58max and minimum value 58min for a particular value of i.
The weighting coefficient is also restricted. If the denominator in equation (5) is independent of the warping function, then:
K N = ∑w(k) (12) k=l where N is the normalisation coefficient. Equation 5 may then be simphfied and re-written as:
Figure imgf000036_0001
The time-normalised distance D may be solved using standard dynamic programming techniques. The aim is to find the cost of the shortest path.
In this example, an asymmetric weighting function w(k) is used, namely:
w(k) = (i(k)-i(k-l)) (14)
The use of an asymmetric weighting function simplifies the normalisation coefficient N of equation 12, such that:
N = I (15)
where I is the length of speech pattern A.
An algorithm for solving equation 13 comprises defining an array g for holding the lowest cost path to each point and initialising such that:
gl(c(l)) = d(c(l))- w(l). (16)
In other words, the lowest cost to the first point is the distance between the first two elements multiphed by the weighting factor. For a symmetric weighting factor w(l) = 2, while for an asymmetric weighting factor w(l) = 1.
The algorithm comprises calculating gk(i, j) for each row i and column j, wherein: gk(c(k)) = minc(k_l)[gk_1(c(k -l))+ d(c(k))- w(k)J (17)
The solution for the time-normalised distance D(A, B) is given by:
Figure imgf000037_0001
The asymmetric weighting coefficient w(k) of equation 14 may be substituted into equation 17, wherein w(l) = 1.
Thus, the algorithm defined by equations 17, 18, 19 is simphfied and comprises defining an array g for, holding the lowest cost path to each point and initialising such that:
Figure imgf000037_0002
In other words, the lowest cost to the first point is the distance between the first two elements.
The algorithm comprises calculating gk(i, j) for each row i and column j, wherein:
Figure imgf000037_0003
where
J c( -2) c(i-2,j)
(21) J \∞ c(k-2) = c(i-2,j)
The algorithm further comprises applying the following global conditions, namely: j ≥ max 2i -2I + J , - + - (22) 2 2
j < min] 2 „ι• - •l, , i I + Jτ (23)
2 2
Thus, the solution for the time-normalised distance D(A, B) is given by:
Figure imgf000038_0001
An algorithm based on equations 19 to 24 may be used to obtain a score when comparing speech utterances of substantially the same length. For example, the algorithm is used when comparing featuregram archetypes, which is described in more detail later.
However, the algorithm based on equations 19 to 24 may be adapted for word spotting applications. In word spotting, it is assumed that the start and stop points of the first speech vector A are known. However, the start and stop points of the relevant speech in the second pattern B are unknown. Therefore, the conditions of equation 9 no longer hold and can be re-defined such that:
i(l) = l,j(l) = start, and i(κ) = I, j(κ) = stop (25)
Based on the fact that the maximum expansion/compression in the speech pattern is 2, the start point can assume any value from 1 to J-I/2 and stop point may assume any value from 1/2 to J. Consequently, the global conditions specify:
Figure imgf000038_0002
and
Figure imgf000039_0001
The time-normalised difference D(A, B) is now defined as:
D(A,B) = - -min[g(l,K)], where K = -,...., J (28)
Referring to Figures 21, 24 and 25, a process for calculating the time-normalised distance D is shown. i
The featuregram 51 derived from the recoded signal 19 (Figure 5) is compared with the speaker-independent featuregram archetype 52. As explained earher, the featuregram comprises a speech utterance, such as "twenty-one", silence intervals and background noise. The speaker-independent featuregram archetype 52 comprises a word or phrase being sought, which in this example is "twenty-one".
The featuregram 51 is warped onto the speaker-independent featuregram archetype 52. The aim is to locate a region within the featuregram 51 (speech pattern B) which best matches speaker-independent featuregram archetype 52 (speech pattern A).
An array g for holding the lowest cost path to each point is defined (step SI.5.1). The array may be considered as a net of points or nodes. As explained earher, the start point can assume any value from 1 to J-I/2, therefore the elements g(l, 1) to g(l, J-I/2) are set to values d(l, 1) to d(l, J-I/2) respectively (step Sl.5.2). Elements g(l, J-I/2+1) to g(l, J) may be set to a large number. A corresponding array 59 is shown in Figure 25a.
Equation 20 is then calculated for some, but not all, elements (i, j) of array g. The process comprises incrementing index i (step SI.5.3), checking whether the algorithm has come to an end (step SI.5.4), determining the bounds 58max, 58min (Figure 23) of the legal search (step SI.5.5 to SI.5.8) and determining whether an index value j falls outside bounds 58max, 58min (step SI.5.9). If so, then a large distance is entered, i.e. g(i, j) = oo, which in practice is a large number (step SI.5.10). Otherwise, equation 20 is calculated and a corresponding distance, herein labelled d'j j, is entered, i.e. g(i, j) = d'; j (step SI.5.11). The process continues by incrementing index j at step SI.5.7 and continuing for until j exceeds J (step SI.5.8). A corresponding array 59', partially filled, is shown in Figure 25b.
The algorithm continues until the array is completed, i.e. (i, j) = (I, J) (step SI .5.4). A corresponding completed array 59" is shown in Figure 25c.
The winning score with the lowest value is found (step SI.5.12). As explained earher, the stop point may assume any value from 1/2 to J. Therefore, elements g(I, 1/2) to g(I, J) are searched. Once a stop point 60 has been found, a start point 61 may be estimated by tracing back the winning path 62. Thus, endpoints 28' are found be reading i-values corresponding to the start and stop points 60, 61.
Performing sanity checks
The ability of a voice authentication system to consistently accept vahd users and reject impostors is dependent on the generation of featuregrams that in some way represent the user's key speech characteristics. A plurahty of sanity checks may be apphed during enrolment and authentication, preferably on the recorded signal or a recorded signal portion, to ensure that they are suitable for enrolment and authentication, i.e. that the speech utterances carry sufficient information for featuregrams to be generated. Preferably, all the following sanity checks are performed.
—Speech Length—
A first sanity check comprises confirming that the length of speech exceeds a minimum length. The minimum length of speech is a function of not only time but also of the number of feature vector time slices. In this example, the minimum length of speech is 0.5 seconds of speech and 30 feature vector timeshces, and timeshce duration and overlap are defined accordingly.
—Noise Length-
A second sanity check comprises checking that each speech utterance includes a silence interval which exceeds a minimum length. The silence interval is used to determine noise threshold levels for exphcit endpointing, signal to noise measurements and for Speech/Noise entropy. In this example, the minimum length of silence is 0.5 seconds and 30 feature vecto timeshces.
-Signal-to-Noise Ratio (SNR)-
A third sanity check includes examining whether the signal-to-noise ratio exceeds a minimum. In this example, the minimum signal-to-noise ratio is 20dB. The purpose of setting a minimum signal-to-noise ratio is to obtain an accurate speaker biometric template uncorrupted by background noise.
An estimate of the SNR can be determined using:
Figure imgf000041_0001
where Is is the speech energy and In is the noise energy. The speech and noise energy Is, In can be calculated using:
I ^∑pcm? (30)
where pcm; is the value of the digital signal. Other values of signal-to noise ratio may be used, for example 25dB.
—Speech Intensity— A fourth sanity check comprises checking whether the speech energy exceeds a minimum. The purpose of setting a minimum speech intensity is not only to provide adequate signal-to-noise, but also to avoid excessive quantisation in the digital signal. In this example, the minimum speech intensity is 47 dB.
-Clipping—
A fifth sanity check comprises determining whether the degree of chpping exceeds a maximum value. The degree of chpping is defined as the average number of samples which exceeds an absolute value in each speech frame. In this case, the degree of chpping is 32000 which represents about 98% of the full-scale deflection of a 16-bit analog-to-digital converter.
—Speech Entropy—
A sixth sanity check includes checking whether a so-called "speech entropy" exceeds a minimum. In this example, the minimum speech entropy is 40.
Speech entropy is defined as the average distance between a speech featuregram and the mean feature vector of the speech featuregram. The mean feature vector is calculated by taking an average of the n-feature vectors in the featuregram. A distance between each feature vector and the mean feature vector is determined. Preferably, a Manhattan distance is calculated, although a Euclidian distance may be used. An average distance is calculated by taking an average of n-values of distance.
—Speech/Noise Entropy—
A sixth sanity check comprises testing whether a so-called "speech-to-noise entropy" exceeds a minimum. Speech-to-noise entropy is defined as the average distance between the mean feature vector of the speech featuregram and the feature vectors of the background noise. In this example, the minimum speech-to-noise entropy is 40.
Referring to Figure 25, a process of performing sanity checks is shown. A plurahty of sanity checks are performed and a tally kept of the number of failures (steps
SI.6.1 to SI.6.6). If there are number of failures exceeds a threshold, for example 3, then signal is deemed to be inadequate and the user is asked to check their set-up (steps SI .6.7 and SI .6.8). Otherwise, the recorded signal 19 (Figure 5) is considered to be satisfactory (step SI.6.9).
Creating speech featuregram Once the endpoints of the recorded signal 19 (Figure 5) have been identified and the recorded signal (Figure 5) passes a plurahty of sanity checks, then a speech featuregram may be created.
Referring to Figure 27, a speech featuregram 63 is created using a process 64 by concatenating feature vectors 24 extracted from the section of the featuregram 25 that originates from the speech utterance. The speech section of the featuregram is located via the speech endpoints 28, 28'.
Creating speech featuregram archetype The aim of the enrolment is to provide a characteristic voiceprint for one or more words or phrases. However, specimens of the same word or phase provided by the same user usually differ from one another. Therefore, it is desirable to obtain a plurahty of specimens and derive a model or archetypal specimen. This may involve discarding one or more specimens that differ significantly from other specimens.
Referring to Figure 28, a speech featuregram archetype 65 is calculated using an averaging process 66 using w-featuregrams 63l3 632,..., 63w. Typically, an average of three featuregrams 63 is taken.
Referring to Figure 29, 30, 31 and 32, the featuregram archetype 65 is computed by determining a winning score D for each featuregram 63l5 632,..., 63w warped, using a modified version of process 50 which is shown in Figure 32, against each other featuregram 63l5 632,..., 63w to create an w-by-w featuregram cost matrix 67, whose diagonal elements are zero (steps SI.8.1 to SI.8.9).
Excluding the diagonal elements, a minimum value Dmιn in the featuregram cost matrix 67 is determined (step SI.8.10). If the minimum value Dmιn is greater than a predefined threshold distance D0, then all the featuregrams 63l5 632,..., 63w are considered to be so dissimilar that a featuregram archetype 67 cannot be created (step SI.8.11).
Referring to Figures 29 and 31, if one or more values in the featuregram cost matrix 67 is less than the threshold D0, then w-featuregram archetypes 69t, 682,..., 68w are computed using each featuregram 63l5 632,..., 63w as a reference and warping each remaining (w-l)-featuregrams 63l5 632,..., 63w onto it (steps SI .8.12 to SI.8.21).
Referring to Figures 29, 33, 34 and 35, once w-featuregram archetypes 68l5 682,..., 68w have been created, a w-by-w featuregram archetype cost matrix 69 is computed whose elements consist of the winning scores E from warping each featuregram 63l5 632,..., 63w into each featuregram archetype 68^ 682,..., 68w (steps SI.8.22 to SI.8.28).
An average featuregram archetype cost matrix 70 is computed by averaging elements within each column 71 corresponding to a featuregram 631? 632,..., 63w (steps Sl.8.29 to Sl .8.37).
A maximum value E'max in the featuregram cost matrix 69 is also determined (steps SI .8.38).
If the maximum value E'max in the featuregram cost matrix 69 is less than the threshold D0, then the featuregram archetype 68l5 682,..., 68w which provides the lowest mean featuregram archetype cost <E'1>, <E'2>,...,<E'W> is chosen to be included in the voice authentication biometric (steps SI.8.37 to SI.8.50). The lowest mean featuregram archetype cost <Ε >, <E'2>,...,<E'W> is calculated by averaging elements within each row 72.
If the maximum value E'max in the featuregram cost matrix 69 is greater than the threshold D0, then a featuregram 63l5 632,..., 63w is excluded, thus reducing the number of featuregrams to (w-1) and steps SI.8.1 to SI.8.50 are repeated (steps Sl.8.54). A featuregram 63l3 632,..., 63w is chosen for exclusion by calculating a variance σl5 σ2,...,σw for each featuregram archetype 68l5 682,..., 68w and excluding the featuregram 63l3 632,..., 63w corresponding to the featuregram archetype 68t, 682,..., 68w having the lowest value of variance σl3 σ2,...,σw (steps SI.8.51 to SI .8.53). For example, for an ith featuregram archetype 68i3 a variance σ; is calculated from the average featuregram archetype cost matrix 70 using:
w w σi = Σ j Σ k (Eii ≠k> -(E^^i) (31)
Thus, the mean featuregram archetype cost <E5 !>, <E'2>,...,<E'W> which produced the lowest average distance results in the reference featuregram 63l5 6323..., 63w from which it was created being discarded.
Steps SI.8.1 to SI.8.50 are repeated until a featuregram archetype 65 (Figure 28) is obtained or if only one featuregram 63l3 6323..., 63w is left.
Setting an appropriate pass level
A featuregram archetype 65 is obtained for each prompt. Thus, during subsequent authentication, a user is asked to provide a response to a prompt. A featuregram is obtained and compared with the featuregram archetype 65 using a dynamic time warping process which produces a score. The score is compared with a preset pass level. A score which falls below the pass level indicates a good match and so the user is accepted as being a vahd user.
A vahd user is hkely to provide a response that results in a low score, falling below the pass level, and which is accepted. However, there may be occasions when even a vahd user provides a response that results in a high score and which is rejected. Conversely, an impostor may be expected to provide poor responses which are usually rejected. Nevertheless, they may occasionaUy provide a sufficiently close- matching response which is accepted. Thus, the pass level affects the proportion of vahd users being incorrectly rejected, i.e. the "false reject rate" (FRR) and the proportion of impostors which are accepted, i.e "false accept rate" (FAR).
In this example, a neutral strategy is adopted which shows no bias towards preventing unauthorised access or allowing authorised access.
A pass level for a fixed-word or fixed-phrase prompt is determined using previously acquired captured recordings taken from a wide range of representative speakers.
A featuregram archetype is obtained for each of a first set of users for the same prompt in a manner hereinbefore described. Thereafter, each user provides a spoken response to the prompt from which a featuregram is obtained and compared with the user's featuregram archetype using a dynamic time warping process so as produce a score. This produces a first set of scores corresponding to vahd users.
The process is repeated for a second set of users, again using the same prompt. Once more, each user provides a spoken response to the prompt from which a featuregram is obtained. However, the featuregeam is compared with a different user's featuregram archetype. Another set of scores are produced, this time corresponding to impostors.
Referring to Figure 36, frequency of scores for vahd users and impostors are fitted to first and second probabihty density functions 73l5 732 respectively using:
p(x) =
(2πσ2} r' τeχp r (ln(x)-μ)2 (32)
2
where, p is probabihty, x is score, μ is mean score and σ is standard deviation. Other probabihty density functions may be used. The mean score μ, for vahd users is expected to be lower than the mean score μ2 for the impostors. Furthermore, the standard deviation σα for the valid users is usually smaller than the standard deviation σ2 of the second density function
Referring to Figure 37, the first and second probabihty density functions 73l5 732 re numerically integrated to produced first and second continuous density functions 74l5 742. The point of intersection 75 of the first and second continuous density functions 74l5 742is the equal error rate (ERR), wherein FRR = FAR. The score at the point of intersection 75 is used as a pass score for the prompt.
Creating a voice authentication biometric
Referring to Figure 38, a voice authentication biometric 76 is shown. The voice authentication biometric 76 comprises sets of data 77l5 772,...77q corresponding to featuregram archetypes 65 and associated prompts 78. Statistical information 79 regarding each featuregram archetype 65 and an associated prompt 78 may also be stored and will be described in more detail later. The voice authentication biometric 76 further comprises ancillary information including the number of prompts to be issued during authentication 80, scoring strategy 81, higher level and gain settings 82. The biometric 76 may include further information, for example related to high-level logic for analysing scores.
The voice authentication biometric 76 is stored in non-volatile memory 11 (Figure !)
Authentication
Referring again to Figures 1 and 2, once enrolment has been successfuUy completed, the user is registered as a vahd user. Access to the secure system 3 is conditional on successful authentication.
Referring to Figure 39, the authentication process corresponding to step S2 in Figure 2, is shown: The voice authentication system 1 is initiahsed, for example by setting amphfier gain to a value stored in the voice authentication biometric, or calibrated, for example to ensure that an appropriate amphfier gain is set (step S2.1). The user is then prompted (step 2.2) and the user's responses are recorded (step S2.3). Featuregrams ate generated from the recordings (step S2.4). The recordings are examined so as to isolate speech from background noise and periods of silence (step S2.5, step S2.6). Checks are performed to ensure that the recordings, isolated speech utterances and featuregrams are suitable for processing (step S2.7). The featuregrams are then matched with the featuregram archetype (step S2.8). The response is also checked for replay attack (step S2.9). The user's response is then scored, (step S2.10).
Initialisation I Calibration
The gain of the amphfier 6 (Figure 1) is set according to the value 82 (Figure 37) stored in the voice authentication biometric 76 (Figure 37) which is stored in non- volatile memory 11 (Figure 1).
Alternatively, the system may be calibrated in a way similar to that used in enrolment. However, the process may differ. For example, prompts used in authentication may differ from those used in enrolment. A value of gain determined during enrolment cahbration need not be recorded but may be compared with a value stored in the voice authentication biometric and user to determine whether the user is a vahd user.
Authentication prompts Authentication prompts are chosen from those stored in the voice authentication biometric 76 (Figure 37). Preferably, prompts are randomly chosen from a sub-set. This has the advantage that it becomes more difficult for a user to guess what prompt will be used and so give an unnatural response. Moreover, this improves security.
Recording
Referring to Figure 40, following the or each prompt, a signal 83 is recorded using the microphone 5 (Figure 1) in a manner hereinbefore described. Creating authentication featuregrams
Referring to Figures 41, 42 and 43, the or each recorded signal 83, is divided into timeshces 84. The timeshces 84 use the same window size and the same overlap as used for enrolment. Feature vectors 85 are created. Again, the same process 25 is used in authentication as enrolment. The feature vectors 85 are concatenated to produce featuregrams 86. The featuregrams 86 generated during authentication are usually referred to as authentication featuregrams.
Referring to Figure 44, exphcit endpointing may be performed using the process 27 described earher so as to generate endpoints 87. Exphcit endpointing may be used to support sanity checks.
Sanity checks Sanity checks are conducted on the recorded signal 83 as described earher.
Matching authentication featuregrams with the voice authentication biometric Referring to Figure 45, the process 50 and the featuregram archetype 65 is used to word spot the authentication featuregram 86 and provide a dynamic time warping winning score 87. The process 50 may be used to provide endpoints 28'.
R " ejecting a replay attack
A potential threat to the security offered by any voice authentication system is the possibility of an impostor secretly recording a spoken response of the vahd user and subsequently replaying a recording to gain access to the system. This is known as a "replay attack."
One solution to this problem is to issue, during each separate authentication, a randomly chosen subset of prompts from a full set of prompts responded to during enrolment. This means that several different authentication sessions will need to be secretly recorded before an impostor can collect a complete set of the responses. However, this does not combat the threat from recordings made during enrolment. Another solution is to store copies of the featuregrams generated during recent authentications and track them to see if they vary sufficiently over time. However, this has several drawbacks. Firstly, additional storage is needed. Secondly, replaying the same recording on several occasions under different levels and types of background noise may in itself provide sufficient variability for the system to be fooled into thinking that it is observing legitimate hve spoken responses provided by the vahd user.
Referring to Figures 46 and 47, a process for rejecting replay attack is shown:
A fixed-phrase prompt is randomly selected (step S2.9.1). An example of a fixed- phrase prompt is "This is my voiceprint". A recording is started (step S2.9.2). The user is then prompted a first time (step S2.9.3). After a predetermined period of time, for example 1 or 2 seconds, the user is prompted a second time with the same prompt (steps S2.9.4 & S2.9.5). Thus, the user supphes two different examples 89l5 892 separated by a 1-2 second interval 90. A featuregram 86 is generated as described earher (step S2.9.6). The interval may comprise silence and/or noise.
The word spotting process 50 is used to isolate the two spoken responses 89,, 892 to the fixed-phrase prompt and the interval 90 (steps S2.9.7 & S2.9.8). The isolate responses 89l5 892, in the form of truncated featuregrams, are fed to process 88. Each truncated featuregram provides a representation of the spoken response. The duration of the interval 90 is determined.
If the featuregrams 89l5 892 are too similar, either to each other, or to the featuregram archetype 65 stored in the voice authentication biometric, then authentication is rejected on suspicion of a replay attack (steps S2.9.9 to S2.9.13). A corresponding reject flag 91 is set.
A record 92 is kept of a degree of match between the two featuregrams 89l5 892 and the length of the intermediate silence 90 (step S2.9.11). This record 92 is known as a "Replay Attack Statistic" (RAS). The record 92 comprises two integers. Therefore, it is possible to store a plurahty of replay attack statistics 92 for each fixed-phrase prompt in the voice authentication biometric 76 (Figure 36) without consuming a significant amount of memory. The record 92 is stored in statistical information 79 (Figure 37).
If during a subsequent authentication, a close match is detected between the latest replay attack statistic 92 and any subsequent replay attack statistic 92 stored in the voice authentication biometric 76 (Figure 36) (steps S2.9.15 to S2.9.16), then the authentication may be rejected on suspicion of a replay attack. Additionally or alternatively, the process may be repeated using a different prompt and check for the replay attack based on another set of replay attack statistics 92.
If during subsequent authentication, the duration of the interval 90 is found to be the same as the duration of the interval 90 for the same prompt arising from an earher authentication, then the authentication may also be rejected on suspicion of a replay attack (step S2.9.17 and step S2.9.18).
The advantage of using this approach is that it is possible to monitor and detect suspicious similarities between featuregram archetypes even if the acoustic environment has changed since the time the recording was originally made. Furthermore, the approach helps to guard against replay attacks based on recordings made during enrolment and authentication. Additionally, the cost of storing the replay attack statistics is low, typicaUy 3 bytes per prompt. Thus, to monitor the last 5 authentication attempts across 5 fixed prompts typically requires 75 bytes of memory.
Higher-level decision logic
A decision on whether to accept or reject the user is based on the degree of match between featuregram archetypes 65 stored in the voice authentication biometric 76 (Figure 36) and the featuregrams 86 derived from the authentication recordings.
Higher-level decision logic is subsequently apphed. Higher-level decision logic may include calculating an average score for, a plurahty featuregrams 86 and determining whether the average score falls below a first predetermined scoring threshold, i.e. Dav < Dtheshl. If the average score falls below the first predetermined scoring threshold, then authentication is considered successful.
Higher-level decision logic may include determining the number, n, of featuregrams 86 whose score fall below a second predetermined scoring threshold, i.e. D; < Dthesh2 for all 0 < i ≤ p. The decision logic subsequently comprises checking a pass condition. For example, the pass condition may be that the scores for ή out of p featuregrams 86 faU below the second predetermined scoring threshold, where 1 < n < p. Allowing one or more of the featuregram scores to be ignored is useful because it allows the valid user to provide an uncharacteristic response to at least one of the prompts without being unduly penalised.
For fixed prompts, with a priori knowledge of a response, the scoring thresholds may be set based upon the statistical method described earher.
For challenge-response prompts, a threshold may be determined during enrolment. A plurahty of specimens, preferably two or three, of the same response are taken. A featuregram archetype is determined. AdditionaUy, a variance is determined.
Thus, a fixed number of prompts are issued and spoken responses are recorded. The spoken response are analysed to determine whether a vahd user is addressing the system.
However, an alternative strategy may be used, which adaptively determines a number of prompts to be issued.
InitiaUy, a user is prompted a predetermined number of times, for example two or three times. Spoken responses are recorded, corresponding featuregrams are obtained and compared with the featuregram archetype so as to produce a number of scores. Depending on the score, further prompts may be issued. For example, if aU or substantially aU the scores fall below a threshold score, indicating a good number of matches, then no further prompts are issued and authentication is successful. Conversely, if all or substantiaUy all the scores exceed the threshold score, indicating a poor number of matches, then authentication is unsuccessful. However, if some scores fall below the threshold and other scores exceed the threshold, then further prompts are issued and further scores obtained.
This process continues until either the proportion of successful scores exceeds a first predetermined proportion, for example 70%, in which case authentication is successful, or faUs below a second predetermined proportion, such as 30%, in which case authentication is considered unsuccessful.
This has the advantage that vahd users who provide consistently good examples of speech when prompted need only provide a smaU number of spoken responses, thus saving time.
In the above embodiment, the voice authentication system is comprised in a single unit, such as a personal computer. However, the voice authentication system may be distributed.
For example, the processor for performing the matching process and non-volatile memory holding the voice authentication biometric may be held on a so-caUed "smart card" which is carried by the vahd user. This is particularly convenient for controlhng access to a room or building via an electronicaUy-controUed lockable door. The door is provided with a microphone and a smart card reader. The door is also provided with a speaker for providing audio prompts and/or a display for providing text prompts. When the smart card is inserted into the smart card reader, the voice authentication system is connected and permits authentication and, optionally, enrolment. Enrolment may be performed elsewhere, preferably under supervision of the system administrator, using another microphone and smart card reader together with speaker and/or display. This has the advantage that access is conditional not only on successful authentication, but also possession of the smart card. Furthermore, the voice authentication biometric and the matching process may be encrypted. The smart card may also be used in personal electronic devices, such as ceUular telephones and personal data assistants.
Smart Card
Voice authentication using a smart card will now be described in more detail:
Referring to Figure 48, a modified voice authentication system 1' is provided by a personal computer 93, for example in the form of a lap-top personal computer, and a smart card 94. The personal computer 93 includes a smart card reader 95 for permitting the personal computer 93 and smart card 94 to exchange signals. The smart card reader 95 may be a peripheral device connected to the computer 93.
The smart card 94 includes an input/ output circuit 96, processor 9', non-volatile memory 10' and volatile memory 11'. If the smart card 94 is used for storing the voice authentication biometric, but not for performing the matching or other processes, then the smart card 94 need not include the processor 9' and volatile memory 11'.
Referring to Figure 49, the smart card 94 takes the form of a contact smart card 49 The contact smart card 49t includes a set of contacts 97 and a chip 98. An example of a contact smart card 49t is a JCOP20 card.
Referring to Figure 50, the smart card 94 may alternatively take the form of a contactiess smart card 492. The contactiess smart card 492 includes a loop or coU 99 and a chip 99. The contactiess smart card 492 may include a plurahty of sets of loops (not shown) and corresponding chips (not shown). An example of a contactiess smart card 492 is an iCLASS™ card produced by HID Corporation™.
Referring to Figure 51, the contact smart card 94j and smart card reader 95 are shown in more detail. The contact smart card 94t and smart card reader 95 are connected by an interface 101 including a voltage line Vcc 102l5 for example at 3 or 5 V, a reset hne Rst 1022 for resetting RAM, a clock hne 1023 for providing an external clock signal from which an internal clock is derived and an input/ output hne 1024. Preferably, the interface conforms to ISO 7816.
Volatile memory 11' (Figure 48) is in the form of RAM 103 and is used during operation of software on the card. If a reset signal is applied to hne Rst 1022 or if the card is disconnected from the card reader 95, then contents of the RAM 103 is reset.
Non-volatile memory 10' (Figure 48) is in the form of ROM 104 and EEPROM 105. An operating system (not shown) is stored in ROM 104. Apphcation software (not shown) and voice authentication biometric 76 (Figure 38) is stored in EEPROM 105. Contents of the EEPROM 105 may be set using a card manufacturer's development kit. The EEPROM 105 may have a memory size of 8 or l όkbits, although other memory sizes may be used.
Processor 9' (Figure 48) may be in the form of an embedded processor 106 which handles, amongst other thing, encryption, for example based on triple data encryption standard (DES).
The interface 96, RAM 103, ROM 104, EEPROM 105, processor 106 may be incorporated into chip 98 (Figure 49), although a plurahty of chips may be used.
The smart card reader 95 is connected to the personal computer 93 which runs one or more computer programs for permitting communication with the smart card 94j using Apphcation Protocol Data Units (APDUs) for example as specified according to ISO 7816-4. However, other schemes and/ or protocols which aUow communication between a smart card and reader may be used. Referring to Figure 52, a table 107 hsts APDU commands 108 which can be sent to the smart card 94 (Figure 49) and corresponding responses 109. Matrix 110 hnks commands 108 with corresponding responses 109 using an 'X'.
The term "template" is used hereinafter to refer to a featuregram archetype 65.
Template download
During or following enrolment, a voice authentication biometric 76 (Figure 38) is stored on the smart card 94 which includes one or more templates 65 (Figure 38).
Referring to Figure 53, a process of downloading templates 65 (Figure 38) to the smart card 94 is shown.
One or more "template download" commands are transmitted, each command including a respective section or portion of the template 65 (step S3.1) to be stored on the smart card 94 in EEPROM 105 (Figure 51). For example, a section of a template can be a feature vector. The smart card 94 returns a response indicating whether template download commands were successful or unsuccessful (step S3.2). If unsuccessful, a response specifies an error and the process is repeated (steps S3.3 & S3.4).
This process may be repeated a plurahty of times for each template 65 (Figure 38), for example corresponding to different prompts.
If no more sections of a template are to be downloaded, then a "template download ended" command is sent (step S3.5). The smart card 94 returns a response indicating whether template download was successful or unsuccessful (step S3.6).
Other data, such as data relating to the prompt, may be included in the data field of the "template download" command. Template upload
During authentication, at least one template 65 is compared with a featuregram 76 (Figure 45). The smart card 94 need not necessarily perform the comparison, i.e. the comparison is performed "off-chip". This may be because the smart card 94 does not have a sufficiently powerful processor. Alternatively, it may be decided to perform an "off-chip" comparison. Under these circumstances, one or more templates 65 (Figure 38) are uploaded to the computer 93.
The process is simnar to template downloading, but carried out in the reverse direction. In other words, an "upload template to laptop" command is used.
Featuregram download
As explained earher, during authentication, at least one template 65 is compared with a featuregram 76 (Figure 45). It is advantageous for the smart card 94 to perform the comparison, i.e. the comparison is performed "on-chip", in which case templates 65 (Figure 38) do not leave the smart card 94. This can help prevent copying, stealing and corruption of templates 65 (Figure 38).
One or more "feature vector download" commands are transmitted each including a respective feature vector (step S4.1). The smart card 94 returns a response indicating whether feature vector download was successful or unsuccessful (step S4.2). If unsuccessful, the response specifies an error and the process is repeated (steps S4.3 & S4.4).
This process is repeated a plurahty of times until a featuregram 76 is downloaded (Figure 38).
If no more feature vectors are to be downloaded, then a "feature vector download ended" command is sent (step S4.5). The smart card 94 returns a response indicating whether feature vector download was successful or unsuccessful (step S4.6). Referring to Figure 55, once the featuregram 76 has been downloaded, a '"return score" command is sent to the card (step S5.1). The processor 106 compares the template 65 with the featuregram 76 as described above to produce a score, which is compared with threshold, which may be hard coded or previously downloaded (step S5.2), and a response is returned indicating whether authentication was successful or unsuccessful (step S5.3).
APDU commands may be used to delete featuregrams. As shown in Figure 57, other APDU command may be provided, such as "delete biometric" for deleting a all templates and other data and "verify biometric loaded" for checking whether the card holds a voice authentication biometric.
The smart card can perform other processes including some of the process described earher, such as detecting replay attack and performing higher-level logic.
Storing the voice authentication biometric on a smart card can have several advantages.
It helps provide a secure mechanism for validating that a smart card user is the smart card owner. This is of particular importance for financial transactions, for example using credit and debit cards.
It helps keep the voice authentication biometric in the user's possession. This helps to avoid data protection issues, such as the need to comply with data protection legislation.
Authentication can be performed at a remote site without the need to communicate with a server holding a database containing voice authentication biometrics.
Furthermore, because the smart card is available locaUy at point of use, it helps avoid the need to communicate through telephone or data lines, thus helping to save costs, increase speed and improve security. Performing matching on the smart card, can also have several advantages.
It helps to avoid the voice authentication biometric being copied, stolen or corrupted.
Additionally, it can help provide backward compatibility by minimising modifications to an existing system so as to provide a facility for voice authentication.
Many modifications may be made to the embodiment hereinbefore described. For example, the recorded signal may comprise a stereo recording. The smart card may be any sort of token, such a tag or key, or incorporated into objects such as a watch or jewellery, which can be held in the user's possession. Information storage media and devices may be used, such as a memory stick, floppy disk or optical disk. The smart card may be a mobile telephone SIM card. The smart card may be marked and/or store data so as to identify that the card belongs to a given user.
Prompts need not be exphcitly stated. For example, a prompt may be a green light or the word "Go".
Measurements of background noise may be made in different ways. For example, a recorded signal, or part thereof, may be divided into a plurahty of frames. A value of background noise may be determined by selecting one or more of the lowest energy frames and either using one of the selected frames as a representative frame or obtaining an average of aU the selected frames. To select the one or more lowest energy frames, the frames may be arranged in order of signal energy. Thereafter, the ordered frames may be examined to determine a boundary where signal energy jumps from a relatively low level and to a relatively high level. Alternatively, a predetermined number of frames at the lower energy end may be selected.

Claims

Claims
1. A token storing a voice, authentication biometric.
2. A token according to claim 1 for possession by a user.
3. A token according to claim 1 or 2, which is smaU enough to be kept on a user.
4. A token according to any preceding claim, which is smaU enough to be worn by a user as jeweUery.
5. A token according to any preceding claim, which is small enough to be kept in a pocket of an article of clothing worn by a user.
6. A smart card storing a voice authentication biometric.
7. A token or smart card according to any preceding claim, wherein the voice authentication biometric is for use in authenticating a user using a sample of speech from the user.
8. A token or smart card according to any preceding claim, wherein the voice authentication biometric includes at least one set of feature vectors.
9. A token or smart card according to any preceding claim, wherein the voice authentication biometric includes at least one archetype set of feature vectors.
10. A token or smart card according to claim 8 or 9, wherein the voice authentication biometric includes at least one prompt, each prompt associated with a respective set of feature vectors.
11. A token or smart card according to any one of claims 8 to 10, wherein the voice authentication biometric includes corresponding statistical information relating to each set of feature vectors.
12. A token or smart card according to any preceding claim, wherein the voice authentication biometric includes data for controlhng authentication procedure.
13. A token or smart card according to any preceding claim, wherein the voice authentication biometric includes data for determining authentication.
14. A token or smart card according to any preceding claim, wherein the voice authentication biometric includes data for configuring a voice authentication apparatus.
15. A token or smart card according to any preceding claim, wherein the voice authentication biometric is encrypted.
16. A token or smart card according to any preceding claim including nonvolatile memory storing the voice authentication biometric.
17. A token or smart card according to any preceding claim storing a computer program comprising program instructions for causing a computer to perform a matching process for use in voice authentication.
18. A token for voice authentication including a processor, the token storing a voice authentication biometric including a first set of feature vectors and a computer program comprising program instructions for causing the processor to perform a method, the method comprising: receiving a second set of feature vectors; and comparing the first and second set of feature vectors.
19. A smart card for voice authentication including a processor, the smart card storing a voice authentication biometric including a first set of feature vectors and storing a computer program comprising program instructions for causing the processor to perform a method, the method comprising: receiving a second set qf feature vectors; and comparing the first and second set of feature vectors.
20. A token or smart card according to claim 18 or 19, wherein the computer program comprises program instructions for causing the processor to perform a method, the method comprising: requesting a user to provide a spoken response.
21. A token or smart card according to any one of claims 18 to 20, wherein the computer program comprises program instructions for causing the processor to perform a method, the method comprising: receiving a recorded signal including a recorded signal portion corresponding to a spoken response.
22. A token or smart card according to claim 21, wherein the computer program comprises program instructions for causing the processor to perform a method, the method comprising: determining endpoints of said recorded signal portion corresponding to a spoken response.
23. A token or smart card according to claim 21 or 22, wherein the computer program comprises program instructions for causing the processor to perform a method, the method comprising: deriving said second set of feature vectors for characterising said recorded signal portion.
24. A token or smart card according to any one of claims 18 to 23, wherein the computer program comprises program instructions for causing the processor to perform a method, the method comprising: producing a score dependent upon a degree of matching between said first and second set of feature vectors.
25. A token or smart card according to 24, wherein the computer program comprises program instructions for causing the processor to perform a method, the method comprising:
'5 comparing the score with a predefined threshold so as to determine authentication of a user.
26. A method of voice authentication, the method comprising in a token or smart card: 0 providing a first set of feature vectors; receiving a second set of feature vectors for characterising a recorded signal portion; and comparing said first and second sets of feature vectors.
5 27. A method according to claim 26, further comprising: providing data relating to a prompt.
28. A method according to claim 26 or 27, further comprising: receiving a recorded signal including a recorded signal portion corresponding 0 to a spoken response.
29. A method according to claim 28, further comprising: determining endpoints of said recorded signal portion.
5 30. A method according to claim 28 or 29; deriving said second set of feature vectors for characterising said recorded signal portion.
31. A method according to any one of claims 26 to 30, further comprising: 0 producing a score dependent upon a degree of matching between said first and second set of feature vectors.
32. A method according to claim 31, further comprising: comparing the score with a predefined threshold so as to determine authentication of a user.
33. A method according to any one of claims 26 to 32, further comprising: receiving a recorded signal which includes a recorded signal portion corresponding to a spoken response and which includes a plurahty of frames; determining endpoints of said recorded signal including: determining whether a value of energy for a first frame exceeds a first predetermined value; and determining whether a second frame immediately preceding the first frame represents a spoken utterance portion.
34. A method according to any one of claims 26 to 33, further comprising: requesting said authenticating user to provide first and second spoken responses to said prompt; obtaining a recorded signal including first and second recorded signal portions corresponding to said first and second spoken responses; isolating said first and second recorded signal portions; , deriving second and third sets of feature vectors for characterising said first and second isolated recorded signal portions respectively; comparing said second set of feature vectors with said third set of feature vectors so as to produce a score dependent upon the degree of matching; and comparing the score with a predefined threshold so as for determine whether the first set of feature vectors is substantially identical to the second set of feature vectors.
35. A method according to any one of claims 26 to 34, further comprising: requesting a user to provide a plurahty of spoken responses to a prompt; obtaining a plurahty of corresponding recorded signals, each recorded signal including a recorded signal portion corresponding to a respective spoken response; deriving a plurahty of sets of feature vectors, each set of feature vectors for characterising a respective recorded signal portion; comparing said sets of feature vectors with said first set of feature vectors so as to produce a plurahty of scores dependent upon a degree of matching and determining whether authentication is successful in dependence upon said plurahty of scores. '5
36. A method according to any one of claims 26 to 35, further comprising: receiving a recorded signal which includes a recorded signal portion; determining endpoints of said recorded signal by dynamic time warping said second set of feature vectors onto said first set of feature vectors, including: 0 determining a first sub-set of feature vectors within said second set of feature vectors from which a dynamic time warping winning path may start and determining a second sub-set of feature vectors within said second set of feature vectors at which the dynamic time warping winning path may finish.
15 37. A method of determining an endpoint of a recorded signal portion in a recorded signal including a plurahty of frames, the method comprising: determining whether a value of energy for a first frame exceeds a first predetermined value; and determining whether a second frame immediately preceding the first frame 0 represents a spoken utterance portion.
38. A method according to claim 37, wherein the first predetermined value represents a value of energy of a frame comprised of background noise.
5 39. A method according to any one of claim 37 or 38 comprising: defining a start point if the value of energy of the first frame exceeds the first predetermined value and the second frame does not represent a spoken utterance portion.
0 40. A method according to claim 39, further comprising: indicating that the first frame represents a spoken utterance portion.
41. A method according to any one of claims 37 to 39 comprising: , defining a stop point if the value of energy of the first frame does not exceed the first predetermined value and the second frame represents a spoken utterance portion. .
42. A method according to claim 41, further comprising: defining the first frame as not representing a spoken utterance portion.
43. A method according to claim 41 or 42, further comprising: counting a number of frames preceding a start point of the spoken utterance portion.
44. A method according to claim 43 further comprising: pairing the stop point with said start point of the spoken utterance portion if the number of frames exceeds a predetermined number.
45. A method according to claim 43 further comprising: pairing the stop point with start point of a preceding spoken utterance portion if the number of frames does not exceed a predetermined number.
46. A method according to claim 37 or 38 comprising: determining whether the value of energy for a first frame exceeds a third predetermined value; and counting a number of frames preceding a start point of the spoken utterance portion.
47. A method according to claim 46 further comprising: defining a start point if the value of energy of the first frame exceeds the third predetermined value, the second frame does not represent a spoken utterance portion and if the number of frames does not exceed a predetermined number.
48. A method according to claim 47 further comprising: determining whether a value of energy for a third frame foUowing said first frame exceeds the second predetermined value.
'5 49. A method according to claim 48 further comprising: defining a stop point if the value of energy of the third frame does not exceed the third predetermined value.
50. A method according to claim 49 further comprising: 0 pairing the stop point with the start point of the spoken utterance portion.
51. A method according to claim 50 further comprising: pairing the stop point with a start point of a preceding spoken utterance portion. 5
52. A method according to claim 42 comprising: defining the first frame as representing background noise if the value of energy of the first frame does not exceed the third predetermined value.
0 53. A method according to claim 52 further comprising: calculating an updated value of background energy using said value of energy of the first frame.
54. A method according to claim 53, further comprising: 5 counting a number of frames preceding a start point of the spoken utterance portion and determining whether said number of frames exceeds another, large number.
55. A method according to any one of claims 37 to 54 comprising: 0 determining whether a value of rate of change of energy of the first frame exceeds a second predetermined value.
56. A method according to claim 55, wherein the second predetermined value represents a value of rate of change of energy of a frame comprised of background noise.
57. A method according to claim 55 or 56, comprising: defining a start point if the value of energy of the first frame exceeds the first predetermined value, and the value of rate of change of energy exceeds the second predetermined value and the second frame does not represent a spoken utterance portion.
58. A method according to claim 55 or 57, comprising: defining a stop point if the value of energy of the first frame does not exceed the first predetermined value, and the value of rate of change of energy does not exceed the second predetermined value and the second frame represents a spoken utterance portion.
59. A method according to any one of claims 55 to 58 comprising: determining whether the value of rate of change of energy for the first frame exceeds a fourth predetermined value.
60. A method of dynamic time warping for warping a first speech pattern (B) characterised by a first set of feature vectors onto a second speech pattern (A) characterised by a second set of feature vectors, the method comprising: identifying a first sub-set of feature vectors within said first set of feature vectors from which a dynamic time warping winning path starts and identifying a second sub-set of feature vectors within said first set of feature vectors at which the dynamic time warping winning path finishes.
61. A method of voice authentication comprising: enrolling a user including: requesting said enrolling user to provide a spoken response to a prompt; obtaining a recorded signal including a recorded signal portion corresponding to said spoken response; determining endpoints of said recorded signal portion; deriving a set of feature vectors for characterising said recorded signal portions; averaging a plurahty of sets of feature vectors, each set of feature vectors '5 relating to one or more different spoken responses to the prompt by said enrolhng user so as to provide an archetype set of feature vectors for said response; storing said archetype set of feature vectors together with data relating to said prompt; 10 authenticating a user including: retrieving said data relating to said prompt and said archetype set of feature vectors; requesting said authenticating user to provide another spoken response to said prompt; 15 obtaining another recorded signal including another recorded signal portion corresponding to said other spoken response; determining endpoints of said other recorded signal portion; deriving another set of feature vectors for characterising said other recorded signal portions; 20 comparing said another set of feature vectors with said archetype set of feature vectors so as to produce a score dependent upon a degree of matching; and comparing said score with a predefined threshold so as to determine whether said enrolhng user and said authenticating user are the same. 25
62. A method of gain control comprising a plurahty of times: determining whether an amphfied signal level is above a predetermined limit; either decreasing gain if the amphfied signal level is above the predetermined limit or maintaining gain if otherwise; 30 thereby permitting no increase in gain.
63. A method of gain control comprising a plurahty of times: determining whether an amphfied signal level is below a predetermined hmit; either increasing gain if the amphfied signal level is below the predetermined limit or maintaining gain if otherwise; thereby permitting no decreases in gain.
64. A method of voice authentication comprising: requesting a user to provide first and second spoken responses to a prompt; obtaining a recorded signal including first and second recorded signal portions corresponding to said first and second spoken responses; isolating said first and second recorded signal portions; deriving first and second sets of feature vectors for characterising said first and second isolated recorded signal portions respectively; comparing said first set of feature vectors with said second set of feature vectors so as to produce a second score dependent upon the degree of matching; and comparing the second score with another predefined threshold so as for determine whether the first set of feature vectors is substantially identical to the second set of feature vectors.
65. A method of voice authentication including: requesting an authenticating user to provide a plurahty of spoken responses to a prompt; obtaining a plurahty of corresponding recorded signals, each recorded signal including a recorded signal portion corresponding to a respective spoken response; deriving a plurahty of sets of feature vectors, each set of feature vectors for characterising a respective recorded signal portion; comparing said sets of feature vectors with an archetype set of feature vectors so as to produce a plurahty of scores dependent upon a degree of matching and determining whether authentication is successful in dependence upon said plurahty of scores.
66. A method of determining an authentication threshold score, the method including: requesting a first set of users to provide respective spoken responses to a prompt; for each user, obtaining a recorded signal which includes a recorded signal portion corresponding to the user's spoken response; for each user, deriving a set of feature vectors for characterising the recorded signal portion; for each user, comparing said set of feature vectors with an archetype set of feature vectors for said user so as to produce a score dependent upon a degree of matching; fitting a first probabihty density function to frequency of scores for said first set of users; requesting a second set of users to provide respective spoken responses to a prompt; for each user, obtaining a recorded signal which includes a recorded signal portion corresponding to the user's spoken response; for each user, deriving a set of feature vectors for characterising the recorded signal portion; for each user, comparing said set of feature vectors with an archetype set of feature vectors for a different user so as to produce a score dependent upon a degree of matching; fitting a second probabihty density function to frequency of scores for said set of users.
67. A method of averaging a plurahty of feature vectors, the method comprising: providing a plurahty of feature vectors; comparing said each set of feature vectors with each other set feature vectors so as to produce a respective set of scores dependent upon a degree of matching; searching for a minimum score; determining whether at least one score is below a predetermined threshold.
68. A smart card for voice authentication comprising: means for storing a first set of feature vectors and data relating to a prompt; means for providing said data to an external circuit; means for receiving a second set of feature vectors relating to said prompt; means for comparing said first and second set of feature vectors so as to determine a score; and means for comparing' said score with a predetermined threshold.
69. A smart card for voice authentication comprising: a memory for storing a first set of feature vectors and data relating to a prompt; an interface for providing said data to an external circuit and for receiving a second set of feature vectors relating to said prompt; a processor for comparing said first and second set of feature vectors so as to determine a score and for comparing said score with a predetermined threshold.
70. Information storage medium storing a voice authentication biometric.
71. A medium according to claim 70, which is portable.
72. A computer program comprising program instructions for causing a smart card to perform a method, the method comprising: retrieving from memory a first set of feature vectors; receiving a second set of feature vectors; and comparing the first and second set of feature vectors.
73. A method comprising writing at least part of a voice authentication biometric to a smart card or token.
74. A method according to claim 73, wherein said at least part of a voice authentication biometric is a set of feature vectors.
75. A method comprising writing a computer program to a smart card or token, the computer program comprising computer instructions for performing a method, the method comprising performing voice authentication.
76. A method according to claim 73 and 75.
77. A smart card for voice authentication including a processor, the smart card storing a computer program comprising program instructions for causing the processor to, perform a method, the method comprising performing voice authentication.
78. A smart card reader/writer connected to apparatus for recording speech and generating feature vectors, said reader /writer' being configured to transmit a set of feature vectors to a smart card or token and receive a response therefrom.
PCT/GB2003/002246 2002-05-22 2003-05-22 Voice authentication WO2003098373A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003230039A AU2003230039A1 (en) 2002-05-22 2003-05-22 Voice authentication

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB0211842.0 2002-05-22
GB0211842A GB2388947A (en) 2002-05-22 2002-05-22 Method of voice authentication

Publications (2)

Publication Number Publication Date
WO2003098373A2 true WO2003098373A2 (en) 2003-11-27
WO2003098373A3 WO2003098373A3 (en) 2004-04-29

Family

ID=9937239

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2003/002246 WO2003098373A2 (en) 2002-05-22 2003-05-22 Voice authentication

Country Status (3)

Country Link
AU (1) AU2003230039A1 (en)
GB (1) GB2388947A (en)
WO (1) WO2003098373A2 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010025523A1 (en) * 2008-09-05 2010-03-11 Auraya Pty Ltd Voice authentication system and methods
AU2012200605B2 (en) * 2008-09-05 2014-01-23 Auraya Pty Ltd Voice authentication system and methods
CN109313903A (en) * 2016-06-06 2019-02-05 思睿逻辑国际半导体有限公司 Voice user interface
WO2019173304A1 (en) * 2018-03-05 2019-09-12 The Trustees Of Indiana University Method and system for enhancing security in a voice-controlled system

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2407681B (en) * 2003-10-29 2007-02-28 Vecommerce Ltd Voice recognition system and method
US20080208580A1 (en) * 2004-06-04 2008-08-28 Koninklijke Philips Electronics, N.V. Method and Dialog System for User Authentication
US20080195395A1 (en) * 2007-02-08 2008-08-14 Jonghae Kim System and method for telephonic voice and speech authentication
US8817964B2 (en) 2008-02-11 2014-08-26 International Business Machines Corporation Telephonic voice authentication and display
DK2364495T3 (en) * 2008-12-10 2017-01-16 Agnitio S L Method of verifying the identity of a speaking and associated computer-readable medium and computer
GB2541466B (en) * 2015-08-21 2020-01-01 Validsoft Ltd Replay attack detection
US9928840B2 (en) * 2015-10-16 2018-03-27 Google Llc Hotword recognition
GB2545534B (en) * 2016-08-03 2019-11-06 Cirrus Logic Int Semiconductor Ltd Methods and apparatus for authentication in an electronic device
GB2552721A (en) 2016-08-03 2018-02-07 Cirrus Logic Int Semiconductor Ltd Methods and apparatus for authentication in an electronic device
GB2555660B (en) 2016-11-07 2019-12-04 Cirrus Logic Int Semiconductor Ltd Methods and apparatus for authentication in an electronic device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4827518A (en) * 1987-08-06 1989-05-02 Bell Communications Research, Inc. Speaker verification system using integrated circuit cards
US4851654A (en) * 1987-05-30 1989-07-25 Kabushiki Kaisha Toshiba IC card
EP0920674A1 (en) * 1996-08-20 1999-06-09 Domain Dynamics Limited Security devices and systems
WO2003021539A1 (en) * 2001-08-31 2003-03-13 Schlumberger Systemes S.A. Voice activated smart card

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2237135A (en) * 1989-10-16 1991-04-24 Logica Uk Ltd Speaker recognition
DE4422545A1 (en) * 1994-06-28 1996-01-04 Sel Alcatel Ag Start / end point detection for word recognition
US6195638B1 (en) * 1995-03-30 2001-02-27 Art-Advanced Recognition Technologies Inc. Pattern recognition system
US6012027A (en) * 1997-05-27 2000-01-04 Ameritech Corporation Criteria for usable repetitions of an utterance during speech reference enrollment
AU2684100A (en) * 1999-03-11 2000-09-28 British Telecommunications Public Limited Company Speaker recognition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4851654A (en) * 1987-05-30 1989-07-25 Kabushiki Kaisha Toshiba IC card
US4827518A (en) * 1987-08-06 1989-05-02 Bell Communications Research, Inc. Speaker verification system using integrated circuit cards
EP0920674A1 (en) * 1996-08-20 1999-06-09 Domain Dynamics Limited Security devices and systems
WO2003021539A1 (en) * 2001-08-31 2003-03-13 Schlumberger Systemes S.A. Voice activated smart card

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010025523A1 (en) * 2008-09-05 2010-03-11 Auraya Pty Ltd Voice authentication system and methods
AU2009290150B2 (en) * 2008-09-05 2011-11-03 Auraya Pty Ltd Voice authentication system and methods
AU2012200605B2 (en) * 2008-09-05 2014-01-23 Auraya Pty Ltd Voice authentication system and methods
US8775187B2 (en) 2008-09-05 2014-07-08 Auraya Pty Ltd Voice authentication system and methods
CN109313903A (en) * 2016-06-06 2019-02-05 思睿逻辑国际半导体有限公司 Voice user interface
KR20190015488A (en) * 2016-06-06 2019-02-13 시러스 로직 인터내셔널 세미컨덕터 리미티드 Voice user interface
US20190214022A1 (en) * 2016-06-06 2019-07-11 Cirrus Logic International Semiconductor Ltd. Voice user interface
US11322157B2 (en) 2016-06-06 2022-05-03 Cirrus Logic, Inc. Voice user interface
KR102441863B1 (en) * 2016-06-06 2022-09-08 시러스 로직 인터내셔널 세미컨덕터 리미티드 voice user interface
WO2019173304A1 (en) * 2018-03-05 2019-09-12 The Trustees Of Indiana University Method and system for enhancing security in a voice-controlled system

Also Published As

Publication number Publication date
GB0211842D0 (en) 2002-07-03
GB2388947A (en) 2003-11-26
AU2003230039A8 (en) 2003-12-02
WO2003098373A3 (en) 2004-04-29
AU2003230039A1 (en) 2003-12-02

Similar Documents

Publication Publication Date Title
US10950245B2 (en) Generating prompts for user vocalisation for biometric speaker recognition
US7447632B2 (en) Voice authentication system
US6539352B1 (en) Subword-based speaker verification with multiple-classifier score fusion weight and threshold adaptation
CN109564759B (en) Speaker identification
US20030033143A1 (en) Decreasing noise sensitivity in speech processing under adverse conditions
US7133826B2 (en) Method and apparatus using spectral addition for speaker recognition
US20100017209A1 (en) Random voiceprint certification system, random voiceprint cipher lock and creating method therefor
US20150112682A1 (en) Method for verifying the identity of a speaker and related computer readable medium and computer
EP1159737B9 (en) Speaker recognition
JP2002514318A (en) System and method for detecting recorded speech
US7962336B2 (en) Method and apparatus for enrollment and evaluation of speaker authentification
WO2010120626A1 (en) Speaker verification system
WO2003098373A2 (en) Voice authentication
CN109243487A (en) A kind of voice playback detection method normalizing normal Q cepstrum feature
CN116490920A (en) Method for detecting an audio challenge, corresponding device, computer program product and computer readable carrier medium for a speech input processed by an automatic speech recognition system
Sorokin et al. Speaker verification using the spectral and time parameters of voice signal
CN113112992B (en) Voice recognition method and device, storage medium and server
JP4440414B2 (en) Speaker verification apparatus and method
WO2000058947A1 (en) User authentication for consumer electronics
JP2001350494A (en) Device and method for collating
JP3919314B2 (en) Speaker recognition apparatus and method
Bouziane et al. Towards an objective comparison of feature extraction techniques for automatic speaker recognition systems
Hsieh et al. A robust speaker identification system based on wavelet transform
WO2004015552A2 (en) Method of authentication
Calvo et al. Channel/handset mismatch evaluation in a biometric speaker verification using shifted delta cepstral features

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP