WO2010136722A1

WO2010136722A1 - Method for detecting words in a voice and use thereof in a karaoke game

Info

Publication number: WO2010136722A1
Application number: PCT/FR2010/051013
Authority: WO
Inventors: Nicolas Delorme; Damien Henry; Aymeric Zils
Original assignee: Voxler
Priority date: 2009-05-29
Filing date: 2010-05-27
Publication date: 2010-12-02
Also published as: ES2477198T3; FR2946175B1; EP2436004B1; EP2436004A1; FR2946175A1

Abstract

The invention essentially relates to a method for detecting the presence of words in a voice signal (S), characterized in that it comprises: a step consisting in measuring, at the moment of analysis (ti), a phonemic alternation (Vi) in the voice signal (S) over a reference period (TRi); and, if no phonemic alternation is detected over the reference period (TRi), deducing that no words are pronounced in the voice signal (S) at the analysis moment (ti), otherwise deducing that words are pronounced in the voice signal (S) at the analysis moment (ti). The invention can be advantageously used for a karaoke-type game.

Description

METHOD FOR DETECTING VOICE WORDS AND USE THEREOF IN A KARAOKE GAME

[001]. The present invention relates to a method for detecting lyrics sung in the voice. The invention aims in particular to provide a simple method to implement and little consumer resources to detect speech in the voice.

[002]. The invention finds a particularly advantageous, but not exclusive, application for "karaoke" type applications. Recall that Karaoke is a game in which the player sings a known song on an accompaniment usually in place of the original singer, usually following the lyrics on a screen. Alternatively, the invention could also be used in voice interactive applications, for example in any video game in which it is desired to detect if the player speaks.

[003]. Karaoke video games such as "SingStar" (registered trademark) only evaluate the accuracy of a player's song in relation to a reference melody. As a result, a player who hums in rhythm the melody (without singing the lyrics) will get the same score or a better score than a player who actually sings the lyrics. Indeed, by humming, the player can focus only on the accuracy of the melody and / or the rhythmic precision, which is much easier than if he had to make the effort to place the good lyrics of the song on the good melody and / or on the right rhythm.

[004]. Especially in some rap songs, there is no melody and the rhythm is too fast to be reliably evaluated. In this case, the detection of the lyrics in the song is a relevant criterion to evaluate the player.

[005]. To account for the words in the player's score, some recent games try to incorporate speech recognition, with questionable performances, these speech recognition mechanisms being very difficult to produce and very expensive algorithmically. Indeed, they require complex calculations (use of HMM models) to recognize complete words, which is difficult to implement and leads to frequent errors and significant latency.

[006]. The present invention makes it possible to check whether the player sings the lyrics in a much simpler way than the traditional voice recognition, by tackling the problem in an original way: one does not seek to "recognize" the words sung by the player, which does not does not really make sense since these are already known (they are displayed on the screen), but to "check" if the player sings words, instead of for example simply humming the melody.

[007]. The invention thus starts from the observation that all spoken and a fortiori sung language is characterized by an alternation of varied sounds (different phonemes) called in this document "phonemic alternation". Phonemic means what relates to phonemes, that is to say to each of the sounds composing a language. This phonemic alternation can for example be defined by an alternation between vowels and consonants, or between voiced sounds and voiceless sounds, or between various vowels, or between various consonants etc.

[008]. Hum is understood to be the absence of phonemic alternation. For example, when we hum, we only emit voiced sounds such as "la la la", "mmmm", "ah ah ah" characterized by an absence of alternation between voiced sounds and voiceless sounds and therefore a absence of phonemic alternation if one chooses to define it by an alternation of voiced sounds and unvoiced sounds. Conversely, a person who sings the lyrics of a song alternates, except exception, the emission of voiced sounds and unvoiced sounds.

[009]. The invention proposes to distinguish the phonemic alternation, that is to say the pronunciation of words in relation to the absence of phonemic alternation (humming). [010]. Remember that a sound is said "voiced" if its production is accompanied by a vibration of the vocal chords, and "voiceless" otherwise. Since the spoken language is a collection of voices and voiced consonants that vibrate the vocal chords and unvoiced consonants that do not vibrate the vocal chords, we naturally observe this alternation between voiced and unvoiced sounds. This is true for the main languages spoken in the world. On the other hand, when humming, the sound emitted corresponds to a continuous emission of voiced sounds of the type "IaIaIa" or "aaaaaaa" or "mmmmmmm".

[011]. In the invention it is observed whether, during a reference period, the voice of the player has variations of voicing or not. If this is the case, then we deduce that the player is singing lyrics over this reference period; while if this is not the case, we deduce that the player is humming on this reference period. It was found that a reference period of about one second provided good results. However any other reference period is possible.

[012]. In one implementation, the phonemic alternation related to the voiced and unvoiced character of the voice is measured. For this purpose, a voicing coefficient of the voice is calculated which has high values when the sound of the voice is voiced and low values when the sound of the voice is not voiced. In one example, this voicing coefficient corresponds to the measurement of the quality of the extraction of the fundamental frequency of the voice signal. When this coefficient of voicing is greater than a threshold value throughout the reference period, it is deduced that the player is humming; on the other hand, when the voicing coefficient is not greater than the threshold value during the whole reference period, it is deduced that the player is singing.

[013]. The invention thus consists in verifying only if the player utters real words and is not humming, without ensuring that the lyrics actually correspond to the lyrics of the song. It is therefore not useful to check if "the" sung words are the true lyrics of the song, but only if "lyrics" are sung. Indeed, if the humming is an important help in this kind of games, singing other lyrics on a song is rather an additional difficulty for the player.

[014]. More generally, the measure of voicing / non-voicing is only one way of measuring phonemic alternation. Any other method of measuring a variation, such as variation in pronounced consonants (measuring the presence of certain consonants by other methods than measuring the rate of voicing) or variation of pronounced vowels (in the vowel triangle), would produce the same type of result.

[015]. Thus, alternatively, if we choose to characterize the phonemic alternation by the alternation of different vowels, we measure a variation of timbre in the vowel triangle. For a player who hums does not vary the tone of his voice while the player who sings words naturally varies the tone of his voice. In the case where we do not detect a variation of the timbre of the voice in the vowel triangle over the reference period, we deduce that the player is humming; while in the case where we detect a variation of the timbre of the voice in the vowel triangle over the reference period, we deduce that the player is singing lyrics.

[016]. Alternatively, the consonants and / or vowels are separated into several groups, for example four groups of consonants and vowels. If all consonants and vowels belong to the same group, then the person can be considered to be humming. On the other hand, if the group to which the consonants and or vowels belong varies, the person is saying words, that is to say a text whose content varies in terms of consonants and / or vowels.

[017]. The invention thus relates to a method for distinguishing the pronunciation of words with respect to the humming in a voice signal of a user, characterized in that it comprises the following steps:

- measure a voicing coefficient at different times of a period of reference,

compare the voicing coefficients thus measured over the reference period with a threshold value, and

- based on the results of these comparisons over the reference period, deduce if the user is saying words or is humming at a time of analysis.

[018]. According to one implementation, the reference period precedes the instant of analysis.

[019]. According to one implementation: - if the voicing coefficient is greater than the threshold value during the reference period, then

- we deduce that there is no unvoiced moment in the voice during this threshold time and that the user hums at the instant of analysis,

- otherwise we deduce that the user utters words at the moment of analysis.

[020]. According to one implementation, the voicing coefficient is the quality parameter in the extraction of the fundamental frequency of the voice signal.

[021]. According to one implementation, the reference period is of the order of 1 second.

[022]. According to one implementation, the step of comparing the voicing parameter with the threshold value is performed only if the energy of the voice signal is greater than a threshold value.

[023]. According to one implementation, the voice signal being sampled, it comprises the following steps:

calculating an instantaneous intensity and an instantaneous voicing coefficient for points of the voice signal at times of analysis spaced apart by a period of analysis over the reference period,

determine the instantaneous states of the voice signal at each instant analysis from the instantaneous energy measurements and voicing coefficient of the voice signal, these instantaneous states being able to be the "voiced" state corresponding to the emission of a sound of voiced nature, or the state "Unvoiced" corresponding to the emission of a sound of unvoiced nature, - if all the instantaneous states are of type "voiced" over the period of reference then one deduces that there is no pronunciation of words in the voice signal at the instant of analysis,

- otherwise we deduce that there is pronunciation of words in the voice signal at the moment of analysis.

[024]. According to one implementation, to determine the instantaneous state of the voice signal at the instant of analysis,

the voicing coefficient is compared with a threshold,

- if the voicing coefficient is lower than the threshold then the instantaneous state is "unvoiced", - otherwise we deduce that the instantaneous state is "voiced".

[025]. According to one implementation, the instantaneous state can also take the "silence" state corresponding to the absence of a sound of sufficient power,

- if the last N instantaneous states on the reference period are of type "silence" then we deduce that the signal does not contain voice at the moment, otherwise

- only "instantaneous" or "voiceless" instantaneous states, excluding instant "silence" type states, are retained over the reference period.

[026]. According to one implementation, to determine the instantaneous state of the voice signal,

the instantaneous energy of the voice signal is compared with a first threshold,

if the energy of the signal is below the threshold, then we deduce that the instantaneous state is "silence",

- if we compare the coefficient of voicing with a second threshold, - if the voicing coefficient is lower than the second threshold then the instantaneous state is "unvoiced",

- otherwise we deduce that the instantaneous state "State_Pi" is "voiced". [027]. According to one implementation, the analysis period is 20ms and the duration of the reference period 1 s.

[028]. According to one implementation, the voice signal is sampled at 16kHz.

[029]. The invention further relates to the use of the method according to the invention in a Karaoke game type application.

[030]. According to one use, the implementation of the method according to the invention is inhibited for voiced passages of song having a duration greater than the duration of the reference period or on passages of songs arbitrarily chosen.

[031]. The invention will be better understood on reading the description which follows and on examining the figures which accompany it. These figures are given for illustrative but not limiting of the invention. They show :

[032]. Figure 1: a graphical representation as a function of time of the amplitude of a voice signal and of the fundamental frequency which has been extracted using a fundamental frequency detection algorithm and the quality signal the extraction of the fundamental frequency;

[033]. Figure 2: a schematic representation of the steps of the method according to the invention for calculating instantaneous states of the voice signal;

[034]. Figure 3: a schematic representation of the steps of the method according to the invention for detecting whether the player sings words or hums from the instantaneous states of the voice signal;

[035]. FIG. 4: a graphical representation of the amplitude of the voice signal corresponding to sung words as well as the activated or deactivated state of the speech detection function according to the invention during the course of the song.

[036]. Identical elements retain the same reference from one figure to another.

[037]. Figure 1 shows a schematic representation of the amplitude of a voice signal S as a function of time t.

[038]. In a first step 10 of the method according to the invention shown in FIG. 2, the instantaneous energy E.sub.i and the voicing coefficient V.sub.re of the voicing of the voice are measured for all the points Pi of the signal S of the voices analyzed at the instants of Analysis ti spaced apart over time by a period of analysis TA. The higher the coefficient Vi, the more the sound of the voice at instant ti is voiced; while the lower this coefficient Vi, the less the sound of the voice at the instant ti is voiced.

[039]. From these measurements, we deduce the instantaneous state "State_Pi" of the signal S of voice at each point Pi, this state "State_Pi" can be the state "silence" corresponding to the absence of a voice signal of sufficient power, the "voiced" state corresponding to the emission of a sound of voiced nature, and the "unvoiced" state corresponding to the emission of a sound of unvoiced nature.

[040]. For this purpose, the instantaneous energy Ei of the voice signal S is compared in a step 13 with a threshold A. In an example, this threshold is equal to 0.02 for a normalized signal. If the energy Ei of the signal is lower than the threshold A, then it is deduced in a step 15 that the instantaneous state "State_Pi" of the point Pi is

"Silence" On the other hand, if the energy Ei of the voice signal is greater than the threshold A, then we deduce that a sound of sufficient power actually leaves the mouth of the player and then determines whether the sound is voiced or unvoiced.

[041]. For this purpose, the voicing coefficient Vi with a threshold B is compared in a step 17. In one example, B is equal to 0.3 for a normalized signal. If the voicing coefficient Vi is lower than the threshold B then deduces that the sound is unvoiced in a step 18 (the instantaneous state "State_Pi" is then "unvoiced"). This means that the player is probably pronouncing a sound including P, T, K, B ₁ D ₁ G ₁ CH ₁ F ₁ S.

[042]. Whereas if the voicing coefficient Vi is greater than the threshold B then it is deduced that the sound is voiced in a step 19 (the instantaneous state "State_Pi" is then "voiced"). This means that the player is probably pronouncing a vowel or a voiceless consonant.

[043]. In one example, in order to calculate the instantaneous energy Ei and the voicing coefficient Vi, an algorithm is applied to the voice signal S which makes it possible to extract the fundamental frequencies of this signal S represented as a function of time by the curve S 'on Figure 1.

[044]. The voicing coefficient Vi corresponds to the Q coefficient of the measurement of the fundamental frequency detection quality by the frequency detection algorithm represented as a function of time by the curve S. The extraction quality corresponds to the reliability of the detection of the fundamental frequency The quality Q of the extraction of the fundamental frequency of the voice signal S, which is in very close relation with the voicing of the voice, will be very high for the voiced parts of the voice at during which the vocal chords vibrate, which makes it possible to easily extract the fundamental frequency of the signal S of voice While the quality Q of the extraction of the fundamental frequency of the signal S of voice will be low for the parts not voiced at during which the vocal chords do not vibrate or very little, which makes it difficult to extract the fundamental frequency of the signal S of voice.

[045]. In one example, the fundamental frequency detection algorithm is the YIN algorithm. This algorithm, known to those skilled in the art, is precisely described in the France Telecom patent document having the French national registration number 0107284. The quality of detection of the height is the value (1-d ¹ ), d 'being the function the averaged and standardized difference of the YIN algorithm as described in the France Telecom patent document having the French national registration number 0107284, and represented as a function of time by the curve S ".

[046]. As a variant, the voicing coefficient is for example a measure of the non-harmonic noise contained in the audio signal, measured for example by the zero-crossing rate (ZCR), a low value of ZCR being characteristic of a voiced sound while a high value of ZCR is characteristic of an unvoiced sound. The use of the ZCR is particularly advantageous in the case where it is desired to minimize the CPU consumption of the system.

[047]. In one example, the voice signal S being sampled at 16 kHz, the instantaneous energy E i and the quality Q i are calculated every TA = 20 ms by applying the fundamental frequency detection algorithm to the last 1024 sampled points of the signal. S so as to cross-check between the different pieces of the analyzed signal S (the last 1024 points corresponding to approximately 3 TA periods of 20 ms). Alternatively, there is no overlap between the different pieces of the analyzed signal.

[048]. Then, as shown in FIG. 3, in a step 25, an analysis of the voice signal S is carried out over a reference period TRi of reference duration TR (approximately one second) before the instant ti, which amounts to conserving the last 50 state_Pj instantaneous states for TA = 20 ms. As a variant, the number of stored state_Pj instantaneous states could be different to perform analysis over a shorter or longer reference period TRi. As a variant, the reference period TRi may be replaced by a set of points around the instant ti, whether these points are before or after the instant ti.

[049]. In a step 27, it is analyzed whether the last N states

(typically N = 5 or 100ms) State_Pj instantaneous of the signal S are silences. If this is the case, we deduce that the instant ti of analysis is a moment of silence. Otherwise we deduce that ti is not a moment of silence and we then determines whether it is a moment ti sung or hummed.

[050]. For this purpose, in a step 30, one keeps among the last 50 instantaneous states of the signal only the instantaneous states "State_Pj" of type "voiced" or "unvoiced" excluding the states of silence. Then, in a step 33, it is analyzed whether all the stored "State_Pj" instantaneous states are "voiced" states. If this is the case, then it is deduced in step 34 that the signal S of voice corresponds to a hum at time ti since it is a priori impossible to not observe at least one unvoiced passage during of the TRi reference period in a sung language. On the other hand, if there are not only voiced states, then it is deduced in step 35 that the voice signal S corresponds to a song of words at the instant ti since it is a priori natural to observe at least one unvoiced passage during the TRi reference period in a language sung with words.

[051]. When using the invention in a Karaoke, the player may be penalized for each moment ti during which he hummed instead of singing the lyrics of the song to be interpreted, or otherwise rewarded for each moment ti where he has sung with the words.

[052]. Some songs may have voiced passages longer than the TR period of the reference period. Thus, FIG. 4 shows the amplitude 41 of the voice signal S corresponding to the words 42 of a song in which the entirely voiced passage 42.1 "the moon my friend" (in gray) has a duration TD greater than the duration TR of the reference period.

[053]. In order to avoid false hum detections on these particular words, it may be useful to inhibit the speech detection function for the entire duration TD of the voiced passage 42.1. Thus, as shown in the strip 43 of FIG. 4, the speech detection function according to the invention is inhibited over the period TD (turned OFF) but activated for the rest of the song (turned ON). [054]. One can also activate this function of detection of the lyrics during only part of the song (for example the refrain) for which it is necessary to know the words and not during others (for example the couplets) during which the knowledge of the lyrics becomes optional .

[055]. It should be noted that the detection of silences in the voice signal S optimizes the operation of the method according to the invention as it prevents certain parasitic white noise from being arbitrarily considered as voiced or unvoiced type sounds. However, alternatively, in a degraded operation, the steps 13, 15, 27 and 29 for silence detection are suppressed and the instantaneous state "State_Pi" of the signal S is "voiced" or "unvoiced", and then simply analyzed. the instantaneous states of the voice signal S are analyzed over the reference period TRi. We deduce that the player hums if all these instantaneous states are of type voiced and that he sings in the opposite case.

Claims

1. A method for distinguishing the pronunciation of words with respect to the humming in a voice signal (S) of a user, characterized in that it comprises the following steps:

measuring a voicing coefficient (Vi) at different times of a reference period (TRi),

- compare the voicing coefficients (Vi) thus measured over the reference period (TRi) with a threshold value (B), and - according to the results of these comparisons (State_Pj) over the reference period (TRi), deduce if the user is pronouncing words or is humming at a time of analysis (ti).

2. Method according to claim 1, characterized in that the reference period (TRi) precedes the analysis time (ti).

3. Method according to claim 1 or 2, characterized in that:

if the voicing coefficient (Vi) is greater than the threshold value (B) during the reference period (TRi), then it is deduced that there is no unvoiced instant in the voice during this threshold duration and that the user hums at the instant of analysis (ti),

- otherwise we deduce that the user pronounces words at the time of analysis (ti).

4. Method according to one of claims 1 to 3, characterized in that the coefficient (Vi) of voicing is the quality parameter (Q) in the extraction of the fundamental frequency (f) of the signal (S) of voice .

5. Method according to one of claims 1 to 4, characterized in that the reference period (TRi) is of the order of 1 second.

6. Method according to one of claims 3 to 5, characterized in that the step of comparing the parameter (Vi) of voicing with the threshold value (B) is performed only if the energy (Ei) of the signal (S ) voice is greater than a threshold value (A).

7. Method according to claim 1 or 2, characterized in that the voice signal (S) being sampled, it comprises the following steps:

calculating an instantaneous intensity (Ei) and an instant voicing coefficient (Vi) for points (Pi) of the voice signal at times (ti) of analysis spaced apart by a period of analysis (TA) on the reference period (TRi),

determining the instantaneous states "State_Pi" of the signal (S) of the voice at each instant ti from the measurements of the instantaneous energy Ei and the voicing (Vi) of the signal (S) of the voice, these instantaneous states being able to be "voiced" state corresponding to the emission of a sound of voiced nature, or the "unvoiced" state corresponding to the emission of a sound of unvoiced nature,

if all the instant states "State_Pj" are of type "voiced" on the period (TRi) of reference then one deduces that there is no pronunciation of words in the signal (S) of voice at the moment analysis (ti),

- otherwise we deduce that there is pronunciation of words in the signal (S) of voice at the instant of analysis (ti).

8. Method according to claim 7, characterized in that for determining the instantaneous state "State_Pi" of the signal S of voice at the instant of analysis (ti),

the voicing coefficient Vi is compared with a threshold (B),

if the voicing coefficient (Vi) is lower than the threshold (B) then the instantaneous state "State_Pi" is "unvoiced", otherwise it is deduced that the instantaneous state State_Pi is "voiced".

9. Method according to claim 7, characterized in that the instantaneous state "State_Pi" can furthermore take the state "silence" corresponding to the absence of a sound of sufficient power, - if the last N instantaneous states " State_Pj "on the period (TRi) of reference are of type" silence "then one deduces that the signal does not contain voice at the moment (ti), otherwise

- It retains, over the reference period (TRi), only the instantaneous states of type "voiced" or "voiceless" excluding instantaneous states "State_Pj" type "silence".

10. Method according to claim 9, characterized in that for determining the instantaneous state "State_Pi" of the signal (S) of voice,

the instantaneous energy (Ei) of the voice signal S is compared with a first threshold (A),

if the energy (Ei) of the signal is lower than the threshold (A), then it can be deduced that the instantaneous state "State_Pi" is "silent",

- If we compare the voicing coefficient (Vi) with a second threshold (B), if the voicing coefficient (Vi) is lower than the second threshold (B) then the instantaneous state "State Pi" is "unvoiced", otherwise we deduce that the instantaneous state "State_Pi" is "voiced".

11. Method according to one of claims 7 to 10, characterized in that the analysis period (TA) is 20ms and the duration (TR) of the reference period 1 s.

Method according to claim 7 to 11, characterized in that the voice signal (S) is sampled at 16 kHz.

13. Use of the method according to one of claims 1 to 12 in a Karaoke game type application.

14. Use according to claim 13, characterized in that the implementation of the method according to one of claims 1 to 12 is inhibited for voiced passages (42.1) of song having a duration (TD) greater than the duration (TR ) of the reference period or on passages of songs arbitrarily chosen.