EP0953969A1

EP0953969A1 - Method for rendering speech with silence regulation

Info

Publication number: EP0953969A1
Application number: EP99400873A
Authority: EP
Inventors: Philippe Charbonnier
Original assignee: Sagem SA
Current assignee: Sagem SA
Priority date: 1998-04-27
Filing date: 1999-04-09
Publication date: 1999-11-03
Anticipated expiration: 2019-04-09
Also published as: DE69911685T2; FR2778011B1; EP0953969B1; DE69911685D1; FR2778011A1

Abstract

The silent periods between speech signals are detected and the length of restoration gap of the signals regulated using a single measure to produce the speech signals.

Description

Dans le réseau téléphonique commuté, la parole est transmise sous forme d'un flux continu de mots de code représentant l'amplitude instantanée du signal vocal, échantillonné cycliquement et numérisé à cet effet. Une communication dispose en permanence d'une voie de transmission, ou circuit, pour écouler ce flux continu.In the switched telephone network, speech is transmitted in the form a continuous stream of code words representing the instantaneous amplitude of the voice signal, cyclically sampled and digitized for this purpose. A communication has a permanent transmission channel, or circuit, to flow this continuous flow.

Par contre, sur un réseau informatique du type de l'Internet, ou sur certains réseaux radio en mode paquet, les mots de code représentant la voix sont transmis par paquets, sur un canal offrant un débit suffisant pour pouvoir partager temporellemént le canal entre plusieurs terminaux.On the other hand, on a computer network of the Internet type, or on some packet radio networks, the code words representing the voices are transmitted in packets, on a channel offering sufficient speed to be able to share the channel in time between several terminals.

La transmission par paquets à travers un réseau engendre cependant des problèmes.Packet transmission over a network, however, generates problems.

En effet, le temps de traversée du réseau est essentiellement variable, car chaque paquet peut suivre un chemin variable et on ne peut pas prendre en compte les paquets dès leur réception, car un paquet peut arriver avant la fin de la restitution sonore du précédent, ou peut, au contraire, arriver après celle-ci. Dans le premier cas, il y aurait, si prise en compte, écrasement de paquet, dans le deuxième cas, temps mort.In fact, the crossing time of the network is essentially variable, because each packet can follow a variable path and we cannot take count the packets as soon as they are received, because a packet can arrive before the end of the sound reproduction of the previous one, or can, on the contrary, happen after this one. In the first case, there would be, if taken into account, packet crushing, in the second case, dead time.

On interpose donc une mémoire tampon dans laquelle on stocke les paquets au rythme aléatoire de leurs arrivées et on lit les paquets, pour la restitution sonore, au rythme fixe de l'émetteur de ceux-ci, avec un retard, par rapport à leurs instants d'émission, qui est fixe et suffisamment grand, par rapport aux fluctuations des durées de transmission, pour que les paquets aient été reçus.We therefore interpose a buffer memory in which we store the packets at random rate of their arrivals and we read the packets, for the sound reproduction, at the fixed rhythm of the transmitter thereof, with a delay, relative to their transmission times, which is fixed and sufficiently large, in relation to fluctuations in the durations of transmission, so that the packets were received.

Il peut même se produire que les paquets soient reçus dans un ordre différent de celui de leur émission, si bien qu'il faut les numéroter à l'émission pour rétablir l'ordre voulu en restitution. It may even happen that packets are received in order different from their broadcast, so you have to number them at the issue to restore the order in restitution.

Tout cela nécessite que la mémoire tampon soit de taille relativement grande. Cependant ce retard de la restitution sonore devient perceptible et gênant pour le dialogue entre les deux interlocuteurs.All of this requires the buffer to be relatively large big. However, this delay in sound reproduction becomes perceptible and troublesome for the dialogue between the two interlocutors.

La présente invention vise à remédier à ces inconvénients.The present invention aims to remedy these drawbacks.

A cet effet, l'invention concerne un procédé de restitution sonore de signaux de parole, reçus par paquets successifs représentant des tranches temporelles successives de parole et mémorisés temporairement avant d'être restitués de façon sonore, procédé caractérisé par le fait qu'on détecte la présence de silences, dans les tranches reçues, et qu'on en régule la durée de restitution pour restituer, d'un seul tenant, les signaux de parole autres que les silences.To this end, the invention relates to a method of sound reproduction of speech signals, received in successive packets representing slices successive speech times and temporarily stored before to be restored with sound, a process characterized by the fact that detects the presence of silences, in the slices received, and that regulates the duration of restitution to reproduce, in a single piece, the signals of speech other than silences.

Ainsi, on module le temps, au niveau de la restitution, pour, en pratique, compenser la fluctuation sur les instants d'arrivée des paquets. On réunifie ainsi temporellement les parties d'une séquence sonore d'origine qui ont été séparées physiquement, et donc temporellement, de par leur mise dans des paquets différents. Comme la modulation du temps porte sur les silences, elle est en pratique sans inconvénient pour la compréhension des paroles.Thus, we modulate the time, at the level of the restitution, for, in practice, compensate for the fluctuation in the times of arrival of the packets. We thus temporally reunites the parts of an original sound sequence which have been separated physically, and therefore temporally, by their put in different packages. As time modulation carries on silences, it is in practice without disadvantage for the understanding the lyrics.

Avantageusement, on abrège la restitution de tout silence qui est suivi d'une séquence sonore complète à restituer.Advantageously, the restitution of any silence that is followed is shortened of a complete sound sequence to be reproduced.

L'invention sera mieux comprise à l'aide de la description suivante d'une forme de réalisation préférée du procédé de l'invention, en référence au dessin annexé, sur lequel :

la figure 1 est un schéma par blocs illustrant un dispositif de mise en oeuvre du procédé de l'invention,
la figure 2, formée des figures 2A, 2B, 2C, 2D et 2E, illustre le découpage par paquets d'un signal de parole en fonction du temps t, et
la figure 3 est un diagramme de cheminement illustrant le procédé.

The invention will be better understood using the following description of a preferred embodiment of the method of the invention, with reference to the appended drawing, in which:

FIG. 1 is a block diagram illustrating a device for implementing the method of the invention,
FIG. 2, formed by FIGS. 2A, 2B, 2C, 2D and 2E, illustrates the cutting up of a speech signal in packets as a function of time t , and
Figure 3 is a flow diagram illustrating the process.

Le dispositif de la figure 1 comporte un automate 1 à microprocesseur commandant le fonctionnement d'une chaíne de réception et de restitution sonore de signaux de parole. La chaíne, reliée en entrée à une ligne 2 d'un réseau de communication par paquets, comporte en entrée un circuit 3 de lecture des paquets reçus, qui repère les segments sonores et les silences et reconstitue leur position temporelle. Selon le type de codage employé pour représenter la voix, des repères explicités peuvent exister qui distinguent les segments et indiquent leurs dates, ou bien de tels repères explicités n'existent pas et le circuit 3 les reconstitue à partir du numéro d'ordre des paquets et de leur contenu décodé en signal vocal dont il analyse la forme. Le circuit 3 est suivi d'une mémoire tampon 4 de stockage des paquets, qui remet ceux-ci dans leur ordre d'émission et les transmet à un circuit 5 de restitution ou reproduction sonore commandant un écouteur 6.The device of FIG. 1 comprises a microprocessor-based controller 1 controlling the operation of a reception and restitution chain sound of speech signals. The chain, connected as input to line 2 of a packet communication network, has as input a circuit 3 for reading the received packets, which identifies the sound segments and the rests and reconstructs their temporal position. According to the type of coding used to represent the voice, explicit reference points may exist that distinguish the segments and indicate their dates, or else such explicit marks do not exist and circuit 3 reconstructs them from the serial number of packets and their content decoded into voice signal whose form it analyzes. Circuit 3 is followed by a buffer memory 4 of storage of the packets, which puts them back in their order of transmission and transmits to a circuit 5 of reproduction or sound reproduction commanding a headset 6.

Le fonctionnement du dispositif ci-dessus va maintenant être expliqué.The operation of the above device will now be explained.

De façon classique, un terminal émetteur, en communication avec celui de la figure 1, analyse le signal de son microphone par un vocodeur pour le coder sous forme comprimée.Conventionally, a transmitting terminal, in communication with that of figure 1, analyzes the signal of its microphone by a vocoder for the code in compressed form.

La figure 2A représente l'amplitude S du signal vocal en fonction du temps t.FIG. 2A represents the amplitude S of the voice signal as a function of time t .

Dans une variante explicite, un vocodeur cherche à délimiter des segments comportant un signal vocal correspondant à un son normalisé en bibliothèque, comme par exemple une voyelle, une consonne ou un quasi-silence. Pour la transmission d'informations représentant le signal S, celui-ci est alors remplacé par une suite de mots de code représentant les sons qui y ont été reconnus. Le volume de données d'information est ainsi très réduit. En réception, la consultation d'une bibliothèque semblable permet la restitution du signal S d'origine.In an explicit variant, a vocoder seeks to delimit segments with a voice signal corresponding to a normalized sound in library, such as a vowel, a consonant or a quasi-silence. For the transmission of information representing the signal S, this is then replaced by a series of code words representing the sounds recognized there. The volume of information data is thus very reduced. In reception, the consultation of a similar library allows the original signal S to be restored.

Dans une variante implicite, le signal n'est pas analysé si finement pour le codage, et c'est le circuit 3 du récepteur qui analyse le signal reconstitué pour repérer les segments de silence. Dans tous les cas, l'information codée est transmise en paquets P1, P2, P3, P4 véhiculant chacun une tranche plus ou moins longue du discours. In an implicit variant, the signal is not analyzed so finely for the coding, and it is circuit 3 of the receiver which analyzes the reconstructed signal to locate the segments of silence. In all cases, the information coded is transmitted in packets P1, P2, P3, P4 each carrying a more or less long section of the speech.

Dans l'exemple de la figure 2A, le signal S comporte, dans le paquet P1, la fin S0 d'un silence, un bloc de signal énergétique de parole 11 suivi d'un silence S1 et d'un autre bloc 12. Dans le paquet P2, le bloc 12 se poursuit, avec la référence 13, et est suivi de deux blocs 14 et 15 avec des silences S2 et S3 interposés. Le paquet P3 comporte la fin du bloc 15, référencée 16, un silence S4, un bloc 17, un silence S5 et le début 18 d'un bloc suivi, paquet P4, d'une fin 19 du bloc puis du début S6 d'un silence.In the example of FIG. 2A, the signal S comprises, in the packet P1, the end S0 of a silence, a block of speech energy signal 11 followed of a silence S1 and of another block 12. In the packet P2, the block 12 is continues, with reference 13, and is followed by two blocks 14 and 15 with rests S2 and S3 interposed. The package P3 includes the end of block 15, referenced 16, a silence S4, a block 17, a silence S5 and the beginning 18 of a block followed, packet P4, of an end 19 of the block then of the start S6 of a silence.

Les blocs 11, 12-13, 14, 15-16, 17, 18 représentent ici les six séquences respectives : "et" "l'invention" "est" "nouvelle" "et" "inventive" (figure 2B).Blocks 11, 12-13, 14, 15-16, 17, 18 here represent the six sequences respective: "and" "the invention" "is" "new" "and" "inventive" (figure 2B).

En réception, le bloc 12-13 (tout comme 15-16 et 18-19), réparti sur les deux paquets, P1 et P2, risque d'être temporellement séparé en deux lors de sa restitution sonore. On cherche donc ici à l'éviter.On reception, block 12-13 (just like 15-16 and 18-19), spread over the two packets, P1 and P2, may be temporarily separated in two when of its sound reproduction. We therefore seek here to avoid it.

Les instants t0, t1, t2, t3 (fig. 2A) délimitent les tranches du signal initial affectées aux paquets successifs P1, P2, P3, P4. Les références t'0, t'1, t'2, t'3 (fig. 2C), uniformément translatées par rapport aux instants respectifs t0, t1, t2, t3, marquent les dates théoriques correspondantes de restitution dans l'écouteur 6. En raison de la fluctuation du délai de transmission, des paquets peuvent arriver en avance ou en retard. Dans cet exemple, le paquet P2 arrive après un temps mort ou retard R suivant l'instant t'1 de fin de restitution du paquet P1. Au contraire, le paquet P3, bien qu'arrivant après t'2, arrive en avance sur la fin de la restitution du paquet P2.The instants t0, t1, t2, t3 (fig. 2A) delimit the slices of the initial signal assigned to successive packets P1, P2, P3, P4. References t'0, t'1, t'2, t'3 (fig. 2C), uniformly translated with respect to the instants respective t0, t1, t2, t3, mark the corresponding theoretical dates of playback in the earpiece 6. Due to the fluctuation in the delay transmission, packets may arrive early or late. In this example, packet P2 arrives after a dead time or delay R following the instant t'1 of end of restitution of the packet P1. On the contrary, the P3 package, although arriving after t'2, arrives ahead of the end of the restitution of the P2 package.

La figure 2D illustre une restitution qui serait immédiate. La phrase :

"et..l'invention...est...nouvelle...et...inventive",

dans laquelle "..." représente un silence naturel comme S1, devient :

"et...l'inven□ □ □ tion...est...nouv/elle...et...inventive"

où "□ □ □" représente un silence parasite qui s'interpose (de durée quelconque), coupant le mot "invention".

et "/" représente une superposition additive, ou un écrasement, entre le début et la fin (qui arrive trop tôt) du mot "nouvelle", le mot "inventive" étant de même déformé.

Figure 2D illustrates a restitution which would be immediate. The phrase :

"and..the invention ... is ... new ... and ... inventive",

in which "..." represents a natural silence like S1, becomes:

"and ... the invention □ □ □ tion ... is ... new ... and ... inventive"

where "□ □ □" represents an interfering parasitic silence (of any length), cutting the word "invention".

and "/" represents an additive superimposition, or overwriting, between the beginning and the end (which arrives too early) of the word "new", the word "inventive" being likewise distorted.

La figure 2E illustre le procédé de l'invention. Sur celle-ci,

"." représente un silence d'origine qui a été abrégé et

"....." représente un silence d'origine qui a été allongé.

FIG. 2E illustrates the method of the invention. On this one,

"." represents an original silence which has been abbreviated and

"....." represents an original silence which has been lengthened.

On remarque tout d'abord que le silence parasite □ □ □ de la figure 2D a disparu, de même que la superposition "/".We notice first of all that the parasitic silence □ □ □ of figure 2D has disappeared, as well as the overlay "/".

A l'arrivée du paquet P1, le bloc sonore entier 11 ("et") est ici restitué immédiatement (on suppose qu'on était dans une phase de restitution du début du silence S0 transmis dans le paquet qui précède). Par contre, la séquence 12 ("l'inven") du bloc 12-13 n'est pas restituée à ce moment, car elle ne constitue qu'une portion de séquence. Le silence S1 est allongé (.....) jusqu'à réception du paquet P2, pour lequel les séquences complètes 12-13 et 14, avec le silence S2, sont restituées. Lorsqu'arrive le paquet P3, en avance relative par rapport au paquet P2, le silence S3 est raccourci (.) pour restituer, sans délai notable, la séquence : "nouvelle.et.". Les silences S2 et S4 sont ici raccourcis, ou abrégés (.) pour vider au plus vite la mémoire tampon 4 afin de mieux tolérer des arrivées anticipées de paquets suivants. Ceci présente un intérêt surtout dans le cas pour lequel la durée de restitution de séquence(s) sonore(s) déclenchée par l'arrivée d'un paquet est supérieure à la période théorique. En effet un paquet arrivant peut compléter un bloc sonore antérieur représentant une durée ininterrompue de signal s'étendant sur plusieurs tranches de temps, qui va alors être restitué.When packet P1 arrives, the entire sound block 11 ("and") is restored here immediately (we assume that we were in a phase of restitution of the start of the S0 silence transmitted in the preceding packet). However, the sequence 12 ("the inven") of block 12-13 is not restored at this time, because it is only a portion of a sequence. S1 silence is elongated (.....) until reception of the P2 packet, for which the sequences 12-13 and 14, with S2 silence, are returned. When arrives packet P3, in relative advance with respect to packet P2, silence S3 is shortened (.) to restore, without notable delay, the sequence: "new.and.". The rests S2 and S4 are here shortened, or abbreviated (.) to empty buffer memory 4 as quickly as possible in order to better tolerate early arrivals of following packages. This is of particular interest in the case for which the duration of restitution of sound sequence (s) triggered by the arrival of a packet is greater than the theoretical period. Indeed an arriving packet can complete an earlier sound block representing an uninterrupted signal duration spanning several time slices, which will then be returned.

La figure 3 illustre la gestion à cet effet de la restitution sonore des paquets reçus. Figure 3 illustrates the management for this purpose of the sound reproduction of packets received.

On part ici d'un état 21 dans lequel le circuit 5 restitue une séquence sonore comme 12 - 13 ou 14. Lorsqu'arrive la fin de celle-ci, donc le début d'un silence détecté par le circuit 3, on teste, à une étape 22, la présence dans la mémoire tampon 4 de la description continue du signal jusqu'à au moins le début du segment ou bloc sonore suivant. Dans l'affirmative, on teste, à une étape 23, si le retard dépasse un seuil haut A. On appelle ici retard la durée totale de signal à restituer se trouvant en tampon; le taux de remplissage du tampon peut en constituer une approximation commode. On peut choisir un seuil A nul, c'est-à-dire passer alors de l'étape 22 à une étape 24, indiquée plus loin, de retour à l'état 21.We start here from a state 21 in which the circuit 5 restores a sequence sound like 12 - 13 or 14. When the end of it arrives, so the beginning of a silence detected by circuit 3, we test, in a step 22, the presence in buffer 4 of the continuous description of the signal until at least the start of the next segment or sound block. In if so, we test, in a step 23, if the delay exceeds a high threshold A. The total duration of the signal to be reproduced in buffer; the buffer filling rate can be one convenient approximation. We can choose a threshold A zero, that is to say then go from step 22 to step 24, indicated below, back to state 21.

Dans l'affirmative à l'étape 23, on abrège, à l'étape 24, le silence en cours ou même le supprime et restitue la séquence sonore ci-dessus par passage à l'état 21. Dans la négative, aux étapes 22 et 23, on passe à un état 26 de reproduction de silence.If yes in step 23, we shorten, in step 24, the silence in course or even delete it and play back the above sound sequence by transition to state 21. If not, in steps 22 and 23, we pass to a state 26 of silence reproduction.

On peut aussi passer à l'état 26, depuis l'état 21, en insérant un silence supplémentaire, si l'on détecte, étape 25, un débordement négatif de la mémoire 4 (absence d'information disponible en mémoire, relative au segment suivant à reproduire).We can also go to state 26, from state 21, by inserting a silence additional, if step 25 is detected as a negative overflow of the memory 4 (no information available in memory, relating to next segment to be reproduced).

A l'état 26, on détecte également si la mémoire 4 est vide du segment suivant à reproduire et on décide en pareil cas, étape 27, de prolonger le silence au-delà de sa durée normale, c'est-à-dire qu'on insère un silence supplémentaire.At state 26, it is also detected if the memory 4 is empty of the segment next to reproduce and we decide in such a case, step 27, to extend the silence beyond its normal duration, i.e. we insert a silence additional.

Au contraire, étape 28, si une séquence sonore jointive, comme 12 - 13, est présente et donc disponible (paquets P1 et P2 reçus) et représente une durée dépassant un seuil B, le silence en cours à l'état 26 (S1) va être abrégé. Dans cet exemple, on fait de même si un bloc (12) de séquence sonore présente un retard à la restitution dépassant le seuil A, même en l'absence de réception de la fin (13) du bloc. Si, par contre, le seuil B n'est pas atteint, on insère un silence supplémentaire. On the contrary, step 28, if a contiguous sound sequence, like 12 - 13, is present and therefore available (packets P1 and P2 received) and represents a duration exceeding a threshold B, the silence in progress in state 26 (S1) will be abbreviated. In this example, we do the same if a block (12) of sequence sound has a delay in restitution exceeding threshold A, even in the absence of reception of the end (13) of the block. If, on the other hand, threshold B is not reached, an additional silence is inserted.

On quitte l'état 26 lorsque tout le silence qui devait être reproduit l'a été et qu'on doit reproduire la séquence sonore suivante. On la retarde cependant en passant transitoirement ou durablement par un état 29, d'émission d'un bruit de fond, ou d'ambiance, à consonance ou tonalité vocale du genre "euhh", traduisant que le locuteur va à nouveau parler, ce qui évite qu'on lui coupe la parole en remplaçant un silence prolongé, ou supplémentaire, par le bruit de fond. Les deux conditions de l'étape 28 (seuils B et A) sont aussi recherchées à l'état 29 et on passe, dans l'affirmative, à l'état 21 de restitution sonore.We leave state 26 when all the silence that was to be reproduced has been and that the following sound sequence should be reproduced. We delay it however by passing temporarily or durably through a state 29, emission of background noise, or background noise, sounding or tone like "uhh", indicating that the speaker will speak again, which avoids being cut off by replacing an extended silence, or additional, by background noise. The two conditions of step 28 (thresholds B and A) are also sought in state 29 and we pass, in yes, in state 21 of sound reproduction.

On y teste (étape 20) si le retard à la restitution excède un seuil C, par exemple de 1,5 fois la valeur du seuil A, ce qui indique un débordement positif de la mémoire 4. En pareil cas, les silences les plus anciens, à restituer les premiers, sont réduits et éventuellement quasiment supprimés. Il peut aussi être prévu de supprimer les séquences sonores les plus anciennes ou simplement des tranches temporelles de celles-ci, ce qui revient à moduler, ici accélérer, la vitesse de restitution du signal sonore.We test there (step 20) if the delay in restitution exceeds a threshold C, by example of 1.5 times the value of threshold A, which indicates an overflow memory positive 4. In such cases, the oldest rests, restore first, are reduced and possibly almost deleted. It can also be planned to delete the sound sequences the older or simply time slices thereof, which is equivalent to modulating, here accelerating, the speed of reproduction of the sound signal.

Claims

Method for sound reproduction of speech signals (11, S1, 12, 13, S2) received in successive packets (P1, P2, P3) representing slices successive speech times and temporarily memorized (4) before to be restored by sound (5, 6), process characterized by the fact that the presence of rests (S1, S2) is detected in the slices received, and that we regulate the duration of restitution to restore, in one piece, the speech signals (11, 12) other than rests.

Method according to claim 1, in which the (24, 28) is shortened restitution of all silence (S1) including at least the beginning of the sequence next sound is available.

Method according to claim 1, in which the (23, 28) is shortened restitution of silence (S1) if the delay in the restitution of the speech signals exceeds a threshold (A).

Method according to one of claims 1 to 3, in which a additional silence (25, 27) when no next sequence (11) is not available in memory.

Method according to one of claims 1 to 4, in which, when silence (S1) is not followed in memory by a sound sequence of duration above a threshold (B), an additional silence is inserted.

Method according to one of claims 4 and 5, in which it is replaced additional silence by background noise (29).

The method of claim 6, wherein the background noise is voice tone.

Method according to one of claims 1 to 7, in which it is eliminated (20), of the restitution, the oldest memorized signals when their delay for return exceeds a threshold (C).