US20070192100A1 - Method and system for the quick conversion of a voice signal - Google Patents
Method and system for the quick conversion of a voice signal Download PDFInfo
- Publication number
- US20070192100A1 US20070192100A1 US10/591,599 US59159905A US2007192100A1 US 20070192100 A1 US20070192100 A1 US 20070192100A1 US 59159905 A US59159905 A US 59159905A US 2007192100 A1 US2007192100 A1 US 2007192100A1
- Authority
- US
- United States
- Prior art keywords
- model
- acoustic features
- speaker
- source
- converted
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
- G10L2021/0135—Voice conversion or morphing
Definitions
- the present invention relates to a method for converting a voice signal delivered by a source speaker into a converted voice signal having acoustic features resembling those of a target speaker, and a system implementing such a method.
- voice conversion applications such as voice services, man-machine oral dialog applications or the voice synthesis of texts
- the auditory reproduction is essential and, to achieve acceptable quality, it is necessary to have a firm control over the parameters related to the prosody of the voice signals.
- the main acoustic or prosodic parameters modified during voice conversion methods are the parameters relating to the spectral envelope and/or, for voiced sounds putting into action the vibration of the vocal cords, the parameters relating to a periodic structure, i.e. the fundamental period, the inverse of which is called the fundamental frequency or pitch.
- Conventional voice conversion methods comprise in general the determination of at least one function for transforming acoustic features of the source speaker into acoustic features similar to those of the target speaker, and the transformation of a voice signal to be converted by the application of this or these functions.
- This transformation is an operation that is long and costly in terms of computation time.
- transformation functions are conventionally considered as linear combinations of a large finite number of transformation elements applied to elements representing the voice signal to be converted.
- the object of the invention is to solve these problems by defining a method and a system, that are fast and of good quality, for converting a voice signal.
- a subject of the present invention is a method for converting a voice signal delivered by a source speaker into a converted voice signal having acoustic features resembling those of a target speaker, comprising:
- the transformation comprises a step for applying only a determined part of at least one transformation function to the signal to be converted.
- the method of the invention thus provides for reducing the computation time necessary for the implementation, by virtue of the application only of a determined part of at least one transformation function.
- the step for applying only a determined part of at least one transformation function comprising the application to the frames to be converted of the sole part of the at least one transformation function corresponding to the selected components of the model;
- Another subject of the invention is a system for converting a voice signal delivered by a source speaker into a converted voice signal having acoustic features resembling those of a target speaker, comprising:
- transformation means are adapted for the application only of a determined part of at least one transformation function to the signal to be converted.
- the application means being adapted to apply only a determined part of the at least one transformation function corresponding to the selected components of the model.
- FIGS. 1A and 1B represent a general flow chart of the method of the invention.
- FIG. 2 represents a block diagram of a system implementing the method of the invention.
- Voice conversion consists in modifying the voice signal of a reference speaker called the source speaker such that the signal produced appears to have been delivered by another speaker, called the target speaker.
- Such a method includes first the determination of functions for transforming acoustic or prosodic features of voice signals from the source speaker into acoustic features similar to those of voice signals from the target speaker, using voice samples delivered by the source speaker and the target speaker.
- the determination 1 of transformation functions is carried out on databases of voice samples corresponding to the acoustic realization of the same phonetic sequences delivered respectively by the source and target speakers.
- This determination process is denoted in FIG. 1A by the general numerical reference 1 and is also commonly referred to as “training”.
- the method then includes a transformation of the acoustic features of a voice signal to be converted delivered by the source speaker, using the function or functions determined previously. This transformation is denoted by the general numerical reference 2 in FIG. 1B .
- various acoustic features are transformed such as spectral envelope and/or fundamental frequency features.
- the method begins with steps 4 X and 4 Y for analyzing voice samples delivered respectively by the source and target speakers. These steps are for grouping the samples together by frames, in order to obtain, for each frame of samples, information relating to the spectral envelope and/or information relating to the fundamental frequency.
- the analysis steps 4 X and 4 Y are based on the use of a sound signal model in the form of a sum of a harmonic signal with a noise signal according to a model commonly referred to as HNM (Harmonic plus Noise Model).
- HNM Harmonic plus Noise Model
- the HNM model comprises the modeling of each voice signal frame as a harmonic part representing the periodic component of the signal, made up of a sum of L harmonic sinusoids of amplitude A l and phase ⁇ l , and as a noise part representing the friction noise and the variation in glottal excitation.
- h(n) therefore represents the harmonic approximation of the signal s(n).
- the embodiment described is based on a representation of the spectral envelope by the discrete cepstrum.
- Steps 4 X and 4 Y include sub-steps 8 X and 8 Y for estimating, for each frame, the fundamental frequency, for example by means of an autocorrelation method.
- Sub-steps 8 X and 8 Y are each followed by a sub-step 10 X and 10 Y for the synchronized analysis of each frame on its fundamental frequency, enabling the parameters of the harmonic part as well as the parameters of the noise of the signal and in particular the maximum voicing frequency to be estimated.
- this frequency can be fixed arbitrarily or be estimated by other known means.
- this synchronized analysis corresponds to the determination of the parameters of the harmonics by minimization of a weighted least squares criterion between the complete signal and its harmonic decomposition corresponding, in the embodiment described, to the estimated noise signal.
- w(n) is the analysis window and T i is the fundamental period of the current frame.
- the analysis window is centered around the mark of the fundamental period and has a duration of twice this period.
- these analyses are performed asynchronously with a fixed analysis step and a window of fixed size.
- the analysis steps 4 X and 4 Y lastly include sub-steps 12 X and 12 Y for estimating parameters of the spectral envelope of signals using, for example, a regularized discrete cepstrum method and a Bark scale transformation to reproduce as faithfully as possible the properties of the human ear.
- the analysis steps 4 X and 4 Y deliver respectively for the voice samples delivered by the source and target speakers, for each frame numbered n of samples of speech signals, a scalar denoted by F n representing the fundamental frequency and a vector denoted by c n comprising spectral envelope information in the form of a sequence of cepstral coefficients.
- cepstral coefficients corresponds to an operational technique that is known in the prior art and, for this reason, will not be described further in detail.
- the method of the invention therefore provides for defining, for each frame n of the source speaker, a vector denoted by x n of cepstral coefficients c x (n) and the fundamental frequency.
- the method provides for defining, for each frame n of the target speaker, a vector y n of cepstral coefficients c y (n), and the fundamental frequency.
- Steps 4 X and 4 Y are followed by a step 18 for alignment between the source vector x n and the target vector y n , so as to form a match between these vectors which match is obtained by a conventional dynamic time alignment algorithm called DTW (Dynamic Time Warping).
- DTW Dynamic Time Warping
- the alignment step 18 is followed by a step 20 for determining a model representing in a weighted manner the common acoustic features of the source speaker and of the target speaker on a finite set of model components.
- the model is a probabilistic model of the acoustic features of the target speaker and of the source speaker, according to a model denoted by GMM of mixtures of components formed of Gaussian densities.
- the parameters of the components are estimated from source and target vectors containing, for each speaker, the discrete cepstrum.
- Q denotes the number of components in the model
- N(z; ⁇ i , ⁇ i ) is the probability density of the normal distribution of mean ⁇ i and of covariance matrix ⁇ i and the coefficients ⁇ i are the coefficients of the mixture.
- the coefficient ⁇ i corresponds to the probability a priori that the random variable z is generated by the i th Gaussian component of the mixture.
- Step 20 then includes a sub-step 24 for estimating GMM parameters ( ⁇ , ⁇ , ⁇ ) of the density p(z).
- This estimation can be achieved, for example, using a conventional algorithm of the EM (Expectation-Maximization) type, corresponding to an iterative method leading to the obtaining of an estimator of maximum likelihood between the data of the speech samples and the Gaussian mixture model.
- the initial parameters of the GMM model are determined using a conventional vector quantization technique.
- the model determination step 20 thus delivers the parameters of a mixture of Gaussian densities, which parameters are representative of common acoustic features of the source speaker and target speaker voice samples.
- the model thus defined therefore forms a weighted representation of common spectral envelope acoustic features of the target speaker and source speaker voice samples on the finite set of components of the model.
- the method then includes a step 30 for determining, from the model and voice samples, a function for transforming the spectral envelope of the signal of the source speaker to the target speaker.
- This transformation function is determined from an estimator for the realization of the acoustic features of the target speaker given the acoustic features of the source speaker, formed, in the embodiment described, by the conditional expectation.
- step 30 includes a sub-step 32 for determining the conditional expectation of the acoustic features of the target speaker given the acoustic feature information of the source speaker.
- the conditional expectation is denoted by F(x) and is determined using the following formulae:
- F ⁇ ( x ) E ⁇ [ y
- ⁇ h i ⁇ ( x ) ⁇ i ⁇ N ⁇ ( x , ⁇ i x , ⁇ i xx )
- ⁇ j 1 Q ⁇ ⁇ j ⁇ N ⁇ ( x , ⁇ j x , ⁇ j xx )
- ⁇ ⁇ ⁇ i [
- h i (x) corresponds to the probability a posteriori that the source vector x is generated by the i th component of the Gaussian density mixture model of the model
- the term in square brackets corresponds to a transformation element determined from the model. It is recalled that y denotes the target vector.
- Step 30 also includes a sub-step 34 for determining a function for transforming the fundamental frequency, by a scaling of the fundamental frequency of the source speaker, onto the fundamental frequency of the target speaker.
- This step 34 is achieved conventionally at any instant in the method after sub-steps 8 X and 8 Y for estimating the fundamental frequency.
- the conversion method then includes the transformation 2 of a voice signal to be converted delivered by the source speaker, which signal to be converted can be different from the voice signals used previously.
- This transformation 2 begins with an analysis step 36 performed, in the embodiment described, using a decomposition according to the HNM model similar to those performed in steps 4 X and 4 Y described previously.
- This step 36 is for delivering spectral envelope information in the form of cepstral coefficients, fundamental frequency information as well as maximum voicing frequency and phase information.
- This analysis step 36 is followed by a step 38 for determining an index of correspondence between the vector to be converted and each component of the model.
- each of these indices corresponds to the probability a posteriori of the realization of the vector to be converted by each of the different components of the model, i.e. to the term h i (x).
- the method then includes a step 40 for selecting a restricted number of components of the model according to the correspondence indices determined in the previous step, which restricted set is denoted by S(x).
- This selection step 40 is implemented by an iterative procedure enabling a minimal set of components to be held, these components being selected as long as the cumulated sum of their correspondence indices is less than a predetermined threshold.
- this selection step comprises the selection of a fixed number of components, the correspondence indices of which are the highest.
- the selection step 40 is followed by a step 42 for normalizing the correspondence indices of the selected components of the model. This normalization is achieved by the ratio of each selected index to the sum of all the selected indices.
- the method then includes a step 43 for storing selected model components and associated normalized correspondence indices.
- Such a storage step 43 is particularly useful if the analysis is performed at a deferred time with respect to the rest of the transformation 2 , which means that a later conversion can be prepared efficiently.
- the method then includes a step 44 for partially applying the spectral envelope transformation function by applying the sole transformation elements corresponding to the model components selected. These sole transformation elements selected are applied to the frames of the signal to be converted, in order to reduce the time required to implement this transformation.
- step 44 for partially applying the transformation function is limited to N (P 2 +1) multiplications, which is added to the Q (P 2 +1) modifications enabling the correspondence indices to be determined, as opposed to twice Q(P 2 +1). Consequently, the reduction in complexity obtained is at least of the order of Q/(Q+N).
- the transformation function application step 44 is limited to N(P 2 +1) operations rather than 2Q(P 2 +1), in the prior art, such that, for this step 44 , the reduction in the computation time is of the order of 2Q/N.
- the method then includes a step 46 for transforming fundamental frequency features of the voice signal to be converted, using the function for transformation by scaling as determined at step 34 and realized according to conventional techniques.
- the conversion method then includes a step 48 for synthesizing the output signal produced, in the example described, by an HNM type synthesis which directly delivers the converted voice signal using spectral envelope information transformed at step 44 and fundamental frequency information delivered by step 46 .
- This step 48 also uses maximum voicing frequency and phase information delivered by step 36 .
- the conversion method of the invention thus provides for achieving a high-quality conversion with low complexity and therefore a significant gain in computation time.
- FIG. 2 shows a block diagram of a voice conversion system implementing the method described with reference to FIGS. 1A and 1B .
- This system uses as input a database 50 of voice samples delivered by the source speaker and a database 52 containing at least the same voice samples delivered by the target speaker.
- a module 54 for determining functions for transforming acoustic features of the source speaker into acoustic features of the target speaker.
- This module 54 is adapted to implement step 1 as described with reference to FIG. 1 and therefore provides for the determination of at least one function for transforming acoustic features and in particular the function for transforming spectral envelope features and the function for transforming the fundamental frequency.
- the module 54 is adapted to determine the spectral envelope transformation function from a model representing in a weighted manner common acoustic features of voice samples from the target speaker and from the source speaker, on a finite set of model components.
- the voice conversion system receives as input a voice signal 60 corresponding to a speech signal delivered by the source speaker and intended to be converted.
- the signal 60 is introduced in an analysis module 62 implementing, for example, an HNM type decomposition enabling spectral envelope information of the signal 60 to be extracted in the form of cepstral coefficients and fundamental frequency information.
- the module 62 also delivers maximum voicing frequency and phase information obtained through the application of the HNM model.
- the module 62 therefore implements step 36 of the method as described previously.
- the module 62 is implemented beforehand and the information is stored in order to be used later.
- the system then includes a module 64 for determining indices of correspondence between the voice signal to be converted 60 and each component of the model. To this end, the module 64 receives the parameters of the model determined by the module 54 .
- the module 64 therefore implements step 38 of the method as described previously.
- the system then comprises a module 65 for selecting components of the model implementing step 40 of the method described previously and enabling the selection of components exhibiting a correspondence index reflecting a strong connectedness with the voice signal to be converted.
- this module 65 also performs the normalization of the correspondence indices of the selected components with respect to their mean by implementing step 42 .
- the system then includes a module 66 for partially applying the spectral envelope transformation function determined by the module 54 , by applying sole transformation elements selected by the module 65 according to the correspondence indices.
- this module 66 is adapted to implement step 44 for the partial application of the transformation function, so as to deliver as output source speaker acoustic information transformed by the sole selected elements of the transformation function, i.e. by the components of the model exhibiting a high correspondence index with the frames of the signal to be converted 60 .
- This module therefore provides for a fast transformation of the voice signal to be converted by virtue of the partial application of the transformation function.
- the quality of the transformation is preserved by the selection of components of the model exhibiting a high index of correspondence with the signal to be converted.
- the module 66 is also adapted to perform a transformation of the fundamental frequency features, which is carried out conventionally by the application of the function for transformation by scaling realized according to step 46 .
- the system then includes a synthesis module 68 receiving as input the spectral envelope and fundamental frequency information transformed and delivered by the module 66 as well as maximum voicing frequency and phase information delivered by the analysis module 62 .
- the module 68 thus implements step 46 of the method described with reference to FIG. 1 and delivers a signal 70 corresponding to the voice signal 60 of the source speaker but for which the spectral envelope and fundamental frequency features have been modified in order to be similar to those of the target speaker.
- the system described can be implemented in various ways and in particular with the aid of computer programs adapted and connected to hardware sound acquisition means.
- This system can also be implemented on determined databases in order to form databases of converted signals ready to be used.
- this system can be implemented in a first operating phase in order to deliver, for a database of signals, information relating to the selected components of the model and to their respective correspondence indices, this information then being stored.
- the modules 66 and 68 of the system are implemented later upon demand to generate a voice synthesis signal using the voice signals to be converted and the information relating to the selected components and to their correspondence indices in order to obtain a maximum reduction in computation time.
- the method of the invention and the corresponding system can also be implemented in real time.
- the method of the invention and the corresponding system are adapted for the determination of several transformation functions. For example, a first and a second function are determined for the transformation respectively of spectral envelope parameters and of fundamental frequency parameters for frames of a voiced nature and a third function is determined for the transformation of frames of an unvoiced nature.
- the voice conversion is achieved by the transformation of spectral envelope features and of fundamental frequency features separately, with only the spectral envelope transformation function being applied partially.
- several functions for transforming different acoustic features and/or for simultaneously transforming several acoustic features are determined and at least one of these transformation functions is applied partially.
- the system is adapted to implement all the steps of the method described with reference to FIGS. 1A and 1B .
- the HNM and GMM models can be replaced by other techniques and models known to the person skilled in the art.
- the analysis is performed using techniques known as LPC (Linear Predictive Coding), sinusoidal or MBE (Multi-Band Excited) models
- the spectral parameters are parameters called LSF (Line Spectrum Frequencies), or even parameters related to the formants or to a glottic signal.
- the GMM model is replaced by a fuzzy vector quantization (Fuzzy VQ).
- the estimator implemented during step 30 can be a maximum a posteriori, or MAP, criterion and corresponding to the realization of the calculation of the expectation only for the model best representing the source-target pair of vectors.
- a transformation function is determined using a technique called least squares instead of estimating the joint density described.
- the determination of a transformation function comprises the modeling of the probability density of the source vectors using a GMM model, then the determination of the parameters of the model using an EM algorithm.
- the modeling thus takes into account speech segments of the source speaker for which the corresponding ones delivered by the target speaker are not available.
- the determination then comprises the minimization of a criterion of least squares between target and source parameters in order to obtain the transformation function. It is to be noted that the estimator of this function is still expressed in the same way but that the parameters are estimated differently and that additional data are taken into account.
Abstract
Description
- The present invention relates to a method for converting a voice signal delivered by a source speaker into a converted voice signal having acoustic features resembling those of a target speaker, and a system implementing such a method.
- In the context of voice conversion applications, such as voice services, man-machine oral dialog applications or the voice synthesis of texts, the auditory reproduction is essential and, to achieve acceptable quality, it is necessary to have a firm control over the parameters related to the prosody of the voice signals.
- Conventionally, the main acoustic or prosodic parameters modified during voice conversion methods are the parameters relating to the spectral envelope and/or, for voiced sounds putting into action the vibration of the vocal cords, the parameters relating to a periodic structure, i.e. the fundamental period, the inverse of which is called the fundamental frequency or pitch.
- Conventional voice conversion methods comprise in general the determination of at least one function for transforming acoustic features of the source speaker into acoustic features similar to those of the target speaker, and the transformation of a voice signal to be converted by the application of this or these functions.
- This transformation is an operation that is long and costly in terms of computation time.
- Indeed, such transformation functions are conventionally considered as linear combinations of a large finite number of transformation elements applied to elements representing the voice signal to be converted.
- The object of the invention is to solve these problems by defining a method and a system, that are fast and of good quality, for converting a voice signal.
- To this end, a subject of the present invention is a method for converting a voice signal delivered by a source speaker into a converted voice signal having acoustic features resembling those of a target speaker, comprising:
-
- the determination of at least one function for transforming acoustic features of the source speaker into acoustic features similar to those of the target speaker, using voice samples from the source and target speakers; and
- the transformation of acoustic features of the source speaker voice signal to be converted by applying the at least one transformation function,
- characterized in that the transformation comprises a step for applying only a determined part of at least one transformation function to the signal to be converted.
- The method of the invention thus provides for reducing the computation time necessary for the implementation, by virtue of the application only of a determined part of at least one transformation function.
- According to other features of the invention:
-
- the determination of at least one transformation function comprises a step for determining a model representing in a weighted manner common acoustic features of voice samples from the target speaker and from the source speaker on a finite set of model components, and the transformation comprises:
- a step for analyzing the voice signal to be converted, which voice signal being grouped into frames, in order to obtain, for each frame of samples, information relating to the acoustic features;
- a step for determining an index of correspondence between the frames to be converted and each component of the model; and
- a step for selecting a determined part of the components of the model according to the correspondence indices,
- the determination of at least one transformation function comprises a step for determining a model representing in a weighted manner common acoustic features of voice samples from the target speaker and from the source speaker on a finite set of model components, and the transformation comprises:
- the step for applying only a determined part of at least one transformation function comprising the application to the frames to be converted of the sole part of the at least one transformation function corresponding to the selected components of the model;
-
- it additionally comprises a step for normalizing each of the correspondence indices of the selected components with respect to the sum of all the correspondence indices of the selected components;
- it additionally comprises a step for storing the correspondence indices and the determined part of the model components, performed before the transformation step, which is delayed in time;
- the determination of the at least one transformation function comprises:
- a step for analyzing voice samples from the source and target speakers, grouped into frames in order to obtain acoustic features for each frame of samples from a speaker;
- a step for the time alignment of the acoustic features of the source speaker with the acoustic features of the target speaker, this step being performed before the step for determining a model;
- the step for determining a model corresponds to the determination of a Gaussian probability density mixture model;
- the step for determining a model comprises:
- a sub-step for determining a model corresponding to a Gaussian probability density mixture, and
- a sub-step for estimating parameters of the Gaussian probability density mixture from the estimation of the maximum likelihood between the acoustic features of the samples from the source and target speakers and the model;
- the determination of at least one transformation function is performed based on an estimator of the realization of the acoustic features of the target speaker given the acoustic features of the source speaker;
- the estimator is formed by the conditional expectation of the realization of the acoustic features of the target speaker given the realization of the acoustic features of the source speaker;
- it additionally includes a synthesis step for forming a converted voice signal from the transformed acoustic information.
- Another subject of the invention is a system for converting a voice signal delivered by a source speaker into a converted voice signal having acoustic features resembling those of a target speaker, comprising:
-
- means for determining at least one function for transforming acoustic features of the source speaker into acoustic features similar to those of the target speaker, using voice samples from the source and target speakers; and
- means for transforming acoustic features of the source speaker voice signal to be converted by applying the at least one transformation function,
- characterized in that the transformation means are adapted for the application only of a determined part of at least one transformation function to the signal to be converted.
- According to other features of the system:
-
- the determination means are adapted for the determination of at least one transformation function using a model representing in a weighted manner common acoustic features of voice samples from the source and target speakers on a finite set of components, and the system includes:
- means for analyzing the signal to be converted, which signal being grouped into frames, in order to obtain, for each frame of samples, information relating to the acoustic features;
- means for determining an index of correspondence between the frames to be converted and each component of the model; and
- means for selecting a determined part of the components of the model according to the correspondence indices,
- the determination means are adapted for the determination of at least one transformation function using a model representing in a weighted manner common acoustic features of voice samples from the source and target speakers on a finite set of components, and the system includes:
- the application means being adapted to apply only a determined part of the at least one transformation function corresponding to the selected components of the model.
- The invention will be better understood on reading the following description given purely by way of example and with reference to the appended drawings in which:
-
FIGS. 1A and 1B represent a general flow chart of the method of the invention; and -
FIG. 2 represents a block diagram of a system implementing the method of the invention. - Voice conversion consists in modifying the voice signal of a reference speaker called the source speaker such that the signal produced appears to have been delivered by another speaker, called the target speaker.
- Such a method includes first the determination of functions for transforming acoustic or prosodic features of voice signals from the source speaker into acoustic features similar to those of voice signals from the target speaker, using voice samples delivered by the source speaker and the target speaker.
- More specifically, the
determination 1 of transformation functions is carried out on databases of voice samples corresponding to the acoustic realization of the same phonetic sequences delivered respectively by the source and target speakers. - This determination process is denoted in
FIG. 1A by the generalnumerical reference 1 and is also commonly referred to as “training”. - The method then includes a transformation of the acoustic features of a voice signal to be converted delivered by the source speaker, using the function or functions determined previously. This transformation is denoted by the general
numerical reference 2 inFIG. 1B . - Depending from the embodiments, various acoustic features are transformed such as spectral envelope and/or fundamental frequency features.
- The method begins with
steps - In the embodiment described, the
analysis steps - The HNM model comprises the modeling of each voice signal frame as a harmonic part representing the periodic component of the signal, made up of a sum of L harmonic sinusoids of amplitude Al and phase φl, and as a noise part representing the friction noise and the variation in glottal excitation.
- Hence, one can express:
- The term h(n) therefore represents the harmonic approximation of the signal s(n).
- Furthermore, the embodiment described is based on a representation of the spectral envelope by the discrete cepstrum.
-
Steps sub-steps -
Sub-steps sub-step - In the embodiment described, this synchronized analysis corresponds to the determination of the parameters of the harmonics by minimization of a weighted least squares criterion between the complete signal and its harmonic decomposition corresponding, in the embodiment described, to the estimated noise signal. The criterion denoted by E is equal to:
- In this equation, w(n) is the analysis window and Ti is the fundamental period of the current frame.
- Thus, the analysis window is centered around the mark of the fundamental period and has a duration of twice this period.
- As a variant, these analyses are performed asynchronously with a fixed analysis step and a window of fixed size.
- The analysis steps 4X and 4Y lastly include sub-steps 12X and 12Y for estimating parameters of the spectral envelope of signals using, for example, a regularized discrete cepstrum method and a Bark scale transformation to reproduce as faithfully as possible the properties of the human ear.
- Thus, the analysis steps 4X and 4Y deliver respectively for the voice samples delivered by the source and target speakers, for each frame numbered n of samples of speech signals, a scalar denoted by Fn representing the fundamental frequency and a vector denoted by cn comprising spectral envelope information in the form of a sequence of cepstral coefficients.
- The manner in which the cepstral coefficients are calculated corresponds to an operational technique that is known in the prior art and, for this reason, will not be described further in detail.
- The method of the invention therefore provides for defining, for each frame n of the source speaker, a vector denoted by xn of cepstral coefficients cx(n) and the fundamental frequency.
- Similarly, the method provides for defining, for each frame n of the target speaker, a vector yn of cepstral coefficients cy(n), and the fundamental frequency.
-
Steps step 18 for alignment between the source vector xn and the target vector yn, so as to form a match between these vectors which match is obtained by a conventional dynamic time alignment algorithm called DTW (Dynamic Time Warping). - The
alignment step 18 is followed by astep 20 for determining a model representing in a weighted manner the common acoustic features of the source speaker and of the target speaker on a finite set of model components. - In the embodiment described, the model is a probabilistic model of the acoustic features of the target speaker and of the source speaker, according to a model denoted by GMM of mixtures of components formed of Gaussian densities. The parameters of the components are estimated from source and target vectors containing, for each speaker, the discrete cepstrum.
- Conventionally, the probability density of a random variable denoted generally by p(z), according to a Gaussian probability density mixture model GMM is expressed mathematically as follows:
- In this formula, Q denotes the number of components in the model, N(z; μi, Σi) is the probability density of the normal distribution of mean μi and of covariance matrix Σi and the coefficients αi are the coefficients of the mixture.
- Thus, the coefficient αi corresponds to the probability a priori that the random variable z is generated by the ith Gaussian component of the mixture.
- More specifically, step 20 for determining the model includes a sub-step 22 for modeling the joint density p(z) of the source vector denoted by x and the target vector denoted by y, such that:
-
Step 20 then includes a sub-step 24 for estimating GMM parameters (α, μ, Σ) of the density p(z). This estimation can be achieved, for example, using a conventional algorithm of the EM (Expectation-Maximization) type, corresponding to an iterative method leading to the obtaining of an estimator of maximum likelihood between the data of the speech samples and the Gaussian mixture model. - The initial parameters of the GMM model are determined using a conventional vector quantization technique.
- The
model determination step 20 thus delivers the parameters of a mixture of Gaussian densities, which parameters are representative of common acoustic features of the source speaker and target speaker voice samples. - The model thus defined therefore forms a weighted representation of common spectral envelope acoustic features of the target speaker and source speaker voice samples on the finite set of components of the model.
- The method then includes a
step 30 for determining, from the model and voice samples, a function for transforming the spectral envelope of the signal of the source speaker to the target speaker. - This transformation function is determined from an estimator for the realization of the acoustic features of the target speaker given the acoustic features of the source speaker, formed, in the embodiment described, by the conditional expectation.
- For this purpose,
step 30 includes a sub-step 32 for determining the conditional expectation of the acoustic features of the target speaker given the acoustic feature information of the source speaker. The conditional expectation is denoted by F(x) and is determined using the following formulae: - In these equations, hi(x) corresponds to the probability a posteriori that the source vector x is generated by the ith component of the Gaussian density mixture model of the model, and the term in square brackets corresponds to a transformation element determined from the model. It is recalled that y denotes the target vector.
- By determining the conditional expectation it is thus possible to obtain the function for transforming spectral envelope features between the source speaker and the target speaker in the form of a weighted linear combination of transformation elements.
-
Step 30 also includes a sub-step 34 for determining a function for transforming the fundamental frequency, by a scaling of the fundamental frequency of the source speaker, onto the fundamental frequency of the target speaker. Thisstep 34 is achieved conventionally at any instant in the method after sub-steps 8X and 8Y for estimating the fundamental frequency. - With reference to
FIG. 1B , the conversion method then includes thetransformation 2 of a voice signal to be converted delivered by the source speaker, which signal to be converted can be different from the voice signals used previously. - This
transformation 2 begins with ananalysis step 36 performed, in the embodiment described, using a decomposition according to the HNM model similar to those performed insteps step 36 is for delivering spectral envelope information in the form of cepstral coefficients, fundamental frequency information as well as maximum voicing frequency and phase information. - This
analysis step 36 is followed by astep 38 for determining an index of correspondence between the vector to be converted and each component of the model. - In the embodiment described, each of these indices corresponds to the probability a posteriori of the realization of the vector to be converted by each of the different components of the model, i.e. to the term hi(x).
- The method then includes a
step 40 for selecting a restricted number of components of the model according to the correspondence indices determined in the previous step, which restricted set is denoted by S(x). - This
selection step 40 is implemented by an iterative procedure enabling a minimal set of components to be held, these components being selected as long as the cumulated sum of their correspondence indices is less than a predetermined threshold. - As a variant, this selection step comprises the selection of a fixed number of components, the correspondence indices of which are the highest.
- In the embodiment described, the
selection step 40 is followed by astep 42 for normalizing the correspondence indices of the selected components of the model. This normalization is achieved by the ratio of each selected index to the sum of all the selected indices. - Advantageously, the method then includes a
step 43 for storing selected model components and associated normalized correspondence indices. - Such a
storage step 43 is particularly useful if the analysis is performed at a deferred time with respect to the rest of thetransformation 2, which means that a later conversion can be prepared efficiently. - The method then includes a
step 44 for partially applying the spectral envelope transformation function by applying the sole transformation elements corresponding to the model components selected. These sole transformation elements selected are applied to the frames of the signal to be converted, in order to reduce the time required to implement this transformation. - This
application step 44 corresponds to solving the following equation for the sole model components selected forming the remaining set S(x), such that: - Thus, for a given frame, with p being the dimension of the data vectors, Q the total number of components and N the number of components selected, step 44 for partially applying the transformation function is limited to N (P2+1) multiplications, which is added to the Q (P2+1) modifications enabling the correspondence indices to be determined, as opposed to twice Q(P2+1). Consequently, the reduction in complexity obtained is at least of the order of Q/(Q+N).
- Furthermore, if the result of
steps 36 to 42 were stored, through the implementation ofstep 43, the transformationfunction application step 44 is limited to N(P2+1) operations rather than 2Q(P2+1), in the prior art, such that, for thisstep 44, the reduction in the computation time is of the order of 2Q/N. - The quality of the transformation is nevertheless preserved through the application of components exhibiting a high index of correspondence with the signal to be converted.
- The method then includes a
step 46 for transforming fundamental frequency features of the voice signal to be converted, using the function for transformation by scaling as determined atstep 34 and realized according to conventional techniques. - Also conventionally, the conversion method then includes a
step 48 for synthesizing the output signal produced, in the example described, by an HNM type synthesis which directly delivers the converted voice signal using spectral envelope information transformed atstep 44 and fundamental frequency information delivered bystep 46. Thisstep 48 also uses maximum voicing frequency and phase information delivered bystep 36. - The conversion method of the invention thus provides for achieving a high-quality conversion with low complexity and therefore a significant gain in computation time.
-
FIG. 2 shows a block diagram of a voice conversion system implementing the method described with reference toFIGS. 1A and 1B . - This system uses as input a
database 50 of voice samples delivered by the source speaker and adatabase 52 containing at least the same voice samples delivered by the target speaker. - These two databases are used by a
module 54 for determining functions for transforming acoustic features of the source speaker into acoustic features of the target speaker. - This
module 54 is adapted to implementstep 1 as described with reference toFIG. 1 and therefore provides for the determination of at least one function for transforming acoustic features and in particular the function for transforming spectral envelope features and the function for transforming the fundamental frequency. - In particular, the
module 54 is adapted to determine the spectral envelope transformation function from a model representing in a weighted manner common acoustic features of voice samples from the target speaker and from the source speaker, on a finite set of model components. - The voice conversion system receives as input a
voice signal 60 corresponding to a speech signal delivered by the source speaker and intended to be converted. - The
signal 60 is introduced in ananalysis module 62 implementing, for example, an HNM type decomposition enabling spectral envelope information of thesignal 60 to be extracted in the form of cepstral coefficients and fundamental frequency information. Themodule 62 also delivers maximum voicing frequency and phase information obtained through the application of the HNM model. - The
module 62 therefore implementsstep 36 of the method as described previously. - If necessary, the
module 62 is implemented beforehand and the information is stored in order to be used later. - The system then includes a
module 64 for determining indices of correspondence between the voice signal to be converted 60 and each component of the model. To this end, themodule 64 receives the parameters of the model determined by themodule 54. - The
module 64 therefore implementsstep 38 of the method as described previously. - The system then comprises a
module 65 for selecting components of themodel implementing step 40 of the method described previously and enabling the selection of components exhibiting a correspondence index reflecting a strong connectedness with the voice signal to be converted. - Advantageously, this
module 65 also performs the normalization of the correspondence indices of the selected components with respect to their mean by implementingstep 42. - The system then includes a
module 66 for partially applying the spectral envelope transformation function determined by themodule 54, by applying sole transformation elements selected by themodule 65 according to the correspondence indices. - Thus, this
module 66 is adapted to implementstep 44 for the partial application of the transformation function, so as to deliver as output source speaker acoustic information transformed by the sole selected elements of the transformation function, i.e. by the components of the model exhibiting a high correspondence index with the frames of the signal to be converted 60. This module therefore provides for a fast transformation of the voice signal to be converted by virtue of the partial application of the transformation function. - The quality of the transformation is preserved by the selection of components of the model exhibiting a high index of correspondence with the signal to be converted.
- The
module 66 is also adapted to perform a transformation of the fundamental frequency features, which is carried out conventionally by the application of the function for transformation by scaling realized according tostep 46. - The system then includes a
synthesis module 68 receiving as input the spectral envelope and fundamental frequency information transformed and delivered by themodule 66 as well as maximum voicing frequency and phase information delivered by theanalysis module 62. - The
module 68 thus implementsstep 46 of the method described with reference toFIG. 1 and delivers asignal 70 corresponding to thevoice signal 60 of the source speaker but for which the spectral envelope and fundamental frequency features have been modified in order to be similar to those of the target speaker. - The system described can be implemented in various ways and in particular with the aid of computer programs adapted and connected to hardware sound acquisition means.
- This system can also be implemented on determined databases in order to form databases of converted signals ready to be used.
- In particular, this system can be implemented in a first operating phase in order to deliver, for a database of signals, information relating to the selected components of the model and to their respective correspondence indices, this information then being stored.
- The
modules - Depending on the complexity of the signals and on the quality desired, the method of the invention and the corresponding system can also be implemented in real time.
- As a variant, the method of the invention and the corresponding system are adapted for the determination of several transformation functions. For example, a first and a second function are determined for the transformation respectively of spectral envelope parameters and of fundamental frequency parameters for frames of a voiced nature and a third function is determined for the transformation of frames of an unvoiced nature.
- In such an embodiment, provision is therefore made for a separating step, in the voice signal to be converted, for separating voiced and unvoiced frames and one or more steps for transforming each of these groups of frames.
- In the context of the invention, one only, or several, of the transformation functions is applied partially so as to reduce the processing time.
- Moreover, in the example described, the voice conversion is achieved by the transformation of spectral envelope features and of fundamental frequency features separately, with only the spectral envelope transformation function being applied partially. As a variant, several functions for transforming different acoustic features and/or for simultaneously transforming several acoustic features are determined and at least one of these transformation functions is applied partially.
- Generally, the system is adapted to implement all the steps of the method described with reference to
FIGS. 1A and 1B . - Naturally, embodiments other than those described can be envisaged.
- In particular, the HNM and GMM models can be replaced by other techniques and models known to the person skilled in the art. For example, the analysis is performed using techniques known as LPC (Linear Predictive Coding), sinusoidal or MBE (Multi-Band Excited) models, the spectral parameters are parameters called LSF (Line Spectrum Frequencies), or even parameters related to the formants or to a glottic signal. As a variant, the GMM model is replaced by a fuzzy vector quantization (Fuzzy VQ).
- As a variant, the estimator implemented during
step 30 can be a maximum a posteriori, or MAP, criterion and corresponding to the realization of the calculation of the expectation only for the model best representing the source-target pair of vectors. - In another variant, a transformation function is determined using a technique called least squares instead of estimating the joint density described.
- In this variant, the determination of a transformation function comprises the modeling of the probability density of the source vectors using a GMM model, then the determination of the parameters of the model using an EM algorithm. The modeling thus takes into account speech segments of the source speaker for which the corresponding ones delivered by the target speaker are not available.
- The determination then comprises the minimization of a criterion of least squares between target and source parameters in order to obtain the transformation function. It is to be noted that the estimator of this function is still expressed in the same way but that the parameters are estimated differently and that additional data are taken into account.
Claims (17)
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
FR0403405A FR2868587A1 (en) | 2004-03-31 | 2004-03-31 | METHOD AND SYSTEM FOR RAPID CONVERSION OF A VOICE SIGNAL |
FR0403405 | 2004-03-31 | ||
PCT/FR2005/000607 WO2005106853A1 (en) | 2004-03-31 | 2005-03-14 | Method and system for the quick conversion of a voice signal |
Publications (2)
Publication Number | Publication Date |
---|---|
US20070192100A1 true US20070192100A1 (en) | 2007-08-16 |
US7792672B2 US7792672B2 (en) | 2010-09-07 |
Family
ID=34944345
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/591,599 Expired - Fee Related US7792672B2 (en) | 2004-03-31 | 2005-03-14 | Method and system for the quick conversion of a voice signal |
Country Status (4)
Country | Link |
---|---|
US (1) | US7792672B2 (en) |
EP (1) | EP1730728A1 (en) |
FR (1) | FR2868587A1 (en) |
WO (1) | WO2005106853A1 (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070027687A1 (en) * | 2005-03-14 | 2007-02-01 | Voxonic, Inc. | Automatic donor ranking and selection system and method for voice conversion |
US20070213987A1 (en) * | 2006-03-08 | 2007-09-13 | Voxonic, Inc. | Codebook-less speech conversion method and system |
US20100208987A1 (en) * | 2009-02-16 | 2010-08-19 | Institute For Information Industry | Method and system for foreground detection using multi-modality fusion graph cut |
WO2010105602A1 (en) * | 2009-03-16 | 2010-09-23 | Hayo Becks | Device and method for adapting acoustic patterns |
US20110112838A1 (en) * | 2009-11-10 | 2011-05-12 | Research In Motion Limited | System and method for low overhead voice authentication |
US20110264453A1 (en) * | 2008-12-19 | 2011-10-27 | Koninklijke Philips Electronics N.V. | Method and system for adapting communications |
US20140086420A1 (en) * | 2011-08-08 | 2014-03-27 | The Intellisis Corporation | System and method for tracking sound pitch across an audio signal using harmonic envelope |
US20140270226A1 (en) * | 2013-03-15 | 2014-09-18 | Broadcom Corporation | Adaptive modulation filtering for spectral feature enhancement |
US20190019500A1 (en) * | 2017-07-13 | 2019-01-17 | Electronics And Telecommunications Research Institute | Apparatus for deep learning based text-to-speech synthesizing by using multi-speaker data and method for the same |
US20220122623A1 (en) * | 2020-10-15 | 2022-04-21 | Agora Lab, Inc. | Real-Time Voice Timbre Style Transform |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4928465B2 (en) * | 2005-12-02 | 2012-05-09 | 旭化成株式会社 | Voice conversion system |
JP4966048B2 (en) * | 2007-02-20 | 2012-07-04 | 株式会社東芝 | Voice quality conversion device and speech synthesis device |
EP1970894A1 (en) * | 2007-03-12 | 2008-09-17 | France Télécom | Method and device for modifying an audio signal |
EP3273442B1 (en) * | 2008-03-20 | 2021-10-20 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Apparatus and method for synthesizing a parameterized representation of an audio signal |
JP5038995B2 (en) * | 2008-08-25 | 2012-10-03 | 株式会社東芝 | Voice quality conversion apparatus and method, speech synthesis apparatus and method |
JP5961950B2 (en) * | 2010-09-15 | 2016-08-03 | ヤマハ株式会社 | Audio processing device |
JP6271748B2 (en) | 2014-09-17 | 2018-01-31 | 株式会社東芝 | Audio processing apparatus, audio processing method, and program |
US20190362737A1 (en) * | 2018-05-25 | 2019-11-28 | i2x GmbH | Modifying voice data of a conversation to achieve a desired outcome |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5327521A (en) * | 1992-03-02 | 1994-07-05 | The Walt Disney Company | Speech transformation system |
US5572624A (en) * | 1994-01-24 | 1996-11-05 | Kurzweil Applied Intelligence, Inc. | Speech recognition system accommodating different sources |
US6029124A (en) * | 1997-02-21 | 2000-02-22 | Dragon Systems, Inc. | Sequential, nonparametric speech recognition and speaker identification |
US20010037195A1 (en) * | 2000-04-26 | 2001-11-01 | Alejandro Acero | Sound source separation using convolutional mixing and a priori sound source knowledge |
US6336092B1 (en) * | 1997-04-28 | 2002-01-01 | Ivl Technologies Ltd | Targeted vocal transformation |
US6405166B1 (en) * | 1998-08-13 | 2002-06-11 | At&T Corp. | Multimedia search apparatus and method for searching multimedia content using speaker detection by audio data |
US6615174B1 (en) * | 1997-01-27 | 2003-09-02 | Microsoft Corporation | Voice conversion system and methodology |
US20050137862A1 (en) * | 2003-12-19 | 2005-06-23 | Ibm Corporation | Voice model for speech processing |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2002067245A1 (en) * | 2001-02-16 | 2002-08-29 | Imagination Technologies Limited | Speaker verification |
-
2004
- 2004-03-31 FR FR0403405A patent/FR2868587A1/en active Pending
-
2005
- 2005-03-14 WO PCT/FR2005/000607 patent/WO2005106853A1/en not_active Application Discontinuation
- 2005-03-14 EP EP05735426A patent/EP1730728A1/en not_active Withdrawn
- 2005-03-14 US US10/591,599 patent/US7792672B2/en not_active Expired - Fee Related
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5327521A (en) * | 1992-03-02 | 1994-07-05 | The Walt Disney Company | Speech transformation system |
US5572624A (en) * | 1994-01-24 | 1996-11-05 | Kurzweil Applied Intelligence, Inc. | Speech recognition system accommodating different sources |
US6615174B1 (en) * | 1997-01-27 | 2003-09-02 | Microsoft Corporation | Voice conversion system and methodology |
US6029124A (en) * | 1997-02-21 | 2000-02-22 | Dragon Systems, Inc. | Sequential, nonparametric speech recognition and speaker identification |
US6336092B1 (en) * | 1997-04-28 | 2002-01-01 | Ivl Technologies Ltd | Targeted vocal transformation |
US6405166B1 (en) * | 1998-08-13 | 2002-06-11 | At&T Corp. | Multimedia search apparatus and method for searching multimedia content using speaker detection by audio data |
US20010037195A1 (en) * | 2000-04-26 | 2001-11-01 | Alejandro Acero | Sound source separation using convolutional mixing and a priori sound source knowledge |
US6879952B2 (en) * | 2000-04-26 | 2005-04-12 | Microsoft Corporation | Sound source separation using convolutional mixing and a priori sound source knowledge |
US20050137862A1 (en) * | 2003-12-19 | 2005-06-23 | Ibm Corporation | Voice model for speech processing |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070027687A1 (en) * | 2005-03-14 | 2007-02-01 | Voxonic, Inc. | Automatic donor ranking and selection system and method for voice conversion |
US20070213987A1 (en) * | 2006-03-08 | 2007-09-13 | Voxonic, Inc. | Codebook-less speech conversion method and system |
US20110264453A1 (en) * | 2008-12-19 | 2011-10-27 | Koninklijke Philips Electronics N.V. | Method and system for adapting communications |
US8478034B2 (en) * | 2009-02-16 | 2013-07-02 | Institute For Information Industry | Method and system for foreground detection using multi-modality fusion graph cut |
US20100208987A1 (en) * | 2009-02-16 | 2010-08-19 | Institute For Information Industry | Method and system for foreground detection using multi-modality fusion graph cut |
WO2010105602A1 (en) * | 2009-03-16 | 2010-09-23 | Hayo Becks | Device and method for adapting acoustic patterns |
US8510104B2 (en) * | 2009-11-10 | 2013-08-13 | Research In Motion Limited | System and method for low overhead frequency domain voice authentication |
US8321209B2 (en) * | 2009-11-10 | 2012-11-27 | Research In Motion Limited | System and method for low overhead frequency domain voice authentication |
US20110112838A1 (en) * | 2009-11-10 | 2011-05-12 | Research In Motion Limited | System and method for low overhead voice authentication |
US20140086420A1 (en) * | 2011-08-08 | 2014-03-27 | The Intellisis Corporation | System and method for tracking sound pitch across an audio signal using harmonic envelope |
US9473866B2 (en) * | 2011-08-08 | 2016-10-18 | Knuedge Incorporated | System and method for tracking sound pitch across an audio signal using harmonic envelope |
US20140270226A1 (en) * | 2013-03-15 | 2014-09-18 | Broadcom Corporation | Adaptive modulation filtering for spectral feature enhancement |
US9520138B2 (en) * | 2013-03-15 | 2016-12-13 | Broadcom Corporation | Adaptive modulation filtering for spectral feature enhancement |
US20190019500A1 (en) * | 2017-07-13 | 2019-01-17 | Electronics And Telecommunications Research Institute | Apparatus for deep learning based text-to-speech synthesizing by using multi-speaker data and method for the same |
US20220122623A1 (en) * | 2020-10-15 | 2022-04-21 | Agora Lab, Inc. | Real-Time Voice Timbre Style Transform |
US11380345B2 (en) * | 2020-10-15 | 2022-07-05 | Agora Lab, Inc. | Real-time voice timbre style transform |
Also Published As
Publication number | Publication date |
---|---|
FR2868587A1 (en) | 2005-10-07 |
WO2005106853A1 (en) | 2005-11-10 |
EP1730728A1 (en) | 2006-12-13 |
US7792672B2 (en) | 2010-09-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7792672B2 (en) | Method and system for the quick conversion of a voice signal | |
US7765101B2 (en) | Voice signal conversation method and system | |
Erro et al. | Voice conversion based on weighted frequency warping | |
EP2881947B1 (en) | Spectral envelope and group delay inference system and voice signal synthesis system for voice analysis/synthesis | |
Chen et al. | Voice conversion with smoothed GMM and MAP adaptation. | |
US8255222B2 (en) | Speech separating apparatus, speech synthesizing apparatus, and voice quality conversion apparatus | |
US8234110B2 (en) | Voice conversion method and system | |
US6741960B2 (en) | Harmonic-noise speech coding algorithm and coder using cepstrum analysis method | |
Lee et al. | MAP-based adaptation for speech conversion using adaptation data selection and non-parallel training. | |
Acero | Formant analysis and synthesis using hidden Markov models | |
US20120265534A1 (en) | Speech Enhancement Techniques on the Power Spectrum | |
JPH10124088A (en) | Device and method for expanding voice frequency band width | |
JPH04313034A (en) | Synthesized-speech generating method | |
CN110648684B (en) | Bone conduction voice enhancement waveform generation method based on WaveNet | |
US7643988B2 (en) | Method for analyzing fundamental frequency information and voice conversion method and system implementing said analysis method | |
US20100217584A1 (en) | Speech analysis device, speech analysis and synthesis device, correction rule information generation device, speech analysis system, speech analysis method, correction rule information generation method, and program | |
US20130311189A1 (en) | Voice processing apparatus | |
Lee | Statistical approach for voice personality transformation | |
RU2427044C1 (en) | Text-dependent voice conversion method | |
En-Najjary et al. | A voice conversion method based on joint pitch and spectral envelope transformation. | |
JP3973492B2 (en) | Speech synthesis method and apparatus thereof, program, and recording medium recording the program | |
JP2798003B2 (en) | Voice band expansion device and voice band expansion method | |
Al-Radhi et al. | Continuous wavelet vocoder-based decomposition of parametric speech waveform synthesis | |
Al-Radhi et al. | RNN-based speech synthesis using a continuous sinusoidal model | |
Irino et al. | Evaluation of a speech recognition/generation method based on HMM and straight. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: FRANCE TELECOM, FRANCE Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ROSEC, OLIVIER;EN-NAJJARY, TAOUFIK;REEL/FRAME:018301/0460 Effective date: 20060721 |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.) |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20180907 |