WO1997034289A1 - System for automatically morphing audio information - Google Patents

System for automatically morphing audio information Download PDF

Info

Publication number
WO1997034289A1
WO1997034289A1 PCT/US1997/004337 US9704337W WO9734289A1 WO 1997034289 A1 WO1997034289 A1 WO 1997034289A1 US 9704337 W US9704337 W US 9704337W WO 9734289 A1 WO9734289 A1 WO 9734289A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
sounds
representation
representations
spectrogram
Prior art date
Application number
PCT/US1997/004337
Other languages
French (fr)
Inventor
Malcolm Slaney
Original Assignee
Interval Research Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Interval Research Corporation filed Critical Interval Research Corporation
Priority to AU22165/97A priority Critical patent/AU2216597A/en
Publication of WO1997034289A1 publication Critical patent/WO1997034289A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H7/00Instruments in which the tones are synthesised from a data store, e.g. computer organs
    • G10H7/008Means for controlling the transition from one tone waveform to another
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10KSOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
    • G10K15/00Acoustics not otherwise provided for
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/025Envelope processing of music signals in, e.g. time domain, transform domain or cepstrum domain
    • G10H2250/035Crossfade, i.e. time domain amplitude envelope control of the transition between musical sounds or melodies, obtained for musical purposes, e.g. for ADSR tone generation, articulations, medley, remix
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/471General musical sound synthesis principles, i.e. sound category-independent synthesis methods
    • G10H2250/481Formant synthesis, i.e. simulating the human speech production mechanism by exciting formant resonators, e.g. mimicking vocal tract filtering as in LPC synthesis vocoders, wherein musical instruments may be used as excitation signal to the time-varying filter estimated from a singer's speech

Definitions

  • the present invention is directed to the manipulation of sounds, and more particularly to the morphing of two audio signals to generate a new sound having characteristics between those of the original sounds.
  • a first type of sound modification involves the mixing of two or more sounds. This type of modification might be employed in a musical environment, for example, to provide equalization or reverberation. These effects are achieved by passing the sounds through simple filters whose operation is independent of the actual data being filtered.
  • a second type of sound modification is based upon data-dependent filtering. For example, the pitch of a sound can be increased or decreased by a predetermined percentage to disguise a person's voice.
  • a third type of manipulation which is more heavily data-dependent, is known as voice transformation.
  • an acoustic feature of speech such as its spectral profile or average pitch
  • histogram mapping might be employed to transform the speaker's pitch to that of the target voice.
  • Each time a particular sound is spoken its format frequencies are changed to match those of the target speaker.
  • the target voice results. Further information relating to this type of sound manipulation is described in U.S. Patent No.
  • Audio morphing differs from sound filtering, from the standpoint that two or more sounds are used as inputs to create a single sound having characteristics of each of the original sounds. Audio morphing also differs from voice transformation by virtue of the fact that the resulting sound is a transition from a beginning sound to an ending sound, and has characteristics which lie between the original sounds, rather than being a jump from a source sound to a target sound.
  • morphing is the process of changing one physical entity smoothly into another. Its most prevalent use today is in the visual domain. In this context, interpolations are made between the data of two images, and then cross fades are implemented so that one image blends smoothly into the other. Typically, the beginning and ending images are static, i.e., they do not change with time as the morphing process is carried out.
  • Audio morphing involves the process of generating sounds that lie between two source sounds. For example, in a series of steps the sound of a human scream might morph into the sound of a siren. Unlike images, sounds are not static. The amplitude of a sound at any given time, by itself, does not present meaningful information. Rather, it must be considered over a period of time. Thus, audio morphing is more complex, because it must take into consideration the time course of a sound during the morphed sequence. In the past, audio morphing has been carried out by using a sinusoidal analysis of the sounds used to create the morph.
  • a sound morphing process in accordance with the present invention is comprised of a series of basic steps. As a first step, each sound which forms the basis for the morph is converted into multiple representations that quantitatively depict one or more salient features of the sounds. In a preferred embodiment of the invention, the multiple representations are independent of one another. After the representations have been obtained, the temporal axes of the two sounds are matched, so that similar components of the two sounds, such as onsets, harmonic regions and inharmonic regions, are aligned with one another.
  • the two sounds can be cross-faded, to produce a representation of the morphed sound, such as a new spectrogram. This representation is then inverted, to generate the morphed sound.
  • the morphing process is not limited to harmonic sounds. Rather, any sound which is capable of being represented can form the basis for an audio morph.
  • the particular representations that are chosen will be dependent upon the characteristics of the sound that are important. The only criteria is that the representation be perceptually relevant, i.e. it relates to some aspect of the sound which is detectable to the human ear, and provides a distance metric of that aspect. Using such representations, any two or more sounds can be matched to one another to produce a morph.
  • Another advantage of the morphing process of the present invention is that it can be easily automated. For example, the temporal warping of two representations of a sound, to match them to one another, can be computed using known techniques, such as correlation that produces the lowest mean-squared- difference. Similarly, other components of the sound can be automatically matched with one another, for example, using auto correlation techniques.
  • Figure 1 is a block diagram illustrating the overall process for morphing two sounds in accordance with the present invention
  • Figure 2 is a more detailed block diagram of an embodiment of the invention for morphing speech
  • Figure 3 is an illustration of the audio correspondence between two sounds
  • Figure 4 is a diagram of the correspondence between the pitches of two sounds
  • Figures 5 A and 5B are illustrations of a continuous morph and a cyclo- stationary morph, respectively;
  • Figure 6 is a spectrogram of a morph in which the pitch of a spoken vowel changes.
  • Figure 7 is an illustration of a sequence of spectrograms in a cyclostationary morph.
  • morphing is the process of generating a range of physical phenomena that move smoothly from one arbitrary entity to another.
  • a video morph consists of a series of images which successively show one object smoothly changing its shape and texture until it becomes another object.
  • the same objectives are desirable for an audio morph.
  • a sound that is perceived as coming from one object should smoothly change into another sound, maintaining the shared properties of the starting and ending sounds while smoothly changing other properties.
  • two different types of audio morphing can be produced.
  • One type of morph is temporally based. In this situation, a monotonic sound is considered as a point in a multi-dimensional space. The dimensions of this space can include the spectral shape, pitch, rhythm and other perceptually relevant auditory dimensions.
  • a morph is obtained by defining a path between two sounds represented at two points in the space.
  • This type of morph is analogous to image morphing. For example, a steady state clarinet tone might morph into the sound of an oboe or into a singer's voice.
  • a sequence of individual sounds are generated which smoothly change from one to another.
  • the spoken word "corner” can change into the word “morning” in a sequence of small steps.
  • Each individual step represents a small difference from the previous word, and in the middle of the sequence the word sounds like a cross between "corner” and "morning.
  • This type of morph is referred to as a cyclostationary morph. It is cyclic because a sound is played repetitively to transition from one word to the other. It is also stationary since each sound instance is a static example of one of the in-between sounds in the sequence.
  • the desired output may be just one of the intermediate sounds.
  • a sound can be produced that is a mixture of different components of the original sounds.
  • the output sound might utilize the pitch from one word, the timing from a second word, and the spectral resonances from a third word.
  • the morphing of one sound into another is schematically illustrated in the block diagram of Figure 1. A brief description of the overall process is first presented, and followed by a more detailed discussion of individual aspects of the process. This particular embodiment relates to the morphing of speech. It will be appreciated, however, that this example is for illustrative purposes. The principles which underlie the invention are equally applicable to music and other types of sound as well.
  • two input sounds provide the basis from which the morphed sound is produced.
  • more than two sounds can be used to provide the original input data.
  • a two- sound example will be described.
  • various representations 10 of each sound are generated.
  • the representations might be one or more spectrograms of each sound.
  • Corresponding representations of the two sounds are then temporally matched, preferably by means of a dynamic time warping process 12.
  • similar components of each sound such as the onset or attack portion, harmonic and inharmonic regions, and a decay region, are temporally aligned with one another.
  • other relevant features of the two sounds undergo a matching process 14. For example, if the sounds contain harmonic components, the pitches of the two sounds can be matched.
  • the matching of the two sounds results in a dense mapping of corresponding elements of the sounds to one another, for each of the dimensions of interest.
  • the sounds undergo interpolation and cross fading 16.
  • the first inte ⁇ olation of the sound in the sequence comprises 100% of Sound 1 and 0% of Sound 2.
  • the second interpolated sound of the sequence is comprised of 75% of Sound l's components and 25% of Sound 2's components.
  • Successive inte ⁇ olation steps comprise greater proportions of Sound 2, until the final step is comprised entirely of Sound 2.
  • the inte ⁇ olation determines the appropriate percentage of each of the two components to combine with one another.
  • the representation 10 of the sound transforms it from a simple waveform into a multi-dimensional representation that can be wa ⁇ ed, or modified, to produce a desired result.
  • the representation of the sound must be one that is invertible, i.e. after one or more of its parameters are modified, the produce a desired result.
  • the representation of the sound must be one that is invertible, i.e. after one or more of its parameters are modified, the result can be used to generate an audible sound.
  • the particular representation that is employed should preserve all relevant dimensions of the sound. For example, in harmonic sounds pitch is an important characteristic. Thus, for the mo ⁇ hing of harmonic sounds, a representation which preserves the pitch information should be employed.
  • Suitable representations for harmonic sound include spectrograms, such as the short-term Fourier transform, as well as cochleagrams and correlograms.
  • Inharmonic sounds such as noise and spoken fricatives, do not have a pitch component. Similarly, if a spoken word is whispered, its pitch is not significant. Consequently, other types of representation may be more appropriate for these types of sounds. For example, linear predictive coding (LPC) coefficients might be used to represent the spectral characteristics of an inharmonic sound.
  • LPC linear predictive coding
  • a multi-dimensional representation of sounds is employed, where each dimension is independent and salient to the perceived result.
  • two relevant dimensions of a sound are its pitch and its broad spectral shape, i.e. its frequency formants. These two dimensions roughly correspond to the rate at which the human glottis produces air pulses during speech (pitch) and the filtering of these pulses that is carried out by the mouth and nasal passages (formants).
  • itch the rate at which the human glottis produces air pulses during speech
  • formants the filtering of these pulses that is carried out by the mouth and nasal passages
  • Figure 2 illustrates one embodiment of the invention in which each of these three dimensions can be separately represented to generate a mo ⁇ h.
  • a conventional magnitude spectrogram of a sound is obtained by processing it through a Fast Fourier Transform 20.
  • the Fast Fourier Transform provides a quantitative analysis of the sound in terms of its frequency content.
  • the spectrogram of the sound is then further analyzed to determine its mel- frequency cepstial coefficients (MFCC) 22.
  • MFCC mel- frequency cepstial coefficients
  • the MFCC for a sound is computed by resampling the magnitude spectrum to match critical bands that are related to auditory perception. This is carried out by passing the spectrogram through a filter bank which approximates the auditory characteristics of the human ear.
  • the filter bank produces a number of output signals, e.g. forty signals, which undergo a discrete cosine transform to rearrange the data values, and a predetermined number of the lowest frequency components, e.g. the thirteen lowest filter coefficients, are then selected.
  • These coefficients indicate the Euclidean distance between vectors, and therefore provide a good measure of how close two sounds are. Hence, they can be used to find a temporal match between two sounds, as described in detail hereinafter.
  • the MFCC contains only the lower frequency component information about a sound, it can be used to obtain a representation of the broad spectral shape of the sound.
  • the MFCC is inverted at 24 by applying the inverse of the cosine transform, to provide a smooth estimate of the filter bank output that was used to compute the MFCC.
  • This smooth estimate is then reinte ⁇ olated, for example by means of an inverse Bark scale, to yield a new spectrogram.
  • This spectrogram corresponds to the original spectrogram, minus the higher frequency pitch information. In the context of the present invention, this spectrogram is referred to as a "smooth spectrogram", and provides a representation of the frequency formats in the original sound.
  • the smooth spectrogram can be used to obtain a representation of the pitch information in a sound. More particularly, a conventional spectrogram encodes all of the information in a sound signal, and the smooth spectrogram describes the sound's overall spectral shape. The conventional spectrogram is divided by the smooth spectrogram at 26, to produce a residual spectrogram that contains the pitch and voicing information in a sound. In the context of the present invention, the residual spectrogram is referred to as a "pitch spectrogram. "
  • Temporal matching of sounds at 28 is desirable since, over the course of a mo ⁇ h, features which are common to both sounds should remain relatively fixed in time.
  • a spectrogram for one sound e.g. a beginning sound
  • the spectrogram for a ending sound is shown above and to the left of the spectrogram for the beginning sound.
  • time is represented along the horizontal axis
  • frequency is depicted on the vertical axis.
  • the spectrogram for the ending sound is rotated counter-clockwise 90° relative to the spectrogram for the beginning sound.
  • dynamic time wa ⁇ ing is employed to find the best temporal match between two sounds, using the distance metric provided by the MFCC transforms of the sounds.
  • the result of the dynamic time wa ⁇ ing process is to provide control points in time which identify the frames of one sound that line up with those of the other sound.
  • the correspondence of the frames provides an indication of the amount by which each segment of a sound must be temporally compressed or expanded to match it to the corresponding features in the other sound.
  • the two sounds Once the two sounds have been aligned temporally at 28, they can be matched at each corresponding time instant. For each pair of corresponding frames, the relevant acoustical features that are indicated by the representations of the two sounds need to be matched. For example, in the pitch spectrogram, Once the two sounds have been aligned temporally at 28, they can be matched at each corresponding time instant. For each pair of corresponding frames, the relevant acoustical features that are indicated by the representations of the two sounds need to be matched. For example, in the pitch spectrogram, the pitch information in the sound is visible as a series of peaks. The spacing of the peaks is proportional to the pitch.
  • the matching of the pitch data for two sounds at 30 essentially involves expanding or compressing the pitch spectrograms to align the pitch peaks.
  • the pitch of one sound can be represented as pi
  • the pitch of the other sound at the corresponding time is p2.
  • the frequency axis of the second sound's pitch spectrogram must be compressed by pl/p2. If pi is larger than p2, the frequency axis of the pitch spectrogram for the second sound is actually stretched.
  • a spoken word may include both voiced and unvoiced sounds.
  • An example of an unvoiced sound is the consonant "c" in the word “corner” .
  • the unvoiced components of the word do not contain pitch information.
  • the voiced, or harmonic, components include pitch, which should be matched to the pitch of another sound to form the mo ⁇ h.
  • a dynamic programming technique can be used to calculate a smooth pitch function throughout the entirety of a sound.
  • a mo ⁇ h includes some type of inte ⁇ olation or cross-fading step.
  • Scalar dimensions are easiest to mo ⁇ h. If one component of a sound description is loudness, then the loudness of the mo ⁇ h should change smoothly from the loudness of the first sound to the loudness of the second. The same holds true for a scalar quantity like pitch.
  • acoustic information is not always scalar. Interpolations of temporal information, smooth spectrograms, and pitch spectrograms present a more complex problem, because they are based upon a dense match between two one- dimensional curves.
  • the data to be mo ⁇ hed can be described as sl(t) and s2(t). These two curves might represent pitch, for example.
  • the objective of the mo ⁇ h is to find a new curve s( ⁇ ,t) such that the s function is a fraction, ⁇ , between the si and s2 curves. Since the matches between curves are monotonic, matching lines do not cross such that, for each point ( ⁇ ,t), there is only one line establishing correspondence.
  • the inte ⁇ olation problem simplifies to finding the times tl and t2 that should be inte ⁇ olated to generate the data at ( ⁇ ,t).
  • the objective is to inte ⁇ olate using the best possible tl and t2.
  • a value t * can be calculated for all values of tl using the expression above.
  • the value for tl that produces t * closest to t can be used for the first half of the s-inte ⁇ olation equation above.
  • Matching the features of the smooth spectrograms for the two sounds, at 32, is less critical than matching of the pitch spectrograms, at least where speech is concerned.
  • the two smooth spectrograms can simply be cross-faded, without prior wa ⁇ ing.
  • dynamic wa ⁇ ing can be applied to the smooth spectra, as a function of frequency, to match peaks in the two sounds before cross-fading them to obtain the mo ⁇ hed sound.
  • the inte ⁇ olation and cross-fading is carried out independently at 34 for each of the relevant components of the sounds. For example, at the 50% point of a mo ⁇ h, a format value and a pitch that is halfway between each of the two original sounds can be employed.
  • the resulting sound will be in between the two sounds.
  • the broad spectral shape for the mo ⁇ h might remain fixed with the first sound, while the pitch is changed to match the second sound.
  • the result of performing the cross-fades of the matched components of the two signals is a new set of representations for a sound having characteristics of each of the original input sounds. These representations are then combined to form a complete spectrogram. The spectrogram is then inverted at 36, to generate the new sound.
  • there are two different types of audio mo ⁇ hing that can be attained with the present invention.
  • One type of mo ⁇ h is continuous, as depicted in Figure 5A, and the other type of mo ⁇ h is cyclostationary, as depicted in Figure 5B.
  • a continuous mo ⁇ h is obtained in the case of simple sounds. For example, a note played on an oboe can smoothly transform over a given time span into a vowel spoken by a person. In another example, one vowel might mo ⁇ h into a different vowel, or the same vowel might mo ⁇ h from one pitch to another.
  • a spectrogram for this latter example, which was produced in accordance with the present invention, is illustrated in Figure 6.
  • a cyclostationary mo ⁇ h is comprised of multiple sound instantiations that form a sequence in which each sound differs from the others.
  • the word “corner” can transform into the word “morning” over a sequence of six steps.
  • the spectrograms for such a sequence are illustrated in Figure 7.
  • the first spectrogram relates to the pronunciation of the word “corner” and the last spectrogram pertains to the word “morning.
  • the four spectrograms between them represent various weighted combinations of the two words.
  • the present invention provides a mo ⁇ hing procedure in which any given sound can mo ⁇ h into any other sound. Since it is not based upon sinusoidal analysis, it is not limited in the types of sounds that can be utilized. Rather, a variety of different types of sound representations can be employed, in accordance with the perceptually significant features of the particular sounds that are chosen.
  • the mo ⁇ hing process can be completely automated.
  • the different steps of the process including the temporal and feature-based matching steps, can be implemented in a computer which is suitably programmed to convert a input sounds into appropriate representations, analyze the representations to match them to one another as described above, and then select a point between matched components to produce a new sound.
  • the labor-intensive requirements of previous audio mo ⁇ hing approaches can be avoided.

Abstract

In the first step of a sound morphing process, each sound which forms the basis for the morph is converted into one or more quantitative representations, such as spectograms. After the representations have been obtained, the temporal axes of the two sounds are matched, so that similar components of the two sounds, such as onsets, harmonic regions and inharmonic regions, are aligned with one another. Other characteristics of the sounds, such as pitch, formant frequencies, or the like, are then matched. Once the energy in each of the sounds has been accounted for and matched to that of the other sound, the two sounds are cross-faded, to produce a representation of a new sound. This representation is then inverted, to generate the morphed sound.

Description

SYSTEM FOR AUTOMATICALLY MORPHING AUDIO INFORMATION
Field of the Invention
The present invention is directed to the manipulation of sounds, and more particularly to the morphing of two audio signals to generate a new sound having characteristics between those of the original sounds.
Background of the Invention
The manipulation of a sound, to produce a different sound, has applicability to a number of different fields. For example, in musical applications the transformation of one audio signal into another audio signal can be used to produce new sounds that are generated with synthesizers and the like. In the movie industry, the transformation of one sound into another sound, such as changing a speaker's voice to sound like the voice of a different person, can be used to create special effects. In a similar fashion, a person's voice can be manipulated so that it is disguised, for security purposes. In the past, different types of sound manipulation have been employed for these various purposes. A first type of sound modification involves the mixing of two or more sounds. This type of modification might be employed in a musical environment, for example, to provide equalization or reverberation. These effects are achieved by passing the sounds through simple filters whose operation is independent of the actual data being filtered.
A second type of sound modification is based upon data-dependent filtering. For example, the pitch of a sound can be increased or decreased by a predetermined percentage to disguise a person's voice.
A third type of manipulation, which is more heavily data-dependent, is known as voice transformation. In this type of manipulation, an acoustic feature of speech, such as its spectral profile or average pitch, is analyzed to represent it as a sequence of numbers, and then modified from the original speaker's voice, typically in accordance with a target voice. For example, histogram mapping might be employed to transform the speaker's pitch to that of the target voice. Each time a particular sound is spoken, its format frequencies are changed to match those of the target speaker. When the sound is resynthesized with the new acoustical parameters, the target voice results. Further information relating to this type of sound manipulation is described in U.S. Patent No. 5,327,521, as well as in Savic et al, "Voice Personality Transformation", Digital Signal Processing 1, Academic Press, Inc. , 1991, pp. 107-110; and Valbret et al, "Voice Transformation Using PSOLA Technique", Speech Communication 11, Elsevier Science Publishers, 1992, pp. 175-187.
A fourth type of audio manipulation, and the one to which the present invention is directed, is known as audio morphing. Audio morphing differs from sound filtering, from the standpoint that two or more sounds are used as inputs to create a single sound having characteristics of each of the original sounds. Audio morphing also differs from voice transformation by virtue of the fact that the resulting sound is a transition from a beginning sound to an ending sound, and has characteristics which lie between the original sounds, rather than being a jump from a source sound to a target sound.
Generally speaking, morphing is the process of changing one physical entity smoothly into another. Its most prevalent use today is in the visual domain. In this context, interpolations are made between the data of two images, and then cross fades are implemented so that one image blends smoothly into the other. Typically, the beginning and ending images are static, i.e., they do not change with time as the morphing process is carried out.
Audio morphing involves the process of generating sounds that lie between two source sounds. For example, in a series of steps the sound of a human scream might morph into the sound of a siren. Unlike images, sounds are not static. The amplitude of a sound at any given time, by itself, does not present meaningful information. Rather, it must be considered over a period of time. Thus, audio morphing is more complex, because it must take into consideration the time course of a sound during the morphed sequence. In the past, audio morphing has been carried out by using a sinusoidal analysis of the sounds used to create the morph. See, for example, Tellman et al, "Timbre Morphing of Sounds with Unequal Numbers of Features", CERL Sound Group, University of Illinois, 1995. In sinusoidal analysis, a sound is broken down into a number of discrete sinusoids. A morph is generated by changing the amplitude and frequency of the sinusoids. This technique only has applicability to harmonic sounds, such as those from musical instruments. It cannot be used to morph other types of sounds, such as noise or speech that includes fricatives, i.e. inharmonic sounds, as exemplified by the consonant "c" in the word "corner. " Furthermore, even for harmonic sounds, if the beginning and ending sounds have different pitches, the result will be perceived as two different auditory objects, rather than a continuous morph from one sound to another.
Another limitation associated with morphing based upon sinusoidal analysis is that it requires a significant amount of manual effort to correctly label individual sinusoids in the two original sounds and match them to one another. Often, there is a significant amount of hand tuning that is required, to identify the discrete sinusoids that result in the best sound.
It is desirable, therefore, to provide a technique for morphing any given sound into any other sound, which is not limited to specific types of sounds, such as harmonic sounds. It is further desirable to provide such a technique which readily lends itself to automation, and thereby reduces the manual effort required to produce a morphed sound.
Brief Statement of the Invention In accordance with the present invention, these objectives are achieved by a sound morphing process that is based on the fact that the different dimensions of sounds can be separated and individually operated upon. A sound morphing process in accordance with the present invention is comprised of a series of basic steps. As a first step, each sound which forms the basis for the morph is converted into multiple representations that quantitatively depict one or more salient features of the sounds. In a preferred embodiment of the invention, the multiple representations are independent of one another. After the representations have been obtained, the temporal axes of the two sounds are matched, so that similar components of the two sounds, such as onsets, harmonic regions and inharmonic regions, are aligned with one another. After the temporal matching, other relevant characteristics of the sounds, such as pitch, are also matched independently of the time matching. Once the energy in each of the sounds has been accounted for and matched to that of the other sound, the two sounds can be cross-faded, to produce a representation of the morphed sound, such as a new spectrogram. This representation is then inverted, to generate the morphed sound.
By using a spectrogram or other perceptual representation of a sound, the morphing process is not limited to harmonic sounds. Rather, any sound which is capable of being represented can form the basis for an audio morph. The particular representations that are chosen will be dependent upon the characteristics of the sound that are important. The only criteria is that the representation be perceptually relevant, i.e. it relates to some aspect of the sound which is detectable to the human ear, and provides a distance metric of that aspect. Using such representations, any two or more sounds can be matched to one another to produce a morph.
Another advantage of the morphing process of the present invention is that it can be easily automated. For example, the temporal warping of two representations of a sound, to match them to one another, can be computed using known techniques, such as correlation that produces the lowest mean-squared- difference. Similarly, other components of the sound can be automatically matched with one another, for example, using auto correlation techniques.
Further features of the invention, and the advantages provided thereby, are explained in greater detail hereinafter with reference to exemplary embodiments illustrated in the accompanying drawings. Brief Description of the Drawings
Figure 1 is a block diagram illustrating the overall process for morphing two sounds in accordance with the present invention;
Figure 2 is a more detailed block diagram of an embodiment of the invention for morphing speech;
Figure 3 is an illustration of the audio correspondence between two sounds;
Figure 4 is a diagram of the correspondence between the pitches of two sounds; Figures 5 A and 5B are illustrations of a continuous morph and a cyclo- stationary morph, respectively;
Figure 6 is a spectrogram of a morph in which the pitch of a spoken vowel changes; and
Figure 7 is an illustration of a sequence of spectrograms in a cyclostationary morph. Detailed Description
Generally speaking, morphing is the process of generating a range of physical phenomena that move smoothly from one arbitrary entity to another. For example, a video morph consists of a series of images which successively show one object smoothly changing its shape and texture until it becomes another object. The same objectives are desirable for an audio morph. A sound that is perceived as coming from one object should smoothly change into another sound, maintaining the shared properties of the starting and ending sounds while smoothly changing other properties. In the context of the present invention, two different types of audio morphing can be produced. One type of morph is temporally based. In this situation, a monotonic sound is considered as a point in a multi-dimensional space. The dimensions of this space can include the spectral shape, pitch, rhythm and other perceptually relevant auditory dimensions. A morph is obtained by defining a path between two sounds represented at two points in the space. This type of morph is analogous to image morphing. For example, a steady state clarinet tone might morph into the sound of an oboe or into a singer's voice.
In the second type of morph, a sequence of individual sounds are generated which smoothly change from one to another. For example, the spoken word "corner" can change into the word "morning" in a sequence of small steps. Each individual step represents a small difference from the previous word, and in the middle of the sequence the word sounds like a cross between "corner" and "morning. " This type of morph is referred to as a cyclostationary morph. It is cyclic because a sound is played repetitively to transition from one word to the other. It is also stationary since each sound instance is a static example of one of the in-between sounds in the sequence.
Different variations of this second type of morph are possible. For example, rather than generating a sequence of sounds that transition from one word to another, the desired output may be just one of the intermediate sounds. Alternatively, a sound can be produced that is a mixture of different components of the original sounds. For example, the output sound might utilize the pitch from one word, the timing from a second word, and the spectral resonances from a third word. The morphing of one sound into another, in accordance with the present invention, is schematically illustrated in the block diagram of Figure 1. A brief description of the overall process is first presented, and followed by a more detailed discussion of individual aspects of the process. This particular embodiment relates to the morphing of speech. It will be appreciated, however, that this example is for illustrative purposes. The principles which underlie the invention are equally applicable to music and other types of sound as well.
Referring to Figure 1 , two input sounds provide the basis from which the morphed sound is produced. In practice, more than two sounds can be used to provide the original input data. For purposes of the present explanation, a two- sound example will be described. As a first step, various representations 10 of each sound are generated. For example, the representations might be one or more spectrograms of each sound. Corresponding representations of the two sounds are then temporally matched, preferably by means of a dynamic time warping process 12. In this step, similar components of each sound, such as the onset or attack portion, harmonic and inharmonic regions, and a decay region, are temporally aligned with one another. After the temporal alignment, other relevant features of the two sounds undergo a matching process 14. For example, if the sounds contain harmonic components, the pitches of the two sounds can be matched. The matching of the two sounds results in a dense mapping of corresponding elements of the sounds to one another, for each of the dimensions of interest.
After all of the relevant energy components in the two sound signals have been matched, the sounds undergo interpolation and cross fading 16. For example, if a morph from Sound 1 to Sound 2 is to take place in five steps, the first inteφolation of the sound in the sequence comprises 100% of Sound 1 and 0% of Sound 2. The second interpolated sound of the sequence is comprised of 75% of Sound l's components and 25% of Sound 2's components. Successive inteφolation steps comprise greater proportions of Sound 2, until the final step is comprised entirely of Sound 2. For each step in the sequence, the inteφolation determines the appropriate percentage of each of the two components to combine with one another. These combined components form a new representation 18 of the moφhed sound, e.g. , a new spectrogram. This representation can then be inverted, to generate the actual moφhed sound for that step in the sequence. By successively reproducing each of the sounds in the sequence and cross-fading them into one another, a smooth transition from Sound 1 to Sound 2 can be heard.
The representation 10 of the sound transforms it from a simple waveform into a multi-dimensional representation that can be waφed, or modified, to produce a desired result. To be useful, the representation of the sound must be one that is invertible, i.e. after one or more of its parameters are modified, the produce a desired result. To be useful, the representation of the sound must be one that is invertible, i.e. after one or more of its parameters are modified, the result can be used to generate an audible sound. The particular representation that is employed should preserve all relevant dimensions of the sound. For example, in harmonic sounds pitch is an important characteristic. Thus, for the moφhing of harmonic sounds, a representation which preserves the pitch information should be employed. Examples of suitable representations for harmonic sound include spectrograms, such as the short-term Fourier transform, as well as cochleagrams and correlograms. Inharmonic sounds, such as noise and spoken fricatives, do not have a pitch component. Similarly, if a spoken word is whispered, its pitch is not significant. Consequently, other types of representation may be more appropriate for these types of sounds. For example, linear predictive coding (LPC) coefficients might be used to represent the spectral characteristics of an inharmonic sound.
Preferably, a multi-dimensional representation of sounds is employed, where each dimension is independent and salient to the perceived result. In the case of speech, two relevant dimensions of a sound are its pitch and its broad spectral shape, i.e. its frequency formants. These two dimensions roughly correspond to the rate at which the human glottis produces air pulses during speech (pitch) and the filtering of these pulses that is carried out by the mouth and nasal passages (formants). As discussed previously, another relevant dimension of sounds is their timing.
Figure 2 illustrates one embodiment of the invention in which each of these three dimensions can be separately represented to generate a moφh. At the outset, a conventional magnitude spectrogram of a sound is obtained by processing it through a Fast Fourier Transform 20. The Fast Fourier Transform provides a quantitative analysis of the sound in terms of its frequency content. The spectrogram of the sound is then further analyzed to determine its mel- frequency cepstial coefficients (MFCC) 22. Briefly, the MFCC for a sound is computed by resampling the magnitude spectrum to match critical bands that are related to auditory perception. This is carried out by passing the spectrogram through a filter bank which approximates the auditory characteristics of the human ear. The filter bank produces a number of output signals, e.g. forty signals, which undergo a discrete cosine transform to rearrange the data values, and a predetermined number of the lowest frequency components, e.g. the thirteen lowest filter coefficients, are then selected. These coefficients indicate the Euclidean distance between vectors, and therefore provide a good measure of how close two sounds are. Hence, they can be used to find a temporal match between two sounds, as described in detail hereinafter.
Since the MFCC contains only the lower frequency component information about a sound, it can be used to obtain a representation of the broad spectral shape of the sound. To this end, the MFCC is inverted at 24 by applying the inverse of the cosine transform, to provide a smooth estimate of the filter bank output that was used to compute the MFCC. This smooth estimate is then reinteφolated, for example by means of an inverse Bark scale, to yield a new spectrogram. This spectrogram corresponds to the original spectrogram, minus the higher frequency pitch information. In the context of the present invention, this spectrogram is referred to as a "smooth spectrogram", and provides a representation of the frequency formats in the original sound. Furthermore, the smooth spectrogram can be used to obtain a representation of the pitch information in a sound. More particularly, a conventional spectrogram encodes all of the information in a sound signal, and the smooth spectrogram describes the sound's overall spectral shape. The conventional spectrogram is divided by the smooth spectrogram at 26, to produce a residual spectrogram that contains the pitch and voicing information in a sound. In the context of the present invention, the residual spectrogram is referred to as a "pitch spectrogram. "
In the embodiment of Figure 2, three representations are derived for each sound, namely the MFCC transform which provides temporal information, the Figure 2, the individual steps for obtaining these representations are shown with respect to one sound. It will be appreciated that similar processing is carried out to provide representation for a second sound, which forms another component of the audio moφh. The corresponding representations of the two sounds are then matched to one another at 28-32.
Temporal matching of sounds at 28 (Fig. 2) is desirable since, over the course of a moφh, features which are common to both sounds should remain relatively fixed in time. Referring to Figure 3, an example of the temporal correspondence between two sounds is illustrated. In the figure, a spectrogram for one sound, e.g. a beginning sound, is shown at the bottom of the figure, and the spectrogram for a ending sound is shown above and to the left of the spectrogram for the beginning sound. In the spectrogram for the beginning sound, time is represented along the horizontal axis, and frequency is depicted on the vertical axis. To illustrate the temporal matching of the two sounds, the spectrogram for the ending sound is rotated counter-clockwise 90° relative to the spectrogram for the beginning sound.
In the preferred embodiment of the invention, dynamic time waφing is employed to find the best temporal match between two sounds, using the distance metric provided by the MFCC transforms of the sounds. For detailed information regarding dynamic time waφing, reference is made to the disclosure of which is incoφorated herein by reference. The result of the dynamic time waφing process is to provide control points in time which identify the frames of one sound that line up with those of the other sound. The correspondence of the frames provides an indication of the amount by which each segment of a sound must be temporally compressed or expanded to match it to the corresponding features in the other sound.
Once the two sounds have been aligned temporally at 28, they can be matched at each corresponding time instant. For each pair of corresponding frames, the relevant acoustical features that are indicated by the representations of the two sounds need to be matched. For example, in the pitch spectrogram, Once the two sounds have been aligned temporally at 28, they can be matched at each corresponding time instant. For each pair of corresponding frames, the relevant acoustical features that are indicated by the representations of the two sounds need to be matched. For example, in the pitch spectrogram, the pitch information in the sound is visible as a series of peaks. The spacing of the peaks is proportional to the pitch. The matching of the pitch data for two sounds at 30 essentially involves expanding or compressing the pitch spectrograms to align the pitch peaks. For any given instant in time, the pitch of one sound can be represented as pi, and the pitch of the other sound at the corresponding time is p2. To perform a match, the frequency axis of the second sound's pitch spectrogram must be compressed by pl/p2. If pi is larger than p2, the frequency axis of the pitch spectrogram for the second sound is actually stretched. When this process is carried out, the result is a dense match between a frequency f, in the first pitch spectrogram and a corresponding frequency f2=p1/p2*fι in the second pitch spectrogram.
Some sounds contain both harmonic and inharmonic components. For example, a spoken word may include both voiced and unvoiced sounds. An example of an unvoiced sound is the consonant "c" in the word "corner" . The unvoiced components of the word do not contain pitch information. However, the voiced, or harmonic, components include pitch, which should be matched to the pitch of another sound to form the moφh. To ensure that the pitch of the moφhed sound is consistent and smoothly changing, it is desirable to find a curve which provides an estimate for pitch throughout the entire time duration of the sound, including the inharmonic regions where it is normally absent. In a preferred implementation of the invention, a dynamic programming technique can be used to calculate a smooth pitch function throughout the entirety of a sound. Examples of suitable dynamic programming techniques are disclosed, for example, in Amini et al, "Using Dynamic Programming for Solving Variational Problems in Vision, " IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. 12, No. 9, September 1990, pps. 855-867, and Secrest et al, "An Integrated Pitch Teaching Algorithm for Speech Systems", Proceedings of 1983 ICASSP. Boston, MA, vol. 3, pp. 1352-1355, 1983. The pitch functions that are calculated for respective sounds with such a technique can then be matched to one another, as described previously. Once all of the relevant energy in each sound has been accounted for and matched, the corresponding portions of the two sounds can be cross-faded to produce a representation for a new sound. A moφh includes some type of inteφolation or cross-fading step. Scalar dimensions are easiest to moφh. If one component of a sound description is loudness, then the loudness of the moφh should change smoothly from the loudness of the first sound to the loudness of the second. The same holds true for a scalar quantity like pitch. However, acoustic information is not always scalar. Interpolations of temporal information, smooth spectrograms, and pitch spectrograms present a more complex problem, because they are based upon a dense match between two one- dimensional curves.
With reference to Figure 4, the data to be moφhed can be described as sl(t) and s2(t). These two curves might represent pitch, for example. The objective of the moφh is to find a new curve s(λ,t) such that the s function is a fraction, λ, between the si and s2 curves. Since the matches between curves are monotonic, matching lines do not cross such that, for each point (λ,t), there is only one line establishing correspondence. The inteφolation problem simplifies to finding the times tl and t2 that should be inteφolated to generate the data at (λ,t).
Given lines ending at tl and t2, the intersection with a line at some fractional distance λ between the two curves is at
1 ~ tl = λ → t = λ * (t2 - tl) + tl t2 - tl
Given the proper values for tl and t2, the new data at (λ,t) is generated by cross- fading the waφed signals. s(λ,t) = (1-λ) * sl(tl) + λ * s2(t2) When λ is zero, the result will be identical to si. When λ is 1, the result is s2. In between, the moφhing process smoothly cross-fades between the two functions. The mappings between si and s2 are described as paths. Pathl waφs si to look like s2. Thus, pathl is the path that produces the smallest difference between sl(pathl(t)) and s2(t). Likewise, s2(path2(t)) is close to sl(t). Using these paths, the above equation can be simplified so that the intermediate t is given by t = λ * (path2(tl) - tl) + tl
For each point t along the s(λ,t) line, the objective is to inteφolate using the best possible tl and t2. A value t* can be calculated for all values of tl using the expression above. The value for tl that produces t* closest to t can be used for the first half of the s-inteφolation equation above. To calculate the appropriate t2, the procedure is repeated from the other side. This is used to calculate the second term in the s-inteφolation equation above.
With reference to Figure 4, during a moφh energy moves along the dashed lines which connect corresponding frequencies of the two sounds. For instance, at a point which is 25% through the moφh, the generated sound has a frequency equal to 75% of that for Sound 1 and 25% of the corresponding, matched frequency for Sound 2. As the moφh progresses, successively greater proportions of the frequencies for Sound 2 are employed.
Matching the features of the smooth spectrograms for the two sounds, at 32, is less critical than matching of the pitch spectrograms, at least where speech is concerned. In one approach, the two smooth spectrograms can simply be cross-faded, without prior waφing. In an alternative approach, dynamic waφing can be applied to the smooth spectra, as a function of frequency, to match peaks in the two sounds before cross-fading them to obtain the moφhed sound. The inteφolation and cross-fading is carried out independently at 34 for each of the relevant components of the sounds. For example, at the 50% point of a moφh, a format value and a pitch that is halfway between each of the two original sounds can be employed. In such a case, the resulting sound will be in between the two sounds. Alternatively, it is possible to keep one of the components fixed, while varying another component. Thus, for example, the broad spectral shape for the moφh might remain fixed with the first sound, while the pitch is changed to match the second sound. Various other combinations of modifications will be readily apparent. The result of performing the cross-fades of the matched components of the two signals is a new set of representations for a sound having characteristics of each of the original input sounds. These representations are then combined to form a complete spectrogram. The spectrogram is then inverted at 36, to generate the new sound. As discussed previously, there are two different types of audio moφhing that can be attained with the present invention. One type of moφh is continuous, as depicted in Figure 5A, and the other type of moφh is cyclostationary, as depicted in Figure 5B. A continuous moφh is obtained in the case of simple sounds. For example, a note played on an oboe can smoothly transform over a given time span into a vowel spoken by a person. In another example, one vowel might moφh into a different vowel, or the same vowel might moφh from one pitch to another. A spectrogram for this latter example, which was produced in accordance with the present invention, is illustrated in Figure 6. In contrast to a continuous moφh, a cyclostationary moφh is comprised of multiple sound instantiations that form a sequence in which each sound differs from the others. For example, the word "corner" can transform into the word "morning" over a sequence of six steps. The spectrograms for such a sequence are illustrated in Figure 7. Thus, the first spectrogram relates to the pronunciation of the word "corner" and the last spectrogram pertains to the word "morning. " The four spectrograms between them represent various weighted combinations of the two words.
From the foregoing, it can be seen that the present invention provides a moφhing procedure in which any given sound can moφh into any other sound. Since it is not based upon sinusoidal analysis, it is not limited in the types of sounds that can be utilized. Rather, a variety of different types of sound representations can be employed, in accordance with the perceptually significant features of the particular sounds that are chosen.
Furthermore, by utilizing spectrographic representations of sounds, the moφhing process can be completely automated. The different steps of the process, including the temporal and feature-based matching steps, can be implemented in a computer which is suitably programmed to convert a input sounds into appropriate representations, analyze the representations to match them to one another as described above, and then select a point between matched components to produce a new sound. As such, the labor-intensive requirements of previous audio moφhing approaches can be avoided.
It will appreciated by those of ordinary skill in the art that the present invention can be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The foregoing discussion of an embodiment of the invention was particularly directed to speech. However, the principles of the invention are equally applicable to other types of sounds as well, such as music. Depending upon the particular sounds to be moφhed, different types of representations might be employed to provide a distance metric of the sound's features that are considered to be perceptually relevant. The presently disclosed embodiments are considered in all respects to be illustrative and not restrictive. The scope of the invention is indicated by the appended claims, rather than the foregoing description, and all changes that come within the meaning and range of equivalence thereof are intended to be embraced therein.

Claims

WHAT IS CLAIMED IS:
1. A method for moφhing from a first sound to a second sound, comprising the steps of: analyzing each of said first and second sounds to obtain at least one quantitative representation of each sound; temporally matching common regions of said representations to one another; determining a cross-faded sound representation comprised of predetermined proportions of corresponding temporally aligned features of said first and second sounds; and inverting the cross-faded sound representation to produce a sound having acoustic characteristics between those of said first and second sounds.
2. The method of claim 1 further including the step of determining correspondence between at least one acoustic feature in the quantitative representation of said first and second sounds.
3. The method of claim 2 wherein the quantitative representation comprises a dense spectral analysis of each sound.
4. The method of claim 3 wherein said dense spectral analysis comprises a pitch spectrogram which provides a distance metric for pitch information in a sound.
5. The method of claim 3 wherein said dense spectral analysis comprises a smooth spectrogram which provides a distance metric for formant frequencies in a sound.
6. The method of claim 1 wherein multiple representations are obtained for each sound.
7. The method of claim 6 wherein said multiple representations include a pitch spectrogram and a smooth spectrogram for each sound.
8. The method of claim 6 wherein each of said multiple representations are separately cross-faded and then combined to form said cross- faded sound representation.
9. A method for moφhing from a first sound to a second sound, comprising the steps of: analyzing each of said first and second sounds to obtain a pitch representation and a separate spectral shape representation for each sound; determining correspondence between the respective pitch representations of said sounds and between the respective spectral shape representations of said sounds; independently modifying the pitch representation and spectral shape representations of said sounds, based on said correspondence; combining the modified representations to form a new representation; and inverting the new representation to produce a moφhed sound.
10. The method of claim 9 further including the step of generating a third representation of each sound that provides a distance metric of the temporal correspondence between the two sounds.
11. The method of claim 10 wherein said third representation comprises an MFCC analysis of each sound.
12. The method of claim 10 further including the step of temporally aligning said two sounds prior to determining said pitch and spectral shape correspondence.
13. A method for moφhing from a first sound to a second sound, comprising the steps of: factoring each of said two sounds into a plurality of representations which respectively relate to different acoustic features of the sounds; independently modifying said plural representations to produce a plurality of new representations; combining said new representations to produce a representation for a moφhed sound; and inverting the representation for the moφhed sound.
14. The method of claim 13 wherein said modifying step includes the step of inteφolating corresponding values for a representation of each of the two sounds.
15. The method of claim 13 wherein said plural representations are independent of one another.
16. The method of claim 13 wherein one of said representations comprises a pitch spectrogram.
17. The method of claim 13 wherein one of said representations comprises a spectrogram of the formant frequencies in a sound.
18. A method for generating a dense spectral representation of a sound, comprising the steps of: generating a first spectrogram of the sound; determining the mel-frequency cepstal coefficients for the sound from said first spectrogram; and inverting the mel-frequency cepstal coefficients to obtain a spectrogram of the formant frequencies of the sound.
19. The method of claim 18 further including the step of dividing said first spectrogram by said formant frequency spectrogram to obtain a pitch spectrogram.
20. A method for producing a moφh comprising a transition from one spoken word to another spoken word, comprising the steps of: generating a dense spectral representation of each spoken word; generating a plurality of modified representations, each of which comprises a different respective inteφolation of corresponding values in the representation of said two sounds; and sequentially inverting each of said modified representation to produce a series of discrete sounds which transition from one of said spoken words to the other of said spoken words, and include characteristics of each of said spoken words.
PCT/US1997/004337 1996-03-15 1997-03-14 System for automatically morphing audio information WO1997034289A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU22165/97A AU2216597A (en) 1996-03-15 1997-03-14 System for automatically morphing audio information

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US08/616,290 1996-03-15
US08/616,290 US5749073A (en) 1996-03-15 1996-03-15 System for automatically morphing audio information

Publications (1)

Publication Number Publication Date
WO1997034289A1 true WO1997034289A1 (en) 1997-09-18

Family

ID=24468815

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1997/004337 WO1997034289A1 (en) 1996-03-15 1997-03-14 System for automatically morphing audio information

Country Status (3)

Country Link
US (1) US5749073A (en)
AU (1) AU2216597A (en)
WO (1) WO1997034289A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003025923A1 (en) * 2001-09-12 2003-03-27 Matsushita Electric Industrial Co., Ltd. Optical information recording medium and recording method using it
EP3513405B1 (en) * 2016-09-14 2023-07-19 Magic Leap, Inc. Virtual reality, augmented reality, and mixed reality systems with spatialized audio

Families Citing this family (62)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6591240B1 (en) * 1995-09-26 2003-07-08 Nippon Telegraph And Telephone Corporation Speech signal modification and concatenation method by gradually changing speech parameters
JP3502247B2 (en) * 1997-10-28 2004-03-02 ヤマハ株式会社 Voice converter
US6054646A (en) * 1998-03-27 2000-04-25 Interval Research Corporation Sound-based event control using timbral analysis
US7003120B1 (en) 1998-10-29 2006-02-21 Paul Reed Smith Guitars, Inc. Method of modifying harmonic content of a complex waveform
US7620527B1 (en) 1999-05-10 2009-11-17 Johan Leo Alfons Gielis Method and apparatus for synthesizing and analyzing patterns utilizing novel “super-formula” operator
US7035873B2 (en) * 2001-08-20 2006-04-25 Microsoft Corporation System and methods for providing adaptive media property classification
US7532943B2 (en) * 2001-08-21 2009-05-12 Microsoft Corporation System and methods for providing automatic classification of media entities according to sonic properties
WO2002037471A2 (en) * 2000-11-03 2002-05-10 Zoesis, Inc. Interactive character system
US6633839B2 (en) * 2001-02-02 2003-10-14 Motorola, Inc. Method and apparatus for speech reconstruction in a distributed speech recognition system
US6915261B2 (en) * 2001-03-16 2005-07-05 Intel Corporation Matching a synthetic disc jockey's voice characteristics to the sound characteristics of audio programs
US7711123B2 (en) * 2001-04-13 2010-05-04 Dolby Laboratories Licensing Corporation Segmenting audio signals into auditory events
US7610205B2 (en) * 2002-02-12 2009-10-27 Dolby Laboratories Licensing Corporation High quality time-scaling and pitch-scaling of audio signals
US7283954B2 (en) * 2001-04-13 2007-10-16 Dolby Laboratories Licensing Corporation Comparing audio using characterizations based on auditory events
US7461002B2 (en) * 2001-04-13 2008-12-02 Dolby Laboratories Licensing Corporation Method for time aligning audio signals using characterizations based on auditory events
JP4180249B2 (en) * 2001-04-27 2008-11-12 パイオニア株式会社 Audio signal processing device
ATE387000T1 (en) * 2001-05-10 2008-03-15 Dolby Lab Licensing Corp IMPROVE TRANSIENT PERFORMANCE IN LOW BITRATE ENCODERS BY SUPPRESSING PRE-NOISE
US6876728B2 (en) 2001-07-02 2005-04-05 Nortel Networks Limited Instant messaging using a wireless interface
JP4012506B2 (en) * 2001-08-24 2007-11-21 株式会社ケンウッド Apparatus and method for adaptively interpolating frequency components of a signal
US8644475B1 (en) 2001-10-16 2014-02-04 Rockstar Consortium Us Lp Telephony usage derived presence information
WO2003036621A1 (en) * 2001-10-22 2003-05-01 Motorola, Inc., A Corporation Of The State Of Delaware Method and apparatus for enhancing loudness of an audio signal
US20030135624A1 (en) * 2001-12-27 2003-07-17 Mckinnon Steve J. Dynamic presence management
US20030182106A1 (en) * 2002-03-13 2003-09-25 Spectral Design Method and device for changing the temporal length and/or the tone pitch of a discrete audio signal
GB0209770D0 (en) * 2002-04-29 2002-06-05 Mindweavers Ltd Synthetic speech sound
US8392609B2 (en) 2002-09-17 2013-03-05 Apple Inc. Proximity detection for media proxies
GB2397736B (en) * 2003-01-21 2005-09-07 Hewlett Packard Co Visualization of spatialized audio
US9118574B1 (en) 2003-11-26 2015-08-25 RPX Clearinghouse, LLC Presence reporting using wireless messaging
US7412377B2 (en) * 2003-12-19 2008-08-12 International Business Machines Corporation Voice model for speech processing based on ordered average ranks of spectral features
WO2006004050A1 (en) * 2004-07-01 2006-01-12 Nippon Telegraph And Telephone Corporation System for detection section including particular acoustic signal, method and program thereof
US7676362B2 (en) * 2004-12-31 2010-03-09 Motorola, Inc. Method and apparatus for enhancing loudness of a speech signal
US7567903B1 (en) * 2005-01-12 2009-07-28 At&T Intellectual Property Ii, L.P. Low latency real-time vocal tract length normalization
US7825321B2 (en) * 2005-01-27 2010-11-02 Synchro Arts Limited Methods and apparatus for use in sound modification comparing time alignment data from sampled audio signals
JP4645241B2 (en) * 2005-03-10 2011-03-09 ヤマハ株式会社 Voice processing apparatus and program
US8280730B2 (en) 2005-05-25 2012-10-02 Motorola Mobility Llc Method and apparatus of increasing speech intelligibility in noisy environments
US20070036297A1 (en) * 2005-07-28 2007-02-15 Miranda-Knapp Carlos A Method and system for warping voice calls
US8239190B2 (en) * 2006-08-22 2012-08-07 Qualcomm Incorporated Time-warping frames of wideband vocoder
GB2443027B (en) * 2006-10-19 2009-04-01 Sony Comp Entertainment Europe Apparatus and method of audio processing
US20090071315A1 (en) * 2007-05-04 2009-03-19 Fortuna Joseph A Music analysis and generation method
US8762143B2 (en) 2007-05-29 2014-06-24 At&T Intellectual Property Ii, L.P. Method and apparatus for identifying acoustic background environments based on time and speed to enhance automatic speech recognition
US7689421B2 (en) * 2007-06-27 2010-03-30 Microsoft Corporation Voice persona service for embedding text-to-speech features into software programs
US8255222B2 (en) * 2007-08-10 2012-08-28 Panasonic Corporation Speech separating apparatus, speech synthesizing apparatus, and voice quality conversion apparatus
US8036891B2 (en) * 2008-06-26 2011-10-11 California State University, Fresno Methods of identification using voice sound analysis
WO2010003068A1 (en) * 2008-07-03 2010-01-07 The Board Of Trustees Of The University Of Illinois Systems and methods for identifying speech sound features
US8595005B2 (en) * 2010-05-31 2013-11-26 Simple Emotion, Inc. System and method for recognizing emotional state from a speech signal
US8670577B2 (en) 2010-10-18 2014-03-11 Convey Technology, Inc. Electronically-simulated live music
US9264840B2 (en) 2012-05-24 2016-02-16 International Business Machines Corporation Multi-dimensional audio transformations and crossfading
US10638221B2 (en) * 2012-11-13 2020-04-28 Adobe Inc. Time interval sound alignment
US9201580B2 (en) 2012-11-13 2015-12-01 Adobe Systems Incorporated Sound alignment user interface
US9355649B2 (en) 2012-11-13 2016-05-31 Adobe Systems Incorporated Sound alignment using timing information
US10249321B2 (en) 2012-11-20 2019-04-02 Adobe Inc. Sound rate modification
US9451304B2 (en) 2012-11-29 2016-09-20 Adobe Systems Incorporated Sound feature priority alignment
WO2014098498A1 (en) * 2012-12-20 2014-06-26 삼성전자 주식회사 Audio correction apparatus, and audio correction method thereof
KR102212225B1 (en) * 2012-12-20 2021-02-05 삼성전자주식회사 Apparatus and Method for correcting Audio data
US9025822B2 (en) 2013-03-11 2015-05-05 Adobe Systems Incorporated Spatially coherent nearest neighbor fields
US9129399B2 (en) 2013-03-11 2015-09-08 Adobe Systems Incorporated Optical flow with nearest neighbor field fusion
US9165373B2 (en) 2013-03-11 2015-10-20 Adobe Systems Incorporated Statistics of nearest neighbor fields
US9031345B2 (en) 2013-03-11 2015-05-12 Adobe Systems Incorporated Optical flow accounting for image haze
NL2011811C2 (en) 2013-11-18 2015-05-19 Genicap Beheer B V METHOD AND SYSTEM FOR ANALYZING AND STORING INFORMATION.
JP2017508188A (en) 2014-01-28 2017-03-23 シンプル エモーション, インコーポレイテッドSimple Emotion, Inc. A method for adaptive spoken dialogue
US10453434B1 (en) 2017-05-16 2019-10-22 John William Byrd System for synthesizing sounds from prototypes
US10981073B2 (en) * 2018-10-22 2021-04-20 Disney Enterprises, Inc. Localized and standalone semi-randomized character conversations
US11205056B2 (en) * 2019-09-22 2021-12-21 Soundhound, Inc. System and method for voice morphing
EP4038610A1 (en) 2019-12-02 2022-08-10 Google LLC Methods, systems, and media for seamless audio melding

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4706537A (en) * 1985-03-07 1987-11-17 Nippon Gakki Seizo Kabushiki Kaisha Tone signal generation device
US5327521A (en) * 1992-03-02 1994-07-05 The Walt Disney Company Speech transformation system
US5371315A (en) * 1986-11-10 1994-12-06 Casio Computer Co., Ltd. Waveform signal generating apparatus and method for waveform editing system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0410045A1 (en) * 1989-07-27 1991-01-30 Koninklijke Philips Electronics N.V. Image audio transformation system, particularly as a visual aid for the blind
US5291557A (en) * 1992-10-13 1994-03-01 Dolby Laboratories Licensing Corporation Adaptive rematrixing of matrixed audio signals
US5473759A (en) * 1993-02-22 1995-12-05 Apple Computer, Inc. Sound analysis and resynthesis using correlograms
US5583961A (en) * 1993-03-25 1996-12-10 British Telecommunications Public Limited Company Speaker recognition using spectral coefficients normalized with respect to unequal frequency bands
US5625749A (en) * 1994-08-22 1997-04-29 Massachusetts Institute Of Technology Segment-based apparatus and method for speech recognition by analyzing multiple speech unit frames and modeling both temporal and spatial correlation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4706537A (en) * 1985-03-07 1987-11-17 Nippon Gakki Seizo Kabushiki Kaisha Tone signal generation device
US5371315A (en) * 1986-11-10 1994-12-06 Casio Computer Co., Ltd. Waveform signal generating apparatus and method for waveform editing system
US5327521A (en) * 1992-03-02 1994-07-05 The Walt Disney Company Speech transformation system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
COVELL M ET AL: "SPANNING THE GAP BETWEEN MOTION ESTIMATION AND MORPHING", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AN SIGNAL PROCESSING (ICASSP), I. IMAGE AND MULTIDIMENSIONAL SIGNAL PROCESSING ADELAIDE, APR. 19 - 22, 1994, vol. 5, 19 April 1994 (1994-04-19), INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS, pages V-213 - V-216, XP000533722 *
DAVIS S.B., MERMELSTEIN P.: "comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences", IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, vol. 28, no. 4, 4 August 1980 (1980-08-04), pages 357 - 366, XP002036829 *
TELLMAN E., ET AL.: "timbre morphing of sounds with unequal numbers of features", JOURNAL OF THE AUDIO ENGENEERING SOCIETY, vol. 43, no. 9, 1 July 1995 (1995-07-01), pages 678 - 689, XP002036828 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003025923A1 (en) * 2001-09-12 2003-03-27 Matsushita Electric Industrial Co., Ltd. Optical information recording medium and recording method using it
EP3513405B1 (en) * 2016-09-14 2023-07-19 Magic Leap, Inc. Virtual reality, augmented reality, and mixed reality systems with spatialized audio

Also Published As

Publication number Publication date
AU2216597A (en) 1997-10-01
US5749073A (en) 1998-05-05

Similar Documents

Publication Publication Date Title
US5749073A (en) System for automatically morphing audio information
Slaney et al. Automatic audio morphing
EP0979503B1 (en) Targeted vocal transformation
US5248845A (en) Digital sampling instrument
KR940002854B1 (en) Sound synthesizing system
Watanabe Formant estimation method using inverse-filter control
US8280724B2 (en) Speech synthesis using complex spectral modeling
JPH06266390A (en) Waveform editing type speech synthesizing device
Bonada et al. Sample-based singing voice synthesizer by spectral concatenation
CN108369803A (en) The method for being used to form the pumping signal of the parameter speech synthesis system based on glottal model
JPH09512645A (en) Multi-pulse analysis voice processing system and method
Caetano et al. A source-filter model for musical instrument sound transformation
Jensen The timbre model
JP2018077283A (en) Speech synthesis method
Wright et al. Analysis/synthesis comparison
Verfaille et al. Adaptive digital audio effects
JP3281266B2 (en) Speech synthesis method and apparatus
JP4469986B2 (en) Acoustic signal analysis method and acoustic signal synthesis method
Villavicencio et al. Efficient pitch estimation on natural opera-singing by a spectral correlation based strategy
JP3468337B2 (en) Interpolated tone synthesis method
Arroabarren et al. Inverse filtering in singing voice: A critical analysis
Wang et al. Beijing opera synthesis based on straight algorithm and deep learning
JP6834370B2 (en) Speech synthesis method
Pardo et al. Applying source separation to music
Driedger Processing music signals using audio decomposition techniques

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE HU IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK TJ TM TR TT UA UG US UZ VN AM AZ BY KG KZ MD RU TJ TM

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH KE LS MW SD SZ UG AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
WR Later publication of a revised version of an international search report
NENP Non-entry into the national phase

Ref country code: JP

Ref document number: 97532913

Format of ref document f/p: F

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA