WO1998006090A1 - Speech/audio coding with non-linear spectral-amplitude transformation - Google Patents

Speech/audio coding with non-linear spectral-amplitude transformation Download PDF

Info

Publication number
WO1998006090A1
WO1998006090A1 PCT/CA1997/000543 CA9700543W WO9806090A1 WO 1998006090 A1 WO1998006090 A1 WO 1998006090A1 CA 9700543 W CA9700543 W CA 9700543W WO 9806090 A1 WO9806090 A1 WO 9806090A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
audio signal
encoding
linear transform
signal
Prior art date
Application number
PCT/CA1997/000543
Other languages
French (fr)
Inventor
Roch Lefebvre
Claude Laflamme
Jean-Pierre Adoul
Original Assignee
Universite De Sherbrooke
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Universite De Sherbrooke filed Critical Universite De Sherbrooke
Priority to AU36901/97A priority Critical patent/AU3690197A/en
Publication of WO1998006090A1 publication Critical patent/WO1998006090A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0204Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using subband decomposition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Definitions

  • the present invention relates to the field of digital encoding of speech and/or audio signals, where the main objective is to maintain the highest possible sound quality at a given bit rate.
  • noise shaping it is widely recognized that, to reduce the bit rate of speech and audio encoders without compromising quality, proper care should be given to the spectral shape of the coding noise.
  • a general rule is that the short-term spectrum of the coding noise must follow the short-term spectrum of the speech and/or audio signal. This is known as noise shaping.
  • a time-varying perceptual filter controls the noise level as a function of frequency.
  • This perceptual filter is derived from an autoregressive filter which models the formants, or spectral envelope, of the speech spectrum. Hence, the noise spectrum approximately follows the speech formants. Further perceptual improvements can be achieved by using a post-filter, which emphasizes the formant and harmonic structure of the synthesized speech signal.
  • the noise spectrum is controlled by dynamic bit allocation in the frequency domain.
  • a sophisticated hearing model is used to determine a masking threshold. Bit allocation is conducted so as to maintain the distortion below the masking threshold at any frequency (provided the encoder operates at a sufficient bit rate).
  • the resulting coding noise will be correlated to the signal spectrum, with corresponding peaks and valleys.
  • the perceptual filter and bit allocation discussed above have to be adaptive; more specifically, they are time-varying functions. This adaptation implies either that side information has to be transmitted to the decoder, or that the noise shaping function is an inherent part of the encoder.
  • An object of the present invention is therefore to overcome the above-discussed drawbacks of the prior art.
  • Another object of the present invention is to provide a speech/audio encoding method and device capable of enhancing the perceptual quality of speech and/or audio signals.
  • a further object of the present invention is to provide a encoding/decoding method and device conducting a warping of the spectral amplitude of a speech and/or audio signal prior to encoding, and an unwarping of the spectral amplitude of the speech and/or audio signal after decoding, in view of enhancing the perceptual quality of the encoded and subsequently synthesized speech and/or audio signal.
  • a method of encoding a speech and/or audio signal in view of enhancing perceptual quality comprising the steps of non-linearly transforming the speech and/or audio signal, and encoding the non- linearly transformed speech and/or audio signal to produce an encoded speech and/or audio signal.
  • the speech and/or audio signal is a time-domain speech and/or audio signal
  • the step of producing a spectrum representation comprises: breaking down the time-domain speech and/or audio signal into a succession of overlapping finite-duration speech and/or audio signal segments; and applying a first linear transform to each speech and/or audio signal segment to obtain short-term spectral components;
  • the step of non-linearly transforming the spectrum representation of the speech and/or audio signal comprises the step of: applying a non-linear transform to the short-term spectral components in order to produce warped spectral components; - the encoding step is performed in the time domain, and the method further comprises, prior to the encoding step, the steps of: applying a second linear transform to the warped spectral components to obtain a time-domain signal interval; multiplying the time-domain signal interval by a time window to produce a windowed-signal interval; and adding the successive overlapping windowed-signal intervals corresponding to the successive overlapping finite- duration speech and/or audio signal segments to obtain a pre- processed signal applied to the encoding step for encoding the non-linearly transformed spectrum representation; wherein the second linear transform is the inverse of the first linear transform.
  • a k ' is the amplitude of the k th warped spectral component S k '; and - meticulousXA k ) is a non-linear function of A k . wherein the non-linear function is given by
  • a k log b (A k ).
  • a method of decoding an encoded speech and/or audio signal in view of enhancing the perceptual quality of a synthesized speech and/or audio signal, in which the speech and/or audio signal has been encoded by producing a spectrum representation of the speech and/or audio signal, non-linearly transforming the spectrum representation of the speech and/or audio signal, and encoding the non-linearly transformed spectrum representation to produce the encoded speech and/or audio signal.
  • the decoding method comprises the steps of decoding the encoded speech and/or audio signal to recover a non-linearly transformed spectrum representation of the speech and/or audio signal, non-linearly transforming the recovered non-linearly transformed spectrum representation to recover a spectrum representation of the speech and/or audio signal, and transforming the recovered spectrum representation into the synthesized speech and/or audio signal.
  • a method for decoding an encoded speech and/or audio signal in view of enhancing the perceptual quality of a synthesized speech and/or audio signal, in which the encoded speech and/or audio signal has been produced by breaking down a time-domain speech and/or audio signal into a succession of overlapping finite-duration speech and/or audio signal segments, applying a first linear transform to each of the speech and/or audio signal segments to obtain short-term spectral components, applying a first non-linear transform to the short-term spectral components in order to produce warped spectral components, and encoding the warped spectral components to produce an encoded speech and/or audio signal.
  • the decoding method comprises the steps of decoding the encoded speech and/or audio signal to recover warped signal components, applying a second non-linear transform to the recovered warped signal components to produce unwarped short-term spectral components, and applying a second linear transform to the unwarped signal components to produce the synthesized speech and/or audio signal.
  • the present invention further relates to a method of decoding an encoded speech and/or audio signal in view of enhancing the perceptual quality of a synthesized speech and/or audio signal, in which the encoded speech and/or audio signal has been produced by breaking down a time- domain speech and/or audio signal into a succession of first overlapping finite-duration speech and/or audio signal segments, applying a first linear transform to each of the first speech and/or audio signal segments to obtain first short-term spectral components, applying a first non-linear transform to the first short-term spectral components in order to produce warped spectral components, applying a second linear transform to the warped spectral components to obtain a first time-domain signal interval, multiplying the first time-domain signal interval by a first time window to produce a first windowed-signal interval, adding the successive overlapping first windowed-signal intervals corresponding to the successive first overlapping finite-duration speech and/or audio signal segments to obtain a time-domain pre-processed speech and
  • the decoding method comprises the steps of decoding the encoded speech and/or audio signal to recover a decoded time-domain pre-processed speech and/or audio signal, breaking down the decoded time-domain pre- processed speech and/or audio signal into a succession of second overlapping finite-duration speech and/or audio signal segments, applying a third linear transform to each of the second speech and/or audio signal segments to obtain second short-term spectral components, applying a second non-linear transform to the second short-term spectral components in order to produce unwarped spectral components, applying a fourth linear transform to the unwarped spectral components to obtain a second time-domain signal interval, multiplying the second time-domain signal interval by a second time window to produce a second windowed- signal interval, adding the successive overlapping second windowed- signal intervals corresponding to the successive second overlapping finite-duration speech and/or audio signal segments to obtain the synthesized speech and/or audio signal.
  • the present invention concerns a device for carrying out into practice the above defined encoding and decoding methods.
  • adaptive noise shaping may be obtained with a constant function. More precisely, a constant non-linear transformation is applied to the speech/audio short-term spectrum prior to the quantizing and/or encoding per se. Since noise shaping is performed by means of a constant function, this function can be viewed as a completely separate entity from the encoder, making it possible to enhance the noise shaping capability of an existing encoder without modifying the encoder itself, or having to transmit additional side information.
  • Figure 1a which is labelled as "prior art", is a simplified block diagram of a speech/audio encoding device using pre-processing
  • FIG 1b which is labelled as "prior art” is a simplified block diagram of a speech/audio decoding device using post-processing, this speech/audio decoding device corresponding to the speech/audio encoding device of Figure 1a;
  • Figure 2a is a block diagram of a speech/audio encoding device using spectral-amplitude warping as pre-processing, in which the input signal is a time- domain speech and/or audio signal;
  • Figure 2b is a block diagram of a speech/audio decoding device using spectral-amplitude unwarping as post-processing, this speech/audio decoding device corresponding to the speech/audio encoding device of Figure 2a;
  • Figure 3a is a block diagram of a more general configuration of the speech/audio encoding device of Figure 2a, still using spectral-amplitude warping as pre-processing;
  • Figure 3b is a block diagram of a more general configuration of the speech/audio decoding device of Figure 2b, still using spectral-amplitude unwarping as post-processing;
  • Figure 4a is a graph showing an example of short-term amplitude spectrum of a voiced speech segment, using a fast Fourier transform (FFT);
  • FFT fast Fourier transform
  • Figure 4b is a graph showing the short-term amplitude spectrum of the coding noise, corresponding to the speech segment of Figure 4a, when this speech segment is encoded with wideband speech coding standard G.722 [P. Mermelstein, "G.722, a new CCITT coding standard for digital transmission of wideband audio signals", IEEE Communications Magazine, Vol. 26, No. 1 , 1988] at 48 kbits/second; and
  • Figure 4c is a graph showing the resulting short-term amplitude spectrum of the coding noise, corresponding to the speech segment of
  • Noise shaping can be implemented through the two following, fundamentally different approaches:
  • a weighting measure is included in the encoder itself, which weighting measure determines the level of accuracy reached in quantizing each spectral component;
  • the present invention is concerned with the pre-processing approach (approach No. (2)).
  • Figure 1a is a simplified block diagram of a speech/audio encoding device 100 with a pre-processing module 102.
  • the speech/audio encoding device 100 comprises an optional module 101 for conditioning the input speech and/or audio signal 106.
  • Input signal conditioning module 101 conditions the input speech and/or audio signal 106 to account for operations such as soft saturation to prevent clipping, high-pass filtering to remove DC component, gain control, etc.
  • signal conditioning is viewed as an entity separate from signal pre-processing.
  • the speech/audio encoding device 100 also comprises the preprocessing module 102 per se.
  • the main purpose of the pre-processing module is to improve the perceptual quality of the speech/audio encoding device 100. More specifically, module 102 modifies the speech and/or audio signal 106, conditioned by a signal conditioning module 101 or not, to emphasize the perceptually relevant features of this signal prior to encoding. This enables proper encoding of these perceptually relevant features.
  • the characteristics of the pre-processing module 102 will be fully explained in the following description.
  • the pre-processed speech and/or audio signal from module 102 is then supplied to the encoder 103.
  • the encoder 103 produces a bitstream 108 to be transmitted over a communication channel.
  • Figure 1 b is a simplified block diagram of a speech/audio decoding device 107 comprising a post-processing module 105.
  • the bitstream 108 from the encoder 103 is received by a decoder 104 of the speech/audio decoding device 107.
  • Decoder 104 produces a pre-processed synthesis speech and/or audio signal 110 in response to the received bitstream 108.
  • the post-processing module 105 conducts a post-processing operation which is typically the inverse of the pre-processing operation conducted by module 102 of Figure 1a. Hence, if the output of the decoder 104 was exactly the same as the input of the encoder 103, i.e. if there were no coding noise and no channel noise, then the synthesized speech and/or audio signal 109 would be exactly the same as the input speech and/or audio signal 106 (conditioned by module 101 or not).
  • the pre-processing 102 and post-processing 105 modules have been mostly implemented by means of linear filters.
  • a drawback of this linear-filter implementation is that adaptation of the preprocessing and post-processing requires adaptation of the linear filters themselves. This implementation also requires the transmission of additional side information to the decoder 104 in order to adapt postprocessing accordingly.
  • Figure 2a is a block diagram of the speech/audio encoding device 100, in which the pre-processing module 102 is broken down into four distinct modules 202-205.
  • the input speech and/or audio signal 106 is a time-domain signal, conditioned by the module 101 or not, and consisting of a block of samples supplied at recurrent time intervals called frames.
  • This signal structure is well known to those of ordinary skill in the art and, accordingly, will not be further described in the present disclosure.
  • the speech/audio encoding device 100 comprises a module 202 for multiplying the block of samples of the input speech and/or audio signal 106, conditioned by module 101 or not, by a time window to produce a windowed signal 201. In this manner, the speech and/or audio signal 106 is broken down into a succession of overlapping finite-duration speech and/or audio signal segments.
  • the window can be a rectangular window, a Hanning or Hamming window [A.V. Oppenheim, A.S. illsky, “Signals and Systems", Prentice-Hall Signal Processing Series, 1983], or a window having a more complex form.
  • module 203 applies a linear transform, for example a Fast Fourier Transform FFT, to each speech and/or audio signal segment to obtain short-term spectral components [A.V. Oppenheim, A.S. Willsky, "Signals and Systems", Prentice-Hall Signal Processing Series, 1983]. It is within the scope of the present invention to use any other similar linear transforms including, but not limited to, Sine, Cosine and MLT transforms, and that can be represented by a set of spectral components each having an amplitude, a phase (or a sign) and a distinct index on a frequency scale.
  • S k is the k th short-term spectral component, and is the amplitude of the spectral component s k .
  • module 204 The function of module 204 is to apply a non-linear transformation (non-linear warping) to the spectral components S k in order to produce so-called "warped" spectral components S k .
  • This operation can be summarized as follows: A, )
  • r " ⁇ / (A k ) is a non-linear function of A k and A k ' is the amplitude of the warped spectral component S k ⁇
  • Examples of f n ⁇ A k ) include, but are not limited to:
  • ⁇ (k) is a constant, possibly the same for all indexes k;
  • Module 205 then applies the inverse of the linear transform of module 203, i.e. the inverse Fast Fourier Transform (FFT), to the warped spectral components S k '. This yields a time-domain signal 206. To minimize the effects of discontinuities (frame effects), successive frames of time-domain signal 206 are added using overlap-add.
  • FFT Fast Fourier Transform
  • module 205 applies a second linear transform to the warped spectral components to obtain a time-domain signal interval, multiplies the time-domain signal interval by a time window to produce a windowed- signal interval, and adds the successive overlapping windowed-signal intervals corresponding to the above mentioned successive overlapping finite-duration speech and/or audio signal segments to obtain the time- domain pre-processed signal 206 applied to the encoder 103.
  • a second linear transform to the warped spectral components to obtain a time-domain signal interval, multiplies the time-domain signal interval by a time window to produce a windowed- signal interval, and adds the successive overlapping windowed-signal intervals corresponding to the above mentioned successive overlapping finite-duration speech and/or audio signal segments to obtain the time- domain pre-processed signal 206 applied to the encoder 103.
  • other methods such as overlap-discard could be used to reconstruct a continuous time-domain signal from successive frames.
  • Figure 2b is a block diagram of the speech/audio decoding device
  • the decoder 208 receives the bitstream 108 from the encoder 103 and, in response to this bitstream, produces a time-domain pre-processed synthesis speech and/or audio signal 110. Since the input to the encoder 103 is a pre-processed signal, the output of the decoder 104 requires post-processing to recover a synthesized speech and/or audio signal 109 suitable for listening.
  • modules 209, 210 and 212 of Figure 2b are identical to modules 202, 203 and 205 of Figure 2a, respectively. However, it is within the scope of the present invention to use modules 209, 210 and 212 different from modules 202, 203 and 205.
  • a non-linear transform is applied to the short-term spectral components produced by module 210 to perform an operation referred to an non-linear spectral-amplitude "unwarping", since the non- linear transform of module 210 is the inverse of the non-linear transform of module 204 ( Figure 1a). Indeed, the term "unwarping” emphasizes the fact that this operation is essentially the inverse of the above described spectral-amplitude warping. More specifically, if
  • Figure 4a shows the amplitude E (dB) as a function of frequency (kHz) of the short-term Fourier spectrum for a voiced segment of female speech, using a Fast Fourier Transform (FFT).
  • Figure 4b shows the amplitude E(dB) of the short-term Fourier spectrum as a function of frequency (kHz) of the coding noise, corresponding to the voiced segment spectrum amplitude of Figure 4a, when this speech segment is encoded using ITU wideband speech coding standard G.722 [3] at 48 kbits/second. (ITU is the successor to CCITT).
  • the coding noise is the difference signal between the original speech and/or audio signal 106, conditioned or not by module 101 , and the synthesized speech and/or audio signal 109 at the output of the speech/audio decoding device 107.
  • Figures 4a and 4b show that, between 2 kHz and 5 kHz, the noise spectrum exceeds the original speech spectrum, which results in audible distortion.
  • Figure 4c shows the resulting coding noise when pre-processing (spectral-amplitude warping) and post-processing (spectral-amplitude unwarping) are used respectively prior to encoding and after decoding, with wideband speech coding standard G.722 at 48 kbits/seconds.
  • the pre-processing is described in module 102 of Figure 2a
  • the postprocessing is described in module 105 of Figure 2b.
  • the non-linear warping operation of Module 204 is, in this particular case, as follows:
  • the noise spectrum is very correlated to the original speech spectrum in Figure 4a.
  • both spectra present corresponding peaks and valleys.
  • the noise spectrum of Figure 4c is much more attenuated in the low energy portions of the original speech spectrum.
  • An obvious example is in the 2-4 kHz region, where the noise level is more than 10 dB below the noise level of the G.722 encoder without pre-processing and post-processing.
  • the present invention can be generalized to the cases where the encoder 103 does not necessarily operate in the time-domain, for example to transform encoders operating directly in the frequency domain.
  • FIG 3a a generalized version of the speech/audio encoding device 100 of Figure 2a is presented.
  • the pre-processing operation conducted by module 102 is decomposed into three functions.
  • Modules 302, 303 and 304 are identical to module 202, 203 and 204 of Figure 2a, respectively.
  • the input of the encoder 103, in Figure 3a is then the "warped" spectral components, instead of the time-domain signal as described with reference to Figure 2a.
  • the encoder 103 has the choice either to operate in the frequency domain directly, as in the case of transform/sub-band encoders, or to apply the inverse linear transform and overlap-add functions of module 205 or Figure 2a to obtain a time- domain signal prior to encoding.
  • Figure 3b shows a modified version of the speech/audio decoding device 107 with post-processing of Figure 2b.
  • the output of the decoder 104 is then assumed to be a frequency-domain signal, i.e. a series of quantized (synthesized) spectral components, as in the case of transform/sub-band decoders.
  • Modules 308 and 309 are similar to modules 211 and 212 of Figure 2b. If the encoder 103 operates in the time domain, it is assumed that the decoding device 107 of Figure 3b includes internally modules 209 and 210 of Figure 2b to provide the spectral components required at the input of module 308 of Figure 3b.

Abstract

In a method and device for encoding a speech and/or audio signal in view of enhancing perceptual quality, the time-domain speech and/or audio signal is broken down into a succession of overlapping finite-duration speech and/or audio signal segments, and a first linear transform is applied to each of these speech and/or audio signal segments to obtain short-term spectral components. Then, a non-linear transform is applied to the short-term spectral components in order to produce warped spectral components. A second linear transform is applied to the warped spectral components to obtain a time-domain signal interval, the time-domain signal interval is multiplied by a time window to produce a windowed-signal interval, and the successive overlapping windowed-signal intervals corresponding to the successive overlapping finite-duration speech and/or audio signal segments are added to obtain a pre-processed signal. This pre-processed signal is encoded to produce the encoded speech and/or audio signal. Corresponding speech/audio decoding method and device are also provided.

Description

SPEECH/AUDIO CODING WITH NON-LINEAR SPECTRAL-AMPLITUDE TRANSFORMATION
BACKGROUND OF THE INVENTION
1. Field of the invention:
The present invention relates to the field of digital encoding of speech and/or audio signals, where the main objective is to maintain the highest possible sound quality at a given bit rate.
2. Brief description of the prior art:
There is an increasing demand for high-quality encoding of speech and audio signals at low bit rates. Applications such as video conference, video telephony and multimedia require the transmission of audio signals with a high level of fidelity over channels of limited bandwidth and capacity.
It is widely recognized that, to reduce the bit rate of speech and audio encoders without compromising quality, proper care should be given to the spectral shape of the coding noise. A general rule is that the short-term spectrum of the coding noise must follow the short-term spectrum of the speech and/or audio signal. This is known as noise shaping.
In low bit-rate speech encoders (<16 kbits/second), which typically use the CELP paradigm [M. Schroeder, B. Atal, "Code-Excited Linear Prediction (CELP): High-quality speech at very low bit rates", Proc. IEEE Int. Conf. ASSP, 1985, pp. 937-940], a time-varying perceptual filter controls the noise level as a function of frequency. This perceptual filter is derived from an autoregressive filter which models the formants, or spectral envelope, of the speech spectrum. Hence, the noise spectrum approximately follows the speech formants. Further perceptual improvements can be achieved by using a post-filter, which emphasizes the formant and harmonic structure of the synthesized speech signal.
In higher bit-rate audio encoders, which are typically transform or sub-band encoders, the noise spectrum is controlled by dynamic bit allocation in the frequency domain. In the most complex algorithms, a sophisticated hearing model is used to determine a masking threshold. Bit allocation is conducted so as to maintain the distortion below the masking threshold at any frequency (provided the encoder operates at a sufficient bit rate). The resulting coding noise will be correlated to the signal spectrum, with corresponding peaks and valleys.
Because of the non-stationarities of speech and audio signals, the perceptual filter and bit allocation discussed above have to be adaptive; more specifically, they are time-varying functions. This adaptation implies either that side information has to be transmitted to the decoder, or that the noise shaping function is an inherent part of the encoder.
OBJECTS OF THE INVENTION
An object of the present invention is therefore to overcome the above-discussed drawbacks of the prior art.
Another object of the present invention is to provide a speech/audio encoding method and device capable of enhancing the perceptual quality of speech and/or audio signals.
A further object of the present invention is to provide a encoding/decoding method and device conducting a warping of the spectral amplitude of a speech and/or audio signal prior to encoding, and an unwarping of the spectral amplitude of the speech and/or audio signal after decoding, in view of enhancing the perceptual quality of the encoded and subsequently synthesized speech and/or audio signal.
SUMMARY OF THE INVENTION
More specifically, in accordance with the present invention, there is provided a method of encoding a speech and/or audio signal in view of enhancing perceptual quality, comprising the steps of non-linearly transforming the speech and/or audio signal, and encoding the non- linearly transformed speech and/or audio signal to produce an encoded speech and/or audio signal.
The method according to the invention, for encoding a speech and/or audio signal in view of enhancing perceptual quality may comprise the steps of producing a spectrum representation of the speech and/or audio signal, non-linearly transforming the spectrum representation of the speech and/or audio signal, and encoding the non-linearly transformed spectrum representation to produce an encoded speech and/or audio signal.
In accordance with preferred embodiments of the method of encoding a speech and/or audio signal:
- the speech and/or audio signal is a time-domain speech and/or audio signal, and the step of producing a spectrum representation comprises: breaking down the time-domain speech and/or audio signal into a succession of overlapping finite-duration speech and/or audio signal segments; and applying a first linear transform to each speech and/or audio signal segment to obtain short-term spectral components;
- the step of non-linearly transforming the spectrum representation of the speech and/or audio signal comprises the step of: applying a non-linear transform to the short-term spectral components in order to produce warped spectral components; - the encoding step is performed in the time domain, and the method further comprises, prior to the encoding step, the steps of: applying a second linear transform to the warped spectral components to obtain a time-domain signal interval; multiplying the time-domain signal interval by a time window to produce a windowed-signal interval; and adding the successive overlapping windowed-signal intervals corresponding to the successive overlapping finite- duration speech and/or audio signal segments to obtain a pre- processed signal applied to the encoding step for encoding the non-linearly transformed spectrum representation; wherein the second linear transform is the inverse of the first linear transform.
- the non-linear transform applied to the short-term spectral components Sk in order to produce the waφed spectral components Sk' is expressed as
'π Ak)
where - is the amplitude of the k,h spectral component Sk;
- Ak' is the amplitude of the kth warped spectral component Sk'; and - „XAk) is a non-linear function of Ak. wherein the non-linear function is given by
U ) = Akα(k) where α(k) is a constant, or by
Ak) = logb(Ak).
According to another aspect of the present invention, there is provided a method of decoding an encoded speech and/or audio signal in view of enhancing the perceptual quality of a synthesized speech and/or audio signal, in which the speech and/or audio signal has been encoded by producing a spectrum representation of the speech and/or audio signal, non-linearly transforming the spectrum representation of the speech and/or audio signal, and encoding the non-linearly transformed spectrum representation to produce the encoded speech and/or audio signal. The decoding method comprises the steps of decoding the encoded speech and/or audio signal to recover a non-linearly transformed spectrum representation of the speech and/or audio signal, non-linearly transforming the recovered non-linearly transformed spectrum representation to recover a spectrum representation of the speech and/or audio signal, and transforming the recovered spectrum representation into the synthesized speech and/or audio signal.
According to a further aspect of the subject invention, there is provided a method for decoding an encoded speech and/or audio signal in view of enhancing the perceptual quality of a synthesized speech and/or audio signal, in which the encoded speech and/or audio signal has been produced by breaking down a time-domain speech and/or audio signal into a succession of overlapping finite-duration speech and/or audio signal segments, applying a first linear transform to each of the speech and/or audio signal segments to obtain short-term spectral components, applying a first non-linear transform to the short-term spectral components in order to produce warped spectral components, and encoding the warped spectral components to produce an encoded speech and/or audio signal. The decoding method comprises the steps of decoding the encoded speech and/or audio signal to recover warped signal components, applying a second non-linear transform to the recovered warped signal components to produce unwarped short-term spectral components, and applying a second linear transform to the unwarped signal components to produce the synthesized speech and/or audio signal.
The present invention further relates to a method of decoding an encoded speech and/or audio signal in view of enhancing the perceptual quality of a synthesized speech and/or audio signal, in which the encoded speech and/or audio signal has been produced by breaking down a time- domain speech and/or audio signal into a succession of first overlapping finite-duration speech and/or audio signal segments, applying a first linear transform to each of the first speech and/or audio signal segments to obtain first short-term spectral components, applying a first non-linear transform to the first short-term spectral components in order to produce warped spectral components, applying a second linear transform to the warped spectral components to obtain a first time-domain signal interval, multiplying the first time-domain signal interval by a first time window to produce a first windowed-signal interval, adding the successive overlapping first windowed-signal intervals corresponding to the successive first overlapping finite-duration speech and/or audio signal segments to obtain a time-domain pre-processed speech and/or audio signal, and encoding the time-domain pre-processed speech and/or audio signal to produce the encoded speech and/or audio signal. The decoding method comprises the steps of decoding the encoded speech and/or audio signal to recover a decoded time-domain pre-processed speech and/or audio signal, breaking down the decoded time-domain pre- processed speech and/or audio signal into a succession of second overlapping finite-duration speech and/or audio signal segments, applying a third linear transform to each of the second speech and/or audio signal segments to obtain second short-term spectral components, applying a second non-linear transform to the second short-term spectral components in order to produce unwarped spectral components, applying a fourth linear transform to the unwarped spectral components to obtain a second time-domain signal interval, multiplying the second time-domain signal interval by a second time window to produce a second windowed- signal interval, adding the successive overlapping second windowed- signal intervals corresponding to the successive second overlapping finite-duration speech and/or audio signal segments to obtain the synthesized speech and/or audio signal.
Finally the present invention concerns a device for carrying out into practice the above defined encoding and decoding methods.
An important advantage of the present invention over the prior art is that adaptive noise shaping may be obtained with a constant function. More precisely, a constant non-linear transformation is applied to the speech/audio short-term spectrum prior to the quantizing and/or encoding per se. Since noise shaping is performed by means of a constant function, this function can be viewed as a completely separate entity from the encoder, making it possible to enhance the noise shaping capability of an existing encoder without modifying the encoder itself, or having to transmit additional side information.
The objects, advantages and other features of the present invention will become more apparent upon reading of the following non restrictive description of preferred embodiments thereof, given by way of examples only with reference to the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
In the appended drawings:
Figure 1a, which is labelled as "prior art", is a simplified block diagram of a speech/audio encoding device using pre-processing;
Figure 1b, which is labelled as "prior art" is a simplified block diagram of a speech/audio decoding device using post-processing, this speech/audio decoding device corresponding to the speech/audio encoding device of Figure 1a;
Figure 2a is a block diagram of a speech/audio encoding device using spectral-amplitude warping as pre-processing, in which the input signal is a time- domain speech and/or audio signal;
Figure 2b is a block diagram of a speech/audio decoding device using spectral-amplitude unwarping as post-processing, this speech/audio decoding device corresponding to the speech/audio encoding device of Figure 2a;
Figure 3a is a block diagram of a more general configuration of the speech/audio encoding device of Figure 2a, still using spectral-amplitude warping as pre-processing;
Figure 3b is a block diagram of a more general configuration of the speech/audio decoding device of Figure 2b, still using spectral-amplitude unwarping as post-processing;
Figure 4a is a graph showing an example of short-term amplitude spectrum of a voiced speech segment, using a fast Fourier transform (FFT);
Figure 4b is a graph showing the short-term amplitude spectrum of the coding noise, corresponding to the speech segment of Figure 4a, when this speech segment is encoded with wideband speech coding standard G.722 [P. Mermelstein, "G.722, a new CCITT coding standard for digital transmission of wideband audio signals", IEEE Communications Magazine, Vol. 26, No. 1 , 1988] at 48 kbits/second; and
Figure 4c is a graph showing the resulting short-term amplitude spectrum of the coding noise, corresponding to the speech segment of
Figure 4a, when spectral-amplitude warping and spectral-amplitude unwarping are used respectively prior to encoding and after decoding with wideband speech coding standard G.722 at 48 kbits/second. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
In the different figures of the appended drawings, the corresponding elements are identified by the same references.
Reducing the bit rate in speech/audio encoding inevitably increases the distortion introduced by the encoder. To minimize the perceived effects of this distortion, the encoder must resort to noise shaping, which essentially controls the level of noise as a function of time and frequency. Noise shaping can be implemented through the two following, fundamentally different approaches:
(1) a weighting measure is included in the encoder itself, which weighting measure determines the level of accuracy reached in quantizing each spectral component; or
(2) the original speech and/or audio signal is spectrally weighted (pre- processed) prior to the quantizing (encoding) operation per se.
The present invention is concerned with the pre-processing approach (approach No. (2)).
Figure 1a is a simplified block diagram of a speech/audio encoding device 100 with a pre-processing module 102.
As illustrated in Figure 1a, the speech/audio encoding device 100 comprises an optional module 101 for conditioning the input speech and/or audio signal 106. Input signal conditioning module 101 conditions the input speech and/or audio signal 106 to account for operations such as soft saturation to prevent clipping, high-pass filtering to remove DC component, gain control, etc. In the present disclosure, signal conditioning is viewed as an entity separate from signal pre-processing.
The speech/audio encoding device 100 also comprises the preprocessing module 102 per se. The main purpose of the pre-processing module is to improve the perceptual quality of the speech/audio encoding device 100. More specifically, module 102 modifies the speech and/or audio signal 106, conditioned by a signal conditioning module 101 or not, to emphasize the perceptually relevant features of this signal prior to encoding. This enables proper encoding of these perceptually relevant features. The characteristics of the pre-processing module 102 will be fully explained in the following description.
The pre-processed speech and/or audio signal from module 102 is then supplied to the encoder 103. As well known to those of ordinary skill in the art, the encoder 103 produces a bitstream 108 to be transmitted over a communication channel.
Figure 1 b is a simplified block diagram of a speech/audio decoding device 107 comprising a post-processing module 105.
The bitstream 108 from the encoder 103 is received by a decoder 104 of the speech/audio decoding device 107. Decoder 104 produces a pre-processed synthesis speech and/or audio signal 110 in response to the received bitstream 108. The post-processing module 105 conducts a post-processing operation which is typically the inverse of the pre-processing operation conducted by module 102 of Figure 1a. Hence, if the output of the decoder 104 was exactly the same as the input of the encoder 103, i.e. if there were no coding noise and no channel noise, then the synthesized speech and/or audio signal 109 would be exactly the same as the input speech and/or audio signal 106 (conditioned by module 101 or not).
In the past, the pre-processing 102 and post-processing 105 modules have been mostly implemented by means of linear filters. A drawback of this linear-filter implementation is that adaptation of the preprocessing and post-processing requires adaptation of the linear filters themselves. This implementation also requires the transmission of additional side information to the decoder 104 in order to adapt postprocessing accordingly.
Figure 2a is a block diagram of the speech/audio encoding device 100, in which the pre-processing module 102 is broken down into four distinct modules 202-205.
In the preferred embodiment of the invention, the input speech and/or audio signal 106 is a time-domain signal, conditioned by the module 101 or not, and consisting of a block of samples supplied at recurrent time intervals called frames. This signal structure is well known to those of ordinary skill in the art and, accordingly, will not be further described in the present disclosure. The speech/audio encoding device 100 comprises a module 202 for multiplying the block of samples of the input speech and/or audio signal 106, conditioned by module 101 or not, by a time window to produce a windowed signal 201. In this manner, the speech and/or audio signal 106 is broken down into a succession of overlapping finite-duration speech and/or audio signal segments. Depending on the transform used in next module 203, the window can be a rectangular window, a Hanning or Hamming window [A.V. Oppenheim, A.S. illsky, "Signals and Systems", Prentice-Hall Signal Processing Series, 1983], or a window having a more complex form.
Then, module 203 applies a linear transform, for example a Fast Fourier Transform FFT, to each speech and/or audio signal segment to obtain short-term spectral components [A.V. Oppenheim, A.S. Willsky, "Signals and Systems", Prentice-Hall Signal Processing Series, 1983]. It is within the scope of the present invention to use any other similar linear transforms including, but not limited to, Sine, Cosine and MLT transforms, and that can be represented by a set of spectral components each having an amplitude, a phase (or a sign) and a distinct index on a frequency scale. In the following description, Sk is the kth short-term spectral component, and is the amplitude of the spectral component sk.
The function of module 204 is to apply a non-linear transformation (non-linear warping) to the spectral components Sk in order to produce so-called "warped" spectral components Sk. This operation can be summarized as follows: A,)
where r" π/(Ak) is a non-linear function of Ak and Ak' is the amplitude of the warped spectral component Sk\ Examples of fn{Ak) include, but are not limited to:
fn/Ak) = Akα(k)
where α(k) is a constant, possibly the same for all indexes k; and
U ) = logb(Ak).
Module 205 then applies the inverse of the linear transform of module 203, i.e. the inverse Fast Fourier Transform (FFT), to the warped spectral components Sk'. This yields a time-domain signal 206. To minimize the effects of discontinuities (frame effects), successive frames of time-domain signal 206 are added using overlap-add. More specifically, module 205 applies a second linear transform to the warped spectral components to obtain a time-domain signal interval, multiplies the time-domain signal interval by a time window to produce a windowed- signal interval, and adds the successive overlapping windowed-signal intervals corresponding to the above mentioned successive overlapping finite-duration speech and/or audio signal segments to obtain the time- domain pre-processed signal 206 applied to the encoder 103. Depending on the window in module 202, other methods such as overlap-discard could be used to reconstruct a continuous time-domain signal from successive frames. The resulting time-domain signal 206 is then supplied as input signal to the encoder 103 which, as described in the foregoing description, produces the bitstream 108 to be transmitted over the communication channel.
Figure 2b is a block diagram of the speech/audio decoding device
107 in which the post-processing module 105 has been divided into four modules 209-212.
The decoder 208 receives the bitstream 108 from the encoder 103 and, in response to this bitstream, produces a time-domain pre-processed synthesis speech and/or audio signal 110. Since the input to the encoder 103 is a pre-processed signal, the output of the decoder 104 requires post-processing to recover a synthesized speech and/or audio signal 109 suitable for listening.
In the preferred embodiment as illustrated in Figures 2a and 2b, modules 209, 210 and 212 of Figure 2b are identical to modules 202, 203 and 205 of Figure 2a, respectively. However, it is within the scope of the present invention to use modules 209, 210 and 212 different from modules 202, 203 and 205.
In module 211 , a non-linear transform is applied to the short-term spectral components produced by module 210 to perform an operation referred to an non-linear spectral-amplitude "unwarping", since the non- linear transform of module 210 is the inverse of the non-linear transform of module 204 (Figure 1a). Indeed, the term "unwarping" emphasizes the fact that this operation is essentially the inverse of the above described spectral-amplitude warping. More specifically, if
- fn,(Ak)
is the non-linear transformation made by module 204 and applied to the amplitude Ak of the kth spectral component, then
Figure imgf000019_0001
is the non-linear transformation conducted by module 211 , where Bk' is the amplitude of the k,h "warped" spectral component at the receiving end (output of module 210), where Bk is the amplitude of the k,h "unwarped" spectral component, and where f„,~ 1 is the inverse of the function fπl. For example, in the case where the non-linear transformation of module 204 is given by
Figure imgf000019_0002
then the non-linear transformation in Module 211 is given by
It is easy to shown that Bk = f if B^ = A when no coding noise and no channel noise are present. In other words, in this particular example, fn, is indeed the inverse of ff nl- Referring to Figures 4a, 4b and 4c, the noise shaping capability of the subject invention will be demonstrated.
Figure 4a shows the amplitude E (dB) as a function of frequency (kHz) of the short-term Fourier spectrum for a voiced segment of female speech, using a Fast Fourier Transform (FFT).
Figure 4b shows the amplitude E(dB) of the short-term Fourier spectrum as a function of frequency (kHz) of the coding noise, corresponding to the voiced segment spectrum amplitude of Figure 4a, when this speech segment is encoded using ITU wideband speech coding standard G.722 [3] at 48 kbits/second. (ITU is the successor to CCITT). It is recalled that the coding noise is the difference signal between the original speech and/or audio signal 106, conditioned or not by module 101 , and the synthesized speech and/or audio signal 109 at the output of the speech/audio decoding device 107. It should be noted that there is no pre-processing and post-processing in the case of Figure 4b, to emphasize the distortion introduced by the encoding and decoding devices 100 and 107 themselves. It can be seen that the short-term spectrum of noise is not correlated to the original speech spectrum (Figure 4a).
Figures 4a and 4b show that, between 2 kHz and 5 kHz, the noise spectrum exceeds the original speech spectrum, which results in audible distortion.
Figure 4c shows the resulting coding noise when pre-processing (spectral-amplitude warping) and post-processing (spectral-amplitude unwarping) are used respectively prior to encoding and after decoding, with wideband speech coding standard G.722 at 48 kbits/seconds. The pre-processing is described in module 102 of Figure 2a, and the postprocessing is described in module 105 of Figure 2b. The non-linear warping operation of Module 204 is, in this particular case, as follows:
A k = f nkA k) = A °w, with a(k) = 0.5 for all k.
The non-linear unwarping of Module 211 of Figure 2b is as follows:
B k = (B k) = B^'α , with α(/c) = 0.5 for all k.
In Figure 4c, the noise spectrum is very correlated to the original speech spectrum in Figure 4a. In particular, both spectra present corresponding peaks and valleys. Further, the noise spectrum of Figure 4c is much more attenuated in the low energy portions of the original speech spectrum. An obvious example is in the 2-4 kHz region, where the noise level is more than 10 dB below the noise level of the G.722 encoder without pre-processing and post-processing.
It is important to emphasize the fact that the present invention can be generalized to the cases where the encoder 103 does not necessarily operate in the time-domain, for example to transform encoders operating directly in the frequency domain.
In Figure 3a, a generalized version of the speech/audio encoding device 100 of Figure 2a is presented. Here, the pre-processing operation conducted by module 102 is decomposed into three functions. Modules 302, 303 and 304 are identical to module 202, 203 and 204 of Figure 2a, respectively. The input of the encoder 103, in Figure 3a, is then the "warped" spectral components, instead of the time-domain signal as described with reference to Figure 2a. Then, the encoder 103 has the choice either to operate in the frequency domain directly, as in the case of transform/sub-band encoders, or to apply the inverse linear transform and overlap-add functions of module 205 or Figure 2a to obtain a time- domain signal prior to encoding.
In the same manner, Figure 3b shows a modified version of the speech/audio decoding device 107 with post-processing of Figure 2b. The output of the decoder 104 is then assumed to be a frequency-domain signal, i.e. a series of quantized (synthesized) spectral components, as in the case of transform/sub-band decoders. Modules 308 and 309 are similar to modules 211 and 212 of Figure 2b. If the encoder 103 operates in the time domain, it is assumed that the decoding device 107 of Figure 3b includes internally modules 209 and 210 of Figure 2b to provide the spectral components required at the input of module 308 of Figure 3b.
Although the present invention has been described hereinabove by way of a preferred embodiment thereof, this embodiment can be modified at will, within the scope of the appended claims, without departing from the spirit and nature of the subject invention.

Claims

WHAT IS CLAIMED IS:
1. A method of encoding a speech and/or audio signal in view of enhancing perceptual quality, comprising the steps of: producing a spectrum representation of the speech and/or audio signal; non-linearly transforming the spectrum representation of the speech and/or audio signal; and encoding the non-linearly transformed spectrum representation to produce an encoded speech and/or audio signal.
2. A method of encoding a speech and/or audio signal as recited in claim 1 , wherein the speech and/or audio signal is a time-domain speech and/or audio signal, and wherein the step of producing a spectrum representation comprises: breaking down the time-domain speech and/or audio signal into a succession of overlapping finite-duration speech and/or audio signal segments; and applying a first linear transform to each of said speech and/or audio signal segments to obtain short-term spectral components.
3. A method of encoding a speech and/or audio signal as recited in claim 2, wherein the step of non-linearly transforming the spectrum representation of the speech and/or audio signal comprises the step of applying a non-linear transform to the short-term spectral components in order to produce waφed spectral components.
4. A method of encoding a speech and/or audio signal as recited in claim 3, wherein the encoding step is performed in the time domain, and wherein said method further comprises, prior to the encoding step, the step of applying a second linear transform to the warped spectral components to obtain a time-domain signal.
5. A method of encoding a speech and/or audio signal as recited in claim 4, wherein the second linear transform is the inverse of said first linear transform.
6. A method of encoding a speech and/or audio signal as recited in claim 3, wherein the encoding step is performed in the time domain, and wherein said method further comprises, prior to the encoding step, the steps of: applying a second linear transform to the waφed spectral components to obtain a time-domain signal interval; multiplying the time-domain signal interval by a time window to produce a windowed-signal interval; and adding the successive overlapping windowed-signal intervals corresponding to said successive overlapping finite-duration speech and/or audio signal segments to obtain a pre-processed signal applied to the encoding step for encoding the non-linearly transformed spectrum representation.
7. A method of encoding a speech and/or audio signal as recited in claim 6, wherein the second linear transform is the inverse of said first linear transform.
8. A method of encoding a speech and/or audio signal as recited in claim 2, wherein the breaking-down step comprises multiplying the time-domain speech and/or audio signal by a time window.
9. A method of encoding a speech and/or audio signal as recited in claim 2, wherein the first linear transform is a Fourier transform.
10. A method of encoding a speech and/or audio signal as recited in claim 3, wherein the non-linear transform applied to the short-term spectral components Sk in order to produce the warped spectral components Sk' is expressed as
A k ~ 'n£Ak)
where - Ak is the amplitude of the kth spectral component Sk;
- Ak' is the amplitude of the k,h warped spectral component Sk'; and
- / A is a non-linear function of Ak.
11. A method of encoding a speech and/or audio signal as recited in claim 10, wherein the non-linear function
UK) = Akα(k)
where α(k) is a constant.
12. A method of encoding a speech and/or audio signal as recited in claim 10, wherein the non-linear function UK) = logb(Ak).
13. A method of encoding a speech and/or audio signal in view of enhancing perceptual quality, comprising the steps of: non-linearly transforming the speech and/or audio signal; and encoding the non-linearly transformed speech and/or audio signal to produce an encoded speech and/or audio signal.
14. A method of decoding an encoded speech and/or audio signal in view of enhancing the perceptual quality of a synthesized speech and/or audio signal, in which the speech and/or audio signal has been encoded by producing a spectrum representation of the speech and/or audio signal, non-linearly transforming the spectrum representation of the speech and/or audio signal, and encoding the non-linearly transformed spectrum representation to produce the encoded speech and/or audio signal, said decoding method comprising the steps of: a) decoding the encoded speech and/or audio signal to recover a non-linearly transformed spectrum representation of the speech and/or audio signal; b) non-linearly transforming the recovered non-linearly transformed spectrum representation to recover a spectrum representation of the speech and/or audio signal; and c) transforming the recovered spectrum representation into the synthesized speech and/or audio signal.
15. A method of decoding an encoded speech and/or audio signal in view of enhancing the perceptual quality of a synthesized speech and/or audio signal, in which the encoded speech and/or audio signal has been produced by breaking down a time-domain speech and/or audio signal into a succession of overlapping finite-duration speech and/or audio signal segments, applying a first linear transform to each of said speech and/or audio signal segments to obtain short-term spectral components, applying a first non-linear transform to the short-term spectral components in order to produce warped spectral components, and encoding the warped spectral components to produce an encoded speech and/or audio signal, said method comprising the steps of: a) decoding the encoded speech and/or audio signal to recover warped signal components; b) applying a second non-linear transform to the recovered warped signal components to produce unwarped short-term spectral components; and c) applying a second linear transform to the unwarped signal components to produce the synthesized speech and/or audio signal.
16. A method of decoding an encoded speech and/or audio signal in view of enhancing the perceptual quality of a synthesized speech and/or audio signal as recited in claim 15, wherein the second non-linear transform is the inverse of the first non-linear transform, and wherein the second linear transform is the inverse of the first linear transform.
17. A method of decoding an encoded speech and/or audio signal in view of enhancing the perceptual quality of a synthesized speech and/or audio signal, in which the encoded speech and/or audio signal has been produced by breaking down a time-domain speech and/or audio signal into a succession of first overlapping finite-duration speech and/or audio signal segments, applying a first linear transform to each of said first speech and/or audio signal segments to obtain first short-term spectral components, applying a first non-linear transform to the first short-term spectral components in order to produce warped spectral components, applying a second linear transform to the warped spectral components to obtain a first time-domain signal interval, multiplying the first time-domain signal interval by a first time window to produce a first windowed-signal interval, adding the successive overlapping first windowed-signal intervals corresponding to said successive first overlapping finite-duration speech and/or audio signal segments to obtain a time-domain pre-processed speech and/or audio signal, and encoding the time-domain pre-processed speech and/or audio signal to produce the encoded speech and/or audio signal, said decoding method comprising the steps of: a) decoding the encoded speech and/or audio signal to recover a decoded time-domain pre-processed speech and/or audio signal; b) breaking down the decoded time-domain pre-processed speech and/or audio signal into a succession of second overlapping finite- duration speech and/or audio signal segments; c) applying a third linear transform to each of said second speech and/or audio signal segments to obtain second short-term spectral components; d) applying a second non-linear transform to the second short-term spectral components in order to produce unwaφed spectral components; e) applying a fourth linear transform to the unwarped spectral components to obtain a second time-domain signal interval; f) multiplying the second time-domain signal interval by a second time window to produce a second windowed-signal interval; g) adding the successive overlapping second windowed-signal intervals corresponding to said successive second overlapping finite- duration speech and/or audio signal segments to obtain the synthesized speech and/or audio signal.
18. A method of decoding an encoded speech and/or audio signal in view of enhancing the perceptual quality of a synthesized speech and/or audio signal as recited in claim 17, wherein the second non-linear transform is the inverse of the first non-linear transform, wherein the second linear transform is the inverse of the first linear transform, wherein the fourth linear transform is the inverse of the third linear transform, wherein the third linear transform is the same as the first linear transform, and wherein the fourth linear transform is the same as the second linear transform.
19. A device for encoding a speech and/or audio signal in view of enhancing perceptual quality, comprising: means for producing a spectrum representation of the speech and/or audio signal; means for non-linearly transforming the spectrum representation of the speech and/or audio signal; and an encoder for encoding the non-linearly transformed spectrum representation to produce an encoded speech and/or audio signal.
20. A device for encoding a speech and/or audio signal as recited in claim 19, wherein the speech and/or audio signal is a time-domain speech and/or audio signal, and wherein said means for producing a spectrum representation comprises: means for breaking down the time-domain speech and/or audio signal into a succession of overlapping finite-duration speech and/or audio signal segments; and means for applying a first linear transform to each of said speech and/or audio signal segments to obtain short-term spectral components.
21. A device for encoding a speech and/or audio signal as recited in claim 20, wherein said means for non-linearly transforming the spectrum representation of the speech and/or audio signal comprises means for applying a non-linear transform to the short-term spectral components in order to produce warped spectral components.
22. A device for encoding a speech and/or audio signal as recited in claim 21 , wherein the encoder operates in the time domain, and wherein said device further comprises: means for applying a second linear transform to the warped spectral components to obtain a time-domain signal interval; means for multiplying the time-domain signal interval by a time window to produce a windowed-signal interval; and means for adding the successive overlapping windowed-signal intervals corresponding to said successive overlapping finite-duration speech and/or audio signal segments to obtain a pre-processed signal applied to the encoder.
23. A device for decoding an encoded speech and/or audio signal in view of enhancing the perceptual quality of a synthesized speech and/or audio signal, in which the speech and/or audio signal has been encoded by producing a spectrum representation of the speech and/or audio signal, non-linearly transforming the spectrum representation of the speech and/or audio signal, and encoding the non-linearly transformed spectrum representation to produce the encoded speech and/or audio signal, said decoding device comprising: a) means for decoding the encoded speech and/or audio signal to recover a non-linearly transformed spectrum representation of the speech and/or audio signal; b) means for non-linearly transforming the recovered non-linearly transformed spectrum representation to recover a spectrum representation of the speech and/or audio signal; and c) means for transforming the recovered spectrum representation into the synthesized speech and/or audio signal.
PCT/CA1997/000543 1996-08-02 1997-07-30 Speech/audio coding with non-linear spectral-amplitude transformation WO1998006090A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU36901/97A AU3690197A (en) 1996-08-02 1997-07-30 Speech/audio coding with non-linear spectral-amplitude transformation

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US2298696P 1996-08-02 1996-08-02
US60/022,986 1996-08-02

Publications (1)

Publication Number Publication Date
WO1998006090A1 true WO1998006090A1 (en) 1998-02-12

Family

ID=21812476

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CA1997/000543 WO1998006090A1 (en) 1996-08-02 1997-07-30 Speech/audio coding with non-linear spectral-amplitude transformation

Country Status (2)

Country Link
AU (1) AU3690197A (en)
WO (1) WO1998006090A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002093558A1 (en) * 2001-05-15 2002-11-21 Wavecom Device and method for processing an audio signal
FR2826492A1 (en) * 2001-06-22 2002-12-27 Thales Sa METHOD AND SYSTEM FOR PRE AND POST-PROCESSING AUDIO SIGNAL FOR TRANSMISSION ON A HIGHLY DISTURBED CHANNEL
WO2003009639A1 (en) * 2001-07-19 2003-01-30 Vast Audio Pty Ltd Recording a three dimensional auditory scene and reproducing it for the individual listener
EP1724758A2 (en) * 1999-02-09 2006-11-22 AT&T Corp. Delay reduction for a combination of a speech preprocessor and speech encoder
EP1739658A1 (en) * 2005-06-28 2007-01-03 Harman Becker Automotive Systems-Wavemakers, Inc. Frequency extension of harmonic signals
US7546237B2 (en) 2005-12-23 2009-06-09 Qnx Software Systems (Wavemakers), Inc. Bandwidth extension of narrowband speech
US7720677B2 (en) 2005-11-03 2010-05-18 Coding Technologies Ab Time warped modified transform coding of audio signals
US7765100B2 (en) * 2005-02-05 2010-07-27 Samsung Electronics Co., Ltd. Method and apparatus for recovering line spectrum pair parameter and speech decoding apparatus using same
US7813931B2 (en) 2005-04-20 2010-10-12 QNX Software Systems, Co. System for improving speech quality and intelligibility with bandwidth compression/expansion
US7912729B2 (en) 2007-02-23 2011-03-22 Qnx Software Systems Co. High-frequency bandwidth extension in the time domain
US8086451B2 (en) 2005-04-20 2011-12-27 Qnx Software Systems Co. System for improving speech intelligibility through high frequency compression
US8249861B2 (en) 2005-04-20 2012-08-21 Qnx Software Systems Limited High frequency compression integration
WO2023056920A1 (en) * 2021-10-05 2023-04-13 Huawei Technologies Co., Ltd. Multilayer perceptron neural network for speech processing

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1989009985A1 (en) * 1988-04-08 1989-10-19 Massachusetts Institute Of Technology Computationally efficient sine wave synthesis for acoustic waveform processing
US5394508A (en) * 1992-01-17 1995-02-28 Massachusetts Institute Of Technology Method and apparatus for encoding decoding and compression of audio-type data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1989009985A1 (en) * 1988-04-08 1989-10-19 Massachusetts Institute Of Technology Computationally efficient sine wave synthesis for acoustic waveform processing
US5394508A (en) * 1992-01-17 1995-02-28 Massachusetts Institute Of Technology Method and apparatus for encoding decoding and compression of audio-type data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LEFEBVRE R ET AL: "Spectral amplitude warping (SAW) for noise spectrum shaping in audio coding", 1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (CAT. NO.97CB36052), 1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, MUNICH, GERMANY, 21-24 APRIL 1997, ISBN 0-8186-7919-0, 1997, LOS ALAMITOS, CA, USA, IEEE COMPUT. SOC. PRESS, USA, pages 335 - 338 vol.1, XP002044323 *
PATISAUL C R ET AL: "Time-frequency resolution experiment in speech analysis and synthesis", JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, DEC. 1975, USA, vol. 58, no. 6, ISSN 0001-4966, pages 1296 - 1307, XP002044324 *
TOKUDA K ET AL: "Recursion formula for calculation of mel generalized cepstrum coefficients", TRANSACTIONS OF THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS A, JAN. 1988, JAPAN, vol. J71A, no. 1, ISSN 0373-6091, pages 128 - 131, XP002044325 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1724758A2 (en) * 1999-02-09 2006-11-22 AT&T Corp. Delay reduction for a combination of a speech preprocessor and speech encoder
EP1724758A3 (en) * 1999-02-09 2007-08-01 AT&T Corp. Delay reduction for a combination of a speech preprocessor and speech encoder
FR2824978A1 (en) * 2001-05-15 2002-11-22 Wavecom Sa Audio signal processing method, for mobile telephone signal, includes signal segmentation and filtering to remove background noise
WO2002093558A1 (en) * 2001-05-15 2002-11-21 Wavecom Device and method for processing an audio signal
FR2826492A1 (en) * 2001-06-22 2002-12-27 Thales Sa METHOD AND SYSTEM FOR PRE AND POST-PROCESSING AUDIO SIGNAL FOR TRANSMISSION ON A HIGHLY DISTURBED CHANNEL
EP1271473A1 (en) * 2001-06-22 2003-01-02 Thales System and method for PRE-AND POST-PROCESSING of an audio signal for transmission over a strongly distorted channel
US7561702B2 (en) 2001-06-22 2009-07-14 Thales Method and system for the pre-processing and post processing of an audio signal for transmission on a highly disturbed channel
WO2003009639A1 (en) * 2001-07-19 2003-01-30 Vast Audio Pty Ltd Recording a three dimensional auditory scene and reproducing it for the individual listener
US7489788B2 (en) 2001-07-19 2009-02-10 Personal Audio Pty Ltd Recording a three dimensional auditory scene and reproducing it for the individual listener
US7765100B2 (en) * 2005-02-05 2010-07-27 Samsung Electronics Co., Ltd. Method and apparatus for recovering line spectrum pair parameter and speech decoding apparatus using same
US8214203B2 (en) 2005-02-05 2012-07-03 Samsung Electronics Co., Ltd. Method and apparatus for recovering line spectrum pair parameter and speech decoding apparatus using same
US8219389B2 (en) 2005-04-20 2012-07-10 Qnx Software Systems Limited System for improving speech intelligibility through high frequency compression
US7813931B2 (en) 2005-04-20 2010-10-12 QNX Software Systems, Co. System for improving speech quality and intelligibility with bandwidth compression/expansion
US8086451B2 (en) 2005-04-20 2011-12-27 Qnx Software Systems Co. System for improving speech intelligibility through high frequency compression
US8249861B2 (en) 2005-04-20 2012-08-21 Qnx Software Systems Limited High frequency compression integration
EP1739658A1 (en) * 2005-06-28 2007-01-03 Harman Becker Automotive Systems-Wavemakers, Inc. Frequency extension of harmonic signals
US8311840B2 (en) 2005-06-28 2012-11-13 Qnx Software Systems Limited Frequency extension of harmonic signals
US7720677B2 (en) 2005-11-03 2010-05-18 Coding Technologies Ab Time warped modified transform coding of audio signals
US8412518B2 (en) 2005-11-03 2013-04-02 Dolby International Ab Time warped modified transform coding of audio signals
US8838441B2 (en) 2005-11-03 2014-09-16 Dolby International Ab Time warped modified transform coding of audio signals
US7546237B2 (en) 2005-12-23 2009-06-09 Qnx Software Systems (Wavemakers), Inc. Bandwidth extension of narrowband speech
US7912729B2 (en) 2007-02-23 2011-03-22 Qnx Software Systems Co. High-frequency bandwidth extension in the time domain
US8200499B2 (en) 2007-02-23 2012-06-12 Qnx Software Systems Limited High-frequency bandwidth extension in the time domain
WO2023056920A1 (en) * 2021-10-05 2023-04-13 Huawei Technologies Co., Ltd. Multilayer perceptron neural network for speech processing

Also Published As

Publication number Publication date
AU3690197A (en) 1998-02-25

Similar Documents

Publication Publication Date Title
US7529660B2 (en) Method and device for frequency-selective pitch enhancement of synthesized speech
KR101213840B1 (en) Decoding device and method thereof, and communication terminal apparatus and base station apparatus comprising decoding device
US8265940B2 (en) Method and device for the artificial extension of the bandwidth of speech signals
Tribolet et al. Frequency domain coding of speech
AU2009267529B2 (en) Apparatus and method for calculating bandwidth extension data using a spectral tilt controlling framing
EP3602549B1 (en) Apparatus and method for post-processing an audio signal using a transient location detection
US20070219785A1 (en) Speech post-processing using MDCT coefficients
EP1328923B1 (en) Perceptually improved encoding of acoustic signals
US20110125507A1 (en) Method and System for Frequency Domain Postfiltering of Encoded Audio Data in a Decoder
US10170126B2 (en) Effective attenuation of pre-echoes in a digital audio signal
AU2001284606A1 (en) Perceptually improved encoding of acoustic signals
WO1998006090A1 (en) Speech/audio coding with non-linear spectral-amplitude transformation
EP0994463A2 (en) Post filter
US11562756B2 (en) Apparatus and method for post-processing an audio signal using prediction based shaping
US6058360A (en) Postfiltering audio signals especially speech signals
EP1395982B1 (en) Adpcm speech coding system with phase-smearing and phase-desmearing filters
Füg et al. Temporal noise shaping on MDCT subband signals for transform audio coding
Luo et al. High quality wavelet-packet based audio coder with adaptive quantization
VIJAYASRI et al. IMPLEMENTATION OF A NOVEL TRANSFORMATION TECHNIQUE TO IMPROVE SPEECH COMPRESSION RATIO
Bhaskar Adaptive predictive coding with transform domain quantization using block size adaptation and high-resolution spectral modeling

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH HU IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW AM AZ BY KG KZ MD RU TJ TM

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH KE LS MW SD SZ UG ZW AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: CA

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

NENP Non-entry into the national phase

Ref country code: JP

Ref document number: 1998507412

Format of ref document f/p: F

122 Ep: pct application non-entry in european phase