SPEECH/AUDIO CODING WITH NON-LINEAR SPECTRAL-AMPLITUDE TRANSFORMATION
BACKGROUND OF THE INVENTION
1. Field of the invention:
The present invention relates to the field of digital encoding of speech and/or audio signals, where the main objective is to maintain the highest possible sound quality at a given bit rate.
2. Brief description of the prior art:
There is an increasing demand for high-quality encoding of speech and audio signals at low bit rates. Applications such as video conference, video telephony and multimedia require the transmission of audio signals with a high level of fidelity over channels of limited bandwidth and capacity.
It is widely recognized that, to reduce the bit rate of speech and audio encoders without compromising quality, proper care should be given to the spectral shape of the coding noise. A general rule is that the
short-term spectrum of the coding noise must follow the short-term spectrum of the speech and/or audio signal. This is known as noise shaping.
In low bit-rate speech encoders (<16 kbits/second), which typically use the CELP paradigm [M. Schroeder, B. Atal, "Code-Excited Linear Prediction (CELP): High-quality speech at very low bit rates", Proc. IEEE Int. Conf. ASSP, 1985, pp. 937-940], a time-varying perceptual filter controls the noise level as a function of frequency. This perceptual filter is derived from an autoregressive filter which models the formants, or spectral envelope, of the speech spectrum. Hence, the noise spectrum approximately follows the speech formants. Further perceptual improvements can be achieved by using a post-filter, which emphasizes the formant and harmonic structure of the synthesized speech signal.
In higher bit-rate audio encoders, which are typically transform or sub-band encoders, the noise spectrum is controlled by dynamic bit allocation in the frequency domain. In the most complex algorithms, a sophisticated hearing model is used to determine a masking threshold. Bit allocation is conducted so as to maintain the distortion below the masking threshold at any frequency (provided the encoder operates at a sufficient bit rate). The resulting coding noise will be correlated to the signal spectrum, with corresponding peaks and valleys.
Because of the non-stationarities of speech and audio signals, the perceptual filter and bit allocation discussed above have to be adaptive; more specifically, they are time-varying functions. This adaptation implies
either that side information has to be transmitted to the decoder, or that the noise shaping function is an inherent part of the encoder.
OBJECTS OF THE INVENTION
An object of the present invention is therefore to overcome the above-discussed drawbacks of the prior art.
Another object of the present invention is to provide a speech/audio encoding method and device capable of enhancing the perceptual quality of speech and/or audio signals.
A further object of the present invention is to provide a encoding/decoding method and device conducting a warping of the spectral amplitude of a speech and/or audio signal prior to encoding, and an unwarping of the spectral amplitude of the speech and/or audio signal after decoding, in view of enhancing the perceptual quality of the encoded and subsequently synthesized speech and/or audio signal.
SUMMARY OF THE INVENTION
More specifically, in accordance with the present invention, there is provided a method of encoding a speech and/or audio signal in view of enhancing perceptual quality, comprising the steps of non-linearly
transforming the speech and/or audio signal, and encoding the non- linearly transformed speech and/or audio signal to produce an encoded speech and/or audio signal.
The method according to the invention, for encoding a speech and/or audio signal in view of enhancing perceptual quality may comprise the steps of producing a spectrum representation of the speech and/or audio signal, non-linearly transforming the spectrum representation of the speech and/or audio signal, and encoding the non-linearly transformed spectrum representation to produce an encoded speech and/or audio signal.
In accordance with preferred embodiments of the method of encoding a speech and/or audio signal:
- the speech and/or audio signal is a time-domain speech and/or audio signal, and the step of producing a spectrum representation comprises: breaking down the time-domain speech and/or audio signal into a succession of overlapping finite-duration speech and/or audio signal segments; and applying a first linear transform to each speech and/or audio signal segment to obtain short-term spectral components;
- the step of non-linearly transforming the spectrum representation of the speech and/or audio signal comprises the step of: applying a non-linear transform to the short-term spectral components in order to produce warped spectral components;
- the encoding step is performed in the time domain, and the method further comprises, prior to the encoding step, the steps of: applying a second linear transform to the warped spectral components to obtain a time-domain signal interval; multiplying the time-domain signal interval by a time window to produce a windowed-signal interval; and adding the successive overlapping windowed-signal intervals corresponding to the successive overlapping finite- duration speech and/or audio signal segments to obtain a pre- processed signal applied to the encoding step for encoding the non-linearly transformed spectrum representation; wherein the second linear transform is the inverse of the first linear transform.
- the non-linear transform applied to the short-term spectral components Sk in order to produce the waφed spectral components Sk' is expressed as
'π Ak)
where - is the amplitude of the k,h spectral component Sk;
- Ak' is the amplitude of the kth warped spectral component Sk'; and - „XAk) is a non-linear function of Ak. wherein the non-linear function is given by
U ) = Akα(k)
where α(k) is a constant, or by
Ak) = logb(Ak).
According to another aspect of the present invention, there is provided a method of decoding an encoded speech and/or audio signal in view of enhancing the perceptual quality of a synthesized speech and/or audio signal, in which the speech and/or audio signal has been encoded by producing a spectrum representation of the speech and/or audio signal, non-linearly transforming the spectrum representation of the speech and/or audio signal, and encoding the non-linearly transformed spectrum representation to produce the encoded speech and/or audio signal. The decoding method comprises the steps of decoding the encoded speech and/or audio signal to recover a non-linearly transformed spectrum representation of the speech and/or audio signal, non-linearly transforming the recovered non-linearly transformed spectrum representation to recover a spectrum representation of the speech and/or audio signal, and transforming the recovered spectrum representation into the synthesized speech and/or audio signal.
According to a further aspect of the subject invention, there is provided a method for decoding an encoded speech and/or audio signal in view of enhancing the perceptual quality of a synthesized speech and/or audio signal, in which the encoded speech and/or audio signal has been produced by breaking down a time-domain speech and/or audio signal into a succession of overlapping finite-duration speech and/or audio signal segments, applying a first linear transform to each of the speech and/or audio signal segments to obtain short-term spectral
components, applying a first non-linear transform to the short-term spectral components in order to produce warped spectral components, and encoding the warped spectral components to produce an encoded speech and/or audio signal. The decoding method comprises the steps of decoding the encoded speech and/or audio signal to recover warped signal components, applying a second non-linear transform to the recovered warped signal components to produce unwarped short-term spectral components, and applying a second linear transform to the unwarped signal components to produce the synthesized speech and/or audio signal.
The present invention further relates to a method of decoding an encoded speech and/or audio signal in view of enhancing the perceptual quality of a synthesized speech and/or audio signal, in which the encoded speech and/or audio signal has been produced by breaking down a time- domain speech and/or audio signal into a succession of first overlapping finite-duration speech and/or audio signal segments, applying a first linear transform to each of the first speech and/or audio signal segments to obtain first short-term spectral components, applying a first non-linear transform to the first short-term spectral components in order to produce warped spectral components, applying a second linear transform to the warped spectral components to obtain a first time-domain signal interval, multiplying the first time-domain signal interval by a first time window to produce a first windowed-signal interval, adding the successive overlapping first windowed-signal intervals corresponding to the successive first overlapping finite-duration speech and/or audio signal segments to obtain a time-domain pre-processed speech and/or audio signal, and encoding the time-domain pre-processed speech and/or audio
signal to produce the encoded speech and/or audio signal. The decoding method comprises the steps of decoding the encoded speech and/or audio signal to recover a decoded time-domain pre-processed speech and/or audio signal, breaking down the decoded time-domain pre- processed speech and/or audio signal into a succession of second overlapping finite-duration speech and/or audio signal segments, applying a third linear transform to each of the second speech and/or audio signal segments to obtain second short-term spectral components, applying a second non-linear transform to the second short-term spectral components in order to produce unwarped spectral components, applying a fourth linear transform to the unwarped spectral components to obtain a second time-domain signal interval, multiplying the second time-domain signal interval by a second time window to produce a second windowed- signal interval, adding the successive overlapping second windowed- signal intervals corresponding to the successive second overlapping finite-duration speech and/or audio signal segments to obtain the synthesized speech and/or audio signal.
Finally the present invention concerns a device for carrying out into practice the above defined encoding and decoding methods.
An important advantage of the present invention over the prior art is that adaptive noise shaping may be obtained with a constant function. More precisely, a constant non-linear transformation is applied to the speech/audio short-term spectrum prior to the quantizing and/or encoding per se. Since noise shaping is performed by means of a constant function, this function can be viewed as a completely separate entity from the encoder, making it possible to enhance the noise shaping capability
of an existing encoder without modifying the encoder itself, or having to transmit additional side information.
The objects, advantages and other features of the present invention will become more apparent upon reading of the following non restrictive description of preferred embodiments thereof, given by way of examples only with reference to the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
In the appended drawings:
Figure 1a, which is labelled as "prior art", is a simplified block diagram of a speech/audio encoding device using pre-processing;
Figure 1b, which is labelled as "prior art" is a simplified block diagram of a speech/audio decoding device using post-processing, this speech/audio decoding device corresponding to the speech/audio encoding device of Figure 1a;
Figure 2a is a block diagram of a speech/audio encoding device using spectral-amplitude warping as pre-processing, in which the input signal is a time- domain speech and/or audio signal;
Figure 2b is a block diagram of a speech/audio decoding device using spectral-amplitude unwarping as post-processing, this speech/audio
decoding device corresponding to the speech/audio encoding device of Figure 2a;
Figure 3a is a block diagram of a more general configuration of the speech/audio encoding device of Figure 2a, still using spectral-amplitude warping as pre-processing;
Figure 3b is a block diagram of a more general configuration of the speech/audio decoding device of Figure 2b, still using spectral-amplitude unwarping as post-processing;
Figure 4a is a graph showing an example of short-term amplitude spectrum of a voiced speech segment, using a fast Fourier transform (FFT);
Figure 4b is a graph showing the short-term amplitude spectrum of the coding noise, corresponding to the speech segment of Figure 4a, when this speech segment is encoded with wideband speech coding standard G.722 [P. Mermelstein, "G.722, a new CCITT coding standard for digital transmission of wideband audio signals", IEEE Communications Magazine, Vol. 26, No. 1 , 1988] at 48 kbits/second; and
Figure 4c is a graph showing the resulting short-term amplitude spectrum of the coding noise, corresponding to the speech segment of
Figure 4a, when spectral-amplitude warping and spectral-amplitude unwarping are used respectively prior to encoding and after decoding with wideband speech coding standard G.722 at 48 kbits/second.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
In the different figures of the appended drawings, the corresponding elements are identified by the same references.
Reducing the bit rate in speech/audio encoding inevitably increases the distortion introduced by the encoder. To minimize the perceived effects of this distortion, the encoder must resort to noise shaping, which essentially controls the level of noise as a function of time and frequency. Noise shaping can be implemented through the two following, fundamentally different approaches:
(1) a weighting measure is included in the encoder itself, which weighting measure determines the level of accuracy reached in quantizing each spectral component; or
(2) the original speech and/or audio signal is spectrally weighted (pre- processed) prior to the quantizing (encoding) operation per se.
The present invention is concerned with the pre-processing approach (approach No. (2)).
Figure 1a is a simplified block diagram of a speech/audio encoding device 100 with a pre-processing module 102.
As illustrated in Figure 1a, the speech/audio encoding device 100 comprises an optional module 101 for conditioning the input speech
and/or audio signal 106. Input signal conditioning module 101 conditions the input speech and/or audio signal 106 to account for operations such as soft saturation to prevent clipping, high-pass filtering to remove DC component, gain control, etc. In the present disclosure, signal conditioning is viewed as an entity separate from signal pre-processing.
The speech/audio encoding device 100 also comprises the preprocessing module 102 per se. The main purpose of the pre-processing module is to improve the perceptual quality of the speech/audio encoding device 100. More specifically, module 102 modifies the speech and/or audio signal 106, conditioned by a signal conditioning module 101 or not, to emphasize the perceptually relevant features of this signal prior to encoding. This enables proper encoding of these perceptually relevant features. The characteristics of the pre-processing module 102 will be fully explained in the following description.
The pre-processed speech and/or audio signal from module 102 is then supplied to the encoder 103. As well known to those of ordinary skill in the art, the encoder 103 produces a bitstream 108 to be transmitted over a communication channel.
Figure 1 b is a simplified block diagram of a speech/audio decoding device 107 comprising a post-processing module 105.
The bitstream 108 from the encoder 103 is received by a decoder 104 of the speech/audio decoding device 107. Decoder 104 produces a pre-processed synthesis speech and/or audio signal 110 in response to the received bitstream 108.
The post-processing module 105 conducts a post-processing operation which is typically the inverse of the pre-processing operation conducted by module 102 of Figure 1a. Hence, if the output of the decoder 104 was exactly the same as the input of the encoder 103, i.e. if there were no coding noise and no channel noise, then the synthesized speech and/or audio signal 109 would be exactly the same as the input speech and/or audio signal 106 (conditioned by module 101 or not).
In the past, the pre-processing 102 and post-processing 105 modules have been mostly implemented by means of linear filters. A drawback of this linear-filter implementation is that adaptation of the preprocessing and post-processing requires adaptation of the linear filters themselves. This implementation also requires the transmission of additional side information to the decoder 104 in order to adapt postprocessing accordingly.
Figure 2a is a block diagram of the speech/audio encoding device 100, in which the pre-processing module 102 is broken down into four distinct modules 202-205.
In the preferred embodiment of the invention, the input speech and/or audio signal 106 is a time-domain signal, conditioned by the module 101 or not, and consisting of a block of samples supplied at recurrent time intervals called frames. This signal structure is well known to those of ordinary skill in the art and, accordingly, will not be further described in the present disclosure.
The speech/audio encoding device 100 comprises a module 202 for multiplying the block of samples of the input speech and/or audio signal 106, conditioned by module 101 or not, by a time window to produce a windowed signal 201. In this manner, the speech and/or audio signal 106 is broken down into a succession of overlapping finite-duration speech and/or audio signal segments. Depending on the transform used in next module 203, the window can be a rectangular window, a Hanning or Hamming window [A.V. Oppenheim, A.S. illsky, "Signals and Systems", Prentice-Hall Signal Processing Series, 1983], or a window having a more complex form.
Then, module 203 applies a linear transform, for example a Fast Fourier Transform FFT, to each speech and/or audio signal segment to obtain short-term spectral components [A.V. Oppenheim, A.S. Willsky, "Signals and Systems", Prentice-Hall Signal Processing Series, 1983]. It is within the scope of the present invention to use any other similar linear transforms including, but not limited to, Sine, Cosine and MLT transforms, and that can be represented by a set of spectral components each having an amplitude, a phase (or a sign) and a distinct index on a frequency scale. In the following description, Sk is the kth short-term spectral component, and is the amplitude of the spectral component sk.
The function of module 204 is to apply a non-linear transformation (non-linear warping) to the spectral components Sk in order to produce so-called "warped" spectral components Sk. This operation can be summarized as follows:
A,)
where r" π/(Ak) is a non-linear function of Ak and Ak' is the amplitude of the warped spectral component Sk\ Examples of fn{Ak) include, but are not limited to:
fn/Ak) = Akα(k)
where α(k) is a constant, possibly the same for all indexes k; and
U ) = logb(Ak).
Module 205 then applies the inverse of the linear transform of module 203, i.e. the inverse Fast Fourier Transform (FFT), to the warped spectral components Sk'. This yields a time-domain signal 206. To minimize the effects of discontinuities (frame effects), successive frames of time-domain signal 206 are added using overlap-add. More specifically, module 205 applies a second linear transform to the warped spectral components to obtain a time-domain signal interval, multiplies the time-domain signal interval by a time window to produce a windowed- signal interval, and adds the successive overlapping windowed-signal intervals corresponding to the above mentioned successive overlapping finite-duration speech and/or audio signal segments to obtain the time- domain pre-processed signal 206 applied to the encoder 103. Depending on the window in module 202, other methods such as overlap-discard could be used to reconstruct a continuous time-domain signal from
successive frames. The resulting time-domain signal 206 is then supplied as input signal to the encoder 103 which, as described in the foregoing description, produces the bitstream 108 to be transmitted over the communication channel.
Figure 2b is a block diagram of the speech/audio decoding device
107 in which the post-processing module 105 has been divided into four modules 209-212.
The decoder 208 receives the bitstream 108 from the encoder 103 and, in response to this bitstream, produces a time-domain pre-processed synthesis speech and/or audio signal 110. Since the input to the encoder 103 is a pre-processed signal, the output of the decoder 104 requires post-processing to recover a synthesized speech and/or audio signal 109 suitable for listening.
In the preferred embodiment as illustrated in Figures 2a and 2b, modules 209, 210 and 212 of Figure 2b are identical to modules 202, 203 and 205 of Figure 2a, respectively. However, it is within the scope of the present invention to use modules 209, 210 and 212 different from modules 202, 203 and 205.
In module 211 , a non-linear transform is applied to the short-term spectral components produced by module 210 to perform an operation referred to an non-linear spectral-amplitude "unwarping", since the non- linear transform of module 210 is the inverse of the non-linear transform of module 204 (Figure 1a). Indeed, the term "unwarping" emphasizes the
fact that this operation is essentially the inverse of the above described spectral-amplitude warping. More specifically, if
- fn,(Ak)
is the non-linear transformation made by module 204 and applied to the amplitude Ak of the kth spectral component, then
is the non-linear transformation conducted by module 211 , where Bk' is the amplitude of the k,h "warped" spectral component at the receiving end (output of module 210), where Bk is the amplitude of the k,h "unwarped" spectral component, and where f„,~ 1 is the inverse of the function fπl. For example, in the case where the non-linear transformation of module 204 is given by
then the non-linear transformation in Module 211 is given by
It is easy to shown that Bk = f if B^ = A when no coding noise and no channel noise are present. In other words, in this particular example, fn, is indeed the inverse of ff nl-
Referring to Figures 4a, 4b and 4c, the noise shaping capability of the subject invention will be demonstrated.
Figure 4a shows the amplitude E (dB) as a function of frequency (kHz) of the short-term Fourier spectrum for a voiced segment of female speech, using a Fast Fourier Transform (FFT).
Figure 4b shows the amplitude E(dB) of the short-term Fourier spectrum as a function of frequency (kHz) of the coding noise, corresponding to the voiced segment spectrum amplitude of Figure 4a, when this speech segment is encoded using ITU wideband speech coding standard G.722 [3] at 48 kbits/second. (ITU is the successor to CCITT). It is recalled that the coding noise is the difference signal between the original speech and/or audio signal 106, conditioned or not by module 101 , and the synthesized speech and/or audio signal 109 at the output of the speech/audio decoding device 107. It should be noted that there is no pre-processing and post-processing in the case of Figure 4b, to emphasize the distortion introduced by the encoding and decoding devices 100 and 107 themselves. It can be seen that the short-term spectrum of noise is not correlated to the original speech spectrum (Figure 4a).
Figures 4a and 4b show that, between 2 kHz and 5 kHz, the noise spectrum exceeds the original speech spectrum, which results in audible distortion.
Figure 4c shows the resulting coding noise when pre-processing (spectral-amplitude warping) and post-processing (spectral-amplitude
unwarping) are used respectively prior to encoding and after decoding, with wideband speech coding standard G.722 at 48 kbits/seconds. The pre-processing is described in module 102 of Figure 2a, and the postprocessing is described in module 105 of Figure 2b. The non-linear warping operation of Module 204 is, in this particular case, as follows:
A k = f nkA k) = A °w, with a(k) = 0.5 for all k.
The non-linear unwarping of Module 211 of Figure 2b is as follows:
B k = (B k) = B^'α , with α(/c) = 0.5 for all k.
In Figure 4c, the noise spectrum is very correlated to the original speech spectrum in Figure 4a. In particular, both spectra present corresponding peaks and valleys. Further, the noise spectrum of Figure 4c is much more attenuated in the low energy portions of the original speech spectrum. An obvious example is in the 2-4 kHz region, where the noise level is more than 10 dB below the noise level of the G.722 encoder without pre-processing and post-processing.
It is important to emphasize the fact that the present invention can be generalized to the cases where the encoder 103 does not necessarily operate in the time-domain, for example to transform encoders operating directly in the frequency domain.
In Figure 3a, a generalized version of the speech/audio encoding device 100 of Figure 2a is presented. Here, the pre-processing operation
conducted by module 102 is decomposed into three functions. Modules 302, 303 and 304 are identical to module 202, 203 and 204 of Figure 2a, respectively. The input of the encoder 103, in Figure 3a, is then the "warped" spectral components, instead of the time-domain signal as described with reference to Figure 2a. Then, the encoder 103 has the choice either to operate in the frequency domain directly, as in the case of transform/sub-band encoders, or to apply the inverse linear transform and overlap-add functions of module 205 or Figure 2a to obtain a time- domain signal prior to encoding.
In the same manner, Figure 3b shows a modified version of the speech/audio decoding device 107 with post-processing of Figure 2b. The output of the decoder 104 is then assumed to be a frequency-domain signal, i.e. a series of quantized (synthesized) spectral components, as in the case of transform/sub-band decoders. Modules 308 and 309 are similar to modules 211 and 212 of Figure 2b. If the encoder 103 operates in the time domain, it is assumed that the decoding device 107 of Figure 3b includes internally modules 209 and 210 of Figure 2b to provide the spectral components required at the input of module 308 of Figure 3b.
Although the present invention has been described hereinabove by way of a preferred embodiment thereof, this embodiment can be modified at will, within the scope of the appended claims, without departing from the spirit and nature of the subject invention.