WO1996027869A1

WO1996027869A1 - Voice-band compression system

Info

Publication number: WO1996027869A1
Application number: PCT/CA1996/000127
Authority: WO
Inventors: Eric Verreault; Tyseer Aboulnasr
Original assignee: Newbridge Networks Corporation
Priority date: 1995-03-04
Filing date: 1996-03-01
Publication date: 1996-09-12
Also published as: GB9504377D0; AU4780696A; CA2211402A1

Abstract

A voice compression apparatus for use in telephony comprises a device for decomposing a voice signal into a bank of narrow frequency band signals, an estimator for estimating the instantaneous energy in each band, a comparator for comparing the energy in each band to of all other bands to estimate the audibility threshold of the signal in each band, and a device for independently quantizing the signal in each band on the basis of the estimated audibility threshold. The inventive apparatus offers low processing delays and reduced computational requirements, and is suitable for use in telecommunications networks.

Description

VOICE-BAND COMPRESSION SYSTEM

This invention relates to a telephone voice-band compression system.

Digital telephone systems employ voice compression in order to make the most effective use of available bandwidth. Most common telephone voice-band compression systems rely on modeling of the vocal tract to eliminate redundant information.

Psycho-acoustic based techniques have been used to compress pre-recorded high fidelity audio signals. Two such known systems are the DCC (digital compact cassette) developed and commercialized by Philips and Sony corporation's minidisc. Hi-fi systems cannot, however, be applied to telephone voice-band signals because they process a larger signal bandwidth. Also, the resulting processing delay is too long for two-way communication, such as is found in telephone communications.

The psycho-acoustic approach relies on modeling the hearing mechanism to eliminate the redundant non-audible information.

The psycho-acoustic based voice compression system makes use of the fact that the human hearing mechanism is incapable of perceiving some sounds in the presence of others. This phenomenon is called the masking effect and can be predominantly analyzed in the frequency domain.

Coupled with the masking effect is the fact that certain critical frequency bands define the resolution capability of the ear.

All psycho-acoustic based compression systems use these two properties. However, known systems rely on the evaluation of the frequency contents of a signal by analyzing the signal over a period of time. This evaluation causes the processing delay, which is unacceptable for two-way telephonic communication.

According to the present invention there is provided a voice compression apparatus comprising means for decomposing a voice signal into a bank of narrow frequency band signals; an estimator for estimating the instantaneous energy in each band; a comparator for comparing the energy in each band to of all other bands to estimate the audibility threshold of the signal in each band; and means for independently quantizing the signal in each band on the basis of the estimated audibility threshold.

A final number coding stage can be used to further compress the voice signal. The present system uses an optimally selected short time analysis window based on each critical band.

Compared to other voice compression systems, the novel system guarantees a low processing delay combined with low computational requirements. Also, because the compressed signal is composed of frequency independent information, the new system un-couples the information to be propagated, making it more adapted to packet transport techniques.

The invention also provides a method of compressing a voice signal comprising the steps of decomposing the voice signal into a bank of narrow frequency band signals, estimating the instantaneous energy in each band, comparing the energy in each band to of all other bands to estimate the audibility threshold of the signal in each band, and independently quantizing the signal in each band on the basis of the estimated audibility threshold. More generally, and in its broadest aspect the invention also provides a voice compression scheme based on human auditory response characteristics as opposed to the vocal tract characteristics used in prior art compression schemes.

The invention will now be described in more detail, by way of example only, with reference to the accompanying drawings, in which:-

Figure 1 shows an analysis/synthesis filter bank for wavelet transform;

Figure 2 shows the time-scale plane for wavelet transform showing multi-resolution representation;

Figure 3 shows a time-frequency grid for short-time Fourier transform showing same resolution at all frequencies/time;

Figure 4 shows masking threshold as a function of frequency;

Figure 5 shows the actual bands used in signal decomposition; Figure 6 is a block diagram of an encoder for compressing a voice signal in accordance with the invention; and

Figure 7 is a detailed diagram of an encoder for a single band. In order to assist in the understanding of the invention, a brief discussion of the underlying theory will be presented.

Spectral analysis has long been based on Fourier analysis or more specifically the short-time Discrete Fourier Transform. A finite length segment X_(n) of a signal x(n) is defined, by multiplying the signal with a pre-selected window w(n) . The frequency content of this windowed signal is then determined using the DFT.

X_i ( k ) = Σ _Xi (n ) e-3«n | _ω=2πk/N , k=0, ...N-l for an N-point DFT x_i (n ) = x ( n ) . w ( n) w (n ) =1 0 < n < N- l

=0 otherwise x_±(n) = 1/N Σ X_i(k) ei^ωn k=0, ...N-l

The segment x^ (n) is expressed as the weighted sum of complex exponentials, the weights being the coefficients of the discrete Fourier transform of that segment. Those coefficients indicate the presence or absence of specific sinusoids and their relative magnitude. However, the frequencies of the sinusoids used in the expansion are equally spaced across the frequency range providing the same frequency resolution at all frequencies. The higher the N, i.e. the longer the window in the time domain, the more points are calculated in the frequency domain, i.e. higher the frequency resolution of the DFT.

The resolution in the time domain is also determined by the length of the window. The longer the window, the longer the effective observation period and the lower the time resolution. Even though the frequencies present in the signals with a higher resolution (more DFT points) would be known, it would not be possible to determine the actual time of their presence with lower resolution (since the observation period considered is longer) .

Thus, there are two fundamental limitations that cannot be bypassed when using the standard short-time Fourier Transform : 1) The attainable frequency resolution is the same for all frequencies.

2) The frequency resolution can only be increased at the expense of the time domain resolution and vice versa. It is these limitations that the wavelet transform addresses.

It would be desirable to have an adjustable window length: longer for lower time resolution, shorter for higher time resolution. However, it is important to note that the time resolution required for a given signal is related to its frequency content : higher frequencies need only a short period of observation to determine their frequency while lower frequencies need longer periods of time to accurately determine the frequency. Thus, variable length windows are needed: ones with shorter effective time-support for the high frequency components of the signal and others of longer effective time support to analyze the lower frequency components of the signal. The choice of the type (function) of these windows has to be made very carefully for the transform to be meaningful and invertible.

This is the basic conceptual idea behind the wavelet transform. The wavelet transform provides a trade-off between the time resolution and the frequency resolution by varying the length of the window. The duration of the window is selected shorter for higher time /lower frequency resolution for the higher frequency components, longer for higher frequency/lower time resolution for the lower frequency components for a pre-selected window function.

Like the Fourier Transform, the discrete wavelet transform is basically an expansion of a given signal in terms of a set of (almost) orthogonal basis functions. The signal is then expressed as the weighted sum of those functions, the weights being the coefficients of the wavelet transform. This provides the reconstruction equation for reproducing x(n) from its wavelet transform coefficients. The coefficients themselves can be computed as the inner product of the signal and each of the basis functions individually, up to this point, this is the same as any other signal expansion. It is the conditions on the basis functions (the wavelets) that make the wavelet transform different from other transforms.

While the Fourier transform uses complex exponentials e3^ωn , ω=2πk/N , k=0, ...N-l as its basis, the wavelet transform uses a set of basis functions ψ_j_ ^(t) that are dilates (expanded versions) and translates of a mother wavelet ψ(n). Thus, starting from a given function, the mother wavelet ψ(t), the wavelets are generated as ψ-_j ^(∑Jt-k). The scaling by 23 provides the dilation and the shift by k provides the translation. Basically, the different wavelets are identical in nature but have effective support in the time domain that depends on the 'scale' parameter j and a position that is a function of the scale j as well as of the translation parameter k . As j increases, the effective time support of the wavelet is reduced and more shifted wavelets are used to cover the duration of the signal. This provides higher capacity for representing finer details for larger j . The opposite is true for lower j . The wavelets are orthogonal to each other so that together they span the whole signal space. The wavelets have to be chosen so that the wavelet transform or expansion provides information that can be directly related to the original signal and that it is invertible. Starting out with a mother wavelet ψ(t), a function f(t) may be expressed in terms of translates (shifted copies) of ψ(t) :

f⁽t⁾=∑ a_k Ψo.kft^{) (}D

where ψø _k(t) = ψ(t-k). A more general set of basis functions is obtained by using scaled versions of the mother wavelets (prior to shifting) :

ψ_j/0(t)= 2:^/2ψ_{j 0}(23t), (2)

where j is a measure of the compression/expansion and k specifies the function's location in time.

The wavelet functions have to satisfy : ψ(t) = ∑g(n) φ(2t-n) (4)

where φ(t) satisfies the dilation equation

φ(t) = ∑ h(n) φ(2t-n) (5)

and g(n)= (-l)ⁿ h(N-l-n) for a filter h(n) of length N h(n)

and jφ(t) φ(t-k)dt = K δ(k) (6)

The wavelet transform as a signal expansion in terms of wavelet functions exclusively can be described as follows: f(t) = ∑<ψ_i/ (t),f (t)> Ψι_.k(t) = Σ diCk) Ψ_i/k(t) (7)

However, starting from a certain scale (level of detail), the same transform is obtained in terms of the wavelet functions (describing the added detail) and a scaling function(describing the basic approximation) .

This is expressed as follows : f(t)=∑ c_i(k) Φ_i/k(t) + ∑ d_i(k) Ψi,_k(t)

(8) The relation between wavelet transform and the basic signal processing techniques can be obtained by relating the coefficients at different scales. It can be shown that d (k) = ∑ h(m-2k) dj₊₁(m) (9)

Thus, the WT coefficients at scale j can be obtained from that at scale j+1 by a basic convolution operation : dj (k) = ∑ h(m-k) d_j+1(m)

(10) followed by a decimation operation equivalent to replacing k by 2k in the right hand side of equation 10.

The wavelet transform can be recast in basic DSP (Digital Signal Processing) notation as shown in Figure 1, which goes from scale j+1 to scale j. The original conditions imply fairly simple and reasonably familiar conditions on h(n) and g(n), i.e.

∑h(n)=2 (normalization) (11)

∑lh(n) |2 =2 (12)

∑h(n) h(n-2k) = 2 δ(k) (13) Given that h(n) has an even duration N, the above constraints result in N/2 -1 degrees of freedom in choosing the coefficients N for a given order N.

The equivalent conditions in the frequency domain look more familiar : | H (ω ) | 2 + | H (ω+π) | 2 = 4 , | H ( 0 ) | =2 ;

and | H (π) I = 0

The basic approximation is provided in Cj (low frequency) and the detail is provided in dj (high frequency) . To understand the decimation operation, it should be noted that after filtering the input, the output signal in each branch is bandlimited to half the original bandwidth. Maintaining the sampling rate results in each output signal being effectively oversampled by a factor of two. Decimation does not result in any information loss and keeps the total number of input and output points equal. The above process can be repeated again on the low frequency signal as far down as needed.

For a fairly high scale (j), the samples of f(t) at a rate above the Nyquist rate were shown to be good approximation to the Discrete Wavelet Transform at that scale. This is the starting scale we ( i.e. f(nT) =

C I initial scale) •

The total process of finding the wavelet transform and its inverse is shown in Figure 1. Selecting the lowpass filter is equivalent to selecting the specific wavelet function. If the conditions set out in equations 11 to 13 are not satisfied, then the system is a valid filterbank but not a wavelet transform, i.e. the wavelet transform is a special case filterbank. As can be seen from the equations, depending on the filter order, there are a varying number of degrees of freedom. These degrees of freedom can be used to optimize the wavelet in some sense. For N=2, there are no degrees of freedom left and there is only one possible wavelet, the Haar function to satisfy the required conditions. For N=4, there are three equations in 4 unknowns resulting in one degree of freedom, thus leading to different possible wavelet functions. One very popular use of these degrees of freedom is to increase the regularity of the wavelet function (the differentiability of the frequency response of the associated filter) . The more regular the wavelet function, the better the compaction property of the wavelet transform. In signal processing terms, the flatter the frequency response of the filters at w=0, -^• ; the more regular the wavelets and the better the compaction property of the transform.

Looking at Figure 1, it can be seen that the signal is basically split into high and low frequency components in splitters 1, 2. The decimated high frequency output 3 (detail) is then maintained as is while the decimated low frequency output 4 is split again in splitters 5, 6. This process is continued in further splitters 7, 8 as far down as needed. The more stages, the higher the frequency resolution of the lower frequencies and the lower time resolution (fewer output samples) . The first highpass filter output has the frequency range π/2-π. The second highpass filter output has the frequencies π/4-π/2 while the next highpass filter will have the frequencies π/8-π/4 and so on. Thus, the filters have a bandwidth that is a constant fraction of the filter center frequency (constant Q bandwidth) which basically provides narrow bandwidth ( higher freq. resolution, lower time resolution) for the lower frequencies and wider bandwidths(lower frequency resolution, higher time resolution) for higher frequencies. Figure 2 shows the corresponding time-scale division. On the time axis, if the original sampling period is assumed T, then the c , the output of the first highpass filter/decimator, will have samples that are separated by 2T and have a total N/2 points. The second highpass filter/decimator output has N/4 samples separated by 4T and so on. Fewer points and higher sampling period are used to represent the lower frequencies. In the limit, the zero frequency is represented by just one point. The higher the frequency, the higher time resolution needed. This is exactly what we needed. This is shown in Figure 2 as grid points. This is generally referred as tiling of the time-scale plane. On the other hand, a typical short-time Fourier transform would result in uniform tiling with the same resolution maintained across time axis and frequency axis as shown in Figure 3.

It will thus be seen that the wavelet transform is a smart way of splitting a signal into frequency bands on a logarithmic frequency scale using constant-Q filters (bandwidth proportional to center frequency) to provide higher frequency resolution and lower time resolution at the lower frequencies while providing lower frequency resolution and higher time resolution at the higher frequencies. The crucial point is the choice of the filter used (i.e. the wavelet function) to ensure invertability (ability to reconstruct the original signal) and optimality in some sense. For a compression application, the wavelet is chosen to provide the best energy compaction possible, i.e. requiring the fewest transform coefficients to represent a given signal. The Daubechies wavelet is optimized for just that. The coefficients for these Daubechies filters are generated using the Matlab program in Appendix A. It is to be noted that the signal can be decomposed into other bands, not necessarily on a logarithmic scale, as needed. The logarithmic scale decomposition is obtained using a dyadic tree where only the low frequency band is repeatedly split . Basic sub-band coding would result if both high and low bands were split. Other trees result when different combinations of the tree branches are divided into smaller bands. The actual number of bands per branch is also a variable resulting in a general M- band decomposition rather than the basic two-band case in the dyadic tree.

Audio compression is traditionally a very different field from speech compression due to the very wide range of possible sources of signals: speech as well as all possible music instruments. As a result, audio compression cannot directly build on the achievements of speech processing where in many cases the compression was based on some form of modeling the speech production system and utilizing the associated redundancies. Rather than model the source of the signal production, efforts in audio compression were directed at modeling the receiver of the music signal, the human ear. Psychoacoustic principles have been extensively applied to identify what the ear can and cannot hear.

Signal-to-noise ratio then loses its validity as a measure of quality since the quality of a 'noisy' signal does not only depend on how much noise was added but also, to a large extent, on where that noise was added. The trick becomes to carefully place the quantization noise introduced due to the fewer bits used where the ear cannot hear it. Thus, though the SNR may be low, the quality of the signal can be quite acceptable since the increased noise is somehow masked by (inaudible to) the ear.

This concept provided an enormous power to audio compression algorithms. The wavelet Transform came in as an almost custom-made representation for audio coding since it provides the information in a form that directly emulates the way the ear hears and as such provides a very compatible representation. It should be noted that the quality required for music signals is significantly higher than the traditional toll quality required for speech transmission over telephone lines. CD quality audio signals have a bandwidth of «20 kHz and are sampled at 44Khz, 16 bits/sample. This results in an uncompressed bit rate of 702 Kb/s. It has been shown that using wavelets and psychoacoustic based bit allocation algorithms, a factor of over 10 compression can be achieved enabling transmission at the standard 64kb/s telephone rate. One thing that helped achieve such compression ratios was the fact that higher delays and considerable transmitter complexity can be afforded compared to what is possible with two-way speech communication.

The fundamentals idea of psychoacoustics is that the ear has very definite masking abilities. A signal cannot be heard unless its amplitude exceeds a certain hearing threshold. This threshold specifies the absolute hearing level of the ear. However, the actual audibility threshold at a given frequency can increase depending on other signals present at neighbouring frequencies. As an example, a given tone at one frequency f₀ can effectively mask another tone at f_χ unless its magnitude exceeds a threshold as shown in Figure 4.

The complex ability of the ear to mask certain frequencies in presence of others leads to a 20 db SNR (Signal-to-Noise ratio) resulting in near transparent coding for perceptually shaped noise (noise placed where it is masked by the ear) while more than 60 db SNR would be required for additive white noise. As an extreme example, transparency is achieved for a lKhz tone with white noise at 90 db SNR while the same transparency is achievable with 25 db SNR and a psychoacoustically shaped noise. The masking threshold in Figure 4 is calculated on a bark scale which is effectively a log frequency scale for higher frequencies, i.e. the same scale as provided by the dyadic tree structure wavelet transform. Other thresholds exist for tones masking noise as well as for noise masking tone/noise. Since normally the signal is neither pure tone nor pure noise, some measure of tonality of the signal is used to determine a compromise value for the threshold between those provided by the two extremes. This masking threshold has to be determined for each signal segment. To do that, the signal spectrum is determined and transformed to the bark scale. The masking threshold due to the individual signal component in each of the bands on that scale is then computed. It is assumed that all those masking thresholds add up to provide the total masking curve for that segment.

Now given the masking curve, it is apparent that quantization noise can be tolerated if it occurs in the right place: the higher the threshold, the higher the ear's inability to hear the noise. The signal in that band can be represented with fewer bits, i.e. more quantization noise without actually adding any audible noise. Thus based on the masking curve, the number of bits used to represent each transform coefficient is determined for a specified overall bit rate. The bit allocation algorithm is a dynamic one that has to be updated periodically based on the spectral content of the current signal segment. This information has to be transmitted to the receiver specifying the number of bits used to represent the coefficient in each of the bark bands to allow for reconstruction on the other side. It can be seen that the transmitter is significantly more complex since it requires the computation of the spectrum on the bark scale, the masking threshold as well as the bit allocation algorithm while the receiver simply reconstructs the signal from the various frequency bands.

In one aspect, the invention can be briefly stated as follows:

1. Find the wavelet transform of a given speech segment decomposing it into bark bands.

2. Based on that transform, find the power spectrum on the bark scale directly.

3. Determine the masking curve for that segment based on psychoacoustic principles. 4. Determine the number of bits to be allocated to the wavelet transform coefficient in each band to ensure quantization noise in that band is less than the audible threshold for the current speech segment.

5. The quantized wavelet coefficients along with the number of bits used /coefficient are then transmitted. At the receiver, the inverse wavelet transform is implemented to reconstruct the speech.

Unlike the audio applications, the masking threshold is obtained directly from the wavelet transform. This eliminates the need for the DFT-based spectrum computation performed (in parallel with the wavelet transform) and the associated translation of the regular spectrum into the bark scale spectrum needed for determination of the masking threshold. Quantization on the logarithm of the coefficient improves the quality of the signal for the same number of bits, since the ear hears on a logarithmic magnitude scale. In regular CODECS for PCM, 8-bit μ-law quantization is effectively equivalent to 12-bit linear quantization. At least the same improved performance is expected in addition to the ear's logarithmic hearing property.

Rather than obtaining the Bark spectrum for a block, it is possible to attempt to find the instantaneous estimate of the spectrum using the Hilbert transform of the wavelet coefficients to determine the masking threshold. The Hilbert transform of a signal gives an estimate of the envelope of the signal, thus being a better estimate than instantaneous values while not requiring block operation (allows us to work on a sample by sample basis. It is to be noted that the system delay is equivalent to the delays of the filters involved in the signal decomposition/reconstruction as well as possibly in the Hilbert Transform computation.

The code to decompose and recombine a signal into bark bands and to reconstruct is given in appendix B. The input is one sentence from a voice file which has to be loaded prior to running the program. The program works on the file spll.lin (" Tom's birthday is in June" ; male speaker) . Figure 5 gives the actual bands used in the decomposition, while Appendix C gives details of the filters used for decomposition and reconstruction of the input signal. Some crucial implementation issues are also given there.

The code to determine the Bark spectrum directly from the wavelet filter outputs and the associated masking threshold is given in Appendix D, (excluding the tonality measure) but has yet to be incorporated in the main code pending the completion of the bit allocation section.

The bit allocation algorithm is crucial in obtaining the best quality signal by assigning the minimum possible number of bits for the coefficients in one frequency band such that the resulting noise in that band is masked by the signal. The number of bits assigned to each frequency band is determined so as to force the quantization noise/masked power ratio to be roughly the same for all bands. This measure is defined as:

Noise to mask ratio = a measure of noise power/masked power in a given band

= log( Pi / σ_m)i) - bi

where p_± is the peak value, σ_m#i is the masked power and b is the number of bits assigned to the i-th band.

This is achieved through an iterative procedure where one starts out with zero bits allocated for all bands. Next, the band with highest noise to mask ratio is determined. The number of bits representing coefficients in this band is increased by 1, leaving fewer bits available for distribution to other coefficients at other frequency bands. The process is then repeated until all bits are used up.

Typically, high frequencies will have little energy in them and high mask values. The masked power will also have to incorporate the absolute hearing threshold for that frequency so we allow for the maximum undetectable noise in any given band in the absence of masking in that band. Referring now to Figure 6, which shows a functional blcok diagram of a system in accordance with the invention, a voice signal A to be compressed is applied to a frequency band splitter 20, which splits the signal into a series of narrow bands C. Instantaneous energy estimator 21 receives at its input the narrow band signals C and for each one outputs an estimate E of the energy content.

The estimates E of the energy content are then applied to the respective inputs of the perceptual masking estimators 22, which in turn output signals to the bit allocator 23, which outputs a compressed voice signal G.

As can be seen in more detail in Figure 7, each narrow band signal C is output to instantaneous energy estimator 21 whose output is applied to perceptual comparator 25 along with the energy estimates from the other bands. This in turn controls quantizer 26, which is input to number coder 27 that produces the coded, compressed output signal G.

The decoder (not shown) recovers from the number coded signal the quantized signal of each band and regenerates a perceptually close representation of the original voice signal.

The described system is particularly suitable for use in any voice transport system, such as modern telephones, where the voice signal is carried over a digital link.

APPENDIX A: MATLAB CODE FOR DETERMINATION OF DAUBECHIES

FILTER COEFFICIENTS

function hn = daub(N2) % program to generate D-6

% as in her paper and Burrus last appendix page a = l; p = l; q = l; hn = [1 1); for j = 1.N2-1, hn = conv ( hn , [ 1 1 ] ) ; a = -a*.25*( j+N2-l)/j; p = conv(p, [1 -2 1]); q = [0 q 0] + a*p; end; q = sort (roots (q) ) ; hn=conv(hn, real (poly (q(l:N2-l) ) ) ) ; hn = hn/sutn(hn) ;

APPENDIX B: CODE TO DECOMPOSE THE SIGNAL INTO BARK BANDS

AND TO RECONSTRUCT %Program decomposes input speech signal roughly into bark scale %(5 stages) and recombines again to produce output. % Not all frequency bands go through same delay, this should be compensated for in .final version.

%3-band and 5-band filters are binomial filters used for simplicity. Others are Daubechies. % Filter order or type is not optimized. Block processing is performed.

% We may not need the first three bands since they are cut off by anyway by the operating

% company, but we have to ensure that the signal in the next two bands is reconstructable %exactly.

%******* This program is provided simply for general reference. It calls a lot of functions %******* that are not listed here. For a running copy, consult the software provided.

% stage 1 : 0-2k; 2-4k;

% stage 2: 0-1, 1-2; 2-3,

3-4;

% stage 3 :0-.5,.5-l; 1-1.5,1.5-2; 2-2.333,2.333-2.6667,2.667-3; 3-3.5,3.5-4; % stage 4 :0-.l, .1-.2, .2-.3, .3-.4, .4-.5;

% .5-.75, .75-1; % 1-1.25,1.25-1.5; % 1.5-1.75,1.75-2

% stage 5 : .5-.625, .625-.75

% .75-.825, .825-1

%

Ndb=16 .length of Daub filter for half-band lp/hp

N=2~14 %number of data points (one full sentence)

%

% generate input

INP2=spll(1:16384) ;

% hl=daub(Ndb/2); % daub lowpass filter

Llp=length(hi) ; hlps=hl( lp:-l:l); hhps=zeros(l.Llp) ; for m=l: lp; hhps(m)=hl(m) .*(-l)^" (m+1) ;end;

% get highpass h2=hhps( lp:-l:l) ; % reversed high pass filter

%

% stage 1 i/p is x 0-4khz yh = filtdec(h2,INP2,N); % 2-4 yl = filtdec(hl,INP2,N); % 0-2 % stage 2 high freq i/p is yh 2-4Khz yh2h = filtdec(hl,yh,N/2); % 3-4 yh21 = filtdec(h2,yh,N/2); % 2-3 % stage 2 low freq i/p is yl 0-2Khz yl2h = filtdec(h2,yl,N/2); % 1-2 yl21 = filtdec(hl,yl,N/2); % 0-1

% % stage 3 i/p is yh2h 3-4Khz y31 = filtdec(hl,yh2h,N/4); % 3.5-4 #18 y32 = filtdec(h2,yh2h,N/4); % 3-3.5 #17

% stage 3 i/p is yh21 2-3khz %split yh21into 3 bands y33,y34,y35 % hb31p=[.25 .5 .25]; hb3hp=[.25 -.5 .25]; hb3bp=[.3536 0 -.3536]; hb3bpa«[-.3536 0 .3536]; y35=filtdec3(hb31p,yh21,N/4); %2-2.33 #16 y34=filtdec3(hb3bpa,yh21,N/4); %2.33-2.66 #15 y33=filtdec3(hb3hp,yh21,N/4); %2.66-3 #14

% stage 3 i/p is yl2h l-2Khz y36 = filtdec(hi,yl2h,N/4) ; % 1.5-2 y37 = filtdec(h2,yl2h,N/4); % 1-1.5

% stage 3 i/p is yl21 0-1K y38 = filtdec(h2,yl21,N/4); % .5-1 y39 s filtdec(hl,yl21,N/4); % 0-.5 a

% stage 4 i/p is y36 1.5-2K y41 = filtdec(hi,y36,N/8) ; % 1.75-2 #13 y42 = filtdec(h2,y36,N/8); % 1.5-1.75 #12

% stage 4 i/p is y37 1-1.5k y43 = filtdec(h2,y37,N/8); % 1.25-1.5 #11 y44 = filtdec(hi,y37,N/8) ; % 1-1.25 #10

% stage 4 i/p is y38 .5-1 y45 = filtdec(hl,y38,N/8); % .75-1 y46 = filtdec(h2,y38,N/8); % .5-.75 % stage 4 i/p is y39 0-.5

%split into 5 bands ha=[l 4 6 4 1]/16; hb=[l 2 0 -2 -l]/8; hba«^*-l -2 0 2 l]/8; hc=[.1531 0 -.3062 .1531]; hca=[ .1531-.3062 0 .1531]; hd=[l -2 0 2 -l]/8; hda=[-l 2 0 -2 l]/8; he=[l -4 6 -4 1]/16; y411=filtdec5(ha,y39,N/8); y410=filtdec5(hba,y39,N/8); y49=filtdecδ(hca,y39,N/8) ; y48=filtdec5(hda,y39,N/8); y47=filtdec5(he,y39,N/8) ; % stage 5 i/p is y45 .75-1 y51 = filtdec(hl,y45,N/16); % .825-1 y52 = filtdec(h2,y45,N/16); % .75-.825 % stage 5 i/p is y46 .5-.75 y53 = filtdec(h2,y46,N/16) ; % .625-.75 y54 = filtdec(hl,y46,N/16); % .5-.625

%###### start recombination ##### bismillah#################### % stage 5 i/p is y51-y52 .75-1 x45 = intfilt(hlps,y51,N/32)+intfilt(hhps,y52,N/32) ;% .825- 1, .75-.825 % stage 5 i/p is y53-y54 .5-.75 x46 = intfilt(hhps,y53,N/32)+intfilt(hlps,y54,N/32) ;% .625- .75, .5-.625 % % stage 4 i/p is y41-y42 1.5-2K x36 = intfilt(hlps,y41,N/16)+intfilt(hhps,y42,N/16);% 1.75-2 , %1.5-1.75 % stage 4 i/p is y43-y44 1-1.5k x37 = intfilt(hhps,y43,N/16)+intfilt(hlps,y44,N/16) ;% 1.25-1.5 % stage 4 i/p is y45-x46 .5-1 x38= intfilt(hlps,y45,N/16)+intfilt(hhps,y46,N/16) ; %.75-l, .5-.75 % stage 4 i/p is y47..y411 0-.5 i411=intfilt5(ha,y411,N/40); i410=intfilt5(hb,y410,N/40); i49=intfilt5(hc,y49,N/40); i48=intfilt5(hd,y48,N/40); i47=intfilt5(he,y47,N/40) ; x39 = i411 + i410 + i49 + i48 + i47; % stage 3 i/p is y31 ...y32 xh2h = intfilt(hlps,y31,N/8)+intfilt(hhps,y32,N/8); % 3.5-4 ,3-

3.5

% stage 3 i/p is y33 ..y35 2-3khz xh21=intfilt3(hb31p,y35,N/12); xh21=xh21+intfilt3(hb3bp,y34,N/12)+intfilt3(hb3hp,y33,N/12);

%2-2.33 %2.33-2.66 %2.66-3 % stage 3 i/p is x36 ..x37 l-2Khz xl2h=intfilt(hlps,y36,N/8)+intfilt(hhps,y37,N/8); % 1.5-2% 1-1.5 % stage 3 i/p is x38 ..x39 O-IK xl21= intfilt(hhps,y38,N/8) +intf ilt(hlps,y39,N/8) ; % 0-.5,.5-l

% stage 2 high freq i/p is xh21..xh2h 2-4Khz xh = intfilt(hlps,xh2h,N/4)+intfilt(hhps,xh21,N/4); %2-4 % stage 2 low freq i/p is xl21..xl2h 0-2Khz xl = intfilt(hhps,xl2h,N/4)+intfilt(hlps,xl21,N/4) ; % stage 1 i/p is xh.xl 0-4khz xrecon5=intfilt(hhps,xh,N)+intfilt(hlps,xl,N) ; %2-4,0-2 fprintf( Original ...pausing') pause playsound(spll( 1:16384) ,8000) pause fprintf ( 'reconstructed' ) playsound ( xreconδ , 8000 )

APPENDIX C: FILTER COEFFICIENTS

The analysis system implements the filters needed to split the signal into the bark bands of Figure 5 using the arrangement shown in Figure 1. However, it is to be noted that when splitting the high frequency bands, the locations of the highpass and lowpass filters are reversed to maintain the outputs as desired after. The specific filters are as follows:

DSo e filters on the synthesis side are reversed and delayed to ensure proper reconstruction (see code) . 2) Lowpass and highpass filters for two band splits are all 12-th order Daubechies filters as generated by the function daub.

3) 3- band splits are binomial filters as generated by the function band 3 binomial (Normalized) . Note that the lowpass and highpass filters are both symmetric thus unaffected by coefficient reversal but the bandpass is. hb3bpa is used for the analysis filter while hb3bp is used for the synthesis one. hb31p=[.25 .5 .25]; hb3hp=[.25 -.5 .25]; hb3bp=[.3536 0 -.3536]; hb3bpa«[-.3536 0 .3536];

4) 5-band splits use binomial filters as generated by the function band 5 binomial (normalized) . Again, note the usage of hba,hca,hda for analysis and hb,hc,hd respectively for synthesis ha=[l 4 6 4 1]/16; hb=[l 2 0 -2 -l]/8; hba=[-l -2 0 2 l]/8; hc=[.1531 0 -.3062 .1531]; hca=[.1531-.3062 0 .1531]; hd=[l -2 0 2 -l]/8; hda=[-l 2 0 -2 l]/8; he=[l -4 6 -4 1]/16; APPENDIX D: MATLAB CODE FOR DETERMINATION OF MASKING THRESHOLD AND QUANTIZATION

% 11111. Compute power in each filter band, % Pw(l:18)=zeros (1:18); for k=l: length (y47)

Pw(l)=Pw(l)+y411(k)~2; Pw(2)=Pw(2)+y410(k)~2; Pw(3)=Pw(3)+y49(k)~2; Pw(4)=Pw(4)+y48(k)"2;

end

Pw(l:5)=Pw(l:5)/length(y47); for k=l:length (y54) Pw(6)=Pw(6)+y54(k)"2;

end Pw(6:9)»Pw(6:9)/length(y54); for k=l:length (y44)

Pw(10)=Pw(10)+y44(k)"2; Pw(ll)=Pw(ll)+y43(k)"2; Pw(12)=Pw(12)+y42(k)~2; Pw(13)=Pw(ll)+y41(k)~2; end

Pw(10:13)=Pw(10:13)/length(y44); for k=l:length (y35)

Pw(14)=Pw(14)+y35(k)^*2; Pw(15)=Pw(15)+y34(k)"2; Pw(16)=Pw(16)+y33(k)"2; end

Pw(14:16)=Pw(14:16)/length(y35); for k=l:length (y32) Pw(17)=Pw(17)+y32(k)"2; Pw(18)=Pw(18)+y31(k)"2; end

Pw(17:18)=Pw(17:18)/length(y32); bar (Pw) title ('power in critical bands') pause bar (10*log(Pw)/ee) title ( 'loglO power in critical bands') pause % 222222. find the spreading function % Spdb=spreadf; Spdb=Spdb-40; for i=l:18; for j=l:18; Sp(i,j)=10"(Spdb(i,j)/10); % get rid of the db end;end % % 333333.find the spread critical band spectrum % as in Johnston

IJSAC 1988%

C=Sp*Pw' ; bar (C) title( 'spread critical band spectrum') pause logC=log(C)/ee; bar (10*logC) title('db spread critical band spectrum') pause

%

% 4444444. get the masking threshold

% To calculate the noise masking threshold, first get the offset.

% According to Johnston, speech has a spectrum flatness measure % of A -20 to -30. This gives index alpha =.3 to.5,

% Alpha=.5 gives the lower (conservative) threshold and is used here

% in general: offset(i) = alpha(14.5+i)+5.5(l-alpha) .

% for speech : offset(i) = 10+.5*i (alpha--.5) % for tone : offset(i) - 14.5+i

% for i=l:18 offset (i)= 10+.5*i; % should subtract from the threshold end % bar (Mthr) title('final masking threshold linear') pause

Mthrdb=10*logl0(Mthr) bar (Mthrdb) title( ' final masking threshold in db') else;end %% out41(l, : )=y411;out41(2, : )*=y410;out41(3, : )=y49; out41(4, : )=y48;out41(5, : )=y47;out5(l, : )=y54; out5(2, : )=y53;out5(3,: )=y52;out5(4, :)=y51; out4m(l, : )=y44;out4m(2, : )=y43;out4m(3, : )=y42; out4m(4, : )=y41;out31(l, : )=y35;out31(2, : )=y34; out31(3,: )=y33;out3h(l,: )=y32;out3h(2, : )*y31;

%%%%%%%%%%%%%quantization % Sigma is defined as the sqrt masked power (linear)

%masked power here is calculated based on Johnston. Need to verify or to use

%Sub-band coding of digital audio signals phillips 1989

% sigmasq = Mthr; % masked power

%

%noise/masked power ratio log(m/sigma)-b, m=peak signal in band

%used differently here

NMR=log2(Pw./sigmasq) bar(NMR) title('NMR' ) pause

NMRs=NMR.*sqrt(Pw) ; %100 just to make sure the reduction by 1 is meaningful bar (NMRs) title( 'NMRs=NMR*Pw' ) NMRs(l:3)=(-l -1 ^■1]; %ensure channels 1:3 get no bits. This will introduce

%errors in reconstruction of other channels but simulates tel. line pause bavg=4; % final average number of bits wanted NB=N*bavg; % total number of bits allowed for the block b=zeros(l,18) ; while NB>0;

[maxNMRs,kmax]=max(NMRs); if( (max(NMRs)) > 0 ) if (b(kmax)<8) %inc bit allocated only if b<8, otherwise just reduce NMRs b(kmax)-b(kmax)+1; NMRs(kmax)=NMRs(kmax)-1; if kmax <=5 , NB=NB-length(out41(l, : ) ) ; % other out41 have same length elseif (kmax <=9), NB=NB-length(out5(l, : ) ) ; elseif (kmax <=13) , NB=NB-length(out4m(1, : ) ) ; elseif (kmax <=16), NB=NB-length(out31(l, : ) ) ; elseif (kmax <=18) , NB=NB-length(out3h(1, : ) ) ; else;end % end of Kmax loop else; MRs(kmax)-NMRs(kmax)-l;end %end of b(kmax) loop else % for max (NMRs) loop fprintf( 'looks like threshold is higher than power in what is left'); end % end of maxNMRs if( (max(NMRs)) < 0 ), break,end end % of while loop fprintf('end of bit allocation section,foil by bit/band' ) b pause y31=quantizeM(out3h(2, ),b(18) y32=quantizeM(out3h(l, ),b{17) y33=quantizeM(out31(3, ),b(16) y34=quantizeM(out31(2, ),b(15) y35=quantizeM(out31(l, ),b(14) y41=quantizeM(out4m(4, ),b<13) y42=quantizeM(out4m(3, ),b(12) y43=quantizeM(out4m(2, ),b(ll) y44=quantizeM(out4m(1, ),b(10) y51=quantizeM(out5(4, : ) ,b(9)); y52=quantizeM(out5(3, : ) ,b(8)); y53=quantizeM(out5(2, : ) b(7)); y54=quantizeM(out5(1, : ) ,b(6)); y47=quantizeM(out41(5, : ),b(5)) y48=quantizeM(out41(4, : )_/b(4)) y49=quantizeM(out41(3, : ),b(3)) y410=quantizeM(out41(2, :)-b(2) y411=quantizeM(out41(1, :)/b(l)

Claims

1. A voice compression apparatus for use in a telecommunications system, comprising means for compressing speech signals, characterized in that said compression means applies a compression scheme based on the auditory response characteristics of the human ear.

2. A voice compression apparatus as claimed in claim 1, characterized in that said auditory response characteristics include the ability to mask certain frequencies in the presence of others.

3. A voice compression apparatus as claimed in claim 2, characterized in that it comprises means for deriving a wavelet transform of said speech signals, the number of bits used to represent the transform coefficients being variable and dependent on the masking curve for a human ear so that introduced quantization noise is located in frequencies bands where it is masked by other frequencies.

4. A voice compression apparatus as claimed in claim 2, characterized in that it further comprises means for decomposing a voice signal into a bank of narrow frequency band signals; an estimator for estimating the instantaneous energy in each band; a comparator for comparing the energy in each band to of all other bands to estimate the audibility threshold of the signal in each band; and means for independently quantizing the signal in each band on the basis of the estimated audibility threshold.

5. A voice compression apparatus as claimed in claim 4, characterized in that it further comprises a final number coding stage for further compressing the voice signal.

6. A voice compression apparatus as claimed in claim 4, characterized in that said narrow bands are bark bands.

7. A voice compression apparatus as claimed in claim 4, characterized in that it comprises means for allocating a number of bits in said bark bands based on the ability of the human ear to tolerate quantization noise in each band due to masking effects.

8. A method of compressing a voice signal for transmission in a telecommunications system, characterized in that voice signals are compressed taking into account the auditory response characteristics of the human ear.

9. A method as claimed in claim 8, characterized in that said auditory characteristics include the ability to mask certain frequencies in the presence of others.

10. A method as claimed in claim 9, characterized in that it comprises the steps of decomposing the voice signal into a bank of narrow frequency band signals, estimating the instantaneous energy in each band, comparing the energy in each band to of all other bands to estimate the audibility threshold of the signal in each band, and independently quantizing the signal in each band on the basis of the estimated audibility threshold.

11. A method of compressing a voice signal, characterized in that it comprises the steps of finding the wavelet transform of a speech segment, decomposing the signal into a series of bark bands, finding the power spectrum on the bark scale directly, determining the masking curve for the segment on the bases of psychoacoustic analysis, determining the number of bits to be allocated to the wavelet transform coefficient in each band to ensure quantization noise in that band is less than the audible threshold for the current speech segment, and transmitting the quantized wavelet coefficients along with the number of bits used.

12. A voice compression apparatus characertized in that it comprises means for decomposing a voice signal into a bank of narrow frequency band signals; an estimator for estimating the instantaneous energy in each band; a comparator for comparing the energy in each band to of all other bands to estimate the audibility threshold of the signal in each band; and means for independently quantizing the signal in each band on the basis of the estimated audibility threshold.