WO2006114101A1

WO2006114101A1 - Detection of speech present in a noisy signal and speech enhancement making use thereof

Info

Publication number: WO2006114101A1
Application number: PCT/DK2006/000221
Authority: WO
Inventors: Søren Vang ANDERSEN; Karsten Vandborg SØRENSEN
Original assignee: Aalborg Universitet
Priority date: 2005-04-26
Filing date: 2006-04-26
Publication date: 2006-11-02

Abstract

The invention provides a method for speech presence detection of speech in a noisy signal. Based on the noisy signal represented as a temporal-spectral distribution, the method includes performing a temporal-spectral smoothing on the temporal-spectral representation of the noisy signal prior to making a speech presence decision. This allows detection of connected regions in the time-frequency domain representation of the noisy signal, and thus allows improved possibilities for efficient noise suppression without musical tones as resulting artifacts. In preferred embodiments, the temporal-spectral smoothing includes performing a spectral smoothing prior to performing a temporal smoothing, and preferably the temporal smoothing includes performing a recursive smoothing with time and frequency varying smoothing parameters, these parameters being lower limited so as to ensure a minimum degree of smoothing. In preferred embodiments, the speech presence decision is based on a minimum tracking of an output of the temporal-spectral smoothing, and a noise estimation is performed based on the speech presence decision, this decision being a binary decision. The mentioned speech presence decision is advantageous for speech enhancement, especially speech enhancement including applying different attenuation functions to regions detected as noise and speech, respectively. In a preferred embodiment noise regions are attenuated by multiplying by a scalar, e.g. a scalar of 0.05-0.4. The methods are applicable within a number of devices e.g. hearing aids, headset and mobile phones.

Description

DETECTION OF SPEECH PRESENT IN A NOISY SIGNAL AND SPEECH ENHANCEMENT MAKING USE THEREOF

Field of the invention

The invention relates to the field of signal processing, more specifically to processing aiming at detecting presence of speech in a noisy signal. Especially, the speech presence detection may serve the purpose of enhancing speech contained in the noisy signal. The invention provides a method and a device, e.g. a headset, adapted to perform the method.

Background of the invention

Signal processing with the purpose of enhancing speech contained in a noisy signal so as to improve speech intelligibility is well-known within a number of applications, e.g. hearing aids, headsets for mobile phones etc. However, most known methods with a low complexity suited for e.g. hearing aids, suffer from a poor sound quality.

In the state of the art it is known to process the noisy signal so as to represent it as a temporal-spectral distribution, e.g. a periodogram. A speech presence detection is performed in order to classify regions of the temporal-spectral distribution as either speech or speech plus noise. A spectral subtraction is then performed in order to suppress noise, e.g. an average noise spectrum is subtracted from all regions. However, by performing a spectral subtraction the noise in the regions where no speech is present is significantly changed, thus resulting in attenuation of this noise, but since it is not taken into account that the actual noise spectrum changes over the region considered, the remaining noise suffers from artificially sounding musical tones and thus an overall poor sound quality is the result.

Summary of the invention

Thus, it may be seen as an object of the present invention to provide a speech enhancement method capable of providing a low computational complexity while still providing a high sound quality without suffering from severe artefacts as musical tones etc.

In a first aspect, the invention provides a method for speech presence detection of speech in a noisy signal, the noisy signal being represented as a temporal-spectral distribution, the method including performing a temporal-spectral smoothing on the temporal-spectral representation of the noisy signal prior to making a speech presence decision.

By "represented as a temporal-spectral distribution" is understood a spectrogram type of representation of a portion of noisy signal in time-frequency domain distribution, thus a three-dimensional representation of the, i.e. time, frequency and e.g. signal intensity or signal power. E.g. the temporal dimension can be divided in discrete time frames, and the spectral dimension can be divided in discrete frequency bins. However, it is to be understood that other temporal-spectral distributions other than fixed discrete time frame and frequency bins are possible. The method is advantageous e.g. in that a smoothing of a time-frequency domain representation of the noisy signal prior to making the speech presence decision allows a better possibility to detect connected time-frequency regions of noise and connected time- frequency regions of speech plus noise. Thus, the smoothing helps to facilitate the speech presence decision task, and it leads in addition to a better noise estimate. E.g. in a preferred embodiment, the speech presence decision is a binary decision.

The speech presence detection method is advantageous for use in connection with speech enhancement since the mentioned detection of connected regions of noise and speech plus noise, respectively, provides a basis for applying an efficient noise suppression algorithm which still, due to the connected regions, results in a natural sounding enhanced speech. E.g. it becomes possible to apply different attenuation functions to noise regions and speech plus noise regions and thus achieve a more efficient noise suppression without resulting in musical tone artifacts.

In preferred embodiments of the method, the temporal-spectral smoothing includes performing a spectral smoothing prior to performing a temporal smoothing, and especially it is preferred that the temporal smoothing includes performing a recursive smoothing with time and frequency varying smoothing parameters, and more preferably these smoothing parameters are lower limited so as to ensure a minimum degree of smoothing.

Preferably, the speech presence decision is based on a minimum tracking of an output of the temporal-spectral smoothing. Preferably, a noise estimation is performed based on the speech presence decision.

In a second aspect, the invention provides a speech enhancement method including performing the speech presence detection method described in the first aspect, and performing enhancement of the speech in the noisy signal based on the speech presence detection.

Preferably, a first attenuation function is applied to regions of the temporal-spectral distribution where presence of speech is detected, and a second attenuation function is applied to other regions of the temporal-spectral distribution, the first and second attenuation functions being different. With different attenuation of the detected noise regions and the detected speech plus noise regions, it is possible to apply more dedicated attenuation functions that result in a more efficient noise suppression still without resulting in musical noise or tone artifacts. Especially, it is preferred that the second attenuation function applied to the other regions, i.e. the regions detected as noise, of the temporal- spectral distribution is a scalar attenuation, and preferably this scalar attenuation is implemented by multiplying by a scalar in the range 0.05 to 0.4, such as in the range 0.1 to 0.3, and preferably in the range 0.15 to 0.25, such as 0.2 or approximately 0.2.

Since the speech enhancement method of the second aspect includes performing the speech presence detection method of the first aspect, it has the same advantages as mentioned for the first aspect, and the preferred embodiments mentioned for the first aspect therefore also apply.

The speech enhancement method is suited for application where a noisy audio signal containing speech is corrupted by noise. The noise may be caused by electrical noise interfering with an electrical audio signal, or the noise may be acoustic noise such as introduced at the recording of the speech, e.g. a person speaking in a telephone at a place with traffic noise etc. The speech enhancement method can then be used to increase speech intelligibility by enhancing the speech in relation to the noise.

In a third aspect the invention provides a device including a processor adapted to perform the method of any one of the first or second aspects. Thus, the advantages and embodiments mentioned for the first and second aspects therefore apply for the third aspect as well.

Especially, the device may be: a mobile phone, a radio communication device, an internet telephony system, sound recording equipment, sound processing equipment, sound editing equipment, broadcasting sound equipment, or a monitoring system.

Alternatively, the device may be: a hearing aid, a headset, an assistive listening device, an electronic hearing protector, or a headphone with a built-in microphone (so as to allow sound from the environments to reach the listener).

In a fourth aspect, the invention provides a computer executable program code adapted to perform the method according to any one of the first, second or third aspects. Thus, the same advantages as mentioned for these aspects therefore apply.

The program code may be present on a program carrier, e.g. a memory card, a disk etc. or in a RAM or ROM memory of a device.

Brief description of the drawings

In the following the invention is described in more details with reference to the accompanying figures, of which

Fig. 1 illustrates a block diagram of a preferred speech enhancement algorithm, and

Fig. 2 illustrates a preferred device.

While the invention is susceptible to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and will be described in detail herein. It should be understood, however, that the invention is not intended to be limited to the particular forms disclosed. Rather, the invention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims. Description of preferred embodiments

Fig. 1 illustrates a block diagram of main parts of a preferred speech enhancement algorithm, including preferred steps of a speech enhancement detection method indicated with a dashed box. The following detailed description of a speech enhancement algorithm with natural sounding residual noise based on connected time-frequency speech presence regions is based on the principles already described. The connected time-frequency regions are used by a noise estimation method and both the speech presence decisions and the noise estimate are used in the speech enhancement method. Different attenuation rules are applied to regions with and without speech presence to achieve enhanced speech with natural sounding attenuated background noise. The preferred speech enhancement method has a low computational complexity, and thus it is suited e.g. for implementation into low processing power devices such as hearing aids. Listening tests have shown that the method provides higher scores than minimum means square error log-spectral amplitude (MMSE-LSA) and decision-directed MMSE-LSA methods.

The performance of many speech enhancement methods relies mainly on the quality of a noise Power Spectral Density (PSD) estimate. When the noise estimate differs from the true noise it will lead to artifacts in the enhanced speech. The approach taken in this paper is based on connected region speech presence detection. Our aim is to exploit

S" spectral and temporal masking mechanisms in the human auditory system [1] to reduce the perception of these artifacts in speech presence regions and eliminate the artifacts in speech absence regions. We achieve this by leaving down-scaled natural sounding background noise in the enhanced speech in connected time-frequency regions with no speech presence. The down-scaled natural sounding background noise will spectrally

JO and temporally mask artifacts in the speech estimate while preserving the naturalness of the background noise.

In the definition of speech presence regions we are inspired by the work by Yang [2]. Yang demonstrates high perceptual quality of a speech enhancement method where constant gain is applied in frames with no detected speech presence. Yang lets a single

\ζ decision cover a full frame. Thus, musical noise is present in the full spectrum of the en- hanced speech in frames with speech activity. We therefore extend the notion of speech presence to individual time-frequency locations. This, in our experience, significantly improves the naturalness of the residual noise. The speech enhancement method, proposed in this paper, thereby eliminates audible musical noise in the enhanced speech. However, fluctuating speech presence decisions will reduce the naturalness of the enhanced speech and the background noise. Thus, reasonably connected regions of the same speech presence decision must be established.

To achieve this, we use spectral-temporal periodogram smoothing. To this end, we make use of the spectral-temporal smoothing method by Martin and Lotter [3], which extends the original ground-breaking work by Martin [4, 5). Martin and Lotter derive optimum smoothing coefficients for (generalized) χ²- distributed spectrally smoothed spectrograms, which is particularly well suited for noise types with a smooth power spectrum. The underlying assumption in this approach is that the real and imaginary parts of the associated STFT coefficients for the averaged periodograms have the same means and variances. For the application of spectral-temporal smoothing to obtain connected regions of speech presence decisions, we augment Martin and Lotters smoothing method with the spectral smoothing method used by Cohen and Berdugo [6]. For minimum statistics noise estimation, Martin [5] has suggested a theoretically founded bias compensation factor, which is a function of the minimum search window length, the smoothed noisy speech and the noise PSD estimate variances. This enables a low-biased noise estimate that does not rely on a speech presence detector. However, as our proposed speech enhancement method has connected speech presence regions as an integrated component, this enables us to make use of a new, simple, yet effi- cient, bias compensation. To verify the performance of the new bias compensation we objectively evaluate the noise estimation method that uses this bias compensation of minimum tracks from our spectrally-temporally smoothed periodograms, prior to integrating this noise estimate in the final speech enhancement method. In result, our proposed speech enhancement algorithm has a low computational com- plexity, which makes it particularly relevant for application in digital signal processors with limited computational power, such as those found in digital hearing aids. In particular, the obtained algorithm provides a significantly higher perceived quality than our implementation of the Decision-Directed Minimum Mean-Square Error Log-Spectral Amplitude (MMSE-LSA-DD) estimator [7] when evaluated in listening tests. Fur- thermore, the noise PSD estimate that we use to obtain a noise magnitude spectrum estimate for the attenuation rule in connected regions of speech presence is shown to be superior to estimates from Minimum Statistics (MS) noise estimation [5] and our implementation of χ²-based noise estimation [3] for spectrally smooth noise types. The remainder of this paper is organized as follows. In Section II, we describe the signal model and give an overview of the proposed algorithm. In Section III, we list the necessary equations to perform the spectral-temporal periodogram smoothing. Section IV contains a description of our detector for connected speech presence regions, and in Section V, we describe how the spectrally-temporally smoothed periodograms and the speech presence regions can be used to obtain both a noise PSD estimate and a noise periodogram estimate, which both relies on the new bias compensation. In the latter noise estimation method, we estimate the squared magnitudes of the noise Short-Time Fourier Transform (STFT) coefficients. In Section VI the connected region speech presence detector is introduced in a speech enhancement method with the purposes of reducing noise and augmenting listening comfort. Section VII contains the experimental setup and all necessary initializations. Finally, Section VIII describes the experimental results and Section IX concludes the paper with a discussion of the proposed methods and obtained results.

Structure of the Algorithm

After an introduction to the signal model, we give a structural description of the algorithm to provide an algorithmic overview before the individual methods, which consti- tute the algorithm, are described in detail.

Signal Model

We assume that noisy speech y{i) at sampling time index i consists of speech s(ϊ) and additive noise n(i). For joint time-frequency analysis of y(ϊ) we apply the i^-point STFT, i.e.

(1)

where λ G Z is the (sub-sampled) time index, λ; £ {0, 1, . . . , K — 1} is the frequency index, and L is the window length. In this paper, we have that L equals K. The quantity R is the number of samples that successive frames are shifted and h(μ) is a unit energy window function, i.e From the linearity of (1) we have

that

(2)

where <S(λ, k) and N(X, k) are the STFT coefficients of speech s(i) and additive noise n(i), respectively. We further assume that s(i) and n(i) are zero mean and statistically independent, which leads to a power relation, where the noise is additive [8], i.e.

(3)

Structural Algorithm Description

The structure of the proposed algorithm, and names of variables with a central role, are shown in Figure 1. After applying an analysis window to the noisy speech, we take the STFT, from which we calculate periodograms Fy(A₇ A;) = |y(λ, A;)|². These peri- odograms are spectrally smoothed, yielding ¹Py(A, k), and then temporally smoothed to produce V(X, k). These smoothed periodograms are temporally minimum tracked, and by comparing ratios and differences of the minimum tracked values to V(X, k), they are used for speech presence detection. As a distinct feature of the proposed method we use speech presence detection to achieve low-biased noise PSD estimates Pfi(X, k), but also for noise periodogram estimates Pfi(X, k), which equal Py(A, k) when D(X, k) = 0, i.e. no detected speech presence. When -D(A, k) = 1, i.e. detected speech presence, the noise periodogram estimate equals the noise PSD estimate, i.e. a recursively smoothed bias compensation factor applied on the minimum tracked values. The bias compensation factor is recursively smoothed power ratios between the noise periodogram estimates and the minimum tracks. This factor is only updated while no speech is present in the frames and kept fixed while speech is present. A noise magnitude spectrum estimate jiV(λ, A;) I obtained from the noise PSD estimate and the speech presence decisions are used in a speech enhancement method that applies different attenuation rules for speech presence and no speech presence. For speech synthesis we take the inverse STFT of the estimated speech magnitude spectrum with the phase from the STFT of the noisy speech. The synthesized frame is used in a Weighted OverLap Add (WOLA) method, where we apply a synthesis window before overlap and add.

Spectral- Temporal Periodogram Smoothing

In this section we briefly describe the spectral-temporal periodogram smoothing method. Spectral Smoothing

First the noisy speech periodograms Py(X, k) are spectrally smoothed by letting a spectrally smoothed periodogram bin Py(X, k) consist of a weighted sum of ID + 1 periodogram bins, spectrally centered at k [Q], i.e.

(4)

where ((m))χ denotes m modulus K, and K is the length of the full (mirrored) spectrum. The window function b(v) used for spectral weighting is chosen such that it sums to 1, i.e.

_> ^and therefore preserves the total power of the spectrum.

Temporal Smoothing

The spectrally smoothed periodograms Vy(X, k), see Figure 1, are now temporally smoothed recursively with time and frequency varying smoothing parameters a(X, k), to produce a spectrally-temporally smoothed noisy speech periodogram V(X, k), i.e. (5)

We use the optimum smoothing parameters proposed by Martin and Lotter [3]. Their method consists of optimum smoothing parameters for χ²-distributed data with some modifications that makes it suited for practical implementation. The optimum smoothing parameters are given by

(6) with

"equivalent" degrees of freedom of a χ²-distribution [3]. For practical implementation the noise PSD, which is used in the calculation of the optimum smoothing parameters, is estimated as the previous noise PSD estimate, i.e.

Complete Periodogram Smoothing Algorithm

Pseudocode for the complete spectral-temporal periodogram smoothing method is pro- vided in Algorithm 1.

A smoothing parameter correction factor a_c(X, k), proposed by Martin [5], is multiplied on ά(X, k). Additionally, in this paper, we lower limit the resulting smoothing parameters to ensure a minimum degree of smoothing, i.e.

(9)

In the next section, we use temporal minimum tracking on the spectrally- temporally smoothed noisy speech periodograms in a method for detection of connected speech presence regions, which later will be used for noise estimation and speech enhancement.

is the square root of the energy of

, which scales the analysis window to unit energy. This is to avoid scaling factors throughout the paper. scales the synthesis window h_s(μ), such that the analysis window h(μ), multiplied with h_s(μ), yields a HANNING(Jf) window.

scales the window to unit sum. "Calculated at run-time as M = round(length(y{i))/R - 1/2) - 1.

^Calculated at run-time as

Connected Speech Presence Regions

We now base a speech presence detection method on comparisons, at each frequency, between the smoothed noisy speech periodograms and temporal minimum tracks of the smoothed noisy speech periodograms.

Temporal Minimum Tracking

From the spectrally-temporally smoothed noisy speech periodograms ^"P(A, k), we track temporal minimum values V_mi_n(X, k), within a minimum search window of length D_min, i.e.

(10)

with φ € Z. D_min is chosen as a tradeoff between the ability to bridge over periods of speech presence [5], which is crucial for the minimum track to be robust to speech presence, and the ability to follow non-stationary noise. Typically, a window length corresponding to 0.5 — 1.5 seconds, yields an acceptable tradeoff between these two properties [5, 6]. We now have that

k) is approximately unaffected by periods of speech presence, but on the average biased towards lower values, when no spectral smoothing is applied [5]. Memory requirements of the tracking method can be reduced at the cost of lost temporal resolution, see e.g. [5]. In the following the temporal minimum tracks P_mi_n(λ, k) are used in a speech presence decision rule.

Binary Speech Presence Decision Rule

We have shown in previous work [9], that temporally smoothed periodograms and their temporal minimum tracks can be used for speech presence detection. Also shown in [9] is that including terms to compensate for bias on the minimum tracks improves the speech presence detection performance (measured as the decrease in a cost function) by less than one percent. In this paper, we therefore do not consider a bias compensation factor in the speech presence decision rule. Rather, as we show later in this paper, the speech presence decisions can be used in the estimation of a simple and very well performing bias compensation factor for noise estimation. Similar to our previous approach for temporally smoothed periodograms [9], we now exploit the properties of spectrally-temporally smoothed periodograms V(X, k) in a binary decision rule for the detection of speech presence. The presence of speech will cause an increase of power in V(X, k) at a particular time-frequency location, due to (3). Thus, the ratio between V(X, k) and a noise PSD estimate, given by a minimum track V_mi_n(X, k) with a bias reduction, yields a robust (due to the smoothing) estimate of the signal-plus-noise to noise ratio at the particular time-frequency location. Our connected region speech presence detection method is based on the smooth nature of V(X, k) and V_mi_n(X, k). The smoothness will ensure that spurious fluctuations in the noisy speech power will not cause spurious fluctuations in our speech presence decisions. Thus, we will be able to obtain connected regions of speech presence and of no speech presence. This property is fundamental for the proposed noise estimation and speech enhancement methods. As a rule to decide between the two speech presence hypotheses, namely

(11) (12)

which can be written in terms of the STFT coefficients, i.e.

(13) (14)

we use a combination of two binary initial decision rules. First, let D(A, k) = % be the decision to believe in hypothesis Hi(X, k) for i G {0, 1}. We define two initial decision rules, which will give two initial decisions D'(X, k) and D"(X, k). The initial decision rules are given by a rule, where the smoothed noisy speech periodograms V(X, k) are compared with the temporal minimum tracks V_mi_n(X, k), weighted with a constant 7', i.e.

(15)

and one where, at time λ, the difference is compared to the average of the minimum tracks scaled by 7", i.e.

For the initial decision rules we have adopted the notation used by Shanmugan and Breipohl [8]. Because the minimum tracks are representative of the noise PSD's [5], the first initial decision rule classifies time-frequency bins based on the estimated signal- plus-noise to noise power ratio. Note that this can be seen as a special case of the indicator function proposed by Cohen [10] (with ζo = ^' /B_mi_n and 7₀ = oo). The second initial decision rule D" (X, k) classifies bins from the estimated power difference between the noisy speech and the noise using a threshold that adapts to the minimum track power level in each frame. Multiplication of the two binary initial decisions corresponds to the logical AND-operation, when we define true as deciding on Hi(λ, A;) and false as deciding on Ho(A, A;). We therefore propose a decision that combines the two initial decisions from the initial decision rules above, i.e.

(18)

In effect, the combined decision allows detection of speech in low signal-to-noise ratios (SNR's) without letting low power regions with high SNR's contaminate the decisions. Thereby we obtain connected time-frequency regions of speech presence. The constant 7' is not sensitive to the type and intensity of environmental noise [11] and it can be adjusted empirically. This is also the case for 7". For applications where a reasonable objective performance measure can be defined, the constants 7' and 7" can be obtained by interpreting the decision rule as artificial neural network and then conduct a supervised training of this network [9].

Speech at frequencies below 100 Hz is considered perceptually unimportant, and bins below this frequency are therefore always classified with no speech presence. Real-life noise sources often have a large part of their power at the low frequencies, so this rule ensures, that this power does not cause the speech presence detection method to falsely classify these low-frequency bins as if speech is present. If less than 5% of the K periodogram bins are classified with speech presence, we expect that these decisions have been falsely caused by the noise characteristics, and all decisions in the current frame are reclassiiied to no speech presence. When the speech presence decisions axe used in a speech enhancement method, as we propose in Section 0.7, this reclassification will ensure the naturalness of the background noise in periods of speaker silence.

Noise Estimation

The spectral-temporal smoothing method [3], which we use in this paper, reduces the bias between the noise PSD and the minimum track V_mi_n(X, k) if the noise is assumed to be ergodic in its PSD. That is, it reduces the bias compared to rninimum tracked values from periodograms, smoothed temporally using Martins first method [5]. Martin gives a parametric description of a bias compensation factor, which depends on the minimum search window length, the smoothed noisy speech, and the noise PSD estimate variances. The spectral smoothing lowers the smoothed noisy speech periodogram variance, and as a consequence, a longer minimum search window can be applied when the noise spectrum is not changing rapidly. This give the ability to bridge over longer speech periods.

We propose to use the speech presence detection method from Section 0.5 to obtain two different noise estimates, i.e. a noise PSD estimate and a noise periodogram estimate. The PSD estimate will be used in the speech enhancement methods and the noise periodogram estimate will illustrate some of the properties of the residual noise from the speech enhancement method we propose in Section 0.7.

Noise Periodogram Estimation

The noise periodogram estimate is equal to a time-varying power scaling of the mini- mum tracks V_mi_n(X, k), for D(X, k) = 1. For D(X, k) = 0 it is equal to the noisy speech periodogram Pγ(λ, k), i.e.

In the above equation a bias compensation factor R_mi_n(X) scales the minimum. The scaling factor is updated in frames where no speech presence is detected and kept fixed while speech presence is detected in the frames. We let R_rUi_n(X) be given by the ratio between the sums of the previous noise periodogram estimate P_jy(λ — l, k) and the minimum tracks V_mi_n(X, k), i.e.

/_9n\

which is recursively smoothed, when no speech is present in the frame, and fixed, when speech is present in the frame, i.e.

where 0 < C-_mi_n < 1 is a constant recursive smoothing parameter. The magnitude spectrum, at time index λ, is obtained by taking the square root of noise periodogram estimate, i.e.

(22)

This noise periodogram estimate equals the true noise periodogram |iV(λ, fc)|² when the speech presence detection is correctly detecting no speech presence. When entering a region with speech presence, the noise periodogram estimate will take on the smooth shape of the minimum track, scaled with the bias compensation factor in (21) such that the power develops smoothly into the speech presence region. Noise PSD Estimation

The noise PSD estimate Pf_}(λ, k) is obtained exactly as the noise periodogram estimate but with (19) modified such that the noise PSD estimate is obtained directly as the power-scaled minimum tracks, i.e.

(23)

A smooth estimate of the noise magnitude spectrum can be obtained by taking the square root of the noise PSD estimates, i.e.

(24)

Speech Enhancement

We now describe the speech enhancement method for which the speech presence detection method has been developed. It is well known that methods that subtract a noise PSD estimate from a noisy speech periodogram, e.g. using an attenuation rule, will introduce musical noise. This happens whenever the noisy speech periodogram exceeds the noise PSD estimate. If, on the other hand, the noise PSD estimate is too high, the attenuation will reduce more noise, but also cause the speech estimate to be distorted. To mitigate these effects, we propose to distinguish between connected regions with speech presence and no speech presence. In speech presence we will use a traditional estimation technique, by means of generalized spectral subtraction, with the noise magnitude spectrum estimate, obtained using (24) from the noise PSD estimate. In no speech presence we will use a simple noise-scaling attenuation rule to preserve the naturalness in the residual noise. Note that this approach, but with only a single speech presence decision covering all frequencies in each frame, has previously been proposed by Yang [2]. Moreover, Cohen and Berdugo [11] propose a binary detection of speech presence/absence (called the indicator function in their paper), which is similar to the one we propose in this paper. However, their decision includes noisy speech periodogram bins without smoothing, hence some decisions will not be regionally connected. In our experience, this leads to artifacts if the decisions are used directly in a speech enhancement scheme with two different attenuation rules for speech absence and speech presence. Cohen and Berdugo smoothes their binary decisions to obtain estimated speech presence probabilities, which are used for a soft-decision between two separate attenuation functions. Our approach, as opposed to this, is to obtain adequately time-frequency smoothed spectra from which connected speech presence regions can be obtained directly in a robust manner. As a consequence, we avoid distortion in speech absence regions, and thereby obtain a natural sounding background noise.

Let the generalized spectral subtraction variant be given similar to the one proposed by Berouti et. al. [12], but with the decision, of which attenuation rule to use, given explicitly by the proposed speech presence decisions, instead of comparisons between the estimated speech power and an estimated noise floor. The immediate advantage of our approach is a higher degree of control with the properties of the enhancement algorithm. Our proposed method is given by

(25)

where a^ determines the power, in which the subtraction is performed, β_\ is a noise over- estimation factor, that scales the estimated magnitude of the noise STFT coefficient |JV(λ, fc)|, obtained from the noise PSD estimate by (24) in Section 0.6, raised to the a,_\ 'th power. The factor /?o scales the noisy speech STFT coefficient magnitude, which before this scaling equals the square root of the noise periodogram estimate for bins with D(X, k) = 0. After the scaling these noisy speech STFT magnitudes lead to the noise component that will be left, after STFT synthesis, in the speech estimate as artifact masking [1] and natural sounding attenuated background noise. For synthesis we let the STFT spectrum of the estimated speech be given by the magnitude, obtained from (25), and the noisy phase /Y(X, k), i.e.

By applying the inverse STFT we synthesize a time-domain frame, which we use in a WOLA scheme, as illustrated in Figure 1, to form the synthesized signal. Depending on the analysis window, a corresponding synthesis window h_s(μ) is applied before overlap add is performed.

Experimental Setup

In the experiments we use 6 speech recordings from the TIMIT database [13]. The speech is spoken by 3 different male and 3 different female speakers - all uttering different sentences of 2-3 seconds duration. These sentences are added with zero-mean highway noise and car interior noise in 0, 5, and 10 dB overall signal-to-noise ratios to form a test set of 36 noisy speech sequences. Spectrograms of time-domain signals are shown with time-frequency axes and always with the time-domain signals. When we plot intermediate coefficients the figures are shown with axes of sub-sampled time index λ and frequency index A:. For all illustrations in this paper, we use the noisy speech from one of the male speakers with additive highway noise in a 5 dB over-all SNR. ^';

The general setup in the experiments is listed in Table 1. The analysis window h(μ) is the square-root of a Hanning window, scaled to unit energy. As the synthesis window h_s(μ), we also use the square-root of a Hanning window, but scaled, such that an unmodified frame would be windowed by a Hanning window after both the analysis and synthesis window have been applied. It will therefore be ready for overlap add with 50% overlapping frames.

For the spectral-temporal periodogram smoothing, we use the settings and algorithm initializations given in Table 2. The decision rules, that are used for speech presence detection, have the threshold values listed in Table 3. For noise estimation, we use the

two parameters from Table 4. The speech enhancement method uses the parameter settings that are listed in Table 5.

Experimental Results

In this section, we evaluate the performance of the proposed algorithm. We measure the performance of the algorithm by means of visual inspection of spectrograms, spec- 5 tral distortion measures, and informal listening tests.

By removing the power in no speech presence regions and speech presence regions from the noisy speech periodogram, we see in Figure 4 (top) and (bottom), respectively, that most of the speech, that is detectable by visual inspection, has been detected by the proposed algorithm.

I O We evaluate the performance of the noise estimation methods by means of their spectral distortion, which we measure as segmental noise-to-error ratios (SegNER). We calculate the SegNER in the time-frequency domain, as the ratio (in dB) between the noise energy and the noise estimation error energy. These values are upper and lower limited by 35 and 0 dB [14], respectively, i.e.

(27) (28)

and averaged over all (M) frames, i.e.

(29)

In Table 6 we list the average SegNER's over the same 6 speakers that are used in the informal listening test of the speech enhancement method. We list the average SegNER's for the noise periodogram estimation method, the noise PSD estimation

method, our implementation of χ²-based noise estimation [3], and minimum statistics (MS) noise estimation [5]. Our implementation of the χ²-based noise estimation uses the MS noise estimate [5] in the calculation of the optimum smoothing parameters, as suggested by Martin and Lotter [3]. The spectral averaging in our implementation of the χ²-based noise estimation is performed in sliding spectral windows, of the same size as used by the two proposed noise estimation methods.

We see that the noise PSD estimate has less spectral distortion than both our implementation of the χ²-based noise estimation [3] and MS noise estimation [5]. This can be explained by a more accurate bias compensation factor, which uses speech presence information. Note that in many scenarios the proposed smooth and low-biased noise PSD estimate is preferable over the noise periodogram estimate.

As an objective measure of time-domain waveform similarity, we list the signal-to-noise ratios, and as a subjective measure of speech quality, we conduct an informal listening test. In this test, test subjects give scores from the scale in Table 7 ranging from 1 to 5, in steps of 0.1, to three different speech enhancement methods, with the noisy speech as a reference signal. Higher score is given to the preferred speech enhancement method. The test subjects are asked to take parameters, such as the naturalness of the enhanced speech, the quality of the speech, and the degree of noise reduction into account, when assigning a score to an estimate. The presentation order of estimates from individual methods is blinded, randomized, and vary in each test set and for each test subject.

A total of 8 listeners, all working within the field of speech signal processing, participated in the test. The proposed speech enhancement method was compared with our implementation of two reference methods:

• MMSE-LSA: Minimum mean-square error log-spectral amplitude estimation, as proposed by Ephraim and Malah [7].

• MMSE-LSA-DD: Decision-directed MMSE-LSA, which is the MMSE-LSA estimation in combination with a smoothing mechanism [7]. Constants are as proposed by Ephraim and Malah.

All three methods in the test use the proposed noise PSD estimate,

Also, they all use the analysis/synthesis setup described in Section 0.8.

SNR's and Mean Opinion Scores (MOS) from the informal subjective listening test are listed in Table 8 and Table 9. All results are averaged over both speakers and listeners. i> The best obtained results are emphasized using bold letters.

To identify if the proposed method is significantly better, i.e. has a higher MOS, than MMSE-LSA-DD, we use the matched sample design [15], where the absolute values of the opinion scores are eliminated as a source of variation. Let μ^ be the mean of the opinion score difference between the proposed method and the MMSE-LSA-DD. Using

I to this formulation, we write the null and alternative hypothesis as:

(30) (31)

respectively. The null hypothesis Ho in this context should not be mistaken for the hypothesis H₀ in the speech presence detection method. With 48 experiments at each combination of SNR and noise type, we are in the large sample case, and we therefore assume that the differences are normally distributed. The rejection rule, at 1% level of |5 significance, is

(32)

with 2._Oi ⁼ 2.33. Table 10 and 11 list the test statistic z, and the corresponding test result. Also, the two-tailed 99% confidence interval [15], of the difference between the MOS of the proposed method and MMSE-LSA-DD, for highway traffic and car interior noise, respectively, is listed. From our results we can therefore state, with a 2.0 confidence level of 99% that the proposed method has a higher perceptual quality than MMSE-LSA-DD. Furthermore, the difference corresponds generally to more than 0.5 MOS, which generally change the ratings from somewhere between Poor and Fair to somewhere between Fair and Good on the MOS scale.

Discussion

We have in this paper presented new noise estimation and speech enhancement methods that utilize a proposed connected region speech presence detection method. Despite the simplicity, the proposed methods axe shown to have superior performance when compared to our implementation of state-of-the-art reference methods in the case of both noise estimation and speech enhancement.

In the first proposed noise estimation method the connected speech presence regions are used to achieve noise periodogram estimates in the regions where no speech is present. In the remaining regions, where speech is present, minimum tracks of the smoothed noisy speech periodograms are bias compensated with a factor that is updated in regions with no speech presence. A second proposed noise estimation method provides a noise PSD estimate by means of the same power-scaled minimum tracks that are used by the noise periodogram estimation method when speech is present. It is shown that the noise PSD estimate has less spectral distortion than both our implementation of χ²-based noise estimation [3] and minimum statistics noise estimation [5]. This can be explained by a more accurate bias compensation factor, which uses speech presence information. The noise periodogram estimate is by far the less spectrally distorted noise estimate of the tested noise estimation methods. This verifies the connected region speech presence principle which is fundamental for the proposed speech enhancement method.

Our proposed enhancement method uses different attenuation rules for each of the two types of speech presence regions. When no speech is present the noisy speech is down- scaled and left in the speech estimate as natural sounding masking noise, and when speech is present a noise PSD estimate is used in a traditional generalized spectral subtraction. In addition to enhancing the speech, the most distinct feature of the proposed speech enhancement method is that it leaves natural sounding background noise matching the actual surroundings of the person wearing the hearing aid. The proposed method performs well at SNR's equal to or higher than 0 dB for noise types with slowly changing and spectrally smooth periodograms. Rapid, and speech-like, changes in the noise will be treated as speech, and will therefore be enhanced, causing a decrease in the naturalness of the background noise. At very low SNR's the detection of speech presence will begin to fail. In this case, we suggest the implementation of the proposed method in a scheme, where low SNR is detected and causes a change to an approach with only a single and very conservative attenuation rule. Strong tonal interferences will affect the speech presence decisions as well as the noise estimation and enhancement method and should be detected and removed by preprocessing of the noisy signal immediately after the STFT analysis. Otherwise, a sufficiently strong tonal interference with duration longer than the minimum search window will cause the signal to be treated as if speech is absent and the speech enhancement algorithm will down-scale the entire noisy speech by multiplication with /?o- Our approach generalizes to other noise reduction schemes. As an example the proposed binary scheme can also be used with MMSE-LSA-DD for the speech presence regions. For such a combination we expect performance similar to, or better than, what we have shown in this paper for the generalized spectral subtraction. This is supported by the findings of Cohen and Berdugo [11] that have shown that a soft-decision approach improves MMSE-LSA-DD. The informal listening test confirms that listeners prefer the down-scaled background noise with fully preserved naturalness over the less realistic whitened residual noise from e.g. MMSE-LSA-DD. From our experiments we can conclude, with a confidence level of 99%, that the proposed speech enhancement method receives significantly higher MOS than MMSE-LSA-DD at all tested combinations of SNR and noise type. 26 d

References

[1] T. Painter and A. Spanias, "Perceptual Coding of Digital Audio," Proc. IEEE, vol. 88, no. 4, pp. 451-513, Apr. 2000.

[2] J. Yang, "Frequency Domain Noise Suppression Approaches in Mobile Telephone Systems," in Proc. IEEE Int. Conf. AcousL, Speech, Signal Processing, vol. 2, Apr. 1993, pp. 363-366.

[3] R. Martin and T. Lotter, "Optimal Recursive Smoothing of Non-Stationary Pe- riodograms," in Proc. IEEE Int. Workshop on Acoustic Echo and Noise Control, Sept. 2001, pp. 43-46.

[4] R. Martin, "Spectral Subtraction Based on Minimum Statistics," in Proc. EURASIP European Signal Processing Conference, Sept. 1994, pp. 1182-1185.

[5] , "Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics," IEEE Trans. Speech Audio Processing, vol. 9, no. 5, pp. 504-512, July 2001.

[6] I. Cohen and B. Berdugo, "Noise Estimation by Minima Controlled Recursive Averaging for Robust Speech Enhancement," IEEE Signal Processing Lett., vol. 9, no. 1, pp. 12-15, Jan. 2002.

[7] Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator," IEEE Trans. AcousL, Speech, Signal Processing, vol. 33, no. 2, pp. 443-445, Apr. 1985. 26 β

[8] K. S. Shanmugan and A. M. Breipohl, Random Signals - Detection, Estimation and Data Analysis. John Wiley & Sons, Inc., 1988.

[9] K. V. Sjzfensen and S. V. Andersen, "Speech Presence Detection in the Time- Frequency Domain using Minimum Statistics," in Proc. Nordic Signal Processing Symposium, June 2004, pp. 340-343.

[10] I. Cohen, "Noise Spectrum Estimation in Adverse Environments: Improved Minima Controlled Recursive Averaging," IEEE Trans. Speech Audio Processing, vol. 11, no. 5, pp. 466-475, Sept. 2003.

[11] I. Cohen and B. Berdugo, "Speech Enhancement for Non-Stationary Noise Environments," Signal Processing, vol. 81, no. 11, pp. 2403-2418, Nov. 2001, Elsevier Science.

[12] M. Berouti, R. Schwartz, and J. Makhoul, "Enhancement of Speech Corrupted by Acoustic Noise," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 4, Apr. 1979, pp. 208-211.

[13] DARPA TIMIT Acoustic-Phonetic Speech Database, National Institute of Standards and Technology (NIST), Gaithersburg, MD 20899, USA, CD-ROM.

[14] J. R. Deller, Jr., J. H. L. Hansen, and J. G. Proakis, Discrete-Time Processing of Speech Signals. Wiley-Interscience, 2000.

[15] D. R. Anderson, D. J. Sweeney, and T. A. Williams, Statistics for Business and Economics. South- Western, 1990.

[16] D. Brillinger, Time Series: Data Analysis and Theory. Holden-Day, 1981. In the following, a detailed description of a preferred method for noise power spectral density estimation from noisy speech using in-line trained hidden Markov models. This method is used in conjunction with the described speech presence detection method that provides connected time-frequency regions of each decision type as a basis for the noise estimation. In speech presence regions hidden Markov models are trained on-line and in speech presence regions the trained models are used for (MMSE) optimum estimation. Both types of speech presence regions can be present in each frame and on-line training of the models in speech absence can be conducted while the models in speech presence are used for estimation. Experiments show that the proposed noise power spectral density estimation performs better than three state of the art reference methods. For real life noise types the special case of the hidden Markov model where it reduces to a Gaussian mixture model is shown to be nearly as good as the hidden Markov model.

Statistical noise estimation methods very often relies on an assumption of station- arity; the parameters of the noise PDF are estimated during speech absence and kept constant during speech presence. In this paper we investigate if dynamic modeling of the noise PDF using hidden Markov models (HMM's) results in better performance

S than when using static Gaussian mixture models (GMM's). We evaluate the performance for real-life noise types. Transition probabilities in the HMM's will, to a certain degree, capture the dynamic behavior of non-stationary noise. The general method of HMM's have already proven its worth for modeling different classes of noise. In particular, Sameti et. al. [1] and Gaunard et. al. [2] have used off-line trained noise

)0 models for noise classification. In their approach only a predefined set of noise classes are modeled and the off-line training is supervised. Similarly trained models have been suggested by Ghoreishi and Sheikhzadeh [3] for speech pause detection. A subband approach has been proposed by Hosoki et. al. [4] for noise detection. If a subband is unlikely to contain clean speech they classify it as being noise contaminated. In previous work [5] we have proposed a connected time-frequency domain region speech, presence detector and applied it to find bias compensation factors for minimum statistics based noise estimation. We have shown that this approach results in a less spectrally distorted noise estimate than the original minimum statistics based noise estimation [6]. In this paper, we apply our new approach to train subband HMM's online while speech is absent. When speech is present we use the most recently trained noise models for MMSE optimum noise power spectral density (PSD) estimation. This way, instead of choosing from a finite set of predefined noise class models, the proposed method models the local behavior of the noise to more accurately adapt to the actual noise environment.

The remainder of this paper is organized as follows. Section 2 describes the signal model, the speech presence hypotheses, the spectral smoothing method, and the fundamental noise PSD estimation approach. Section 3 provides the details of the applied statistical model. In Section 4 the methods for estimation of state probabilities and unknown observations are described. Section 5 contains the experiments and Section 6 provides a discussion of the proposed method and the obtained results.

Signal Model

We assume that noisy speech y(i) at sampling time index % consists of speech s(i) and additive noise n(i). For joint time-frequency analysis of y(i) we apply the L-point discrete Short-Time Fourier Transform (STFT), i.e.

(1)

where r E Z is t is the frequency index, and L is the STFT size, which in this paper equals the window length. R is the skip between frames and h(μ) is a unit energy window function, i.e. ∑_μZl h²(μ) = 1. From the linearity of (1) we have that Y(τ,ω) = S(τ,ω) + N(τ,ω), where S(τ,ω) and N(τ,ω) are the STFT coefficients of speech s(i) and additive noise n(i), respectively. We further assume that s(i) and n(i) are zero mean and statistically independent, which leads to a power relation where the noise is additive. Now, let the hypotheses 30

Ho and Hi for speech absence and speech presence, respectively, be defined by two power relations, i.e.

The decision of which hypothesis to believe is true is done by means of the connected region speech presence detection method, which we have proposed in previous work [5]. This speech presence detection method provides individual decisions at each time- frequency location. At the same time it ensures that decisions of the same type are connected in larger time-frequency regions. In the regions where no speech is detected we can directly observe what we assume to be the realizations of the stochastic noise process. The approach taken in this paper is to exploit this property to train dynamic statistical noise models in connected regions of speech absence and use it for noise PSD estimation in regions of speech presence. Initially, we apply a spectral window of size 2D + 1, centered at ω, to reduce the fluctuations of the noisy speech periodogram bins, i.e.

At each time-frequency location (r, ω) we let the spectrally averaged B(τ,ω) constitute the noise PSD estimate if speech is absent in all the noisy speech periodogram bins within the spectral window in (4). If speech is present in any of these bins we turn to ΗMM based MMSE estimation.

Training The Statistical Model

We model the spectrally averaged noisy periodogram bins B(τ, ω) at each ω using a continuous density ΗMM with a Gaussian mixture model in each state of the ΗMM modeling the observation PDF. At current time, say T, we consider the T" most recent 31

spectrally smoothed periodograms, i.e.

K no noisy speech periodogram bins with speech presence was used in (4) to calculate any of these, we denote the case D(T, ω) — 0 and otherwise we denote it D(T, ω) — 1. Binary speech presence detection methods generally needs a certain amount of speech power to detect speech presence. Therefore, the last few noisy speech periodogram bins leading up to a speech presence region will most likely contain enough speech power to contaminate the HMM training. To avoid this, we train the HMM on the training set that consist of the first T spectrally smoothed periodograms within the sliding window of length T'. We train the model on the training set only if

the model parameters from T - I are preserved except for the forward likelihoods, which are estimated by prediction from the forward likelihoods and state transition t T — 1. The set of T spectrally smoothed bins B(τ,ω) for

that constitutes a training set of T scalar observations at ω will cause the means and variances of the GMM's to be scalars. For easy generalization to the vector case we will, however, describe the theory using vector /matrix notation. As the training procedure is the same for all sets of training vectors we use the same notation for all T and ω, i.e. We want to find model

parameters Φ that maximize the joint likelihood¹ conditioned on the training set X^ of T observation vectors by adjusting the model parameters, i.e.

(5)

This optimization, however, of the (generally) non-convex objective function p(Xf |Φ) requires knowledge of the hidden states and mixture components and is therefore not feasible. Instead we use the Baum- Welch algorithm [7], which by alternating maximization will converge to a model parameter estimate corresponding to a local maximum of the likelihood function. To ensure numerical stability we use the lower limit t_\ — 10^~6 on the multivariate Gaussian mixture components C_jk, the single element of the 1-by-l covariance matrices ∑_jk, and the sampled individual Gaussians

mixture number

Also, we use a lower limit e₂ = 10^~2 on

¹We use p(-) to denote probability density functions and P(-) to denote probability mass functions. 32

all entries in the state tra y matrix A =

i) [8, pp. 381-382] for the state

Let a_t(i) denote the forward likelihoods, defined as the likelihoods for being in state i at t while having produced the observations X^, conditioned on the model, i.e.

(6)

The forward likelihoods ao(i) initializing the HMM training for t = 0 at each T are initialized as uniform distributions. After initialization at t = 0 the forward likelihoods for

in the Baum- Welch training algorithm are induced from the forward likelihoods at t — 1, i.e.

(7)

where the sampled observation PDF b_j (x_t) from state j is given by the weighted mixture of sampled Gaussian i.e.

(8)

To avoid numerical instability in the implementation all forward likelihood vectors are scaled to unity lι~noτm [S], i.e.

(9)

This scaling does not affect training nor does it affect HMM based estimation.

Hidden Markov Model based Estimation

When speech presence is detected in a noisy speech periodogram bin it will cause

here it is 33

located within the spectral window in (4). While D(τ, ω) = 1 the forward likelihoods OLT' at ω, which are required for HMM based estimation, are predicted from the most recently trained model at ω, i.e. the model at the most recent r for which D(τ, ω) = 0. We denote the last observable training set, on which this model has been trained, Xf. When D(T,ω) = 1 we set the sampled observation PDF equal to one for all states j, i.e. b_j(x_t) ^ 1- This corresponds to an observation Xi that does not affect the forward likelihoods and therefore it leads to a simplified version of the forward likelihood induction equation in (6), i.e. at(j) = ∑)_i=1 ott-ι{i)o,ij for j G {1, . . . , N}. The simplified equation can be compactly written in vector/matrix notation, i.e.

From (10) it follows that the F'th successively estimated forward likelihood vector for T' (at current time T) is given by

In the above OL_T'-F ^1S relative to current time T, i.e. it equals cxψ from time T — F + (T' — T). We now investigate whether or not this estimate will converge to a "steady state". Suppose A^τ G M.^NxN has N linearly independent eigenvectors. Then, by means of the eigenvalue decomposition we have that A^τ = SΛS^"1, where Λ is a diagonal matrix with non-increasingly sorted eigenvalues λi > \_% > • • • > λjv-i > XN °n the diagonal and S = [sj, . . . , Sjv] is a matrix of associated eigenvectors. We then have that (A^T)^F — SΛ^FS^""1. For the strictly positive, hence primitive, Markov matrix A^r the Perron-Frobenius theorem for primitive matrices [9, Theorem 1.1] states that there will be a unique dominant eigenvalue A₁ > |λj| for any eigenvalue A₁- φ Ai, which can be associated a strictly positive (or strictly negative) dominant eigenvector Si- For a positive Markov matrix the dominant eigenvalue will be Ai = 1 [9, p.118]. Now, if we let F → oo we have that the dominant eigenvalue remains one while the rest goes to zero. Therefore, lim^-_→co SΛ S^""1 becomes a rank-1 projection matrix that projects onto the subspace C($_\) G R^N , i.e. the line, spanned by a dominant eigenvector Si of 34

A^τ. Since we have that OLT'-F is strictly positive it is not possible that Si±α:₇v__F. In effect,

(12)

where c € R \ 0 is a constant. In the practical implementation, we have that a.τ>-_F is scaled by (9) to unity /χ-norm. Since left multiplication with the Markov matrix A^τ does not affect the 1_\-TLOTΏI the limit value remains csi, but readily scaled to unity /i-norm. Thus, c = si(7n(sj(i))/||si||i for any i G {1, . . . , N}. To summarize, forward likelihood vectors predicted successively by (10) converge to a dominant eigenvector of A^r. In fact, this implies that the HMM converges to a special case where it reduces to a GMM. Once the forward likelihoods have been estimated we obtain the noise PSD estimate

Xr as the conditional MMSE optimum observation vector X^. i.e.

(13)

We assume that the unknown observation vector Xy is drawn from a continuous density HMM with observation PDF's modeled by a GMM in each state, i.e.

(14)

The conditional MMSE estimate of the observation parameter vector is therefore given by

(15)

where the ji'th conditional state probability

(16) is obtained as the fth. entry in δr, given by (9). Experiments

For the experiments we use signals sampled at 8 kHz sampling frequency. Both frame and FFT size are L = 256 and the frame skip is R — 128 samples. The analysis window is a square-root Hanning scaled to unit energy. D = 2 defines the size of the spectral window and the number of observations in a training set is T = 16. These are taken from a sliding window of length T' — 20. In the experiments we evaluate the performance in subbands consisting of individual frequency tracks, i.e. scalar observations for all HMM's. The following methods are compared:

• HMM(5,1): The proposed HMM based noise estimation method with M = I Gaussian in each state and N = 5 states.

• GMM (5): The proposed method with M = 5 and N = 1. With N = 1 the HMM reduces to a GMM with M Gaussians.

• CR-SPD: Connected time-frequency region speech presence detection based smooth noise estimation [5].

• MCRA²: Minima controlled recursive averaging [10].

• MS: Minimum statistics noise estimation [6].

The first three methods are based on the connected time-frequency region speech presence detector proposed in [5]. The last two methods, MCRA and MS, serve as well known reference methods, which both feature independence from explicit speech presence detection. The performance of these methods are evaluated by means of their spectral distortion, which we measure as average segmental noise-to-error ratios (Seg- NER's). We calculate the SegNER directly in the time-frequency domain, as the ratio in dB between the noise energy and the noise estimation error energy. These values are upper and lower limited by 35 and 0 dB [11, p.586], respectively, and averaged over all T' frames, i.e.

²MCRA is implemented for 8 kHz sampling frequency and 16 ms frame skip. Filter coefficients have time constants equal to the coefficients proposed by Cohen and Berdugo [10]. (17)

where the noise-to-error ratio, in dB, at τ is given by

We evaluate for noisy speech with 4 different noise types, i.e. highway traffic, car interior, white, and helicopter noise, which (with zero mean) are added to the speech in -5, 0, 5, and 10 dB SNR. In each combination of SNR and noise type we average the 5 SegNER's over speech from 3 male and 3 female speakers from the TIMIT database [12]. Initially, we illustrate the distinct dynamic modeling abilities and the "steady state" estimate convergence of HMM(5,1) with an example where the model is trained on the first 16 observations. For illustrative purposes, as well as for implementation verification, we have in this example used artificially created noise with dynamics that io are well captured by the applied model. We conclude that the initialization and subsequent Baum- Welch training lead to model parameters where the dynamics of the noise have been well captured by the model.

The best average SegNER in each combination of noise type and input SNR is emphasized using bold letters. For -5 dB highway traffic noise HMM(5,1)

/5 and GMM(5) are equally good, so we emphasize the average SegNER of the method that was best before the SegNER's were rounded.

From the table we see that HMM(5,1) has the highest average SegNER's for highway traffic, car interior, and helicopter noise and GMM(5) has the highest average SegNER's for white noise. That GMM(5) is better than HMM(5,1) in white noise is to be expected

¹Zb since the additional degrees of freedom in the HMM(5,1) relative to the GMM(5) tend to cause over-modeling. This is particularly the case when the training sets are small. In case of the HMM(5,1) the local variations in the realizations of stationary stochastic noise processes are, generally, captured by a dynamic model and in case of the GMM(5) by a static model. In all the tested combinations of noise type and SNR the proposed

2,S^" method is considerably better than both MCRA [10] and MS [6]. This is the case in regions of speech presence as well as in regions of speech absence. We note that in all cases the SegNER's for HMM(5,1) and GMM(5) have only minor differences.;

Discussion

We have proposed a HMM based method for noise PSD estimation that depends on an external connected time-frequency region speech presence detector. The method 5 is trained on-line in speech absence and applied on-line for noise PSD estimation in connected regions of speech presence. Estimates from the proposed method are consistently less spectrally distorted than estimates from any of the three reference methods. We have shown that for the tested real-life noise types there are only minor differences in performance between GMM(5) and HMM(5,1), i.e. the static model leads "> to similar performance as the dynamic model. For noise environments with clearly dynamically changing noise PSD PDF's the HMM would be advantageous, though, as does have the ability to model dynamic behavior of the PDF, that is, non-stationary noise. However, in order for the HMM to be a significantly better model than the GMM the number of states and Gaussians in each GMM must 15 be adequately chosen. Also, the size of the training sets needs to be large enough for the dynamics to be captured during the Baum- Welch training.

At the cost of increased computational complexity and memory requirements the number of states and Gaussians in the HMM could be estimated on-line during speech pauses. At each ω various models could be trained and the best choice could be made 2.0 from measuring the distortion between estimated and observed noise.

We have shown that the MMSE estimate under certain conditions converges to a "steady state" estimate determined by a dominant eigenvector of the transposed transition matrix. In case that the columns of the transposed state transition probability matrix A^τ after Baum- Welch training remains uniformly distributed the trace, which 2.5 equals the sum of the eigenvalues, will be 1. Therefore the dominant eigenvalue of the symmetric and positive semi-definite matrix, implying non-negative eigenvalues [13, p.269], will be the only eigenvalue different from zero. This will cause immediate convergence to the "steady state" estimate where the HMM reduces to a GMM. For stationary noise we consider this to be desired behavior. More generally, the gap in magnitude between the dominant eigenvalue and the remaining eigenvalues affects the rate of convergence. There will also be an impact from the angle between the forward likelihood vector and the individual subspaces spanned by associated eigenvectors. Experiments have shown that the rate of convergence for a number of models trained on the same training set with different initializations differs for parameters associated with each of the local minima of the likelihood function. There is no direct relationship, however, between the likelihood of a local minimum and the rate of convergence.

For environments with increasing or decreasing noise levels delta parameters will be better suited for the proposed noise estimation method. They will give the "steady state" estimate the ability to follow the level of the noise. Using non-delta representa- tion is the most conservative approach and will unlike the delta representation ensure a stable MMSE estimate. It will, however, not be able to follow increasing nor decreasing noise during speech presence.

If the statistical models are trained with no spectral smoothing of the training sets, i.e. with D = O, the method proposed in this paper could easily be modified to provide estimated noise PSD PDF's at each frequency. This makes the proposed method applicable, and very well suited, for statistical speech enhancement.

The method described in this paper can be applied on subbands of any width. For the model to benefit from any inter-frequency dependencies the vectors in each subband should be modeled using full (non-diagonal) covariance matrices. Better performance could very well be a consequence of the ability to model inter-frequency dependencies in the noise. HMM modeling employing parametric descriptions of larger time-frequency regions is a topic of current research.

References

[1] H. Sameti, H. Sheikhzadeh, L. Deng, and R. L. Brennan, "HMM-Based Strategies for Enhancement of Speech Signals Embedded in Nonstationary Noise," IEEE Trans. Speech Audio Processing, vol. 6, no. 5, pp. 445-455, Sept. 1998.

[2] P. Gaunard, C. G. Mubikangiey, 0. Gouvreur, and V. Fontaine, "Automatic Classification of Environmental Noise Events by Hidden Markov Models," in Proc. IEEE Int. Conf. Acoust_n Speech, Signal Processing, vol. 6, May 1998, pp. 3609-3612.

[3] M. H. Ghoreishi and H. Sheikhzadeh, "A Hybrid Speech Enhancement System Based on HMM and Spectral Subtraction," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 3, June 2000, pp. 1855-1858.

[4] M. Hosoki, T. Nagai, and A. Kurematsu, "Speech Signal Band Width Extension and Noise Removal using Subband HMM," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 1, May 2002, pp. 245-248.

[5] K. V. Sørensen and S. V. Andersen, "Speech Enhancement with Natural Sounding Residual Noise based on Connected Time-Frequency Speech Presence Regions," EURASIP Journal on Applied Signal Processing, vol. 2005, no. 18, pp. 2954-2964, Oct. 2005.

[6] R. Martin, "Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics," IEEE Trans. Speech Audio Processing, vol. 9, no. 5, pp. 504-512, July 2001. [7] L. E. Baum, T. Petrie, G. Soules, and N. Weiss, "A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains," Annals of Mathematical Statistics, vol. 41, pp. 164-171, 1970.

[8] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Prentice-Hall, 1993.

[9] E. Seneta, Non-negative Matrices and Markov Chains. Springer- Verlag, 1981.

[10] I. Cohen and B. Berdugo, "Noise Estimation by Minima Controlled Recursive Averaging for Robust Speech Enhancement," IEEE Signal Processing Lett, vol. 9, no. 1, pp. 12-15, Jan. 2002.

[11] J. R. Deller, Jr., J. H. L. Hansen, and J. G. Proakis, Discrete- Time Processing of Speech Signals. Wiley-Interscience, 2000.

[12] DARPA TIMIT Acoustic-Phonetic Speech Database, National Institute of Standards and Technology (NIST), Gaithersburg, MD 20899, USA, CD-ROM.

[13] P. Stoica and R. Moses, Introduction to Spectral Analysis. Prentice-Hall, 1997.

Fig. 2 illustrates a block diagram of a preferred device embodiment. The illustrated device may be such as a mobile phone, a headset or a part thereof. The device is adapted to receive a noisy signal, e.g. an electrical analog or digital signal representing an audio signal containing speech and unintended noise. The device includes a digital signal processor DSP that performs a signal processing on the noisy signal. First, a speech presence detection is performed such as described in the foregoing including spectral- temporal smoothing prior to making a speech presence decision, and such as will be described in details in the following. The speech presence detection serves as input to a speech enhancement algorithm as already mention and which will also be described in more details in the following. The output of the speech enhancement algorithm is a signal where the speech is enhanced in relation to the noise. This signal with enhanced speech is applied to a loudspeaker, preferably via an amplifier, so as to present an acoustic representation of the speech enhanced signal to a listener.

As mentioned, the device in Fig. 2 may be a hearing aid, a headset or a mobile phone or the like. In case of a headset, the DSP may either be built into the headset, or the DSP may be positioned remote from the headset, e.g. built into other equipment such as amplifier equipment. In case of a hearing aid, the noisy signal can originate from a remote audio source or from microphone built into the hearing aid.

It is to be understood that reference signs in the claims should not be construed as limiting with respect to the scope of the claims.

Claims

1. A method for speech presence detection of speech in a noisy signal, the noisy signal being represented as a temporal-spectral distribution, the method including performing a temporal-spectral smoothing on the temporal-spectral representation of the noisy signal

5 prior to making a speech presence decision.

2. Method according to claim 1, wherein the temporal-spectral smoothing includes performing a spectral smoothing prior to performing a temporal smoothing.

10 3. Method according to claim 2, wherein the temporal smoothing includes performing a recursive smoothing with time and frequency varying smoothing parameters.

4. Method according to claim 3, wherein the smoothing parameters are lower limited so as to ensure a minimum degree of smoothing.

15

5. Method according to any of the preceding claims, wherein the speech presence decision is based on a minimum tracking of an output of the temporal-spectral smoothing.

6. Method according to any of the preceding claims, wherein a noise estimation is 20 performed based on the speech presence decision.

7. Method according to any of the preceding claims, wherein the speech presence decision is a binary decision.

25 8. A speech enhancement method including performing the method according to any of the preceding claims, and performing enhancement of the speech in the noisy signal based on the speech presence detection.

9. Method according to claim 8, wherein a first attenuation function is applied to regions of 30 the temporal-spectral distribution where presence of speech is detected, and a second attenuation function is applied to other regions of the temporal-spectral distribution, the first and second attenuation functions being different.

10. Method according to claim 9, wherein the second attenuation function applied to the 35 other regions of the temporal-spectral distribution is a scalar attenuation.

11. Method according to claim 10, wherein the scalar attenuation is implemented by multiplying by a scalar in the range 0.05 to 0.4, preferably 0.2.

40 12. Device including a processor adapted to perform the method according to any of the preceding claims.

13. Device according to claim 12, the device being selected from the group consisting of: a mobile phone, a radio communication device, an internet telephony system, sound recording equipment, sound processing equipment, sound editing equipment, broadcasting sound equipment, and a monitoring system.

14. Device according to claim 12, the device being selected from the group consisting of: a hearing aid, a headset, an assistive listening device, an electronic hearing protector, and a headphone with a built-in microphone.

15. Computer executable program code adapted to perform the method according to any of claims 1-11.