CN103310789A

CN103310789A - Sound event recognition method based on optimized parallel model combination

Info

Publication number: CN103310789A
Application number: CN2013102397247A
Authority: CN
Inventors: 刘宏; 王一; 李晓飞
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2013-05-08
Filing date: 2013-06-17
Publication date: 2013-09-18
Anticipated expiration: 2033-06-17
Also published as: CN103310789B

Abstract

The invention relates to a sound event recognition method based on optimized parallel model combination. The sound event recognition method includes 1) recording data of a sound event, acquiring a GMM (Gaussian mixture model) according to clean sound event training, and establishing a clean sound event template; 2) acquiring noise data in current environment in indoor actual noisy environment, acquiring a GMM according to the noise data training, and establishing a noise template; 3) processing the noise template and the clean sound event template by the method of optimized parallel model combination, and obtaining a template of a sound event with noise; 4) sampling to obtain sample signals of the sound event with noise, recognizing sound of the sample signals according to the parameters in the template of the sound event with noise. According to the sound event recognition method, a GMM capable of describing background noise feature distribution better is established as one input in a PMC (portable media center) method, a clean GMM of five sound events is established at another input in the PMC. Meanwhile, robustness of a recognition system to noises is guaranteed.

Description

A kind of sound event recognition methods based on improved parallel model combination

Technical field

The invention belongs to intelligent monitoring sound intermediate frequency signal process field, relate to sound event recognition methods in the indoor environment, be specifically related to a kind of sound event recognition methods based on improved parallel model combination.

Background technology

With respect to audio recognition method ripe in the artificial intelligence field, the identification that utilizes computing machine to carry out sound event is to compare the recent studies on direction in recent years.Sound event identification at be in the physical environment sounding have certain implication or the sound event that can reflect people's behavior, judge automatically and sort out.In family intelligent monitoring system, the identification of sound event can help situation about taking place in the long-range monitoring family indoor environment of people, and informs that in time which type of event the user has produced, and is conducive to the user and in time handles.But, be to exist complicated noise in the real environment, want to be implemented in the effective monitoring under the true environment, be necessary with urgent to the processing of noise.

At first, the identification of sound event belongs to the problem of a pattern-recognition, is similar to automatic speech recognition.Fundamental method is that signal is handled and pattern-recognition.Existing sound event recognition methods comprises following step:

(1) typing of sound event signal, pre-filtering, analog to digital conversion.Earlier the analoging sound signal of typing is carried out pre-filtering, high-pass filtering suppresses 50HZ power supply noise signal; Low-pass filtering filtering sound signal intermediate frequency rate component surpasses half part of sample frequency, prevents that aliasing from disturbing.Analoging sound signal is sampled and quantification obtains digital signal.

(2) divide frame, windowing.Voice signal is the same with voice signal, all has whole non-stationary, and the part is stationarity in short-term stably, and the analogous terms tone signal can think that voice signal is stably in 10～30ms, can carry out the branch frame to voice signal according to the length of 30ms.Utilize window function to carry out the extraction of signal when dividing frame, its selection of window function (shape and length) is very big to the properties influence of short-time analysis parameter, and window function commonly used comprises rectangular window, Hanning window and Hamming window etc.Generally select Hamming window for use, can react the characteristic variations of voice signal well.

(3) feature extraction.The feature of different sound events is inequality, wants to distinguish different voice signals, will carry out mathematical description to audio signal characteristics.The feature of sound event identification commonly used has temporal signatures: short-time energy, short-time zero-crossing rate.Frequency domain character: sub belt energy, Wavelet time-frequency characteristic.Cepstrum domain feature: linear prediction cepstrum coefficient (LPCC), Mel frequency cepstral coefficient (MFCC) etc.

(4) identification.The recognition methods of sound event also is to adopt the algorithm that is similar to speech recognition.The method of sound event identification commonly used has the classification based on support vector machine (SVM), based on mixed Gauss model (GMM) clustering method, and hidden Markov model (HMM) method, Bayes algorithm.

Secondly, to the processing of noise.When the above-mentioned recognition methods of telling about was used in actual environment, the performance of recognition system can sharply worsen along with the mismatch of training data and test data, is exactly influence of environmental noise and cause the reason of described mismatch.Can do not analyzed from signal space, feature space and three spaces of the model space by the matching of training and testing that noise causes.Method commonly used has the sound enhancing method, robust features extraction, feature compensation, model compensation such as the parallel model combination methods such as (PMC) that are similar to the voice enhancing that noise is handled, and improves the robustness of system.

Existing method major part is still continued to use a cover of speech recognition, to the processing of noise also nothing more than above several method, in the above method based on the abundant describe environment noise and being widely adopted of the method for PMC, they can fully excavate the information in the environment, improve the robustness of system identification, but describe with single Gauss model (SGM) for noise characteristic in the existing P MC method, for noise ratio than complicated situation, the characteristic that SGM can not fine sign noise.So discrimination is not ideal enough under the noise complicated situation.

Summary of the invention

In order to solve the problems of the technologies described above, the object of the present invention is to provide a kind of method that merges by improved model parameter to obtain meeting the band noise sound event model of noise circumstance, identify for the sound event to be identified under the actual noise environment.

In order to realize above-mentioned purpose, technical solution of the present invention is: a kind of sound event recognition methods based on improved parallel model combination, and its step comprises:

1) obtains the GMM gauss hybrid models according to clean sound event training, set up clean sound event template;

2) training obtains the GMM gauss hybrid models according to noise data, sets up the noise template;

3) to the method for described noise template and the fusion of described clean sound event template employing parallel model, obtain being with noise sound event-template;

4) sampling obtains being with noise sound event sample signal, according to the parameter in the described band noise sound event-template sample signal is carried out voice recognition.

Further, it is as follows to set up the method for template of clean sound event:

1) data of recorded voice event under nothing is made an uproar quiet indoor environment carry out carrying out branch frame, windowing process again after pre-filtering, the analog to digital conversion to the sound event of recording;

2) extract MFCC Mel cepstrum coefficient feature, train the GMM Gaussian Mixture template of sound event.

Further, described gauss hybrid models adopts the training of EM algorithm and upgrades the parameter of Gauss model, and the GMM parameter of the clean sound event that training obtains is λ _x={ w _Xk, μ _Xk, Σ _Xk, k=1,2, M, wherein, w _XkRepresent clean sound event model mix weight, μ _XkThe average of representing clean sound event model, Σ _XkThe variance of representing clean sound event model, M represents the exponent number of mixed Gaussian.

Further, obtain the noise data in the current environment under indoor true noisy environment, set up described noise template method and be: extract the MFCC feature, set up the GMM template of noise, obtaining noise template GMM parameter is λ _n={ w _Nk, μ _Nk, Σ _Nk, k=1,2, M, wherein, w _NkThe hybrid weight of expression noise model, μ _NkThe average of expression noise model, Σ _NkThe variance of expression noise model, M represents the exponent number of mixed Gaussian.

Further, adopt the method for improved parallel model fusion as follows to described noise template and described clean sound event template:

(1) adopts inverse discrete cosine transform that arbitrary model parameter is mapped to linear spectral domain by cepstrum domain, obtain the average μ of log-spectral domain model ^Log=C ^-1μ and variance Σ ^Log=C ^-1Σ (C ^-1) ^T, wherein, C is the discrete cosine transform matrix, μ, Σ are respectively and are the average of the cepstrum domain of model and variance;

(2) the log-spectral domain average in the log-spectral domain model and variance are transformed to linear spectral domain by exponential function,

{μ_{i}}^{lin} = \exp ({μ_{i}}^{\log} + \frac{Σ_{ii}^{\log}}{2})

Be i element of the mean vector of linear spectral domain,

Σ_{ij}^{lin} = {μ_{i}}^{lin} {μ_{j}}^{lin} [\exp (Σ_{ij}^{\log}) - 1]

The capable j column element of i for the covariance matrix of linear spectral domain.Wherein, μ _i ^LogBe i element of the mean vector of log-spectral domain,

The capable j column element of i for the covariance matrix of log-spectral domain;

(3) adopt improved parallel model combined method, clean sound event model parameter and noise model parameter merged at linear spectral domain, Be the average of the band noise sound event model after merging at linear spectral domain,

Be the variance of the band noise sound event model after merging at linear spectral domain.μ wherein _Xk ^LinBe the average of the clean linear spectral domain of sound event model after described step (1) (2) conversion, Be the variance of the clean linear spectral domain of sound event model after described step (1) (2) conversion, μ _Nk ^LinBe the average of the linear spectral domain of noise model after described step (1) (2) conversion,

Variance for the linear spectral domain of noise model after described step (1) (2) conversion;

(4) average of the linear spectral domain model of the band noise sound event model after will merging and variance obtain the log-spectral domain parameter through the inverse transformation of above-mentioned steps (2), pass through the characteristic parameter that above-mentioned steps (1) inverse transformation obtains cepstrum domain again, obtain mean vector and variance with noise sound event model.

Further, the parameter of band noise sound event model is λ _y={ w _Yk, μ _Yk, Σ _Yk, k=1,2, M, wherein w _Yk, μ _Yk, Σ _YkThe hybrid weight of representing the noise template respectively, average and variance.Wherein hybrid weight does not have linear spectral domain, the difference of log-spectral domain and cepstrum domain.Therefore with the hybrid weight w of noise sound event model _YkBe the weight w of clean sound event template _Xk, M represents the exponent number of mixed Gaussian.

Further, as follows to the method that sample signal carries out voice recognition according to the parameter in the described band noise sound event model:

1) described sample signal is carried out pre-filtering, analog to digital conversion, carry out extracting after branch frame, the windowing process multidimensional MFCC feature again and obtain the sample signal characteristic sequence;

2) characteristic vector sequence and the described band noise sound event model with sample signal mates, and calculates the match likelihood degree, and the matching template of maximum likelihood degree is recognition result.

Further, air-conditioning noise under the babble noise of described noise data employing NoiseX-92 and/or the indoor environment.

Technique effect of the present invention:

The present invention can better describe the background GMM that the ground unrest feature distributes by setting up under complicated noise background, be used as input in the PMC method, sets up the clean GMM of 5 kinds of sound events as another input of PMC.The method that improved model parameter merges obtains meeting the band noise sound event model of noise circumstance, and the sound event to be identified under the actual noise environment has good identification effect.The present invention has guaranteed the robustness of recognition system to noise.

Description of drawings

Fig. 1 is the whole identification process synoptic diagram of sound event recognition methods that the present invention is based on improved parallel model combination.

Fig. 2 the present invention is based on fusion method synoptic diagram among sound event recognition methods one embodiment of improved parallel model combination.

Fig. 3 the present invention is based on 5 kinds of sound event recognition effect synoptic diagram among sound event recognition methods one embodiment of improved parallel model combination.

Specific implementation method

Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, be understandable that described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those skilled in the art belong to the scope of protection of the invention not making the every other embodiment that obtains under the creative work prerequisite.

The present invention is directed to 5 kinds of sound events that recurrent needs come into the picture in the indoor environment identifies.In addition, take into full account the situation (the babble noise of the air-conditioning noise of recording under the indoor environment, public noise data storehouse NoiseX-92) of complicated noise, utilize GMM mixed Gauss model (GMM, " voice signal processing " the 2nd edition, Zhao Li writes, China Machine Press, 228-230 page or leaf) ambient noise signal described, the feature that GMM describes background with a plurality of Gaussian distribution weightings distributes, and can better describe the information of ground unrest fully., to utilizing described ground unrest model parameter clean sound event model parameter is compensated in model layer, obtains the model with noise sound event, to prevent because the training data that brings of noise and the mismatch of test data.

The present invention is a kind of sound event recognition methods based on improved parallel model combination, and particular content is:

The first, set up the template of clean sound event.

(1) under quiet environment, records the data of 5 kinds of sound events respectively, windowing is divided pre-service such as frame according to foregoing sound signal processing step.

(2) again by as previously mentioned, extract the MFCC feature of robust, train the gauss hybrid models of 5 kinds of sound events respectively.The training of gauss hybrid models adopts the EM algorithm to upgrade the parameter of Gauss model.The wherein a kind of GMM parameter of supposing the clean sound event that obtains of training is as follows:

λ _x={w _xk,μ _xk,Σ _xk},k=1,2···,M （1）

The second, obtain the noise data in the current environment, extract the MFCC feature, set up the GMM template of noise.It is as follows to obtain the noise mode board parameter:

λ _n={w _nk,μ _nk,Σ _nk},k=1,2···,M （2）

The 3rd, carrying out model and merge, is the MFCC feature all owing to be used for training the data of GMM among the present invention, belongs to the cepstrum domain feature, and ground unrest and sound event model parameter only can add at linear spectral domain, so will do following processing equally to above two kinds of models: (with λ={ w _k, μ _k, Σ _k, k=1,2, M unify to explain the GMM of clean sound and the GMM of ground unrest)

1) model parameter is mapped to linear spectral domain by cepstrum domain, specifically can adopts the inverse transformation of discrete cosine transform to handle.Here we do not extract the difference coefficient of MFCC, computing method formula (3) and (4):

μ ^log=C ^-1μ （3）

Σ ^log=C ^-1Σ(C ^-1) ^T （4）

μ wherein ^Log, Σ ^LogBe average and the variance of log-spectral domain model, μ, Σ are average and the variance of the cepstrum domain of model, and C is the discrete cosine transform matrix.

2) normally distributed random variable with log-spectral domain transforms to linear spectral domain by exponential function, computing method such as formula (5) and (6):

{μ_{i}}^{lin} = \exp ({μ_{i}}^{\log} + \frac{Σ_{ii}^{\log}}{2}) - - - (5)

Σ_{ij}^{lin} = {μ_{i}}^{lin} {μ_{j}}^{lin} [\exp (Σ_{ij}^{\log}) - 1] - - - (6)

Wherein, μ _i ^Lin, Be respectively i element of mean vector of linear spectral domain and the capable j column element of i of covariance matrix; μ _i ^Log,

Be respectively i element of mean vector of log-spectral domain and the capable j column element of i of covariance matrix.

3) mean vector and the variance of the linear spectral domain of the clean sound event model after supposing to be calculated by following formula are respectively μ _Xk ^Lin,

Linear spectral domain mean vector and the variance of noise model are respectively μ _Nk ^Lin,

Adopt formula (7) and (8) that two models are merged:

{μ_{yk}}^{lin} = g {μ_{xk}}^{lin} + (1 - g) Σ_{k = 1}^{k} w_{nk} {μ_{nk}}^{lin} - - - (7)

Σ_{yk}^{lin} = g^{2} Σ_{xk}^{lin} + {(1 - g)}^{2} Σ_{k = 1}^{K} w_{nk} Σ_{n}^{lin} - - - (8)

μ wherein _Yk ^Lin,

Mean vector and the variance of the band noise sound event model after expression is merged, g represents gain factor.

4) the linear spectral domain model parameter after will merging obtains the log-spectral domain parameter of model through the inverse transformation of formula (5) (6), and the inverse transformation of passing through formula (3) (4) again obtains the parameter of the cepstrum domain of model.Be 5 kinds of band noise sound event model parameters after the fusion with what 5 kinds of sound event models did all that above-mentioned processing can obtain at last.

The 4th, for the band noise sound event signal sample that in the room noise ambient sound, extracts, the purpose of identification is to determine which of 5 kinds of sound events current sample belong to, namely calculate sample to the posterior probability of 5 kinds of models, wherein the model of a Zui Da posterior probability correspondence is the classification of sample.According to Bayesian formula, because 5 kinds of contingent probability of sound event are identical, for the observation vector of determining, the calculating of above-mentioned maximum a posteriori probability is equal to calculates this probability under 5 kinds of sound event models of revise, makes the model of this probability maximum be the affiliated classification of sample.

Be the whole identification process synoptic diagram of sound event recognition methods that the present invention is based on improved parallel model combination as shown in Figure 1, comprise training part and identification division.

The present invention considers frequent 5 kinds of sound events that take place and need come into the picture under the indoor environment, is respectively close the door sound, tap-tap, clapping, voice, birdie.5 kinds of sound event templates and noise training template training process are as follows:

1, under quiet environment, records 5 kinds of sound event databases and go forward side by side rower calmly.00 of every kind of sound event Class1 is by 5 male 5 woman sounding or produce action and obtain respectively.Air-conditioning noise under the babble noise of noise employing NoiseX-92 and the indoor environment.

2, pre-filtering, high-pass filtering suppress 50Hz power supply noise signal; Low-pass filtering filtering sound signal intermediate frequency rate component surpasses half part of sample frequency; Analog to digital conversion, sample frequency are 11025Hz, and sampling precision is 16bits;

3, for each complete voice segments, divide frame, windowing.Frame length is 256 sampled points, and it is 128 sampled points that frame moves.Window function is chosen Hamming window;

4, feature extraction.Extract 13 dimension MFCC features;

5, every kind of sound event utilizes 60 characteristic vector sequence respectively, and noise adopts 10 characteristic vector sequence, based on the GMM template λ of 5 kinds of sound of expectation maximization (EM) algorithm training _Xk, k=1,2,5, and the template λ of noise _n, template adopts the gauss hybrid models of 8 gaussian component.

Model fusion process of the present invention is to the present invention is based on fusion method synoptic diagram among sound event recognition methods one embodiment of improved parallel model combination as shown in Figure 2.

Concrete steps are as follows:

1, adopt described formula (3) (4) (5) (6) that ground unrest model and ten clean sound event model parameter spectral domains are converted into linear spectral domain.

2, adopt described formula (7) (8) respectively with ten kinds of clean sound events the linear spectral field parameter and the linear spectral field parameter of noise merge g=0.5 here.

3, the linear spectral field parameter of the band noise sound event model after will merging through the inverse transformation of (5) (6) formula and the inverse transformation of (3) (4), obtains 5 GMM model λ with noise sound event respectively _Yk, k=1,2,5.

Identifying of the present invention is as follows:

1, under above-mentioned two kinds of noise conditions, records totally 110 of 5 kinds of band noise sound event signals.Carry out pre-filtering; Analog to digital conversion, sample frequency are 11025Hz, and sampling precision is 16bits.

2, divide frame, windowing.Frame length is 256 sampled points, and it is 128 sampled points that frame moves.Window function is chosen Hamming window.Extract 13 dimension MFCC features.

3, template matches.Current audio signal characteristics sequence vector and 5 kinds of band noise sound event-templates mate.Feature vector sequence is X _k, k=1 ..., N, 5 templates are λ _Yk, k=1,2,5.Calculate the match likelihood degree, selecting the template of acquisition maximum likelihood degree is recognition result.Be to the present invention is based on 5 kinds of sound event recognition effect synoptic diagram among sound event recognition methods one embodiment of improved parallel model combination as shown in Figure 3.

Above-mentioned example is of the present invention giving an example, although disclose example of the present invention for the purpose of illustration, but it will be appreciated by those skilled in the art that: without departing from the spirit and scope of the invention and the appended claims, various replacements, variation and modification all are possible.Therefore, the present invention should not be limited to the content of this example.

Claims

1. sound event recognition methods based on improved parallel model combination, its step comprises:

2. the sound event recognition methods based on the combination of improved parallel model as claimed in claim 1 is characterized in that, the method for template of setting up clean sound event is as follows:

3. the sound event recognition methods based on improved parallel model combination as claimed in claim 1 is characterized in that, described gauss hybrid models adopts the training of EM algorithm and upgrades the parameter of Gauss model, and the GMM parameter of the clean sound event that training obtains is λ _x={ w _Xk, μ _Xk, Σ _Xk, k=1,2, M, wherein, w _XkRepresent clean sound event model mix weight, μ _XkThe average of representing clean sound event model, Σ _XkThe variance of representing clean sound event model, M represents the exponent number of mixed Gaussian.

4. the sound event recognition methods based on improved parallel model combination as claimed in claim 1, it is characterized in that, under indoor true noisy environment, obtain the noise data in the current environment, setting up described noise template method is: extract the MFCC feature, set up the GMM template of noise, obtaining noise template GMM parameter is λ _n={ w _Nk, μ _Nk, Σ _Nk, k=1,2, M, wherein, w _NkThe hybrid weight of expression noise model, μ _NkThe average of expression noise model, Σ _NkThe variance of expression noise model, M represents the exponent number of mixed Gaussian.

5. the sound event recognition methods based on improved parallel model combination as claimed in claim 1 is characterized in that, adopts the method for parallel model fusion as follows to described noise template and described clean sound event template:

(1) adopts inverse discrete cosine transform that arbitrary model parameter is mapped to linear spectral domain by cepstrum domain, obtain the average μ of log-spectral domain model ^Log=C ^-1μ and variance Σ ^Log=C ^-1Σ (C ^-1) ^T, wherein, C is the discrete cosine transform matrix, μ, Σ are respectively average and the variance of the cepstrum domain of model;

{μ_{i}}^{lin} = \exp ({μ_{i}}^{\log} + \frac{Σ_{ii}^{\log}}{2})

Be i element of the mean vector of linear spectral domain,

Σ_{ij}^{lin} = {μ_{i}}^{lin} {μ_{j}}^{lin} [\exp (Σ_{ij}^{\log}) - 1]

The capable j column element of i for the covariance matrix of linear spectral domain; Wherein, μ _i ^LogBe i element of the mean vector of log-spectral domain,

(3) adopt improved parallel model combined method, clean sound event model parameter and noise model parameter merged at linear spectral domain,

Be the average of the band noise sound event model after merging at linear spectral domain, Be the variance of the band noise sound event model after merging at linear spectral domain, wherein μ _Xk ^LinBe the average of the clean linear spectral domain of sound event model after described step (1) (2) conversion,

Be the variance of the clean linear spectral domain of sound event model after described step (1) (2) conversion, μ _Nk ^LinBe the average of the linear spectral domain of noise model after described step (1) (2) conversion,

6. the sound event recognition methods based on improved parallel model combination as claimed in claim 1 is characterized in that, the parameter lambda of band noise sound event model _y={ w _Yk, μ _Yk, Σ _Yk, k=1,2, M, wherein w _Yk, μ _Yk, Σ _YkThe hybrid weight of representing the noise template respectively, average and variance, wherein hybrid weight does not have linear spectral domain, the hybrid weight w of band noise sound event model _YkBe the weight w of clean sound event template _Xk, M represents the exponent number of mixed Gaussian.

7. as any described sound event recognition methods based on improved parallel model combination of claim 1-6, it is characterized in that air-conditioning noise under the babble noise of described noise data employing NoiseX-92 and/or the indoor environment.

8. as any described sound event recognition methods based on the combination of improved parallel model of claim 1-6, it is characterized in that being characterized as of described extraction: Mel frequency cepstral coefficient MFCC.

9. the sound event recognition methods based on improved parallel model combination as claimed in claim 1 is characterized in that, and is as follows to the method that sample signal carries out voice recognition according to the parameter in the described band noise sound event model: