CN103310789B

CN103310789B - A kind of sound event recognition method of the parallel model combination based on improving

Info

Publication number: CN103310789B
Application number: CN201310239724.7A
Authority: CN
Inventors: 刘宏; 王一; 李晓飞
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2013-05-08
Filing date: 2013-06-17
Publication date: 2016-04-06
Anticipated expiration: 2033-06-17
Also published as: CN103310789A

Abstract

The present invention relates to a kind of sound event recognition method of the parallel model combination based on improving, its step comprises: the 1) data of recorded voice event, obtains GMM gauss hybrid models, set up clean sound event template according to clean sound event training; 2) under the true noisy environment in indoor, obtain the noise data in current environment, obtain GMM according to noise data training, set up noise template; 3) to the method that noise template and clean sound event template adopt the parallel model improved to merge, band noise sound event-template is obtained; 4) sampling obtains band and to make an uproar sound event sample signal, carries out voice recognition according to the parameter in band noise sound event-template to sample signal.The present invention is used as an input in PMC method by setting up the GMM that better can describe the distribution of ground unrest feature, sets up clean GMM another input as PMC of 5 kinds of sound events.This invention ensures that the robustness of recognition system to noise simultaneously.

Description

A kind of sound event recognition method of the parallel model combination based on improving

Technical field

The invention belongs to intelligent monitoring sound intermediate frequency signal transacting field, relate to sound event recognition method in indoor environment, being specifically related to a kind of sound event recognition method of the parallel model combination based on improving.

Background technology

Relative to audio recognition method ripe in artificial intelligence field, computing machine is utilized to be in recent years in relatively recent studies on direction to carry out the identification of sound event.Sound event identification for be in physical environment sounding there is certain implication or the sound event that can reflect people's behavior, judge automatically and sort out.In family intelligent monitoring system, situation about occurring in the monitoring family room environment that the identification of sound event can help people long-range, and inform which type of event user creates in time, be conducive to user and process in time.But being there is complicated noise in real environment, wanting to realize the effective monitoring under true environment, is necessary with urgent to the process of noise.

First, the identification of sound event belongs to the problem of a pattern-recognition, is similar to automatic speech recognition.Basic method is signal transacting and pattern-recognition.Existing sound event recognition method comprises following step:

(1) typing of sound event signal, pre-filtering, analog to digital conversion.First the analoging sound signal of typing is carried out pre-filtering, high-pass filtering suppresses 50HZ power supply noise signal; Low-pass filtering filtering sound signal intermediate frequency rate component exceedes the part of sample frequency half, prevents aliasing from disturbing.Analoging sound signal to be sampled and quantification obtains digital signal.

(2) framing, windowing.Voice signal is the same with voice signal, all has overall non-stationary, the short-term stationarity of local stationary, analogous terms tone signal, can think that voice signal is stable in 10 ~ 30ms, voice signal can be carried out framing according to the length of 30ms.Utilize window function to carry out the extraction of signal during framing, its selection of window function (shape and length) is very large to the properties influence of short-time analysis parameter, and conventional window function comprises rectangular window, Hanning window and Hamming window etc.Generally select Hamming window, the characteristic variations of voice signal can be reacted well.

(3) feature extraction.The feature of different sound events is not identical, wants to distinguish different voice signals, will carry out mathematical description to the feature of voice signal.The feature of conventional sound event identification has temporal signatures: short-time energy, short-time zero-crossing rate.Frequency domain character: sub belt energy, wavelet time frequency feature.Cepstral domain feature: linear prediction residue error (LPCC), mel-frequency cepstrum coefficient (MFCC) etc.

(4) identify.The recognition methods of sound event is also adopt the algorithm being similar to speech recognition.Conventional sound event knowledge method for distinguishing has the classification based on support vector machine (SVM), based on mixed Gauss model (GMM) clustering method, and hidden Markov model (HMM) method, Bayesian Classification Arithmetic.

Secondly, to the process of noise.When above-mentioned told about recognition methods is applied in actual environment, the performance of recognition system sharply can worsen along with the mismatch of training data and test data, and causes the reason of described mismatch to be exactly the impact of neighbourhood noise.Not mating of the training and testing caused by noise can be analyzed from signal space, feature space and the model space three spaces.Conventional method has that the sound being similar to speech enhan-cement strengthens method, robust features extractions, feature compensation, model compensation such as the methods such as parallel model combination (PMC) process noise, the robustness of raising system.

Existing method major part still continues to use a set of of speech recognition, to the process of noise also nothing more than above several method, can abundant describe environment noise and being widely adopted based on the method for PMC in above method, they fully can excavate the information in environment, improve the robustness of system identification, but describe for noise characteristic single Gauss model (SGM) in existing PMC method, for noise ratio comparatively complicated situation, SGM can not characterize the characteristic of noise very well.So discrimination is not ideal enough under noise complicated situation.

Summary of the invention

In order to solve the problems of the technologies described above, the object of the present invention is to provide a kind of method of the model parameter fusion by improving to obtain meeting the band noise sound event model of noise circumstance, the sound event to be identified under actual noise environment is identified.

In order to realize above-mentioned object, technical solution of the present invention is: a kind of sound event recognition method of the parallel model combination based on improving, and its step comprises:

1) obtain GMM gauss hybrid models according to clean sound event training, set up clean sound event template;

2) obtain GMM gauss hybrid models according to noise data training, set up noise template;

3) to the method that described noise template and described clean sound event template adopt parallel model to merge, band noise sound event-template is obtained;

4) sampling obtains band and to make an uproar sound event sample signal, carries out voice recognition according to the parameter in described band noise sound event-template to sample signal.

Further, the method setting up the template of clean sound event is as follows:

1) data of recorded voice event under without quiet indoor environment of making an uproar, again carry out framing, windowing process after carrying out pre-filtering, analog to digital conversion to the sound event recorded;

2) extract MFCC mel cepstrum coefficients feature, train the GMM Gaussian Mixture template of sound event.

Further, described gauss hybrid models adopts EM Algorithm for Training and upgrades the parameter of Gauss model, and the GMM parameter of training the clean sound event obtained is λ _x={ w _xk, μ _xk, Σ _xk, k=1,2 ..., M, wherein, w _xkrepresent the hybrid weight of clean sound event model, μ _xkrepresent the average of clean sound event model, Σ _xkrepresent the variance of clean sound event model, M represents the exponent number of mixed Gaussian.

Further, obtain the noise data in current environment under the true noisy environment in indoor, setting up described noise template method is: extract MFCC feature, set up the GMM template of noise, obtaining noise template GMM parameter is λ _n={ w _nk, μ _nk, Σ _nk, k=1,2 ..., M, wherein, w _nkrepresent the hybrid weight of noise model, μ _nkrepresent the average of noise model, Σ _nkrepresent the variance of noise model, M represents the exponent number of mixed Gaussian.

Further, adopt the method for the parallel model fusion improved as follows to described noise template and described clean sound event template:

(1) adopt inverse discrete cosine transform that arbitrary model parameter is mapped to linear spectral domain by cepstrum domain, obtain the average μ of log-spectral domain model ^log=C ^-1μ, and variance Σ ^log=C ^-1Σ (C ^-1) ^t, wherein, C is discrete cosine transformation matrix, and μ, Σ are respectively average into the cepstrum domain of model and variance;

(2) the log-spectral domain average in log-spectral domain model and variance are transformed to linear spectral domain by exponential function,

{μ_{i}}^{lin} = \exp ({μ_{i}}^{\log} + \frac{Σ_{ii}^{\log}}{2})

For i-th element of the mean vector of linear spectral domain,

Σ_{ij}^{lin} = {μ_{i}}^{lin} {μ_{j}}^{lin} [\exp (Σ_{ij}^{\log}) - 1]

For the i-th row jth column element of the covariance matrix of linear spectral domain.Wherein, μ _i ^logfor i-th element of the mean vector of log-spectral domain, for the i-th row jth column element of the covariance matrix of log-spectral domain;

(3) adopt the parallel model combined method improved, clean sound event model parameter and noise model parameters merged at linear spectral domain, for the band noise sound event model after fusion is in the average of linear spectral domain, for the band noise sound event model after fusion is in the variance of linear spectral domain.Wherein μ _xk ^linfor the average of the linear spectral domain of clean sound event model after described step (1) (2) conversion, for the variance of the linear spectral domain of clean sound event model after described step (1) (2) conversion, μ _nk ^linfor the average of the linear spectral domain of noise model after described step (1) (2) conversion, for the variance of the linear spectral domain of noise model after described step (1) (2) conversion;

(4) average of the linear spectral domain model of the band noise sound event model after fusion and variance are obtained log-spectral domain parameter through the inverse transformation of above-mentioned steps (2), obtain the characteristic parameter of cepstrum domain again through above-mentioned steps (1) inverse transformation, obtain mean vector and the variance of being with noise sound event model.

Further, the parameter being with noise sound event model is λ _y={ w _yk, μ _yk, Σ _yk, k=1,2 ..., M, wherein w _yk, μ _yk, Σ _ykrepresent hybrid weight, average and the variance of noise template respectively.Wherein hybrid weight does not have linear spectral domain, the difference of log-spectral domain and cepstrum domain.Therefore the hybrid weight w of noise sound event model is with _ykbe the weight w of clean sound event template _xk, M represents the exponent number of mixed Gaussian.

Further, the method for carrying out voice recognition to sample signal according to the parameter in described band noise sound event model is as follows:

1) pre-filtering, analog to digital conversion are carried out to described sample signal, then extract multidimensional MFCC feature after carrying out framing, windowing process and obtain sample signal characteristic sequence;

2) characteristic vector sequence of sample signal mated with described band noise sound event model, calculate match likelihood degree, the matching template of maximum likelihood degree is recognition result.

Further, described noise data adopts air-conditioning noise under the babble noise of NoiseX-92 and/or indoor environment.

Technique effect of the present invention:

The present invention under Complex Noise background, by setting up the background GMM that better can describe the distribution of ground unrest feature, can be used as an input in PMC method, sets up clean GMM another input as PMC of 5 kinds of sound events.The method that improved model parameter merges obtains the band noise sound event model meeting noise circumstance, for the sound event to be identified under actual noise environment, has good recognition effect.This invention ensures that the robustness of recognition system to noise.

Accompanying drawing explanation

Fig. 1 is the overall identification process schematic diagram of sound event recognition method of the parallel model combination that the present invention is based on improvement.

Fig. 2 is fusion method schematic diagram in sound event recognition method one embodiment of the parallel model combination that the present invention is based on improvement.

Fig. 3 is 5 kinds of sound event recognition effect schematic diagram in sound event recognition method one embodiment of the parallel model combination that the present invention is based on improvement.

Specific implementation method

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described, be understandable that the technical scheme in the embodiment of the present invention, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those skilled in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

The present invention is directed to 5 kinds of sound events that recurrent needs in indoor environment come into the picture to identify.In addition, take into full account the situation (the babble noise of the air-conditioning noise recorded under indoor environment, public noise database NoiseX-92) of Complex Noise, utilize GMM mixed Gauss model (GMM, " Speech processing " the 2nd edition, Zhao Li writes, China Machine Press, 228-230 page) ambient noise signal is described, the multiple Gaussian distribution weighting of GMM describes the feature distribution of background, better can describe the information of ground unrest fully.In model layer to the ground unrest model parameter described in utilizing, clean sound event model parameter is compensated, obtains the model with sound event of making an uproar, to prevent the mismatch of training data because noise brings and test data.

The present invention is a kind of sound event recognition method of the parallel model combination based on improving, and particular content is:

The first, set up the template of clean sound event.

(1) under quiet environment, the data of 5 kinds of sound events are recorded respectively, according to foregoing sound signal processing steps to pre-service such as windowing framings.

(2) again by as previously mentioned, extract the MFCC feature of robust, train the gauss hybrid models of 5 kinds of sound events respectively.The training of gauss hybrid models adopts EM algorithm to upgrade the parameter of Gauss model.Suppose to train wherein a kind of GMM parameter of the clean sound event obtained as follows:

λ _x＝{w _xk,μ _xk,Σ _xk},k＝1,2…,M(1)

The second, obtain the noise data in current environment, extract MFCC feature, set up the GMM template of noise.Obtain noise mode board parameter as follows:

λ _n＝{w _nk,μ _nk,Σ _nk},k＝1,2…,M(2)

3rd, carry out Model Fusion, because the data being used for training GMM in the present invention are all MFCC feature, belong to cepstral domain feature, and ground unrest and sound event model parameter only can add at linear spectral domain, so will do following process equally to above two kinds of models: (with λ={ w _k, μ _k, Σ _k, k=1,2 ..., M carrys out the GMM of the clean sound of unified presentation and the GMM of ground unrest)

1) model parameter is mapped to linear spectral domain by cepstrum domain, the inverse transformation of discrete cosine transform specifically can be adopted to process.Here we do not extract the difference coefficient of MFCC, computing method formula (3) and (4):

μ ^log＝C ^-1μ(3)

Σ ^log＝C ^-1Σ(C ^-1) ^T(4)

Wherein μ ^log, Σ ^logfor average and the variance of log-spectral domain model, μ, Σ are average and the variance of the cepstrum domain of model, and C is discrete cosine transformation matrix.

2) normally distributed random variable of log-spectral domain is transformed to linear spectral domain by exponential function, computing method are as formula (5) and (6):

{μ_{i}}^{lin} = \exp ({μ_{i}}^{\log} + \frac{Σ_{ii}^{\log}}{2}) - - - (5)

Σ_{ij}^{lin} = {μ_{i}}^{lin} {μ_{j}}^{lin} [\exp (Σ_{ij}^{\log}) - 1] - - - (6)

Wherein, μ _i ^lin, be respectively i-th element of the mean vector of linear spectral domain and the i-th row jth column element of covariance matrix; μ _i ^log, be respectively i-th element of the mean vector of log-spectral domain and the i-th row jth column element of covariance matrix.

3) suppose that mean vector and the variance of the linear spectral domain of the clean sound event model after being calculated by above formula are respectively μ _xk ^lin, the linear spectral domain mean vector of noise model and variance are respectively μ _nk ^lin, formula (7) and (8) is adopted to merge two models:

{μ_{yk}}^{lin} = g {μ_{xk}}^{lin} + (1 - g) Σ_{k = 1}^{K} w_{nk} {μ_{nk}}^{lin} - - - (7)

Σ_{yk}^{lin} = g^{2} Σ_{xk}^{lin} + {(1 - g)}^{2} Σ_{k = 1}^{K} w_{nk} Σ_{nk}^{lin} - - - (8)

Wherein μ _yk ^lin, represent mean vector and the variance of the band noise sound event model after merging, g represents gain factor.

4) inverse transformation of linear spectral domain model parameter through formula (5) (6) after fusion is obtained the log-spectral domain parameter of model, then the inverse transformation through formula (3) (4) obtains the parameter of the cepstrum domain of model.That 5 kinds of sound event models are all done above-mentioned process can finally obtain is 5 kinds after fusion band noise sound event model parameters.

4th, for the band noise sound event signal sample extracted in room noise ambient sound, the object identified to determine which of 5 kinds of sound events current sample belong to, namely calculate sample to the posterior probability of 5 kinds of models, model corresponding to a wherein maximum posterior probability is the classification of sample.According to Bayesian formula, because 5 kinds of contingent probability of sound event are identical, for the observation vector determined, the calculating of above-mentioned maximum a posteriori probability is equal to and calculates this probability under 5 kinds of sound event models of revise, makes the model of this maximum probability be classification belonging to sample.

Be the overall identification process schematic diagram of sound event recognition method of the parallel model combination that the present invention is based on improvement as shown in Figure 1, comprise training part and identification division.

The present invention considers often to occur under indoor environment and needs 5 kinds of sound events coming into the picture, is respectively sound of closing the door, tap-tap, clapping, voice, birdie.5 kinds of sound event templates and noise train template training process as follows:

1, under quiet environment, record 5 kinds of sound event databases and go forward side by side rower calmly.Often kind of sound event type 100, by 5 male 5 female, sounding or generation action obtain respectively.Noise adopts air-conditioning noise under the babble noise of NoiseX-92 and indoor environment.

2, pre-filtering, high-pass filtering suppresses 50Hz power supply noise signal; Low-pass filtering filtering sound signal intermediate frequency rate component exceedes the part of sample frequency half; Analog to digital conversion, sample frequency is 11025Hz, and sampling precision is 16bits;

3, for each complete voice segments, framing, windowing.Frame length is 256 sampled points, and it is 128 sampled points that frame moves.Window function chooses Hamming window;

4, feature extraction.Extract 13 dimension MFCC features;

5, often kind of sound event utilizes 60 characteristic vector sequence respectively, and noise adopts 10 characteristic vector sequence, based on the GMM template λ of expectation maximization (EM) Algorithm for Training 5 kinds of sound _xk, k=1,2 ... 5, and the template λ of noise _n, template adopts the gauss hybrid models of 8 gaussian component.

Model Fusion process of the present invention is fusion method schematic diagram in sound event recognition method one embodiment of the parallel model combination that the present invention is based on improvement as shown in Figure 2.

Concrete steps are as follows:

1, adopt described formula (3) (4) (5) (6) that ground unrest model and ten clean sound event model parameter spectral domains are converted into linear spectral domain.

2, adopt described formula (7) (8) respectively by ten kinds of clean sound events linear spectral domain parameter and the linear spectral domain parameter of noise merge, g=0.5 here.

3, by the inverse transformation of warp (5) (6) formula and the inverse transformation of (3) (4) respectively of the linear spectral domain parameter of the band noise sound event model after fusion, 5 GMM model λ with sound event of making an uproar are obtained _yk, k=1,2 ... 5.

Identifying of the present invention is as follows:

1, under above-mentioned two kinds of noise conditions, record 5 kinds of band noise sound event signals totally 110.Carry out pre-filtering; Analog to digital conversion, sample frequency is 11025Hz, and sampling precision is 16bits.

2, framing, windowing.Frame length is 256 sampled points, and it is 128 sampled points that frame moves.Window function chooses Hamming window.Extract 13 dimension MFCC features.

3, template matches.The characteristic vector sequence of present sound signals is with noise sound event-template to mate with 5 kinds.Feature vector sequence is X _k, k=1 ..., N, 5 templates are λ _yk, k=1,2 ... 5.Calculate match likelihood degree, select the template obtaining maximum likelihood degree to be recognition result.5 kinds of sound event recognition effect schematic diagram in sound event recognition method one embodiment of the parallel model combination that the present invention is based on improvement as shown in Figure 3.

Above-mentioned example is citing of the present invention, although disclose example of the present invention for the purpose of illustration, but it will be appreciated by those skilled in the art that: without departing from the spirit and scope of the invention and the appended claims, various replacement, change and amendment are all possible.Therefore, the present invention should not be limited to the content of this example.

Claims

1., based on a sound event recognition method for the parallel model combination improved, its step comprises:

3) to the method that described noise template and described clean sound event template adopt parallel model to merge, band noise sound event-template is obtained; Comprise step by step following:

3 ?1) adopt inverse discrete cosine transform arbitrary model parameter is mapped to linear spectral domain by cepstrum domain, obtain the average μ of log-spectral domain model ^log=C ^-1μ and variance Σ ^log=C ^-1Σ (C ^-1) ^t, wherein, C is discrete cosine transformation matrix, and μ, Σ are respectively average and the variance of the cepstrum domain of model;

3 ?2) the log-spectral domain average in log-spectral domain model and variance are transformed to linear spectral domain by exponential function, for i-th element of the mean vector of linear spectral domain, for the i-th row jth column element of the covariance matrix of linear spectral domain; Wherein, μ _i ^logfor i-th element of the mean vector of log-spectral domain, for the i-th row jth column element of the covariance matrix of log-spectral domain;

3 ?3) adopt the parallel model combined method improved, clean sound event model parameter and noise model parameters are merged at linear spectral domain, for the band noise sound event model after fusion is in the average of linear spectral domain, for the band noise sound event model after fusion is in the variance of linear spectral domain, wherein μ _xk ^linfor clean sound event model through described step 3 ?1), 3 ?2) average of linear spectral domain after conversion, for clean sound event model through described step 3 ?1), 3 ?2) variance of linear spectral domain after conversion, μ _nk ^linfor noise model through described step 3 ?1), 3 ?2) average of linear spectral domain after conversion, for noise model through described step 3 ?1), 3 ?2) variance of linear spectral domain after conversion;

3 ?4) by the average of linear spectral domain model of the band noise sound event model after merging and variance through above-mentioned steps 3 ?2) inverse transformation obtain log-spectral domain parameter, again through above-mentioned steps 3 ?1) inverse transformation obtains the model parameter of cepstrum domain, obtains mean vector and the variance of being with noise sound event model;

2., as claimed in claim 1 based on the sound event recognition method of the parallel model combination improved, it is characterized in that, the method setting up the template of clean sound event is as follows:

3., as claimed in claim 1 based on the sound event recognition method of the parallel model combination improved, it is characterized in that, described gauss hybrid models adopts EM Algorithm for Training and upgrades the parameter of Gauss model, and the GMM parameter of training the clean sound event obtained is λ _x={ w _xk, μ _xk, Σ _xk, k=1,2 ..., M, wherein, w _xkrepresent the hybrid weight of clean sound event model, μ _xkrepresent the average of clean sound event model, Σ _xkrepresent the variance of clean sound event model, M represents the exponent number of mixed Gaussian.

4. as claimed in claim 1 based on the sound event recognition method of the parallel model combination improved, it is characterized in that, the noise data in current environment is obtained under the true noisy environment in indoor, setting up described noise template method is: extract MFCC feature, set up the GMM template of noise, obtaining noise template GMM parameter is λ _n={ w _nk, μ _nk, Σ _nk, k=1,2 ..., M, wherein, w _nkrepresent the hybrid weight of noise model, μ _nkrepresent the average of noise model, Σ _nkrepresent the variance of noise model, M represents the exponent number of mixed Gaussian.

5. as claimed in claim 1 based on the sound event recognition method of the parallel model combination improved, it is characterized in that, the parameter lambda of band noise sound event model _y={ w _yk, μ _yk, Σ _yk, k=1,2 ..., M, wherein w _yk, μ _yk, Σ _ykrepresent the hybrid weight of noise template respectively, average and variance, wherein hybrid weight does not have linear spectral domain, the hybrid weight w of band noise sound event model _ykbe the weight w of clean sound event template _xk, M represents the exponent number of mixed Gaussian.

6. the sound event recognition method based on the parallel model combination improved as described in claim 1-5 any one, is characterized in that, described noise data adopts air-conditioning noise under the babble noise of NoiseX-92 and/or indoor environment.

7., as claimed in claim 1 based on the sound event recognition method of the parallel model combination improved, it is characterized in that, as follows according to the method that the parameter in described band noise sound event model carries out voice recognition to sample signal: