CN105512108A

CN105512108A - English pun recognition method based on likelihood ratio estimation

Info

Publication number: CN105512108A
Application number: CN201510918577.5A
Authority: CN
Inventors: 邹航; 王月芳; 孔令璇; 李�瑞; 刘树英; 戴继生
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2015-12-11
Filing date: 2015-12-11
Publication date: 2016-04-20
Anticipated expiration: 2035-12-11
Also published as: CN105512108B

Abstract

The invention discloses an English pun recognition method based on likelihood ratio estimation. The method comprises steps as follows: step 1, an English sentence required to be recognized is read by software; step 2, pun words and all notional words of the sentence in the step 1 are extracted and marked as h and wm (m denotes 1, 2,...M), and two meanings of each pun word H are marked as I1 and I2; step 3, each notional word wm (m denotes 1, 2,...M) is counted, the value of the correlation between M and the pun meaning Ii (i denotes 1 or 2) is marked as R (wm,Ii), and the value of R (wm,Ii) is counted in a questionnaire manner in advance; step 4, the R (wm,Ii) obtained in the step 3 is used for creating a likelihood ratio lambda (I); step 5, whether the sentence contains pun meanings is judged according to the calculation result of lambda (I); when the value of lambda (I) approaches 0, the sentence contains pun meanings, and otherwise, the sentence doesn't contain pun meanings. A probability calculation method capable of accurately quantifying ambiguity of the sentence and recognizing puns is proposed, and the defect that the pun meanings cannot be accurately and quantitatively analyzed with a conventional method is overcome.

Description

A kind of English pun recognition methods estimated based on likelihood ratio

Technical field

The invention belongs to natural language processing field, relate to the quirkish identification of English, specifically a kind of English pun recognition methods estimated based on likelihood ratio.

Background technology

In recent years, the rise of computational linguistics, for philological research and development is filled with new vitality, also for quirkish research provides a brand-new approach.Computational linguistics, usually by means of probabilistic method, take computer technology as means, from extensive real text, obtains useful statistical information.It is less that relevant domestic scholars utilizes statistical method to carry out quirkish achievement in research, at document: Zhao Huijun, pun Pragmatic Translation quantitative model, in foreign language research 135 (5) (2012) 72-76, propose the comparatively simple pun Pragmatic Translation quantitative model of one, this is the Beneficial that computational linguistics and artificial intelligence technology are applied in pun translation by domestic scholars.

Foreign scholar's great majority are using the emphasis of the quantitative test of word incongruity as pun Study of recognition, but academic circles at present not yet has a strict standard accurately to the measurement of word incongruity, this uncertainty is that quirkish discriminance analysis brings many unfavorable factors.The pun recognition methods being in main flow is at present failed as pun analysis and identification provide a general calculating cognitive theory.Pun Study of recognition based on computational linguistics is still in the starting stage, is its feature extraction, computation model designs or theoretical analysis method aspect all remains to be further improved and develops.

Summary of the invention

In order to solve the problem, the present invention utilizes a games which become fairer with time being applicable to analyze having a double meaning word incongruity, and propose a kind of English pun recognition methods estimated based on likelihood ratio, the method can realize the quirkish quick identification of English automatically.The technical scheme adopted is as follows:

Based on the English pun recognition methods that likelihood ratio is estimated, comprise the steps:

Step 1: read the English sentence that need identify by software;

Step 2: the having a double meaning word of sentence and all notional words in extraction step 1, be designated as h and w respectively _m, m=1,2 ..., M, the two layers of meaning that wherein having a double meaning word h comprises is designated as I respectively ₁and I ₂;

Step 3: add up each notional word w _m, m=1,2 ..., M and having a double meaning word meaning I _i, i=1, the correlation degree between 2, its value is designated as R (w _m, I _i);

Step 4: utilize the R (w obtained in step 3 _m, I _i), structure likelihood ratio λ (I);

Step 5: judge whether sentence exists having a double meaning implication according to the result of calculation of λ (I).

As optimal technical scheme, software described in step 1 is realized by Matlab or visual c++.

As optimal technical scheme, step 2 also comprises: artificial foundation comprises the corpus of word part of speech and having a double meaning word and is stored in computing machine, extracts having a double meaning word and all notional words by computer inquery corpus.

As optimal technical scheme, the correlation degree R (w in described step 3 _m, I _i) the value mode that adopts prior survey to add up obtain.

As optimal technical scheme, described correlation degree R (w _m, I _i) value be located between 0-10.

As optimal technical scheme, the computing method of the likelihood ratio λ (I) in described step 4 are:

λ (I) = \frac{\log P (w_{1}, ..., w_{M} | I_{1})}{\log P (w_{1}, ..., w_{M} | I_{2})} = Σ_{m = 1}^{M} R (w_{m}, I_{1}) - Σ_{m = 1}^{M} R (w_{m}, I_{2});

In formula, P (|) represents conditional probability function, and log () represents natural logarithm function.

As optimal technical scheme, the concrete grammar whether sentence exists having a double meaning implication that judges described in described step 5 is: when | λ (I) | during <1, judge that sentence has having a double meaning implication; Otherwise judge that sentence does not have having a double meaning implication.

Beneficial effect of the present invention:

The present invention is based on likelihood ratio estimation theory, give a kind of can accurate quantification statement ambiguousness identify quirkish method for calculating probability, solving classic method cannot the defect of the having a double meaning implication of accurate quantitative analysis.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the English pun recognition methods that the present invention proposes;

Fig. 2 is each notional word and having a double meaning word meaning (I in example sentence (a) ₁=reign and I ₂=rain) between the statistical value of degree of correlation;

Fig. 3 is each notional word and having a double meaning word meaning (I in example sentence (b) ₁=reign and I ₂=rain) between the statistical value of degree of correlation;

Fig. 4 is each notional word and having a double meaning word meaning (I in example sentence (c) ₁=reign and I ₂=rain) between the statistical value of degree of correlation;

Fig. 5 is each notional word and having a double meaning word meaning (I in example sentence (d) ₁=reign and I ₂=rain) between the statistical value of degree of correlation.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.

As shown in Figure 1, be the process flow diagram of the English pun recognition methods that the present invention proposes, comprise the steps:

Step 1: read the English sentence that need identify by software; This sentence is made up of certain having a double meaning word and M notional word, and wherein having a double meaning word, preposition, adverbial word, article are not counted in sum M.

Step 2: artificial set up the corpus that comprises word part of speech and having a double meaning word and be kept in computing machine, computing machine is by inquiry corpus, and in extraction step 1, the having a double meaning word of sentence and all notional words, be designated as h and w respectively _m, m=1,2, K, M, wherein having a double meaning word h comprises two layers of meaning and is designated as I respectively ₁and I ₂;

R (w _m, I _i) can be added up by the mode of survey in advance, generally should be greater than 50 people by trial number.Survey requires every tested independent judgment group w _m, I _ibetween meaning of one's words correlativity, and provide corresponding mark, marking scope 0 assigns to 10 points, and 0 point of expression is completely uncorrelated, and 10 points of expressions are extremely relevant.Each R (w _m, I _i) estimated value by the w obtained in survey _m, I _ibetween the mean value of meaning of one's words associated score determined.

λ (I) = \frac{\log P (w_{1}, ..., w_{M} | I_{1})}{\log P (w_{1}, ..., w_{M} | I_{2})} = Σ_{m = 1}^{M} R (w_{m}, I_{1}) - Σ_{m = 1}^{M} R (w_{m}, I_{2}) - - - (1)

In formula (1), P (|) represents conditional probability function, and log () represents natural logarithm function.

Step 5: judge whether sentence exists having a double meaning implication according to the result of calculation of λ (I).If λ (I) be one close to the value (such as | λ (I) | <1) of zero, judge that sentence has having a double meaning implication; Otherwise judge that sentence does not have having a double meaning implication.

Below in conjunction with embodiment, effect of the present invention is described further.

In order to assess the performance of the method that the present invention proposes, 4 English example sentences in method his-and-hers watches 1 of the present invention are used to carry out quirkish identification.

Table 1 English example sentence

Sequence number	Example sentence content
		(a)	Britain is a wet place since the queen has had a long reign.

(b)	Britain is a wet place since the autumn has had a long reign.
		(c)	The king's reign ended and his heir took over.
(d)	Rain fell on the city last night.

From table 1, it is having a double meaning that reign and the rain in sentence (a) belongs to unisonance; Word queen in sentence (a) has been replaced to word autumn by sentence (b), and this makes having a double meaning linguistic context be destroyed, and is generally belonged to by such sentence " going having a double meaning "; Sentence (c) and sentence (d) all belong to non-having a double meaning, and the implication that sentence (c) is clearly expressed is I ₁=reign, and the implication that sentence (d) is clearly expressed is I ₂=rain.In above-mentioned all sentences, notional word has 13 altogether, and for convenience of description, we are by its serial number: w ₁=Britain, w ₂=wet, w ₃=place, w ₄=queen, w ₅=long, w ₆=autumn, w ₇=king, w ₈=end, w ₉=heir, w ₁₀=takeover, w ₁₁=fall, w ₁₂=city, w ₁₃=lastnight.Questionnaire requires the meaning of one's words correlativity in every tested independent judgment table 2 between each phrase, and provides degree of correlation mark, and marking scope 0 assigns to 10 points, and 0 point of expression is completely uncorrelated, and 10 points of expressions are extremely relevant.

Table 2 survey content

As Figure 2-Figure 5, be each notional word in each example sentence of table 1 and having a double meaning word meaning (I ₁=reign and I ₂=rain) between the statistics of degree of correlation.As can be seen from Fig. 2-Fig. 5: the notional word of sentence (a) is I ₁=reign and I ₂=rain provides different supporting roles; The notional word of sentence (b) and (d) is mainly I ₂=rain provides supporting role; The notional word of sentence (c) is mainly I ₁=reign provides supporting role.More than show R (w _m, I _i) correlation degree of notional word and having a double meaning implication can be weighed more exactly.

The value of the corresponding likelihood ratio λ of each example sentence (I) is calculated according to example sentence notional word each in Fig. 2-Fig. 5 and the correlation degree value between having a double meaning word and relational expression (1), for sentence (a), the computation process of its likelihood ratio λ (I) is as follows: result of calculation is as shown in table 3, as can be seen from Table 3, the λ (I)=0.04 that sentence (a) is corresponding, be one close to zero value, absolute value is less than 1, therefore judges that sentence (a) has having a double meaning implication; The λ (I) that other sentence is corresponding keeps off in zero, therefore judges that these sentences do not have having a double meaning implication.This recognition result is consistent with legitimate reading, thus confirms validity of the present invention.

The calculated value of each example sentence likelihood ratio λ (I) of table 3

The above is only for describing technical scheme of the present invention and specific embodiment; the protection domain be not intended to limit the present invention; be to be understood that; under the prerequisite without prejudice to flesh and blood of the present invention and spirit, institute changes, improve or be equal to replacement etc. all will fall within the scope of protection of the present invention.

Claims

1., based on the English pun recognition methods that likelihood ratio is estimated, it is characterized in that, comprise the steps:

Step 1: read the English sentence that need identify by software;

2. a kind of English pun recognition methods estimated based on likelihood ratio according to claim 1, is characterized in that, software described in step 1 is realized by Matlab or visual c++.

3. a kind of English pun recognition methods estimated based on likelihood ratio according to claim 1, it is characterized in that, step 2 also comprises: artificial foundation comprises the corpus of word part of speech and having a double meaning word and is stored in computing machine, extracts having a double meaning word and all notional words by computer inquery corpus.

4. a kind of English pun recognition methods estimated based on likelihood ratio according to claim 1, is characterized in that, the correlation degree R (w in described step 3 _m, I _i) the value mode that adopts prior survey to add up obtain.

5. a kind of English pun recognition methods estimated based on likelihood ratio according to claim 4, is characterized in that, described correlation degree R (w _m, I _i) value be located between 0-10.

6. a kind of English pun recognition methods estimated based on likelihood ratio according to claim 1, it is characterized in that, the computing method of the likelihood ratio λ (I) in described step 4 are:

λ (I) = \frac{\log P (w_{1}, ..., w_{M} | I_{1})}{\log P (w_{1}, ..., w_{M} | I_{2})} = Σ_{m = 1}^{M} R (w_{m}, I_{1}) - Σ_{m = 1}^{M} R (w_{m}, I_{2});

7. a kind of English pun recognition methods estimated based on likelihood ratio according to claim 1, it is characterized in that, the concrete grammar whether sentence exists having a double meaning implication that judges described in described step 5 is: when | λ (I) | during <1, judge that sentence has having a double meaning implication; Otherwise judge that sentence does not have having a double meaning implication.