CN103631817A

CN103631817A - Method and device for excavating attribute name repeat

Info

Publication number: CN103631817A
Application number: CN201210307150.8A
Authority: CN
Inventors: 赵世奇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2012-08-24
Filing date: 2012-08-24
Publication date: 2014-03-12
Anticipated expiration: 2032-08-24
Also published as: CN103631817B

Abstract

The invention provides a method and device for excavating attribute name repeat. The method comprises the steps of obtaining at least one source of Q-Q, Q-T, and T-T from a searching log as candidate sentence pairs, wherein the Q-Q is a sentence pair formed by two queries searched by a user in a dialogue, Q-T is a sentence pair formed by the queries and clicked web titles corresponding to the queries, and T-T is a sentence pair formed by two clicked titles corresponding to the same query; extracting phrases with the same context as candidate repeated phrase pairs from the candidate sentence pairs; extracting candidate repeated phrase pairs stored in at least one phrase attribute list from the candidate repeated phrases; conducting noise filtration on the extracted candidate repeated phrase pairs in the third step so as to obtain attribute name repeated phrases pairs. The method and device can obtain the expression form of attribute names, so that flexible and diverse request expression of users can be matched better.

Description

A kind of method and apparatus that excavates attribute-name repetition

[technical field]

The present invention relates to Computer Applied Technology field, particularly a kind of method and apparatus that excavates attribute-name repetition.

[background technology]

In network information field, tlv triple data can be expressed as that (v), wherein e is physical name (entity) for e, a, and a is attribute-name (attribute), and v is property value (value), and for example (Yao Ming, height, 2.26 meters) are a tlv triple.Aspect a lot of, all there is application in tlv triple data, especially in search engine, tlv triple data are stored in structured database as vertical search provides Data Source, when user search entity attribute, search engine can directly return to corresponding property value to user, for example, during user search " Yao Ming's height is how many ", can directly return to accurate answer " 2.26 meters ".

Yet user, carry out in the process of actual search, the language expression of employing may with structured database in statement there are differences, be reflected in attribute-name especially obvious.For above-mentioned example, user may search for " Yao Ming's height ", " Yao Ming is high how many ", " Yao Ming has how high " etc., although the intention of these inquiries is all to obtain Yao Ming's height, but because the statement of attribute-name is different, possibly cannot hit the content in structured database, therefore, be necessary the attribute-name in structured database to repeat excavation, excavate the expression-form that each attribute-name has, thus the flexile Query expression of match user better.

[summary of the invention]

In view of this, the invention provides a kind of method and apparatus that attribute-name is repeated that excavates, so that excavate the expression-form that attribute-name has, thus the flexile Query expression of match user better.

Concrete technical scheme is as follows:

Excavate the method that attribute-name is repeated, the method comprises the following steps:

S1, from search, obtain at least one resource in Q-Q, Q-T and T-T daily record as candidate sentence pair, the sentence that two query that described Q-Q searches in a session session for user form is right, the described Q-T sentence that to be query form with corresponding clicked web page title title is right, and described T-T is that the sentence that forms of two clicked title that same query is corresponding is right;

S2, from each candidate sentence centering extract there is same context linguistic context phrase to repeating phrase pair as candidate;

S3, from candidate, repeat the candidate that phrase centering extract to exist at least one phrase to belong to attribute-name list and repeat phrase pair;

S4, the candidate who extracts from described step S3 repeat phrase to carrying out noise filtering, obtain attribute-name and repeat phrase pair, and attribute-name is repeated the two phrases attribute-name repetition each other of phrase centering.

According to one preferred embodiment of the present invention, in described step S2 according to following phrase extraction rule extraction phrase to repeating phrase pair as candidate: the previous word of two phrases is identical and a rear word is identical, but two phrases itself are not identical.

According to one preferred embodiment of the present invention, described phrase extraction rule also comprises: the length of two phrases is in default length range, in two phrases, do not comprise punctuate and can not by stop words, be formed completely, or two phrases can not be at least one in punctuate before and afterwards.

According to one preferred embodiment of the present invention, in described step S2, also comprise: add up each candidate and repeat phrase to the number of times extracting from Q-Q, Q-T and T-T respectively, the candidate who total degree is less than to preset times threshold value repeats phrase to filtering out.

According to one preferred embodiment of the present invention, noise filtering described in step S4 comprise following at least one:

If candidate repeats the Length Ratio of two phrases of phrase centering and is greater than preset length than threshold value, this candidate is repeated to phrase to filtering out;

If candidate repeats the difference of two phrases of phrase centering and is only stop words, this candidate is repeated to phrase to filtering out;

If candidate repeats phrase centering and is not included in the phrase in attribute-name list and has numeral or English alphabet, this candidate is repeated to phrase to filtering out;

If candidate repeats head-word or the tail word that phrase centering is not included in the phrase in attribute-name list and appears in default filtration vocabulary, this candidate is repeated to phrase to filtering out;

If candidate repeats the phrase that phrase centering is not included in attribute-name list, comprise place name, this candidate is repeated to phrase to filtering out;

Each candidate who determines same phrase place repeats the right word frequency score value of phrase, and the candidate that word frequency score value is come outside top n repeats phrase to filtering out, and described N is default positive integer.

According to one preferred embodiment of the present invention, candidate repeats phrase to <p1, and the word frequency score value score (p2|p1) of p2> adopts following formula to calculate:

score(p2|p1)=λ _q-qP _q-q(p2|p1)+λ _q-tP _q-t(p2|p1)+λ _t-tP _t-t(p2|p1)；

Described

P_{q - q} (p 2 | p 1) = \frac{C_{q - q} (p 1, p 2)}{\underset{x}{Σ} C_{q - q} (p 1, x) + C},

Described

P_{q - t} (p 2 | p 1) = \frac{C_{q - t} (p 1, p 2)}{\underset{x}{Σ} C_{q - t} (p 1, x) + C},

Described

P_{t - t} (p 2 | p 1) = \frac{C_{t - t} (p 1, p 2)}{\underset{x}{Σ} C_{t - t} (p 1, x) + C},

C _q-q(p1, p2) is <p1, the number of times that p2> is extracted out from Q-Q, C _q-t(p1, p2) is <p1, the number of times that p2> is extracted out from Q-T, C _t-t(p1, p2) is <p1, the number of times that p2> is extracted out from T-T,

repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from Q-Q and, repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from Q-T and,

repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from T-T and, C is default smoothing factor, λ _q-q, λ _q-tand λ _t-tfor default weight coefficient.

According to one preferred embodiment of the present invention, described λ _q-tbe greater than λ _q-qand λ _t-t.

Excavate the device that attribute-name is repeated, this device comprises:

Candidate sentence is to acquiring unit, for obtaining at least one resource Q-Q, Q-T and T-T as candidate sentence pair from search daily record, the sentence that two query that described Q-Q searches in a session session for user form is right, the described Q-T sentence that to be query form with corresponding clicked web page title title is right, and described T-T is that the sentence that forms of two clicked title that same query is corresponding is right;

The first phrase pair extraction unit, for from each candidate sentence centering, extract there is same context linguistic context phrase to repeating phrase pair as candidate;

The second phrase pair extraction unit, the candidate who belongs to attribute-name list for repeat at least one phrase of phrase centering extraction existence from candidate repeats phrase pair;

Noise filtering unit, repeats phrase to carrying out noise filtering for the candidate who extracts from described the second phrase pair extraction unit, obtains attribute-name and repeats phrase pair, and attribute-name is repeated the two phrases attribute-name repetition each other of phrase centering.

According to one preferred embodiment of the present invention, described the first phrase pair extraction unit according to following phrase extraction rule extraction phrase to repeating phrase pair as candidate: the previous word of two phrases is identical and a rear word is identical, but two phrases itself are not identical.

According to one preferred embodiment of the present invention, this device also comprises:

Candidate's filter element, for adding up each candidate that described the first phrase pair extraction unit extracts, repeat phrase to the number of times extracting from Q-Q, Q-T and T-T respectively, the candidate who total degree is less than to preset times threshold value repeats phrase to filtering out, and the candidate after filtering is repeated to phrase to offering described the second phrase pair extraction unit.

According to one preferred embodiment of the present invention, the noise filtering that described noise filtering unit carries out comprise following at least one:

According to one preferred embodiment of the present invention, described noise filtering unit determines that candidate repeats phrase to <p1, during the word frequency score value score (p2|p1) of p2>, adopts following formula to calculate:

score(p2|p1)=λ _q-qP _q-q(p2|p1)+λ _q-tP _q-t(p2|p1)+λ _t-tP _t-t(p2|p1)；

Described

P_{q - q} (p 2 | p 1) = \frac{C_{q - q} (p 1, p 2)}{\underset{x}{Σ} C_{q - q} (p 1, x) + C},

Described

P_{q - t} (p 2 | p 1) = \frac{C_{q - t} (p 1, p 2)}{\underset{x}{Σ} C_{q - t} (p 1, x) + C},

Described

P_{t - t} (p 2 | p 1) = \frac{C_{t - t} (p 1, p 2)}{\underset{x}{Σ} C_{t - t} (p 1, x) + C},

repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from Q-Q and,

repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from Q-T and,

As can be seen from the above technical solutions, the present invention can realize attribute-name and repeat the right excavation of phrase, excavate the phrase pair that attribute-name is each other repeated, thus the expression-form can getattr name having, thus the flexile Query expression of match user better.

[accompanying drawing explanation]

The method flow diagram that Fig. 1 repeats for the excavation attribute-name that the embodiment of the present invention one provides;

The structure drawing of device that Fig. 2 repeats for the excavation attribute-name that the embodiment of the present invention two provides.

[embodiment]

In order to make the object, technical solutions and advantages of the present invention clearer, below in conjunction with the drawings and specific embodiments, describe the present invention.

Embodiment mono-,

The method flow diagram that Fig. 1 repeats for the excavation attribute-name that the embodiment of the present invention one provides, as shown in Figure 1, the method can comprise the following steps:

Step 101: obtain at least one resource in Q-Q, Q-T and T-T as candidate sentence pair from search daily record (query log).

The object of this step is in order to obtain sentence that follow-up excavation used to resource from query log, the data of user's inquiry session (session) and webpage clicking title (title) in querylog, have been recorded, the concrete querylog adopting can be the query log of set period, for example the query log of a day.

Above-mentioned Q-Q refers to and inquires about-inquire about right, refers to two query that a user searches in a session, and the meaning of these two query may be identical.

Above-mentioned Q-T refers to inquiry-clicked title couple, refers to query and corresponding clicked title, and the query of user search is possible identical with the semanteme between the title of its click afterwards conventionally.

Above-mentioned T-T refers to title-title pair, refers to two clicked title that same query is corresponding, and while being generally used for searching for a query, the semanteme between two title of click may be identical.

Step 102: from each candidate sentence centering extract there is same context linguistic context phrase to repeating phrase pair as candidate.

The object of this step is from candidate sentence centering, further to excavate the phrase pair of semantic equivalence, and the phrase pair extraction rule of use is exactly to see that whether phrase to having same context linguistic context.The previous word of phrase centering two phrases that extract is identical and a rear word is also identical, but two phrases are not identical.Further, can also limit the length of two phrases in default length range, for example, be less than 5 words, can also limit in two phrases and not comprise punctuate, can not stop words, consist of completely, in addition, two phrases can not be punctuates before and afterwards.

Suppose <s1, s2> is candidate sentence pair, <p1, p2> is by the phrase p1 from s1 and the phrase pair that forms from the phrase p2 of s2, previous word and the rear word of p1 in s1 is respectively b1 and e1, previous word and the rear word of p2 in s2 is respectively b2 and e2, if <p1, p2> meets following rule, using it as candidate, repeat phrase extraction out: p1 ≠ p2, b1=b2 and e1=e2.

Preferably, above-mentioned rule may further include with lower a kind of or combination: the length of p1 and p2, within the scope of preset length, does not contain punctuate and can not stop words, consist of completely in p1 and p2, b1, b2, e1 and e2 are not punctuates.

Give one example, suppose the candidate sentence pair that " national number-plate number inquiry " and " national license plate inquiry " forms, " number-plate number " and " license plate " will with same context linguistic context " whole nation _ _ inquiry " extracts formation candidate and repeats phrase pair.

In addition, add up each candidate and repeat phrase to the number of times extracting from Q-Q, Q-T and T-T respectively, be respectively C _q-q(p1, p2), C _q-t(p1, p2) and C _t-t(p1, p2), finally retains the candidate that total degree is more than or equal to preset times threshold value and repeats phrase pair.For example, retain C _q-q(p1, p2)+C _q-t(p1, p2)+C _t-tthe candidate of (p1, p2)>=3 repeats phrase pair, other filter out.

Step 103: repeat from candidate the candidate that phrase centering extract to exist at least one phrase to belong to attribute-name list and repeat phrase pair.

Owing to the object of the invention is to excavate attribute-name, repeat, it is a kind of special case of complete repetition resource that attribute-name is repeated, therefore, need to guarantee that candidate repeats phrase to being that the candidate of attribute-name repeats phrase pair, this just can be by judging that candidate repeats phrase centering and whether has at least a phrase to belong to attribute-name list.

Attribute-name list can be by the mode of artificial and/or automatic mining, to obtain the list of attribute-name formation, as long as one of them phrase is just repeated this candidate phrase to retaining in attribute-name list, for the whether no requirement (NR) in attribute-name list of another phrase, can not exist yet.

Step 104: the candidate that step 103 is extracted repeats phrase to carrying out after noise filtering, obtains attribute-name and repeats phrase pair.

The noise filtering of carrying out in this step can adopt but be not limited at least one in following filtering rule:

Filtering rule one: the filtration based on Length Ratio.If candidate repeats the Length Ratio of two phrases of phrase centering and is greater than preset length than threshold value, this candidate is repeated to phrase to filtering out.Suppose that candidate repeats phrase to <p1, in p2>, the length of phrase p1 and p2 is respectively L1 and L2, and preset length is that for example can to adopt T be 2.5 to T(than threshold value), if

or

, by <p1, p2> filters out.

Filtering rule two: the filtration based on stop words.If candidate repeats the difference of two phrases of phrase centering and is only stop words, this candidate is repeated to phrase to filtering out.For example, the candidate that " height " and " height " forms repeats phrase pair, both difference be only stop words " ", such repetition does not provide valuable knowledge, is therefore filtered out.

Filtering rule three, the filtration based on filtering vocabulary.Specifically comprise following several situation:

1) in advance numeral and English alphabet are arranged on and are filtered in vocabulary, if candidate repeats phrase centering, be not included in the phrase in attribute-name list and have numeral or English alphabet, filter out this candidate and repeat phrase pair.

2) by auxiliary word, preposition, conjunction etc., the word without competency is arranged in filtration vocabulary in advance, if candidate repeats head-word that phrase centering is not included in the phrase in attribute-name list or tail word and appears at and filter in vocabulary, filter out this candidate and repeat phrase pair.This filtration is mainly in order to filter out the irrational phrase in those borders, owing to not having the restriction on linguistic meaning when the extracting phrase, therefore " body weight ", " middle speed per hour ", " height and ", " complete or collected works it " etc. in the phrase extracting, there is the irrational situation in border, as should being filtered.

3) place name is arranged on filters in vocabulary in advance, if candidate repeats phrase centering, be not included in the phrase in attribute-name list and comprise place name, filter out this candidate and repeat phrase pair.This filtration is mainly the place name of considering should not contain in attribute-name sign specific region restriction.

Filtering rule four: the top-N based on word frequency score value filters.Because may existing a plurality of candidates, an attribute-name repeats, for example the repetition of " postcode " may be " postcode ", " postcode number ", " postal code number " etc., here be limited to the N of reservation kinds repeats more, this just needs a kind of marking mechanism, the marking mechanism herein adopting is based on word frequency, the candidate who determines same phrase place repeats the right word frequency score value of phrase (described same phrase is included in attribute-name list), the candidate that word frequency score value is come outside top n repeats phrase to filtering out, and N is default positive integer.

Definite method that word frequency score value is described below, candidate repeats phrase to <p1, and the word frequency score value score (p2|p1) of p2> can adopt following formula to calculate:

score(p2|p1)=λ _q-qP _q-q(p2|p1)+λ _q-tP _q-t(p2|p1)+λ _t-tP _t-t(p2|p1) （1）

Wherein, P _q-q(p2|p1), P _q-tand P (p2|p1) _t-t(p2|p1) be respectively <p1, the score value of p2> based on Q-Q, Q-T and T-T, adopts respectively following formula to calculate:

P_{q - q} (p 2 | p 1) = \frac{C_{q - q} (p 1, p 2)}{\underset{x}{Σ} C_{q - q} (p 1, x) + C} - - - (2)

P_{q - t} (p 2 | p 1) = \frac{C_{q - t} (p 1, p 2)}{\underset{x}{Σ} C_{q - t} (p 1, x) + C} - - - (3)

P_{t - t} (p 2 | p 1) = \frac{C_{t - t} (p 1, p 2)}{\underset{x}{Σ} C_{t - t} (p 1, x) + C} - - - (4)

repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from T-T and.

C is default smoothing factor, and for example C is 100, and for avoiding candidate as p1 to repeat quantity seldom time, the denominator of formula (2), (3) and (4) is too small and cause whole score value excessive.

λ _q-q, λ _q-tand λ _t-tfor default weight coefficient, due to P _q-t(p2|p1) score value is more important with respect to other two kinds, therefore preferably, and can be by λ _q-tarrange to such an extent that be greater than λ _q-qand λ _t-t, for example, λ is set _q-q, λ _q-tand λ _t-tbe respectively 0.2,0.6 and 0.2.

Be more than the description that method provided by the present invention is carried out, below by bis-pairs of devices provided by the present invention of embodiment, be described in detail.

Embodiment bis-,

The structure drawing of device that Fig. 2 repeats for the excavation attribute-name that the embodiment of the present invention two provides, as shown in Figure 2, this device comprises: candidate sentence is to acquiring unit 201, the first phrase pair extraction unit 202, the second phrase pair extraction unit 203 and noise filtering unit 204.

Candidate sentence is obtained at least one resource in Q-Q, Q-T and T-T as candidate sentence pair from query log to acquiring unit 201, the sentence that two query that Q-Q searches in a session for user form is right, the Q-T sentence that to be query form with corresponding clicked title is right, and T-T is that the sentence that forms of two clicked title that same query is corresponding is right.

The phrase that the first phrase pair extraction unit 202 has same context linguistic context from each candidate sentence centering extraction is to repeating phrase pair as candidate.Specifically can be according to following phrase extraction rule extraction phrase to repeating phrase pair as candidate: the previous word of two phrases is identical and a rear word is identical, but two phrases itself are not identical.

Further, phrase extraction rule can also comprise: the length of two phrases, in default length range, does not comprise punctuate and can not stop words, consist of completely in two phrases, or two phrases can not be at least one in punctuate before and afterwards.

The second phrase pair extraction unit 203 is repeated from candidate the candidate that phrase centering extract to exist at least one phrase to belong to attribute-name list and is repeated phrase pair.Namely in order to guarantee that candidate repeats phrase to being that the candidate of attribute-name repeats phrase pair, can be by judging that candidate repeats phrase centering and whether has at least a phrase to belong to attribute-name list.

The candidate that noise filtering unit 204 extracts from the second phrase pair extraction unit 203 repeats phrase to carrying out noise filtering, obtains attribute-name and repeats phrase pair, and attribute-name is repeated the two phrases attribute-name repetition each other of phrase centering.

Preferably, this device can also comprise: candidate's filter element 205, for adding up each candidate that the first phrase pair extraction unit 202 extracts, repeat phrase to the number of times extracting from Q-Q, Q-T and T-T respectively, the candidate who total degree is less than to preset times threshold value repeats phrase to filtering out, and the candidate after filtering is repeated to phrase to offering the second phrase pair extraction unit 203.The extraction operation that now the second phrase pair extraction unit 203 is carried out is actually candidate from candidate's filter element 205 filters and repeats that phrase centering extracts.

Above-mentioned noise filtering unit 204 can adopt at least one of following filtering rule when carrying out noise filtering:

Filtering rule one: the filtration based on Length Ratio.If candidate repeats the Length Ratio of two phrases of phrase centering and is greater than preset length than threshold value, this candidate is repeated to phrase to filtering out.

Filtering rule two: the filtration based on stop words.If candidate repeats the difference of two phrases of phrase centering and is only stop words, this candidate is repeated to phrase to filtering out.

Filtering rule three: the filtration based on filtering vocabulary.Specifically comprise following several situation:

1) in advance numeral and English alphabet are arranged on and are filtered in vocabulary, if candidate repeats phrase centering, be not included in the phrase in attribute-name list and have numeral or English alphabet, this candidate is repeated to phrase to filtering out.

2) by auxiliary word, preposition, conjunction etc., the word without competency is arranged in filtration vocabulary in advance, if candidate repeats head-word or the tail word that phrase centering is not included in the phrase in attribute-name list and appears in default filtration vocabulary, this candidate is repeated to phrase to filtering out.

3) place name is arranged on filters in vocabulary in advance, if candidate repeats the phrase that phrase centering is not included in attribute-name list, comprise place name, this candidate is repeated to phrase to filtering out.

Filtering rule four: the top-N based on word frequency score value filters.Each candidate who determines same phrase place repeats the right word frequency score value of phrase, and the candidate that word frequency score value is come outside top n repeats phrase to filtering out, and N is default positive integer.

In filtering rule four, determine that candidate repeats phrase to <p1, during the word frequency score value score (p2|p1) of p2>, can adopt following formula to calculate:

score(p2|p1)=λ _q-qP _q-q(p2|p1)+λ _q-tP _q-t(p2|p1)+λ _t-tP _t-t(p2|p1)；

Wherein,

P_{q - q} (p 2 | p 1) = \frac{C_{q - q} (p 1, p 2)}{\underset{x}{Σ} C_{q - q} (p 1, x) + C},

P_{q - t} (p 2 | p 1) = \frac{C_{q - t} (p 1, p 2)}{\underset{x}{Σ} C_{q - t} (p 1, x) + C},

P_{t - t} (p 2 | p 1) = \frac{C_{t - t} (p 1, p 2)}{\underset{x}{Σ} C_{t - t} (p 1, x) + C},

C _q-q(p1, p2) is <p1, the number of times that p2> is extracted out from Q-Q, C _q-t(p1, p2) is <p1, the number of times that p2> is extracted out from Q-T, C _t-t(p1, p2) is <p1, the number of times that p2> is extracted out from T-T, repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from Q-Q and,

Due to P _q-t(p2|p1) score value is more important with respect to other two kinds, therefore preferably, λ can be set _q-tbe greater than λ _q-qand λ _t-t, for example, λ is set _q-q, λ _q-tand λ _t-tbe respectively 0.2,0.6 and 0.2.

Repeat phrase to rear, just obtained the phrase pair that a large amount of attribute-name are each other repeated adopting said method and device to excavate, then can be by these phrases to arranging, determine same attribute-name corresponding respectively repeat phrase.In addition, because mining process may be a periodic process, therefore the repetition phrase of excavating may exist in mining process before, therefore can filter out being present in the existing repetition phrase of repeating in phrase library, then remaining repetition phrase is added into repetition phrase library, in this repetition phrase library, comprise attribute-name corresponding respectively repeat phrase.In addition, when adding repetition phrase in repetition phrase library, can first through artificial, mark, manually determine whether it is the repetition of attribute-name really, be actually a process of manually filtering, for artificial mark, be the repetition of attribute-name really, be added into and repeat in phrase library.For example, the information finally obtaining in repeating phrase library can be as shown in table 1.

Table 1

Attribute-name

Attribute-name is repeated

[0116]

The TV of not liking	Poor TV
		The number-plate number	Number, take pictures, automotive license plate number, good board, the trade mark
Net profit	Profit, income, net profit, total profit, profit, income
		Camera titbits	Film clips, excellent titbit
Luxurious	Honorable luxury, expensive, most distinguished, top, well-known, famous brand
		Song catalog	Song, song complete works

Like this, even if user inputs the query of different attribute name statement, by inquiry, repeat phrase library and also can determine user's attribute-name and be intended to, thereby be the information that user returns to inquiry targetedly.

The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims

1. excavate the method that attribute-name is repeated, it is characterized in that, the method comprises the following steps:

2. method according to claim 1, it is characterized in that, in described step S2 according to following phrase extraction rule extraction phrase to repeating phrase pair as candidate: the previous word of two phrases is identical and a rear word is identical, but two phrases itself are not identical.

3. method according to claim 2, it is characterized in that, described phrase extraction rule also comprise following at least one: the length of two phrases, in default length range, does not comprise punctuate and can not stop words, consist of completely in two phrases, or before two phrases and can not be punctuate afterwards.

4. method according to claim 1, is characterized in that, in described step S2, also comprises: add up each candidate and repeat phrase to the number of times extracting from Q-Q, Q-T and T-T respectively, the candidate who total degree is less than to preset times threshold value repeats phrase to filtering out.

5. method according to claim 1, is characterized in that, noise filtering described in step S4 comprise following at least one:

6. method according to claim 5, is characterized in that, candidate repeats phrase to <p1, and the word frequency score value score (p2|p1) of p2> adopts following formula to calculate:

score(p2|p1)=λ _q-qP _q-q(p2|p1)+λ _q-tP _q-t(p2|p1)+λ _t-tP _t-t(p2|p1)；

Described

P_{q - q} (p 2 | p 1) = \frac{C_{q - q} (p 1, p 2)}{\underset{x}{Σ} C_{q - q} (p 1, x) + C},

Described

P_{q - t} (p 2 | p 1) = \frac{C_{q - t} (p 1, p 2)}{\underset{x}{Σ} C_{q - t} (p 1, x) + C},

Described

P_{t - t} (p 2 | p 1) = \frac{C_{t - t} (p 1, p 2)}{\underset{x}{Σ} C_{t - t} (p 1, x) + C},

7. method according to claim 6, is characterized in that, described λ _q-tbe greater than λ _q-qand λ _t-t.

8. excavate the device that attribute-name is repeated, it is characterized in that, this device comprises:

9. device according to claim 8, it is characterized in that, described the first phrase pair extraction unit according to following phrase extraction rule extraction phrase to repeating phrase pair as candidate: the previous word of two phrases is identical and a rear word is identical, but two phrases itself are not identical.

10. device according to claim 9, it is characterized in that, described phrase extraction rule also comprise following at least one: the length of two phrases, in default length range, does not comprise punctuate and can not stop words, consist of completely in two phrases, or before two phrases and can not be punctuate afterwards.

11. devices according to claim 8, is characterized in that, this device also comprises:

12. devices according to claim 8, is characterized in that, the noise filtering that described noise filtering unit carries out comprise following at least one:

13. devices according to claim 12, is characterized in that, described noise filtering unit determines that candidate repeats phrase to <p1, during the word frequency score value score (p2|p1) of p2>, adopt following formula to calculate:

score(p2|p1)=λ _q-qP _q-q(p2|p1)+λ _q-tP _q-t(p2|p1)+λ _t-tP _t-t(p2|p1)；

Described

P_{q - q} (p 2 | p 1) = \frac{C_{q - q} (p 1, p 2)}{\underset{x}{Σ} C_{q - q} (p 1, x) + C},

Described

P_{q - t} (p 2 | p 1) = \frac{C_{q - t} (p 1, p 2)}{\underset{x}{Σ} C_{q - t} (p 1, x) + C},

Described

P_{t - t} (p 2 | p 1) = \frac{C_{t - t} (p 1, p 2)}{\underset{x}{Σ} C_{t - t} (p 1, x) + C},

14. devices according to claim 13, is characterized in that, described λ _q-tbe greater than λ _q-qand λ _t-t.