CN103631817A - Method and device for excavating attribute name repeat - Google Patents

Method and device for excavating attribute name repeat Download PDF

Info

Publication number
CN103631817A
CN103631817A CN201210307150.8A CN201210307150A CN103631817A CN 103631817 A CN103631817 A CN 103631817A CN 201210307150 A CN201210307150 A CN 201210307150A CN 103631817 A CN103631817 A CN 103631817A
Authority
CN
China
Prior art keywords
phrase
candidate
repeats
attribute
phrases
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210307150.8A
Other languages
Chinese (zh)
Other versions
CN103631817B (en
Inventor
赵世奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201210307150.8A priority Critical patent/CN103631817B/en
Publication of CN103631817A publication Critical patent/CN103631817A/en
Application granted granted Critical
Publication of CN103631817B publication Critical patent/CN103631817B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention provides a method and device for excavating attribute name repeat. The method comprises the steps of obtaining at least one source of Q-Q, Q-T, and T-T from a searching log as candidate sentence pairs, wherein the Q-Q is a sentence pair formed by two queries searched by a user in a dialogue, Q-T is a sentence pair formed by the queries and clicked web titles corresponding to the queries, and T-T is a sentence pair formed by two clicked titles corresponding to the same query; extracting phrases with the same context as candidate repeated phrase pairs from the candidate sentence pairs; extracting candidate repeated phrase pairs stored in at least one phrase attribute list from the candidate repeated phrases; conducting noise filtration on the extracted candidate repeated phrase pairs in the third step so as to obtain attribute name repeated phrases pairs. The method and device can obtain the expression form of attribute names, so that flexible and diverse request expression of users can be matched better.

Description

A kind of method and apparatus that excavates attribute-name repetition
[technical field]
The present invention relates to Computer Applied Technology field, particularly a kind of method and apparatus that excavates attribute-name repetition.
[background technology]
In network information field, tlv triple data can be expressed as that (v), wherein e is physical name (entity) for e, a, and a is attribute-name (attribute), and v is property value (value), and for example (Yao Ming, height, 2.26 meters) are a tlv triple.Aspect a lot of, all there is application in tlv triple data, especially in search engine, tlv triple data are stored in structured database as vertical search provides Data Source, when user search entity attribute, search engine can directly return to corresponding property value to user, for example, during user search " Yao Ming's height is how many ", can directly return to accurate answer " 2.26 meters ".
Yet user, carry out in the process of actual search, the language expression of employing may with structured database in statement there are differences, be reflected in attribute-name especially obvious.For above-mentioned example, user may search for " Yao Ming's height ", " Yao Ming is high how many ", " Yao Ming has how high " etc., although the intention of these inquiries is all to obtain Yao Ming's height, but because the statement of attribute-name is different, possibly cannot hit the content in structured database, therefore, be necessary the attribute-name in structured database to repeat excavation, excavate the expression-form that each attribute-name has, thus the flexile Query expression of match user better.
[summary of the invention]
In view of this, the invention provides a kind of method and apparatus that attribute-name is repeated that excavates, so that excavate the expression-form that attribute-name has, thus the flexile Query expression of match user better.
Concrete technical scheme is as follows:
Excavate the method that attribute-name is repeated, the method comprises the following steps:
S1, from search, obtain at least one resource in Q-Q, Q-T and T-T daily record as candidate sentence pair, the sentence that two query that described Q-Q searches in a session session for user form is right, the described Q-T sentence that to be query form with corresponding clicked web page title title is right, and described T-T is that the sentence that forms of two clicked title that same query is corresponding is right;
S2, from each candidate sentence centering extract there is same context linguistic context phrase to repeating phrase pair as candidate;
S3, from candidate, repeat the candidate that phrase centering extract to exist at least one phrase to belong to attribute-name list and repeat phrase pair;
S4, the candidate who extracts from described step S3 repeat phrase to carrying out noise filtering, obtain attribute-name and repeat phrase pair, and attribute-name is repeated the two phrases attribute-name repetition each other of phrase centering.
According to one preferred embodiment of the present invention, in described step S2 according to following phrase extraction rule extraction phrase to repeating phrase pair as candidate: the previous word of two phrases is identical and a rear word is identical, but two phrases itself are not identical.
According to one preferred embodiment of the present invention, described phrase extraction rule also comprises: the length of two phrases is in default length range, in two phrases, do not comprise punctuate and can not by stop words, be formed completely, or two phrases can not be at least one in punctuate before and afterwards.
According to one preferred embodiment of the present invention, in described step S2, also comprise: add up each candidate and repeat phrase to the number of times extracting from Q-Q, Q-T and T-T respectively, the candidate who total degree is less than to preset times threshold value repeats phrase to filtering out.
According to one preferred embodiment of the present invention, noise filtering described in step S4 comprise following at least one:
If candidate repeats the Length Ratio of two phrases of phrase centering and is greater than preset length than threshold value, this candidate is repeated to phrase to filtering out;
If candidate repeats the difference of two phrases of phrase centering and is only stop words, this candidate is repeated to phrase to filtering out;
If candidate repeats phrase centering and is not included in the phrase in attribute-name list and has numeral or English alphabet, this candidate is repeated to phrase to filtering out;
If candidate repeats head-word or the tail word that phrase centering is not included in the phrase in attribute-name list and appears in default filtration vocabulary, this candidate is repeated to phrase to filtering out;
If candidate repeats the phrase that phrase centering is not included in attribute-name list, comprise place name, this candidate is repeated to phrase to filtering out;
Each candidate who determines same phrase place repeats the right word frequency score value of phrase, and the candidate that word frequency score value is come outside top n repeats phrase to filtering out, and described N is default positive integer.
According to one preferred embodiment of the present invention, candidate repeats phrase to <p1, and the word frequency score value score (p2|p1) of p2> adopts following formula to calculate:
score(p2|p1)=λ q-qP q-q(p2|p1)+λ q-tP q-t(p2|p1)+λ t-tP t-t(p2|p1);
Described P q - q ( p 2 | p 1 ) = C q - q ( p 1 , p 2 ) &Sigma; x C q - q ( p 1 , x ) + C ,
Described P q - t ( p 2 | p 1 ) = C q - t ( p 1 , p 2 ) &Sigma; x C q - t ( p 1 , x ) + C ,
Described P t - t ( p 2 | p 1 ) = C t - t ( p 1 , p 2 ) &Sigma; x C t - t ( p 1 , x ) + C ,
C q-q(p1, p2) is <p1, the number of times that p2> is extracted out from Q-Q, C q-t(p1, p2) is <p1, the number of times that p2> is extracted out from Q-T, C t-t(p1, p2) is <p1, the number of times that p2> is extracted out from T-T,
Figure BDA00002055274400034
repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from Q-Q and, repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from Q-T and,
Figure BDA00002055274400036
repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from T-T and, C is default smoothing factor, λ q-q, λ q-tand λ t-tfor default weight coefficient.
According to one preferred embodiment of the present invention, described λ q-tbe greater than λ q-qand λ t-t.
Excavate the device that attribute-name is repeated, this device comprises:
Candidate sentence is to acquiring unit, for obtaining at least one resource Q-Q, Q-T and T-T as candidate sentence pair from search daily record, the sentence that two query that described Q-Q searches in a session session for user form is right, the described Q-T sentence that to be query form with corresponding clicked web page title title is right, and described T-T is that the sentence that forms of two clicked title that same query is corresponding is right;
The first phrase pair extraction unit, for from each candidate sentence centering, extract there is same context linguistic context phrase to repeating phrase pair as candidate;
The second phrase pair extraction unit, the candidate who belongs to attribute-name list for repeat at least one phrase of phrase centering extraction existence from candidate repeats phrase pair;
Noise filtering unit, repeats phrase to carrying out noise filtering for the candidate who extracts from described the second phrase pair extraction unit, obtains attribute-name and repeats phrase pair, and attribute-name is repeated the two phrases attribute-name repetition each other of phrase centering.
According to one preferred embodiment of the present invention, described the first phrase pair extraction unit according to following phrase extraction rule extraction phrase to repeating phrase pair as candidate: the previous word of two phrases is identical and a rear word is identical, but two phrases itself are not identical.
According to one preferred embodiment of the present invention, described phrase extraction rule also comprises: the length of two phrases is in default length range, in two phrases, do not comprise punctuate and can not by stop words, be formed completely, or two phrases can not be at least one in punctuate before and afterwards.
According to one preferred embodiment of the present invention, this device also comprises:
Candidate's filter element, for adding up each candidate that described the first phrase pair extraction unit extracts, repeat phrase to the number of times extracting from Q-Q, Q-T and T-T respectively, the candidate who total degree is less than to preset times threshold value repeats phrase to filtering out, and the candidate after filtering is repeated to phrase to offering described the second phrase pair extraction unit.
According to one preferred embodiment of the present invention, the noise filtering that described noise filtering unit carries out comprise following at least one:
If candidate repeats the Length Ratio of two phrases of phrase centering and is greater than preset length than threshold value, this candidate is repeated to phrase to filtering out;
If candidate repeats the difference of two phrases of phrase centering and is only stop words, this candidate is repeated to phrase to filtering out;
If candidate repeats phrase centering and is not included in the phrase in attribute-name list and has numeral or English alphabet, this candidate is repeated to phrase to filtering out;
If candidate repeats head-word or the tail word that phrase centering is not included in the phrase in attribute-name list and appears in default filtration vocabulary, this candidate is repeated to phrase to filtering out;
If candidate repeats the phrase that phrase centering is not included in attribute-name list, comprise place name, this candidate is repeated to phrase to filtering out;
Each candidate who determines same phrase place repeats the right word frequency score value of phrase, and the candidate that word frequency score value is come outside top n repeats phrase to filtering out, and described N is default positive integer.
According to one preferred embodiment of the present invention, described noise filtering unit determines that candidate repeats phrase to <p1, during the word frequency score value score (p2|p1) of p2>, adopts following formula to calculate:
score(p2|p1)=λ q-qP q-q(p2|p1)+λ q-tP q-t(p2|p1)+λ t-tP t-t(p2|p1);
Described P q - q ( p 2 | p 1 ) = C q - q ( p 1 , p 2 ) &Sigma; x C q - q ( p 1 , x ) + C ,
Described P q - t ( p 2 | p 1 ) = C q - t ( p 1 , p 2 ) &Sigma; x C q - t ( p 1 , x ) + C ,
Described P t - t ( p 2 | p 1 ) = C t - t ( p 1 , p 2 ) &Sigma; x C t - t ( p 1 , x ) + C ,
C q-q(p1, p2) is <p1, the number of times that p2> is extracted out from Q-Q, C q-t(p1, p2) is <p1, the number of times that p2> is extracted out from Q-T, C t-t(p1, p2) is <p1, the number of times that p2> is extracted out from T-T,
Figure BDA00002055274400054
repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from Q-Q and,
Figure BDA00002055274400055
repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from Q-T and,
Figure BDA00002055274400056
repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from T-T and, C is default smoothing factor, λ q-q, λ q-tand λ t-tfor default weight coefficient.
According to one preferred embodiment of the present invention, described λ q-tbe greater than λ q-qand λ t-t.
As can be seen from the above technical solutions, the present invention can realize attribute-name and repeat the right excavation of phrase, excavate the phrase pair that attribute-name is each other repeated, thus the expression-form can getattr name having, thus the flexile Query expression of match user better.
[accompanying drawing explanation]
The method flow diagram that Fig. 1 repeats for the excavation attribute-name that the embodiment of the present invention one provides;
The structure drawing of device that Fig. 2 repeats for the excavation attribute-name that the embodiment of the present invention two provides.
[embodiment]
In order to make the object, technical solutions and advantages of the present invention clearer, below in conjunction with the drawings and specific embodiments, describe the present invention.
Embodiment mono-,
The method flow diagram that Fig. 1 repeats for the excavation attribute-name that the embodiment of the present invention one provides, as shown in Figure 1, the method can comprise the following steps:
Step 101: obtain at least one resource in Q-Q, Q-T and T-T as candidate sentence pair from search daily record (query log).
The object of this step is in order to obtain sentence that follow-up excavation used to resource from query log, the data of user's inquiry session (session) and webpage clicking title (title) in querylog, have been recorded, the concrete querylog adopting can be the query log of set period, for example the query log of a day.
Above-mentioned Q-Q refers to and inquires about-inquire about right, refers to two query that a user searches in a session, and the meaning of these two query may be identical.
Above-mentioned Q-T refers to inquiry-clicked title couple, refers to query and corresponding clicked title, and the query of user search is possible identical with the semanteme between the title of its click afterwards conventionally.
Above-mentioned T-T refers to title-title pair, refers to two clicked title that same query is corresponding, and while being generally used for searching for a query, the semanteme between two title of click may be identical.
Step 102: from each candidate sentence centering extract there is same context linguistic context phrase to repeating phrase pair as candidate.
The object of this step is from candidate sentence centering, further to excavate the phrase pair of semantic equivalence, and the phrase pair extraction rule of use is exactly to see that whether phrase to having same context linguistic context.The previous word of phrase centering two phrases that extract is identical and a rear word is also identical, but two phrases are not identical.Further, can also limit the length of two phrases in default length range, for example, be less than 5 words, can also limit in two phrases and not comprise punctuate, can not stop words, consist of completely, in addition, two phrases can not be punctuates before and afterwards.
Suppose <s1, s2> is candidate sentence pair, <p1, p2> is by the phrase p1 from s1 and the phrase pair that forms from the phrase p2 of s2, previous word and the rear word of p1 in s1 is respectively b1 and e1, previous word and the rear word of p2 in s2 is respectively b2 and e2, if <p1, p2> meets following rule, using it as candidate, repeat phrase extraction out: p1 ≠ p2, b1=b2 and e1=e2.
Preferably, above-mentioned rule may further include with lower a kind of or combination: the length of p1 and p2, within the scope of preset length, does not contain punctuate and can not stop words, consist of completely in p1 and p2, b1, b2, e1 and e2 are not punctuates.
Give one example, suppose the candidate sentence pair that " national number-plate number inquiry " and " national license plate inquiry " forms, " number-plate number " and " license plate " will with same context linguistic context " whole nation _ _ inquiry " extracts formation candidate and repeats phrase pair.
In addition, add up each candidate and repeat phrase to the number of times extracting from Q-Q, Q-T and T-T respectively, be respectively C q-q(p1, p2), C q-t(p1, p2) and C t-t(p1, p2), finally retains the candidate that total degree is more than or equal to preset times threshold value and repeats phrase pair.For example, retain C q-q(p1, p2)+C q-t(p1, p2)+C t-tthe candidate of (p1, p2)>=3 repeats phrase pair, other filter out.
Step 103: repeat from candidate the candidate that phrase centering extract to exist at least one phrase to belong to attribute-name list and repeat phrase pair.
Owing to the object of the invention is to excavate attribute-name, repeat, it is a kind of special case of complete repetition resource that attribute-name is repeated, therefore, need to guarantee that candidate repeats phrase to being that the candidate of attribute-name repeats phrase pair, this just can be by judging that candidate repeats phrase centering and whether has at least a phrase to belong to attribute-name list.
Attribute-name list can be by the mode of artificial and/or automatic mining, to obtain the list of attribute-name formation, as long as one of them phrase is just repeated this candidate phrase to retaining in attribute-name list, for the whether no requirement (NR) in attribute-name list of another phrase, can not exist yet.
Step 104: the candidate that step 103 is extracted repeats phrase to carrying out after noise filtering, obtains attribute-name and repeats phrase pair.
The noise filtering of carrying out in this step can adopt but be not limited at least one in following filtering rule:
Filtering rule one: the filtration based on Length Ratio.If candidate repeats the Length Ratio of two phrases of phrase centering and is greater than preset length than threshold value, this candidate is repeated to phrase to filtering out.Suppose that candidate repeats phrase to <p1, in p2>, the length of phrase p1 and p2 is respectively L1 and L2, and preset length is that for example can to adopt T be 2.5 to T(than threshold value), if
Figure BDA00002055274400081
or
Figure BDA00002055274400082
, by <p1, p2> filters out.
Filtering rule two: the filtration based on stop words.If candidate repeats the difference of two phrases of phrase centering and is only stop words, this candidate is repeated to phrase to filtering out.For example, the candidate that " height " and " height " forms repeats phrase pair, both difference be only stop words " ", such repetition does not provide valuable knowledge, is therefore filtered out.
Filtering rule three, the filtration based on filtering vocabulary.Specifically comprise following several situation:
1) in advance numeral and English alphabet are arranged on and are filtered in vocabulary, if candidate repeats phrase centering, be not included in the phrase in attribute-name list and have numeral or English alphabet, filter out this candidate and repeat phrase pair.
2) by auxiliary word, preposition, conjunction etc., the word without competency is arranged in filtration vocabulary in advance, if candidate repeats head-word that phrase centering is not included in the phrase in attribute-name list or tail word and appears at and filter in vocabulary, filter out this candidate and repeat phrase pair.This filtration is mainly in order to filter out the irrational phrase in those borders, owing to not having the restriction on linguistic meaning when the extracting phrase, therefore " body weight ", " middle speed per hour ", " height and ", " complete or collected works it " etc. in the phrase extracting, there is the irrational situation in border, as should being filtered.
3) place name is arranged on filters in vocabulary in advance, if candidate repeats phrase centering, be not included in the phrase in attribute-name list and comprise place name, filter out this candidate and repeat phrase pair.This filtration is mainly the place name of considering should not contain in attribute-name sign specific region restriction.
Filtering rule four: the top-N based on word frequency score value filters.Because may existing a plurality of candidates, an attribute-name repeats, for example the repetition of " postcode " may be " postcode ", " postcode number ", " postal code number " etc., here be limited to the N of reservation kinds repeats more, this just needs a kind of marking mechanism, the marking mechanism herein adopting is based on word frequency, the candidate who determines same phrase place repeats the right word frequency score value of phrase (described same phrase is included in attribute-name list), the candidate that word frequency score value is come outside top n repeats phrase to filtering out, and N is default positive integer.
Definite method that word frequency score value is described below, candidate repeats phrase to <p1, and the word frequency score value score (p2|p1) of p2> can adopt following formula to calculate:
score(p2|p1)=λ q-qP q-q(p2|p1)+λ q-tP q-t(p2|p1)+λ t-tP t-t(p2|p1) (1)
Wherein, P q-q(p2|p1), P q-tand P (p2|p1) t-t(p2|p1) be respectively <p1, the score value of p2> based on Q-Q, Q-T and T-T, adopts respectively following formula to calculate:
P q - q ( p 2 | p 1 ) = C q - q ( p 1 , p 2 ) &Sigma; x C q - q ( p 1 , x ) + C - - - ( 2 )
P q - t ( p 2 | p 1 ) = C q - t ( p 1 , p 2 ) &Sigma; x C q - t ( p 1 , x ) + C - - - ( 3 )
P t - t ( p 2 | p 1 ) = C t - t ( p 1 , p 2 ) &Sigma; x C t - t ( p 1 , x ) + C - - - ( 4 )
C q-q(p1, p2) is <p1, the number of times that p2> is extracted out from Q-Q, C q-t(p1, p2) is <p1, the number of times that p2> is extracted out from Q-T, C t-t(p1, p2) is <p1, the number of times that p2> is extracted out from T-T,
Figure BDA00002055274400094
repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from Q-Q and,
Figure BDA00002055274400095
repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from Q-T and,
Figure BDA00002055274400096
repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from T-T and.
C is default smoothing factor, and for example C is 100, and for avoiding candidate as p1 to repeat quantity seldom time, the denominator of formula (2), (3) and (4) is too small and cause whole score value excessive.
λ q-q, λ q-tand λ t-tfor default weight coefficient, due to P q-t(p2|p1) score value is more important with respect to other two kinds, therefore preferably, and can be by λ q-tarrange to such an extent that be greater than λ q-qand λ t-t, for example, λ is set q-q, λ q-tand λ t-tbe respectively 0.2,0.6 and 0.2.
Be more than the description that method provided by the present invention is carried out, below by bis-pairs of devices provided by the present invention of embodiment, be described in detail.
Embodiment bis-,
The structure drawing of device that Fig. 2 repeats for the excavation attribute-name that the embodiment of the present invention two provides, as shown in Figure 2, this device comprises: candidate sentence is to acquiring unit 201, the first phrase pair extraction unit 202, the second phrase pair extraction unit 203 and noise filtering unit 204.
Candidate sentence is obtained at least one resource in Q-Q, Q-T and T-T as candidate sentence pair from query log to acquiring unit 201, the sentence that two query that Q-Q searches in a session for user form is right, the Q-T sentence that to be query form with corresponding clicked title is right, and T-T is that the sentence that forms of two clicked title that same query is corresponding is right.
The phrase that the first phrase pair extraction unit 202 has same context linguistic context from each candidate sentence centering extraction is to repeating phrase pair as candidate.Specifically can be according to following phrase extraction rule extraction phrase to repeating phrase pair as candidate: the previous word of two phrases is identical and a rear word is identical, but two phrases itself are not identical.
Further, phrase extraction rule can also comprise: the length of two phrases, in default length range, does not comprise punctuate and can not stop words, consist of completely in two phrases, or two phrases can not be at least one in punctuate before and afterwards.
The second phrase pair extraction unit 203 is repeated from candidate the candidate that phrase centering extract to exist at least one phrase to belong to attribute-name list and is repeated phrase pair.Namely in order to guarantee that candidate repeats phrase to being that the candidate of attribute-name repeats phrase pair, can be by judging that candidate repeats phrase centering and whether has at least a phrase to belong to attribute-name list.
The candidate that noise filtering unit 204 extracts from the second phrase pair extraction unit 203 repeats phrase to carrying out noise filtering, obtains attribute-name and repeats phrase pair, and attribute-name is repeated the two phrases attribute-name repetition each other of phrase centering.
Preferably, this device can also comprise: candidate's filter element 205, for adding up each candidate that the first phrase pair extraction unit 202 extracts, repeat phrase to the number of times extracting from Q-Q, Q-T and T-T respectively, the candidate who total degree is less than to preset times threshold value repeats phrase to filtering out, and the candidate after filtering is repeated to phrase to offering the second phrase pair extraction unit 203.The extraction operation that now the second phrase pair extraction unit 203 is carried out is actually candidate from candidate's filter element 205 filters and repeats that phrase centering extracts.
Above-mentioned noise filtering unit 204 can adopt at least one of following filtering rule when carrying out noise filtering:
Filtering rule one: the filtration based on Length Ratio.If candidate repeats the Length Ratio of two phrases of phrase centering and is greater than preset length than threshold value, this candidate is repeated to phrase to filtering out.
Filtering rule two: the filtration based on stop words.If candidate repeats the difference of two phrases of phrase centering and is only stop words, this candidate is repeated to phrase to filtering out.
Filtering rule three: the filtration based on filtering vocabulary.Specifically comprise following several situation:
1) in advance numeral and English alphabet are arranged on and are filtered in vocabulary, if candidate repeats phrase centering, be not included in the phrase in attribute-name list and have numeral or English alphabet, this candidate is repeated to phrase to filtering out.
2) by auxiliary word, preposition, conjunction etc., the word without competency is arranged in filtration vocabulary in advance, if candidate repeats head-word or the tail word that phrase centering is not included in the phrase in attribute-name list and appears in default filtration vocabulary, this candidate is repeated to phrase to filtering out.
3) place name is arranged on filters in vocabulary in advance, if candidate repeats the phrase that phrase centering is not included in attribute-name list, comprise place name, this candidate is repeated to phrase to filtering out.
Filtering rule four: the top-N based on word frequency score value filters.Each candidate who determines same phrase place repeats the right word frequency score value of phrase, and the candidate that word frequency score value is come outside top n repeats phrase to filtering out, and N is default positive integer.
In filtering rule four, determine that candidate repeats phrase to <p1, during the word frequency score value score (p2|p1) of p2>, can adopt following formula to calculate:
score(p2|p1)=λ q-qP q-q(p2|p1)+λ q-tP q-t(p2|p1)+λ t-tP t-t(p2|p1);
Wherein, P q - q ( p 2 | p 1 ) = C q - q ( p 1 , p 2 ) &Sigma; x C q - q ( p 1 , x ) + C ,
P q - t ( p 2 | p 1 ) = C q - t ( p 1 , p 2 ) &Sigma; x C q - t ( p 1 , x ) + C ,
P t - t ( p 2 | p 1 ) = C t - t ( p 1 , p 2 ) &Sigma; x C t - t ( p 1 , x ) + C ,
C q-q(p1, p2) is <p1, the number of times that p2> is extracted out from Q-Q, C q-t(p1, p2) is <p1, the number of times that p2> is extracted out from Q-T, C t-t(p1, p2) is <p1, the number of times that p2> is extracted out from T-T, repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from Q-Q and,
Figure BDA00002055274400125
repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from Q-T and,
Figure BDA00002055274400126
repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from T-T and, C is default smoothing factor, λ q-q, λ q-tand λ t-tfor default weight coefficient.
Due to P q-t(p2|p1) score value is more important with respect to other two kinds, therefore preferably, λ can be set q-tbe greater than λ q-qand λ t-t, for example, λ is set q-q, λ q-tand λ t-tbe respectively 0.2,0.6 and 0.2.
Repeat phrase to rear, just obtained the phrase pair that a large amount of attribute-name are each other repeated adopting said method and device to excavate, then can be by these phrases to arranging, determine same attribute-name corresponding respectively repeat phrase.In addition, because mining process may be a periodic process, therefore the repetition phrase of excavating may exist in mining process before, therefore can filter out being present in the existing repetition phrase of repeating in phrase library, then remaining repetition phrase is added into repetition phrase library, in this repetition phrase library, comprise attribute-name corresponding respectively repeat phrase.In addition, when adding repetition phrase in repetition phrase library, can first through artificial, mark, manually determine whether it is the repetition of attribute-name really, be actually a process of manually filtering, for artificial mark, be the repetition of attribute-name really, be added into and repeat in phrase library.For example, the information finally obtaining in repeating phrase library can be as shown in table 1.
Table 1
Attribute-name Attribute-name is repeated
[0116]
The TV of not liking Poor TV
The number-plate number Number, take pictures, automotive license plate number, good board, the trade mark
Net profit Profit, income, net profit, total profit, profit, income
Camera titbits Film clips, excellent titbit
Luxurious Honorable luxury, expensive, most distinguished, top, well-known, famous brand
Song catalog Song, song complete works
Like this, even if user inputs the query of different attribute name statement, by inquiry, repeat phrase library and also can determine user's attribute-name and be intended to, thereby be the information that user returns to inquiry targetedly.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims (14)

1. excavate the method that attribute-name is repeated, it is characterized in that, the method comprises the following steps:
S1, from search, obtain at least one resource in Q-Q, Q-T and T-T daily record as candidate sentence pair, the sentence that two query that described Q-Q searches in a session session for user form is right, the described Q-T sentence that to be query form with corresponding clicked web page title title is right, and described T-T is that the sentence that forms of two clicked title that same query is corresponding is right;
S2, from each candidate sentence centering extract there is same context linguistic context phrase to repeating phrase pair as candidate;
S3, from candidate, repeat the candidate that phrase centering extract to exist at least one phrase to belong to attribute-name list and repeat phrase pair;
S4, the candidate who extracts from described step S3 repeat phrase to carrying out noise filtering, obtain attribute-name and repeat phrase pair, and attribute-name is repeated the two phrases attribute-name repetition each other of phrase centering.
2. method according to claim 1, it is characterized in that, in described step S2 according to following phrase extraction rule extraction phrase to repeating phrase pair as candidate: the previous word of two phrases is identical and a rear word is identical, but two phrases itself are not identical.
3. method according to claim 2, it is characterized in that, described phrase extraction rule also comprise following at least one: the length of two phrases, in default length range, does not comprise punctuate and can not stop words, consist of completely in two phrases, or before two phrases and can not be punctuate afterwards.
4. method according to claim 1, is characterized in that, in described step S2, also comprises: add up each candidate and repeat phrase to the number of times extracting from Q-Q, Q-T and T-T respectively, the candidate who total degree is less than to preset times threshold value repeats phrase to filtering out.
5. method according to claim 1, is characterized in that, noise filtering described in step S4 comprise following at least one:
If candidate repeats the Length Ratio of two phrases of phrase centering and is greater than preset length than threshold value, this candidate is repeated to phrase to filtering out;
If candidate repeats the difference of two phrases of phrase centering and is only stop words, this candidate is repeated to phrase to filtering out;
If candidate repeats phrase centering and is not included in the phrase in attribute-name list and has numeral or English alphabet, this candidate is repeated to phrase to filtering out;
If candidate repeats head-word or the tail word that phrase centering is not included in the phrase in attribute-name list and appears in default filtration vocabulary, this candidate is repeated to phrase to filtering out;
If candidate repeats the phrase that phrase centering is not included in attribute-name list, comprise place name, this candidate is repeated to phrase to filtering out;
Each candidate who determines same phrase place repeats the right word frequency score value of phrase, and the candidate that word frequency score value is come outside top n repeats phrase to filtering out, and described N is default positive integer.
6. method according to claim 5, is characterized in that, candidate repeats phrase to <p1, and the word frequency score value score (p2|p1) of p2> adopts following formula to calculate:
score(p2|p1)=λ q-qP q-q(p2|p1)+λ q-tP q-t(p2|p1)+λ t-tP t-t(p2|p1);
Described P q - q ( p 2 | p 1 ) = C q - q ( p 1 , p 2 ) &Sigma; x C q - q ( p 1 , x ) + C ,
Described P q - t ( p 2 | p 1 ) = C q - t ( p 1 , p 2 ) &Sigma; x C q - t ( p 1 , x ) + C ,
Described P t - t ( p 2 | p 1 ) = C t - t ( p 1 , p 2 ) &Sigma; x C t - t ( p 1 , x ) + C ,
C q-q(p1, p2) is <p1, the number of times that p2> is extracted out from Q-Q, C q-t(p1, p2) is <p1, the number of times that p2> is extracted out from Q-T, C t-t(p1, p2) is <p1, the number of times that p2> is extracted out from T-T,
Figure FDA00002055274300024
repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from Q-Q and,
Figure FDA00002055274300025
repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from Q-T and,
Figure FDA00002055274300026
repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from T-T and, C is default smoothing factor, λ q-q, λ q-tand λ t-tfor default weight coefficient.
7. method according to claim 6, is characterized in that, described λ q-tbe greater than λ q-qand λ t-t.
8. excavate the device that attribute-name is repeated, it is characterized in that, this device comprises:
Candidate sentence is to acquiring unit, for obtaining at least one resource Q-Q, Q-T and T-T as candidate sentence pair from search daily record, the sentence that two query that described Q-Q searches in a session session for user form is right, the described Q-T sentence that to be query form with corresponding clicked web page title title is right, and described T-T is that the sentence that forms of two clicked title that same query is corresponding is right;
The first phrase pair extraction unit, for from each candidate sentence centering, extract there is same context linguistic context phrase to repeating phrase pair as candidate;
The second phrase pair extraction unit, the candidate who belongs to attribute-name list for repeat at least one phrase of phrase centering extraction existence from candidate repeats phrase pair;
Noise filtering unit, repeats phrase to carrying out noise filtering for the candidate who extracts from described the second phrase pair extraction unit, obtains attribute-name and repeats phrase pair, and attribute-name is repeated the two phrases attribute-name repetition each other of phrase centering.
9. device according to claim 8, it is characterized in that, described the first phrase pair extraction unit according to following phrase extraction rule extraction phrase to repeating phrase pair as candidate: the previous word of two phrases is identical and a rear word is identical, but two phrases itself are not identical.
10. device according to claim 9, it is characterized in that, described phrase extraction rule also comprise following at least one: the length of two phrases, in default length range, does not comprise punctuate and can not stop words, consist of completely in two phrases, or before two phrases and can not be punctuate afterwards.
11. devices according to claim 8, is characterized in that, this device also comprises:
Candidate's filter element, for adding up each candidate that described the first phrase pair extraction unit extracts, repeat phrase to the number of times extracting from Q-Q, Q-T and T-T respectively, the candidate who total degree is less than to preset times threshold value repeats phrase to filtering out, and the candidate after filtering is repeated to phrase to offering described the second phrase pair extraction unit.
12. devices according to claim 8, is characterized in that, the noise filtering that described noise filtering unit carries out comprise following at least one:
If candidate repeats the Length Ratio of two phrases of phrase centering and is greater than preset length than threshold value, this candidate is repeated to phrase to filtering out;
If candidate repeats the difference of two phrases of phrase centering and is only stop words, this candidate is repeated to phrase to filtering out;
If candidate repeats phrase centering and is not included in the phrase in attribute-name list and has numeral or English alphabet, this candidate is repeated to phrase to filtering out;
If candidate repeats head-word or the tail word that phrase centering is not included in the phrase in attribute-name list and appears in default filtration vocabulary, this candidate is repeated to phrase to filtering out;
If candidate repeats the phrase that phrase centering is not included in attribute-name list, comprise place name, this candidate is repeated to phrase to filtering out;
Each candidate who determines same phrase place repeats the right word frequency score value of phrase, and the candidate that word frequency score value is come outside top n repeats phrase to filtering out, and described N is default positive integer.
13. devices according to claim 12, is characterized in that, described noise filtering unit determines that candidate repeats phrase to <p1, during the word frequency score value score (p2|p1) of p2>, adopt following formula to calculate:
score(p2|p1)=λ q-qP q-q(p2|p1)+λ q-tP q-t(p2|p1)+λ t-tP t-t(p2|p1);
Described P q - q ( p 2 | p 1 ) = C q - q ( p 1 , p 2 ) &Sigma; x C q - q ( p 1 , x ) + C ,
Described P q - t ( p 2 | p 1 ) = C q - t ( p 1 , p 2 ) &Sigma; x C q - t ( p 1 , x ) + C ,
Described P t - t ( p 2 | p 1 ) = C t - t ( p 1 , p 2 ) &Sigma; x C t - t ( p 1 , x ) + C ,
C q-q(p1, p2) is <p1, the number of times that p2> is extracted out from Q-Q, C q-t(p1, p2) is <p1, the number of times that p2> is extracted out from Q-T, C t-t(p1, p2) is <p1, the number of times that p2> is extracted out from T-T,
Figure FDA00002055274300044
repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from Q-Q and, repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from Q-T and,
Figure FDA00002055274300046
repeat phrase to the number of times that is extracted out for all candidates that comprise phrase p1 from T-T and, C is default smoothing factor, λ q-q, λ q-tand λ t-tfor default weight coefficient.
14. devices according to claim 13, is characterized in that, described λ q-tbe greater than λ q-qand λ t-t.
CN201210307150.8A 2012-08-24 2012-08-24 A kind of method and apparatus for excavating attribute name repeat Active CN103631817B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210307150.8A CN103631817B (en) 2012-08-24 2012-08-24 A kind of method and apparatus for excavating attribute name repeat

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210307150.8A CN103631817B (en) 2012-08-24 2012-08-24 A kind of method and apparatus for excavating attribute name repeat

Publications (2)

Publication Number Publication Date
CN103631817A true CN103631817A (en) 2014-03-12
CN103631817B CN103631817B (en) 2018-04-03

Family

ID=50212884

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210307150.8A Active CN103631817B (en) 2012-08-24 2012-08-24 A kind of method and apparatus for excavating attribute name repeat

Country Status (1)

Country Link
CN (1) CN103631817B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080126408A1 (en) * 2006-06-23 2008-05-29 Invensys Systems, Inc. Presenting continuous timestamped time-series data values for observed supervisory control and manufacturing/production parameters
JP2009070282A (en) * 2007-09-14 2009-04-02 Fujifilm Corp Content retrieval device, and program
CN101599985A (en) * 2008-06-05 2009-12-09 华为技术有限公司 Content is obtained and content reception method, server and terminal
US20110231347A1 (en) * 2010-03-16 2011-09-22 Microsoft Corporation Named Entity Recognition in Query

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080126408A1 (en) * 2006-06-23 2008-05-29 Invensys Systems, Inc. Presenting continuous timestamped time-series data values for observed supervisory control and manufacturing/production parameters
JP2009070282A (en) * 2007-09-14 2009-04-02 Fujifilm Corp Content retrieval device, and program
CN101599985A (en) * 2008-06-05 2009-12-09 华为技术有限公司 Content is obtained and content reception method, server and terminal
US20110231347A1 (en) * 2010-03-16 2011-09-22 Microsoft Corporation Named Entity Recognition in Query

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵军: ""命名实体识别、排歧和跨语言关联"", 《中文信息学报》 *

Also Published As

Publication number Publication date
CN103631817B (en) 2018-04-03

Similar Documents

Publication Publication Date Title
CN102262634B (en) Automatic questioning and answering method and system
CN103268348B (en) A kind of user&#39;s query intention recognition methods
CN103023714B (en) The liveness of topic Network Based and cluster topology analytical system and method
Sarawagi et al. Open-domain quantity queries on web tables: annotation, response, and consensus models
CN106354844B (en) Service combination package recommendation system and method based on text mining
CN103246644B (en) Method and device for processing Internet public opinion information
CN101520798A (en) Webpage classification technology based on vertical search and focused crawler
KR20100068532A (en) Apparatus and method for keyword extraction and associative word network configuration of document data
CN101751439A (en) Image retrieval method based on hierarchical clustering
CN104516949A (en) Webpage data processing method and apparatus, query processing method and question-answering system
CN109597895B (en) Knowledge graph-based official document searching method
CN101404036A (en) Keyword abstraction method for PowerPoint electronic demonstration draft
CN102193951A (en) Information extracting method and system
CN102779135A (en) Method and device for obtaining cross-linguistic search resources and corresponding search method and device
CN103186556A (en) Method for obtaining and searching structural semantic knowledge and corresponding device
CN105224630A (en) Based on the integrated approach of Ontology on Semantic Web data
Li et al. Visual segmentation-based data record extraction from web documents
CN103020083B (en) The automatic mining method of demand recognition template, demand recognition methods and corresponding device
CN106372232B (en) Information mining method and device based on artificial intelligence
CN102799586B (en) A kind of escape degree defining method for search results ranking and device
CN100562872C (en) Automatic moulding plate information locating method at the structuring webpage
CN105740310A (en) Automatic answer summarizing method and system for question answering system
CN102521263A (en) Method and device for obtaining subject vocabulary entry
CN103020311A (en) Method and system for processing user search terms
CN103942233A (en) Method and device for identifying lobby page of hub page

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant