CN102063497A - Open type knowledge sharing platform and entry processing method thereof - Google Patents

Open type knowledge sharing platform and entry processing method thereof Download PDF

Info

Publication number
CN102063497A
CN102063497A CN 201010619675 CN201010619675A CN102063497A CN 102063497 A CN102063497 A CN 102063497A CN 201010619675 CN201010619675 CN 201010619675 CN 201010619675 A CN201010619675 A CN 201010619675A CN 102063497 A CN102063497 A CN 102063497A
Authority
CN
China
Prior art keywords
entry
directory
mark
feature
polysemant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201010619675
Other languages
Chinese (zh)
Other versions
CN102063497B (en
Inventor
邓亮
陈浩然
来瑾颖
唐益龙
梁东杰
耿磊
李永强
严冰
韦晨曦
乔峤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN 201010619675 priority Critical patent/CN102063497B/en
Publication of CN102063497A publication Critical patent/CN102063497A/en
Application granted granted Critical
Publication of CN102063497B publication Critical patent/CN102063497B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides an open type knowledge sharing platform and an entry processing method thereof. The entry processing method comprises the steps of: acquiring entries and entry contents; judging whether the entries are polysemous words related to different subjects; and if so, respectively sorting the entry contents related to different subjects to corresponding options of word sense. By means of the technical scheme, the invention provides the open type knowledge sharing platform and the entry processing method thereof, which can be used for sorting the entry contents related to different subjects to the options of different word senses, and thus, the content of the open type knowledge sharing platform is finer and is more convenient for the editing perfection of contents with same name and different subjects and the targeted introduction of expanded contents, and the browse experience of a user is improved.

Description

A kind of open knowledge sharing platform and entry disposal route thereof
[technical field]
The present invention relates to network technology, particularly a kind of open knowledge sharing platform and entry disposal route thereof.
[background technology]
Along with the development of Internet technology, open knowledge sharing platform is served widespread use already in the internet.Knowledge sharing platform is edited various entries by a large number of users, in order to help the having people who needs to understand.But often there is the polysemant that relates to different themes in open knowledge sharing platform, and for example entry " Sun Yue " may relate to " singer Sun Yue ", also may relate to " sportsman Sun Yue ".For example entry " apple " may relate to plant, company, internal film and foreign films or the like again.At present, existing open knowledge sharing platform is that the user is presented in the entry content unification that will be referred to the polysemant of different themes, and the user need seek the explanation of oneself wanting in numerous entry contents, cause user's viewing experience not good.
[summary of the invention]
In view of this, the invention provides a kind of open knowledge sharing platform and entry disposal route thereof, the entry content that will be referred to different themes is divided under the different meaning of a word options, make that the granularity of open knowledge sharing platform content is thinner, be more convenient for the content of different themes of the same name edited and improve and introduce the expansion content targetedly, thereby promote user's viewing experience.
The invention provides a kind of entry disposal route of open knowledge sharing platform, it is characterized in that, the entry disposal route comprises: a. obtains entry and entry content; B. judge whether entry is the polysemant that relates to different themes; C. if entry is the polysemant that relates to different themes, the entry content that then will be referred to different themes is divided into respectively under the corresponding meaning of a word option.
The preferred implementation one of according to the present invention, the entry content comprises a plurality of catalogues, in step b, judges according to the directory information in the catalogue whether entry is the polysemant that relates to different themes.
The preferred implementation one of according to the present invention in step a, is further obtained the classified information of entry, in step b, judges according to classified information whether entry is the polysemant that relates to different themes.
The preferred implementation one of according to the present invention, step b comprises: b1. carries out feature extraction to the entry content, to obtain a plurality of entry features; B2. obtain the mark characteristic set, the mark characteristic set comprises a plurality of mark features that have weighted value; B3. be respectively the weighted value of each entry characteristic allocation correspondence according to the mark feature; And b4. sues for peace to the weighted value of a plurality of entry features, and the weighted value summation is higher than the entry of threshold value as polysemant.
The preferred implementation one of according to the present invention, step b2 further comprises: b21. obtains the mark language material that comprises a plurality of polysemant samples and non-polysemant sample; B22. from the mark language material, extract a plurality of mark features; B23. respectively distribute corresponding weighted value in the appearance situation of polysemant sample and non-polysemant sample for the mark feature according to the mark feature.
The preferred implementation one of according to the present invention in step b23, is calculated the number of times M that the mark feature occurs in the polysemant sample, calculate the times N that the mark feature occurs in non-polysemant sample, and calculates weighted value and equal M/ (M+N).
The preferred implementation one of according to the present invention, in step c, with the first class catalogue of entry and the directory content under the first class catalogue as the fundamental forecasting unit, whether prediction fundamental forecasting unit belongs to different themes, and the fundamental forecasting unit that will belong to different themes according to predicting the outcome is divided into corresponding meaning of a word option respectively.
The preferred implementation one of according to the present invention in step c, is predicted according to the directory information in the first class catalogue.
The preferred implementation one of according to the present invention, step c further comprises: c1. extracts a plurality of directory feature from the fundamental forecasting unit; C2. generation predicts the outcome according to the degree of association between the directory feature judgement fundamental forecasting unit, and according to the degree of association.
The preferred implementation one of according to the present invention, step c further comprises: c1. obtains the directory information of entry; C2. extract directory feature according to directory information; C3. obtain the machine mould that contains directory feature degree of association relation; C4. according to the directory feature of being extracted, the applied for machines model carries out the degree of association to the directory feature of the adjacent directory information in front and back and calculates; C5. according to the degree of association result of calculation directory information is carried out mark.
The preferred implementation one of according to the present invention, step c2 further comprises: when extracting directory feature, carry out participle earlier.
The preferred implementation one of according to the present invention, the method for participle comprises: forward coupling participle, oppositely mate participle, Direct/Reverse coupling participle, the participle based on full segmenting word figure, maximum entropy Markov model participle, maximum entropy participle or condition random field participle.
The preferred implementation one of according to the present invention, step c3 further comprises: c31. obtains the polysemant bar sample storehouse of the meaning of a word option that is categorized as different themes; C32. obtain the directory information of meaning of a word option; C33. extract the directory feature of meaning of a word option according to the directory information of meaning of a word option; C34. carry out the machine modeling according to the directory feature of meaning of a word option, generate the machine mould of degree of association relation of the directory feature of the adjacent directory information comprise meaning of a word option.
The preferred implementation one of according to the present invention, the degree of association relation of the directory feature of the adjacent directory information of meaning of a word option comprises one of ratio of the vocabulary attribute of the quantity of identical vocabulary, identical vocabulary, the ratio of the shared directory information of identical vocabulary, the quantity of related vocabulary, the vocabulary attribute of related vocabulary, the shared directory information of related vocabulary or its combination.
The preferred implementation one of according to the present invention in step c4, is carried out the quantity that degree of association Calculation Method comprises identical vocabulary in the directory feature of calculating adjacent directory information.
The preferred implementation one of according to the present invention in step c4, is carried out degree of association Calculation Method and is also comprised the vocabulary attribute of judging identical vocabulary.
The preferred implementation one of according to the present invention in step c4, is carried out degree of association Calculation Method and is also comprised the ratio of calculating the shared directory information of identical vocabulary.
The preferred implementation one of according to the present invention in step c4, is carried out the quantity that degree of association Calculation Method comprises related vocabulary in the directory feature of calculating adjacent directory information.
The preferred implementation one of according to the present invention in step c4, is carried out degree of association Calculation Method and is also comprised the vocabulary attribute of judging related vocabulary.
The preferred implementation one of according to the present invention in step c4, is carried out the ratio that degree of association Calculation Method also comprises the shared directory information of compute associations vocabulary.
The preferred implementation one of according to the present invention, step c5 further comprises: c51. is divided into relevant and irrelevant according to the degree of association result of calculation directory information that front and back are adjacent; C52. with home directory and be labeled as first mark at the relevant directory information of preceding directory information; C53. will be labeled as second mark with the directory information that has nothing to do at preceding directory information.
The present invention also provides a kind of open knowledge sharing platform, and open knowledge sharing platform comprises: the entry acquisition module, obtain entry and entry content; The polysemant judge module judges whether entry is the polysemant that relates to different themes; Meaning of a word option is divided module, if entry is the polysemant that relates to different themes, the entry content that then will be referred to different themes is divided into respectively under the corresponding meaning of a word option.
The preferred implementation one of according to the present invention, the entry content comprises a plurality of catalogues, the polysemant judge module judges according to the directory information in the catalogue whether entry is the polysemant that relates to different themes.
The preferred implementation one of according to the present invention, the entry acquisition module further obtains the classified information of entry, and the polysemant judge module judges according to classified information whether entry is the polysemant that relates to different themes.
The preferred implementation one of according to the present invention, the polysemant judge module comprises: the entry characteristic extracting module, the entry content is carried out feature extraction, to obtain a plurality of entry features; Mark characteristic set acquisition module obtains the mark characteristic set, and the mark characteristic set comprises a plurality of mark features that have weighted value; Entry feature weight computing module is according to marking the weighted value that feature is respectively each entry characteristic allocation correspondence; And the threshold decision module, the weighted value of a plurality of entry features is sued for peace, and the weighted value summation is higher than the entry of threshold value as polysemant.
The preferred implementation one of according to the present invention, mark characteristic set acquisition module further comprises: mark language material acquisition module, obtain the mark language material that comprises a plurality of polysemant samples and non-polysemant sample; The mark characteristic extracting module is extracted a plurality of mark features from the mark language material; Mark feature weight computing module respectively distributes corresponding weighted value in the appearance situation of polysemant sample and non-polysemant sample for the mark feature according to the mark feature.
The preferred implementation one of according to the present invention, mark feature weight computing module calculates the number of times M that the mark feature occurs in the polysemant sample, calculate the times N that the mark feature occurs in non-polysemant sample, and calculates weighted value and equal M/ (M+N).
The preferred implementation one of according to the present invention, meaning of a word option divide module with the first class catalogue of entry and the directory content under the first class catalogue as the fundamental forecasting unit, whether prediction fundamental forecasting unit belongs to different themes, and the fundamental forecasting unit that will belong to different themes according to predicting the outcome is divided into corresponding meaning of a word option respectively.
The preferred implementation one of according to the present invention, meaning of a word option is divided module and is predicted according to the directory information in the first class catalogue.
The preferred implementation one of according to the present invention, meaning of a word option are divided module and are further comprised: the directory information acquisition module, obtain the directory information of entry; The directory feature extraction module extracts directory feature according to directory information; The machine mould acquisition module obtains the machine mould that contains directory feature degree of association relation; Degree of association computing module, according to the directory feature of being extracted, the applied for machines model carries out the degree of association to the directory feature of the adjacent directory information in front and back and calculates, and mark module, according to the degree of association result of calculation directory information is carried out mark.
The preferred implementation one of according to the present invention, the machine mould acquisition module further comprises: meaning of a word option sample acquisition module, the polysemant bar sample storehouse that obtains the meaning of a word option that is categorized as different themes; Meaning of a word option directory information acquisition module obtains the directory information of meaning of a word option; Meaning of a word option directory feature extraction module extracts the directory feature of meaning of a word option according to the directory information of meaning of a word option; The machine MBM is carried out the machine modeling according to the directory feature of meaning of a word option, generates the machine mould of degree of association relation of the directory feature of the adjacent directory information comprise meaning of a word option.
The preferred implementation one of according to the present invention, the degree of association relation of the directory feature of the adjacent directory information of meaning of a word option comprises one of ratio of the vocabulary attribute of the quantity of identical vocabulary, identical vocabulary, the ratio of the shared directory information of identical vocabulary, the quantity of related vocabulary, the vocabulary attribute of related vocabulary, the shared directory information of related vocabulary or its combination.
The preferred implementation one of according to the present invention, degree of association computing module are carried out the quantity that degree of association Calculation Method comprises identical vocabulary in the directory feature of calculating adjacent directory information.
The preferred implementation one of according to the present invention, degree of association computing module carries out degree of association Calculation Method and also comprises the vocabulary attribute of judging identical vocabulary.
The preferred implementation one of according to the present invention, degree of association computing module are carried out degree of association Calculation Method and are also comprised the ratio of calculating the shared directory information of identical vocabulary.
The preferred implementation one of according to the present invention, degree of association computing module are carried out the quantity that degree of association Calculation Method comprises related vocabulary in the directory feature of calculating adjacent directory information.
The preferred implementation one of according to the present invention, degree of association computing module carries out degree of association Calculation Method and also comprises the vocabulary attribute of judging related vocabulary.
The preferred implementation one of according to the present invention, degree of association computing module carries out the ratio that degree of association Calculation Method also comprises the shared directory information of compute associations vocabulary.
The preferred implementation one of according to the present invention, mark module further comprises: degree of association sort module is divided into relevant and irrelevant according to the degree of association result of calculation directory information that front and back are adjacent; First mark module, with home directory and be labeled as first mark at the relevant directory information of preceding directory information; Second mark module will be labeled as second mark with the directory information that has nothing to do at preceding directory information.
By the above-mentioned technical scheme that provides, the invention provides a kind of open knowledge sharing platform and entry disposal route thereof, the entry content that can will be referred to different themes is divided under the different meaning of a word options, make that the granularity of open knowledge sharing platform content is thinner, be more convenient for the content of different themes of the same name edited and improve and introduce the expansion content targetedly, thereby promote user's viewing experience.
[description of drawings]
Fig. 1 is the schematic flow sheet of the polysemant exhibiting method of open knowledge sharing platform of the present invention;
Fig. 2 is the schematic block diagram of open knowledge sharing platform of the present invention;
Fig. 3 is the schematic flow sheet of the entry disposal route of open knowledge sharing platform of the present invention;
Fig. 4 is the schematic flow sheet of the polysemant decision method of open knowledge sharing platform of the present invention
Fig. 5 is the schematic flow sheet of the mark characteristic set acquisition methods of open knowledge sharing platform of the present invention.
Fig. 6 is the schematic flow sheet of meaning of a word option division methods of the ambiguity entry of open knowledge sharing platform of the present invention;
Fig. 7 is the schematic block diagram of the entry treating apparatus of open knowledge sharing platform of the present invention;
Fig. 8 is the schematic block diagram of the polysemant decision maker of open knowledge sharing platform of the present invention
Fig. 9 is the schematic block diagram of the mark characteristic set deriving means of open knowledge sharing platform of the present invention.
Figure 10 is the schematic block diagram of meaning of a word option classification apparatus of the ambiguity entry of open knowledge sharing platform of the present invention.
[embodiment]
In order to make the purpose, technical solutions and advantages of the present invention clearer, describe the present invention below in conjunction with the drawings and specific embodiments.
At first see also Fig. 1, Fig. 1 is the schematic flow sheet of the polysemant exhibiting method of open knowledge sharing platform of the present invention.
In step 10, carry out related with the corresponding meaning of a word option entry content that relates to different themes of same entry respectively.For example, in this step, the entry content of the relating to of entry " Sun Yue " " singer Sun Yue " is divided and is associated with under the meaning of a word option " singer Sun Yue ", and the entry content that will be referred to " sportsman Sun Yue " is divided and is associated with under the meaning of a word option " sportsman Sun Yue ".The concrete deterministic process of polysemant and the concrete partition process of entry content will be described hereinafter.
In step 11, receive user's query requests.Specifically, the user sends query requests by browser, and this query requests is sent to open knowledge sharing platform through the internet.
In step 12, the entry that inquiry and query requests are complementary.
In step 13, output also presents a plurality of meaning of a word options corresponding with the entry that is complementary on browser.Specifically, open knowledge sharing platform is inquired about the entry that is complementary with this query requests in database after receiving query requests.If this entry is the polysemant that relates to different themes, then output and in browser, present a plurality of meaning of a word options corresponding with the entry that is complementary.For example, when the query requests of user's input was " Sun Yue ", open knowledge sharing platform output also presented " singer Sun Yue " and " sportsman Sun Yue " two meaning of a word options on browser.
In step 14, represent the entry content that is associated at the request of meaning of a word option according to the user.Specifically, the user determines own institute topics of interest by meaning of a word option, and then clicks corresponding meaning of a word option.The entry content that open knowledge sharing platform then will be associated with this meaning of a word option outputs on the browser, and then represents to the user.For example, user's interest is " singer Sun Yue ", and the user clicks meaning of a word option " singer Sun Yue ", and then open knowledge sharing platform outputs to the entry content relevant with " singer Sun Yue " on the browser, and then represents to the user.
In step 13, preferably a plurality of meaning of a word options that present on the browser are sorted according to user behavior, make the high meaning of a word option of attention rate come the prostatitis of these a plurality of meaning of a word options, further improve user's viewing experience thus.
For example, can sort to the click volume of browsing time of entry content or entry content a plurality of meaning of a word options of comparison according to the amount of representing of entry content, user with the amount of representing.Wherein, the amount of representing of entry content is meant the number of times that represents the entry content that is associated according to the user at the request of meaning of a word option.In general, the attention rate that this entry content of the many more proofs of the amount of representing of entry content is subjected to is high more, and its pairing meaning of a word option rank should be forward more.The user the browsing time of entry content be meant the user behind this entry content revealing to the time that is spent before this entry content is closed, the user is long more to the browsing time of entry content, prove that also the user is high more to the attention rate of this entry, its pairing meaning of a word option rank should be forward more.The click volume of entry content is meant the number of clicks of user to the contents such as title, picture or link in the entry content that is represented.The click volume of entry content is high more with the ratio of the amount of representing, and proves also that then the user is high more to the attention rate of this entry, and its pairing meaning of a word option rank should be forward more.
In a preferred embodiment, take all factors into consideration above-mentioned three criterions, come meaning of a word option is sorted.Just, according to the amount of representing of meaning of a word content, user the click volume of browsing time of meaning of a word content and the meaning of a word content weighting statistics with the ratio of the amount of representing is sorted to a plurality of meaning of a word options.Concrete weighting statistic algorithm can design according to actual needs.For example, when the amount of representing of statistics meaning of a word content, lower weight is given in the behavior that represents that the browsing time is short, participates in the amount of the representing statistics again, can reduce thus since the amount of representing that user misoperation produced to the influence of meaning of a word option sorting.
As shown in Figure 2, Fig. 2 is the schematic block diagram of open knowledge sharing platform of the present invention.Open knowledge sharing platform of the present invention comprises relating module 20, load module 21, matching module 22 and output module 23.
Relating module 20 carries out related with the corresponding meaning of a word option entry content that relates to different themes of same entry respectively.For example, relating module 20 is divided the entry content of the relating to of entry " Sun Yue " " singer Sun Yue " and be associated with under the meaning of a word option " singer Sun Yue ", and the entry content that will be referred to " sportsman Sun Yue " is divided and is associated with under the meaning of a word option " sportsman Sun Yue ".
Load module 21 receives the query requests that the user sends by browser.Matching module 22 receives query requests according to load module 21 and inquire about the entry that is complementary with this query requests in database.If this entry is the polysemant that relates to different themes, then output module 23 is imported a plurality of meaning of a word options of these entry correspondences, and then presents this a plurality of meaning of a word options in browser.For example, when the query requests of user's input was " Sun Yue ", output module 23 outputs also presented " singer Sun Yue " and " sportsman Sun Yue " two meaning of a word options on browser.
Load module 21 further receives the requests of users at certain meaning of a word option, output module 23 further according to the user at the request output of meaning of a word option and and then represent the entry content that is associated.Specifically, the user determines own institute topics of interest by meaning of a word option, and then clicks corresponding meaning of a word option.23 entry contents that will be associated with this meaning of a word option of output module output on the browser, and then represent to the user.For example, user's interest is " singer Sun Yue ", and the user clicks meaning of a word option " singer Sun Yue ", and then 23 of output modules output to the entry content relevant with " singer Sun Yue " on the browser, and then represent to the user.
Output module 23 preferably sorts to a plurality of meaning of a word options that present on the browser according to user behavior, makes the high meaning of a word option of attention rate come the prostatitis of these a plurality of meaning of a word options, further improves user's viewing experience thus.
For example, can sort to the click volume of browsing time of entry content or entry content a plurality of meaning of a word options of comparison according to the amount of representing of entry content, user with the amount of representing.Wherein, the amount of representing of entry content is meant the number of times that represents the entry content that is associated according to the user at the request of meaning of a word option.In general, the attention rate that this entry content of the many more proofs of the amount of representing of entry content is subjected to is high more, and its pairing meaning of a word option rank should be forward more.The user the browsing time of entry content be meant the user behind this entry content revealing to the time that is spent before this entry content is closed, the user is long more to the browsing time of entry content, prove that also the user is high more to the attention rate of this entry, its pairing meaning of a word option rank should be forward more.The click volume of entry content is meant the number of clicks of user to the contents such as title, picture or link in the entry content that is represented.The click volume of entry content is high more with the ratio of the amount of representing, and proves also that then the user is high more to the attention rate of this entry, and its pairing meaning of a word option rank should be forward more.
In a preferred embodiment, take all factors into consideration above-mentioned three criterions, come meaning of a word option is sorted.Just, according to the amount of representing of meaning of a word content, user the click volume of browsing time of meaning of a word content and the meaning of a word content weighting statistics with the ratio of the amount of representing is sorted to a plurality of meaning of a word options.Concrete weighting statistic algorithm can design according to actual needs.For example, when the amount of representing of statistics meaning of a word content, lower weight is given in the behavior that represents that the browsing time is short, participates in the amount of the representing statistics again, can reduce thus since the amount of representing that user misoperation produced to the influence of meaning of a word option sorting.
As shown in Figure 3, Fig. 3 is the schematic flow sheet of the entry disposal route of open knowledge sharing platform of the present invention.
In step 30, obtain entry and entry content.In a preferred embodiment, this entry and entry content can be the entry and the entry contents that present with catalogue form in the existing open knowledge sharing platform.That is to say that this entry content comprises a plurality of catalogues and lays respectively at directory content under each catalogue.Wherein, catalogue can comprise a plurality of first class catalogues, can further include a plurality of second-level directories, three grades of sub-directories such as catalogue under each first class catalogue.
In step 31, judge whether entry is the polysemant that relates to different themes.It is multiple to judge that whether entry is that the method for polysemant has, and will be described with several embodiments below.
In one embodiment, judge according to the directory information in the catalogue whether entry is the polysemant that relates to different themes.Specifically, judge whether there is the keyword that relates to different themes in the different directory informations.For example, if occurred " singer Sun Yue " and " sportsman Sun Yue " in two directory informations of entry " Sun Yue " respectively,, think that then this entry " Sun Yue " is a polysemant because " singer " relates to different themes with " sportsman ".Again for example, in two directory informations of entry " Hero Shooting Vulture ", occurred " 1983 editions " and " 2008 editions " respectively, thought equally that then entry " Hero Shooting Vulture " is a polysemant.
In one embodiment, in step 30, further obtain the classified information of entry, in step 31, judge according to the classified information of entry whether this entry is the polysemant of different themes.For example, comprise " plant ", " film " and " company " three different classification in the classified information of entry " apple ", think that then entry " apple " is a polysemant.
Whether in another embodiment, can discern entry automatically by the machine excavation method according to the entry content is polysemant.As shown in Figure 4, Fig. 4 is the schematic flow sheet of the polysemant decision method of open knowledge sharing platform of the present invention.
In step 40, the entry content of entry to be determined is carried out feature extraction, to obtain a plurality of entry features.Specifically, the entry content is carried out participle and filtration, and with participle and filter words that the back obtains as the entry feature.Wherein, the effect of participle is that the Chinese character sequence in the entry content is cut into significant words, so that subsequent treatment.The method of concrete participle comprises: forward coupling participle, oppositely mate participle, Direct/Reverse coupling participle, based on the participle of full segmenting word figure, maximum entropy Markov model participle, maximum entropy participle or condition random field participle etc., above-mentioned segmenting method is techniques well known, does not repeat them here.The effect of filtering is to remove garbages such as punctuation mark, auxiliary word.
In step 41, obtain the mark characteristic set.The mark characteristic set comprises a plurality of mark features that have weighted value.As shown in Figure 5, Fig. 5 is the schematic flow sheet of the mark characteristic set acquisition methods of open knowledge sharing platform of the present invention.
In step 50, obtain the mark language material that comprises a plurality of polysemant samples and non-polysemant sample.In the mark language material, the polysemant sample is meant entry and the entry content that is judged to be polysemant, and non-polysemant sample is meant entry and the entry content that is judged to be non-polysemant.
In step 51, from the mark language material, extract a plurality of mark features.Specifically, respectively each polysemant sample and each non-polysemant sample are carried out participle and filtration, and with participle and filter words that the back obtains as the mark feature.
In step 52, distribute corresponding weighted value in the appearance situation of polysemant sample and non-polysemant sample respectively for the mark feature according to the mark feature.Specifically, calculate the number of times M that the mark feature occurs in the polysemant sample, calculate the times N that the mark feature occurs in non-polysemant sample, and the weighted value that calculates this mark feature equals M/ (M+N).By said method as can be known, if certain number of times of occurring in the polysemant sample of mark feature is more, the number of times that in non-polysemant sample, occurs simultaneously more after a little while, weighted value of this mark feature is just higher relatively.If certain number of times number of times similar or that occur in the polysemant sample of occurring in polysemant sample and non-polysemant of mark feature is less, the number of times that in non-polysemant sample, occurs simultaneously more for a long time, weighted value of this mark feature is just relatively low.
In step 42, be respectively from the weighted value of each entry characteristic allocation correspondence of the entry contents extraction of entry to be determined according to the mark feature.Specifically, judge whether to exist the mark feature identical with each entry feature in the mark characteristic set, if exist, the weighted value that then will mark feature is distributed to this entry feature.
In step 43, the weighted value from a plurality of entry features of the entry contents extraction of entry to be determined is sued for peace, and the weighted value summation is higher than the entry of threshold value as polysemant.Specifically, if the weighted value of a plurality of entry features of entry to be determined is high more, the number of times of then representing this entry feature to occur in the polysemant sample is high more, and the probability that this entry to be determined is a polysemant is just high more.In the present embodiment, concrete threshold value can be provided with according to actual conditions.
In step 32, if judge that entry is the polysemant that relates to different themes, the entry content that then will be referred to different themes is divided into respectively under the corresponding meaning of a word option.In a preferred embodiment, with the first class catalogue of entry and the directory content under the first class catalogue as a fundamental forecasting unit, whether prediction fundamental forecasting unit belongs to different themes, and will belong to the fundamental forecasting dividing elements of same theme to same meaning of a word option according to predicting the outcome.Whether prediction fundamental forecasting unit belongs to different themes has multiplely, will be described with several embodiments below.
In one embodiment, predict according to the directory information in the first class catalogue.For example, if occurred " singer Sun Yue " and " sportsman Sun Yue " in the directory information of two first class catalogues of entry " Sun Yue " respectively, because " singer " relates to different themes with " sportsman ", then the first class catalogue and the directory content thereof that comprise " singer Sun Yue " in the directory information are divided and be associated with under the meaning of a word option " singer Sun Yue ", and the first class catalogue and the directory content thereof that comprise " sportsman Sun Yue " in the directory information are divided and be associated with under the meaning of a word option " sportsman Sun Yue ".In addition, can also predict according to user's edit action that directory information embodied.For example, if first word in the directory information of different first class catalogues is a numeral, and arrange continuously, then will have first catalogue and the directory content thereof that the first class catalogue of numeral and directory content thereof and below do not have numeral and be divided into respectively under the different meaning of a word options.
In one embodiment, when judging that entry is when relating to the polysemant of different themes, the entry content that can will be referred to different themes by the mode of machine excavation is divided into respectively under the corresponding meaning of a word option.As shown in Figure 6, Fig. 6 is the schematic flow sheet of meaning of a word option division methods of the ambiguity entry of open knowledge sharing platform of the present invention.
Because catalogue is normally tactic according to front and back in the entry, just under normal conditions, the catalogue of the same subject in the ambiguity entry is tactic according to front and back, the situation of the out of order arrangement of less appearance, therefore in this case, whether be correlated with between the catalogue before and after only needing to judge, promptly can learn the split position of the catalogue of different themes.
In step 61, obtain the polysemant entry data of not carrying out the classification of meaning of a word option.These polysemant entry data of carrying out the classification of meaning of a word option can obtain by above-mentioned step 31 shown in Figure 3 or polysemant decision method shown in Figure 4.
In step 62, obtain the directory information of entry according to the position of catalogue in the entry.In preferred embodiment, the present invention is cut apart entry according to the position of first class catalogue in the polysemant entry, obtains the directory information of each first class catalogue, and directory information comprises the directory content under first class catalogue title and the first class catalogue etc.
In step 63,, therefrom extract a plurality of features according to the directory information that obtains.When directory information is carried out feature extraction, need to carry out participle and filtration to the entry content earlier, and with participle and filter words that the back obtains as the entry feature.Wherein, the effect of participle is that the Chinese character sequence in the entry content is cut into significant words, so that subsequent treatment.The method of concrete participle comprises: forward coupling participle, oppositely mate participle, Direct/Reverse coupling participle, based on the participle of full segmenting word figure, maximum entropy Markov model participle, maximum entropy participle or condition random field participle etc., above-mentioned segmenting method is techniques well known, does not repeat them here.The effect of filtering is to remove garbages such as punctuation mark, auxiliary word.In preferred embodiment, the mode that the present invention adopts forward maximum match participle and reverse maximum match participle to combine is proofreaied and correct word segmentation result, to obtain the higher word segmentation result of accuracy.
In step 64, obtain the machine mould that contains directory feature degree of association relation.As shown in Figure 6, step 64 further comprises:
Step 641, the polysemant bar sample storehouse that obtains the meaning of a word option that is categorized as different themes.Because sample is the ambiguity entry that has been categorized as the meaning of a word option of different themes, so the pairing theme of catalogue below each meaning of a word option is identical.
Step 642 is obtained the directory information of meaning of a word option.Promptly obtain the directory information that has the same subject characteristic below the meaning of a word option.Preferred embodiment, obtain the first class catalogue information of meaning of a word option.
Step 643 is extracted the directory feature of meaning of a word option according to the directory information of meaning of a word option.Directory information with same subject characteristic is extracted corresponding directory feature.
Step 644 is carried out the machine modeling according to the directory feature of meaning of a word option, generates the machine mould of degree of association relation of the directory feature of the adjacent directory information comprise meaning of a word option.Because each entry in polysemant bar sample storehouse has all comprised a plurality of meaning of a word options, by the directory feature with same subject characteristic under the same meaning of a word option is carried out learning training, and the directory feature with different themes characteristic under the different meaning of a word options carried out learning training, can set up the machine mould of degree of association relation of the directory feature of the adjacent directory information that comprises meaning of a word option.In preferred embodiment, the degree of association of the directory feature of the adjacent directory information of meaning of a word option relation comprises one of ratio of the vocabulary attribute of the quantity of identical vocabulary, identical vocabulary, the ratio of the shared directory information of identical vocabulary, the quantity of related vocabulary, the vocabulary attribute of related vocabulary, the shared directory information of related vocabulary or its combination.
In step 65, according to the directory feature of being extracted, the applied for machines model carries out the degree of association to the directory feature of the adjacent directory information in front and back and calculates.Wherein, the degree of association is calculated can adopt several different methods, and the enforcement of also can implementing separately or mutually combine between the whole bag of tricks, the methods that the present invention now gives an example two kinds and can implement separately or mutually combine and implement, but be not to be used to limit embodiments of the present invention.
In one embodiment of the invention, the applied for machines model calculates the parameter of the identical vocabulary in the directory feature of the adjacent directory information in front and back, ratio by the quantity of calculating identical vocabulary, the shared directory information of identical vocabulary, perhaps the vocabulary attribute of identical vocabulary is judged, realized the degree of association of the adjacent directory information in front and back is calculated.For example, for artistic works, particularly video display serial, serial story etc., identical and the content difference of its directory name, the existing quantity of same words remittance abroad is a lot of in the directory content, and the vocabulary attribute is noun, gerund etc., the degree of association of catalogue before and after therefore can calculating in view of the above.
In another embodiment of the invention, the applied for machines model calculates the parameter of the related vocabulary in the directory feature of the adjacent directory information in front and back, ratio by the quantity of compute associations vocabulary, the shared directory information of related vocabulary, perhaps the vocabulary attribute of related vocabulary is judged, realized the degree of association of the adjacent directory information in front and back is calculated.For example the degree of correlation of " Liu Dehua " and " Zhu Liqian " is very high, and the degree of correlation of " Liu Dehua " and " old man " is just low, and " singer " is very high with the degree of correlation of " special edition ", and " singer " is just low with the degree of correlation of " war ".This vocabulary degree of correlation can be judged by the mode of related term dictionary or machine sample learning.
In step 66, directory information is carried out mark according to degree of association result of calculation.Labeling method comprises numerous embodiments.In one embodiment of the invention, can directory information be classified according to theme according to degree of association result of calculation.In another embodiment of the invention, be divided into relevant and irrelevant according to the degree of association result of calculation directory information that front and back are adjacent, with home directory and be labeled as first mark at the relevant directory information of preceding directory information, will be labeled as second mark at the irrelevant directory information of preceding directory information.For instance, an ambiguity entry comprises 6 catalogues.Identify the beginning part that this catalogue is the meaning of a word option of a same subject by each catalogue and corresponding directory content, if, be marked as " B ", if not, be marked as " I ".Like this, 6 catalogues may be marked as the result as " BIBIIB ", and catalogue 1-2 is exactly a meaning of a word option so, and catalogue 3-5 is a meaning of a word option, and catalogue 6 is a meaning of a word options.So just realized having the classification of the catalogue of same subject in the ambiguity entry.
As shown in Figure 7, Fig. 7 is the schematic block diagram of the entry treating apparatus of open knowledge sharing platform of the present invention.In the present embodiment, the entry treating apparatus comprises entry acquisition module 70, polysemant judge module 71 and meaning of a word option division module 72
Entry acquisition module 70 is used to obtain entry and entry content.In a preferred embodiment, this entry and entry content can be the entry and the entry contents that present with catalogue form in the existing open knowledge sharing platform.That is to say that this entry content comprises a plurality of catalogues and lays respectively at directory content under each catalogue.Wherein, catalogue can comprise a plurality of first class catalogues, can further include a plurality of second-level directories, three grades of sub-directories such as catalogue under each first class catalogue.
Polysemant judge module 71 is used to judge whether entry is the polysemant that relates to different themes.It is multiple to judge that whether entry is that the method for polysemant has, and will be described with several embodiments below.
In one embodiment, polysemant judge module 71 judges according to the directory information in the catalogue whether entry is the polysemant that relates to different themes.Specifically, polysemant judge module 71 judges whether there is the keyword that relates to different themes in the different directory informations.For example, if occurred " singer Sun Yue " and " sportsman Sun Yue " in two directory informations of entry " Sun Yue " respectively,, think that then this entry " Sun Yue " is a polysemant because " singer " relates to different themes with " sportsman ".Again for example, in two directory informations of entry " Hero Shooting Vulture ", occurred " 1983 editions " and " 2008 editions " respectively, thought equally that then entry " Hero Shooting Vulture " is a polysemant.
In one embodiment, entry acquisition module 70 further obtains the classified information of entry, and polysemant judge module 71 judges according to the classified information of entry whether this entry is the polysemant of different themes.For example, comprise " plant ", " film " and " company " three different classification in the classified information of entry " apple ", think that then entry " apple " is a polysemant.
Whether in another embodiment, can discern entry automatically by the machine excavation method according to the entry content is polysemant.As shown in Figure 8, Fig. 8 is the schematic block diagram of the polysemant judge module of open knowledge sharing platform of the present invention.In the present embodiment, the polysemant judge module comprises entry characteristic extracting module 80, mark characteristic set acquisition module 81, entry feature weight computing module 82 and threshold decision module 83
Entry characteristic extracting module 80 is used for the entry content of entry to be determined is carried out feature extraction, to obtain a plurality of entry features.Specifically, 80 pairs of entry contents of entry characteristic extracting module are carried out participle and filtration, and with participle and filter words that the back obtains as the entry feature.Wherein, the effect of participle is that the Chinese character sequence in the entry content is cut into significant words, so that subsequent treatment.The method of concrete participle comprises: forward coupling participle, oppositely mate participle, Direct/Reverse coupling participle, based on the participle of full segmenting word figure, maximum entropy Markov model participle, maximum entropy participle or condition random field participle etc., above-mentioned segmenting method is techniques well known, does not repeat them here.The effect of filtering is to remove garbages such as punctuation mark, auxiliary word.
Mark characteristic set acquisition module 81 is used to obtain the mark characteristic set.The mark characteristic set comprises a plurality of mark features that have weighted value.As shown in Figure 9, Fig. 8 is the schematic block diagram of the mark characteristic set acquisition module of open knowledge sharing platform of the present invention.In the present embodiment, mark characteristic set acquisition module comprises mark language material acquisition module 90, mark characteristic extracting module 91 and mark feature weight computing module 92.
Mark language material acquisition module 90 is used to obtain the mark language material that comprises a plurality of polysemant samples and non-polysemant sample.In the mark language material, the polysemant sample is meant entry and the entry content that is judged to be polysemant, and non-polysemant sample is meant entry and the entry content that is judged to be non-polysemant.
Mark characteristic extracting module 91 is used for extracting a plurality of mark features from the mark language material.Specifically, mark characteristic extracting module 91 is carried out participle and filtration to each polysemant sample and each non-polysemant sample respectively, and with participle and filter words that the back obtains as the mark feature.
Mark feature weight computing module 92 is used for respectively distributing corresponding weighted value in the appearance situation of polysemant sample and non-polysemant sample for the mark feature according to the mark feature.Specifically, mark feature weight computing module 92 calculates the number of times M that the mark features occur in the polysemant sample, calculates the times N that the mark feature occurs in non-polysemant sample, and the weighted value that calculates this mark feature equals M/ (M+N).By said method as can be known, if certain number of times of occurring in the polysemant sample of mark feature is more, the number of times that in non-polysemant sample, occurs simultaneously more after a little while, weighted value of this mark feature is just higher relatively.If certain number of times number of times similar or that occur in the polysemant sample of occurring in polysemant sample and non-polysemant of mark feature is less, the number of times that in non-polysemant sample, occurs simultaneously more for a long time, weighted value of this mark feature is just relatively low.
Entry feature weight computing module 82 is used for being respectively from the weighted value of each entry characteristic allocation correspondence of the entry contents extraction of entry to be determined according to the mark feature.Specifically, entry feature weight computing module 82 judges whether to exist the mark feature identical with each entry feature in the mark characteristic set, if exist, the weighted value that then will mark feature is distributed to this entry feature.
Threshold decision module 83 is used for the weighted value from a plurality of entry features of the entry contents extraction of entry to be determined is sued for peace, and the weighted value summation is higher than the entry of threshold value as polysemant.Specifically, if the weighted value of a plurality of entry features of entry to be determined is high more, the number of times of then representing this entry feature to occur in the polysemant sample is high more, and the probability that this entry to be determined is a polysemant is just high more.In the present embodiment, concrete threshold value can be provided with according to actual conditions.
If polysemant judge module 71 judges that entries are the polysemant that relates to different themes, then meaning of a word option is divided the entry content that module 72 will be referred to different themes and is divided into respectively under the corresponding meaning of a word option.In a preferred embodiment, meaning of a word option divide module 72 with the first class catalogue of entry and the directory content under the first class catalogue as a fundamental forecasting unit, whether prediction fundamental forecasting unit belongs to different themes, and will belong to the fundamental forecasting dividing elements of same theme to same meaning of a word option according to predicting the outcome.Whether prediction fundamental forecasting unit belongs to different themes has multiplely, will be described with several embodiments below.
As shown in figure 10, Figure 10 is the schematic block diagram of meaning of a word option classification apparatus of the ambiguity entry of open knowledge sharing platform of the present invention.Meaning of a word option is divided module and is further comprised: entry data acquisition module 101, directory information acquisition module 102, directory feature extraction module 103, machine mould acquisition module 104, catalogue relatedness computation module 105 and mark module 106.
Entry data acquisition module 101 is used to obtain the polysemant entry data of not carrying out the classification of meaning of a word option.These polysemant entry data of carrying out the classification of meaning of a word option can obtain by above-mentioned step 31 shown in Figure 3 or polysemant decision method shown in Figure 4.
Directory information acquisition module 102 is used for obtaining according to the position of entry catalogue the directory information of entry.In preferred embodiment, the present invention is cut apart entry according to the position of first class catalogue in the polysemant entry, obtains the directory information of each first class catalogue, and directory information comprises the directory content under first class catalogue title and the first class catalogue etc.
Directory feature extraction module 103 is used for therefrom extracting a plurality of features according to the directory information that obtains.When directory information is carried out feature extraction, need to carry out participle and filtration to the entry content earlier, and with participle and filter words that the back obtains as the entry feature.In preferred embodiment, the mode that the present invention adopts forward maximum match participle and reverse maximum match participle to combine is proofreaied and correct word segmentation result, to obtain the higher word segmentation result of accuracy.
Machine mould acquisition module 104 is used to obtain the machine mould that contains directory feature degree of association relation.As shown in figure 10, machine mould acquisition module 104 further comprises: sample acquisition module 1041, meaning of a word option directory information acquisition module 1042, directory feature extraction module 1043 and machine MBM 1044.Sample acquisition module 1041 is used to obtain the polysemant bar sample storehouse of the meaning of a word option that has been categorized as different themes.Because sample is the ambiguity entry that has been categorized as the meaning of a word option of different themes, so the pairing theme of catalogue below each meaning of a word option is identical.Meaning of a word option directory information acquisition module 1042 is used to obtain the directory information of meaning of a word option.Promptly obtain the directory information that has the same subject characteristic below the meaning of a word option.Preferred embodiment, obtain the first class catalogue information of meaning of a word option.Directory feature extraction module 1043 is used for extracting according to the directory information of the meaning of a word option directory feature of meaning of a word option.Directory information with same subject characteristic is extracted corresponding directory feature.Machine MBM 1044 is used for carrying out the machine modeling according to the directory feature of meaning of a word option, generates the machine mould of degree of association relation of the directory feature of the adjacent directory information comprise meaning of a word option.In preferred embodiment, the degree of association of the directory feature of the adjacent directory information of meaning of a word option relation comprises one of ratio of the vocabulary attribute of the quantity of identical vocabulary, identical vocabulary, the ratio of the shared directory information of identical vocabulary, the quantity of related vocabulary, the vocabulary attribute of related vocabulary, the shared directory information of related vocabulary or its combination.
Catalogue relatedness computation module 105 is used for according to the directory feature of being extracted, and the applied for machines model carries out the degree of association to the directory feature of the adjacent directory information in front and back and calculates.Wherein, catalogue relatedness computation module 105 can adopt multiple computation structure to realize, and the enforcement of also can implementing separately or mutually combine between the various computation structure.In one embodiment of the invention, catalogue relatedness computation module 105 applied for machines models calculate the parameter of the identical vocabulary in the directory feature of the adjacent directory information in front and back, ratio by the quantity of calculating identical vocabulary, the shared directory information of identical vocabulary, perhaps the vocabulary attribute of identical vocabulary is judged, realized the degree of association of the adjacent directory information in front and back is calculated.In another embodiment of the invention, catalogue relatedness computation module 105 applied for machines models calculate the parameter of the related vocabulary in the directory feature of the adjacent directory information in front and back, ratio by the quantity of compute associations vocabulary, the shared directory information of related vocabulary, perhaps the vocabulary attribute of related vocabulary is judged, realized the degree of association of the adjacent directory information in front and back is calculated.
Mark module 106 is used for according to the degree of association result of calculation directory information being carried out mark.Labeling method comprises numerous embodiments.In one embodiment of the invention, can directory information be classified according to theme according to degree of association result of calculation.In another embodiment of the invention, be divided into relevant and irrelevant according to the degree of association result of calculation directory information that front and back are adjacent, with home directory and be labeled as first mark at the relevant directory information of preceding directory information, will be labeled as second mark at the irrelevant directory information of preceding directory information.Mark module 106 mark modules further comprise: degree of association sort module, first mark module and second mark module.Degree of association sort module is divided into relevant and irrelevant according to the degree of association result of calculation directory information that front and back are adjacent.First mark module with home directory and be labeled as first mark at the relevant directory information of preceding directory information.Second mark module will be labeled as second mark with the directory information that has nothing to do at preceding directory information.
By the above-mentioned technical scheme that provides, the invention provides a kind of open knowledge sharing platform and polysemant exhibiting method thereof, the meaning of a word option of different themes in the polysemant can be shown, select by the user, improve user experience.
The above only is a better embodiment of the present invention, and is in order to restriction the present invention, within the spirit and principles in the present invention not all, any modification of being made, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims (39)

1. the entry disposal route of an open knowledge sharing platform is characterized in that, described entry disposal route comprises:
A. obtain entry and entry content;
B. judge whether described entry is the polysemant that relates to different themes;
C. if described entry is the polysemant that relates to different themes, the described entry content that then will be referred to different themes is divided into respectively under the corresponding meaning of a word option.
2. entry disposal route according to claim 1 is characterized in that, described entry content comprises a plurality of catalogues, in described step b, judges according to the directory information in the described catalogue whether described entry is the polysemant that relates to different themes.
3. entry disposal route according to claim 1 is characterized in that, in described step a, further obtains the classified information of described entry, in described step b, judges according to described classified information whether described entry is the polysemant that relates to different themes.
4. entry disposal route according to claim 1 is characterized in that, described step b comprises:
B1. described entry content is carried out feature extraction, to obtain a plurality of entry features;
B2. obtain the mark characteristic set, described mark characteristic set comprises a plurality of mark features that have weighted value;
B3. be respectively the weighted value of each described entry characteristic allocation correspondence according to described mark feature; And
B4. the weighted value of described a plurality of entry features is sued for peace, and the weighted value summation is higher than the entry of threshold value as polysemant.
5. entry disposal route according to claim 4 is characterized in that, described step b2 further comprises:
B21. obtain the mark language material that comprises a plurality of polysemant samples and non-polysemant sample;
B22. from described mark language material, extract a plurality of described mark features;
B23. be that described mark feature is distributed corresponding weighted value respectively according to described mark feature in the appearance situation of described polysemant sample and described non-polysemant sample.
6. entry disposal route according to claim 5, it is characterized in that, in described step b23, calculate the number of times M that described mark feature occurs in described polysemant sample, calculate the times N that described mark feature occurs in described non-polysemant sample, and calculate described weighted value and equal M/ (M+N).
7. entry disposal route according to claim 1, it is characterized in that, in described step c, with the first class catalogue of described entry and the directory content under the described first class catalogue as the fundamental forecasting unit, predict whether described fundamental forecasting unit belongs to different themes, and the described fundamental forecasting unit that will belong to different themes according to predicting the outcome is divided into corresponding meaning of a word option respectively.
8. entry disposal route according to claim 7 is characterized in that, in described step c, predicts according to the directory information in the described first class catalogue.
9. entry disposal route according to claim 7 is characterized in that, described step c further comprises:
C1. from described fundamental forecasting unit, extract a plurality of directory feature;
C2. judge the degree of association between the described fundamental forecasting unit according to described directory feature, and produce described predicting the outcome according to the described degree of association.
10. entry disposal route according to claim 1 is characterized in that, described step c further comprises:
C1. obtain the directory information of described entry;
C2. extract directory feature according to described directory information;
C3. obtain the machine mould that contains directory feature degree of association relation;
C4. according to the described directory feature of being extracted, use described machine mould the described directory feature of the adjacent described directory information in front and back is carried out degree of association calculating;
C5. according to degree of association result of calculation described directory information is carried out mark.
11. entry disposal route according to claim 10 is characterized in that, described step c2 further comprises: when extracting described directory feature, carry out participle earlier.
12. target speech classify and grading method according to claim 11, it is characterized in that the method for described participle comprises: forward coupling participle, oppositely mate participle, Direct/Reverse coupling participle, participle, maximum entropy Markov model participle, maximum entropy participle or condition random field participle based on full segmenting word figure.
13. entry disposal route according to claim 10 is characterized in that, described step c3 further comprises:
C31. obtain the polysemant bar sample storehouse of the meaning of a word option that has been categorized as different themes;
C32. obtain the directory information of described meaning of a word option;
C33. extract the directory feature of described meaning of a word option according to the directory information of described meaning of a word option;
C34. carry out the machine modeling according to the directory feature of described meaning of a word option, generate the machine mould of degree of association relation of the described directory feature of the adjacent described directory information comprise described meaning of a word option.
14. entry disposal route according to claim 13, it is characterized in that the degree of association relation of the described directory feature of the adjacent described directory information of described meaning of a word option comprises one of ratio of the vocabulary attribute of the quantity of identical vocabulary, identical vocabulary, the ratio of the shared directory information of identical vocabulary, the quantity of related vocabulary, the vocabulary attribute of related vocabulary, the shared directory information of related vocabulary or its combination.
15. entry disposal route according to claim 10 is characterized in that, in described step c4, carries out the quantity that degree of association Calculation Method comprises identical vocabulary in the described directory feature of calculating adjacent described directory information.
16. entry disposal route according to claim 15 is characterized in that, in described step c4, carries out degree of association Calculation Method and also comprises the vocabulary attribute of judging described identical vocabulary.
17. entry disposal route according to claim 15 is characterized in that, in described step c4, carries out degree of association Calculation Method and also comprises the ratio of calculating the shared directory information of described identical vocabulary.
18. entry disposal route according to claim 10 is characterized in that, in described step c4, carries out the quantity that degree of association Calculation Method comprises related vocabulary in the described directory feature of calculating adjacent described directory information.
19. entry disposal route according to claim 18 is characterized in that, in described step c4, carries out degree of association Calculation Method and also comprises the vocabulary attribute of judging described related vocabulary.
20. entry disposal route according to claim 18 is characterized in that, in described step c4, carries out degree of association Calculation Method and also comprises the ratio of calculating the shared directory information of described related vocabulary.
21. entry disposal route according to claim 10 is characterized in that, described step c5 further comprises:
C51. be divided into relevant and irrelevant according to the degree of association result of calculation described directory information that front and back are adjacent;
C52. with home directory and be labeled as first mark at the relevant described directory information of preceding directory information;
C53. will be labeled as second mark with the described directory information that has nothing to do at preceding directory information.
22. an open knowledge sharing platform is characterized in that, described open knowledge sharing platform comprises:
The entry acquisition module obtains entry and entry content;
The polysemant judge module judges whether described entry is the polysemant that relates to different themes;
Meaning of a word option is divided module, if described entry is the polysemant that relates to different themes, the described entry content that then will be referred to different themes is divided into respectively under the corresponding meaning of a word option.
23. open knowledge sharing platform according to claim 22, it is characterized in that, described entry content comprises a plurality of catalogues, and described polysemant judge module judges according to the directory information in the described catalogue whether described entry is the polysemant that relates to different themes.
24. open knowledge sharing platform according to claim 22, it is characterized in that, described entry acquisition module further obtains the classified information of described entry, and described polysemant judge module judges according to described classified information whether described entry is the polysemant that relates to different themes.
25. open knowledge sharing platform according to claim 22 is characterized in that, described polysemant judge module comprises:
The entry characteristic extracting module is carried out feature extraction to described entry content, to obtain a plurality of entry features;
Mark characteristic set acquisition module obtains the mark characteristic set, and described mark characteristic set comprises a plurality of mark features that have weighted value;
Entry feature weight computing module is respectively the weighted value of each described entry characteristic allocation correspondence according to described mark feature; And
The threshold decision module is sued for peace to the weighted value of described a plurality of entry features, and the weighted value summation is higher than the entry of threshold value as polysemant.
26. open knowledge sharing platform according to claim 25 is characterized in that, described mark characteristic set acquisition module further comprises:
Mark language material acquisition module obtains the mark language material that comprises a plurality of polysemant samples and non-polysemant sample;
The mark characteristic extracting module is extracted a plurality of described mark features from described mark language material;
Mark feature weight computing module is that described mark feature is distributed corresponding weighted value respectively according to described mark feature in the appearance situation of described polysemant sample and described non-polysemant sample.
27. open knowledge sharing platform according to claim 26, it is characterized in that, described mark feature weight computing module calculates the number of times M that described mark feature occurs in described polysemant sample, calculate the times N that described mark feature occurs in described non-polysemant sample, and calculate described weighted value and equal M/ (M+N).
28. open knowledge sharing platform according to claim 22, it is characterized in that, described meaning of a word option divide module with the first class catalogue of described entry and the directory content under the described first class catalogue as the fundamental forecasting unit, predict whether described fundamental forecasting unit belongs to different themes, and the described fundamental forecasting unit that will belong to different themes according to predicting the outcome is divided into corresponding meaning of a word option respectively.
29. open knowledge sharing platform according to claim 28 is characterized in that, described meaning of a word option is divided module and is predicted according to the directory information in the described first class catalogue.
30. open knowledge sharing platform according to claim 22 is characterized in that, described meaning of a word option is divided module and is further comprised:
The directory information acquisition module obtains the directory information of described entry;
The directory feature extraction module extracts directory feature according to described directory information;
The machine mould acquisition module obtains the machine mould that contains directory feature degree of association relation;
Degree of association computing module according to the described directory feature of being extracted, is used described machine mould the described directory feature of the adjacent described directory information in front and back is carried out degree of association calculating, and
Mark module carries out mark according to degree of association result of calculation to described directory information.
31. open knowledge sharing platform according to claim 30 is characterized in that, described machine mould acquisition module further comprises:
Meaning of a word option sample acquisition module, the polysemant bar sample storehouse that obtains the meaning of a word option that is categorized as different themes;
Meaning of a word option directory information acquisition module obtains the directory information of described meaning of a word option;
Meaning of a word option directory feature extraction module extracts the directory feature of described meaning of a word option according to the directory information of described meaning of a word option;
The machine MBM is carried out the machine modeling according to the directory feature of described meaning of a word option, generates the machine mould of degree of association relation of the described directory feature of the adjacent described directory information comprise described meaning of a word option.
32. open knowledge sharing platform according to claim 31, it is characterized in that the degree of association relation of the described directory feature of the adjacent described directory information of described meaning of a word option comprises one of ratio of the vocabulary attribute of the quantity of identical vocabulary, identical vocabulary, the ratio of the shared directory information of identical vocabulary, the quantity of related vocabulary, the vocabulary attribute of related vocabulary, the shared directory information of related vocabulary or its combination.
33. open knowledge sharing platform according to claim 30 is characterized in that, described degree of association computing module carries out the quantity that degree of association Calculation Method comprises identical vocabulary in the described directory feature of calculating adjacent described directory information.
34. open knowledge sharing platform according to claim 33 is characterized in that, described degree of association computing module carries out degree of association Calculation Method and also comprises the vocabulary attribute of judging described identical vocabulary.
35. open knowledge sharing platform according to claim 33 is characterized in that, described degree of association computing module carries out degree of association Calculation Method and also comprises the ratio of calculating the shared directory information of described identical vocabulary.
36. open knowledge sharing platform according to claim 30 is characterized in that, described degree of association computing module carries out the quantity that degree of association Calculation Method comprises related vocabulary in the described directory feature of calculating adjacent described directory information.
37. open knowledge sharing platform according to claim 36 is characterized in that, described degree of association computing module carries out degree of association Calculation Method and also comprises the vocabulary attribute of judging described related vocabulary.
38. open knowledge sharing platform according to claim 36 is characterized in that, described degree of association computing module carries out degree of association Calculation Method and also comprises the ratio of calculating the shared directory information of described related vocabulary.
39. open knowledge sharing platform according to claim 30 is characterized in that, described mark module further comprises:
Degree of association sort module is divided into relevant and irrelevant according to the degree of association result of calculation described directory information that front and back are adjacent;
First mark module, with home directory and be labeled as first mark at the relevant described directory information of preceding directory information;
Second mark module will be labeled as second mark with the described directory information that has nothing to do at preceding directory information.
CN 201010619675 2010-12-31 2010-12-31 Open type knowledge sharing platform and entry processing method thereof Active CN102063497B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010619675 CN102063497B (en) 2010-12-31 2010-12-31 Open type knowledge sharing platform and entry processing method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010619675 CN102063497B (en) 2010-12-31 2010-12-31 Open type knowledge sharing platform and entry processing method thereof

Publications (2)

Publication Number Publication Date
CN102063497A true CN102063497A (en) 2011-05-18
CN102063497B CN102063497B (en) 2013-07-10

Family

ID=43998772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010619675 Active CN102063497B (en) 2010-12-31 2010-12-31 Open type knowledge sharing platform and entry processing method thereof

Country Status (1)

Country Link
CN (1) CN102063497B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103123636A (en) * 2011-11-21 2013-05-29 北京百度网讯科技有限公司 Method to build vocabulary entry classification models, method of vocabulary entry automatic classification and device
CN104008098A (en) * 2013-02-21 2014-08-27 腾讯科技(深圳)有限公司 Polysemy keyword based text filtering method and device
CN105159936A (en) * 2015-08-06 2015-12-16 广州供电局有限公司 File classification apparatus and method
CN111428498A (en) * 2020-04-02 2020-07-17 北京明略软件系统有限公司 Entry filtering method and device for special name dictionary
CN111444707A (en) * 2020-03-26 2020-07-24 腾讯科技(深圳)有限公司 Title generation method and device and computer readable storage medium
CN112464115A (en) * 2020-11-24 2021-03-09 北京字节跳动网络技术有限公司 Information display method and device and computer storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6647383B1 (en) * 2000-09-01 2003-11-11 Lucent Technologies Inc. System and method for providing interactive dialogue and iterative search functions to find information
CN1535433A (en) * 2001-07-04 2004-10-06 库吉萨姆媒介公司 Category based, extensible and interactive system for document retrieval
CN1942856A (en) * 2003-04-04 2007-04-04 雅虎公司 Universal search interface systems and methods
CN101454750A (en) * 2006-03-31 2009-06-10 谷歌公司 Disambiguation of named entities

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6647383B1 (en) * 2000-09-01 2003-11-11 Lucent Technologies Inc. System and method for providing interactive dialogue and iterative search functions to find information
CN1535433A (en) * 2001-07-04 2004-10-06 库吉萨姆媒介公司 Category based, extensible and interactive system for document retrieval
CN1942856A (en) * 2003-04-04 2007-04-04 雅虎公司 Universal search interface systems and methods
CN101454750A (en) * 2006-03-31 2009-06-10 谷歌公司 Disambiguation of named entities

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103123636A (en) * 2011-11-21 2013-05-29 北京百度网讯科技有限公司 Method to build vocabulary entry classification models, method of vocabulary entry automatic classification and device
CN103123636B (en) * 2011-11-21 2016-04-27 北京百度网讯科技有限公司 Set up the method and apparatus of the method for entry disaggregated model, entry automatic classification
CN104008098A (en) * 2013-02-21 2014-08-27 腾讯科技(深圳)有限公司 Polysemy keyword based text filtering method and device
CN104008098B (en) * 2013-02-21 2018-09-18 腾讯科技(深圳)有限公司 Text filtering method based on ambiguity keyword and device
CN105159936A (en) * 2015-08-06 2015-12-16 广州供电局有限公司 File classification apparatus and method
CN111444707A (en) * 2020-03-26 2020-07-24 腾讯科技(深圳)有限公司 Title generation method and device and computer readable storage medium
CN111444707B (en) * 2020-03-26 2022-07-01 腾讯科技(深圳)有限公司 Title generation method and device and computer readable storage medium
CN111428498A (en) * 2020-04-02 2020-07-17 北京明略软件系统有限公司 Entry filtering method and device for special name dictionary
CN112464115A (en) * 2020-11-24 2021-03-09 北京字节跳动网络技术有限公司 Information display method and device and computer storage medium
WO2022111249A1 (en) * 2020-11-24 2022-06-02 北京字节跳动网络技术有限公司 Information presentation method, apparatus, and computer storage medium

Also Published As

Publication number Publication date
CN102063497B (en) 2013-07-10

Similar Documents

Publication Publication Date Title
CN103678670B (en) Micro-blog hot word and hot topic mining system and method
Boia et al. A:) is worth a thousand words: How people attach sentiment to emoticons and words in tweets
CN101408883B (en) Method for collecting network public feelings viewpoint
CN102929873B (en) Method and device for extracting searching value terms based on context search
CN104484343B (en) It is a kind of that method of the motif discovery with following the trail of is carried out to microblogging
CN106407484B (en) Video tag extraction method based on barrage semantic association
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN106156372B (en) A kind of classification method and device of internet site
CN102063497B (en) Open type knowledge sharing platform and entry processing method thereof
CN107609052A (en) A kind of generation method and device of the domain knowledge collection of illustrative plates based on semantic triangle
CN106951438A (en) A kind of event extraction system and method towards open field
CN103455562A (en) Text orientation analysis method and product review orientation discriminator on basis of same
CN109543178A (en) A kind of judicial style label system construction method and system
CN101609459A (en) A kind of extraction system of affective characteristic words
CN108228853A (en) A kind of microblogging rumour recognition methods and system
CN103605665A (en) Keyword based evaluation expert intelligent search and recommendation method
JP2017508214A (en) Provide search recommendations
CN109388743B (en) Language model determining method and device
CN103390051A (en) Topic detection and tracking method based on microblog data
CN107729468A (en) Answer extracting method and system based on deep learning
CN101950309A (en) Subject area-oriented method for recognizing new specialized vocabulary
CN105975596A (en) Query expansion method and system of search engine
CN112131863A (en) Comment opinion theme extraction method, electronic equipment and storage medium
CN105550168A (en) Method and device for determining notional words of objects
CN107239512A (en) The microblogging comment spam recognition methods of relational network figure is commented in a kind of combination

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant