CN100461183C - Metadata automatic extraction method based on multiple rule in network search - Google Patents

Metadata automatic extraction method based on multiple rule in network search Download PDF

Info

Publication number
CN100461183C
CN100461183C CNB2007101185908A CN200710118590A CN100461183C CN 100461183 C CN100461183 C CN 100461183C CN B2007101185908 A CNB2007101185908 A CN B2007101185908A CN 200710118590 A CN200710118590 A CN 200710118590A CN 100461183 C CN100461183 C CN 100461183C
Authority
CN
China
Prior art keywords
word
rule
metadata
information
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2007101185908A
Other languages
Chinese (zh)
Other versions
CN101101600A (en
Inventor
张铭
杨宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CNB2007101185908A priority Critical patent/CN100461183C/en
Publication of CN101101600A publication Critical patent/CN101101600A/en
Application granted granted Critical
Publication of CN100461183C publication Critical patent/CN100461183C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The method includes following steps: (1) preprocessing coarse web pages, and normalizing all web pages to more normative format; (2) primary positioning contents of web pages including information to be extracted in files of web page; (3) according to specified rules to extract metadata from contents picked-up by primary positioning operation. First, the invention distinguishes core area from hash in large area. Then, aiming at core area, the invention carries out regular extraction so as to raise accuracy of extraction greatly. The invention also can extract metadata in web page according to multiple rules. Based on given priorities, the multiple rules determine matching sequence, and carries out refined process according to method of extraction in two stages.

Description

In the web search based on the metadata automatic extraction method of multiple rule
Technical field:
The invention belongs to the web search technical field, relate in particular to the method for on internet page, carrying out subject search.
Background technology:
Metadata is data (data that describes data) or " about the data of data " (data aboutdata) of data of description, being used for the feature of data of description and attribute, also is the instrument of describing and organizing the Internet information resources, find the Internet information resources.In each field, some large-scale resource issuing web sites all can be arranged, by extracting the metadata on these resource websites, can collect a large amount of useful resources, help different user to set up the database of specific area, therefore, using for the extraction of metadata is very widely.
Metadata extracts in whole information organization and retrieval and is in the fundamental position that data are prepared.The Data Source of extraction process at first passes through necessary pre-service, rejects the document in aspects such as form, content, language existing problems or serious disappearance, produces the regular relatively text document of form; Secondly through the processing of metadata abstraction module, generate the document metadata of compliant definition, and the result is stored in metadata, library or other files or text database relevant with concrete system.Document metadata also can adopt different cannonical formats to organize according to the difference of concrete system, to make things convenient for sharing and information interchange of data.And along with the further developing of network, the webpage metadata has become one of main method of storage useful information, and therefore the research work aspect the extraction of webpage metadata also more and more widely more and more comes into one's own.
Extract the field in metadata, researcher has carried out a large amount of theoretical researches, has also proposed technology and method that a lot of metadata extract, and has developed many available instruments according to these theories.
In Web information extraction field a kind of more common sorting technique is arranged, promptly the technology that adopts according to every kind of instrument roughly is divided into six classes with the method that the Web metadata extracts: (1) is based on the metadata abstracting method of language; (2) based on the metadata abstracting method of HTML structure: (3) are based on the metadata abstracting method of NLP (Natural Language Processing); (4) based on the metadata abstracting method of concluding; (5) based on the Web metadata abstracting method of model; (6) based on the metadata abstracting method of Ontology.
Wherein the metadata abstracting method based on NLP is that the NLP technology is applied to learn decimation rule from the document of natural language description.These class methods mainly adopt phrase sentence structure and semantic analysis technology, comprise the identification and the mark of syntactic constituent, keyword abstraction, the extraction of retrieval character, index etc.Relatively be applicable to from loosely organized or even plain text based on the metadata abstracting method of NLP and extract useful information.The exemplary tool that adopts these class methods is WHISK[SoderLan, S.Learning Information Extraction Rules for Semi-Structured and Free Text.Machine Learning 34,1-3 (1999), 2333-272.].
WHISK be a use comparatively widely, carry out the system of automatic rule learning at semi-structured information.Compare with system before, WHISK can learn the rule of needs according to more accurate principle, because it not only can also add some semantic informations at the particular content format learning rules in certain field.Rule among the WHISK is to use regular expression to represent, and can specify in a rule and extract one or more information fields.But the system of this automatic learning rules obtains the result still exists a lot of deficiencies when really being used to extract, reason is at a complete jumbo semi-structured information, the rule that provides among the WHISK just concentrates on information area to be extracted as far as possible, article one, should reject unwanted information fully in the rule, the information that accurately defining extraction again needs is difficult to accomplish.On the other hand, the semantic information that provides in the rule that WHISK learns is to list file names with by the synonym word that all are learnt, to guarantee that seeing this semantic different terms can both pick out.But this point can make that the rule that finally obtains is very loaded down with trivial details, and because the word of a kind of semanteme of expression emerges in an endless stream, and final extraction effect can be not fine, this point is more obvious in Chinese extracts.
Summary of the invention:
The purpose of this invention is to provide a kind of metadata abstracting method according to the specified rule Automatic Extraction.This method can extract the metadata in the webpage according to multiple rule, and multiple rule is judged matching order according to given priority, and carries out process of refinement according to the method that two steps were extracted.
Technical scheme of the present invention is as follows:
Based on the metadata automatic extraction method of multiple rule, may further comprise the steps in the web search:
1. coarse webpage is carried out pre-service, can use the kit NekoHTML that increases income to handle for the form pre-service of standard relatively all webpage normalizings, this instrument is in the series of tools (Java APIs) write with Java of J.Andrew Clark.NekoHTML is a simple HTML scanner and label compensator (tag balancer), makes calling program can resolve html document and visits wherein information with the standard XML interface.Just can reach basic requirement through the webpage of above processing, those loosely organized html documents also have been converted into the webpage that meets basic XML document standard.
2. information pre-determined bit
The information pre-determined bit is meant most valuable part in the whole web document, comprises that just the web page contents of the information that will extract carries out Primary Location.
Can use another instrument of increasing income: xalan-java of Apache, this is a cover xslt processor.Wherein, XPath is that XSLT is used for language that the XML document each several part is positioned, and it can be described and how to discern, select, mates each composed component in the XML document, comprises element, attribute, content of text etc.
3. extract based on multiple regular specified message
Rule according to appointment is extracted metadata from the content that pre-determined bit extracts.The regular expression kit that described rule can adopt java.regex to provide is described, and legally just it is extracted.
Further, in actual extracting, can not only use once cover XPath or regular expression and can handle all webpage extractions, method of the present invention is set different pre-determined bit XPath path and a series of Different Rule at same class webpage, and sets the priority of every kind of XPath path and rule, pick out which kind of pre-determined bit path of this web pages conform according to priority, so that extract the core content zone, then mate each rule according to different priorities again, want content up to extracting.All are suitable for a series of different priorities XPath paths and the rule of certain web page type, all need write exactly in the configuration file of xml file layout, and the form that forms DOM Tree through document analysis is handled.
Further, based on the metadata automatic extraction method of multiple rule, be in the information extraction process in the above-mentioned web search in described step 3, also the information of carrying out is refined as follows, distinguishes at first which is a redundant information.
Determine that at first which data field needs the information step of refining, because some data field is very accurate having specified regular back just can guarantee to extract the result.But, a large amount of redundancies or gibberish will occur, and the result of machine learning can be appointed as these data fields the type that need refine naturally for the other data class.
Adopt two tuples<word, feature〉form represent the unnecessary word that runs in the learning process.Wherein, word is the record of unnecessary word itself, and feature is according to the sorting technique that provides, and the type of word is judged and mark.In the method for the invention, the word that runs into is divided into following 4 types: (1) all letters are small letter, the lowcase type; (2) comprise numeral and letter, perhaps only comprise numeral, in_number type; (3) first word of this data field, or first word in short, and first letter of this word is capitalization, first_word type; (4) first letter of word is capitalization, and the word that also starts for capitalization of word of word in its front or back, the con_capital type.
In learning process, if the probability that certain word or certain word types occur as impurity at an ad-hoc location is greater than a specific thresholding, then with word or word types and location records occurs and get off.In leaching process,, i.e. coupling<word, feature if the impurity word occurred at fixing matched position〉in first identical, then directly it is filtered out.If occurred with the impurity word inequalityly, but the word identical with dopant type then also is judged as it impurity and filters out.
Advantage of the present invention and good effect:
The invention provides a kind of metadata abstracting method according to the specified rule Automatic Extraction.The present invention has provided the inventive process of two step location, and the first step is opened nucleus and large-scale garbage difference earlier, then carries out Rule Extraction at nucleus again, makes that extracting accuracy has obtained increasing substantially.Method of the present invention can also extract the metadata in the webpage according to multiple rule, and multiple rule is judged matching order according to given priority, and carries out process of refinement according to the method that two steps were extracted.Use native system, different users only needs good according to the rules form redaction rule at different webpages, can automatically extract processing, and not need program is carried out any specific modification.In addition, the present invention further in the process of information extraction the information of carrying out refine, we find at the study of impurity information and record than at the study of front vocabulary with write down much better, therefore the information that provides among the present invention part of refining is exactly at impurity study and filters, and is remarkable through overtesting discovery effect.
Experiment showed, that method of the present invention can be good at handling semi-structured webpage, F estimates and reaches more than 85%, has good practical values.
Description of drawings:
Fig. 1 is a structural representation of the present invention.
Fig. 2 is a machine learning stage process flow diagram of the present invention.
Fig. 3 is a kernel program process flow diagram of the present invention.
Embodiment:
Below in conjunction with an example that educational resource is integrated, describe the specific embodiment of the present invention in detail.
This embodiment is described is to the abstracting method of the webpage metadata on the resource website in the integration work of educational resource.The integration work of educational resource, target are the integrated platforms of an educational resource providing for e-learning person and teachers.As the metadata extraction step of its important step, need can reach semi-structured webpage and extract accuracy preferably, and have the ability of the loose document of Processing Structure.
As shown in Figure 1, in this embodiment, the extraction of metadata is comprised following step:
1. coarse webpage pre-service
Generally, extraction algorithm relatively is adapted to handle xml document and the good html document of form.Here said " form is good ", what refer to exactly will be when webpage be write, and is strict with before and after the webpage label correspondingly one by one, has the mark of beginning that the mark of termination will be arranged.
Because it is corresponding before and after being that the html webpage is not strict with webpage label in the code, therefore, the code format of the various webpage in the network of being popular in may differ greatly.What the reptile program grabbed all is that some have no the coarse webpage that standardization can be sayed.
Before extraction work formally begins, need carry out certain pre-service to coarse webpage.With the label completion that lacks in the coarse webpage, all webpage normalizings for comparing the form of standard, are made things convenient for post-processed.And in this stage, all pictures in the webpage will be by automatic fitration.Here used the kit NekoHTML that increases income to handle, this instrument is in the series of tools (Java APIs) write with Java of J.Andrew Clark.NekoHTML is a simple HTML scanner and label compensator (tag balancer), makes calling program can resolve html document and visits wherein information with the standard XML interface.Just can reach basic requirement through the webpage of above processing, those loosely organized html documents also have been converted into the webpage that meets basic XML document standard.
2. information extraction stage
After the webpage pre-service of phase one, just entered the information extraction stage.In order to promote the extraction accuracy, the extraction stage is divided into three steps: the information pre-determined bit, based on the information extraction of multiple Rulemaking, and information is refined.
(1) information pre-determined bit
The information pre-determined bit is meant that just the core content of webpage carries out Primary Location to most valuable part in the whole web document.In pending webpage, often both comprise the core content zone, and also can comprise the structured content zone.Core content generally includes the information that will extract, for example course name, course numbering, course characteristics and teacher's information etc.; And the structured message zone comprises navigation bar, field of search navigation etc.In processing procedure, these structured content are nonsensical, but the content that all pages all can comprise.Therefore, in first step leaching process, the first integral body of valuable content area be extracted, wherein can comprise the code content such as label of webpage, be not finally needed.But in the pre-determined bit process, these labels are not done concrete the differentiation, confirm that just whole core webpage partly gets final product.Here, use be another instrument of increasing income: xalan-java of Apache, this be one the cover xslt processor.Wherein, XPath is that XSLT is used for language that the XML document each several part is positioned, and it can be described and how to discern, select, mates each composed component in the XML document, comprises element, attribute, content of text etc.
The content that pre-determined bit extracts will be given based on the information extraction step of multiple Rulemaking and be handled.For each processing, as long as specify corresponding processing rule, program just can be extracted as requested.The rule description form here, the regular expression kit that adopts java.regex to provide is described and handles, and can if meet rule, just it be extracted easily according to the given content in the given regular expression coupling target string.
(2) extract based on multiple regular specified message
In actual extracting, can not only use once cover XPath or regular expression and can handle all webpage extractions, in order to make the method for the present invention can be more flexible, this paper provides an option, can set different pre-determined bit XPath path at same class webpage, and a series of Different Rule, and set every kind of XPath path and regular priority, and then more accurately extract.Program can according to priority pick out the sort of pre-determined bit of this web pages conform path, so that extract the core content zone, then mates each rule according to different priorities again, wants content up to extracting.
All are suitable for a series of different priorities XPath paths and the rule of certain web page type, all need write exactly in the configuration file of xml file layout, and the form that forms DOM Tree through document analysis is handled.The format write of configuration file is:
Figure C200710118590D00091
The ground floor of configuration file is Domain (territory) information, the unified root network address of just a certain class webpage; Following one deck is given in the url rule of every kind of pending webpage under this big network address Domain, just uses regular expression (Regular Expression) pair net page address, judges and uses which cover rule to carry out meta-data extraction; And then be the appointment in different meta-data extraction territory, course name (title) for example, program content (outline) etc.; Then be to provide each pre-determined bit XPath according to the priority difference, be used for the core content pre-determined bit at every kind of information field; Behind the XPath, be concrete extracting rule, still use regular expression to represent.
(3) information is refined
After back came out all meta-data extraction according to rule, the extraction result in the information field contained a lot of incoherent contents.As everyone knows, for comparatively general metadata extraction program, can not accomplish hundred-percent accurate extraction, accurately classification, no matter be the XML document of loosely organized html document or semantization, can't stipulate accurately that all the webpage under all same Domain uses identical form of presentation when expression same target or same attribute.
For guaranteeing accuracy, needing as much as possible wherein, incoherent word all weeds out.Use the method based on machine learning in this step, learning process as shown in Figure 2.Find that through observing those unnecessary information all are often to repeat on content and form generally speaking, therefore, the training data in this module will be used for emphatically distinguishing which is a redundant information.
Determine that at first which data field needs the information step of refining, because some data field is very accurate having specified regular back just can guarantee to extract the result, for example: course id, course name, course description etc., the real rule of these data fields is more fixing, describing method is also more unified, therefore the situation that uncorrelated word mixes can not occur substantially after extracting, directly the result with second step gets final product as finally extracting result's output.But for the other data class, as: collateral reading author, collateral reading autograph, collateral reading source etc., if the information of carrying out is not refined, a large amount of redundancies or gibberish will occur, and the result of machine learning can be appointed as these data fields the type that need refine naturally.
Adopt two tuples<word, feature〉form represent the unnecessary word that runs in the learning process.Wherein, word is the record of unnecessary word itself, and feature is according to the sorting technique that provides, and the type of word is judged and mark.In this module, the word that runs into is divided into following 4 types: (1) all letters are small letter, the lowcase type; (2) comprise numeral and letter, perhaps only comprise numeral, in_number type; (3) first word of this data field, or first word in short, and first letter of this word is capitalization, first_word type; (4) first letter of word is capitalization, and the word that also starts for capitalization of word of word in its front or back, the con_capital type.
In learning process, if the probability that certain word or certain word types occur as impurity at an ad-hoc location is greater than a specific thresholding, then with word or word types and location records occurs and get off.In leaching process,, i.e. coupling<word, feature if the impurity word occurred at fixing matched position〉in first identical, then directly it is filtered out.If occurred with the impurity word inequalityly, but the word identical with dopant type then also is judged as it impurity and filters out.
Because this paper at all be semi-structured educational website and resource, web page contents is standard relatively, many times, redundant information is similar or identical, therefore, will be at the judgement of word itself as primary criterion, if find that redundant information is same or similar with existing record, then can make judgement simply, and not need extra work.
Performance evaluating
The index of weighing a meta-data extraction system quality mainly contains following three:
Recall ratio (Recall)
Figure C200710118590D00111
Precision ratio (Precision)
Figure C200710118590D00112
F measures (F-measure) F = ( β 2 + 1 ) PR β 2 P + R
Recall ratio and precision ratio have directly been weighed the quality of meta-data extraction system, and F tolerance allows the user try to achieve balance on recall ratio and precision ratio.And we think that the precision ratio of meta-data extraction is more important than recall ratio, because in the application of reality, accurate data is the assurance of service quality, and data complete should be based upon data accurately on the basis.Therefore, select significance level value β=0.5 of precision ratio and recall ratio, the significance level of representing P is 2 times of R.
Test data comprises open courseware MIT OpenCourseWare (http://www.core.org.cn/OcwWeb) website of Massachusetts Institute Technology, wherein having contained MIT 1900 subject contents in all undergraduate and graduate stages, is rare free, an open educational resource; Illinois, US university (UIUC) 3000 surplus the subject resource; And 1000 subject resources of University of Wisconsin at Madison (WISC).Listed the F metric of different former data messages territory extraction effect in three college course webpages in the following table:
OCW UIUC WISC
Course id 0.995 0.998 0.996
Course name 0.995 0.998 0.990
Course description 0.890 0.936 0.997
Collateral reading information 0.821 0.897
Teacher's information 0.857 0.890
Class hour information 0.902 0.953
Syllabus 0.798
Owing to the metadata fields that needs in not all the project can both relate in each college course webpage, therefore some numerical value can't be added up in upward showing, for example do not comprise class hour information and information such as syllabus in the course webpage of UIUC, and do not comprise teacher's information and syllabus in the WISC webpage, therefore program can not extract metadata, thereby can't accuracy in computation.
Based on above chart as can be seen, for the relatively more fixing information field of some forms, as course id, course name, course description etc., the extraction effect of this project can guarantee more than 90%, for practical application provides extraordinary metadata resource.For some variations information field greatly, as collateral reading information, teacher's information, syllabus etc., in the course website that descriptor format is comparatively fixed (as UIUC and WISC), this project still can reach the accurate effect about 90%, and under form is not very fixing situation, extraction effect also can reach 80%~85% order of accuarcy, can satisfy the needs of practical application equally, finishes the integration task of most websites metadata.

Claims (7)

  1. In the web search based on the metadata automatic extraction method of multiple rule, it is characterized in that, may further comprise the steps:
    (1) coarse webpage being carried out pre-service, is the form of standard with all webpage normalizings;
    (2) be to comprise that the web page contents of the information that will extract carries out pre-determined bit in the web document behind the form of standard to above-mentioned normalizing;
    (3) from the content that pre-determined bit extracts, extract metadata according to the rule of appointment.
  2. 2. metadata automatic extraction method as claimed in claim 1 is characterized in that, the pre-service in the described step (1) is to convert webpage to XML document by html document.
  3. 3. metadata automatic extraction method as claimed in claim 1 is characterized in that, carrying out pre-determined bit in the described step (2) is to discern, select, mate the composed component in the XML document.
  4. 4. metadata automatic extraction method as claimed in claim 1, it is characterized in that, in the described step (2), at same class webpage, set different pre-determined bit path, and set every kind of path priority, pick out which kind of pre-determined bit path of this web pages conform according to priority, and adopt the pre-determined bit path extraction that picks out to go out the core content zone.
  5. 5. metadata automatic extraction method as claimed in claim 1, it is characterized in that, at same class webpage, set a series of Different Rule in the described step (3), and set the priority of every kind of rule, mate each rule according to different priorities and carry out information extraction.
  6. 6. metadata automatic extraction method as claimed in claim 1 is characterized in that, the regular expression kit that the described rule of described step (3) adopts java.regex to provide is described.
  7. 7. metadata automatic extraction method as claimed in claim 1 is characterized in that, in the information extraction process of described step (3), also the information of carrying out is refined as follows:
    (1) adopt two tuples<word, feature〉form represent unnecessary word, word is the record of unnecessary word itself, feature is that the sorting technique that foundation provides is judged and mark the type of word;
    (2) if the probability that certain word or certain word types occur as impurity at an ad-hoc location greater than a certain threshold level, then with word or word types and location records occurs and get off;
    (3) in the leaching process,, i.e. coupling<word, feature if the impurity word occurred at fixing matched position〉in first identical, then directly it is filtered out; If occurred with the impurity word inequalityly, but the word identical with dopant type then also is judged as it impurity and filters out.
CNB2007101185908A 2007-07-10 2007-07-10 Metadata automatic extraction method based on multiple rule in network search Expired - Fee Related CN100461183C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007101185908A CN100461183C (en) 2007-07-10 2007-07-10 Metadata automatic extraction method based on multiple rule in network search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007101185908A CN100461183C (en) 2007-07-10 2007-07-10 Metadata automatic extraction method based on multiple rule in network search

Publications (2)

Publication Number Publication Date
CN101101600A CN101101600A (en) 2008-01-09
CN100461183C true CN100461183C (en) 2009-02-11

Family

ID=39035874

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007101185908A Expired - Fee Related CN100461183C (en) 2007-07-10 2007-07-10 Metadata automatic extraction method based on multiple rule in network search

Country Status (1)

Country Link
CN (1) CN100461183C (en)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101576885B (en) * 2008-05-08 2012-02-22 韩露 Technical scheme for extracting dynamic generation web page contents
CN101290624B (en) * 2008-06-11 2012-02-01 华东师范大学 News web page metadata automatic extraction method
JP2011100403A (en) * 2009-11-09 2011-05-19 Sony Corp Information processor, information extraction method, program and information processing system
CN102467497B (en) * 2010-10-29 2014-11-05 国际商业机器公司 Method and system for text translation in verification program
CN102799597A (en) * 2011-05-26 2012-11-28 株式会社日立制作所 Content extraction method
CN103207878B (en) * 2012-01-17 2016-05-04 阿里巴巴集团控股有限公司 The inspection method releasing news and device
CN102819580B (en) * 2012-07-25 2016-09-21 广州翼锋信息科技有限公司 Internet third party online media sites broadcast monitoring method and system
CN103838796A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Webpage structured information extraction method
CN103902578B (en) * 2012-12-27 2017-05-31 中国移动通信集团四川有限公司 A kind of method for abstracting web page information and device
CN103092973B (en) * 2013-01-24 2015-12-02 浪潮(北京)电子信息产业有限公司 information extraction method and device
CN104598472B (en) * 2013-10-31 2019-02-12 腾讯科技(深圳)有限公司 The extracting method of web page contents, apparatus and system
CN103970898A (en) * 2014-05-27 2014-08-06 重庆大学 Method and device for extracting information based on multistage rule base
CN105653531B (en) * 2014-11-12 2020-02-07 中兴通讯股份有限公司 Data extraction method and device
CN106033417B (en) * 2015-03-09 2020-07-21 深圳市腾讯计算机系统有限公司 Method and device for sequencing series of video search
CN104965783A (en) * 2015-06-16 2015-10-07 百度在线网络技术(北京)有限公司 Method and apparatus for monitoring web content presentation
US10650065B2 (en) 2016-02-26 2020-05-12 Rovi Guides, Inc. Methods and systems for aggregating data from webpages using path attributes
CN106126688B (en) * 2016-06-29 2020-03-24 厦门趣处网络科技有限公司 Intelligent network information acquisition system and method based on WEB content and structure mining
CN108694205B (en) * 2017-04-11 2021-01-26 北京京东尚科信息技术有限公司 Method and device for matching target field
CN107608949B (en) * 2017-10-16 2019-04-16 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device based on semantic model
CN108133010A (en) * 2017-12-22 2018-06-08 新奥(中国)燃气投资有限公司 A kind of information grasping means and device
CN108874870A (en) * 2018-04-24 2018-11-23 北京中科闻歌科技股份有限公司 A kind of data pick-up method, equipment and computer can storage mediums
CN109684457A (en) * 2018-12-27 2019-04-26 清华大学 A kind of method and system that personal share advertisement data is extracted
CN109783728B (en) * 2018-12-29 2021-10-19 安徽听见科技有限公司 Page crawler rule updating method and system
CN110096568B (en) * 2019-03-22 2022-12-06 泰康保险集团股份有限公司 Method, device, equipment and storage medium for marketing company performance early warning
CN111767363A (en) * 2019-04-02 2020-10-13 杭州全拓科技有限公司 Internet-based big data analysis and extraction device and method
CN110704781A (en) * 2019-09-30 2020-01-17 北京百度网讯科技有限公司 Web page parser

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6999959B1 (en) * 1997-10-10 2006-02-14 Nec Laboratories America, Inc. Meta search engine
US7162691B1 (en) * 2000-02-01 2007-01-09 Oracle International Corp. Methods and apparatus for indexing and searching of multi-media web pages
CN1967535A (en) * 2005-11-17 2007-05-23 国际商业机器公司 System and method for using text analytics to identify a set of related documents from a source document

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6999959B1 (en) * 1997-10-10 2006-02-14 Nec Laboratories America, Inc. Meta search engine
US7162691B1 (en) * 2000-02-01 2007-01-09 Oracle International Corp. Methods and apparatus for indexing and searching of multi-media web pages
CN1967535A (en) * 2005-11-17 2007-05-23 国际商业机器公司 System and method for using text analytics to identify a set of related documents from a source document

Also Published As

Publication number Publication date
CN101101600A (en) 2008-01-09

Similar Documents

Publication Publication Date Title
CN100461183C (en) Metadata automatic extraction method based on multiple rule in network search
CN110147436B (en) Education knowledge map and text-based hybrid automatic question-answering method
NIELSEN Subiect Access Points in Electronic Retrieval
EP3096246A1 (en) Method, system and storage medium for realizing intelligent answering of questions
CN112231494B (en) Information extraction method and device, electronic equipment and storage medium
Peters et al. Tag gardening for folksonomy enrichment and maintenance
Rubinstein Historical corpora meet the digital humanities: the Jerusalem corpus of emergent modern Hebrew
Nowroozi et al. The comparison of thesaurus and ontology: Case of ASIS&T web-based thesaurus and designed ontology
Chieze et al. An automatic system for summarization and information extraction of legal information
Sakai et al. ASKMi: A Japanese Question Answering System based on Semantic Role Analysis.
Seadle Managing and mining historical research data
KR102256007B1 (en) System and method for searching documents and providing an answer to a natural language question
Biletskiy et al. Information extraction from syllabi for academic e-Advising
Généreux et al. A large Portuguese corpus on-line: cleaning and preprocessing
Aouichat et al. Building TALAA-AFAQ, a corpus of Arabic FActoid question-answers for a question answering system
Mohnot et al. Hybrid approach for Part of Speech Tagger for Hindi language
Abdelhamid et al. Using ontology for associating Web multimedia resources with the Holy Quran
Balaji et al. Finding related research papers using semantic and co-citation proximity analysis
Drymonas et al. Opinion mapping travelblogs
Mastora et al. Failed queries: A morpho-syntactic analysis based on transaction log files
Kergosien et al. Automatic identification of research fields in scientific papers
Kundu How to write research article for a journal: Techniques and rules
Thanadechteemapat et al. Thai word segmentation for visualization of thai web sites
Manna et al. Information retrieval-based question answering system on foods and recipes
De Groat Future directions in metadata remediation for metadata aggregators

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090211

Termination date: 20160710