CN100461183C

CN100461183C - Metadata automatic extraction method based on multiple rule in network search

Info

Publication number: CN100461183C
Application number: CNB2007101185908A
Authority: CN
Inventors: 张铭; 杨宇
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2007-07-10
Filing date: 2007-07-10
Publication date: 2009-02-11
Anticipated expiration: 2027-07-10
Also published as: CN101101600A

Abstract

The method includes following steps: (1) preprocessing coarse web pages, and normalizing all web pages to more normative format; (2) primary positioning contents of web pages including information to be extracted in files of web page; (3) according to specified rules to extract metadata from contents picked-up by primary positioning operation. First, the invention distinguishes core area from hash in large area. Then, aiming at core area, the invention carries out regular extraction so as to raise accuracy of extraction greatly. The invention also can extract metadata in web page according to multiple rules. Based on given priorities, the multiple rules determine matching sequence, and carries out refined process according to method of extraction in two stages.

Description

In the web search based on the metadata automatic extraction method of multiple rule

Technical field:

The invention belongs to the web search technical field, relate in particular to the method for on internet page, carrying out subject search.

Background technology:

Metadata is data (data that describes data) or " about the data of data " (data aboutdata) of data of description, being used for the feature of data of description and attribute, also is the instrument of describing and organizing the Internet information resources, find the Internet information resources.In each field, some large-scale resource issuing web sites all can be arranged, by extracting the metadata on these resource websites, can collect a large amount of useful resources, help different user to set up the database of specific area, therefore, using for the extraction of metadata is very widely.

Metadata extracts in whole information organization and retrieval and is in the fundamental position that data are prepared.The Data Source of extraction process at first passes through necessary pre-service, rejects the document in aspects such as form, content, language existing problems or serious disappearance, produces the regular relatively text document of form; Secondly through the processing of metadata abstraction module, generate the document metadata of compliant definition, and the result is stored in metadata, library or other files or text database relevant with concrete system.Document metadata also can adopt different cannonical formats to organize according to the difference of concrete system, to make things convenient for sharing and information interchange of data.And along with the further developing of network, the webpage metadata has become one of main method of storage useful information, and therefore the research work aspect the extraction of webpage metadata also more and more widely more and more comes into one's own.

Extract the field in metadata, researcher has carried out a large amount of theoretical researches, has also proposed technology and method that a lot of metadata extract, and has developed many available instruments according to these theories.

In Web information extraction field a kind of more common sorting technique is arranged, promptly the technology that adopts according to every kind of instrument roughly is divided into six classes with the method that the Web metadata extracts: (1) is based on the metadata abstracting method of language; (2) based on the metadata abstracting method of HTML structure: (3) are based on the metadata abstracting method of NLP (Natural Language Processing); (4) based on the metadata abstracting method of concluding; (5) based on the Web metadata abstracting method of model; (6) based on the metadata abstracting method of Ontology.

Wherein the metadata abstracting method based on NLP is that the NLP technology is applied to learn decimation rule from the document of natural language description.These class methods mainly adopt phrase sentence structure and semantic analysis technology, comprise the identification and the mark of syntactic constituent, keyword abstraction, the extraction of retrieval character, index etc.Relatively be applicable to from loosely organized or even plain text based on the metadata abstracting method of NLP and extract useful information.The exemplary tool that adopts these class methods is WHISK[SoderLan, S.Learning Information Extraction Rules for Semi-Structured and Free Text.Machine Learning 34,1-3 (1999), 2333-272.].

WHISK be a use comparatively widely, carry out the system of automatic rule learning at semi-structured information.Compare with system before, WHISK can learn the rule of needs according to more accurate principle, because it not only can also add some semantic informations at the particular content format learning rules in certain field.Rule among the WHISK is to use regular expression to represent, and can specify in a rule and extract one or more information fields.But the system of this automatic learning rules obtains the result still exists a lot of deficiencies when really being used to extract, reason is at a complete jumbo semi-structured information, the rule that provides among the WHISK just concentrates on information area to be extracted as far as possible, article one, should reject unwanted information fully in the rule, the information that accurately defining extraction again needs is difficult to accomplish.On the other hand, the semantic information that provides in the rule that WHISK learns is to list file names with by the synonym word that all are learnt, to guarantee that seeing this semantic different terms can both pick out.But this point can make that the rule that finally obtains is very loaded down with trivial details, and because the word of a kind of semanteme of expression emerges in an endless stream, and final extraction effect can be not fine, this point is more obvious in Chinese extracts.

Summary of the invention:

The purpose of this invention is to provide a kind of metadata abstracting method according to the specified rule Automatic Extraction.This method can extract the metadata in the webpage according to multiple rule, and multiple rule is judged matching order according to given priority, and carries out process of refinement according to the method that two steps were extracted.

Technical scheme of the present invention is as follows:

Based on the metadata automatic extraction method of multiple rule, may further comprise the steps in the web search:

1. coarse webpage is carried out pre-service, can use the kit NekoHTML that increases income to handle for the form pre-service of standard relatively all webpage normalizings, this instrument is in the series of tools (Java APIs) write with Java of J.Andrew Clark.NekoHTML is a simple HTML scanner and label compensator (tag balancer), makes calling program can resolve html document and visits wherein information with the standard XML interface.Just can reach basic requirement through the webpage of above processing, those loosely organized html documents also have been converted into the webpage that meets basic XML document standard.

2. information pre-determined bit

The information pre-determined bit is meant most valuable part in the whole web document, comprises that just the web page contents of the information that will extract carries out Primary Location.

Can use another instrument of increasing income: xalan-java of Apache, this is a cover xslt processor.Wherein, XPath is that XSLT is used for language that the XML document each several part is positioned, and it can be described and how to discern, select, mates each composed component in the XML document, comprises element, attribute, content of text etc.

3. extract based on multiple regular specified message

Rule according to appointment is extracted metadata from the content that pre-determined bit extracts.The regular expression kit that described rule can adopt java.regex to provide is described, and legally just it is extracted.

Further, in actual extracting, can not only use once cover XPath or regular expression and can handle all webpage extractions, method of the present invention is set different pre-determined bit XPath path and a series of Different Rule at same class webpage, and sets the priority of every kind of XPath path and rule, pick out which kind of pre-determined bit path of this web pages conform according to priority, so that extract the core content zone, then mate each rule according to different priorities again, want content up to extracting.All are suitable for a series of different priorities XPath paths and the rule of certain web page type, all need write exactly in the configuration file of xml file layout, and the form that forms DOM Tree through document analysis is handled.

Further, based on the metadata automatic extraction method of multiple rule, be in the information extraction process in the above-mentioned web search in described step 3, also the information of carrying out is refined as follows, distinguishes at first which is a redundant information.

Determine that at first which data field needs the information step of refining, because some data field is very accurate having specified regular back just can guarantee to extract the result.But, a large amount of redundancies or gibberish will occur, and the result of machine learning can be appointed as these data fields the type that need refine naturally for the other data class.

Adopt two tuples＜word, feature〉form represent the unnecessary word that runs in the learning process.Wherein, word is the record of unnecessary word itself, and feature is according to the sorting technique that provides, and the type of word is judged and mark.In the method for the invention, the word that runs into is divided into following 4 types: (1) all letters are small letter, the lowcase type; (2) comprise numeral and letter, perhaps only comprise numeral, in_number type; (3) first word of this data field, or first word in short, and first letter of this word is capitalization, first_word type; (4) first letter of word is capitalization, and the word that also starts for capitalization of word of word in its front or back, the con_capital type.

In learning process, if the probability that certain word or certain word types occur as impurity at an ad-hoc location is greater than a specific thresholding, then with word or word types and location records occurs and get off.In leaching process,, i.e. coupling＜word, feature if the impurity word occurred at fixing matched position〉in first identical, then directly it is filtered out.If occurred with the impurity word inequalityly, but the word identical with dopant type then also is judged as it impurity and filters out.

Advantage of the present invention and good effect:

The invention provides a kind of metadata abstracting method according to the specified rule Automatic Extraction.The present invention has provided the inventive process of two step location, and the first step is opened nucleus and large-scale garbage difference earlier, then carries out Rule Extraction at nucleus again, makes that extracting accuracy has obtained increasing substantially.Method of the present invention can also extract the metadata in the webpage according to multiple rule, and multiple rule is judged matching order according to given priority, and carries out process of refinement according to the method that two steps were extracted.Use native system, different users only needs good according to the rules form redaction rule at different webpages, can automatically extract processing, and not need program is carried out any specific modification.In addition, the present invention further in the process of information extraction the information of carrying out refine, we find at the study of impurity information and record than at the study of front vocabulary with write down much better, therefore the information that provides among the present invention part of refining is exactly at impurity study and filters, and is remarkable through overtesting discovery effect.

Experiment showed, that method of the present invention can be good at handling semi-structured webpage, F estimates and reaches more than 85%, has good practical values.

Description of drawings:

Fig. 1 is a structural representation of the present invention.

Fig. 2 is a machine learning stage process flow diagram of the present invention.

Fig. 3 is a kernel program process flow diagram of the present invention.

Embodiment:

Below in conjunction with an example that educational resource is integrated, describe the specific embodiment of the present invention in detail.

This embodiment is described is to the abstracting method of the webpage metadata on the resource website in the integration work of educational resource.The integration work of educational resource, target are the integrated platforms of an educational resource providing for e-learning person and teachers.As the metadata extraction step of its important step, need can reach semi-structured webpage and extract accuracy preferably, and have the ability of the loose document of Processing Structure.

As shown in Figure 1, in this embodiment, the extraction of metadata is comprised following step:

1. coarse webpage pre-service

Generally, extraction algorithm relatively is adapted to handle xml document and the good html document of form.Here said " form is good ", what refer to exactly will be when webpage be write, and is strict with before and after the webpage label correspondingly one by one, has the mark of beginning that the mark of termination will be arranged.

Because it is corresponding before and after being that the html webpage is not strict with webpage label in the code, therefore, the code format of the various webpage in the network of being popular in may differ greatly.What the reptile program grabbed all is that some have no the coarse webpage that standardization can be sayed.

Before extraction work formally begins, need carry out certain pre-service to coarse webpage.With the label completion that lacks in the coarse webpage, all webpage normalizings for comparing the form of standard, are made things convenient for post-processed.And in this stage, all pictures in the webpage will be by automatic fitration.Here used the kit NekoHTML that increases income to handle, this instrument is in the series of tools (Java APIs) write with Java of J.Andrew Clark.NekoHTML is a simple HTML scanner and label compensator (tag balancer), makes calling program can resolve html document and visits wherein information with the standard XML interface.Just can reach basic requirement through the webpage of above processing, those loosely organized html documents also have been converted into the webpage that meets basic XML document standard.

2. information extraction stage

After the webpage pre-service of phase one, just entered the information extraction stage.In order to promote the extraction accuracy, the extraction stage is divided into three steps: the information pre-determined bit, based on the information extraction of multiple Rulemaking, and information is refined.

(1) information pre-determined bit

The information pre-determined bit is meant that just the core content of webpage carries out Primary Location to most valuable part in the whole web document.In pending webpage, often both comprise the core content zone, and also can comprise the structured content zone.Core content generally includes the information that will extract, for example course name, course numbering, course characteristics and teacher's information etc.; And the structured message zone comprises navigation bar, field of search navigation etc.In processing procedure, these structured content are nonsensical, but the content that all pages all can comprise.Therefore, in first step leaching process, the first integral body of valuable content area be extracted, wherein can comprise the code content such as label of webpage, be not finally needed.But in the pre-determined bit process, these labels are not done concrete the differentiation, confirm that just whole core webpage partly gets final product.Here, use be another instrument of increasing income: xalan-java of Apache, this be one the cover xslt processor.Wherein, XPath is that XSLT is used for language that the XML document each several part is positioned, and it can be described and how to discern, select, mates each composed component in the XML document, comprises element, attribute, content of text etc.

The content that pre-determined bit extracts will be given based on the information extraction step of multiple Rulemaking and be handled.For each processing, as long as specify corresponding processing rule, program just can be extracted as requested.The rule description form here, the regular expression kit that adopts java.regex to provide is described and handles, and can if meet rule, just it be extracted easily according to the given content in the given regular expression coupling target string.

(2) extract based on multiple regular specified message

In actual extracting, can not only use once cover XPath or regular expression and can handle all webpage extractions, in order to make the method for the present invention can be more flexible, this paper provides an option, can set different pre-determined bit XPath path at same class webpage, and a series of Different Rule, and set every kind of XPath path and regular priority, and then more accurately extract.Program can according to priority pick out the sort of pre-determined bit of this web pages conform path, so that extract the core content zone, then mates each rule according to different priorities again, wants content up to extracting.

All are suitable for a series of different priorities XPath paths and the rule of certain web page type, all need write exactly in the configuration file of xml file layout, and the form that forms DOM Tree through document analysis is handled.The format write of configuration file is:

The ground floor of configuration file is Domain (territory) information, the unified root network address of just a certain class webpage; Following one deck is given in the url rule of every kind of pending webpage under this big network address Domain, just uses regular expression (Regular Expression) pair net page address, judges and uses which cover rule to carry out meta-data extraction; And then be the appointment in different meta-data extraction territory, course name (title) for example, program content (outline) etc.; Then be to provide each pre-determined bit XPath according to the priority difference, be used for the core content pre-determined bit at every kind of information field; Behind the XPath, be concrete extracting rule, still use regular expression to represent.

(3) information is refined

After back came out all meta-data extraction according to rule, the extraction result in the information field contained a lot of incoherent contents.As everyone knows, for comparatively general metadata extraction program, can not accomplish hundred-percent accurate extraction, accurately classification, no matter be the XML document of loosely organized html document or semantization, can't stipulate accurately that all the webpage under all same Domain uses identical form of presentation when expression same target or same attribute.

For guaranteeing accuracy, needing as much as possible wherein, incoherent word all weeds out.Use the method based on machine learning in this step, learning process as shown in Figure 2.Find that through observing those unnecessary information all are often to repeat on content and form generally speaking, therefore, the training data in this module will be used for emphatically distinguishing which is a redundant information.

Determine that at first which data field needs the information step of refining, because some data field is very accurate having specified regular back just can guarantee to extract the result, for example: course id, course name, course description etc., the real rule of these data fields is more fixing, describing method is also more unified, therefore the situation that uncorrelated word mixes can not occur substantially after extracting, directly the result with second step gets final product as finally extracting result's output.But for the other data class, as: collateral reading author, collateral reading autograph, collateral reading source etc., if the information of carrying out is not refined, a large amount of redundancies or gibberish will occur, and the result of machine learning can be appointed as these data fields the type that need refine naturally.

Adopt two tuples＜word, feature〉form represent the unnecessary word that runs in the learning process.Wherein, word is the record of unnecessary word itself, and feature is according to the sorting technique that provides, and the type of word is judged and mark.In this module, the word that runs into is divided into following 4 types: (1) all letters are small letter, the lowcase type; (2) comprise numeral and letter, perhaps only comprise numeral, in_number type; (3) first word of this data field, or first word in short, and first letter of this word is capitalization, first_word type; (4) first letter of word is capitalization, and the word that also starts for capitalization of word of word in its front or back, the con_capital type.

Because this paper at all be semi-structured educational website and resource, web page contents is standard relatively, many times, redundant information is similar or identical, therefore, will be at the judgement of word itself as primary criterion, if find that redundant information is same or similar with existing record, then can make judgement simply, and not need extra work.

Performance evaluating

The index of weighing a meta-data extraction system quality mainly contains following three:

Recall ratio (Recall)

Precision ratio (Precision)

F measures (F-measure)

F = \frac{(β^{2} + 1) PR}{β^{2} P + R}

Recall ratio and precision ratio have directly been weighed the quality of meta-data extraction system, and F tolerance allows the user try to achieve balance on recall ratio and precision ratio.And we think that the precision ratio of meta-data extraction is more important than recall ratio, because in the application of reality, accurate data is the assurance of service quality, and data complete should be based upon data accurately on the basis.Therefore, select significance level value β=0.5 of precision ratio and recall ratio, the significance level of representing P is 2 times of R.

Test data comprises open courseware MIT OpenCourseWare (http://www.core.org.cn/OcwWeb) website of Massachusetts Institute Technology, wherein having contained MIT 1900 subject contents in all undergraduate and graduate stages, is rare free, an open educational resource; Illinois, US university (UIUC) 3000 surplus the subject resource; And 1000 subject resources of University of Wisconsin at Madison (WISC).Listed the F metric of different former data messages territory extraction effect in three college course webpages in the following table:

	OCW	UIUC	WISC
	OCW	UIUC	WISC	Course id	0.995	0.998	0.996
Course name	0.995	0.998	0.990	Course id	0.995	0.998	0.996
Course name	0.995	0.998	0.990	Course description	0.890	0.936	0.997
Collateral reading information	0.821	—	0.897	Course description	0.890	0.936	0.997
Collateral reading information	0.821	—	0.897	Teacher's information	0.857	0.890	—
Class hour information	0.902	—	0.953	Teacher's information	0.857	0.890	—

Syllabus

0.798

—

Owing to the metadata fields that needs in not all the project can both relate in each college course webpage, therefore some numerical value can't be added up in upward showing, for example do not comprise class hour information and information such as syllabus in the course webpage of UIUC, and do not comprise teacher's information and syllabus in the WISC webpage, therefore program can not extract metadata, thereby can't accuracy in computation.

Based on above chart as can be seen, for the relatively more fixing information field of some forms, as course id, course name, course description etc., the extraction effect of this project can guarantee more than 90%, for practical application provides extraordinary metadata resource.For some variations information field greatly, as collateral reading information, teacher's information, syllabus etc., in the course website that descriptor format is comparatively fixed (as UIUC and WISC), this project still can reach the accurate effect about 90%, and under form is not very fixing situation, extraction effect also can reach 80%～85% order of accuarcy, can satisfy the needs of practical application equally, finishes the integration task of most websites metadata.

Claims

In the web search based on the metadata automatic extraction method of multiple rule, it is characterized in that, may further comprise the steps:

(1) coarse webpage being carried out pre-service, is the form of standard with all webpage normalizings;

(2) be to comprise that the web page contents of the information that will extract carries out pre-determined bit in the web document behind the form of standard to above-mentioned normalizing;

(3) from the content that pre-determined bit extracts, extract metadata according to the rule of appointment.
2. metadata automatic extraction method as claimed in claim 1 is characterized in that, the pre-service in the described step (1) is to convert webpage to XML document by html document.
3. metadata automatic extraction method as claimed in claim 1 is characterized in that, carrying out pre-determined bit in the described step (2) is to discern, select, mate the composed component in the XML document.
4. metadata automatic extraction method as claimed in claim 1, it is characterized in that, in the described step (2), at same class webpage, set different pre-determined bit path, and set every kind of path priority, pick out which kind of pre-determined bit path of this web pages conform according to priority, and adopt the pre-determined bit path extraction that picks out to go out the core content zone.
5. metadata automatic extraction method as claimed in claim 1, it is characterized in that, at same class webpage, set a series of Different Rule in the described step (3), and set the priority of every kind of rule, mate each rule according to different priorities and carry out information extraction.
6. metadata automatic extraction method as claimed in claim 1 is characterized in that, the regular expression kit that the described rule of described step (3) adopts java.regex to provide is described.
7. metadata automatic extraction method as claimed in claim 1 is characterized in that, in the information extraction process of described step (3), also the information of carrying out is refined as follows:

(1) adopt two tuples＜word, feature〉form represent unnecessary word, word is the record of unnecessary word itself, feature is that the sorting technique that foundation provides is judged and mark the type of word;

(2) if the probability that certain word or certain word types occur as impurity at an ad-hoc location greater than a certain threshold level, then with word or word types and location records occurs and get off;

(3) in the leaching process,, i.e. coupling＜word, feature if the impurity word occurred at fixing matched position〉in first identical, then directly it is filtered out; If occurred with the impurity word inequalityly, but the word identical with dopant type then also is judged as it impurity and filters out.