CN102768660A - Dynamic-interaction-based generation method of template of internet acquisition system - Google Patents

Dynamic-interaction-based generation method of template of internet acquisition system Download PDF

Info

Publication number
CN102768660A
CN102768660A CN2011101146416A CN201110114641A CN102768660A CN 102768660 A CN102768660 A CN 102768660A CN 2011101146416 A CN2011101146416 A CN 2011101146416A CN 201110114641 A CN201110114641 A CN 201110114641A CN 102768660 A CN102768660 A CN 102768660A
Authority
CN
China
Prior art keywords
node
acquisition system
generation method
source code
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011101146416A
Other languages
Chinese (zh)
Other versions
CN102768660B (en
Inventor
陈宗华
陈永江
伊鹏
刘永超
李存华
仲兆满
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Golden Feather Network Technology Nanjing Co ltd
Original Assignee
JIANGSU JINGE NETWORK TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JIANGSU JINGE NETWORK TECHNOLOGY Co Ltd filed Critical JIANGSU JINGE NETWORK TECHNOLOGY Co Ltd
Priority to CN201110114641.6A priority Critical patent/CN102768660B/en
Publication of CN102768660A publication Critical patent/CN102768660A/en
Application granted granted Critical
Publication of CN102768660B publication Critical patent/CN102768660B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a dynamic-interaction-based generation method of a template of an internet acquisition system. The method is characterized by including steps of firstly, accessing internet to load a target webpage and obtain a webpage text source code set S; secondly, identifying a node set N of the set S according to regular expression of a tag and adding a unique serial number to each node in the set N; thirdly, constructing a model tree T with an interdependent hierarchical structure according to the set N; and fourthly, inputting node IDs (identities), performing moving operation, and using ergodic analysis algorithm to iteratively calculate front tag expression and rear tag expression of each node. By the method, customization of finish template of man-machine interaction is achieved, operation can be finished by clicking a mouse, tiles, authors, contents, reply time, reply contents and the like of information which can be seed on a browser can be distinguished in acquired information, acquiring efficiency is high, only contents defined by the template are acquired during acquiring, and network resource occupation is low. In addition, the template has a function of novelty retrieval.

Description

A kind of internet acquisition system masterplate generation method based on dynamic interaction
Technical field
The invention belongs to the internet information acquisition field, specifically relate to a kind of internet acquisition system masterplate generation method based on dynamic interaction.
Background technology
Along with the fast development of social informatization, network has become the important source that people obtain information.And the network information has magnanimity, complicacy, characteristics such as destructuring, for the network information obtain and the analysis and the research work of information search Network Based have all brought very big difficulty.A large amount of practices show that also it is difficult on network, various information carrier (News Network, blog, forum, microblogging etc.) being carried out information acquisition.Particularly want provisionally when carrying out information acquisition, the simplification of suitable ability, collecting efficiency and the operation of acquisition system has all been proposed very high requirement to certain specific objective.In order to adapt to the increasing demand in market, generate method fast to the template of each acquisition target and also arise at the historic moment.Internet acquisition system masterplate generation method based on dynamic interaction; In automatic collection field; Can carry out the customization of template to some information carriers (News Network, blog, forum, microblogging etc.) targetedly, comprise some common forms of expression such as title, content, author, time of origin of information.
Based on the internet acquisition system masterplate generation method of dynamic interaction, be applied to the public sentiment management domain on the one hand, use in government departments such as public security, safety, safety supervisions; Also can use on the other hand in the information analysis field, as: hunter's industry.In the internet acquisition system masterplate generation method based on dynamic interaction, think that the information carrier of internet is changeable, the information acquisition of carrier is needed different templates.Many information acquisition systems are also arranged on market, but exist all mostly that the information acquisition content is concrete, problem such as the technical requirement threshold is high on the template configuration, the acquisition target scope is less than normal.For example: the TIS information acquisition device, suit news website is gathered, lower to other information carrier (as: forum, blog, microblogging etc.) collecting efficiency; Heritrix is very comprehensive to the support of acquisition target, but the information of gathering is received all, and adopts different templates to the difference of information, and dragons and fishes jumbled together for the information that collects, and is unfavorable for analyzing; Network expression information acquisition system all has improvement to the defective of above-mentioned two systems, does not remove at the end but improve.The network expression when gathering different objects, has only defined some collection rules and has distinguished acquisition target, but rule highly professional, operation skill is more, is unfavorable for the all-round popularization to market.
Summary of the invention
The technical matters that the present invention will solve is the deficiency that exists to prior art, and a kind of internet acquisition system masterplate method of generationing based on dynamic interaction is provided, and this method can load the collection target automatically, and with the foundation of user interactions completion specific template.
In order to address the above problem, the present invention adopts following technical proposals: the present invention is a kind of internet acquisition system masterplate generation method based on dynamic interaction, is characterized in that its step is following:
(1) the access internet loaded targets page obtains web page text source code S set;
(2) regular expression according to label identifies the node set N in the S set, and is that each node of gathering among the N adds unique sequence number;
(3) construct according to node set N and have complementary hierarchy Model tree T;
(4) import node ID and move operation at last, utilize the front and back label expression formula of traversal analytical algorithm iterative computation egress.
Above-described internet acquisition system masterplate based on dynamic interaction generates in the method and technology scheme, and described step (1) can be operated by following concrete operations step:
(1-1) the input web page address utilizes HttpClient to obtain original html source code S set 1;
(1-2) utilize Cobra to carry out JavaScript in the S set 1, its return results is filled into the S1 relevant position, obtain text source code S set 2 at last, promptly said web page text source code S set.
Above-described internet acquisition system masterplate based on dynamic interaction generates in the method and technology scheme, and described step (2) can be operated by following concrete operations step:
(2-1) obtain node n according to canonical matched text source code S set;
(2-2) whether decision node n type is a kind of in end-tag node, script node and nested node thereof, link node, br node, the note/note expression formula node; Node type if not foregoing description; Then carry out B3 again, otherwise execution in step (2-1);
(2-3) utilize sequence generator to generate unique system banner sequence number, be appended in the node n label;
(2-4) return repeated execution of steps (2-1), until identifying all node n, the set that all node n form is described node and combines N.
Above-described internet acquisition system masterplate based on dynamic interaction generates in the method and technology scheme, can operate by following concrete operations step in the said step (3):
(3-1) utilize HtmlParser to generate HTML DOM structure as R ', R ' is initially the root node among the DOM, creates self-defining hierarchy Model tree HtmlNode simultaneously as node R, and the R initial default is the root node of tree;
(3-2) with the node content among the R ', ID and child node information are given in the node R;
(3-3) if node R ' child node is arranged, the label that obtains R ' converts end node to, as the brotgher of node of R;
The next node that (3-4) obtains R ' is given R ', returns once more execution in step (3-2), finishes until traversal HTML DOM, and the series of layers aggregated(particle) structure that node R produced is described hierarchy Model tree T.
Above-described internet acquisition system masterplate based on dynamic interaction generates in the method and technology scheme, and said step (4) can be operated by following concrete operations step:
(4-1) click the loading nodal operation, obtain node identification ID;
(4-2) inquire tree T corresponding nodes object according to ID;
(4-3) click preceding/rearmounted move operation O;
(4-4) calculate preceding/rearmounted border;
(4-5) obtain preceding/rearmounted expression formula.
Compared with prior art, the internet acquisition system masterplate generation method based on dynamic interaction of the present invention has following effect:
1, model customization is accomplished in man-machine interaction: the user only need click the mouse and get final product complete operation, does not need the technical know-how of specialty;
Distinguish the attribute of information when 2, gathering: the content that the title of differentiation information, author, content, turnaround time, reply content or the like are seen in the Information Monitoring on browser;
3, collecting efficiency is high: in gatherer process, only the content of template definition is gathered, it is little to take Internet resources;
4, new function is looked in the template support: only update content is gathered during collection, not repeated acquisition.
Description of drawings
Fig. 1 is a kind of FB(flow block) of the inventive method;
Fig. 2 is the node set N process flow diagram among the described identification of step 102 and the tag set S among Fig. 1;
Fig. 3 is the hierarchy Model tree T process flow diagram of the described structure node of step 103 among Fig. 1;
Fig. 4 is all input node ID of step 104 and the front and back label expression formula process flow diagram of move operation computing node among Fig. 1.
Embodiment
Following with reference to accompanying drawing, further describe concrete technical scheme of the present invention, so that those skilled in the art understands the present invention further, and do not constitute restriction to its right.
Embodiment 1, a kind of internet acquisition system masterplate generation method based on dynamic interaction, and its step is following:
(1) the access internet loaded targets page obtains web page text source code S set;
(2) regular expression according to label identifies the node set N in the S set, and is that each node of gathering among the N adds unique sequence number;
(3) construct according to node set N and have complementary hierarchy Model tree T;
(4) import node ID and move operation at last, utilize the front and back label expression formula of traversal analytical algorithm iterative computation egress.
Embodiment 2, and the concrete operations step of the step (1) of the described internet acquisition system masterplate generation method based on dynamic interaction of embodiment 1 is following:
(1-1) the input web page address utilizes HttpClient to obtain original html source code S set 1;
(1-2) utilize Cobra to carry out JavaScript in the S set 1, its return results is filled into the S1 relevant position, obtain text source code S set 2 at last, promptly said web page text source code S set.
Embodiment 3, embodiment 1 or 2 described internet acquisition system masterplate generation methods based on dynamic interaction the concrete operations step of step (2) following:
(2-1) obtain node n according to canonical matched text source code S set;
(2-2) whether decision node n type is a kind of in end-tag node, script node and nested node thereof, link node, br node, the note/note expression formula node; Node type if not foregoing description; Then carry out B3 again, otherwise execution in step (2-1);
(2-3) utilize sequence generator to generate unique system banner sequence number, be appended in the node n label;
(2-4) return repeated execution of steps (2-1), until identifying all node n, the set that all node n form is described node and combines N.
Embodiment 4, and the concrete operations step of the step (3) of embodiment 1 or 2 or 3 described internet acquisition system masterplate generation methods based on dynamic interaction is following:
(3-1) utilize HtmlParser to generate HTML DOM structure as R ', R ' is initially the root node among the DOM, creates self-defining hierarchy Model tree HtmlNode simultaneously as node R, and the R initial default is the root node of tree;
(3-2) with the node content among the R ', ID and child node information are given in the node R;
(3-3) if node R ' child node is arranged, the label that obtains R ' converts end node to, as the brotgher of node of R;
The next node that (3-4) obtains R ' is given R ', returns once more execution in step (3-2), finishes until traversal HTML DOM, and the series of layers aggregated(particle) structure that node R produced is described hierarchy Model tree T.
Embodiment 5, and the concrete operations step of the step (4) of any one described internet acquisition system masterplate generation method based on dynamic interaction is following among the embodiment 1-4:
(4-1) click the loading nodal operation, obtain node identification ID;
(4-2) inquire tree T corresponding nodes object according to ID;
(4-3) click preceding/rearmounted move operation O;
(4-4) calculate preceding/rearmounted border;
(4-5) obtain preceding/rearmounted expression formula.
Embodiment 6, with reference to Fig. 1-4, with operation experiments of carrying out based on the internet acquisition system masterplate generation method of dynamic interaction of the present invention, comprise the steps:
Step 101, the loaded targets page obtain text source code S set, and it is specific as follows:
(1), the input web page address utilizes HttpClient to obtain original html source code S set 1; For example, the original html source code S set 1 that obtains through the internet is following:
<html>
<head>
<title>Template target to be generated</title>
<script?type="text/javascript">
var?go?=?function(){
Document.getElementById (" content_id ") .innerHTML=" JS replacement "; }
</script>
</head>
<body?onload="javascript:go();">
<p id=" content_id ">Content</p>
</body>
</html>
The html source code set is made up of various html tags and content of pages;
(2), utilize Cobra to carry out the JavaScript among the S1, and process result is filled the S1 relevant position.A lot of info webs need pass through and just demonstrated after script such as JS is handled, and therefore are necessary the entrained information of script is handled.For example, have partial information to be present in the JS script in the S set 1 described in the A1, utilize Cobra to carry out after, the content text in the p label should be " JS replacement ".Obtain a new source code S set 2, promptly said web page text source code S set after handling like this.
Node set N among step 102, identification and the tag set S.With reference to Fig. 2, comprise the steps:
Label in step 201, use regular expression r=" < ([^ < >] *)>" the identification S set is as node n.Node n comprises title, attribute, content of text.For example: the pairing node n of the p label in the S set 1 described in the A1 is " < p id=" content_id ">content ";
Whether step 202, decision node n are null value, if null value, explain then that label has been identified in the S set to finish, and promptly execution in step 201 finishes; If non-null, value explains that then the label in the S set gets into step 203 for identification finishes;
Whether step 203, decision node n type are a kind of in end-tag node, script node and nested node thereof, link node, br node, the note/note expression formula node, if wherein one type, return step 201 identification next node n; If not one of them type, get into next step 204;
Step 204, sequence generator generate unique system identifier and are appended in the node label.For example: for the p label in the described S set 1 of A1 adds sequence; If the sequence number 20100120112233 (random numbers between current date timestamp+1 to 1000000) that produces; Add in the p label, then the information in the p label should be " < p id=" content_id " systemid=" 20100120112233 ">"; After adding system sign finishes, return step 201, until all nodes add-on system identifier all.
The hierarchy Model tree T of step 103, structure node with reference to Fig. 3, comprises the steps:
Step 301, the S set of utilizing the HtmlParser parsing to be added system banner generate HTML DOM structure R ', and R ' is initially the root node among the DOM;
Step 302, the self-defining hierarchy Model tree of establishment HtmlNode are as node R, and R is initially the root node of tree;
Step 303, with node R ' in bookmark name, attribute, content of text is given in the node R.For example: give node R with the p label in the described S set 1 of A1 as node, obtain the information that node R has and comprise title " < p id=" content_id ">", system identifier 20100120112233;
Step 304, judge that whether current R ' node has child node, if child node is arranged, then carries out next step 305; If there is not child node, then execution in step 307;
Step 305, convert the bookmark name of R ' node to end-tag L.For example: with the node p bookmark name in the described S set 1 of A1 "<p id=" content_id " systemid=" 20100120112233 ">", convert to end-tag for "</p>" remember and make L;
Step 306, change into node and add and to make the node R brotgher of node obtaining end-tag L in the step 305;
Step 307, obtain the next child node of R ' and will be worth and compose to R ', as Next iteration node;
Whether step 308, judgement are null value by the R ' of new assignment, if R ' finishes (set of all node R is exactly described hierarchy Model tree T) for generation HTML DOM structure in the null value description of step 301 travels through, then step 103 finishes; If R ' is a non-null, value, then generate HTML DOM structure traversal in the step 301 and do not finish, return execution in step 303.
The front and back label expression formula of step 104, input node ID and move operation computing node with reference to Fig. 4, comprises the steps:
Step 401, click load certain page node, obtain the sign ID that has been labeled of this node.For example: the S set described in the A2, will see " JS replacement " such text through browser, select the text, click load button, the system identifier 20100120112233 that the p label will identify is imported next step 402 into as required parameter;
Step 402, obtain required parameter after, T obtains node corresponding according to the ID query tree;
Before step 403, the click/rearmounted move operation button;
Step 404, judgement are pre action or post action, if pre action, then execution in step 405; If not pre action but post action, then execution in step 407;
Step 405, calculate preposition border;
Step 406, obtain preposition expression formula;
Step 407, calculate rearmounted border;
Step 408, obtain rearmounted expression formula.

Claims (5)

1. internet acquisition system masterplate generation method based on dynamic interaction is characterized in that its step is following:
(1) the access internet loaded targets page obtains web page text source code S set;
(2) regular expression according to label identifies the node set N in the S set, and is that each node of gathering among the N adds unique sequence number;
(3) construct according to node set N and have complementary hierarchy Model tree T;
(4) import node ID and move operation at last, utilize the front and back label expression formula of traversal analytical algorithm iterative computation egress.
2. the internet acquisition system masterplate generation method based on dynamic interaction according to claim 1 is characterized in that the concrete operations step of described step (1) is following:
(1-1) the input web page address utilizes HttpClient to obtain original html source code S set 1;
(1-2) utilize Cobra to carry out JavaScript in the S set 1, its return results is filled into the S1 relevant position, obtain text source code S set 2 at last, promptly said web page text source code S set.
3. the internet acquisition system masterplate generation method based on dynamic interaction according to claim 1 is characterized in that the concrete operations step of described step (2) is following:
(2-1) obtain node n according to canonical matched text source code S set;
(2-2) whether decision node n type is a kind of in end-tag node, script node and nested node thereof, link node, br node, the note/note expression formula node; Node type if not foregoing description; Then carry out B3 again, otherwise execution in step (2-1);
(2-3) utilize sequence generator to generate unique system banner sequence number, be appended in the node n label;
(2-4) return repeated execution of steps (2-1), until identifying all node n, the set that all node n form is described node and combines N.
4. the internet acquisition system masterplate generation method based on dynamic interaction according to claim 1 is characterized in that the concrete operations step in the said step (3) is following:
(3-1) utilize HtmlParser to generate HTML DOM structure as R ', R ' is initially the root node among the DOM, creates self-defining hierarchy Model tree HtmlNode simultaneously as node R, and the R initial default is the root node of tree;
(3-2) with the node content among the R ', ID and child node information are given in the node R;
(3-3) if node R ' child node is arranged, the label that obtains R ' converts end node to, as the brotgher of node of R;
The next node that (3-4) obtains R ' is given R ', returns once more execution in step (3-2), finishes until traversal HTML DOM, and the series of layers aggregated(particle) structure that node R produced is described hierarchy Model tree T.
5. the internet acquisition system masterplate generation method based on dynamic interaction according to claim 1 is characterized in that the concrete operations step of said step (4) is following:
(4-1) click the loading nodal operation, obtain node identification ID;
(4-2) inquire tree T corresponding nodes object according to ID;
(4-3) click preceding/rearmounted move operation O;
(4-4) calculate preceding/rearmounted border;
(4-5) obtain preceding/rearmounted expression formula.
CN201110114641.6A 2011-05-05 2011-05-05 Dynamic-interaction-based generation method of template of internet acquisition system Active CN102768660B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110114641.6A CN102768660B (en) 2011-05-05 2011-05-05 Dynamic-interaction-based generation method of template of internet acquisition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110114641.6A CN102768660B (en) 2011-05-05 2011-05-05 Dynamic-interaction-based generation method of template of internet acquisition system

Publications (2)

Publication Number Publication Date
CN102768660A true CN102768660A (en) 2012-11-07
CN102768660B CN102768660B (en) 2014-09-03

Family

ID=47096064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110114641.6A Active CN102768660B (en) 2011-05-05 2011-05-05 Dynamic-interaction-based generation method of template of internet acquisition system

Country Status (1)

Country Link
CN (1) CN102768660B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013167006A1 (en) * 2012-12-14 2013-11-14 中兴通讯股份有限公司 Method for configuring browser bookmarks, device and terminal thereof

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6026413A (en) * 1997-08-01 2000-02-15 International Business Machines Corporation Determining how changes to underlying data affect cached objects
CN101615178A (en) * 2008-06-26 2009-12-30 日电(中国)有限公司 Be used to set up the method and system of object hierarchy structure

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6026413A (en) * 1997-08-01 2000-02-15 International Business Machines Corporation Determining how changes to underlying data affect cached objects
CN101615178A (en) * 2008-06-26 2009-12-30 日电(中国)有限公司 Be used to set up the method and system of object hierarchy structure

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李睿, 曾俊瑀, 周四望: "《基于局部标签树匹配的改进网页聚类算法》", 《计算机应用》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013167006A1 (en) * 2012-12-14 2013-11-14 中兴通讯股份有限公司 Method for configuring browser bookmarks, device and terminal thereof

Also Published As

Publication number Publication date
CN102768660B (en) 2014-09-03

Similar Documents

Publication Publication Date Title
CN107341215B (en) Multi-source vertical knowledge graph classification integration query system based on distributed computing platform
Kellou-Menouer et al. Schema discovery in RDF data sources
CN103678511B (en) The method and device of webpage content extraction is carried out according to visual template
CN102622453A (en) Body-based food security event semantic retrieval system
Ji et al. Tag tree template for Web information and schema extraction
CN102043862A (en) Directional web data extraction method
König et al. Architecture of an open knowledge base for sustainable buildings based on Linked Data technologies
CN101819584B (en) Light weight intelligent webpage content analysis method
Nikhil et al. A survey on text mining and sentiment analysis for unstructured web data
CN109783815A (en) A kind of various dimensions network public-opinion big data comparative analysis method
da Silva Machado et al. EXEHDA-HM: A compositional approach to explore contextual information on hybrid models
Hu et al. A Virtual Dataspaces Model for large-scale materials scientific data access
Wang et al. Measuring the veracity of web event via uncertainty
CN102768660B (en) Dynamic-interaction-based generation method of template of internet acquisition system
CN102930030A (en) Ontology-based intelligent semantic document indexing reasoning system
Bai et al. G-path: flexible path pattern query on large graphs
Dai et al. Search Engine System Based on Ontology of Technological Resources.
Duchateau et al. FRBRPedia: a tool for FRBRizing web products and linking FRBR entities to DBpedia
CN107436919A (en) A kind of cloud manufacturer&#39;s standard service modeling method based on body and BOSS
CN102750386A (en) Inquiry processing method suitable for large-scale real-time data flows
Jiang et al. Personalized recommendation method of E-commerce based on fusion technology of smart ontology and big data mining
Chuang et al. Developing A Customized Web Mining System with PHP Language: A Case of Kaohsiung Land Administration Website Data
CN112199613B (en) Product URL automatic positioning method integrating DOM topology and text attributes
Xu et al. An improved strategy of distributed network crawler based on Hadoop and P2P
Raheja et al. A Survey on Data Extraction in Web Based Environment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200521

Address after: Room 17-2-1209, Huaguoshan Avenue, Haizhou District, Lianyungang City, Jiangsu Province

Patentee after: Lianyungang Dayu Information Technology Co.,Ltd.

Address before: 222000, room 7, building 706, West Tower, dragon river building, Sinpo District, Jiangsu, Lianyungang, China

Patentee before: JIANGSU JINGE NETWORK TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240129

Address after: Room 302, Building G, Yunmi City, No. 19 Ningshuang Road, Yuhuatai District, Nanjing City, Jiangsu Province, 210012

Patentee after: Golden Feather Network Technology (Nanjing) Co.,Ltd.

Country or region after: China

Address before: Room 17-2-1209, Huaguoshan Avenue, Haizhou District, Lianyungang City, Jiangsu Province, 222000

Patentee before: Lianyungang Dayu Information Technology Co.,Ltd.

Country or region before: China