Summary of the invention
The technical matters that the present invention will solve is the deficiency that exists to prior art, and a kind of internet acquisition system masterplate method of generationing based on dynamic interaction is provided, and this method can load the collection target automatically, and with the foundation of user interactions completion specific template.
In order to address the above problem, the present invention adopts following technical proposals: the present invention is a kind of internet acquisition system masterplate generation method based on dynamic interaction, is characterized in that its step is following:
(1) the access internet loaded targets page obtains web page text source code S set;
(2) regular expression according to label identifies the node set N in the S set, and is that each node of gathering among the N adds unique sequence number;
(3) construct according to node set N and have complementary hierarchy Model tree T;
(4) import node ID and move operation at last, utilize the front and back label expression formula of traversal analytical algorithm iterative computation egress.
Above-described internet acquisition system masterplate based on dynamic interaction generates in the method and technology scheme, and described step (1) can be operated by following concrete operations step:
(1-1) the input web page address utilizes HttpClient to obtain original html source code S set 1;
(1-2) utilize Cobra to carry out JavaScript in the S set 1, its return results is filled into the S1 relevant position, obtain text source code S set 2 at last, promptly said web page text source code S set.
Above-described internet acquisition system masterplate based on dynamic interaction generates in the method and technology scheme, and described step (2) can be operated by following concrete operations step:
(2-1) obtain node n according to canonical matched text source code S set;
(2-2) whether decision node n type is a kind of in end-tag node, script node and nested node thereof, link node, br node, the note/note expression formula node; Node type if not foregoing description; Then carry out B3 again, otherwise execution in step (2-1);
(2-3) utilize sequence generator to generate unique system banner sequence number, be appended in the node n label;
(2-4) return repeated execution of steps (2-1), until identifying all node n, the set that all node n form is described node and combines N.
Above-described internet acquisition system masterplate based on dynamic interaction generates in the method and technology scheme, can operate by following concrete operations step in the said step (3):
(3-1) utilize HtmlParser to generate HTML DOM structure as R ', R ' is initially the root node among the DOM, creates self-defining hierarchy Model tree HtmlNode simultaneously as node R, and the R initial default is the root node of tree;
(3-2) with the node content among the R ', ID and child node information are given in the node R;
(3-3) if node R ' child node is arranged, the label that obtains R ' converts end node to, as the brotgher of node of R;
The next node that (3-4) obtains R ' is given R ', returns once more execution in step (3-2), finishes until traversal HTML DOM, and the series of layers aggregated(particle) structure that node R produced is described hierarchy Model tree T.
Above-described internet acquisition system masterplate based on dynamic interaction generates in the method and technology scheme, and said step (4) can be operated by following concrete operations step:
(4-1) click the loading nodal operation, obtain node identification ID;
(4-2) inquire tree T corresponding nodes object according to ID;
(4-3) click preceding/rearmounted move operation O;
(4-4) calculate preceding/rearmounted border;
(4-5) obtain preceding/rearmounted expression formula.
Compared with prior art, the internet acquisition system masterplate generation method based on dynamic interaction of the present invention has following effect:
1, model customization is accomplished in man-machine interaction: the user only need click the mouse and get final product complete operation, does not need the technical know-how of specialty;
Distinguish the attribute of information when 2, gathering: the content that the title of differentiation information, author, content, turnaround time, reply content or the like are seen in the Information Monitoring on browser;
3, collecting efficiency is high: in gatherer process, only the content of template definition is gathered, it is little to take Internet resources;
4, new function is looked in the template support: only update content is gathered during collection, not repeated acquisition.
Embodiment
Following with reference to accompanying drawing, further describe concrete technical scheme of the present invention, so that those skilled in the art understands the present invention further, and do not constitute restriction to its right.
Embodiment 1, a kind of internet acquisition system masterplate generation method based on dynamic interaction, and its step is following:
(1) the access internet loaded targets page obtains web page text source code S set;
(2) regular expression according to label identifies the node set N in the S set, and is that each node of gathering among the N adds unique sequence number;
(3) construct according to node set N and have complementary hierarchy Model tree T;
(4) import node ID and move operation at last, utilize the front and back label expression formula of traversal analytical algorithm iterative computation egress.
Embodiment 2, and the concrete operations step of the step (1) of the described internet acquisition system masterplate generation method based on dynamic interaction of embodiment 1 is following:
(1-1) the input web page address utilizes HttpClient to obtain original html source code S set 1;
(1-2) utilize Cobra to carry out JavaScript in the S set 1, its return results is filled into the S1 relevant position, obtain text source code S set 2 at last, promptly said web page text source code S set.
Embodiment 3, embodiment 1 or 2 described internet acquisition system masterplate generation methods based on dynamic interaction the concrete operations step of step (2) following:
(2-1) obtain node n according to canonical matched text source code S set;
(2-2) whether decision node n type is a kind of in end-tag node, script node and nested node thereof, link node, br node, the note/note expression formula node; Node type if not foregoing description; Then carry out B3 again, otherwise execution in step (2-1);
(2-3) utilize sequence generator to generate unique system banner sequence number, be appended in the node n label;
(2-4) return repeated execution of steps (2-1), until identifying all node n, the set that all node n form is described node and combines N.
Embodiment 4, and the concrete operations step of the step (3) of embodiment 1 or 2 or 3 described internet acquisition system masterplate generation methods based on dynamic interaction is following:
(3-1) utilize HtmlParser to generate HTML DOM structure as R ', R ' is initially the root node among the DOM, creates self-defining hierarchy Model tree HtmlNode simultaneously as node R, and the R initial default is the root node of tree;
(3-2) with the node content among the R ', ID and child node information are given in the node R;
(3-3) if node R ' child node is arranged, the label that obtains R ' converts end node to, as the brotgher of node of R;
The next node that (3-4) obtains R ' is given R ', returns once more execution in step (3-2), finishes until traversal HTML DOM, and the series of layers aggregated(particle) structure that node R produced is described hierarchy Model tree T.
Embodiment 5, and the concrete operations step of the step (4) of any one described internet acquisition system masterplate generation method based on dynamic interaction is following among the embodiment 1-4:
(4-1) click the loading nodal operation, obtain node identification ID;
(4-2) inquire tree T corresponding nodes object according to ID;
(4-3) click preceding/rearmounted move operation O;
(4-4) calculate preceding/rearmounted border;
(4-5) obtain preceding/rearmounted expression formula.
Embodiment 6, with reference to Fig. 1-4, with operation experiments of carrying out based on the internet acquisition system masterplate generation method of dynamic interaction of the present invention, comprise the steps:
Step 101, the loaded targets page obtain text source code S set, and it is specific as follows:
(1), the input web page address utilizes HttpClient to obtain original html source code S set 1; For example, the original html source code S set 1 that obtains through the internet is following:
<html>
<head>
<title>Template target to be generated</title>
<script?type="text/javascript">
var?go?=?function(){
Document.getElementById (" content_id ") .innerHTML=" JS replacement "; }
</script>
</head>
<body?onload="javascript:go();">
<p id=" content_id ">Content</p>
</body>
</html>
The html source code set is made up of various html tags and content of pages;
(2), utilize Cobra to carry out the JavaScript among the S1, and process result is filled the S1 relevant position.A lot of info webs need pass through and just demonstrated after script such as JS is handled, and therefore are necessary the entrained information of script is handled.For example, have partial information to be present in the JS script in the S set 1 described in the A1, utilize Cobra to carry out after, the content text in the p label should be " JS replacement ".Obtain a new source code S set 2, promptly said web page text source code S set after handling like this.
Node set N among step 102, identification and the tag set S.With reference to Fig. 2, comprise the steps:
Label in step 201, use regular expression r=" < ([^ < >] *)>" the identification S set is as node n.Node n comprises title, attribute, content of text.For example: the pairing node n of the p label in the S set 1 described in the A1 is " < p id=" content_id ">content ";
Whether step 202, decision node n are null value, if null value, explain then that label has been identified in the S set to finish, and promptly execution in step 201 finishes; If non-null, value explains that then the label in the S set gets into step 203 for identification finishes;
Whether step 203, decision node n type are a kind of in end-tag node, script node and nested node thereof, link node, br node, the note/note expression formula node, if wherein one type, return step 201 identification next node n; If not one of them type, get into next step 204;
Step 204, sequence generator generate unique system identifier and are appended in the node label.For example: for the p label in the described S set 1 of A1 adds sequence; If the sequence number 20100120112233 (random numbers between current date timestamp+1 to 1000000) that produces; Add in the p label, then the information in the p label should be " < p id=" content_id " systemid=" 20100120112233 ">"; After adding system sign finishes, return step 201, until all nodes add-on system identifier all.
The hierarchy Model tree T of step 103, structure node with reference to Fig. 3, comprises the steps:
Step 301, the S set of utilizing the HtmlParser parsing to be added system banner generate HTML DOM structure R ', and R ' is initially the root node among the DOM;
Step 302, the self-defining hierarchy Model tree of establishment HtmlNode are as node R, and R is initially the root node of tree;
Step 303, with node R ' in bookmark name, attribute, content of text is given in the node R.For example: give node R with the p label in the described S set 1 of A1 as node, obtain the information that node R has and comprise title " < p id=" content_id ">", system identifier 20100120112233;
Step 304, judge that whether current R ' node has child node, if child node is arranged, then carries out next step 305; If there is not child node, then execution in step 307;
Step 305, convert the bookmark name of R ' node to end-tag L.For example: with the node p bookmark name in the described S set 1 of A1 "<p id=" content_id " systemid=" 20100120112233 ">", convert to end-tag for "</p>" remember and make L;
Step 306, change into node and add and to make the node R brotgher of node obtaining end-tag L in the step 305;
Step 307, obtain the next child node of R ' and will be worth and compose to R ', as Next iteration node;
Whether step 308, judgement are null value by the R ' of new assignment, if R ' finishes (set of all node R is exactly described hierarchy Model tree T) for generation HTML DOM structure in the null value description of step 301 travels through, then step 103 finishes; If R ' is a non-null, value, then generate HTML DOM structure traversal in the step 301 and do not finish, return execution in step 303.
The front and back label expression formula of step 104, input node ID and move operation computing node with reference to Fig. 4, comprises the steps:
Step 401, click load certain page node, obtain the sign ID that has been labeled of this node.For example: the S set described in the A2, will see " JS replacement " such text through browser, select the text, click load button, the system identifier 20100120112233 that the p label will identify is imported next step 402 into as required parameter;
Step 402, obtain required parameter after, T obtains node corresponding according to the ID query tree;
Before step 403, the click/rearmounted move operation button;
Step 404, judgement are pre action or post action, if pre action, then execution in step 405; If not pre action but post action, then execution in step 407;
Step 405, calculate preposition border;
Step 406, obtain preposition expression formula;
Step 407, calculate rearmounted border;
Step 408, obtain rearmounted expression formula.