CN103092817A - Data collection method and data collection device based on script engine - Google Patents

Data collection method and data collection device based on script engine Download PDF

Info

Publication number
CN103092817A
CN103092817A CN2013100196239A CN201310019623A CN103092817A CN 103092817 A CN103092817 A CN 103092817A CN 2013100196239 A CN2013100196239 A CN 2013100196239A CN 201310019623 A CN201310019623 A CN 201310019623A CN 103092817 A CN103092817 A CN 103092817A
Authority
CN
China
Prior art keywords
script
target data
data
rule
title
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013100196239A
Other languages
Chinese (zh)
Inventor
侯赋文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing 58 Information Technology Co Ltd
Original Assignee
Beijing 58 Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing 58 Information Technology Co Ltd filed Critical Beijing 58 Information Technology Co Ltd
Priority to CN2013100196239A priority Critical patent/CN103092817A/en
Publication of CN103092817A publication Critical patent/CN103092817A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a data collection method and a data collection device based on script engine. The data collection method based on script engine comprises the following steps: loading collection configuration files which are configurated in advance and corresponding to current collecting tasks, analyzing the collection configuration files, and obtaining target data collecting rules; initializing all the script engines which support different scripting languages, and loading script files which are configurated in advance and formed by script methods collecting target data; downloading webpage data, searching the collecting rules of the target data which are defined on a webpage and need to be collected, and sending script method names which are configurated in the downloaded webpage data and the collecting rules to the script engine of the corresponding script languages; and transferring and executing the corresponding script methods through the script engine according to the script method names, and collecting the target data in the webpage data. Extracting, cleaning, processing and transferring in the process of data collection are achieved through modes of scrip, and suggested technical problems are solved perfectly.

Description

A kind of collecting method and device based on script engine
Technical field
The present invention relates to field of computer technology, relate in particular to a kind of collecting method based on script engine and device.
Background technology
The oriented acquisition software that many maturations have been arranged in the industry, its implementation basically all are based on template configuration and realize, these data pick-up methods based on template configuration are generally the canonical matching methods, mark intercepting method, Xpath extraction method, plug-in unit customization method etc.
Wherein, about the canonical matching method: partial data extracts the possibility of result needs secondary cleaning, and processing, conversion just can obtain target data, and such abstracting method is highly professional, needs the skilled regular expression of grasping;
Intercept method about mark: partial data extracts the possibility of result needs secondary cleaning, and processing, conversion just can obtain target data;
About the Xpath extraction method: web page contents must be structurized, and such abstracting method is highly professional, needs the skilled Xpath of grasp grammer; In addition, partial data extracts the possibility of result needs secondary cleaning, and processing, conversion just can obtain target data;
Customize method about plug-in unit: frequent Update Table decimation rule code all needs to recompilate, and seems cumbersome, and strongly professional.
In sum, existing data pick-up method based on template configuration has characteristics as can be known, and the data that are extracted into exactly are all much to pass through the target data that secondary cleaning, processing, conversion etc. just can obtain wanting, and cause extraction efficiency lower; In addition, some abstracting method is strongly professional, is unfavorable for widespread use.
Summary of the invention
In view of the above problems, the present invention has been proposed in order to a kind of collecting method and device based on script engine that overcomes the problems referred to above or address the above problem at least in part is provided.
According to one aspect of the present invention, a kind of collecting method based on script engine is provided, comprising:
Step 1 loads the pre-configured acquisition configuration file corresponding with current acquisition tasks, resolves this acquisition configuration file, obtains the target data collection rule; Wherein, described target data collection rule comprises target data type and gathers all kinds of target datas corresponding script method title and script;
Step 2, each script engine of different scripts is supported in initialization, and loads the script file that the pre-configured script method by gathering target data consists of;
Step 3, the downloading web pages data, and search and be defined in the collection rule that needs the target data that gathers on this webpage, the script method title that configures in the web data of downloading and the collection rule that finds is sent to the script engine of corresponding scripts language;
Corresponding script method is called and carried out to step 4, script engine according to described script method title, gather out target data in described web data.
Alternatively, in the method for the invention, according to the acquisition tasks demand, in described script method, definition has target data extraction, cleaning, processing and transformation rule.
Alternatively, in the method for the invention, described target data decimation rule comprises: the decimation rule that the decimation rule that the decimation rule that the decimation rule according to the definition of canonical matching method extracts, defines according to mark intercepting method extracts, defines according to the Xpath extraction method extracts or defines according to plug-in unit customization method extracts.
Alternatively, in the step 4 of the method for the invention, carry out corresponding script method and gather out target data in web data, specifically comprise:
Decimation rule according to described script method definition, extract the target data of appointment in described web data, and according to the cleaning that defines in described script method, processing and transformation rule, to the target data that extraction obtains clean, processing and conversion operations, obtain required target data.
Alternatively, in the method for the invention, described target data type includes but not limited to be title, author, date, content.
According to a further aspect in the invention, provide a kind of data collector based on script engine, having comprised:
The Command Line Parsing module is used for loading the pre-configured acquisition configuration file corresponding with current acquisition tasks, resolves this acquisition configuration file, obtains the target data collection rule; Wherein, described target data collection rule comprises target data type and gathers all kinds of target datas corresponding script method title and script;
Data processing module, be used for the downloading web pages data, and search and be defined in the collection rule that needs the target data that gathers on this webpage, with the script method title that configures in the web data of downloading and the collection rule that finds, be sent in the script engine module in corresponding script engine by script;
The script engine module, comprise a plurality of script engines of supporting different scripts, each script engine is after initialization, load the script file that the pre-configured script method by gathering target data consists of, and after the data that receive the data processing module transmission, according to described script method title, call and carry out corresponding script method, gather out target data in described web data.
Alternatively, in device of the present invention, according to the acquisition tasks demand, in the script method in the script file of described script engine module loading, definition has target data extraction, cleaning, processing and transformation rule.
Alternatively, in device of the present invention, in described script engine module, described target data decimation rule comprises: the decimation rule that the decimation rule that the decimation rule that the decimation rule according to the definition of canonical matching method extracts, defines according to mark intercepting method extracts, defines according to the Xpath extraction method extracts or defines according to plug-in unit customization method extracts.
Alternatively, in device of the present invention, described script engine module, the concrete decimation rule that is used for according to described script method definition, extract the target data of appointment in described web data, and according to the cleaning that defines in described script method, processing and transformation rule, to the target data that extraction obtains clean, processing and conversion operations, obtain required target data.
Alternatively, in device of the present invention, in described Command Line Parsing module, target data type includes but not limited to be title, author, date, content.
Beneficial effect of the present invention is as follows:
The method of the invention and device carry out the script method configuration by simple, easy-to-use script, have realized flexibly, easily the collection of target data, have reduced the professional requirement of image data, are convenient to extensive popularization; And, because script method can carry out flexible configuration by script, having realized completing the operations such as cleaning, processing and conversion when extracting, the target data that obtains need not again to process, and has improved greatly collecting efficiency.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, for can clearer understanding technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Description of drawings
By reading hereinafter detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing only is used for the purpose of preferred implementation is shown, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts with identical reference symbol.In the accompanying drawings:
The process flow diagram of a kind of collecting method based on script engine that Fig. 1 provides for the embodiment of the present invention;
Fig. 2 is the execution block diagram of the described method of the embodiment of the present invention;
The structured flowchart of a kind of data collector based on script engine that Fig. 3 provides for the embodiment of the present invention.Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in accompanying drawing, yet should be appreciated that and to realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order to understand the disclosure more thoroughly that these embodiment are provided, and can with the scope of the present disclosure complete convey to those skilled in the art.
In order to reduce the professional requirement of data acquisition, and raising data acquisition efficiency, the embodiment of the present invention provides a kind of collecting method based on script engine and device, described method and apparatus has been realized extracting simultaneously in data acquisition by the mode of script, clean, processing and conversion have well solved the technical matters that proposes.
Before specifically introducing the present invention program, the explanation of several technical terms that given first technical scheme of the present invention is used, specific as follows:
Acquisition configuration file: defined the collection rule configuration of the target data that acquisition tasks gathers on each webpage.Wherein, collection rule configuration mainly comprises: target data type and gather such target data corresponding script method title and script; For example, if the target data type of extracting is " title ", the script method title of the data acquisition of definition " title " correspondence is " parseTitle ", and the script of use is: javascript.
Script file: the file that the script method that is used for the collection target data of being write with script by the user consists of.Wherein, script has simple, easy to learn, easy-to-use characteristic usually, so, as long as the real needs of clear and definite acquisition tasks can be utilized the configuration of completing script method, greatly reduce professional requirement.About script, common are javascript, vbscript, php etc.
Script draws sincere: the instrument of resolving and carry out script method; In the present invention, script engine obtains script method by the script file that loading configures.At present, existing script engine comprises: the javascript script that Microsoft provides draws sincere, and it is sincere etc. that the vbscript script draws.
Based on the explanation of above-mentioned technical term, the below provides the specific implementation process of embodiment of the method for the present invention and device embodiment.
Embodiment of the method
As shown in Figure 1, the embodiment of the present invention provides a kind of collecting method based on script engine, comprising:
Step S101 loads the pre-configured acquisition configuration file corresponding with current acquisition tasks, resolves this acquisition configuration file, obtains the target data collection rule; Wherein, described target data collection rule comprises target data type and gathers all kinds of target datas corresponding script method title and script;
In this step, described target data type include but not limited to into: title, author, date, content, those skilled in the art can divide flexibly according to user's request.
Step S102, each script engine of different scripts is supported in initialization, and loads the script file that the pre-configured script method by gathering target data consists of;
In this step, according to the acquisition tasks demand, in described script method, definition has target data extraction, cleaning, processing and transformation rule.
Wherein, described target data decimation rule can extract according to the decimation rule of canonical matching method definition, can extract according to the decimation rule of mark intercepting method definition, can extract according to the decimation rule of Xpath extraction method definition or can extract according to the decimation rule of plug-in unit customization method definition.Certainly, technical scheme of the present invention is not limited to above-mentioned decimation rule, also can carry out flexible configuration according to real needs.
Step S103, the downloading web pages data, and search and be defined in the collection rule that needs the target data that gathers on this webpage, the script method title that configures in the web data of downloading and the collection rule that finds is sent to the script engine of corresponding scripts language;
Step S104, script engine call and carry out corresponding script method according to the script method title, gather out target data in described web data.
Preferably, in this step, call and carry out corresponding script method, gathering out target data in described web data specifically comprises: according to the decimation rule of described script method definition, extract the target data of appointment in described web data, and according to the cleaning that defines in described script method, processing and transformation rule, to the target data that extraction obtains clean, processing and conversion operations, obtain required target data.
As shown in Figure 2, for take Fig. 1 as carrying out the execution frame diagram of principle, exactly, the present embodiment is divided into two processes with gatherer process, and one for before gathering, and another is in gathering.concrete, corresponding preliminary work can be done according to different acquisition tasks by system before the task collection, at first it can resolve acquisition configuration file corresponding to acquisition tasks, purpose is to allow rule configuration and the script method of crawl target data set up corresponding relation, then be that the corresponding script of initializtion script language draws sincere, script draws the sincere corresponding script file of script that reloads, next just can gather, the process that gathers is first downloading web pages data, then find to be defined in and need the target data rule configuration that extracts on this webpage, one by one the script method title in the configuration of target data decimation rule is passed to script with the web data that downloads to again and draws sincere execution script method, script draws sincere meeting and carries out data pick-up according to the corresponding scripts method, clean, processing, the operations such as conversion, take at last target data and do subsequent treatment.
In sum, the described method of the embodiment of the present invention is carried out the script method configuration by simple, easy-to-use script, has realized flexibly, easily the collection of target data, has reduced the professional requirement of image data, is convenient to extensive popularization; And, because script method can carry out flexible configuration by script, having realized completing the operations such as cleaning, processing and conversion when extracting, the target data that obtains need not again to process, and has improved greatly collecting efficiency.
Device embodiment
As shown in Figure 3, the embodiment of the present invention provides a kind of data collector based on script engine, comprising: Command Line Parsing module 310, data processing module 320 and script engine module 330;
Command Line Parsing module 310 is used for loading the pre-configured acquisition configuration file corresponding with current acquisition tasks, resolves this acquisition configuration file, obtains the target data collection rule; Wherein, described target data collection rule comprises target data type and gathers all kinds of target datas corresponding script method title and script; Wherein, described target data type includes but not limited to be title, author, date, content
Data processing module 320, be used for the downloading web pages data, and search and be defined in the collection rule that needs the target data that gathers on this webpage, with the script method title that configures in the web data of downloading and the collection rule that finds, be sent in script engine module 330 in corresponding script engine by script;
Script engine module 330, comprise a plurality of script engines of supporting different scripts, each script engine is after initialization, load the script file that the pre-configured script method by gathering target data consists of, and after the data that receive data processing module 320 transmissions, according to the script method title, call and carry out corresponding script method, gather out target data in web data.
Further, in the embodiment of the present invention, according to the acquisition tasks demand, in the script method in the script file that described script engine module 330 loads, definition has target data extraction, cleaning, processing and transformation rule.
Wherein, described target data decimation rule can extract according to the decimation rule of canonical matching method definition, can extract according to the decimation rule of mark intercepting method definition, can extract according to the decimation rule of Xpath extraction method definition or can extract according to the decimation rule of plug-in unit customization method definition.Certainly, technical scheme of the present invention is not limited to above-mentioned decimation rule, also can carry out flexible configuration according to real needs.
Further, described script engine module 330, the concrete decimation rule that is used for according to described script method definition, extract the target data of appointment in described web data, and according to the cleaning that defines in described script method, processing and transformation rule, to the target data that extraction obtains clean, processing and conversion operations, obtain required target data.
In sum, the described device of the embodiment of the present invention carries out the script method configuration by simple, easy-to-use script, has realized flexibly, easily the collection of target data, has reduced the professional requirement of image data, is convenient to extensive popularization; And, because script method can carry out flexible configuration by script, having realized completing the operations such as cleaning, processing and conversion when extracting, the target data that obtains need not again to process, and has improved greatly collecting efficiency.
Intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration at this algorithm that provides.Various general-purpose systems also can with based on using together with this teaching.According to top description, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.Should be understood that and to utilize various programming languages to realize content of the present invention described here, and the top description that language-specific is done is in order to disclose preferred forms of the present invention.
In the instructions that provides herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can be in the situation that do not have these details to put into practice.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the description to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes in the above.Yet the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires the more feature of feature clearly put down in writing than institute in each claim.Or rather, as claims reflected, inventive aspect was to be less than all features of the disclosed single embodiment in front.Therefore, follow claims of embodiment and incorporate clearly thus this embodiment into, wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can adaptively change and they are arranged in one or more equipment different from this embodiment the module in the equipment in embodiment.Can be combined into a module or unit or assembly to the module in embodiment or unit or assembly, and can put them into a plurality of submodules or subelement or sub-component in addition.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and so all processes or the unit of disclosed any method or equipment make up.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed), disclosed each feature can be by providing identical, being equal to or the alternative features of similar purpose replaces.
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included in other embodiment, the combination of the feature of different embodiment mean be in scope of the present invention within and form different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, perhaps realizes with the software module of moving on one or more processor, and perhaps the combination with them realizes.It will be understood by those of skill in the art that can use in practice microprocessor or digital signal processor (DSP) realize according to the embodiment of the present invention based on some or all some or repertoire of parts in the data collector of script engine.The present invention can also be embodied as be used to part or all equipment or the device program (for example, computer program and computer program) of carrying out method as described herein.The program of the present invention that realizes like this can be stored on computer-readable medium, perhaps can have the form of one or more signal.Such signal can be downloaded from internet website and obtain, and perhaps provides on carrier signal, perhaps provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation that do not break away from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed in element or step in claim.Being positioned at word " " before element or " one " does not get rid of and has a plurality of such elements.The present invention can realize by means of the hardware that includes some different elements and by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to come imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title with these word explanations.

Claims (10)

1. the collecting method based on script engine, is characterized in that, comprising:
Step 1 loads the pre-configured acquisition configuration file corresponding with current acquisition tasks, resolves this acquisition configuration file, obtains the target data collection rule; Wherein, described target data collection rule comprises target data type and gathers all kinds of target datas corresponding script method title and script;
Step 2, each script engine of different scripts is supported in initialization, and loads the script file that the pre-configured script method by gathering target data consists of;
Step 3, the downloading web pages data, and search and be defined in the collection rule that needs the target data that gathers on this webpage, the script method title that configures in the web data of downloading and the collection rule that finds is sent to the script engine of corresponding scripts language;
Corresponding script method is called and carried out to step 4, script engine according to described script method title, gather out target data in described web data.
2. the method for claim 1, is characterized in that, according to the acquisition tasks demand, in described script method, definition has target data extraction, cleaning, processing and transformation rule.
3. method as claimed in claim 2, it is characterized in that, described target data decimation rule comprises: the decimation rule that the decimation rule that the decimation rule that the decimation rule according to the definition of canonical matching method extracts, defines according to mark intercepting method extracts, defines according to the Xpath extraction method extracts or defines according to plug-in unit customization method extracts.
4. method as claimed in claim 2 or claim 3, is characterized in that, in described step 4, carries out corresponding script method and gather out target data in web data, specifically comprises:
Decimation rule according to described script method definition, extract the target data of appointment in described web data, and according to the cleaning that defines in described script method, processing and transformation rule, to the target data that extraction obtains clean, processing and conversion operations, obtain required target data.
5. the method for claim 1, is characterized in that, described target data type comprises: title, author, date, content.
6. the data collector based on script engine, is characterized in that, comprising:
The Command Line Parsing module is used for loading the pre-configured acquisition configuration file corresponding with current acquisition tasks, resolves this acquisition configuration file, obtains the target data collection rule; Wherein, described target data collection rule comprises target data type and gathers all kinds of target datas corresponding script method title and script;
Data processing module, be used for the downloading web pages data, and search and be defined in the collection rule that needs the target data that gathers on this webpage, with the script method title that configures in the web data of downloading and the collection rule that finds, be sent in the script engine module in corresponding script engine by script;
The script engine module, comprise a plurality of script engines of supporting different scripts, each script engine is after initialization, load the script file that the pre-configured script method by gathering target data consists of, and after the data that receive the data processing module transmission, according to described script method title, call and carry out corresponding script method, gather out target data in described web data.
7. device as claimed in claim 6, is characterized in that, according to the acquisition tasks demand, in the script method in the script file of described script engine module loading, definition has target data extraction, cleaning, processing and transformation rule.
8. device as claimed in claim 7, it is characterized in that, in described script engine module, described target data decimation rule comprises: the decimation rule that the decimation rule that the decimation rule that the decimation rule according to the definition of canonical matching method extracts, defines according to mark intercepting method extracts, defines according to the Xpath extraction method extracts or defines according to plug-in unit customization method extracts.
9. install as claimed in claim 7 or 8, it is characterized in that, described script engine module, the concrete decimation rule that is used for according to described script method definition, extract the target data of appointment in described web data, and according to the cleaning that defines in described script method, processing and transformation rule, to the target data that extraction obtains clean, processing and conversion operations, obtain required target data.
10. device as claimed in claim 6, is characterized in that, in described Command Line Parsing module, target data type comprises: title, author, date, content.
CN2013100196239A 2013-01-18 2013-01-18 Data collection method and data collection device based on script engine Pending CN103092817A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013100196239A CN103092817A (en) 2013-01-18 2013-01-18 Data collection method and data collection device based on script engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013100196239A CN103092817A (en) 2013-01-18 2013-01-18 Data collection method and data collection device based on script engine

Publications (1)

Publication Number Publication Date
CN103092817A true CN103092817A (en) 2013-05-08

Family

ID=48205405

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013100196239A Pending CN103092817A (en) 2013-01-18 2013-01-18 Data collection method and data collection device based on script engine

Country Status (1)

Country Link
CN (1) CN103092817A (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103412890A (en) * 2013-07-19 2013-11-27 北京亿赞普网络技术有限公司 Webpage loading method and device
CN104462140A (en) * 2013-09-24 2015-03-25 北大方正集团有限公司 Webpage data collecting method and device
CN104850361A (en) * 2015-06-01 2015-08-19 广东电网有限责任公司信息中心 Data cleaning method and system
WO2016086784A1 (en) * 2014-12-02 2016-06-09 阿里巴巴集团控股有限公司 Method, apparatus and system for collecting webpage data
CN105868169A (en) * 2016-04-06 2016-08-17 西安电子科技大学 Data acquisition interface and data acquisition method and system
CN106649353A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Webpage data collection method and apparatus
CN106708846A (en) * 2015-11-12 2017-05-24 北京国双科技有限公司 Collection method and device for webpage data
CN107315576A (en) * 2016-04-26 2017-11-03 中兴通讯股份有限公司 A kind of method and system of dynamic expansion software flow
CN108021621A (en) * 2017-11-15 2018-05-11 平安科技(深圳)有限公司 Database data acquisition method, application server and computer-readable recording medium
CN109088908A (en) * 2018-06-06 2018-12-25 武汉酷犬数据科技有限公司 A kind of the distributed general collecting method and system of network-oriented
CN109273077A (en) * 2018-10-08 2019-01-25 北京万东医疗科技股份有限公司 Data processing method, device and smart machine
CN109360093A (en) * 2018-09-12 2019-02-19 珠海凡泰极客科技有限责任公司 A kind of securities trading reference data integration system
CN109492149A (en) * 2018-11-29 2019-03-19 深圳墨世科技有限公司 Crawler task processing method and device
CN109542555A (en) * 2018-10-26 2019-03-29 深圳点猫科技有限公司 A kind of international programming implementation method of realization educational applications and device
CN109766206A (en) * 2018-12-29 2019-05-17 北京中电普华信息技术有限公司 A kind of log collection method and system
CN110244956A (en) * 2019-06-04 2019-09-17 北京中亦安图科技股份有限公司 Data analysis method, device and system
CN110347667A (en) * 2019-06-27 2019-10-18 上海淇馥信息技术有限公司 A kind of data cleaning method and device
CN110347399A (en) * 2019-05-31 2019-10-18 深圳绿米联创科技有限公司 Data processing method, real time computation system and information system
CN111026796A (en) * 2019-11-29 2020-04-17 华南农业大学 Multi-source heterogeneous data acquisition method, device, system, medium and equipment
CN111881404A (en) * 2020-08-05 2020-11-03 广州裕睿信息科技有限公司 Configuration data acquisition method, device and system
CN114428635A (en) * 2022-04-06 2022-05-03 杭州未名信科科技有限公司 Data acquisition method and device, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020007358A1 (en) * 1998-09-01 2002-01-17 David E. Johnson Architecure of a framework for information extraction from natural language documents
WO2002007358A1 (en) * 2000-07-14 2002-01-24 Fujitsu Limited Cdma receiver
US20020065802A1 (en) * 2000-05-30 2002-05-30 Koki Uchiyama Distributed monitoring system providing knowledge services
CN101561802A (en) * 2008-04-18 2009-10-21 上海复旦光华信息科技股份有限公司 Web page structural data extraction method and system
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system
CN101673256A (en) * 2008-09-11 2010-03-17 北大方正集团有限公司 Method and system for automatically extracting article metadata information based on word flow
CN102254046A (en) * 2011-08-18 2011-11-23 深圳市融创天下科技股份有限公司 Webpage data acquiring method and system
CN102495885A (en) * 2011-12-08 2012-06-13 中国信息安全测评中心 Method for integrating information safety data based on base-networking engine

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020007358A1 (en) * 1998-09-01 2002-01-17 David E. Johnson Architecure of a framework for information extraction from natural language documents
US20020065802A1 (en) * 2000-05-30 2002-05-30 Koki Uchiyama Distributed monitoring system providing knowledge services
WO2002007358A1 (en) * 2000-07-14 2002-01-24 Fujitsu Limited Cdma receiver
CN101561802A (en) * 2008-04-18 2009-10-21 上海复旦光华信息科技股份有限公司 Web page structural data extraction method and system
CN101673256A (en) * 2008-09-11 2010-03-17 北大方正集团有限公司 Method and system for automatically extracting article metadata information based on word flow
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system
CN102254046A (en) * 2011-08-18 2011-11-23 深圳市融创天下科技股份有限公司 Webpage data acquiring method and system
CN102495885A (en) * 2011-12-08 2012-06-13 中国信息安全测评中心 Method for integrating information safety data based on base-networking engine

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103412890A (en) * 2013-07-19 2013-11-27 北京亿赞普网络技术有限公司 Webpage loading method and device
CN103412890B (en) * 2013-07-19 2017-06-06 北京亿赞普网络技术有限公司 A kind of webpage loading method and device
CN104462140A (en) * 2013-09-24 2015-03-25 北大方正集团有限公司 Webpage data collecting method and device
CN105721519B (en) * 2014-12-02 2019-02-05 阿里巴巴集团控股有限公司 A kind of webpage data acquiring method, apparatus and system
WO2016086784A1 (en) * 2014-12-02 2016-06-09 阿里巴巴集团控股有限公司 Method, apparatus and system for collecting webpage data
CN105721519A (en) * 2014-12-02 2016-06-29 阿里巴巴集团控股有限公司 Webpage data acquisition method, device and system
CN104850361A (en) * 2015-06-01 2015-08-19 广东电网有限责任公司信息中心 Data cleaning method and system
CN106649353B (en) * 2015-10-30 2020-05-22 北京国双科技有限公司 Method and device for collecting webpage data
CN106649353A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Webpage data collection method and apparatus
CN106708846B (en) * 2015-11-12 2020-04-21 北京国双科技有限公司 Method and device for collecting webpage data
CN106708846A (en) * 2015-11-12 2017-05-24 北京国双科技有限公司 Collection method and device for webpage data
CN105868169B (en) * 2016-04-06 2019-04-30 西安电子科技大学 A kind of data acquisition device, collecting method and system
CN105868169A (en) * 2016-04-06 2016-08-17 西安电子科技大学 Data acquisition interface and data acquisition method and system
CN107315576A (en) * 2016-04-26 2017-11-03 中兴通讯股份有限公司 A kind of method and system of dynamic expansion software flow
CN108021621A (en) * 2017-11-15 2018-05-11 平安科技(深圳)有限公司 Database data acquisition method, application server and computer-readable recording medium
WO2019095667A1 (en) * 2017-11-15 2019-05-23 平安科技(深圳)有限公司 Database data collection method, application server, and computer readable storage medium
CN109088908A (en) * 2018-06-06 2018-12-25 武汉酷犬数据科技有限公司 A kind of the distributed general collecting method and system of network-oriented
CN109360093A (en) * 2018-09-12 2019-02-19 珠海凡泰极客科技有限责任公司 A kind of securities trading reference data integration system
CN109273077A (en) * 2018-10-08 2019-01-25 北京万东医疗科技股份有限公司 Data processing method, device and smart machine
CN109273077B (en) * 2018-10-08 2021-08-31 北京万东医疗科技股份有限公司 Data processing method and device and intelligent equipment
CN109542555A (en) * 2018-10-26 2019-03-29 深圳点猫科技有限公司 A kind of international programming implementation method of realization educational applications and device
CN109492149A (en) * 2018-11-29 2019-03-19 深圳墨世科技有限公司 Crawler task processing method and device
CN109492149B (en) * 2018-11-29 2021-04-09 深圳大宇无限科技有限公司 Crawler task processing method and device
CN109766206A (en) * 2018-12-29 2019-05-17 北京中电普华信息技术有限公司 A kind of log collection method and system
CN110347399A (en) * 2019-05-31 2019-10-18 深圳绿米联创科技有限公司 Data processing method, real time computation system and information system
CN110347399B (en) * 2019-05-31 2023-06-06 深圳绿米联创科技有限公司 Data processing method, real-time computing system and information system
CN110244956A (en) * 2019-06-04 2019-09-17 北京中亦安图科技股份有限公司 Data analysis method, device and system
CN110347667A (en) * 2019-06-27 2019-10-18 上海淇馥信息技术有限公司 A kind of data cleaning method and device
CN111026796A (en) * 2019-11-29 2020-04-17 华南农业大学 Multi-source heterogeneous data acquisition method, device, system, medium and equipment
CN111026796B (en) * 2019-11-29 2023-05-16 华南农业大学 Multi-source heterogeneous data acquisition method, device, system, medium and equipment
CN111881404A (en) * 2020-08-05 2020-11-03 广州裕睿信息科技有限公司 Configuration data acquisition method, device and system
CN114428635A (en) * 2022-04-06 2022-05-03 杭州未名信科科技有限公司 Data acquisition method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN103092817A (en) Data collection method and data collection device based on script engine
CN104965901A (en) Method and apparatus for grabbing content of target page
CN102843445B (en) A kind of browser and carry out the method for domain name mapping
CN102982161A (en) Method and device for acquiring webpage information
CN102982162B (en) The acquisition system of info web
CN103744853A (en) Method and device for providing web cache information in search engine
CN103714116A (en) Webpage information extracting method and webpage information extracting equipment
CN103793462A (en) URL (uniform resource locator) purifying method and device
CN103577552A (en) Webpage picture processing method and device
CN102882991A (en) Browser and domain name resolution method thereof
CN103761079A (en) Method and device for automatically graying page
CN103020266A (en) Method and device for extracting webpage text content
CN111143403B (en) SQL conversion method and device and storage medium
CN102855334A (en) Browser and method for acquiring domain name system (DNS) resolving data
CN110109671B (en) Webpack label size and style conversion method and device
CN103034622A (en) Rich text content processing method and server
CN102981848A (en) Webpage main body element processing browser and method
CN105335516A (en) Construction method of universal acquisition system
CN106202323A (en) A kind for the treatment of method and apparatus of daily record
CN103593406A (en) Static resource identifier processing method and device
CN102902784B (en) Web page classification storage system and method
CN103034700A (en) Rich text content processing method and system
CN104361007A (en) Browser and processing method for browser favorites
CN109299423A (en) A method of obtaining network data
CN106951405A (en) Data processing method and device based on typesetting engine

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20130508

RJ01 Rejection of invention patent application after publication