CN103092817A - Data collection method and data collection device based on script engine - Google Patents
Data collection method and data collection device based on script engine Download PDFInfo
- Publication number
- CN103092817A CN103092817A CN2013100196239A CN201310019623A CN103092817A CN 103092817 A CN103092817 A CN 103092817A CN 2013100196239 A CN2013100196239 A CN 2013100196239A CN 201310019623 A CN201310019623 A CN 201310019623A CN 103092817 A CN103092817 A CN 103092817A
- Authority
- CN
- China
- Prior art keywords
- script
- target data
- data
- rule
- title
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention discloses a data collection method and a data collection device based on script engine. The data collection method based on script engine comprises the following steps: loading collection configuration files which are configurated in advance and corresponding to current collecting tasks, analyzing the collection configuration files, and obtaining target data collecting rules; initializing all the script engines which support different scripting languages, and loading script files which are configurated in advance and formed by script methods collecting target data; downloading webpage data, searching the collecting rules of the target data which are defined on a webpage and need to be collected, and sending script method names which are configurated in the downloaded webpage data and the collecting rules to the script engine of the corresponding script languages; and transferring and executing the corresponding script methods through the script engine according to the script method names, and collecting the target data in the webpage data. Extracting, cleaning, processing and transferring in the process of data collection are achieved through modes of scrip, and suggested technical problems are solved perfectly.
Description
Technical field
The present invention relates to field of computer technology, relate in particular to a kind of collecting method based on script engine and device.
Background technology
The oriented acquisition software that many maturations have been arranged in the industry, its implementation basically all are based on template configuration and realize, these data pick-up methods based on template configuration are generally the canonical matching methods, mark intercepting method, Xpath extraction method, plug-in unit customization method etc.
Wherein, about the canonical matching method: partial data extracts the possibility of result needs secondary cleaning, and processing, conversion just can obtain target data, and such abstracting method is highly professional, needs the skilled regular expression of grasping;
Intercept method about mark: partial data extracts the possibility of result needs secondary cleaning, and processing, conversion just can obtain target data;
About the Xpath extraction method: web page contents must be structurized, and such abstracting method is highly professional, needs the skilled Xpath of grasp grammer; In addition, partial data extracts the possibility of result needs secondary cleaning, and processing, conversion just can obtain target data;
Customize method about plug-in unit: frequent Update Table decimation rule code all needs to recompilate, and seems cumbersome, and strongly professional.
In sum, existing data pick-up method based on template configuration has characteristics as can be known, and the data that are extracted into exactly are all much to pass through the target data that secondary cleaning, processing, conversion etc. just can obtain wanting, and cause extraction efficiency lower; In addition, some abstracting method is strongly professional, is unfavorable for widespread use.
Summary of the invention
In view of the above problems, the present invention has been proposed in order to a kind of collecting method and device based on script engine that overcomes the problems referred to above or address the above problem at least in part is provided.
According to one aspect of the present invention, a kind of collecting method based on script engine is provided, comprising:
Step 1 loads the pre-configured acquisition configuration file corresponding with current acquisition tasks, resolves this acquisition configuration file, obtains the target data collection rule; Wherein, described target data collection rule comprises target data type and gathers all kinds of target datas corresponding script method title and script;
Step 2, each script engine of different scripts is supported in initialization, and loads the script file that the pre-configured script method by gathering target data consists of;
Step 3, the downloading web pages data, and search and be defined in the collection rule that needs the target data that gathers on this webpage, the script method title that configures in the web data of downloading and the collection rule that finds is sent to the script engine of corresponding scripts language;
Corresponding script method is called and carried out to step 4, script engine according to described script method title, gather out target data in described web data.
Alternatively, in the method for the invention, according to the acquisition tasks demand, in described script method, definition has target data extraction, cleaning, processing and transformation rule.
Alternatively, in the method for the invention, described target data decimation rule comprises: the decimation rule that the decimation rule that the decimation rule that the decimation rule according to the definition of canonical matching method extracts, defines according to mark intercepting method extracts, defines according to the Xpath extraction method extracts or defines according to plug-in unit customization method extracts.
Alternatively, in the step 4 of the method for the invention, carry out corresponding script method and gather out target data in web data, specifically comprise:
Decimation rule according to described script method definition, extract the target data of appointment in described web data, and according to the cleaning that defines in described script method, processing and transformation rule, to the target data that extraction obtains clean, processing and conversion operations, obtain required target data.
Alternatively, in the method for the invention, described target data type includes but not limited to be title, author, date, content.
According to a further aspect in the invention, provide a kind of data collector based on script engine, having comprised:
The Command Line Parsing module is used for loading the pre-configured acquisition configuration file corresponding with current acquisition tasks, resolves this acquisition configuration file, obtains the target data collection rule; Wherein, described target data collection rule comprises target data type and gathers all kinds of target datas corresponding script method title and script;
Data processing module, be used for the downloading web pages data, and search and be defined in the collection rule that needs the target data that gathers on this webpage, with the script method title that configures in the web data of downloading and the collection rule that finds, be sent in the script engine module in corresponding script engine by script;
The script engine module, comprise a plurality of script engines of supporting different scripts, each script engine is after initialization, load the script file that the pre-configured script method by gathering target data consists of, and after the data that receive the data processing module transmission, according to described script method title, call and carry out corresponding script method, gather out target data in described web data.
Alternatively, in device of the present invention, according to the acquisition tasks demand, in the script method in the script file of described script engine module loading, definition has target data extraction, cleaning, processing and transformation rule.
Alternatively, in device of the present invention, in described script engine module, described target data decimation rule comprises: the decimation rule that the decimation rule that the decimation rule that the decimation rule according to the definition of canonical matching method extracts, defines according to mark intercepting method extracts, defines according to the Xpath extraction method extracts or defines according to plug-in unit customization method extracts.
Alternatively, in device of the present invention, described script engine module, the concrete decimation rule that is used for according to described script method definition, extract the target data of appointment in described web data, and according to the cleaning that defines in described script method, processing and transformation rule, to the target data that extraction obtains clean, processing and conversion operations, obtain required target data.
Alternatively, in device of the present invention, in described Command Line Parsing module, target data type includes but not limited to be title, author, date, content.
Beneficial effect of the present invention is as follows:
The method of the invention and device carry out the script method configuration by simple, easy-to-use script, have realized flexibly, easily the collection of target data, have reduced the professional requirement of image data, are convenient to extensive popularization; And, because script method can carry out flexible configuration by script, having realized completing the operations such as cleaning, processing and conversion when extracting, the target data that obtains need not again to process, and has improved greatly collecting efficiency.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, for can clearer understanding technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Description of drawings
By reading hereinafter detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing only is used for the purpose of preferred implementation is shown, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts with identical reference symbol.In the accompanying drawings:
The process flow diagram of a kind of collecting method based on script engine that Fig. 1 provides for the embodiment of the present invention;
Fig. 2 is the execution block diagram of the described method of the embodiment of the present invention;
The structured flowchart of a kind of data collector based on script engine that Fig. 3 provides for the embodiment of the present invention.Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in accompanying drawing, yet should be appreciated that and to realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order to understand the disclosure more thoroughly that these embodiment are provided, and can with the scope of the present disclosure complete convey to those skilled in the art.
In order to reduce the professional requirement of data acquisition, and raising data acquisition efficiency, the embodiment of the present invention provides a kind of collecting method based on script engine and device, described method and apparatus has been realized extracting simultaneously in data acquisition by the mode of script, clean, processing and conversion have well solved the technical matters that proposes.
Before specifically introducing the present invention program, the explanation of several technical terms that given first technical scheme of the present invention is used, specific as follows:
Acquisition configuration file: defined the collection rule configuration of the target data that acquisition tasks gathers on each webpage.Wherein, collection rule configuration mainly comprises: target data type and gather such target data corresponding script method title and script; For example, if the target data type of extracting is " title ", the script method title of the data acquisition of definition " title " correspondence is " parseTitle ", and the script of use is: javascript.
Script file: the file that the script method that is used for the collection target data of being write with script by the user consists of.Wherein, script has simple, easy to learn, easy-to-use characteristic usually, so, as long as the real needs of clear and definite acquisition tasks can be utilized the configuration of completing script method, greatly reduce professional requirement.About script, common are javascript, vbscript, php etc.
Script draws sincere: the instrument of resolving and carry out script method; In the present invention, script engine obtains script method by the script file that loading configures.At present, existing script engine comprises: the javascript script that Microsoft provides draws sincere, and it is sincere etc. that the vbscript script draws.
Based on the explanation of above-mentioned technical term, the below provides the specific implementation process of embodiment of the method for the present invention and device embodiment.
Embodiment of the method
As shown in Figure 1, the embodiment of the present invention provides a kind of collecting method based on script engine, comprising:
Step S101 loads the pre-configured acquisition configuration file corresponding with current acquisition tasks, resolves this acquisition configuration file, obtains the target data collection rule; Wherein, described target data collection rule comprises target data type and gathers all kinds of target datas corresponding script method title and script;
In this step, described target data type include but not limited to into: title, author, date, content, those skilled in the art can divide flexibly according to user's request.
Step S102, each script engine of different scripts is supported in initialization, and loads the script file that the pre-configured script method by gathering target data consists of;
In this step, according to the acquisition tasks demand, in described script method, definition has target data extraction, cleaning, processing and transformation rule.
Wherein, described target data decimation rule can extract according to the decimation rule of canonical matching method definition, can extract according to the decimation rule of mark intercepting method definition, can extract according to the decimation rule of Xpath extraction method definition or can extract according to the decimation rule of plug-in unit customization method definition.Certainly, technical scheme of the present invention is not limited to above-mentioned decimation rule, also can carry out flexible configuration according to real needs.
Step S103, the downloading web pages data, and search and be defined in the collection rule that needs the target data that gathers on this webpage, the script method title that configures in the web data of downloading and the collection rule that finds is sent to the script engine of corresponding scripts language;
Step S104, script engine call and carry out corresponding script method according to the script method title, gather out target data in described web data.
Preferably, in this step, call and carry out corresponding script method, gathering out target data in described web data specifically comprises: according to the decimation rule of described script method definition, extract the target data of appointment in described web data, and according to the cleaning that defines in described script method, processing and transformation rule, to the target data that extraction obtains clean, processing and conversion operations, obtain required target data.
As shown in Figure 2, for take Fig. 1 as carrying out the execution frame diagram of principle, exactly, the present embodiment is divided into two processes with gatherer process, and one for before gathering, and another is in gathering.concrete, corresponding preliminary work can be done according to different acquisition tasks by system before the task collection, at first it can resolve acquisition configuration file corresponding to acquisition tasks, purpose is to allow rule configuration and the script method of crawl target data set up corresponding relation, then be that the corresponding script of initializtion script language draws sincere, script draws the sincere corresponding script file of script that reloads, next just can gather, the process that gathers is first downloading web pages data, then find to be defined in and need the target data rule configuration that extracts on this webpage, one by one the script method title in the configuration of target data decimation rule is passed to script with the web data that downloads to again and draws sincere execution script method, script draws sincere meeting and carries out data pick-up according to the corresponding scripts method, clean, processing, the operations such as conversion, take at last target data and do subsequent treatment.
In sum, the described method of the embodiment of the present invention is carried out the script method configuration by simple, easy-to-use script, has realized flexibly, easily the collection of target data, has reduced the professional requirement of image data, is convenient to extensive popularization; And, because script method can carry out flexible configuration by script, having realized completing the operations such as cleaning, processing and conversion when extracting, the target data that obtains need not again to process, and has improved greatly collecting efficiency.
Device embodiment
As shown in Figure 3, the embodiment of the present invention provides a kind of data collector based on script engine, comprising: Command Line Parsing module 310, data processing module 320 and script engine module 330;
Command Line Parsing module 310 is used for loading the pre-configured acquisition configuration file corresponding with current acquisition tasks, resolves this acquisition configuration file, obtains the target data collection rule; Wherein, described target data collection rule comprises target data type and gathers all kinds of target datas corresponding script method title and script; Wherein, described target data type includes but not limited to be title, author, date, content
Further, in the embodiment of the present invention, according to the acquisition tasks demand, in the script method in the script file that described script engine module 330 loads, definition has target data extraction, cleaning, processing and transformation rule.
Wherein, described target data decimation rule can extract according to the decimation rule of canonical matching method definition, can extract according to the decimation rule of mark intercepting method definition, can extract according to the decimation rule of Xpath extraction method definition or can extract according to the decimation rule of plug-in unit customization method definition.Certainly, technical scheme of the present invention is not limited to above-mentioned decimation rule, also can carry out flexible configuration according to real needs.
Further, described script engine module 330, the concrete decimation rule that is used for according to described script method definition, extract the target data of appointment in described web data, and according to the cleaning that defines in described script method, processing and transformation rule, to the target data that extraction obtains clean, processing and conversion operations, obtain required target data.
In sum, the described device of the embodiment of the present invention carries out the script method configuration by simple, easy-to-use script, has realized flexibly, easily the collection of target data, has reduced the professional requirement of image data, is convenient to extensive popularization; And, because script method can carry out flexible configuration by script, having realized completing the operations such as cleaning, processing and conversion when extracting, the target data that obtains need not again to process, and has improved greatly collecting efficiency.
Intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration at this algorithm that provides.Various general-purpose systems also can with based on using together with this teaching.According to top description, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.Should be understood that and to utilize various programming languages to realize content of the present invention described here, and the top description that language-specific is done is in order to disclose preferred forms of the present invention.
In the instructions that provides herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can be in the situation that do not have these details to put into practice.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the description to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes in the above.Yet the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires the more feature of feature clearly put down in writing than institute in each claim.Or rather, as claims reflected, inventive aspect was to be less than all features of the disclosed single embodiment in front.Therefore, follow claims of embodiment and incorporate clearly thus this embodiment into, wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can adaptively change and they are arranged in one or more equipment different from this embodiment the module in the equipment in embodiment.Can be combined into a module or unit or assembly to the module in embodiment or unit or assembly, and can put them into a plurality of submodules or subelement or sub-component in addition.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and so all processes or the unit of disclosed any method or equipment make up.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed), disclosed each feature can be by providing identical, being equal to or the alternative features of similar purpose replaces.
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included in other embodiment, the combination of the feature of different embodiment mean be in scope of the present invention within and form different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, perhaps realizes with the software module of moving on one or more processor, and perhaps the combination with them realizes.It will be understood by those of skill in the art that can use in practice microprocessor or digital signal processor (DSP) realize according to the embodiment of the present invention based on some or all some or repertoire of parts in the data collector of script engine.The present invention can also be embodied as be used to part or all equipment or the device program (for example, computer program and computer program) of carrying out method as described herein.The program of the present invention that realizes like this can be stored on computer-readable medium, perhaps can have the form of one or more signal.Such signal can be downloaded from internet website and obtain, and perhaps provides on carrier signal, perhaps provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation that do not break away from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed in element or step in claim.Being positioned at word " " before element or " one " does not get rid of and has a plurality of such elements.The present invention can realize by means of the hardware that includes some different elements and by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to come imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title with these word explanations.
Claims (10)
1. the collecting method based on script engine, is characterized in that, comprising:
Step 1 loads the pre-configured acquisition configuration file corresponding with current acquisition tasks, resolves this acquisition configuration file, obtains the target data collection rule; Wherein, described target data collection rule comprises target data type and gathers all kinds of target datas corresponding script method title and script;
Step 2, each script engine of different scripts is supported in initialization, and loads the script file that the pre-configured script method by gathering target data consists of;
Step 3, the downloading web pages data, and search and be defined in the collection rule that needs the target data that gathers on this webpage, the script method title that configures in the web data of downloading and the collection rule that finds is sent to the script engine of corresponding scripts language;
Corresponding script method is called and carried out to step 4, script engine according to described script method title, gather out target data in described web data.
2. the method for claim 1, is characterized in that, according to the acquisition tasks demand, in described script method, definition has target data extraction, cleaning, processing and transformation rule.
3. method as claimed in claim 2, it is characterized in that, described target data decimation rule comprises: the decimation rule that the decimation rule that the decimation rule that the decimation rule according to the definition of canonical matching method extracts, defines according to mark intercepting method extracts, defines according to the Xpath extraction method extracts or defines according to plug-in unit customization method extracts.
4. method as claimed in claim 2 or claim 3, is characterized in that, in described step 4, carries out corresponding script method and gather out target data in web data, specifically comprises:
Decimation rule according to described script method definition, extract the target data of appointment in described web data, and according to the cleaning that defines in described script method, processing and transformation rule, to the target data that extraction obtains clean, processing and conversion operations, obtain required target data.
5. the method for claim 1, is characterized in that, described target data type comprises: title, author, date, content.
6. the data collector based on script engine, is characterized in that, comprising:
The Command Line Parsing module is used for loading the pre-configured acquisition configuration file corresponding with current acquisition tasks, resolves this acquisition configuration file, obtains the target data collection rule; Wherein, described target data collection rule comprises target data type and gathers all kinds of target datas corresponding script method title and script;
Data processing module, be used for the downloading web pages data, and search and be defined in the collection rule that needs the target data that gathers on this webpage, with the script method title that configures in the web data of downloading and the collection rule that finds, be sent in the script engine module in corresponding script engine by script;
The script engine module, comprise a plurality of script engines of supporting different scripts, each script engine is after initialization, load the script file that the pre-configured script method by gathering target data consists of, and after the data that receive the data processing module transmission, according to described script method title, call and carry out corresponding script method, gather out target data in described web data.
7. device as claimed in claim 6, is characterized in that, according to the acquisition tasks demand, in the script method in the script file of described script engine module loading, definition has target data extraction, cleaning, processing and transformation rule.
8. device as claimed in claim 7, it is characterized in that, in described script engine module, described target data decimation rule comprises: the decimation rule that the decimation rule that the decimation rule that the decimation rule according to the definition of canonical matching method extracts, defines according to mark intercepting method extracts, defines according to the Xpath extraction method extracts or defines according to plug-in unit customization method extracts.
9. install as claimed in claim 7 or 8, it is characterized in that, described script engine module, the concrete decimation rule that is used for according to described script method definition, extract the target data of appointment in described web data, and according to the cleaning that defines in described script method, processing and transformation rule, to the target data that extraction obtains clean, processing and conversion operations, obtain required target data.
10. device as claimed in claim 6, is characterized in that, in described Command Line Parsing module, target data type comprises: title, author, date, content.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013100196239A CN103092817A (en) | 2013-01-18 | 2013-01-18 | Data collection method and data collection device based on script engine |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013100196239A CN103092817A (en) | 2013-01-18 | 2013-01-18 | Data collection method and data collection device based on script engine |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103092817A true CN103092817A (en) | 2013-05-08 |
Family
ID=48205405
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2013100196239A Pending CN103092817A (en) | 2013-01-18 | 2013-01-18 | Data collection method and data collection device based on script engine |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103092817A (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103412890A (en) * | 2013-07-19 | 2013-11-27 | 北京亿赞普网络技术有限公司 | Webpage loading method and device |
CN104462140A (en) * | 2013-09-24 | 2015-03-25 | 北大方正集团有限公司 | Webpage data collecting method and device |
CN104850361A (en) * | 2015-06-01 | 2015-08-19 | 广东电网有限责任公司信息中心 | Data cleaning method and system |
WO2016086784A1 (en) * | 2014-12-02 | 2016-06-09 | 阿里巴巴集团控股有限公司 | Method, apparatus and system for collecting webpage data |
CN105868169A (en) * | 2016-04-06 | 2016-08-17 | 西安电子科技大学 | Data acquisition interface and data acquisition method and system |
CN106649353A (en) * | 2015-10-30 | 2017-05-10 | 北京国双科技有限公司 | Webpage data collection method and apparatus |
CN106708846A (en) * | 2015-11-12 | 2017-05-24 | 北京国双科技有限公司 | Collection method and device for webpage data |
CN107315576A (en) * | 2016-04-26 | 2017-11-03 | 中兴通讯股份有限公司 | A kind of method and system of dynamic expansion software flow |
CN108021621A (en) * | 2017-11-15 | 2018-05-11 | 平安科技(深圳)有限公司 | Database data acquisition method, application server and computer-readable recording medium |
CN109088908A (en) * | 2018-06-06 | 2018-12-25 | 武汉酷犬数据科技有限公司 | A kind of the distributed general collecting method and system of network-oriented |
CN109273077A (en) * | 2018-10-08 | 2019-01-25 | 北京万东医疗科技股份有限公司 | Data processing method, device and smart machine |
CN109360093A (en) * | 2018-09-12 | 2019-02-19 | 珠海凡泰极客科技有限责任公司 | A kind of securities trading reference data integration system |
CN109492149A (en) * | 2018-11-29 | 2019-03-19 | 深圳墨世科技有限公司 | Crawler task processing method and device |
CN109542555A (en) * | 2018-10-26 | 2019-03-29 | 深圳点猫科技有限公司 | A kind of international programming implementation method of realization educational applications and device |
CN109766206A (en) * | 2018-12-29 | 2019-05-17 | 北京中电普华信息技术有限公司 | A kind of log collection method and system |
CN110244956A (en) * | 2019-06-04 | 2019-09-17 | 北京中亦安图科技股份有限公司 | Data analysis method, device and system |
CN110347667A (en) * | 2019-06-27 | 2019-10-18 | 上海淇馥信息技术有限公司 | A kind of data cleaning method and device |
CN110347399A (en) * | 2019-05-31 | 2019-10-18 | 深圳绿米联创科技有限公司 | Data processing method, real time computation system and information system |
CN111026796A (en) * | 2019-11-29 | 2020-04-17 | 华南农业大学 | Multi-source heterogeneous data acquisition method, device, system, medium and equipment |
CN111881404A (en) * | 2020-08-05 | 2020-11-03 | 广州裕睿信息科技有限公司 | Configuration data acquisition method, device and system |
CN114428635A (en) * | 2022-04-06 | 2022-05-03 | 杭州未名信科科技有限公司 | Data acquisition method and device, electronic equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020007358A1 (en) * | 1998-09-01 | 2002-01-17 | David E. Johnson | Architecure of a framework for information extraction from natural language documents |
WO2002007358A1 (en) * | 2000-07-14 | 2002-01-24 | Fujitsu Limited | Cdma receiver |
US20020065802A1 (en) * | 2000-05-30 | 2002-05-30 | Koki Uchiyama | Distributed monitoring system providing knowledge services |
CN101561802A (en) * | 2008-04-18 | 2009-10-21 | 上海复旦光华信息科技股份有限公司 | Web page structural data extraction method and system |
CN101582075A (en) * | 2009-06-24 | 2009-11-18 | 大连海事大学 | Web information extraction system |
CN101673256A (en) * | 2008-09-11 | 2010-03-17 | 北大方正集团有限公司 | Method and system for automatically extracting article metadata information based on word flow |
CN102254046A (en) * | 2011-08-18 | 2011-11-23 | 深圳市融创天下科技股份有限公司 | Webpage data acquiring method and system |
CN102495885A (en) * | 2011-12-08 | 2012-06-13 | 中国信息安全测评中心 | Method for integrating information safety data based on base-networking engine |
-
2013
- 2013-01-18 CN CN2013100196239A patent/CN103092817A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020007358A1 (en) * | 1998-09-01 | 2002-01-17 | David E. Johnson | Architecure of a framework for information extraction from natural language documents |
US20020065802A1 (en) * | 2000-05-30 | 2002-05-30 | Koki Uchiyama | Distributed monitoring system providing knowledge services |
WO2002007358A1 (en) * | 2000-07-14 | 2002-01-24 | Fujitsu Limited | Cdma receiver |
CN101561802A (en) * | 2008-04-18 | 2009-10-21 | 上海复旦光华信息科技股份有限公司 | Web page structural data extraction method and system |
CN101673256A (en) * | 2008-09-11 | 2010-03-17 | 北大方正集团有限公司 | Method and system for automatically extracting article metadata information based on word flow |
CN101582075A (en) * | 2009-06-24 | 2009-11-18 | 大连海事大学 | Web information extraction system |
CN102254046A (en) * | 2011-08-18 | 2011-11-23 | 深圳市融创天下科技股份有限公司 | Webpage data acquiring method and system |
CN102495885A (en) * | 2011-12-08 | 2012-06-13 | 中国信息安全测评中心 | Method for integrating information safety data based on base-networking engine |
Cited By (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103412890A (en) * | 2013-07-19 | 2013-11-27 | 北京亿赞普网络技术有限公司 | Webpage loading method and device |
CN103412890B (en) * | 2013-07-19 | 2017-06-06 | 北京亿赞普网络技术有限公司 | A kind of webpage loading method and device |
CN104462140A (en) * | 2013-09-24 | 2015-03-25 | 北大方正集团有限公司 | Webpage data collecting method and device |
CN105721519B (en) * | 2014-12-02 | 2019-02-05 | 阿里巴巴集团控股有限公司 | A kind of webpage data acquiring method, apparatus and system |
WO2016086784A1 (en) * | 2014-12-02 | 2016-06-09 | 阿里巴巴集团控股有限公司 | Method, apparatus and system for collecting webpage data |
CN105721519A (en) * | 2014-12-02 | 2016-06-29 | 阿里巴巴集团控股有限公司 | Webpage data acquisition method, device and system |
CN104850361A (en) * | 2015-06-01 | 2015-08-19 | 广东电网有限责任公司信息中心 | Data cleaning method and system |
CN106649353B (en) * | 2015-10-30 | 2020-05-22 | 北京国双科技有限公司 | Method and device for collecting webpage data |
CN106649353A (en) * | 2015-10-30 | 2017-05-10 | 北京国双科技有限公司 | Webpage data collection method and apparatus |
CN106708846B (en) * | 2015-11-12 | 2020-04-21 | 北京国双科技有限公司 | Method and device for collecting webpage data |
CN106708846A (en) * | 2015-11-12 | 2017-05-24 | 北京国双科技有限公司 | Collection method and device for webpage data |
CN105868169B (en) * | 2016-04-06 | 2019-04-30 | 西安电子科技大学 | A kind of data acquisition device, collecting method and system |
CN105868169A (en) * | 2016-04-06 | 2016-08-17 | 西安电子科技大学 | Data acquisition interface and data acquisition method and system |
CN107315576A (en) * | 2016-04-26 | 2017-11-03 | 中兴通讯股份有限公司 | A kind of method and system of dynamic expansion software flow |
CN108021621A (en) * | 2017-11-15 | 2018-05-11 | 平安科技(深圳)有限公司 | Database data acquisition method, application server and computer-readable recording medium |
WO2019095667A1 (en) * | 2017-11-15 | 2019-05-23 | 平安科技(深圳)有限公司 | Database data collection method, application server, and computer readable storage medium |
CN109088908A (en) * | 2018-06-06 | 2018-12-25 | 武汉酷犬数据科技有限公司 | A kind of the distributed general collecting method and system of network-oriented |
CN109360093A (en) * | 2018-09-12 | 2019-02-19 | 珠海凡泰极客科技有限责任公司 | A kind of securities trading reference data integration system |
CN109273077A (en) * | 2018-10-08 | 2019-01-25 | 北京万东医疗科技股份有限公司 | Data processing method, device and smart machine |
CN109273077B (en) * | 2018-10-08 | 2021-08-31 | 北京万东医疗科技股份有限公司 | Data processing method and device and intelligent equipment |
CN109542555A (en) * | 2018-10-26 | 2019-03-29 | 深圳点猫科技有限公司 | A kind of international programming implementation method of realization educational applications and device |
CN109492149A (en) * | 2018-11-29 | 2019-03-19 | 深圳墨世科技有限公司 | Crawler task processing method and device |
CN109492149B (en) * | 2018-11-29 | 2021-04-09 | 深圳大宇无限科技有限公司 | Crawler task processing method and device |
CN109766206A (en) * | 2018-12-29 | 2019-05-17 | 北京中电普华信息技术有限公司 | A kind of log collection method and system |
CN110347399A (en) * | 2019-05-31 | 2019-10-18 | 深圳绿米联创科技有限公司 | Data processing method, real time computation system and information system |
CN110347399B (en) * | 2019-05-31 | 2023-06-06 | 深圳绿米联创科技有限公司 | Data processing method, real-time computing system and information system |
CN110244956A (en) * | 2019-06-04 | 2019-09-17 | 北京中亦安图科技股份有限公司 | Data analysis method, device and system |
CN110347667A (en) * | 2019-06-27 | 2019-10-18 | 上海淇馥信息技术有限公司 | A kind of data cleaning method and device |
CN111026796A (en) * | 2019-11-29 | 2020-04-17 | 华南农业大学 | Multi-source heterogeneous data acquisition method, device, system, medium and equipment |
CN111026796B (en) * | 2019-11-29 | 2023-05-16 | 华南农业大学 | Multi-source heterogeneous data acquisition method, device, system, medium and equipment |
CN111881404A (en) * | 2020-08-05 | 2020-11-03 | 广州裕睿信息科技有限公司 | Configuration data acquisition method, device and system |
CN114428635A (en) * | 2022-04-06 | 2022-05-03 | 杭州未名信科科技有限公司 | Data acquisition method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103092817A (en) | Data collection method and data collection device based on script engine | |
CN104965901A (en) | Method and apparatus for grabbing content of target page | |
CN102843445B (en) | A kind of browser and carry out the method for domain name mapping | |
CN102982161A (en) | Method and device for acquiring webpage information | |
CN102982162B (en) | The acquisition system of info web | |
CN103744853A (en) | Method and device for providing web cache information in search engine | |
CN103714116A (en) | Webpage information extracting method and webpage information extracting equipment | |
CN103793462A (en) | URL (uniform resource locator) purifying method and device | |
CN103577552A (en) | Webpage picture processing method and device | |
CN102882991A (en) | Browser and domain name resolution method thereof | |
CN103761079A (en) | Method and device for automatically graying page | |
CN103020266A (en) | Method and device for extracting webpage text content | |
CN111143403B (en) | SQL conversion method and device and storage medium | |
CN102855334A (en) | Browser and method for acquiring domain name system (DNS) resolving data | |
CN110109671B (en) | Webpack label size and style conversion method and device | |
CN103034622A (en) | Rich text content processing method and server | |
CN102981848A (en) | Webpage main body element processing browser and method | |
CN105335516A (en) | Construction method of universal acquisition system | |
CN106202323A (en) | A kind for the treatment of method and apparatus of daily record | |
CN103593406A (en) | Static resource identifier processing method and device | |
CN102902784B (en) | Web page classification storage system and method | |
CN103034700A (en) | Rich text content processing method and system | |
CN104361007A (en) | Browser and processing method for browser favorites | |
CN109299423A (en) | A method of obtaining network data | |
CN106951405A (en) | Data processing method and device based on typesetting engine |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20130508 |
|
RJ01 | Rejection of invention patent application after publication |