AUTOMATIC DATA EXTRACTION
Field of the Invention
The present invention relates to a method and apparatus for automating the extraction of information from on-line information sources.
Background to the Invention
It is well known that databases can be set up so as to be accessible over the World Wide Web or other communications networks. However, it is common for a user to be required to enter a certain amount of information in order to gain access to the data in such a on-line information source. This information is typically subscriber details, such as a password, and/or search criteria. When the user later desires to repeat access to the Website to extract further data, it is usual for the user to be required to again navigate the Website and enter a certain amount of information to gain access to the further data. The majority of the navigation steps and information being entered will be similar or identical to that entered previously.
Summary of the Invention
According to the present invention, a method of extracting data from an on-line information source which requires a sequence of inputs from a user terminal before data can be accessed, comprises the steps of: providing a user with an emulated user interface to the on-line information source; monitoring a sequence of user inputs whilst the user accesses the on-line information source via the emulated user interface; building an on-line information source access script from the sequence of inputs; and, storing the on-line information source access script for subsequent automatic access to the on-line information source. Preferably, the stored script can be accessed and edited by the user. The stored script may be run in response to a command from a user or in response to a schedule. The script may contain subscriber details, search terms and general navigation data.
Preferably, the method further includes the step of providing the user with extracted data from the on-line information source through the emulated user interface. The method may further include the step of storing the extracted data so that it can be compared with subsequently extracted data and only new data provided to the user or reports of new data may be generated. The extracted data may also be filtered according to predefined user preferences.
The on-line information source may, for example, be a Website or an on-line database. Preferably, the emulated user interface is a substitute Web browser window. More preferably the emulated user interface determines input options at the on-line information source and presents modified input options to the user.
According to the present invention a data extraction apparatus for use with an user terminal and an on-line information source which requires a sequence of inputs before data can be accessed, comprises: means for providing a user with an emulated user interface to the on-line information source; means for monitoring a sequence of user inputs whilst the user accesses the on-line information source via the emulated user interface; means for building an on-line information source access script from the sequence of inputs; and, means for storing the on-line information source access script for subsequent automatic access to the on-line information source.
Preferably, the apparatus takes the form of computer executable code. The computer executable code may be run on a user terminal or on a remote server. Alternatively, the computer executable code may be distributed and run on a number of connected machines.
Preferably, the stored script can be accessed and amended by a user. The script may be run on the on-line information source in response to a command from a user, or the processor may further include scheduling means so that the script can be run on the on-line information source at predetermined times. According to the present invention a computer program comprises program code adapted to perform all the steps of a method according to the present invention, when said program is run on a computer. The computer program may be embodied on a computer readable medium.
The present invention provides an apparatus and a method for automating the
data extraction process so that certain information may be presented automatically, to obviate the need for it to be entered repeatedly.
In addition the system enables the user to enter this information in such a way that the sequence is learned so that it may be used again in whole or in part to make further automated accesses. Having first learned the sequence the system then allows the user to vary the sequence in a defined manner to gain access to further similar information.
Having automated the data extraction process it is then possible to repeat the process without the need for the user to be involved and to collect the data to a predetermined schedule. It is then further possible to automatically check this data for predetermined changes in order to generate further reports or alarms.
Brief Description of the Drawings
Examples of the present invention will now be described with reference to the accompanying drawings, in which:-
Figure 1 is a block diagram of apparatus operating as a server in accordance with the present invention to provide the inter-layer service to the user; and
Figure 2 is a block diagram of apparatus operating in accordance with the present invention to provide the inter-layer service to the user.
Detailed Description
Figure 1 is a block diagram of an intermediate server or "inter-layer" system to provide the functions of an Internet Service Provider (ISP) server operating in accordance with the present invention. The block diagram illustrates functionality that may be embodied as software or alternatively attached as hardware or a combination of the two.
The ISP server is represented as a single machine but may consist of a number of machines providing similar functionality or be an integral part of another machine or system that also performs other functions. The ISP server has similar functionality to a standard ISP server with the addition of other specific functions to enable it to operate in accordance with the present invention.
The ISP server has standard communications connections linking it to the network 1. Examples of suitable networks are the Internet or an Intranet. One such connection is illustrated as the target site access connection 3 which provides access
to the network for the main inter-layer processor 9 so that it may connect to Web pages to collect information on behalf of the user. Another such connection is illustrated as the user access connection 6, which provides user access to the inter- layer system including the main inter-layer processor 9 and also the controlling inter- layer Website. An alternative access connection 8 may also be provided to allow for access to the system from an alternative network 2. One example of such alternative access would be a link established by a user over a dial-in connection, such as a modem over a telephone line.
The e-mail server 5 provides the normal e-mail facilities to the inter-layer site In addition to providing standard e-mail facilities the server also allows the automatic inter-layer processing to filter recovered data, generate reports and then e-mail these to the user as determined by the user preferences.
Other services may be provided, either internally or externally to the inter-layer server. Connection to the network is provided by the general access connection 4. Alternatively, some services may require use of an alternative network 2 and an alternative network connection 7 may be provided for this purpose. There may be several alternative network connections provided. The type of connection may vary depending on the specific type of network. Examples of alternative network connections are a digital adapter to connect to a dedicated leased line or a virtual private network adapter allowing a TCP/IP connection to be established over the Internet.
The main inter-layer processing 9 presents the user with a graphical user interface that provides the facilities and appearance of a standard browser, with the user's own browser controls being hidden from view. Thus accesses to Web pages made by the user are accomplished only indirectly through the inter-layer processor. The inter-layer processor is therefore able to monitor and record transactions and data in either direction on behalf of the user and save these in the user sequences 19 area. These sequences can then be recalled later and executed again.
The sequence builder 13 can access these stored sequences and present them to the user so that they may be edited or enhanced to include complex substitution of access data in order to enable the system to retrieve alternative information on behalf of the user.
Editing of such sequences may be further facilitated by use of an appropriate computer readable script and associated scripting language. It is therefore possible
for a suitably skilled person to directly create and edit such a script to automate the process of retrieving information from Web pages. The interpretation of such inter- layer processor (ILP) scripts is achieved by the ILP script processor 14.
The script is a generic template describing the logical path which a user navigates through a Website to retrieve the desired information. It goes beyond a simple hyperlink as it allows the user to access an item which may be buried behind authentication/login pages, tables of contents, menus, indices and is particularly useful where no direct hyperlink is provided to the data.
The script describes in concise pseudo-English a navigation pathway through the Website, in terms of following hyperlinks which match the criteria, submitting form data and allows for restricting the scope within a page where the hyperlinks should be detected, i.e. it is a readable description of the procedure followed when browsing.
The script is generated automatically by the inter-layer processor by tracking how the user navigates sequentially through the Website and is presented in user readable form to the user so that it can be amended/corrected once the user indicates that the target data has been reached (e.g. by clicking on a 'stop recording script' icon).
The user is prompted to answer any ambiguities which the interlayer processor cannot resolve given knowledge of the user variables. The user is invited to answer using one of a number of pre-formed answers. For example given the question "Why did you click this link and not that one?" the user will use answers such as "because it contains x" or "because it follows y".
The script, with ambiguities thus removed, is then sufficiently generic to find all items of data from the Website that are accessed by the same navigation procedure but with different values of the user variables (e.g. different page numbers) and consequently a different choice of which navigation links are followed and which form data is submitted at each step of the navigation sequence. The user variables are search criteria which specify the data the user is seeking. For example, to locate an article in an electronic journal, a user might want to specify the volume, issue number, page number and publication year and may therefore store these values as the variables $vol, $num, $page, $year.
Further automation of this process on behalf of the user is accomplished by scheduling 10 which enables the stored sequences to be executed on an automated and repeated basis in order to check for changes in the data recovered over a period
of time. Event trigger generation 12 can check the retrieved data against a set of predetermined rules on behalf of the user and cause further activities to be undertaken on the basis of these findings.
The information extracted from Web pages on behalf of the user can be further filtered, formatted or processed and then presented in a pre-defined format by the report generation 11. The report generation 11 provides the ability to filter and select the extracted data according to the preferences set by the user and to generate reports based on these. Reports may be generated from data extracted directly from Web pages or data derived by further processing. Reports may be presented in textual or various graphical layouts or both and may be formatted and encoded in various forms for transmission and viewing by the user. Examples of the various forms of the reports generated would include HTML Web pages, E-mail messages, graphical files, text messages and WML pages for viewing using a WAP enabled mobile terminal amongst others. The user may choose to view these reports directly and have the availability of the report signalled by various means or have the contents of the report forwarded directly to a remote location. One example is the extraction of time-variable data such as a stock market index at regular intervals, and the tabulation and graphing of these values for storage, viewing or onward transmission.
The server has provision for storing various information in order to enable this process. User preferences 16 are determined by the user, initially at registration then modified as a result of usage of the system or manually updated by accessing the inter-layer control Website. User data 17 is data retained by the inter-layer server on behalf of the user and may comprise of various data required by the server to facilitate inter-layer processing, ILP script processing and subsequent reporting of results. Caching 18 is employed by the server to provide temporary copies of information from both the target Web pages and the inter-layer system so that it may be displayed more quickly to the user.
The user access to the system is governed by the subscriber account management 15 with reference to the subscription data 20 and user authentication 21 information retained by the server. This ensures that only registered users have access to the facilities provided.
Figure 2 is a block diagram of apparatus to provide the functions of an inter- layer client operating in accordance with the present invention. The block diagram
illustrates functionality that may be embodied within software or alternatively attached as hardware or a combination of the two.
The inter-layer client provides the user with the functionality of the inter-layer system and may be utilised at any location provided that a connection can be established with the network from time to time.
The inter-layer client described may exist as a single entity or various functionality may be distributed to different parts of the system or be an integral part of another machine or system that also performs other functions. The inter-layer client may operate in isolation or may operate in conjunction or be integrated with other apparatus. One example is operation in conjunction with, or integrated as part of, appropriate software functioning as a browser. There is little difference between the fundamental operation of a server based interlayer system and a client based interlayer system. The client based system allows closer coupling with the rest of the PC functionality but the server based system allows much greater power and functionality due to the power and connectivity of the server. The server also allows more frequent searches as the user does not need to have his machine connected to the network.
The client has standard communications connections linking it to the network 1. One such connection is illustrated as the target site access connection 3 which provides access to the Internet for the main inter-layer processing so that it may connect to Web pages to collect information on behalf of the user. One example of such a connector is a link to an existing TCP/IP protocol socket provided by the operating system or networking software and providing onward connection to the network. Other services may be provided, either internally or externally to the inter-layer system. Connection to the network is provided by the general access connection 4. Alternatively, some services may require use of an alternative network 2 and an alternative network connection 7 may be provided for this purpose. An example of a connection to an alternative network is a link to a pager or mobile telephone network to send a text message where a direct link may be required or connection may need to be established via a dial-in line.
The main inter-layer processing 9 presents the user with a graphical user interface that provides the facilities and appearance of a standard browser, with the users own browser controls being hidden from view, as described with reference to
Figure 1.
The functionality of the sequence builder 13, the ILP script processing, scheduling 10, event trigger generation 12, report generation, user preferences 16, user data 17 and caching 18 shown in Figure 2 is identical to that described with reference to Figure 1 , and so further description has been omitted.
The invention provides a method for automating the extraction of information from Web pages. An example of a Web page to which the invention may be applied is a Website containing information about patent publications.
The user may be required to subscribe in order to gain access to a service or to obtain the appropriate apparatus to provide the service locally.
The inter-layer system may be accessed by the user as a Web page in the normal way. In order to be able to intercept the data traffic flowing between the user and the target Website the inter-layer system will instruct the users Internet browser software to open a new blank window displaying no controls. The system will then provide a Web page that provides the appearance and features of a standard Web browser with the addition of various extra features.
In order to access the target Website the user enters all control and addressing information into the replacement inter-layer browser window instead of directly through the normal browser. The inter-layer system provides access for the user by relaying the information between the user and the target Website and is therefore able to intercept and record all the transactions. The inter-layer processor is able to intelligently adapt the user interface to the Website so that the user is presented with simplified options. For example, the inter-layer processor could present the user with a "select all" option when ordinarily the user would have to manually populate a number of fields.
When the user has successfully completed accessing the target Website, the inter-layer system may be instructed to save the access sequence and data. This sequence can then written as a script and be allocated a name or identifier that can be used to recall it. The user has the ability to access the sequences that are stored within his own profile. The sequence builder can now be called upon to display the sequence in a way that allows the entries to be modified. The script can be edited directly by those with an understanding of its function or the inexperienced user may be provided with a Graphical User Interface that presents the same information in terms of check boxes
and pull-down menus and allows for simple selection from the options. In this example, a sequence designed to cause a search for certain data such as patents in the field of semiconductor technology owned by a particular company may be modified to search for other similar patents owned by another company. The sequence builder may also provide various other more complex facilities to be incorporated into sequences. For example, fields may be provided with features that allow the entry to be selected from a list or modified at the time of executing the sequence. Sequences thus modified may then be saved with alternative names or identifiers so that they may be recalled later. Additionally, the scheduler may be used to introduce various controls linked to times and dates. It is therefore possible to execute sequences repeatedly from time to time and to vary the content of the sequences based on this schedule.
Data extracted by these processes can be recorded by the inter-layer system and processed later. It is therefore possible to detect changes in the data obtained and produce further information for the user based on the results of this processing.
It is consequently possible to trigger further events from this processed data, as previously described.
The information extracted from the target Web pages along with information available as a result of further processing may be filtered and formatted for the user. Such reports may be viewed by the user as a Web page or e-mailed to the user from time to time according to the preferences set.
It is also possible to put together commercial services, either across the Web or across a company's Intranet, in which an inter-layer system visits many sites, logs in as a user and automatically extracts comparative data. For example this would be useful for a buying department where the inter-layer summary gives an ordered list of the prices of identical parts from a number of possible suppliers and their current stock quantity.
A further application might be setting up an inter-layer system to access a user's bank account using a log-in procedure and then downloading accounts for browsing off-line. The inter-layer system could also perform periodic log-ins and issue a warning if the balance drops below a certain threshold.
The user will gain the following benefits from the system: i) repeated identical access to the same target sites will require reduced manual entry of data;
ii) similar accesses to target Websites will require only changed parameters to be entered; iii) repeated accesses to the same Website to submit varied information will require only the changed parameters to be entered for each access; iv) repeated accesses to the target Website can be accomplished on an automated basis; v) target Websites can be monitored for any change of data; vi) target Websites can be monitored for specific changes in data; and vii) information extracted from target Websites can be automatically summarised and presented to the user.