WO2002029626A2 - Automatic data extraction - Google Patents

Automatic data extraction Download PDF

Info

Publication number
WO2002029626A2
WO2002029626A2 PCT/GB2001/004339 GB0104339W WO0229626A2 WO 2002029626 A2 WO2002029626 A2 WO 2002029626A2 GB 0104339 W GB0104339 W GB 0104339W WO 0229626 A2 WO0229626 A2 WO 0229626A2
Authority
WO
WIPO (PCT)
Prior art keywords
user
information source
line information
script
sequence
Prior art date
Application number
PCT/GB2001/004339
Other languages
French (fr)
Other versions
WO2002029626A3 (en
Inventor
Jane Lesley Aldridge
Philip Michael Gaffney
Mark Geoffrey Harrison
Original Assignee
Internet-Extra Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Internet-Extra Ltd. filed Critical Internet-Extra Ltd.
Priority to AU2001292031A priority Critical patent/AU2001292031A1/en
Publication of WO2002029626A2 publication Critical patent/WO2002029626A2/en
Publication of WO2002029626A3 publication Critical patent/WO2002029626A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/174Form filling; Merging

Definitions

  • the present invention relates to a method and apparatus for automating the extraction of information from on-line information sources.
  • databases can be set up so as to be accessible over the World Wide Web or other communications networks.
  • a user it is common for a user to be required to enter a certain amount of information in order to gain access to the data in such a on-line information source.
  • This information is typically subscriber details, such as a password, and/or search criteria.
  • the user When the user later desires to repeat access to the Website to extract further data, it is usual for the user to be required to again navigate the Website and enter a certain amount of information to gain access to the further data. The majority of the navigation steps and information being entered will be similar or identical to that entered previously.
  • a method of extracting data from an on-line information source which requires a sequence of inputs from a user terminal before data can be accessed comprises the steps of: providing a user with an emulated user interface to the on-line information source; monitoring a sequence of user inputs whilst the user accesses the on-line information source via the emulated user interface; building an on-line information source access script from the sequence of inputs; and, storing the on-line information source access script for subsequent automatic access to the on-line information source.
  • the stored script can be accessed and edited by the user.
  • the stored script may be run in response to a command from a user or in response to a schedule.
  • the script may contain subscriber details, search terms and general navigation data.
  • the method further includes the step of providing the user with extracted data from the on-line information source through the emulated user interface.
  • the method may further include the step of storing the extracted data so that it can be compared with subsequently extracted data and only new data provided to the user or reports of new data may be generated.
  • the extracted data may also be filtered according to predefined user preferences.
  • the on-line information source may, for example, be a Website or an on-line database.
  • the emulated user interface is a substitute Web browser window. More preferably the emulated user interface determines input options at the on-line information source and presents modified input options to the user.
  • a data extraction apparatus for use with an user terminal and an on-line information source which requires a sequence of inputs before data can be accessed, comprises: means for providing a user with an emulated user interface to the on-line information source; means for monitoring a sequence of user inputs whilst the user accesses the on-line information source via the emulated user interface; means for building an on-line information source access script from the sequence of inputs; and, means for storing the on-line information source access script for subsequent automatic access to the on-line information source.
  • the apparatus takes the form of computer executable code.
  • the computer executable code may be run on a user terminal or on a remote server.
  • the computer executable code may be distributed and run on a number of connected machines.
  • the stored script can be accessed and amended by a user.
  • the script may be run on the on-line information source in response to a command from a user, or the processor may further include scheduling means so that the script can be run on the on-line information source at predetermined times.
  • a computer program comprises program code adapted to perform all the steps of a method according to the present invention, when said program is run on a computer.
  • the computer program may be embodied on a computer readable medium.
  • the present invention provides an apparatus and a method for automating the data extraction process so that certain information may be presented automatically, to obviate the need for it to be entered repeatedly.
  • the system enables the user to enter this information in such a way that the sequence is learned so that it may be used again in whole or in part to make further automated accesses. Having first learned the sequence the system then allows the user to vary the sequence in a defined manner to gain access to further similar information.
  • Figure 1 is a block diagram of apparatus operating as a server in accordance with the present invention to provide the inter-layer service to the user;
  • Figure 2 is a block diagram of apparatus operating in accordance with the present invention to provide the inter-layer service to the user.
  • FIG. 1 is a block diagram of an intermediate server or "inter-layer" system to provide the functions of an Internet Service Provider (ISP) server operating in accordance with the present invention.
  • the block diagram illustrates functionality that may be embodied as software or alternatively attached as hardware or a combination of the two.
  • the ISP server is represented as a single machine but may consist of a number of machines providing similar functionality or be an integral part of another machine or system that also performs other functions.
  • the ISP server has similar functionality to a standard ISP server with the addition of other specific functions to enable it to operate in accordance with the present invention.
  • the ISP server has standard communications connections linking it to the network 1.
  • suitable networks are the Internet or an Intranet.
  • One such connection is illustrated as the target site access connection 3 which provides access to the network for the main inter-layer processor 9 so that it may connect to Web pages to collect information on behalf of the user.
  • Another such connection is illustrated as the user access connection 6, which provides user access to the inter- layer system including the main inter-layer processor 9 and also the controlling inter- layer Website.
  • An alternative access connection 8 may also be provided to allow for access to the system from an alternative network 2.
  • One example of such alternative access would be a link established by a user over a dial-in connection, such as a modem over a telephone line.
  • the e-mail server 5 provides the normal e-mail facilities to the inter-layer site In addition to providing standard e-mail facilities the server also allows the automatic inter-layer processing to filter recovered data, generate reports and then e-mail these to the user as determined by the user preferences.
  • connection to the network is provided by the general access connection 4.
  • some services may require use of an alternative network 2 and an alternative network connection 7 may be provided for this purpose.
  • There may be several alternative network connections provided. The type of connection may vary depending on the specific type of network. Examples of alternative network connections are a digital adapter to connect to a dedicated leased line or a virtual private network adapter allowing a TCP/IP connection to be established over the Internet.
  • the main inter-layer processing 9 presents the user with a graphical user interface that provides the facilities and appearance of a standard browser, with the user's own browser controls being hidden from view. Thus accesses to Web pages made by the user are accomplished only indirectly through the inter-layer processor.
  • the inter-layer processor is therefore able to monitor and record transactions and data in either direction on behalf of the user and save these in the user sequences 19 area. These sequences can then be recalled later and executed again.
  • the sequence builder 13 can access these stored sequences and present them to the user so that they may be edited or enhanced to include complex substitution of access data in order to enable the system to retrieve alternative information on behalf of the user.
  • ILP inter-layer processor
  • the script is a generic template describing the logical path which a user navigates through a Website to retrieve the desired information. It goes beyond a simple hyperlink as it allows the user to access an item which may be buried behind authentication/login pages, tables of contents, menus, indices and is particularly useful where no direct hyperlink is provided to the data.
  • the script describes in concise pseudo-English a navigation pathway through the Website, in terms of following hyperlinks which match the criteria, submitting form data and allows for restricting the scope within a page where the hyperlinks should be detected, i.e. it is a readable description of the procedure followed when browsing.
  • the script is generated automatically by the inter-layer processor by tracking how the user navigates sequentially through the Website and is presented in user readable form to the user so that it can be amended/corrected once the user indicates that the target data has been reached (e.g. by clicking on a 'stop recording script' icon).
  • the user is prompted to answer any ambiguities which the interlayer processor cannot resolve given knowledge of the user variables.
  • the user is invited to answer using one of a number of pre-formed answers. For example given the question “Why did you click this link and not that one?" the user will use answers such as "because it contains x" or "because it follows y".
  • the script is then sufficiently generic to find all items of data from the Website that are accessed by the same navigation procedure but with different values of the user variables (e.g. different page numbers) and consequently a different choice of which navigation links are followed and which form data is submitted at each step of the navigation sequence.
  • the user variables are search criteria which specify the data the user is seeking. For example, to locate an article in an electronic journal, a user might want to specify the volume, issue number, page number and publication year and may therefore store these values as the variables $vol, $num, $page, $year.
  • scheduling 10 which enables the stored sequences to be executed on an automated and repeated basis in order to check for changes in the data recovered over a period of time.
  • Event trigger generation 12 can check the retrieved data against a set of predetermined rules on behalf of the user and cause further activities to be undertaken on the basis of these findings.
  • the information extracted from Web pages on behalf of the user can be further filtered, formatted or processed and then presented in a pre-defined format by the report generation 11.
  • the report generation 11 provides the ability to filter and select the extracted data according to the preferences set by the user and to generate reports based on these.
  • Reports may be generated from data extracted directly from Web pages or data derived by further processing. Reports may be presented in textual or various graphical layouts or both and may be formatted and encoded in various forms for transmission and viewing by the user. Examples of the various forms of the reports generated would include HTML Web pages, E-mail messages, graphical files, text messages and WML pages for viewing using a WAP enabled mobile terminal amongst others.
  • the user may choose to view these reports directly and have the availability of the report signalled by various means or have the contents of the report forwarded directly to a remote location.
  • One example is the extraction of time-variable data such as a stock market index at regular intervals, and the tabulation and graphing of these values for storage, viewing or onward transmission.
  • the server has provision for storing various information in order to enable this process.
  • User preferences 16 are determined by the user, initially at registration then modified as a result of usage of the system or manually updated by accessing the inter-layer control Website.
  • User data 17 is data retained by the inter-layer server on behalf of the user and may comprise of various data required by the server to facilitate inter-layer processing, ILP script processing and subsequent reporting of results.
  • Caching 18 is employed by the server to provide temporary copies of information from both the target Web pages and the inter-layer system so that it may be displayed more quickly to the user.
  • the user access to the system is governed by the subscriber account management 15 with reference to the subscription data 20 and user authentication 21 information retained by the server. This ensures that only registered users have access to the facilities provided.
  • Figure 2 is a block diagram of apparatus to provide the functions of an inter- layer client operating in accordance with the present invention.
  • the block diagram illustrates functionality that may be embodied within software or alternatively attached as hardware or a combination of the two.
  • the inter-layer client provides the user with the functionality of the inter-layer system and may be utilised at any location provided that a connection can be established with the network from time to time.
  • the inter-layer client described may exist as a single entity or various functionality may be distributed to different parts of the system or be an integral part of another machine or system that also performs other functions.
  • the inter-layer client may operate in isolation or may operate in conjunction or be integrated with other apparatus.
  • One example is operation in conjunction with, or integrated as part of, appropriate software functioning as a browser.
  • the client based system allows closer coupling with the rest of the PC functionality but the server based system allows much greater power and functionality due to the power and connectivity of the server.
  • the server also allows more frequent searches as the user does not need to have his machine connected to the network.
  • the client has standard communications connections linking it to the network 1.
  • One such connection is illustrated as the target site access connection 3 which provides access to the Internet for the main inter-layer processing so that it may connect to Web pages to collect information on behalf of the user.
  • One example of such a connector is a link to an existing TCP/IP protocol socket provided by the operating system or networking software and providing onward connection to the network.
  • Other services may be provided, either internally or externally to the inter-layer system. Connection to the network is provided by the general access connection 4. Alternatively, some services may require use of an alternative network 2 and an alternative network connection 7 may be provided for this purpose.
  • An example of a connection to an alternative network is a link to a pager or mobile telephone network to send a text message where a direct link may be required or connection may need to be established via a dial-in line.
  • the main inter-layer processing 9 presents the user with a graphical user interface that provides the facilities and appearance of a standard browser, with the users own browser controls being hidden from view, as described with reference to Figure 1.
  • sequence builder 13 The functionality of the sequence builder 13, the ILP script processing, scheduling 10, event trigger generation 12, report generation, user preferences 16, user data 17 and caching 18 shown in Figure 2 is identical to that described with reference to Figure 1 , and so further description has been omitted.
  • the invention provides a method for automating the extraction of information from Web pages.
  • An example of a Web page to which the invention may be applied is a Website containing information about patent publications.
  • the user may be required to subscribe in order to gain access to a service or to obtain the appropriate apparatus to provide the service locally.
  • the inter-layer system may be accessed by the user as a Web page in the normal way.
  • the inter-layer system will instruct the users Internet browser software to open a new blank window displaying no controls.
  • the system will then provide a Web page that provides the appearance and features of a standard Web browser with the addition of various extra features.
  • the inter-layer system provides access for the user by relaying the information between the user and the target Website and is therefore able to intercept and record all the transactions.
  • the inter-layer processor is able to intelligently adapt the user interface to the Website so that the user is presented with simplified options. For example, the inter-layer processor could present the user with a "select all" option when ordinarily the user would have to manually populate a number of fields.
  • the inter-layer system may be instructed to save the access sequence and data.
  • This sequence can then written as a script and be allocated a name or identifier that can be used to recall it.
  • the user has the ability to access the sequences that are stored within his own profile.
  • the sequence builder can now be called upon to display the sequence in a way that allows the entries to be modified.
  • the script can be edited directly by those with an understanding of its function or the inexperienced user may be provided with a Graphical User Interface that presents the same information in terms of check boxes and pull-down menus and allows for simple selection from the options.
  • a sequence designed to cause a search for certain data such as patents in the field of semiconductor technology owned by a particular company may be modified to search for other similar patents owned by another company.
  • the sequence builder may also provide various other more complex facilities to be incorporated into sequences. For example, fields may be provided with features that allow the entry to be selected from a list or modified at the time of executing the sequence. Sequences thus modified may then be saved with alternative names or identifiers so that they may be recalled later. Additionally, the scheduler may be used to introduce various controls linked to times and dates. It is therefore possible to execute sequences repeatedly from time to time and to vary the content of the sequences based on this schedule.
  • Data extracted by these processes can be recorded by the inter-layer system and processed later. It is therefore possible to detect changes in the data obtained and produce further information for the user based on the results of this processing.
  • the information extracted from the target Web pages along with information available as a result of further processing may be filtered and formatted for the user.
  • Such reports may be viewed by the user as a Web page or e-mailed to the user from time to time according to the preferences set.
  • a further application might be setting up an inter-layer system to access a user's bank account using a log-in procedure and then downloading accounts for browsing off-line.
  • the inter-layer system could also perform periodic log-ins and issue a warning if the balance drops below a certain threshold.
  • the user will gain the following benefits from the system: i) repeated identical access to the same target sites will require reduced manual entry of data; ii) similar accesses to target Websites will require only changed parameters to be entered; iii) repeated accesses to the same Website to submit varied information will require only the changed parameters to be entered for each access; iv) repeated accesses to the target Website can be accomplished on an automated basis; v) target Websites can be monitored for any change of data; vi) target Websites can be monitored for specific changes in data; and vii) information extracted from target Websites can be automatically summarised and presented to the user.

Abstract

The present invention provides an apparatus and a method for automating a data extraction process from an 'on-line' database so that certain information may be presented automatically, to obviate the need for it to be entered repeatedly. In addition the system enables the user to enter this information in such a way that the sequence is learned so that it may be used again in whole or in part to make further automated accesses. The method of the invention comprises the steps of providing a user with an emulated user interface of an on-line information source, monitoring a sequence of user inputs whilst the user accesses the on-line information source via the emulated user interface, building a on-line information source access script from the sequence of inputs and storing the on-line information source access script for subsequent automatic access to the on-line information source.

Description

AUTOMATIC DATA EXTRACTION
Field of the Invention
The present invention relates to a method and apparatus for automating the extraction of information from on-line information sources.
Background to the Invention
It is well known that databases can be set up so as to be accessible over the World Wide Web or other communications networks. However, it is common for a user to be required to enter a certain amount of information in order to gain access to the data in such a on-line information source. This information is typically subscriber details, such as a password, and/or search criteria. When the user later desires to repeat access to the Website to extract further data, it is usual for the user to be required to again navigate the Website and enter a certain amount of information to gain access to the further data. The majority of the navigation steps and information being entered will be similar or identical to that entered previously.
Summary of the Invention
According to the present invention, a method of extracting data from an on-line information source which requires a sequence of inputs from a user terminal before data can be accessed, comprises the steps of: providing a user with an emulated user interface to the on-line information source; monitoring a sequence of user inputs whilst the user accesses the on-line information source via the emulated user interface; building an on-line information source access script from the sequence of inputs; and, storing the on-line information source access script for subsequent automatic access to the on-line information source. Preferably, the stored script can be accessed and edited by the user. The stored script may be run in response to a command from a user or in response to a schedule. The script may contain subscriber details, search terms and general navigation data. Preferably, the method further includes the step of providing the user with extracted data from the on-line information source through the emulated user interface. The method may further include the step of storing the extracted data so that it can be compared with subsequently extracted data and only new data provided to the user or reports of new data may be generated. The extracted data may also be filtered according to predefined user preferences.
The on-line information source may, for example, be a Website or an on-line database. Preferably, the emulated user interface is a substitute Web browser window. More preferably the emulated user interface determines input options at the on-line information source and presents modified input options to the user.
According to the present invention a data extraction apparatus for use with an user terminal and an on-line information source which requires a sequence of inputs before data can be accessed, comprises: means for providing a user with an emulated user interface to the on-line information source; means for monitoring a sequence of user inputs whilst the user accesses the on-line information source via the emulated user interface; means for building an on-line information source access script from the sequence of inputs; and, means for storing the on-line information source access script for subsequent automatic access to the on-line information source.
Preferably, the apparatus takes the form of computer executable code. The computer executable code may be run on a user terminal or on a remote server. Alternatively, the computer executable code may be distributed and run on a number of connected machines.
Preferably, the stored script can be accessed and amended by a user. The script may be run on the on-line information source in response to a command from a user, or the processor may further include scheduling means so that the script can be run on the on-line information source at predetermined times. According to the present invention a computer program comprises program code adapted to perform all the steps of a method according to the present invention, when said program is run on a computer. The computer program may be embodied on a computer readable medium.
The present invention provides an apparatus and a method for automating the data extraction process so that certain information may be presented automatically, to obviate the need for it to be entered repeatedly.
In addition the system enables the user to enter this information in such a way that the sequence is learned so that it may be used again in whole or in part to make further automated accesses. Having first learned the sequence the system then allows the user to vary the sequence in a defined manner to gain access to further similar information.
Having automated the data extraction process it is then possible to repeat the process without the need for the user to be involved and to collect the data to a predetermined schedule. It is then further possible to automatically check this data for predetermined changes in order to generate further reports or alarms.
Brief Description of the Drawings
Examples of the present invention will now be described with reference to the accompanying drawings, in which:-
Figure 1 is a block diagram of apparatus operating as a server in accordance with the present invention to provide the inter-layer service to the user; and
Figure 2 is a block diagram of apparatus operating in accordance with the present invention to provide the inter-layer service to the user.
Detailed Description
Figure 1 is a block diagram of an intermediate server or "inter-layer" system to provide the functions of an Internet Service Provider (ISP) server operating in accordance with the present invention. The block diagram illustrates functionality that may be embodied as software or alternatively attached as hardware or a combination of the two.
The ISP server is represented as a single machine but may consist of a number of machines providing similar functionality or be an integral part of another machine or system that also performs other functions. The ISP server has similar functionality to a standard ISP server with the addition of other specific functions to enable it to operate in accordance with the present invention.
The ISP server has standard communications connections linking it to the network 1. Examples of suitable networks are the Internet or an Intranet. One such connection is illustrated as the target site access connection 3 which provides access to the network for the main inter-layer processor 9 so that it may connect to Web pages to collect information on behalf of the user. Another such connection is illustrated as the user access connection 6, which provides user access to the inter- layer system including the main inter-layer processor 9 and also the controlling inter- layer Website. An alternative access connection 8 may also be provided to allow for access to the system from an alternative network 2. One example of such alternative access would be a link established by a user over a dial-in connection, such as a modem over a telephone line.
The e-mail server 5 provides the normal e-mail facilities to the inter-layer site In addition to providing standard e-mail facilities the server also allows the automatic inter-layer processing to filter recovered data, generate reports and then e-mail these to the user as determined by the user preferences.
Other services may be provided, either internally or externally to the inter-layer server. Connection to the network is provided by the general access connection 4. Alternatively, some services may require use of an alternative network 2 and an alternative network connection 7 may be provided for this purpose. There may be several alternative network connections provided. The type of connection may vary depending on the specific type of network. Examples of alternative network connections are a digital adapter to connect to a dedicated leased line or a virtual private network adapter allowing a TCP/IP connection to be established over the Internet.
The main inter-layer processing 9 presents the user with a graphical user interface that provides the facilities and appearance of a standard browser, with the user's own browser controls being hidden from view. Thus accesses to Web pages made by the user are accomplished only indirectly through the inter-layer processor. The inter-layer processor is therefore able to monitor and record transactions and data in either direction on behalf of the user and save these in the user sequences 19 area. These sequences can then be recalled later and executed again.
The sequence builder 13 can access these stored sequences and present them to the user so that they may be edited or enhanced to include complex substitution of access data in order to enable the system to retrieve alternative information on behalf of the user.
Editing of such sequences may be further facilitated by use of an appropriate computer readable script and associated scripting language. It is therefore possible for a suitably skilled person to directly create and edit such a script to automate the process of retrieving information from Web pages. The interpretation of such inter- layer processor (ILP) scripts is achieved by the ILP script processor 14.
The script is a generic template describing the logical path which a user navigates through a Website to retrieve the desired information. It goes beyond a simple hyperlink as it allows the user to access an item which may be buried behind authentication/login pages, tables of contents, menus, indices and is particularly useful where no direct hyperlink is provided to the data.
The script describes in concise pseudo-English a navigation pathway through the Website, in terms of following hyperlinks which match the criteria, submitting form data and allows for restricting the scope within a page where the hyperlinks should be detected, i.e. it is a readable description of the procedure followed when browsing.
The script is generated automatically by the inter-layer processor by tracking how the user navigates sequentially through the Website and is presented in user readable form to the user so that it can be amended/corrected once the user indicates that the target data has been reached (e.g. by clicking on a 'stop recording script' icon).
The user is prompted to answer any ambiguities which the interlayer processor cannot resolve given knowledge of the user variables. The user is invited to answer using one of a number of pre-formed answers. For example given the question "Why did you click this link and not that one?" the user will use answers such as "because it contains x" or "because it follows y".
The script, with ambiguities thus removed, is then sufficiently generic to find all items of data from the Website that are accessed by the same navigation procedure but with different values of the user variables (e.g. different page numbers) and consequently a different choice of which navigation links are followed and which form data is submitted at each step of the navigation sequence. The user variables are search criteria which specify the data the user is seeking. For example, to locate an article in an electronic journal, a user might want to specify the volume, issue number, page number and publication year and may therefore store these values as the variables $vol, $num, $page, $year.
Further automation of this process on behalf of the user is accomplished by scheduling 10 which enables the stored sequences to be executed on an automated and repeated basis in order to check for changes in the data recovered over a period of time. Event trigger generation 12 can check the retrieved data against a set of predetermined rules on behalf of the user and cause further activities to be undertaken on the basis of these findings.
The information extracted from Web pages on behalf of the user can be further filtered, formatted or processed and then presented in a pre-defined format by the report generation 11. The report generation 11 provides the ability to filter and select the extracted data according to the preferences set by the user and to generate reports based on these. Reports may be generated from data extracted directly from Web pages or data derived by further processing. Reports may be presented in textual or various graphical layouts or both and may be formatted and encoded in various forms for transmission and viewing by the user. Examples of the various forms of the reports generated would include HTML Web pages, E-mail messages, graphical files, text messages and WML pages for viewing using a WAP enabled mobile terminal amongst others. The user may choose to view these reports directly and have the availability of the report signalled by various means or have the contents of the report forwarded directly to a remote location. One example is the extraction of time-variable data such as a stock market index at regular intervals, and the tabulation and graphing of these values for storage, viewing or onward transmission.
The server has provision for storing various information in order to enable this process. User preferences 16 are determined by the user, initially at registration then modified as a result of usage of the system or manually updated by accessing the inter-layer control Website. User data 17 is data retained by the inter-layer server on behalf of the user and may comprise of various data required by the server to facilitate inter-layer processing, ILP script processing and subsequent reporting of results. Caching 18 is employed by the server to provide temporary copies of information from both the target Web pages and the inter-layer system so that it may be displayed more quickly to the user.
The user access to the system is governed by the subscriber account management 15 with reference to the subscription data 20 and user authentication 21 information retained by the server. This ensures that only registered users have access to the facilities provided.
Figure 2 is a block diagram of apparatus to provide the functions of an inter- layer client operating in accordance with the present invention. The block diagram illustrates functionality that may be embodied within software or alternatively attached as hardware or a combination of the two.
The inter-layer client provides the user with the functionality of the inter-layer system and may be utilised at any location provided that a connection can be established with the network from time to time.
The inter-layer client described may exist as a single entity or various functionality may be distributed to different parts of the system or be an integral part of another machine or system that also performs other functions. The inter-layer client may operate in isolation or may operate in conjunction or be integrated with other apparatus. One example is operation in conjunction with, or integrated as part of, appropriate software functioning as a browser. There is little difference between the fundamental operation of a server based interlayer system and a client based interlayer system. The client based system allows closer coupling with the rest of the PC functionality but the server based system allows much greater power and functionality due to the power and connectivity of the server. The server also allows more frequent searches as the user does not need to have his machine connected to the network.
The client has standard communications connections linking it to the network 1. One such connection is illustrated as the target site access connection 3 which provides access to the Internet for the main inter-layer processing so that it may connect to Web pages to collect information on behalf of the user. One example of such a connector is a link to an existing TCP/IP protocol socket provided by the operating system or networking software and providing onward connection to the network. Other services may be provided, either internally or externally to the inter-layer system. Connection to the network is provided by the general access connection 4. Alternatively, some services may require use of an alternative network 2 and an alternative network connection 7 may be provided for this purpose. An example of a connection to an alternative network is a link to a pager or mobile telephone network to send a text message where a direct link may be required or connection may need to be established via a dial-in line.
The main inter-layer processing 9 presents the user with a graphical user interface that provides the facilities and appearance of a standard browser, with the users own browser controls being hidden from view, as described with reference to Figure 1.
The functionality of the sequence builder 13, the ILP script processing, scheduling 10, event trigger generation 12, report generation, user preferences 16, user data 17 and caching 18 shown in Figure 2 is identical to that described with reference to Figure 1 , and so further description has been omitted.
The invention provides a method for automating the extraction of information from Web pages. An example of a Web page to which the invention may be applied is a Website containing information about patent publications.
The user may be required to subscribe in order to gain access to a service or to obtain the appropriate apparatus to provide the service locally.
The inter-layer system may be accessed by the user as a Web page in the normal way. In order to be able to intercept the data traffic flowing between the user and the target Website the inter-layer system will instruct the users Internet browser software to open a new blank window displaying no controls. The system will then provide a Web page that provides the appearance and features of a standard Web browser with the addition of various extra features.
In order to access the target Website the user enters all control and addressing information into the replacement inter-layer browser window instead of directly through the normal browser. The inter-layer system provides access for the user by relaying the information between the user and the target Website and is therefore able to intercept and record all the transactions. The inter-layer processor is able to intelligently adapt the user interface to the Website so that the user is presented with simplified options. For example, the inter-layer processor could present the user with a "select all" option when ordinarily the user would have to manually populate a number of fields.
When the user has successfully completed accessing the target Website, the inter-layer system may be instructed to save the access sequence and data. This sequence can then written as a script and be allocated a name or identifier that can be used to recall it. The user has the ability to access the sequences that are stored within his own profile. The sequence builder can now be called upon to display the sequence in a way that allows the entries to be modified. The script can be edited directly by those with an understanding of its function or the inexperienced user may be provided with a Graphical User Interface that presents the same information in terms of check boxes and pull-down menus and allows for simple selection from the options. In this example, a sequence designed to cause a search for certain data such as patents in the field of semiconductor technology owned by a particular company may be modified to search for other similar patents owned by another company. The sequence builder may also provide various other more complex facilities to be incorporated into sequences. For example, fields may be provided with features that allow the entry to be selected from a list or modified at the time of executing the sequence. Sequences thus modified may then be saved with alternative names or identifiers so that they may be recalled later. Additionally, the scheduler may be used to introduce various controls linked to times and dates. It is therefore possible to execute sequences repeatedly from time to time and to vary the content of the sequences based on this schedule.
Data extracted by these processes can be recorded by the inter-layer system and processed later. It is therefore possible to detect changes in the data obtained and produce further information for the user based on the results of this processing.
It is consequently possible to trigger further events from this processed data, as previously described.
The information extracted from the target Web pages along with information available as a result of further processing may be filtered and formatted for the user. Such reports may be viewed by the user as a Web page or e-mailed to the user from time to time according to the preferences set.
It is also possible to put together commercial services, either across the Web or across a company's Intranet, in which an inter-layer system visits many sites, logs in as a user and automatically extracts comparative data. For example this would be useful for a buying department where the inter-layer summary gives an ordered list of the prices of identical parts from a number of possible suppliers and their current stock quantity.
A further application might be setting up an inter-layer system to access a user's bank account using a log-in procedure and then downloading accounts for browsing off-line. The inter-layer system could also perform periodic log-ins and issue a warning if the balance drops below a certain threshold.
The user will gain the following benefits from the system: i) repeated identical access to the same target sites will require reduced manual entry of data; ii) similar accesses to target Websites will require only changed parameters to be entered; iii) repeated accesses to the same Website to submit varied information will require only the changed parameters to be entered for each access; iv) repeated accesses to the target Website can be accomplished on an automated basis; v) target Websites can be monitored for any change of data; vi) target Websites can be monitored for specific changes in data; and vii) information extracted from target Websites can be automatically summarised and presented to the user.

Claims

1. A method of extracting data from an on-line information source which requires a sequence of inputs from a user terminal before data can be accessed, comprising the steps of: providing a user with an emulated user interface of the on-line information source; monitoring a sequence of user inputs whilst the user accesses the on-line information source via the emulated user interface; building an on-line information source access script from the sequence of inputs; and, storing the on-line information source access script for subsequent automatic access to the on-line information source.
2. A method according to claim 1 , wherein the stored script can be accessed and edited by the user.
3. A method according to claim 1 or claim 2, wherein the stored script is run in response to a command from a user.
4. A method according to claim 1 or claim 2, wherein the stored script is run in response to a schedule.
5. A method according to any preceding claim, further including the step of providing the user with extracted data from the on-line information source through the emulated user interface.
6. A method according to any preceding claim, further including the steps of: storing the extracted data; comparing it with subsequently extracted data; and, providing only new data to the user or generating a report indicative of new data.
7. A method according to any preceding claim, further including the step of filtering the extracted data according to predefined user preferences.
8. A method according to any preceding claim, wherein the emulated user interface is a substitute Web browser window.
9. A method according to any preceding claim, wherein the emulated user interface determines input options at the on-line information source and presents modified input options to the user.
10. An apparatus for use with an user terminal and an on-line information source which requires a sequence of inputs before data can be accessed, comprising: means for providing a user with an emulated user interface of the on-line information source; means for monitoring a sequence of user inputs whilst the user accesses the on-line information source via the emulated user interface; means for building an on-line information source access script from the sequence of inputs; and, means for storing the on-line information source access script for subsequent automatic access to the on-line information source.
11. An apparatus according to claim 10, wherein the apparatus takes the form of computer executable code.
12. An apparatus according to claim 11 , wherein the computer executable code is run on a remote server.
13. An apparatus according to any one of claims 10 to 12, wherein the stored script can be accessed and amended by a user.
14. An apparatus according to any one of claims 10 to 13, further including scheduling means so that the script can be run on the on-line information source at predetermined times.
15. A computer program comprising program code adapted to perform all the steps of a method according to any one of claims 1 to 9, when said program is run on a computer.
16. A computer program according to claim 15, wherein the computer program is embodied on a computer readable medium.
PCT/GB2001/004339 2000-09-30 2001-09-28 Automatic data extraction WO2002029626A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2001292031A AU2001292031A1 (en) 2000-09-30 2001-09-28 Automatic data extraction

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GBGB0023969.9A GB0023969D0 (en) 2000-09-30 2000-09-30 Mechanism for automating the extraction of selected information from web based pages design and implementation
GB0023969.9 2000-09-30

Publications (2)

Publication Number Publication Date
WO2002029626A2 true WO2002029626A2 (en) 2002-04-11
WO2002029626A3 WO2002029626A3 (en) 2003-12-31

Family

ID=9900426

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2001/004339 WO2002029626A2 (en) 2000-09-30 2001-09-28 Automatic data extraction

Country Status (3)

Country Link
AU (1) AU2001292031A1 (en)
GB (1) GB0023969D0 (en)
WO (1) WO2002029626A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1606914A1 (en) * 2003-03-03 2005-12-21 Encentuate Pte. Ltd. Secure object for convenient identification

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5809250A (en) * 1996-10-23 1998-09-15 Intel Corporation Methods for creating and sharing replayable modules representive of Web browsing session
US5902352A (en) * 1995-03-06 1999-05-11 Intel Corporation Method and apparatus for task scheduling across multiple execution sessions
WO2000033202A1 (en) * 1998-12-01 2000-06-08 University Of Florida Web page accessing of data bases and mainframes

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5902352A (en) * 1995-03-06 1999-05-11 Intel Corporation Method and apparatus for task scheduling across multiple execution sessions
US5809250A (en) * 1996-10-23 1998-09-15 Intel Corporation Methods for creating and sharing replayable modules representive of Web browsing session
WO2000033202A1 (en) * 1998-12-01 2000-06-08 University Of Florida Web page accessing of data bases and mainframes

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ANUPAM V ET AL: "Automating Web navigation with the WebVCR" COMPUTER NETWORKS, ELSEVIER SCIENCE PUBLISHERS B.V., AMSTERDAM, NL, vol. 33, no. 1-6, June 2000 (2000-06), pages 503-517, XP004304788 ISSN: 1389-1286 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1606914A1 (en) * 2003-03-03 2005-12-21 Encentuate Pte. Ltd. Secure object for convenient identification
EP1606914A4 (en) * 2003-03-03 2008-12-31 Ibm Secure object for convenient identification

Also Published As

Publication number Publication date
AU2001292031A1 (en) 2002-04-15
GB0023969D0 (en) 2000-11-15
WO2002029626A3 (en) 2003-12-31

Similar Documents

Publication Publication Date Title
US8108425B2 (en) System and method for facilitating personalization of applications based on anticipation of users' interests
KR101021413B1 (en) Method, apparatus, and user interface for managing electronic mail and alert messages
CA2428076C (en) Use of extensible markup language in a system and method for influencing a position on a search result list generated by a computer network search engine
US6385642B1 (en) Internet web server cache storage and session management system
US6912563B1 (en) Methods and systems for proactive on-line communications
US6725222B1 (en) Automated on-line commerce method and apparatus utilizing shopping servers which update product information on product selection
US20010037359A1 (en) System and method for a server-side browser including markup language graphical user interface, dynamic markup language rewriter engine and profile engine
WO1999014693A1 (en) Systems and methods for organizing and analyzing information stored on a computer network
US20060282795A1 (en) Domain bar
US20020005867A1 (en) Snippet selection
US20020026441A1 (en) System and method for integrating multiple applications
US20060015390A1 (en) System and method for identifying and approaching browsers most likely to transact business based upon real-time data mining
US20060230343A1 (en) Method and apparatus for detecting changes in websites and reporting results to web developers for navigation template repair purposes
US20020087506A1 (en) Method and system for interactively enabling venture financing for entrepreneurs
JP2003514271A (en) Method and apparatus for providing a calculated, solution-oriented, personalized summary report to a user via a single user interface
WO2002084507A1 (en) User-side tracking of multimedia application usage within a web page
US20020087450A1 (en) Venture matching method and system
US20020087446A1 (en) Method and system for interactively enabling investment opportunities for investors
US20030101115A1 (en) Method and system for electronically supporting investment and venture financing opportunities for investors and entrepreneurs
US6446117B1 (en) Apparatus and method for saving session variables on the server side of an on-line data base management system
WO2002005105A1 (en) Autonomous browsing agent
WO2002029626A2 (en) Automatic data extraction
US6374247B1 (en) Cool ice service templates
US7062489B1 (en) Data management system having remote terminal access utilizing security management by table profiling
KR20020061316A (en) Client based stock information service method and system

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PH PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: JP