US6732102B1 - Automated data extraction and reformatting - Google Patents

Automated data extraction and reformatting Download PDF

Info

Publication number
US6732102B1
US6732102B1 US09/714,644 US71464400A US6732102B1 US 6732102 B1 US6732102 B1 US 6732102B1 US 71464400 A US71464400 A US 71464400A US 6732102 B1 US6732102 B1 US 6732102B1
Authority
US
United States
Prior art keywords
data
elements
web site
xml
extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime, expires
Application number
US09/714,644
Inventor
Pramod Khandekar
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
InstaKnow Com Inc
Original Assignee
InstaKnow Com Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by InstaKnow Com Inc filed Critical InstaKnow Com Inc
Priority to US09/714,644 priority Critical patent/US6732102B1/en
Assigned to INSTAKNOW.COM INC. reassignment INSTAKNOW.COM INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KHANDEKAR, PRAMOD
Application granted granted Critical
Publication of US6732102B1 publication Critical patent/US6732102B1/en
Adjusted expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion
    • G06F16/86Mapping to a database

Definitions

  • the present invention relates to a method and system for automated browsing and data extraction of data found at global communication network sites such as Web sites that include HTML or XML data.
  • HTML HyperText Markup Language
  • Web World Wide Web
  • the Internet can also be viewed as a global database.
  • a large amount of valuable business information is present on the Internet as HTML pages.
  • HTML pages are meant for human eyes, not for a computer to read them, posing serious limitations on how that information can be used in an automated manner.
  • HTML Web pages are built as HTML tags within other tags, in effect forming a “tree”. Certain automated browsers interpret the hierarchy and type of tags and render a visual picture of the HTML for a user to view.
  • HTML data-capture technology currently available follows a paradigm of “design” and “run” modes.
  • design mode a user (e.g., a designer), through software, locates Web sites and extracts data from those sites, by way of an “example”.
  • the software program saves the example data and in the “run” mode, automatically repeats the example for the new data.
  • most Web pages can, and do, change as frequently and as much as their Webmaster desires, sometimes changing the tree hierarchy completely between design time and run time. As a result, reliable extraction of data, including business data, from an HTML page becomes a challenging task.
  • OnDisplay Inc. of San Ramon, Calif. has a “CenterStage eContent” product that can access, integrate and transform data from multiple HTML pages.
  • OnDisplay's HTML data recognition algorithm works by remembering the depth and location of the required business information within the HTML “tree” between the design and run modes.
  • Neptunet Inc. provides for a system comprising a method, whereby, after getting the Web data, all further processing of that data has to be programmatically specified.
  • Neptunet's HTML data recognition algorithm works by remembering the depth and location of the required business information within the HTML “tree” between the design and run modes.
  • HTML data capture mechanisms include methods whereby HTML data extraction is performed by specifying (i.e., hard coding) the exact HTML tag number of the data to be extracted using a programming language such as Visual Basic or Visual C++.
  • a programming language such as Visual Basic or Visual C++.
  • HTML is a very useful information presentation protocol. It allows visually pleasing formatting and colors to be set for data being presented to make it more understandable. For example, a stock price change can be shown in green color if the stock is going up and in red if the stock is going down, making the change visually and intuitively more understandable.
  • HTML is a wonderful mechanism for the purpose of human interaction, it is not ideally-suited for computer to computer communication. It has the main disadvantage for this purpose that there is no way for the data being sent to be described as “what” the data is supposed to represent. For example, a number “85” appearing on a Web stock trading screen in the browser may be the stock price or the share quantity. The data just gets shown in the browser and it is the human being looking at the browser who knows what numbers mean what because of casual context information shown around the data. But in machine to machine communication, the receiving computer lacks the context resolution intelligence and has to be told very specifically that the number “85” is the stock price and not the share quantity.
  • XML Extensible Markup Language
  • XML provides a perfect solution to specify explicitly and clearly what each number reaching the receiving computer is supposed to be.
  • XML has a feature called “tags” which go with the data and describe what the data is supposed to be. For example, the stock price will be sent in a XML stream as:
  • the “/” in the second tag signifies that the data description for that data element is complete. Other tag pairs may follow, describing and giving values of other data elements. This allows computer to computer data exchange without needing a prior agreement between the computers about how the data is formatted or sequenced. additionally, XML is capable of showing relationships between pieces of data using a “tree” or hierarchical structure.
  • XML has its own unique problems. While useful as data definition mechanisms, XML tree structures cannot be fed to existing data manipulation mechanisms operating on relational (tabular) data formats using well known languages like SQL.
  • OnDisplay, Neptunet and WebMethods are companies allowing a fairly user-friendly design time specification of XML data interchange between computers, saving the specifications and reapplying them at a later point in time on new data.
  • Several companies offer point-and-click programming environments with varying capabilities. Some are used to generate source code in other programming languages, while others execute the language directly. Examples are Visual Flowcoder by FlowLynx, Software-through-pictures by Aonix, ProGraph by Pictorius, LabView by National Instruments and Sanscript by Northwoods Software. All of these methods lack the critical built-in ability to capture and use Web based (HTML/XML) real-time data.
  • One aspect of the present invention provides a computer-implemented method for automated data extraction from a Web site.
  • the method comprising: navigating to a Web site during a design phase; extracting data elements associated with the Web site and producing a visible display corresponding to the extracted data elements; selecting and storing at least one Page ID data element in the display from the data elements; selecting and storing one or more Extraction data elements in the display; selecting and storing at least one Base ID data element having an offset distance from the Extraction elements; setting a tolerance for possible deviation from the offset distance; and renavigating to the Web site during a playback phase and extracting data from the Extraction data elements if the Page ID data element is located in the Web site and if the offset distance of the Base ID data element has not changed by more than the tolerance.
  • the user-specific information is entered into the Web site and used in connection with producing the data to be extracted from the Extraction data elements.
  • the data elements preferably are HTML elements.
  • the visible display may comprise a grid containing rows and columns including information about each the data elements extracted.
  • the information comprises, for each data element, fixed information of grid row number, HTML tag number and visible text, and user-selected information of Page ID, Base ID, Extract and tolerance.
  • a position of the Page ID data element within the Web site is stored and the extracting occurs during the playback phase if the Page ID data element has not changed position.
  • the Page ID data element is desirably selected as a data element that is unlikely to change position upon reformatting of the Web site and the display contains data desired to be extracted.
  • Another aspect of the present invention provides a computer-implemented method for automated data extraction from a Web site, comprising: navigating to a Web site during a design phase; extracting data elements associated with the Web site and producing a visible current display grid corresponding to the extracted data elements; selecting and storing at least one Page ID data element in the current display from the data elements; selecting and storing one or more Extraction data elements in the current display; selecting and storing at least one Base ID data element in the current display having an offset distance from the Extraction elements; entering a tolerance in the current display for possible deviation from the offset distance; displaying a playback display grid during a playback phase with the selected Page ID data element, the Extraction data elements, and the Base ID data element; renavigating to the Web site; extracting data elements associated with the Web site to the visible current display grid; and comparing the extracted data elements in the current display grid with the playback display grid and extracting data from the Extraction data elements if the Page ID data element is found in the current display grid and if the offset distance of the Base ID
  • a further aspect of the present invention provides a computer-implemented method for automated browsing of Web sites on a global communications network and for extracting usable data, comprising: accessing at least one Web site page containing data, wherein the data comprises a hierarchy of HTML tags; transforming the hierarchy of tags into a computer-readable list; identifying a base data element from the list; identifying an offset from the base data element to the usable data; and extracting the usable data for use by a user regardless of changes to the Web site, provided that the offset between the base data element and the usable data does not change.
  • the offset is identified during a design phase and saved for use in a run time phase, which extracts the usable data.
  • Another aspect of the present invention provides a computer-implemented method for automated browsing Web sites and for extracting usable data, comprising: filling a current display grid with rows of HTML data elements from at least one Web site page currently selected by a Web browser; displaying in a playback display grid previously-stored HTML data elements; examining the rows of the playback grid to locate an HTML data element previously selected as a Page ID data element; comparing the rows of the current grid to locate an HTML element that matches the Page ID data element; examining the rows of the playback grid to locate HTML data elements previously selected as Extraction data elements and a Base ID data element used as a reference for locating the Extraction data elements; comparing the rows of the current grid to locate HTML elements that match the Extraction data elements and match the Base ID data element; and extracting data from the Extraction data elements regardless of changes to the Web site, provided that the Page ID elements match and any offset between the Base ID elements is within a predetermined tolerance.
  • a still further aspect of the present invention provides a computer-based system for automatically browsing Web sites, comprising a client computer and a server computer for receiving requests from the client computers over a network connecting the client and server computers, the client computer running an application to: navigate to a Web site during a design phase; extract data elements associated with the Web site and produce a visible display corresponding to the extracted data elements; select and store at least one Page ID data element in the display from the data elements; select and store one or more Extraction data elements in the display; select and store at least one Base ID data element having an offset distance from the Extraction elements; set a tolerance for possible deviation from the offset distance; and renavigate to the Web site during a playback phase and extract data from the Extraction data elements if the Page ID data element is located in the Web site and if the offset distance of the Base ID data element has not changed by more than the tolerance.
  • Another aspect of the instant invention provides a computer-implemented method for automated XML data extraction, comprising: identifying selections of XML data elements for extraction from a source of XML data comprising XML data stored in XML format; storing information related to the identified selections of XML data elements for subsequent use; acquiring the source of XML data and retrieving the XML data elements; comparing the retrieved XML data elements to the identified selections and extracting only the data from the XML data elements that correspond to the identified selections; and reformatting the extracted XML data into a relational format.
  • the source of XML data can be a Web site or a file.
  • the extracted data may be saved into a relational data table, and the reformatted extracted XML data is passed to a calling application.
  • a further aspect of the instant invention provides a computer-implemented method for automated XML data extraction, comprising: navigating to a Web site containing XML data; identifying selections of XML data elements for extraction from the Web site, the XML data comprising data elements containing the data stored in XML format; storing information related to the identified selections of XML data elements for subsequent use; re-navigating to the Web site and retrieving the XML data elements; comparing the retrieved XML data elements to the identified selections and extracting only the data from the XML data elements that correspond to the identified selections; and reformatting the extracted XML data into a relational format.
  • the extracted data is desirably saved into a relational data table.
  • a yet further aspect of the present invention provides a computer-implemented method for automated XML data extraction, comprising: navigating a client computer to a Web site containing XML data; generating a graphical tree structure on the client computer to display XML nodes and subnodes representing the XML data at the Web site; selecting one or more of the nodes and/or subnodes from the tree structure associated with the data to be extracted; storing information related to the selected nodes and/or subnodes; renavigating the client computer to the Web site and retrieving the XML data using the information; comparing the retrieved XML data with the selected nodes and/or subnodes and extracting only the data corresponding to the selected nodes and/or subnodes; and reformatting the extracted XML data into a relational format.
  • selecting one subnode under a parent node automatically selects all subnodes under the parent node.
  • Another aspect provides a computer readable medium storing a set of instructions for controlling a computer to automatically extract desired XML data from a source of XML data, the medium comprising a set of instructions for causing the computer to: identify selections of XML data elements for extraction from a source of XML data comprising XML data stored in XML format; store information related to the identified selections of XML data elements for subsequent use; acquire the source of XML data and retrieve the XML data elements; compare the retrieved XML data elements to the identified selections and extract only the data from the XML data elements that correspond to the identified selections; and reformat the extracted XML data into a relational format.
  • a still further aspect provides a computer-based system for automated XML data extraction, comprising a client computer and server computer for receiving requests from the client computer over a network connecting the client and server computers, the client computer running an application to: identify selections of XML data elements for extraction from a source of XML data contained at the server computer and comprising XML data stored in XML format; store information related to the identified selections of XML data elements for subsequent use; acquire the source of XML data and retrieve the XML data elements; compare the retrieved XML data elements to the identified selections and extract only the data from the XML data elements that correspond to the identified selections; and reformat the extracted XML data into a relational format.
  • FIG. 1A is a depiction of the program user interface used in accordance with a preferred embodiment of the present invention, displaying an HTML screen from a Web page.
  • FIG. 1B is a depiction of the user interface showing the current and playback grids.
  • FIG. 2 is a depiction of the user interface displaying one Web page of a Web site and the design grid.
  • FIG. 3 is a depiction of the design grid used in accordance with a preferred embodiment of the present invention.
  • FIG. 4 is a flowchart of the design phase of one embodiment of the present invention.
  • FIG. 5 is a depiction of the user interface showing one Web page of a Web site and the design grid used in accordance with a preferred embodiment of the present invention.
  • FIG. 6 is a depiction of the user interface showing a Web page and design grid with a refreshed grid with new data from the Web page.
  • FIG. 7 is a depiction of the user interface showing a second Web page containing HTML data and showing the design grid and the selection of the Base ID data element.
  • FIG. 8 is a depiction of the user interface showing a second Web page containing HTML data and showing selection of an Extraction element including the data to be extracted from the Web page.
  • FIG. 9 is a depiction of the user interface showing a second Web page containing HTML data and showing user specified variable names and extraction patterns of the data to be extracted.
  • FIG. 10 is a depiction of a schema file storing information recorded during the design phase of one embodiment of the present invention.
  • FIG. 11 is a depiction of the user interface showing a Web page and the current grid and playback grids.
  • FIG. 12 is a flow chart of the playback phase of one embodiment of the present invention.
  • FIG. 13 is a depiction of the user interface showing the current and playback grids and the information from the playback grid submitted to the Web page.
  • FIG. 14 is a depiction of the user interface showing a second Web page and current and playback grids associated with a second Web page.
  • FIG. 15 is a depiction of a program user interface used in connection with automated browsing of XML data Web sites in accordance with another embodiment of the present invention.
  • FIG. 16A is a depiction of XML data islands extracted from a Web page containing embedded XML data islands
  • FIG. 16B is a depiction of XML data islands extracted from a XML file containing embedded XML data islands.
  • FIG. 17 is a depiction of the user interface showing an XML file and data islands associated with the file.
  • FIG. 18 is a depiction of the user interface further showing node details for one of the data islands associated with the XML file.
  • FIG. 19 is a depiction of the user interface displaying tree nodes and data from the rss data island shown in FIG. 18 .
  • FIG. 20 is a depiction of the user interface showing user-selected nodes which have been highlighted.
  • FIG. 21 is a depiction of the user interface showing details of the user-selected nodes from FIG. 20 .
  • FIG. 22 is a chart listing items that are written to a file storing the various user-entered information to be used in accordance with a second embodiment of the present invention.
  • FIG. 23 is a depiction of the arrangement of the file in its stored format.
  • FIG. 24 is a depiction of the user interface showing a user opening a previously-saved file to be used in connection with the playback mode.
  • FIG. 25 is a depiction of two files showing the final relational version of the data extracted in accordance with a second embodiment of the present invention.
  • the present invention provides various preferred embodiments to provide a unique, instantly deployable business solution construction and delivery mechanism to make use of live Web data and conventional data.
  • the present invention in the embodiments described below, is preferably implemented in the form of software forming a computer program that is adapted to run on a computer.
  • the computer is a client side computer which is adapted to connect to a server side computer. More preferably, the computers are connected via the Internet, with the client computer running a program to allow Web browsing and the server computer running a Web server program, as is typical with Internet and Web communications.
  • the program preferably provides a user interface in which the user can manipulate during a design phase and later call back saved settings during a playback phase.
  • the program can be stored on various types of storage media such as floppy disks, CD ROM, hard disk, RAM, etc. and is preferably installed on the client computer.
  • a Web server is adapted to connect to the Internet or Web in a typical way to deliver Web pages to client computers.
  • Client computers run the program implementing the various embodiments of the present invention which allow browsing of Internet Web sites, data extraction from such Web sites and reformatting of the data found at such Web sites.
  • Client computers connect to the Internet in a typical fashion (e.g., dial-up access, cable modem, DSL, T-l connection, etc.).
  • Client computers typically include standard operating systems such as Microsoft Windows (CE, 95, 98, NT, 2000), MAC OS, DOS, Unix, etc.
  • Client computers can comprise other devices beyond PCs which can connect to the Web in a wired or wireless fashion, such as PDAs, notebook computers, mobile phones, etc.
  • FIG. 1A is a screen shot of a typical Internet HTML screen from a Web site 1 as it appears to the user, showing useful information like current prices for five stocks in a chart 2 , which stocks are Microsoft, IBM, DELL, Oracle, and Hewlett-Packard.
  • the present invention is preferably in the form of a Web browser application 4 , which looks like a normal Web browser screen, but has additional buttons 5 and menu options 6 on the top.
  • the application is programmed to bring up additional specific screen elements when certain buttons 5 are pressed on the top menu bar.
  • a preferred embodiment of the present invention included a built-in Web browser that is used to actually navigate the Web.
  • this browser can accept commands from a program in addition to command from a user.
  • the present invention is preferably implemented in two stages, a design phase and a playback phase.
  • the design phase when the user wants to start recording the sequence of automatic navigation/extraction steps, the user clicks on the “Record” button 7 on the menu bar.
  • this action preferably produces a design grid 8 at the bottom of the Web browser that displays the content of the current page loaded in the Web browser (here the URL of the page is http://us.yimg.com/i/fi/c/zc.html).
  • the user can give instructions in the design grid 8 relevant to what information from the Web page needs to be extracted, which information needs to be entered in which entry slot on this screen, and which button or hyperlink needs to be pressed for the next Web page to show up. As described in more detail below, checking the appropriate check boxes and entering information in the appropriate row in the grid achieves these actions.
  • the process of understanding which Web page is currently being processed, what data is to be entered on which element on the page, which buttons or links are to be clicked, and retrieving desired pieces of information from the Web page (i.e., all processing that happens on a single Web page) is referred to herein as a “step”.
  • Many pages can be processed one after another in a continuous string.
  • the string referred to herein as a “surflet” because it contains information on surfing one specific site for one specific purpose, can be saved under a special user given name as an XML file and can be re-used again in the future to achieve exactly the same navigation and data extraction from the same site but for new, refreshed data.
  • the stock information surflet can be played back automatically by other users for their own customized list of stocks (say Wal-Mart and AOL) as desired by a program.
  • stocks say Wal-Mart and AOL
  • the recorded instructions can be played back automatically.
  • a user or a program tells the program which saved (recorded) surflet is to be played back and with what input parameters.
  • the input parameters are the stocks the particular user is interested in.
  • the steps within the surflet are repeated (played back) automatically and the extracted new business information is returned to the user or program initiating the request.
  • FIG. 2 shows the user interface of the program, depicting a Web browser component 4 , design grid 8 , and menu buttons 5 .
  • a Web browser is an application that displays information from the Internet in page format.
  • Microsoft Internet Explorer 4.0, Netscape 4.6, and Eudora are examples of a Web browser.
  • a Web page is a visible interface written in HTML language that displays information over the Internet.
  • the HTML language is a finite set of formatting tags that are read by the Web browser, which decides how to present the information as a Web page.
  • the Web browser parses the formatting tags inside the Web page, and creates a “tree-like” structure in its memory, based on the relation of the tags to each other. This internal tree-like memory structure is referred to herein as a “Web document.” It is normally not shown to the user and is required only if a user or an application wants to manipulate the content or information on the Web page programmatically.
  • a Web page written in HTML language consists of numerous formatting tags.
  • the formatting tags inside the Web page are identified by the Web browser as HTML elements.
  • the HTML elements are read from top down, and are arranged in a hierarchical fashion. If one element has other elements inside it, those elements are treated as child elements, and are at a level lower than the original element in the hierarchy.
  • the HTML elements that constitute a Web page are read by the Web browser in an ascending order, and are assigned a number (starting from 1) for identification. This number is called the “source index” of that HTML element.
  • Web page controls The type of HTML elements that can accept information, or which allow the information presented to be changed, are called “Web page controls.” In short, these Web page controls allow user interaction to change its content. Examples of Web page controls are TEXT, TEXTAREA, CHOICE, SELECT, RADIO, SUBMIT, RESET, and PASSWORD.
  • the design grid 8 is an mechanism which displays the information about the HTML elements in the HTML document currently loaded in the Web browser.
  • the design grid 8 has several columns, some of which describe the properties of the HTML tags, and others which accept information from the user on what to extract and how to extract. The others columns contain instructions for fine tuning the data extraction process.
  • An example of design grid is shown FIG. 3 .
  • the string can be saved under a special user given name as an XML file. It is referred to herein as a surflet because it contains information on surfing one specific site for one specific purpose.
  • FIG. 3 shows a consolidated image of various parts of the same grid in one picture frame.
  • the HTML Tag Number column 32 is the Source Index of the HTML element from the HTML document. This is a read-only property provided as an attribute of the HTML element. The tag number is useful in identifying the exact element inside an HTML document, and to perform operations like extracting values from that element, or posting some data to that element.
  • the Tag Type column 34 is the Type attribute given to an HTML element in an HTML document. This attribute is provided to the control type of HTML elements only.
  • the formatting tags like TD, TR, P, etc. do not have a tag type attribute.
  • the Web page controls are TEXT, TEXTAREA, CHOICE, SELECT, RADIO, SUBMIT, RESET and PASSWORD.
  • the Visible Text column 36 displays the text contained inside every HTML element that is displayed inside design grid 8 .
  • the controls on the Web page are displayed with their default text.
  • the TEXT, PASSWORD, TEXTAREA controls are generally kept blank for the user to enter values.
  • the SELECT control 37 usually shows the first item in its list as the default item selected.
  • the RADIO or CHOICE may or may not be selected by default.
  • the HTML tag specific information is automatically filled in design grid 8 when the grid is displayed.
  • the designer has to supply following information in the appropriate columns against an appropriate HTML tag.
  • a Web page will change its data content and also possibly its data presentation format between the time a recording is done in the design phase and later played back in the playback phase.
  • the user needs to ascertain that he/she is working with the same Web page that was used during the recording of surflet.
  • the user identifies a firm piece of information on the Web page that has a low probability or the least probability of being modified or changed when the Web page data or format is modified.
  • This piece of information will work as a guide during the playback phase, based on which it will decide whether the Web page with is the right Web page or not.
  • This piece of information is called the Page ID data element, shown in column 38 .
  • the user selected “Yahoo! Finance” as the Page ID data element since it is very unlikely this information will change at this Web site.
  • the user thus has to examine the Web page for such stable information (Page ID text), and then click the check box in the Page ID column 38 against the same Page ID text in the design grid 8 .
  • Page ID text stable information
  • a Base ID data element contained in Base Column 40 , is an HTML data element that acts as a starting reference or base or anchor for other HTML elements during data extraction. The designer identifies one or more such HTML elements which have a high or the highest probability of appearing in the same relative position from the data to be extracted, even if the Web page undergoes modifications.
  • the Xtract column 42 is of type check box. The check box is clicked if the information contained in the Extraction data element (as seen in the Visible Text Column 36 ) associated with a check box is to be extracted from the Web site.
  • the Variable Name column 44 is a user-defined variable name that will contain the extracted business data from the Web browser page. This variable can be supplied to other functions as a variable, or can be set from other applications inside the program to receive input values to be entered on the Web page.
  • the Forward Tolerance column 46 is a numeric field. The application will go “down” the HTML elements list in the grid within this forward tolerance limit to find a match for the HTML element in consideration. A typical tolerance number is 10, meaning that it is acceptable for the HTML element to wander a little bit here and there as a result of Web page design changes, as long as it is within 10 positions of the recording time position of the same element.
  • the Backward Tolerance column 48 is also a numeric field. The application will go “up” the HTML element tree within this limit to find a match for the HTML element in consideration.
  • the Xtract All Rows column 50 defines a fixed pattern to be extracted.
  • the designer provides a unique Extract All ID to all the fields that will form the pattern.
  • the number of rows filled in the grid with this unique name form the number of columns in the extraction pattern.
  • This column is useful to define the extraction of data formatted in tabular form. In effect this column is used by the user to specify one complete row of the table to the software.
  • the implied instruction to the software is to Extract All similar rows.
  • the Stop At column 52 means to stop extracting at certain text.
  • the Extract All Rows pattern looks for the text entered in this field to stop the extraction. If not provided, the end of the document is assumed to be the end of the extraction pattern.
  • the extracted information is passed from the Web browser to the design grid, and ultimately to the surflet.
  • the Web page is first loaded in the Web browser and is read in the memory at step action 60 .
  • the information on this Web page is displayed in the design grid at action 62 .
  • the user then submits supporting instructions in the design grid at action 64 and also submits some information on the Web page at step 66 .
  • the design grid is resynchronized with the current Web page so that it will contain the most current version of the Web page, and this design grid is saved to the system's memory as one “step”.
  • the Web server returns a Web page as a response to earlier request.
  • This new page is displayed in the Web browser and its content is shown in the design grid at action 70 .
  • the process continues until the user decides to stop the recording at action 72 . When stopped, the steps are saved in the surflet for future playback.
  • Actions preferably performed by the user are actions 64 , 66 , 70 and the stop recording action 72 .
  • Operations preferably performed by the program include actions 60 , 62 , 68 and the write action 72 .
  • the designer enters the URL in the address box 80 and brings up the design page.
  • the designer clicks the record button 7 to start recording the navigation and extraction process.
  • the click of the Record button 7 populates the design grid 8 with the information contained in the currently loaded Web page in the Web browser.
  • the program When the user clicks the Record button 7 , the program automatically copies the Web page HTML from the Web browser to a Mshtml.dll document, which automatically makes a list of HTML tags and their names and values available to the program.
  • a utility is preferably provided that can allow or disallow certain tags to be included in the design grid based on the user's preference. Each element in the document is checked whether it is included in the included list. If it is included, it is processed further to include its information in design grid 8 .
  • the designer enters the stock symbols in the Text Control box 82 to get back more information on those stocks. But the designer has not clicked on any buttons or hyperlinks yet.
  • the designer believed that whatever changes may take place on the Web page, the image 20 that says “Yahoo Finance” will always appear at the same place. In short, that image will always be there to identify at a later time that this is the correct page that will be automatically visited. Therefore, the designer eyeballs the lines in the design grid 8 , locates the line 84 that contains “Yahoo! Finance” in the Visible Text Column and Image “IMG” 86 in the Tag Type column and clicks the check box 88 in the Page ID column. The tolerance given is zero, which means that when this recording is played back, and when this particular Web page is brought up, the program will search for the “Yahoo!
  • the designer does not wish to extract any data from this first page of recording, and is ready to receive the detailed information on the stock symbols that was entered in the text control. Therefore, the designer clicks the Submit Button 83 (“Get Quotes”) on the Web browser.
  • BeforeNavigate Before the Web page is submitted to the Web server for a response, an event in the Web browser called BeforeNavigate is always triggered automatically. This allows the program a chance to extract the relevant information from the current page before it is sent back to the server.
  • the program inspects the status of every tag in the MSHTML tag list and all tags that were used by the user as Page ID, Base ID Extract, Extract All or were clicked on in the browser window are saved to a memory structure.
  • This second Web page 90 has data of interest to the designer who wishes to extract it.
  • the table 8 in Web browser displays the information returned as a result of user request in the first step.
  • the designer wants to extract the information on the date and time of the stock information on this page, shown at element 100 .
  • This is an isolated data extraction. This information can move up or down within the Web page later on. But there are certain key information pieces on every Web page that always appear on that page, albeit their position in the layout of the Web page may have changed a bit. These, are referred to herein as the Base ID elements.
  • These Base ID elements act as an anchor with whose reference other data can be extracted.
  • the Quotes text 102 has been assigned the role of a Base ID element. The tolerance provided is 10, both for backward and forward tolerances. This ensures that during the playback of this recording, if this second Web page has changed within the provided tolerance for the specified Base ID element, the Base element still will be located from the Web page in the playback.
  • the designer wants to extract the date and time 100 of the stock values returned on the Web page.
  • the designer also has provided a backward and a forward tolerance of 5. This means that during the playback of this recording, if the given text has moved within the provided tolerance, it still will be identified and retrieved.
  • the designer also wishes to extract the information contained in the table 2 , i.e., the stock name 104 , the stock value 106 , the date and time 108 , the percent change 110 , and the volume traded 112 .
  • the designer identified the column header Symbol 120 as the Base ID element for this second data extraction, and has provided a value of 10 for the backward and forward tolerance.
  • This Base ID element will act as an anchor to the “Extract All” data extraction from the table.
  • the designer next has to provide a sample of the pattern that user wants to extract. As shown in FIG. 9, the first row 130 of the table 2 serves as the sample. In the design grid, the designer provides a user-friendly name to the data that will be extracted in the Variable Name column, and also a user chosen name to this pattern of data extraction (“Row 1 ” in this example).
  • the program also needs to understand at what point it should stop extracting data in an Extract All type of extraction. Therefore, some text is provided where the extraction will stop.
  • the text “Recent News” 132 is used as a relatively firm piece of information that should always appear on this particular Web page. Hence it is selected. Now when the Extract All extraction is implemented, it will extract the first row from the table, look to see if it has reached an element that has a text called “Recent News”. If not, the program will continue to extract until it finds that text. Thus, the program will extract data from all the desired rows from the table on the Web page.
  • This process is repeated until the designer decides to stop the recording by clicking the stop button 134 .
  • the designer is provided with an option to save the current recording in a surflet file. Once saved, this file can be reloaded any time, as many times as desired for playback.
  • the schema file is a representation of the steps that were recorded. As can be seen in FIG. 10, there are some Global variables 140 , two folders 142 a and 142 b called Step, and many subfolders 144 called Gridline within those folders.
  • the variables “PageIDTolerance”, “AnchorTolerance”, and “ExtractTolerance” are applicable to the whole schema, i.e., to all steps. They decide default tolerances, saving the user the work of specifying tolerances on every row.
  • the folder Step corresponds to one step in the recording. That means the file in FIG. 10 has only two steps, which in turn means that only two steps were recorded in the design phase.
  • the folders, Gridline 144 under each step folder represents one row in the design grid.
  • Each gridline folder has numerous variable fields, which represent one column each in the design grid. The value of the variable is the value in that row of the associated column of the design grid.
  • the current grid is same as the design grid in its layout. This grid is populated during the playback of a recorded surflet only. It always displays the content of the current Web page loaded in this embodiment's Web browser.
  • An example of current grid is shown in FIG. 11 .
  • the upper grid 150 a of the two grids seen is the current grid.
  • the playback grid 150 b is same as the current grid 150 a in its layout. This grid is populated during the playback of a recorded surflet only. It displays only those rows that were saved as a step during the design phase.
  • FIG. 12 shows the playback loop flow diagram. The playback starts when the playback button is clicked. Before the actual playback starts, the Web browser should show the same starting Web page from where the playback was recorded earlier.
  • the first action in the playback is the reading of information from the surflet XML file identified by the user in a file open dialog box.
  • the information in the surflet file is saved during the design phase in such a way that there are as many of XML nodes as that of the steps.
  • Each row in the design grid for a step is again a separate sub-node to the step node in the surflet file (FIG. 10 ).
  • the memory structure is filled with information from the surflet file in a similar hierarchical fashion.
  • the user or a program navigates to the Web page in the design step from where the recording began.
  • validation action 164 is performed where it is ascertained whether the currently loaded Web page is the Web page that the user wants to work with. A check is made based on the identification marks made in the playback grid.
  • Similar validations are made for elements identified as Base ID.
  • each HTML element in the playback grid is searched in the current grid based on the grid row number, tag type, and tag text.
  • the current grid 150 a is filled with information from the Web page in the Web browser in view (FIG. 13 ).
  • the information about the first step from the memory is displayed in the playback grid 150 b.
  • the rows from the playback grid are picked up one by one in a loop to find out if any of the rows have HTML elements that have been defined as a Page ID element. When such row is identified, an attempt is made to find the same HTML element in the current grid 150 a.
  • the playback grid there are four rows in the playback grid. These rows were saved during the first step of the design schema.
  • the first row 180 was identified to be a page marker at that time.
  • the information associated with this row is as follows:
  • the information about this row is saved in memory when it starts to search the exact element in the current grid 150 a .
  • the first row 181 in current grid is checked against this saved information from the playback grid and does not match.
  • the program then moves on to the next row 182 in the current grid, which has the following information describing itself:
  • the grid row number from the playback grid matches exactly to the grid row number of the current grid for the tag type, and visible text. Therefore the tolerances of zero will work here, and there will be a match on the grids. As a result, the program knows that it is dealing with the same and correct Web page that was used during the design phase.
  • the forward and backward tolerances operate as a guide that tell the program how much up or down in the grid it should look to find a possible match.
  • the program would have had to search the current grid to find a possible match.
  • the forward tolerance was 5, then from the location in the current grid where the program originally expected to find a match (row 2 in current grid), the program moves down the current grid one row at a time.
  • the software compares the Visible Text and Tag Type of that row to the saved information of the row declared as Page ID in the playback grid. If a match is found in the third move, then the program knows that the Web page has changed since the last recording, but within the tolerance limit provided. Hence, the program proclaims that a match is found, albeit with some adjustment.
  • the adjustment which will be 3 rows movement in the example, is called an offset of that element. This means that any subsequent HTML element present is the playback grid is also expected to appear 3 rows below its originally expected position in the current grid.
  • a Base ID element is a reference ID of an element for other elements.
  • the presence (though not the exact location) of this element on the Web page is more or less assured, and hence it serves as a good anchor (or base) to locate other information from the same page.
  • the Base ID element itself can move between the design time and playback time. In such case, the backward or forward tolerance is applied to find out the correct new location of the element tag that had been declared as a Base ID element in the design phase.
  • the offset is determined, which is the difference between the old and the new row number of the Base ID element. All the elements associated with this Base ID element are identified further based on the offset calculated from the new position of the Base ID element.
  • Base ID elements 191 and 192 there are two Base ID elements 191 and 192 , one for the stand-alone extraction, the other for the column data extraction. If there is more than one Base ID element, then the elements that lay between the two successive Base ID are associated with the Base ID element which appears earlier in the playback grid.
  • Action 166 is essentially the repeat of action 164 .
  • the difference is that the offset found in action 164 is added to the row number from the playback grid, and then the match is performed on the current grid. This method takes care of any rippling effect the base element has on the other elements because of its relative displacement in the new Web page.
  • each element may be assigned a forward and/or backward tolerance. This further helps to cushion changes made to the HTML page since the design was recorded.
  • the third row 193 in the playback grid has instructions to extract some information from the Web page.
  • This row has the following information:
  • the Base ID appearing before this extract is the Base ID with text as Quotes in row 190 .
  • This row has the following information:
  • the program validated the Base ID element before it reached this extraction.
  • This Base ID has a tolerance of 10 units.
  • the Web page changed from the time the user created this surflet, and some information was added before the text Quotes. Also, some text was added between the text Quotes and the data to be extracted (i.e., “Wed Oct 20 10:24 am ET—US Markets closes in 5 hours 37 minutes”).
  • the difference (8) is still within the tolerance limit declared for the Base ID (10). Therefore, this Base ID will be located in the current grid in row number 29 , and will be validated.
  • the offset of 8 tells the program that all other HTML elements appearing after this Base ID element would also have moved down the hierarchy by at least 8 values.
  • the program looks up the current grid at row number 51 for the exact text that is present in the playback grid. But the data of interest also has moved down because of some changes made to the Web page, and its new location in the current grid is 52. Therefore, the program applies the forward tolerance (of 5) and correctly locates the target at row number 52 .
  • the HTML tag number associated with this row is the actual Source Index of that HTML element in the changed HTML page.
  • the handle on the HTML element's source index allows the program to programmatically manipulate that element as instructed.
  • Base ID helps to cushion the changes made to the Web page after it was used for creating a surflet. This method assures that given the right instructions, the program will find the correct HTML element from a given Web page.
  • Extract can be a single extract, or can be an Extract All type to be extracted.
  • an instruction to extract is in isolation, meaning no other element's information to be extracted along with the one in focus, is it single extraction.
  • the Stop At text is the text of the element that tells the program where to stop extracting during a pattern search and extract in Extract All type of extraction.
  • the Stop At elements position adjusts the offset for all further extractions on that page. This allows for data appearing after varying length tables to be correctly extracted, regardless of how many rows existed in the table at design time and at actual run time.
  • the first row of the table 2 has been identified as a pattern for extraction.
  • the program knows the number of rows in a given pattern because of the same name given against the HTML elements in the playback grid that form the pattern (Row 1 ). In the example, there are five elements in the pattern.
  • the pattern is read in memory from the playback grid. Each element from the memory is read one at a time, and a match is found in the current grid as described in earlier occasions. Likewise, all the elements in the memory are matched. At the end of which a check is made if the text given in the Stop At column has reached or not. If not, the first element from the memory collection is read again and a match is searched in the next element in the current grid.
  • the extraction logic keeps applying the pattern given in playback grid (identified by Row 1 ) to the rows in the current grid table until it encounters the “Recent News” text. This results in retrieval of all the rows in the Web page table.
  • the information entered by the user during the design phase in the HTML controls is saved in the surflet, and is reproduced again in the playback phase in the playback grid.
  • the next step is to identify the input controls in the playback grid and update the appropriate Web page controls with information in the playback grid row.
  • the same method of HTML element control identification is employed as in actions 164 and 165 .
  • the text is set inside the control, or the index is set if it is a multi item control (e.g., radio button, choice, select, etc.).
  • a system and method for data extraction from XML data, including a method for capturing, filtering and converting XML data into more conventional tabular (relational) table formats, using an easy point and click user interface.
  • the data can be from the Internet, Intranets or flat files on local drives or LANS/WANS.
  • the program captures the live data from the XML source and converts the data in a relational (tabular) format. The new format can then be used by anyone needing filtered original data in the more conventional relational format.
  • This embodiment preferably provides a user-interface providing point and click user interaction to: identify an XML data source; identify the XML data of interest within the data source; save these instructions for later use; automatically retrieve live XML data at a later point in time using the saved instructions; and automatically filter the live data to cull it down to data of interest. Automatic conversion of filtered live data into more conventional table formats for easier use of the data by applying well known SQL techniques
  • a preferred aspect of this embodiment executes in two modes, the design mode and the playback mode.
  • the design mode allows a person (referred to as the “Designer”) to instruct the program which and how much data is to be captured from a Web site (or a file) capable of supplying XML data.
  • the program saves these instructions in a file, referred to herein as a “schema” file.
  • the schema file itself is preferably an internal XML file but is not directly related to the XML data being captured.
  • the playback mode can be monitored by a person for testing purposes or can be completely automatic, devoid of human interaction, for program to program communication purposes. It reads the saved instructions from the schema file and gets selective live XML data from the source (Internet, Intranet or file) and converts it in a relational (tabular) format. The new format can then be used by anyone used to using conventional data.
  • the design mode essentially is a teach mode, wherein the set of instructions taught by the designer is saved in as a schema in a XML file.
  • the screen 200 shown in FIG. 15 is presented as the user interface.
  • the designer identifies the data source from where the XML data is to be captured at playback time by typing in a Web address or choosing a file in field 202 .
  • the designer clicks on the Show Data Islands/Tree button 204 .
  • the program navigates to that Web site or opens the file, loads the XML contents of the source in the MSXML.DLL offered by Microsoft utility as a document object. Further processing takes place in the document object.
  • the document can be of two main types. It may either be an HTML Web page with embedded XML “data islands” or it may be a standalone XML file.
  • FIG. 16 A An example an HTML Web page 210 with data islands 212 and 214 with is shown below in FIG. 16 A.
  • An example of a stand-alone XML file 220 is shown in FIG. 16 B.
  • the Show Data Island/Tree button 204 When the designer presses the Show Data Island/Tree button 204 , if the XML data source contains data islands, a list of those islands is shown in the data islands list box 204 . As shown in FIG. 17, the user chooses one data island (here the data islands are “rss” or “moreovernews”) for further data extraction definition by clicking on it in box 204 .
  • XML source contains only XML data (i.e. without any data islands) then that entire XML is shown in box 206 (shown in FIG. 18) directly, skipping the steps of requiring the user to identify the particular data island. All further processing is the same regardless of whether the data in lower box 206 was loaded from a XML file or from a XML data island embedded inside another file.
  • the program then uses another utility, such as Microsoft supplied MSXML.DLL facilities, to read the XML tree node by node.
  • another utility such as Microsoft supplied MSXML.DLL facilities
  • a visual image of the XML tree is created for the user. This is preferably done by loading every node into a third party utility software component such as Tree Visualizer sold by Green Tree Software.
  • the user is also given the flexibility of viewing the tree node with and without the data.
  • FIG. 18 there is shown tree nodes 220 data 222 .
  • FIG. 19 displays tree nodes 230 and data 232 from the “rss” data island displayed in FIG. 18 .
  • the user can click on any tree nodes to serve as identification of tree branches that contain data to be captured at run time. Multiple nodes on different branches can be clicked. All clicked nodes are highlighted for visual identification and easier understanding. A selected node can be de-selected by clicking on it again.
  • the designer is required to click on only one node from any desired branch.
  • the implied instruction is that all similar branches are also desired. If the design time tree has 50 similar looking branches, the designer has to click only one node in any of 50 branches.
  • the “depth” of the clicked node within the desired branch decides which data within the similar branches will be captured at playback time. Only the data from the clicked node upward, towards and including the root of the tree is captured at playback time.
  • the user-clicked nodes' full path up to and including the tree root and also their children or subnode names are saved as a part of the saved instructions.
  • the details of the user clicked node in the Visual Tree are shown in the far point grid 250 .
  • the user clicked Node's path and also their immediate subnodes' path are shown in columns 252 and 254 .
  • the number of rows the user wishes to capture for each path is specified in number column 256 .
  • the Node Name is specified in Table/Grid column 25 B.
  • a description is entered in description column 260 . Wait-time for the response from the Web site at run time is also specified.
  • the designer can specify the number of rows he wishes to extract at playback time, a meaningful business name for the whole table of relational data to be generated at run time, and a description.
  • the user can also specify the number of seconds he wishes to wait for the response from the Web site at runtime.
  • FIG. 22 shows the items which are written to the schema file.
  • FIG. 23 shows how the schema file looks when it is saved, which in this case was saved in the file “sample.xml” shown in address box 260 .
  • the end user identifies a saved schemas that is to be executed.
  • the program opens the user specified schema file from the open box and reads the saved instructions from the file. It then navigates to that particular Web site or opens the file to get the live XML data.
  • the XML data is then loaded, preferably into the Microsoft supplied MSXML.DLL as a Document. If the XML data fails to load within the time specified by the designer, the program returns back to the calling application with appropriate error messages.
  • One example of why such failures may occur is a Web server being down.
  • the data is retrieved node by node from the top, preferably using Microsoft supplied methods.
  • a relational table is created in the user-specified name (Table-Name), which is retrieved from the saved schema.
  • Table-Name the user-specified name
  • the Parent-Path 280 and the To-Be-Extracted Path 282 for that clicked node form the Columns of the relational table.
  • This process of retrieving the next node and inspecting it continues node-by-node until the traversal reaches a node having the same node as the clicked node.
  • the data from all of the children nodes is also written out to their corresponding column names in the tables being filled.
  • the node inspection process stops when the end of the XML data is encountered or the number of nodes specified by the designer have been written out to the tables.
  • the tables are created in user specified names, which, in this example are Image and Item.
  • the program can show the final relational version of the retrieved, filtered and tabularized data in an application such as in Microsoft's supplied Notepad utility shown in FIG. 25 .
  • the first line in the output corresponds to the column names and the following lines have data.
  • the XML extraction method of the present invention can be used as a stand-alone program or can be implemented as a utility subprogram inside any other program to retrieve, filter and convert into tables any XML data from any source, Website or files.
  • the “business” can be any industry, segment or market including commerce, academics areas of any manufacturing, service, information or knowledge oriented institutions.
  • the technologies and capabilities achieved include automated capture/entry of data on live HTML Web (Internet/Intranets) pages and automated capture/entry of data on live XML Web (Internet/Intranets) pages.
  • the present invention provides a method to capture useful and latest business data from the Internet, Intranets and Extranets. Most preferably, it is implemented via a program running on a user's computer.
  • the program can be stored on any type of recordable media (e.g., floppy disk, hard disk, ROM, RAM) and instructs a user's computer in the typical fashion, with the present invention, the program learns which Web pages are of interest to the user, how to reach (navigate to) those Web pages and which business information of those pages is of interest to the user. These steps are recorded and saved in a schema file.
  • the program when requested by a program or a user, the program can automatically repeat the saved Web navigation and data extraction steps and capture the latest instance of the changed business data from the same pages as originally taught by the user.
  • the present invention is capable of surfing the Web without human assistance.
  • the present invention intelligently accommodates the changes and adjusts itself to get the correct business information from the new pages in spite of the layout change. If the layout has changed so drastically that the business information is no longer present on that page, the present invention will provide an error message.
  • This present invention provides the capabilities of automatically navigating to pre-requested Web pages and automatically entering data and clicking on pre-requested buttons or links.
  • the advantages of such Internet automation include unattended, continuous monitoring of real time business information, and automatic surfing, 10 to 1000 times faster than human interaction with the Web
  • the present invention is preferably build on Microsoft supplied generic basic technologies and components, although any other equivalent components can be used to achieve the same results.
  • Preferred components for the HTML extraction program include: Visual Basic 6.0 IDE; Web Browser Control, SHDOCV.DLL, and MSHTML.DLL.
  • Preferred components for the XML extraction program include Visual Basic 6.0 IDE and MSXML.DLL.
  • Visual Basic 6.0 IDE is a generic object oriented programming language. Equivalents are C++, VC++, and Java.
  • Web Browser Control (Shdocvu.dll) is a generic tool supplied by Microsoft to provide a browser interface to the Web under a program's control.
  • Mshtml.dll is a generic tool supplied by Microsoft to convert an HTML Web page into a program understandable list of HTML tags and their values.
  • Equivalents are third party tools like the “Internet tool pack” from Crescent Technologies. These tools are the interface layer that allow a program to read the browser's current content and detect where the user clicked on the Web page in the browser. They also provide a programmatic interface to automatically fill information on the browser and simulate user actions like clicks.
  • Msxml.dll is a generic XML parser supplied by Microsoft. Many other equivalents are available from all leading software manufacturers like IBM and Sun Microsystems, and Netscape.
  • the present invention can also be configured to extract the entire text from a non-HTML page, such as a Word, Excel or PDF file.
  • a non-HTML page such as a Word, Excel or PDF file.
  • an “Extract Entire Text” option can be provided to get the entire content of the page.
  • only a Page ID needs to be provided and one would check the “body tag” for extraction.

Abstract

A system and method for automated browsing and data extraction from Internet Web sites. Our preferred method and system selects various data elements within the Web site during a design phase and extracts data from the Web site based on the matching of the selected data elements at the Web site during a playback phase. Another preferred method and system extracts XML data based on matching previously selected XML data elements during a design phase with XML data elements present during a playback phase, and reformats the extracted XML data into a relational format.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS
This application claims the benefit of U.S. Provisional Application No. 60/174,747, filed Jan. 4, 2000, U.S. Provisional Application No. 60/166,247, filed Nov. 18, 1999, and U.S. Provisional Application No. 60/171,143, filed Dec. 16, 1999, the disclosures of which are hereby incorporated by reference herein.
FIELD OF THE INVENTION
The present invention relates to a method and system for automated browsing and data extraction of data found at global communication network sites such as Web sites that include HTML or XML data.
BACKGROUND OF THE INVENTION
The Internet is becoming the de facto default network for people and computers to connect to each other because of its truly global reach and its free nature. HTML (HyperText Markup Language) is the widely accepted standard for human interaction with the Internet and particularly the World Wide Web (the “Web”). HTML, in conjunction with a browser, allows people to communicate with other people or computers at the other end of the network.
The Internet can also be viewed as a global database. A large amount of valuable business information is present on the Internet as HTML pages. However, HTML pages are meant for human eyes, not for a computer to read them, posing serious limitations on how that information can be used in an automated manner.
HTML Web pages are built as HTML tags within other tags, in effect forming a “tree”. Certain automated browsers interpret the hierarchy and type of tags and render a visual picture of the HTML for a user to view. HTML data-capture technology currently available follows a paradigm of “design” and “run” modes. In design mode, a user (e.g., a designer), through software, locates Web sites and extracts data from those sites, by way of an “example”. The software program saves the example data and in the “run” mode, automatically repeats the example for the new data. However, most Web pages can, and do, change as frequently and as much as their Webmaster desires, sometimes changing the tree hierarchy completely between design time and run time. As a result, reliable extraction of data, including business data, from an HTML page becomes a challenging task.
There are certain known methods for extracting this information. For example, OnDisplay Inc. of San Ramon, Calif. has a “CenterStage eContent” product that can access, integrate and transform data from multiple HTML pages. OnDisplay's HTML data recognition algorithm works by remembering the depth and location of the required business information within the HTML “tree” between the design and run modes.
As another example, Neptunet Inc., of San Jose, Calif., provides for a system comprising a method, whereby, after getting the Web data, all further processing of that data has to be programmatically specified. Neptunet's HTML data recognition algorithm works by remembering the depth and location of the required business information within the HTML “tree” between the design and run modes.
Other HTML data capture mechanisms include methods whereby HTML data extraction is performed by specifying (i.e., hard coding) the exact HTML tag number of the data to be extracted using a programming language such as Visual Basic or Visual C++. The drawbacks of these types of methods is that at the slightest change in the appearance of the Web page, the program has to be changed, making it an impractical solution for reliable data processing solutions.
HTML is a very useful information presentation protocol. It allows visually pleasing formatting and colors to be set for data being presented to make it more understandable. For example, a stock price change can be shown in green color if the stock is going up and in red if the stock is going down, making the change visually and intuitively more understandable.
But more and more, the Internet is also being used for machine to machine (i.e., computer to computer) communications. While HTML is a wonderful mechanism for the purpose of human interaction, it is not ideally-suited for computer to computer communication. It has the main disadvantage for this purpose that there is no way for the data being sent to be described as “what” the data is supposed to represent. For example, a number “85” appearing on a Web stock trading screen in the browser may be the stock price or the share quantity. The data just gets shown in the browser and it is the human being looking at the browser who knows what numbers mean what because of casual context information shown around the data. But in machine to machine communication, the receiving computer lacks the context resolution intelligence and has to be told very specifically that the number “85” is the stock price and not the share quantity.
The need for correct and specific understanding of the data at the receiving computer's end has been conventionally satisfied via EDI (Electronic Data Interchange), where the sending and receiving computers have to be synched up to agree on the sequence, length and format of the data elements that can be sent as a complete message. This mechanism, while it works, is cumbersome because of the prior agreement required between the two computers and hence can be used effectively only in a network of relatively few computers in communication with one another. It does not work in an extremely large network like the Internet.
The void of clarity of data definition in a large network is being filled today by a new Internet protocol called XML (Extensible Markup Language). XML provides a perfect solution to specify explicitly and clearly what each number reaching the receiving computer is supposed to be. XML has a feature called “tags” which go with the data and describe what the data is supposed to be. For example, the stock price will be sent in a XML stream as:
<Stock Price> 85 </Stock Price>
The “/” in the second tag signifies that the data description for that data element is complete. Other tag pairs may follow, describing and giving values of other data elements. This allows computer to computer data exchange without needing a prior agreement between the computers about how the data is formatted or sequenced. additionally, XML is capable of showing relationships between pieces of data using a “tree” or hierarchical structure.
But XML has its own unique problems. While useful as data definition mechanisms, XML tree structures cannot be fed to existing data manipulation mechanisms operating on relational (tabular) data formats using well known languages like SQL.
It is believed that OnDisplay, Neptunet and WebMethods are companies allowing a fairly user-friendly design time specification of XML data interchange between computers, saving the specifications and reapplying them at a later point in time on new data. Several companies offer point-and-click programming environments with varying capabilities. Some are used to generate source code in other programming languages, while others execute the language directly. Examples are Visual Flowcoder by FlowLynx, Software-through-pictures by Aonix, ProGraph by Pictorius, LabView by National Instruments and Sanscript by Northwoods Software. All of these methods lack the critical built-in ability to capture and use Web based (HTML/XML) real-time data.
SUMMARY OF THE INVENTION
One aspect of the present invention provides a computer-implemented method for automated data extraction from a Web site. The method comprising: navigating to a Web site during a design phase; extracting data elements associated with the Web site and producing a visible display corresponding to the extracted data elements; selecting and storing at least one Page ID data element in the display from the data elements; selecting and storing one or more Extraction data elements in the display; selecting and storing at least one Base ID data element having an offset distance from the Extraction elements; setting a tolerance for possible deviation from the offset distance; and renavigating to the Web site during a playback phase and extracting data from the Extraction data elements if the Page ID data element is located in the Web site and if the offset distance of the Base ID data element has not changed by more than the tolerance.
Preferably, the user-specific information is entered into the Web site and used in connection with producing the data to be extracted from the Extraction data elements. The data elements preferably are HTML elements. The visible display may comprise a grid containing rows and columns including information about each the data elements extracted. Desirably, the information comprises, for each data element, fixed information of grid row number, HTML tag number and visible text, and user-selected information of Page ID, Base ID, Extract and tolerance. Also preferred, a position of the Page ID data element within the Web site is stored and the extracting occurs during the playback phase if the Page ID data element has not changed position. Further, the Page ID data element is desirably selected as a data element that is unlikely to change position upon reformatting of the Web site and the display contains data desired to be extracted.
Another aspect of the present invention provides a computer-implemented method for automated data extraction from a Web site, comprising: navigating to a Web site during a design phase; extracting data elements associated with the Web site and producing a visible current display grid corresponding to the extracted data elements; selecting and storing at least one Page ID data element in the current display from the data elements; selecting and storing one or more Extraction data elements in the current display; selecting and storing at least one Base ID data element in the current display having an offset distance from the Extraction elements; entering a tolerance in the current display for possible deviation from the offset distance; displaying a playback display grid during a playback phase with the selected Page ID data element, the Extraction data elements, and the Base ID data element; renavigating to the Web site; extracting data elements associated with the Web site to the visible current display grid; and comparing the extracted data elements in the current display grid with the playback display grid and extracting data from the Extraction data elements if the Page ID data element is found in the current display grid and if the offset distance of the Base ID data element has not changed by more than the tolerance. Preferably, the tolerance comprises a forward and backward tolerance.
A further aspect of the present invention provides a computer-implemented method for automated browsing of Web sites on a global communications network and for extracting usable data, comprising: accessing at least one Web site page containing data, wherein the data comprises a hierarchy of HTML tags; transforming the hierarchy of tags into a computer-readable list; identifying a base data element from the list; identifying an offset from the base data element to the usable data; and extracting the usable data for use by a user regardless of changes to the Web site, provided that the offset between the base data element and the usable data does not change. Desirably, the offset is identified during a design phase and saved for use in a run time phase, which extracts the usable data.
Another aspect of the present invention provides a computer-implemented method for automated browsing Web sites and for extracting usable data, comprising: filling a current display grid with rows of HTML data elements from at least one Web site page currently selected by a Web browser; displaying in a playback display grid previously-stored HTML data elements; examining the rows of the playback grid to locate an HTML data element previously selected as a Page ID data element; comparing the rows of the current grid to locate an HTML element that matches the Page ID data element; examining the rows of the playback grid to locate HTML data elements previously selected as Extraction data elements and a Base ID data element used as a reference for locating the Extraction data elements; comparing the rows of the current grid to locate HTML elements that match the Extraction data elements and match the Base ID data element; and extracting data from the Extraction data elements regardless of changes to the Web site, provided that the Page ID elements match and any offset between the Base ID elements is within a predetermined tolerance.
A still further aspect of the present invention provides a computer-based system for automatically browsing Web sites, comprising a client computer and a server computer for receiving requests from the client computers over a network connecting the client and server computers, the client computer running an application to: navigate to a Web site during a design phase; extract data elements associated with the Web site and produce a visible display corresponding to the extracted data elements; select and store at least one Page ID data element in the display from the data elements; select and store one or more Extraction data elements in the display; select and store at least one Base ID data element having an offset distance from the Extraction elements; set a tolerance for possible deviation from the offset distance; and renavigate to the Web site during a playback phase and extract data from the Extraction data elements if the Page ID data element is located in the Web site and if the offset distance of the Base ID data element has not changed by more than the tolerance.
Another aspect of the instant invention provides a computer-implemented method for automated XML data extraction, comprising: identifying selections of XML data elements for extraction from a source of XML data comprising XML data stored in XML format; storing information related to the identified selections of XML data elements for subsequent use; acquiring the source of XML data and retrieving the XML data elements; comparing the retrieved XML data elements to the identified selections and extracting only the data from the XML data elements that correspond to the identified selections; and reformatting the extracted XML data into a relational format. The source of XML data can be a Web site or a file. The extracted data may be saved into a relational data table, and the reformatted extracted XML data is passed to a calling application.
A further aspect of the instant invention provides a computer-implemented method for automated XML data extraction, comprising: navigating to a Web site containing XML data; identifying selections of XML data elements for extraction from the Web site, the XML data comprising data elements containing the data stored in XML format; storing information related to the identified selections of XML data elements for subsequent use; re-navigating to the Web site and retrieving the XML data elements; comparing the retrieved XML data elements to the identified selections and extracting only the data from the XML data elements that correspond to the identified selections; and reformatting the extracted XML data into a relational format. The extracted data is desirably saved into a relational data table.
A yet further aspect of the present invention provides a computer-implemented method for automated XML data extraction, comprising: navigating a client computer to a Web site containing XML data; generating a graphical tree structure on the client computer to display XML nodes and subnodes representing the XML data at the Web site; selecting one or more of the nodes and/or subnodes from the tree structure associated with the data to be extracted; storing information related to the selected nodes and/or subnodes; renavigating the client computer to the Web site and retrieving the XML data using the information; comparing the retrieved XML data with the selected nodes and/or subnodes and extracting only the data corresponding to the selected nodes and/or subnodes; and reformatting the extracted XML data into a relational format. Desirably, selecting one subnode under a parent node automatically selects all subnodes under the parent node.
Another aspect provides a computer readable medium storing a set of instructions for controlling a computer to automatically extract desired XML data from a source of XML data, the medium comprising a set of instructions for causing the computer to: identify selections of XML data elements for extraction from a source of XML data comprising XML data stored in XML format; store information related to the identified selections of XML data elements for subsequent use; acquire the source of XML data and retrieve the XML data elements; compare the retrieved XML data elements to the identified selections and extract only the data from the XML data elements that correspond to the identified selections; and reformat the extracted XML data into a relational format.
A still further aspect provides a computer-based system for automated XML data extraction, comprising a client computer and server computer for receiving requests from the client computer over a network connecting the client and server computers, the client computer running an application to: identify selections of XML data elements for extraction from a source of XML data contained at the server computer and comprising XML data stored in XML format; store information related to the identified selections of XML data elements for subsequent use; acquire the source of XML data and retrieve the XML data elements; compare the retrieved XML data elements to the identified selections and extract only the data from the XML data elements that correspond to the identified selections; and reformat the extracted XML data into a relational format.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1A is a depiction of the program user interface used in accordance with a preferred embodiment of the present invention, displaying an HTML screen from a Web page.
FIG. 1B is a depiction of the user interface showing the current and playback grids.
FIG. 2 is a depiction of the user interface displaying one Web page of a Web site and the design grid.
FIG. 3 is a depiction of the design grid used in accordance with a preferred embodiment of the present invention.
FIG. 4 is a flowchart of the design phase of one embodiment of the present invention.
FIG. 5 is a depiction of the user interface showing one Web page of a Web site and the design grid used in accordance with a preferred embodiment of the present invention.
FIG. 6 is a depiction of the user interface showing a Web page and design grid with a refreshed grid with new data from the Web page.
FIG. 7 is a depiction of the user interface showing a second Web page containing HTML data and showing the design grid and the selection of the Base ID data element.
FIG. 8 is a depiction of the user interface showing a second Web page containing HTML data and showing selection of an Extraction element including the data to be extracted from the Web page.
FIG. 9 is a depiction of the user interface showing a second Web page containing HTML data and showing user specified variable names and extraction patterns of the data to be extracted.
FIG. 10 is a depiction of a schema file storing information recorded during the design phase of one embodiment of the present invention.
FIG. 11 is a depiction of the user interface showing a Web page and the current grid and playback grids.
FIG. 12 is a flow chart of the playback phase of one embodiment of the present invention.
FIG. 13 is a depiction of the user interface showing the current and playback grids and the information from the playback grid submitted to the Web page.
FIG. 14 is a depiction of the user interface showing a second Web page and current and playback grids associated with a second Web page.
FIG. 15 is a depiction of a program user interface used in connection with automated browsing of XML data Web sites in accordance with another embodiment of the present invention.
FIG. 16A is a depiction of XML data islands extracted from a Web page containing embedded XML data islands,
FIG. 16B is a depiction of XML data islands extracted from a XML file containing embedded XML data islands.
FIG. 17 is a depiction of the user interface showing an XML file and data islands associated with the file.
FIG. 18 is a depiction of the user interface further showing node details for one of the data islands associated with the XML file.
FIG. 19 is a depiction of the user interface displaying tree nodes and data from the rss data island shown in FIG. 18.
FIG. 20 is a depiction of the user interface showing user-selected nodes which have been highlighted.
FIG. 21 is a depiction of the user interface showing details of the user-selected nodes from FIG. 20.
FIG. 22 is a chart listing items that are written to a file storing the various user-entered information to be used in accordance with a second embodiment of the present invention.
FIG. 23 is a depiction of the arrangement of the file in its stored format.
FIG. 24 is a depiction of the user interface showing a user opening a previously-saved file to be used in connection with the playback mode.
FIG. 25 is a depiction of two files showing the final relational version of the data extracted in accordance with a second embodiment of the present invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
The present invention provides various preferred embodiments to provide a unique, instantly deployable business solution construction and delivery mechanism to make use of live Web data and conventional data. The present invention, in the embodiments described below, is preferably implemented in the form of software forming a computer program that is adapted to run on a computer. Preferably, the computer is a client side computer which is adapted to connect to a server side computer. More preferably, the computers are connected via the Internet, with the client computer running a program to allow Web browsing and the server computer running a Web server program, as is typical with Internet and Web communications.
The program preferably provides a user interface in which the user can manipulate during a design phase and later call back saved settings during a playback phase. The program can be stored on various types of storage media such as floppy disks, CD ROM, hard disk, RAM, etc. and is preferably installed on the client computer. In a typical setup, a Web server is adapted to connect to the Internet or Web in a typical way to deliver Web pages to client computers. Client computers run the program implementing the various embodiments of the present invention which allow browsing of Internet Web sites, data extraction from such Web sites and reformatting of the data found at such Web sites. Client computers connect to the Internet in a typical fashion (e.g., dial-up access, cable modem, DSL, T-l connection, etc.). Client computers typically include standard operating systems such as Microsoft Windows (CE, 95, 98, NT, 2000), MAC OS, DOS, Unix, etc. Client computers can comprise other devices beyond PCs which can connect to the Web in a wired or wireless fashion, such as PDAs, notebook computers, mobile phones, etc.
In a preferred embodiment of the present invention, a method and system is provided for automated browsing and data extraction from Web sites that include HTML data. Some Web pages, however, embed other types of data such as data stored in formats such as Microsoft Word or Excel or in Adobe Acrobat (PDF files). The present invention can also automatically detect that a Web page is not an HTML page and still extract the correct data from other such file types. FIG. 1A is a screen shot of a typical Internet HTML screen from a Web site 1 as it appears to the user, showing useful information like current prices for five stocks in a chart 2, which stocks are Microsoft, IBM, DELL, Oracle, and Hewlett-Packard. The present invention is preferably in the form of a Web browser application 4, which looks like a normal Web browser screen, but has additional buttons 5 and menu options 6 on the top. The application is programmed to bring up additional specific screen elements when certain buttons 5 are pressed on the top menu bar.
A preferred embodiment of the present invention included a built-in Web browser that is used to actually navigate the Web. In addition, this browser can accept commands from a program in addition to command from a user. The present invention is preferably implemented in two stages, a design phase and a playback phase. In the design phase, when the user wants to start recording the sequence of automatic navigation/extraction steps, the user clicks on the “Record” button 7 on the menu bar. As shown in FIG. 1B, this action preferably produces a design grid 8 at the bottom of the Web browser that displays the content of the current page loaded in the Web browser (here the URL of the page is http://us.yimg.com/i/fi/c/zc.html).
At this point, the user can give instructions in the design grid 8 relevant to what information from the Web page needs to be extracted, which information needs to be entered in which entry slot on this screen, and which button or hyperlink needs to be pressed for the next Web page to show up. As described in more detail below, checking the appropriate check boxes and entering information in the appropriate row in the grid achieves these actions.
The process of understanding which Web page is currently being processed, what data is to be entered on which element on the page, which buttons or links are to be clicked, and retrieving desired pieces of information from the Web page (i.e., all processing that happens on a single Web page) is referred to herein as a “step”. Many pages can be processed one after another in a continuous string. The string, referred to herein as a “surflet” because it contains information on surfing one specific site for one specific purpose, can be saved under a special user given name as an XML file and can be re-used again in the future to achieve exactly the same navigation and data extraction from the same site but for new, refreshed data.
For example, once saved by the designer, the stock information surflet can be played back automatically by other users for their own customized list of stocks (say Wal-Mart and AOL) as desired by a program. There is no relationship between the stock list used at design time and the actual stock list used at playback time. As long as the same Web page supplies the same business information in a similar layout, the recorded instructions can be played back automatically.
In the playback phase, a user or a program tells the program which saved (recorded) surflet is to be played back and with what input parameters. In this example, the input parameters are the stocks the particular user is interested in. The steps within the surflet are repeated (played back) automatically and the extracted new business information is returned to the user or program initiating the request.
In the event that the Web page layout changes drastically, or if the Web server is down at playback time, the program will return to the caller with a message explaining why the surflet could not find what it was instructed to find.
Even though the above example is for stocks, this aspect of the present invention can be applied to automatically navigate and extract any type of data from Web pages from any Web site.
Details of the design phase of the HTML aspect of the present invention now follow. FIG. 2 shows the user interface of the program, depicting a Web browser component 4, design grid 8, and menu buttons 5.
By way of background, a Web browser is an application that displays information from the Internet in page format. Microsoft Internet Explorer 4.0, Netscape 4.6, and Eudora are examples of a Web browser. A Web page is a visible interface written in HTML language that displays information over the Internet. The HTML language is a finite set of formatting tags that are read by the Web browser, which decides how to present the information as a Web page. When a Web page is loaded inside a Web browser, the Web browser parses the formatting tags inside the Web page, and creates a “tree-like” structure in its memory, based on the relation of the tags to each other. This internal tree-like memory structure is referred to herein as a “Web document.” It is normally not shown to the user and is required only if a user or an application wants to manipulate the content or information on the Web page programmatically.
A Web page written in HTML language consists of numerous formatting tags. When a Web page is read by the Web browser, the formatting tags inside the Web page are identified by the Web browser as HTML elements. The HTML elements are read from top down, and are arranged in a hierarchical fashion. If one element has other elements inside it, those elements are treated as child elements, and are at a level lower than the original element in the hierarchy.
When a Web page is loaded, the HTML elements that constitute a Web page are read by the Web browser in an ascending order, and are assigned a number (starting from 1) for identification. This number is called the “source index” of that HTML element.
The type of HTML elements that can accept information, or which allow the information presented to be changed, are called “Web page controls.” In short, these Web page controls allow user interaction to change its content. Examples of Web page controls are TEXT, TEXTAREA, CHOICE, SELECT, RADIO, SUBMIT, RESET, and PASSWORD.
The design grid 8 is an mechanism which displays the information about the HTML elements in the HTML document currently loaded in the Web browser. The design grid 8 has several columns, some of which describe the properties of the HTML tags, and others which accept information from the user on what to extract and how to extract. The others columns contain instructions for fine tuning the data extraction process. An example of design grid is shown FIG. 3.
Many pages can be processed one after another in a continuous string. The string can be saved under a special user given name as an XML file. It is referred to herein as a surflet because it contains information on surfing one specific site for one specific purpose.
A brief description of the columns in the design grid follows. As shown in FIG. 3, a sample design grid (and playback grid), the grid row number 30 is the actual row number of the HTML element in the design grid, such as HTML element 20 (FIG. 2) This number plays an important role of detecting same business information in spite of possible Web page layout changes when this information is retrieved for playback. FIG. 3 shows a consolidated image of various parts of the same grid in one picture frame.
The HTML Tag Number column 32 is the Source Index of the HTML element from the HTML document. This is a read-only property provided as an attribute of the HTML element. The tag number is useful in identifying the exact element inside an HTML document, and to perform operations like extracting values from that element, or posting some data to that element.
The Tag Type column 34 is the Type attribute given to an HTML element in an HTML document. This attribute is provided to the control type of HTML elements only. The formatting tags like TD, TR, P, etc. do not have a tag type attribute. The Web page controls are TEXT, TEXTAREA, CHOICE, SELECT, RADIO, SUBMIT, RESET and PASSWORD.
The Visible Text column 36 displays the text contained inside every HTML element that is displayed inside design grid 8. The controls on the Web page are displayed with their default text. The TEXT, PASSWORD, TEXTAREA controls are generally kept blank for the user to enter values. The SELECT control 37 usually shows the first item in its list as the default item selected. The RADIO or CHOICE may or may not be selected by default.
The HTML tag specific information is automatically filled in design grid 8 when the grid is displayed. The designer has to supply following information in the appropriate columns against an appropriate HTML tag.
A Web page will change its data content and also possibly its data presentation format between the time a recording is done in the design phase and later played back in the playback phase. At playback time, the user needs to ascertain that he/she is working with the same Web page that was used during the recording of surflet. To achieve this, the user identifies a firm piece of information on the Web page that has a low probability or the least probability of being modified or changed when the Web page data or format is modified. This piece of information will work as a guide during the playback phase, based on which it will decide whether the Web page with is the right Web page or not. This piece of information is called the Page ID data element, shown in column 38. Here, the user selected “Yahoo! Finance” as the Page ID data element since it is very unlikely this information will change at this Web site.
The user thus has to examine the Web page for such stable information (Page ID text), and then click the check box in the Page ID column 38 against the same Page ID text in the design grid 8. There can be more than one such Page ID data element on one Web page. All record time page ID's must remain unchanged (within a tolerance limit as described below) for the playback phase to determine whether the Web page is the correct Web page or not. If a validation is not found on the Page ID data element, the user is notified that the Web page in concern has changed beyond recognition and no further processing is done.
A Base ID data element, contained in Base Column 40, is an HTML data element that acts as a starting reference or base or anchor for other HTML elements during data extraction. The designer identifies one or more such HTML elements which have a high or the highest probability of appearing in the same relative position from the data to be extracted, even if the Web page undergoes modifications.
The design assumes that if the Base ID data element has moved up or down the HTML element hierarchy because of some changes made on the Web page, all the HTML elements associated with data extraction or data submission also will have moved up or down the HTML element hierarchy by the same number.
The Xtract column 42 is of type check box. The check box is clicked if the information contained in the Extraction data element (as seen in the Visible Text Column 36) associated with a check box is to be extracted from the Web site.
The Variable Name column 44 is a user-defined variable name that will contain the extracted business data from the Web browser page. This variable can be supplied to other functions as a variable, or can be set from other applications inside the program to receive input values to be entered on the Web page.
The Forward Tolerance column 46 is a numeric field. The application will go “down” the HTML elements list in the grid within this forward tolerance limit to find a match for the HTML element in consideration. A typical tolerance number is 10, meaning that it is acceptable for the HTML element to wander a little bit here and there as a result of Web page design changes, as long as it is within 10 positions of the recording time position of the same element. The Backward Tolerance column 48 is also a numeric field. The application will go “up” the HTML element tree within this limit to find a match for the HTML element in consideration.
The Xtract All Rows column 50 defines a fixed pattern to be extracted. The designer provides a unique Extract All ID to all the fields that will form the pattern. The number of rows filled in the grid with this unique name form the number of columns in the extraction pattern. This column is useful to define the extraction of data formatted in tabular form. In effect this column is used by the user to specify one complete row of the table to the software. The implied instruction to the software is to Extract All similar rows.
The Stop At column 52 means to stop extracting at certain text. The Extract All Rows pattern looks for the text entered in this field to stop the extraction. If not provided, the end of the document is assumed to be the end of the extraction pattern.
The extracted information is passed from the Web browser to the design grid, and ultimately to the surflet.
As shown in the flow chart FIG. 4 (the design phase loop), the Web page is first loaded in the Web browser and is read in the memory at step action 60. The information on this Web page is displayed in the design grid at action 62. The user then submits supporting instructions in the design grid at action 64 and also submits some information on the Web page at step 66. Before the Web page is submitted to the server to receive a new page at action 68, the design grid is resynchronized with the current Web page so that it will contain the most current version of the Web page, and this design grid is saved to the system's memory as one “step”. The Web server returns a Web page as a response to earlier request. This new page is displayed in the Web browser and its content is shown in the design grid at action 70. The process continues until the user decides to stop the recording at action 72. When stopped, the steps are saved in the surflet for future playback.
Actions preferably performed by the user are actions 64, 66, 70 and the stop recording action 72. Operations preferably performed by the program include actions 60, 62, 68 and the write action 72.
Referring now to FIG. 5, in the design phase, the designer enters the URL in the address box 80 and brings up the design page. Next, the designer clicks the record button 7 to start recording the navigation and extraction process. The click of the Record button 7 populates the design grid 8 with the information contained in the currently loaded Web page in the Web browser.
When the user clicks the Record button 7, the program automatically copies the Web page HTML from the Web browser to a Mshtml.dll document, which automatically makes a list of HTML tags and their names and values available to the program. A utility is preferably provided that can allow or disallow certain tags to be included in the design grid based on the user's preference. Each element in the document is checked whether it is included in the included list. If it is included, it is processed further to include its information in design grid 8.
In the Next step, the designer enters the stock symbols in the Text Control box 82 to get back more information on those stocks. But the designer has not clicked on any buttons or hyperlinks yet.
In this example, the designer believed that whatever changes may take place on the Web page, the image 20 that says “Yahoo Finance” will always appear at the same place. In short, that image will always be there to identify at a later time that this is the correct page that will be automatically visited. Therefore, the designer eyeballs the lines in the design grid 8, locates the line 84 that contains “Yahoo! Finance” in the Visible Text Column and Image “IMG” 86 in the Tag Type column and clicks the check box 88 in the Page ID column. The tolerance given is zero, which means that when this recording is played back, and when this particular Web page is brought up, the program will search for the “Yahoo! Finance” image to be at the exact current location (i.e., grid row number 2, in the design grid). Preferably, no deviation is allowed. In an alternate embodiment, some tolerance can be given to the position of the given element. For instance, if given a tolerance of 20, the program will look for 20 tags above or below the current tag position of “Yahoo!Finance.” Such a tolerance can be applied to all elements on the HTML page. If at playback time the program does not find the “Yahoo! Finance” image at that exact position in the list of tags, the program will assume that the Web server has responded with a brand new layout page or error message page and no data can be extracted in such situations until the user re-records the data extraction for the changed page.
In this example, the designer does not wish to extract any data from this first page of recording, and is ready to receive the detailed information on the stock symbols that was entered in the text control. Therefore, the designer clicks the Submit Button 83 (“Get Quotes”) on the Web browser.
As explained above, some information (stock symbols) was entered in the Web browser after that page information was displayed in the design grid. Therefore, the grid needs to be updated with the information on operations performed on the Web page in the Web browser, including the click of the submit button, before the Web page is sent back to the Web server. This update of design grid is necessary because after the update, the grid will carry entire information that is self-sufficient to reproduce this recording in its entirety.
Before the Web page is submitted to the Web server for a response, an event in the Web browser called BeforeNavigate is always triggered automatically. This allows the program a chance to extract the relevant information from the current page before it is sent back to the server. The program inspects the status of every tag in the MSHTML tag list and all tags that were used by the user as Page ID, Base ID Extract, Extract All or were clicked on in the browser window are saved to a memory structure.
When the Web server sends a response back to the client browser, an event in the Web browser called Document Complete indicates to the program that all data has been received, and the program updates the Design Grid with the information contained in this newly-loaded Web page 90 as shown in FIG. 6.
This second Web page 90 has data of interest to the designer who wishes to extract it. As shown in FIG. 7, the table 8 in Web browser displays the information returned as a result of user request in the first step. The designer wants to extract the information on the date and time of the stock information on this page, shown at element 100. This is an isolated data extraction. This information can move up or down within the Web page later on. But there are certain key information pieces on every Web page that always appear on that page, albeit their position in the layout of the Web page may have changed a bit. These, are referred to herein as the Base ID elements. These Base ID elements act as an anchor with whose reference other data can be extracted. In FIG. 7, the Quotes text 102 has been assigned the role of a Base ID element. The tolerance provided is 10, both for backward and forward tolerances. This ensures that during the playback of this recording, if this second Web page has changed within the provided tolerance for the specified Base ID element, the Base element still will be located from the Web page in the playback.
As shown in FIG. 8, the designer wants to extract the date and time 100 of the stock values returned on the Web page. The designer also has provided a backward and a forward tolerance of 5. This means that during the playback of this recording, if the given text has moved within the provided tolerance, it still will be identified and retrieved.
The designer also wishes to extract the information contained in the table 2, i.e., the stock name 104, the stock value 106, the date and time 108, the percent change 110, and the volume traded 112. The designer identified the column header Symbol 120 as the Base ID element for this second data extraction, and has provided a value of 10 for the backward and forward tolerance.
This Base ID element will act as an anchor to the “Extract All” data extraction from the table.
The designer next has to provide a sample of the pattern that user wants to extract. As shown in FIG. 9, the first row 130 of the table 2 serves as the sample. In the design grid, the designer provides a user-friendly name to the data that will be extracted in the Variable Name column, and also a user chosen name to this pattern of data extraction (“Row 1” in this example).
The program also needs to understand at what point it should stop extracting data in an Extract All type of extraction. Therefore, some text is provided where the extraction will stop. The text “Recent News” 132 is used as a relatively firm piece of information that should always appear on this particular Web page. Hence it is selected. Now when the Extract All extraction is implemented, it will extract the first row from the table, look to see if it has reached an element that has a text called “Recent News”. If not, the program will continue to extract until it finds that text. Thus, the program will extract data from all the desired rows from the table on the Web page.
This process is repeated until the designer decides to stop the recording by clicking the stop button 134. The designer is provided with an option to save the current recording in a surflet file. Once saved, this file can be reloaded any time, as many times as desired for playback.
The schema file is a representation of the steps that were recorded. As can be seen in FIG. 10, there are some Global variables 140, two folders 142 a and 142 b called Step, and many subfolders 144 called Gridline within those folders.
The variables “PageIDTolerance”, “AnchorTolerance”, and “ExtractTolerance” are applicable to the whole schema, i.e., to all steps. They decide default tolerances, saving the user the work of specifying tolerances on every row. The folder Step corresponds to one step in the recording. That means the file in FIG. 10 has only two steps, which in turn means that only two steps were recorded in the design phase. The folders, Gridline 144 under each step folder represents one row in the design grid. Each gridline folder has numerous variable fields, which represent one column each in the design grid. The value of the variable is the value in that row of the associated column of the design grid.
The current grid is same as the design grid in its layout. This grid is populated during the playback of a recorded surflet only. It always displays the content of the current Web page loaded in this embodiment's Web browser. An example of current grid is shown in FIG. 11. The upper grid 150 a of the two grids seen is the current grid. The playback grid 150 b is same as the current grid 150 a in its layout. This grid is populated during the playback of a recorded surflet only. It displays only those rows that were saved as a step during the design phase.
FIG. 12 shows the playback loop flow diagram. The playback starts when the playback button is clicked. Before the actual playback starts, the Web browser should show the same starting Web page from where the playback was recorded earlier.
The first action in the playback is the reading of information from the surflet XML file identified by the user in a file open dialog box. The information in the surflet file is saved during the design phase in such a way that there are as many of XML nodes as that of the steps. Each row in the design grid for a step is again a separate sub-node to the step node in the surflet file (FIG. 10). Thereafter, the memory structure is filled with information from the surflet file in a similar hierarchical fashion.
As a starting step, the user or a program navigates to the Web page in the design step from where the recording began. The user clicks the record button, and, in action 160, the current grid is populated with the starting Web page, whereas the playback grid, in action 162 is populated with the previously saved grid information from the memory. Then, validation action 164 is performed where it is ascertained whether the currently loaded Web page is the Web page that the user wants to work with. A check is made based on the identification marks made in the playback grid. At action 166, similar validations are made for elements identified as Base ID. Then, at action 168, each HTML element in the playback grid is searched in the current grid based on the grid row number, tag type, and tag text. After it has been located, appropriate action is performed (like Extract, Extract All, Update Web page) based upon the information in the playback grid for that element. A click on the current browser is simulated to submit current information to the Web browser. This completes one step of the playback loop. When the Web server returns a Web page back, its content is loaded in the current grid. At action 170, the next step from the memory is loaded in the playback grid, and the process continues until all the steps in the memory have been encountered at which the loop is exited at step 172.
In action 160, the current grid 150 a is filled with information from the Web page in the Web browser in view (FIG. 13). Next, at action 162, the information about the first step from the memory is displayed in the playback grid 150 b. Next, in the validate Web page step 164 the rows from the playback grid are picked up one by one in a loop to find out if any of the rows have HTML elements that have been defined as a Page ID element. When such row is identified, an attempt is made to find the same HTML element in the current grid 150 a.
In this example, there are four rows in the playback grid. These rows were saved during the first step of the design schema. The first row 180 was identified to be a page marker at that time. The information associated with this row is as follows:
Grid Row Number: 2:
HTML Tag Number: 10
Tag Type: IMG (Image)
Visible Text: “Yahoo! Finance”
Forward Tolerance: 0
Backward Tolerance: 0
Page ID: YES (checked)
The information about this row is saved in memory when it starts to search the exact element in the current grid 150 a. In the example, the first row 181 in current grid is checked against this saved information from the playback grid and does not match. The program then moves on to the next row 182 in the current grid, which has the following information describing itself:
Grid Row Number: 2
HTML Tag Number: 10
Tag Type: IMG (Image)
Visible Text: “Yahoo! Finance”
The grid row number from the playback grid matches exactly to the grid row number of the current grid for the tag type, and visible text. Therefore the tolerances of zero will work here, and there will be a match on the grids. As a result, the program knows that it is dealing with the same and correct Web page that was used during the design phase.
If the image (Yahoo! Finance) had moved between the time of recording and playback, its grid line number and HTML tag number would have changed in the current grid. In that case, the program would have applied the backward and forward tolerances to find a match.
The forward and backward tolerances operate as a guide that tell the program how much up or down in the grid it should look to find a possible match. In the current example, if the rows in the playback phase had not matched, the program would have had to search the current grid to find a possible match. If the forward tolerance was 5, then from the location in the current grid where the program originally expected to find a match (row 2 in current grid), the program moves down the current grid one row at a time. The software compares the Visible Text and Tag Type of that row to the saved information of the row declared as Page ID in the playback grid. If a match is found in the third move, then the program knows that the Web page has changed since the last recording, but within the tolerance limit provided. Hence, the program proclaims that a match is found, albeit with some adjustment.
The adjustment, which will be 3 rows movement in the example, is called an offset of that element. This means that any subsequent HTML element present is the playback grid is also expected to appear 3 rows below its originally expected position in the current grid.
As described above, a Base ID element is a reference ID of an element for other elements. The presence (though not the exact location) of this element on the Web page is more or less assured, and hence it serves as a good anchor (or base) to locate other information from the same page. The Base ID element itself can move between the design time and playback time. In such case, the backward or forward tolerance is applied to find out the correct new location of the element tag that had been declared as a Base ID element in the design phase. When the match is found, the offset is determined, which is the difference between the old and the new row number of the Base ID element. All the elements associated with this Base ID element are identified further based on the offset calculated from the new position of the Base ID element.
In this example, as shown in FIG. 14, there are two Base ID elements 191 and 192, one for the stand-alone extraction, the other for the column data extraction. If there is more than one Base ID element, then the elements that lay between the two successive Base ID are associated with the Base ID element which appears earlier in the playback grid.
Action 166 is essentially the repeat of action 164. The difference is that the offset found in action 164 is added to the row number from the playback grid, and then the match is performed on the current grid. This method takes care of any rippling effect the base element has on the other elements because of its relative displacement in the new Web page. In addition, each element may be assigned a forward and/or backward tolerance. This further helps to cushion changes made to the HTML page since the design was recorded.
For example, in FIG. 14, the third row 193 in the playback grid has instructions to extract some information from the Web page. This row has the following information:
Grid Row Number: 43
HTML Tag Number: 82
Tag Type: P (Paragraph)
Visible Text: Wed Oct 20 10:24am ET - US
Markets closes in 5 hours 37
minutes.
Forward Tolerance: 5
Backward Tolerance: 5
Xtract: YES
The Base ID appearing before this extract is the Base ID with text as Quotes in row 190. This row has the following information:
Grid Row Number: 21
HTML Tag Number: 48
Tag Type: TD
Visible Text: Quotes
Forward Tolerance: 10
Backward Tolerance: 10
Base ID: YES
The program validated the Base ID element before it reached this extraction. This Base ID has a tolerance of 10 units. Suppose the Web page changed from the time the user created this surflet, and some information was added before the text Quotes. Also, some text was added between the text Quotes and the data to be extracted (i.e., “Wed Oct 20 10:24 am ET—US Markets closes in 5 hours 37 minutes”). Let the new row number for the text (Quotes) in the current grid be 29 (instead of 21), and the new row number for the element of interest for data extraction be 52 (instead of 43).
Therefore, the offset for the Base ID element will be 29−21=8. The difference (8) is still within the tolerance limit declared for the Base ID (10). Therefore, this Base ID will be located in the current grid in row number 29, and will be validated. The offset of 8 tells the program that all other HTML elements appearing after this Base ID element would also have moved down the hierarchy by at least 8 values.
To identify the element to be extracted in the current grid, the program takes the row number of that element from the playback grid (43) and applies the offset to that value. Here, it would be 43+8=51. The program looks up the current grid at row number 51 for the exact text that is present in the playback grid. But the data of interest also has moved down because of some changes made to the Web page, and its new location in the current grid is 52. Therefore, the program applies the forward tolerance (of 5) and correctly locates the target at row number 52.
Now the row number of the element is correctly located in the current grid. The HTML tag number associated with this row is the actual Source Index of that HTML element in the changed HTML page. The handle on the HTML element's source index allows the program to programmatically manipulate that element as instructed.
The tolerances on Base ID, and the element itself helps to cushion the changes made to the Web page after it was used for creating a surflet. This method assures that given the right instructions, the program will find the correct HTML element from a given Web page.
At action 168, after the element has been correctly identified, the methods like inner-text, outer-text are invoked to extract the information contained inside that element. Extract can be a single extract, or can be an Extract All type to be extracted. When an instruction to extract is in isolation, meaning no other element's information to be extracted along with the one in focus, is it single extraction.
Sometimes, however, data is presented in HTML tables. The idea is to define a pattern of extraction for the first row, and then the program will extract the remaining rows based on the outlined extraction pattern. To achieve this, the following supporting information is required: Variable Name; a unique name to hold the extracted value; Xtract All command; the name given to an extraction pattern of Extract All type (there can be more than one pattern names per playback step) ; and the Stop At text.
The Stop At text is the text of the element that tells the program where to stop extracting during a pattern search and extract in Extract All type of extraction. The Stop At elements position adjusts the offset for all further extractions on that page. This allows for data appearing after varying length tables to be correctly extracted, regardless of how many rows existed in the table at design time and at actual run time.
In the example (FIG. 14), the first row of the table 2 has been identified as a pattern for extraction. The program knows the number of rows in a given pattern because of the same name given against the HTML elements in the playback grid that form the pattern (Row 1). In the example, there are five elements in the pattern. The pattern is read in memory from the playback grid. Each element from the memory is read one at a time, and a match is found in the current grid as described in earlier occasions. Likewise, all the elements in the memory are matched. At the end of which a check is made if the text given in the Stop At column has reached or not. If not, the first element from the memory collection is read again and a match is searched in the next element in the current grid. Similarly the other elements also are applied for a match until the text in Stop At column is reached. As a result, the extraction logic keeps applying the pattern given in playback grid (identified by Row 1) to the rows in the current grid table until it encounters the “Recent News” text. This results in retrieval of all the rows in the Web page table.
The information entered by the user during the design phase in the HTML controls is saved in the surflet, and is reproduced again in the playback phase in the playback grid. The next step is to identify the input controls in the playback grid and update the appropriate Web page controls with information in the playback grid row. This programmatically simulates the user's action of typing information on the HTML page. The same method of HTML element control identification is employed as in actions 164 and 165. After the element is identified, depending on its type, the text is set inside the control, or the index is set if it is a multi item control (e.g., radio button, choice, select, etc.).
In the current example, the symbols for the stocks were updated accordingly in the text control on the Web page.
In the design stage, when the user clicks an HTML element, the information on that clicked element is saved in the surflet. This information is reproduced in the playback grid. The same logic of element validation with/without offset is applied to find out the new position of the same HTML element in the current grid. Thereafter, this element is programmatically clicked by invoking the click method on the HTML element. This prepares the document to be submitted to the Web server with the user-entered information.
This completes one step in the series of playback steps where one submit and one extract is done simultaneously. The HTML page is submitted to the server, and a response is returned. This new page is again loaded in the current grid. The next step from the memory structure is read and displayed in the playback grid.
In another preferred embodiment of the present invention, a system and method is provided for data extraction from XML data, including a method for capturing, filtering and converting XML data into more conventional tabular (relational) table formats, using an easy point and click user interface. The data can be from the Internet, Intranets or flat files on local drives or LANS/WANS. The program captures the live data from the XML source and converts the data in a relational (tabular) format. The new format can then be used by anyone needing filtered original data in the more conventional relational format.
This embodiment preferably provides a user-interface providing point and click user interaction to: identify an XML data source; identify the XML data of interest within the data source; save these instructions for later use; automatically retrieve live XML data at a later point in time using the saved instructions; and automatically filter the live data to cull it down to data of interest. Automatic conversion of filtered live data into more conventional table formats for easier use of the data by applying well known SQL techniques
A preferred aspect of this embodiment executes in two modes, the design mode and the playback mode. The design mode allows a person (referred to as the “Designer”) to instruct the program which and how much data is to be captured from a Web site (or a file) capable of supplying XML data. The program saves these instructions in a file, referred to herein as a “schema” file. The schema file itself is preferably an internal XML file but is not directly related to the XML data being captured.
The playback mode can be monitored by a person for testing purposes or can be completely automatic, devoid of human interaction, for program to program communication purposes. It reads the saved instructions from the schema file and gets selective live XML data from the source (Internet, Intranet or file) and converts it in a relational (tabular) format. The new format can then be used by anyone used to using conventional data.
With this aspect of the present invention, a design mode and a playback mode are again preferably used. The design mode essentially is a teach mode, wherein the set of instructions taught by the designer is saved in as a schema in a XML file.
When the program is started in design mode, the screen 200 shown in FIG. 15 is presented as the user interface. The designer identifies the data source from where the XML data is to be captured at playback time by typing in a Web address or choosing a file in field 202. The designer then clicks on the Show Data Islands/Tree button 204. At this point, the program navigates to that Web site or opens the file, loads the XML contents of the source in the MSXML.DLL offered by Microsoft utility as a document object. Further processing takes place in the document object.
The document can be of two main types. It may either be an HTML Web page with embedded XML “data islands” or it may be a standalone XML file.
An example an HTML Web page 210 with data islands 212 and 214 with is shown below in FIG. 16A. An example of a stand-alone XML file 220 is shown in FIG. 16B.
When the designer presses the Show Data Island/Tree button 204, if the XML data source contains data islands, a list of those islands is shown in the data islands list box 204. As shown in FIG. 17, the user chooses one data island (here the data islands are “rss” or “moreovernews”) for further data extraction definition by clicking on it in box 204.
If the XML source contains only XML data (i.e. without any data islands) then that entire XML is shown in box 206 (shown in FIG. 18) directly, skipping the steps of requiring the user to identify the particular data island. All further processing is the same regardless of whether the data in lower box 206 was loaded from a XML file or from a XML data island embedded inside another file.
The program then uses another utility, such as Microsoft supplied MSXML.DLL facilities, to read the XML tree node by node. In addition, a visual image of the XML tree is created for the user. This is preferably done by loading every node into a third party utility software component such as Tree Visualizer sold by Green Tree Software. The user is also given the flexibility of viewing the tree node with and without the data. In FIG. 18, there is shown tree nodes 220 data 222. FIG. 19 displays tree nodes 230 and data 232 from the “rss” data island displayed in FIG. 18.
Once the tree is displayed, the user can click on any tree nodes to serve as identification of tree branches that contain data to be captured at run time. Multiple nodes on different branches can be clicked. All clicked nodes are highlighted for visual identification and easier understanding. A selected node can be de-selected by clicking on it again.
The designer is required to click on only one node from any desired branch. The implied instruction is that all similar branches are also desired. If the design time tree has 50 similar looking branches, the designer has to click only one node in any of 50 branches. The “depth” of the clicked node within the desired branch decides which data within the similar branches will be captured at playback time. Only the data from the clicked node upward, towards and including the root of the tree is captured at playback time.
After selecting all desired nodes from where the XML data is to be captured at playback time, the user presses the Show Node Details button 240 (FIG. 20). This brings up the screen shown in FIG. 21.
The user-clicked nodes' full path up to and including the tree root and also their children or subnode names are saved as a part of the saved instructions. As shown in FIG. 20, the user clicked on “image” and “item” which are displayed, along with their subnodes, in far point grid 250 (FIG. 21). The details of the user clicked node in the Visual Tree are shown in the far point grid 250. The user clicked Node's path and also their immediate subnodes' path are shown in columns 252 and 254. The number of rows the user wishes to capture for each path is specified in number column 256. The Node Name is specified in Table/Grid column 25B. A description is entered in description column 260. Wait-time for the response from the Web site at run time is also specified.
Thus, as shown in FIG. 21, for every clicked node, the designer can specify the number of rows he wishes to extract at playback time, a meaningful business name for the whole table of relational data to be generated at run time, and a description. The user can also specify the number of seconds he wishes to wait for the response from the Web site at runtime.
These form a complete set of instructions, which is saved in a file, preferably as a XML schema file. FIG. 22 shows the items which are written to the schema file. FIG. 23 shows how the schema file looks when it is saved, which in this case was saved in the file “sample.xml” shown in address box 260.
In the playback phase, the end user identifies a saved schemas that is to be executed. As shown in FIG. 24, after clicking on the Run Schema button 270, the program opens the user specified schema file from the open box and reads the saved instructions from the file. It then navigates to that particular Web site or opens the file to get the live XML data. The XML data is then loaded, preferably into the Microsoft supplied MSXML.DLL as a Document. If the XML data fails to load within the time specified by the designer, the program returns back to the calling application with appropriate error messages. One example of why such failures may occur is a Web server being down.
Once the XML data is loaded successfully in the MSXML document, the data is retrieved node by node from the top, preferably using Microsoft supplied methods. For each of the user clicked node path in the schema, a relational table is created in the user-specified name (Table-Name), which is retrieved from the saved schema. As shown in FIG. 25, the Parent-Path 280 and the To-Be-Extracted Path 282 for that clicked node form the Columns of the relational table.
For every retrieved node, if the node's name falls within any of the designer specified “paths” saved in the schema, that node's data is written to the same named column in the appropriate user specified table. If the node's name falls outside the designer specified path, that node's data is ignored.
This process of retrieving the next node and inspecting it continues node-by-node until the traversal reaches a node having the same node as the clicked node. At this point, the data from all of the children nodes is also written out to their corresponding column names in the tables being filled. The node inspection process stops when the end of the XML data is encountered or the number of nodes specified by the designer have been written out to the tables. The tables are created in user specified names, which, in this example are Image and Item.
When all the user clicked nodes path are traversed for data the program, returns to the calling application with a set of relational tables, one each for each of the user clicked selections.
If needed, for testing purposes, the program can show the final relational version of the retrieved, filtered and tabularized data in an application such as in Microsoft's supplied Notepad utility shown in FIG. 25.
The first line in the output corresponds to the column names and the following lines have data.
The XML extraction method of the present invention can be used as a stand-alone program or can be implemented as a utility subprogram inside any other program to retrieve, filter and convert into tables any XML data from any source, Website or files.
With both embodiments of the present invention, after the data has been captured and brought back for further processing to be specified by the user, it is preferred that such information is further processed by a program such as InstaKnow™, offered by InstaKnow Inc. of Edison, N.J., set forth in detail in provisional application Nos. 60/174,747, 60/166,247 and 60/171,143, the disclosures of which are incorporated by reference herein. Such software is capable of specifying business logic/processes using a point-and-click, wizard-based interface, without needing a programmer. This enables business solutions to be deployed much faster than the competing solutions, reducing solutions costs dramatically.
Complexity involved with high-level language programming is eliminated with the present invention. With minimal initial one time training, a user can use the present embodiment's point-and-click interfaces to achieve advanced business computing, information management, and knowledge management results, as long as the user has a clear understanding of the business. The “business” can be any industry, segment or market including commerce, academics areas of any manufacturing, service, information or knowledge oriented institutions.
With the various preferred embodiments of the present invention, the technologies and capabilities achieved include automated capture/entry of data on live HTML Web (Internet/Intranets) pages and automated capture/entry of data on live XML Web (Internet/Intranets) pages.
The present invention provides a method to capture useful and latest business data from the Internet, Intranets and Extranets. Most preferably, it is implemented via a program running on a user's computer. The program can be stored on any type of recordable media (e.g., floppy disk, hard disk, ROM, RAM) and instructs a user's computer in the typical fashion, with the present invention, the program learns which Web pages are of interest to the user, how to reach (navigate to) those Web pages and which business information of those pages is of interest to the user. These steps are recorded and saved in a schema file. From that point on, when requested by a program or a user, the program can automatically repeat the saved Web navigation and data extraction steps and capture the latest instance of the changed business data from the same pages as originally taught by the user. In effect, the present invention is capable of surfing the Web without human assistance.
When the HTML layout of the involved Web pages changes (as it does every few minutes on any commercial Internet page showing a banner ad, for example), the present invention intelligently accommodates the changes and adjusts itself to get the correct business information from the new pages in spite of the layout change. If the layout has changed so drastically that the business information is no longer present on that page, the present invention will provide an error message.
This present invention provides the capabilities of automatically navigating to pre-requested Web pages and automatically entering data and clicking on pre-requested buttons or links. The advantages of such Internet automation include unattended, continuous monitoring of real time business information, and automatic surfing, 10 to 1000 times faster than human interaction with the Web
The present invention is preferably build on Microsoft supplied generic basic technologies and components, although any other equivalent components can be used to achieve the same results. Preferred components for the HTML extraction program include: Visual Basic 6.0 IDE; Web Browser Control, SHDOCV.DLL, and MSHTML.DLL. Preferred components for the XML extraction program include Visual Basic 6.0 IDE and MSXML.DLL. Visual Basic 6.0 IDE is a generic object oriented programming language. Equivalents are C++, VC++, and Java. Web Browser Control (Shdocvu.dll) is a generic tool supplied by Microsoft to provide a browser interface to the Web under a program's control. Mshtml.dll is a generic tool supplied by Microsoft to convert an HTML Web page into a program understandable list of HTML tags and their values. Equivalents are third party tools like the “Internet tool pack” from Crescent Technologies. These tools are the interface layer that allow a program to read the browser's current content and detect where the user clicked on the Web page in the browser. They also provide a programmatic interface to automatically fill information on the browser and simulate user actions like clicks. Msxml.dll is a generic XML parser supplied by Microsoft. Many other equivalents are available from all leading software manufacturers like IBM and Sun Microsystems, and Netscape.
The present invention can also be configured to extract the entire text from a non-HTML page, such as a Word, Excel or PDF file. In this case, an “Extract Entire Text” option can be provided to get the entire content of the page. In this case, only a Page ID needs to be provided and one would check the “body tag” for extraction.
As these and other variations and combinations of features discussed above can be utilized without departing from the present invention as defined by the claims, the foregoing description of the preferred embodiments should be taken by way of illustration rather than by way of limitation of the present invention.

Claims (33)

What is claimed is:
1. A computer-implemented method for automated data extraction from a Web site, comprising:
(a) navigating to a Web site during a design phase;
(b) extracting data elements associated with said Web site and producing a visible display corresponding to said extracted data elements;
(c) selecting and storing at least one Page ID data element in said display from said data elements;
(d) selecting and storing one or more Extraction data elements in said display;
(e) selecting and storing at least one Base ID data element having an offset distance from said Extraction elements;
(f) setting a tolerance for possible deviation from said offset distance; and
(g) renavigating to said Web site during a playback phase and extracting data from said Extraction data elements if said Page ID data element is located in said Web site and if said offset distance of said Base ID data element has not changed by more than said adjustable tolerance.
2. A method as claimed in claim 1, wherein user-specific information is entered into said Web site and used in connection with producing the data to be extracted from said Extraction data elements.
3. A method as claimed in claim 1, wherein said data elements comprise HTML elements.
4. A method as claimed in claim 1, wherein said visible display comprises a grid containing rows and columns including information about each said data elements extracted.
5. A method as claimed in claim 4, wherein said information comprises, for each said data element, fixed information of grid row number, HTML tag number and visible text, and user-selected information of Page ID, Base ID, Extract and tolerance.
6. A method as claimed in claim 1, wherein a position of said Page ID data element within said Web site is stored and said extracting occurs during said playback phase if said Page ID data element has not changed said position.
7. A method as claimed in claim 1, wherein said Page ID data element is selected as a data element that is unlikely to change position upon reformatting of said Web site.
8. A method as claimed in claim 1, wherein said display contains data desired to be extracted.
9. A computer-implemented method for automated data extraction from a Web site, comprising:
(a) navigating to a Web site during a design phase;
(b) extracting data elements associated with said Web site and producing a visible current display grid corresponding to said extracted data elements;
(c) selecting and storing at least one Page ID data element in said current display from said data elements;
(d) selecting and storing one or more Extraction data elements in said current display;
(e) selecting and storing at least one Base ID data element in said current display having an offset distance from said Extraction elements;
(f) entering a tolerance in said current display for possible deviation from said offset distance;
(g) displaying a playback display grid during a playback phase with said selected Page ID data element, said Extraction data elements, and said Base ID data element;
(h) renavigating to said Web site;
(i) extracting data elements associated with said Web site to said visible current display grid;
(j) comparing said extracted data elements in said current display grid with said playback display grid and extracting data from said Extraction data elements if said Page ID data element is found in said current display grid and if said offset distance of said Base ID data element has not changed by more than said tolerance; and
(k) adjusting said tolerance based on said offset distance of said Extraction elements found during renavigation.
10. A method as claimed in claim 9, wherein said tolerance comprises a forward and backward tolerance.
11. A computer-implemented method for automated browsing of Web sites on a global communications network and for extracting usable data, comprising:
(a) accessing at least one Web site page containing data, wherein said data comprises a plurality of data formats;
(b) transforming said data in a plurality of formats into a computer-readable list;
(c) identifying a base data element from said list;
(d) identifying an offset from said base data element to the usable data; and
(e) extracting the usable data for use by a user regardless of changes to the Web site, provided that said offset between said base data element and the usable data does not change.
12. The method of claim 11, wherein identifying said offset comprises identifying said offset during a design phase and saving said offset for use in a run time phase including said extracting of said usable data.
13. A computer-implemented method for automated browsing Web sites and for extracting usable data, comprising:
(a) filling a current display grid with rows of HTML data elements from at least one Web site page currently selected by a Web browser;
(b) displaying in a playback display grid previously-stored HTML data elements;
(c) examining said rows of said playback grid to locate an HTML data element previously selected as a Page ID data element;
(d) comparing said rows of said current grid to locate an HTML element that matches said Page ID data element;
(e) examining said rows of said playback grid to locate HTML data elements previously selected as Extraction data elements and a Base ID data element used as a reference for locating said Extraction data elements;
(f) comparing said rows of said current grid to locate HTML elements that match said Extraction data elements and match said Base ID data element;
(g) extracting data from said Extraction data elements regardless of changes to said Web site, provided that said Page ID elements match and any offset between said Base ID elements is within a predetermined tolerance; and,
(h) resetting said tolerance based on said offset of said Base ID elements.
14. A computer-based system for automatically browsing Web sites, comprising a client computer and a server computer for receiving requests from said client computers over a network connecting said client and server computers, said client computer running an application to:
(a) navigate to a Web site during a design phase;
(b) extract data elements associated with said Web site and produce a visible display corresponding to said extracted data elements;
(c) select and store at least one Page ID data element in said display from said data elements;
(d) select and store one or more Extraction data elements in said display;
(e) select and store at least one Base ID data element having an offset distance from said Extraction elements;
(f) set an adjustable tolerance for possible deviation from said offset distance;
(g) renavigate to said Web site during a playback phase and extract data from said Extraction data elements if said Page ID data element is located in said Web site and if said offset distance of said Base ID data element has not changed by more than said tolerance; and
(h) reset said tolerance based on changes to said Web site found during renavigation.
15. A computer-implemented method for automated data extraction, comprising:
(a) identifying selections of data elements in one of a plurality of data formats for extraction from a source of data comprising data stored in one of said plurality of formats;
(b) storing information related to said identified selections of data elements in XML format for subsequent use;
(c) acquiring said source of data and retrieving said data elements;
(d) comparing said retrieved XML data elements to said identified selections and extracting only the data from said data elements that correspond to said identified selections; and
(e) reformatting said extracted XML data into a relational format.
16. A method as claimed in claim 15, wherein said source of said data is a Web site.
17. A method as claimed in claim 15, wherein said source of said data is a file.
18. A method as claimed in claim 15, including saving said extracted data into a relational data table.
19. A method as claimed in claim 15, wherein said reformatted extracted data is passed to a calling application.
20. A computer-implemented method for automated XML data extraction, comprising:
(a) navigating to a Web site including a plurality of web pages containing XML data;
(b) identifying selections of XML data elements for extraction from said Web site from said plurality of pages, said XML data comprising data elements containing said data stored in XML format;
(c) storing information related to said identified selections of XML data elements for subsequent use;
(d) re-navigating to said Web site and retrieving said XML data elements from said plurality of web pages;
(e) comparing said retrieved XML data elements to said identified selections and extracting only the data from said XML data elements that correspond to said identified selections; and
(f) reformatting said extracted XML data into a relational format.
21. A method as claimed in claim 20, including saving said extracted data into a relational data table.
22. A computer-implemented method for automated XML data extraction, comprising:
(a) navigating a client computer to a Web site including a plurality of web pages, said Web site containing XML data;
(b) generating a graphical tree structure on said client computer to display XML nodes and subnodes representing said XML data at said plurality of web pages on said Web site;
(c) selecting one or more of said nodes and/or subnodes from said tree structure associated with the data to be extracted;
(d) storing information related to said selected nodes and/or subnodes;
(e) renavigating said client computer to said Web site and retrieving said XML data using said information;
(f) comparing said retrieved XML data with said selected nodes and/or subnodes and extracting only the data corresponding to said selected nodes and/or subnodes; and
(g) reformatting said extracted XML data into a relational format.
23. A method as claimed in claim 22, wherein selecting one subnode under a parent node automatically selects all subnodes under said parent node.
24. A computer readable medium storing a set of instructions for controlling a computer to automatically extract desired XML data from a source of data in a plurality of formats, said medium comprising a set of instructions for causing said computer to:
(a) identify selections of data elements for extraction from a source of data comprising data stored in a plurality of formats;
(b) store information related to said identified selections of data elements for subsequent use;
(c) acquire said source of data and retrieve said data elements in XML format;
(d) compare said retrieved XML data elements to said identified selections and extract only the data from said data elements that correspond to said identified selections; and
(e) reformat said extracted XML data into a relational format.
25. A computer-based system for automated XML data extraction, comprising a client computer and server computer for receiving requests from said client computer over a network connecting said client and server computers, said client computer running an application to:
(a) identify selections of XML data elements for extraction from a plurality of sources of XML data contained at said server computer;
(b) store information related to said identified selections of XML data elements for subsequent use;
(c) acquire said plurality of sources of XML data and retrieve said XML data elements from said plurality of sources;
(d) compare said retrieved XML data elements to said identified selections and extract only the data from said XML data elements that correspond to said identified selections; and
(e) reformat said extracted XML data into a relational format.
26. A computer-implemented method for automated data extraction from a Web site, comprising:
(a) navigating to a Web site during a design phase;
(b) extracting data elements associated with said Web site and producing a visible display corresponding to said extracted data elements;
(c) selecting and storing at least one Page ID data element in said display from said data elements;
(d) selecting and storing one or more Extraction data elements in said display;
(e) selecting and storing at least one Base ID data element having an offset distance from said Extraction elements;
(f) setting an adjustable tolerance for possible deviation from said offset distance; and,
(g) renavigating to said Web site during a playback phase and extracting data from said Extraction data elements if said Page ID data element is located in said Web site and adjusting said tolerance based on said offset distance of said Base ID data element.
27. A method as claimed in claim 26, wherein user-specific information is entered into said Web site based on said adjustable tolerance and said offset.
28. A method as claimed in claim 26, wherein said adjustable tolerance is reset based on renavigation of said Web site during said playback phase.
29. A method as claimed in claim 26, wherein user-specific information is entered into said Web site and used in connection with producing the data to be extracted from said Extraction data elements.
30. A method as claimed in claim 26, wherein said visible display comprises a grid containing rows and columns including information about each said data elements extracted.
31. A method as claimed in claim 29, wherein said information comprises, for each said data element, fixed information of grid row number, HTML tag number and visible text, and user-selected information of Page ID, Base ID, Extract and tolerance.
32. A method as claimed in claim 26, wherein a position of said Page ID data element within said Web site is stored and said extracting occurs during said playback phase if said Page ID data element has not changed said position.
33. A method as claims in claim 26, wherein said data elements are extracted from a Web page embedding at least one of the following formats: XML, PDF, Word, and Excel.
US09/714,644 1999-11-18 2000-11-16 Automated data extraction and reformatting Expired - Lifetime US6732102B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US09/714,644 US6732102B1 (en) 1999-11-18 2000-11-16 Automated data extraction and reformatting

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US16624799P 1999-11-18 1999-11-18
US17114399P 1999-12-16 1999-12-16
US17474700P 2000-01-04 2000-01-04
US09/714,644 US6732102B1 (en) 1999-11-18 2000-11-16 Automated data extraction and reformatting

Publications (1)

Publication Number Publication Date
US6732102B1 true US6732102B1 (en) 2004-05-04

Family

ID=32180667

Family Applications (1)

Application Number Title Priority Date Filing Date
US09/714,644 Expired - Lifetime US6732102B1 (en) 1999-11-18 2000-11-16 Automated data extraction and reformatting

Country Status (1)

Country Link
US (1) US6732102B1 (en)

Cited By (90)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020099767A1 (en) * 2001-01-24 2002-07-25 Microsoft Corporation System and method for incremental and reversible data migration and feature deployment
US20020107834A1 (en) * 2000-11-20 2002-08-08 Larry Yen Quality assurance of data extraction
US20020129067A1 (en) * 2001-03-06 2002-09-12 Dwayne Dames Method and apparatus for repurposing formatted content
US20020165858A1 (en) * 2001-05-01 2002-11-07 Koji Kusumoto Method and apparatus for automatically searching hypertext structure
US20020174218A1 (en) * 2001-05-18 2002-11-21 Dick Kevin Stewart System, method and computer program product for analyzing data from network-based structured message stream
US20020174185A1 (en) * 2001-05-01 2002-11-21 Jai Rawat Method and system of automating data capture from electronic correspondence
US20020194379A1 (en) * 2000-12-06 2002-12-19 Bennett Scott William Content distribution system and method
US20030028559A1 (en) * 2001-06-27 2003-02-06 Jean-Jacques Moreau Method of analyzing a document represented in a markup language
US20030037236A1 (en) * 2001-06-21 2003-02-20 Simon Daniel R. Automated generator of input-validation filters
US20030068099A1 (en) * 2001-10-09 2003-04-10 Hui Chao Section extraction tool for PDF documents
US20030237047A1 (en) * 2002-06-18 2003-12-25 Microsoft Corporation Comparing hierarchically-structured documents
US20040068693A1 (en) * 2000-04-28 2004-04-08 Jai Rawat Client side form filler that populates form fields based on analyzing visible field labels and visible display format hints without previous examination or mapping of the form
US20040073555A1 (en) * 2002-03-15 2004-04-15 Dennis Hevener Web callbook interface for amateur radio logging systems
US20040194023A1 (en) * 2001-06-12 2004-09-30 Frank Wiechers User selective reload of images
US20050065965A1 (en) * 2003-09-19 2005-03-24 Ziemann David M. Navigation of tree data structures
US20050091540A1 (en) * 2002-02-25 2005-04-28 Dick Kevin S. System, method and computer program product for guaranteeing electronic transactions
US20050125730A1 (en) * 2003-12-08 2005-06-09 Microsoft Corporation. Preservation of source code formatting
US20050160095A1 (en) * 2002-02-25 2005-07-21 Dick Kevin S. System, method and computer program product for guaranteeing electronic transactions
US20050177805A1 (en) * 2004-02-11 2005-08-11 Lynch Michael R. Methods and apparatuses to generate links from content in an active window
US20060059234A1 (en) * 2004-09-02 2006-03-16 Atchison Charles E Automated messaging tool
US7032181B1 (en) * 2002-06-18 2006-04-18 Good Technology, Inc. Optimized user interface for small screen devices
US20060167911A1 (en) * 2005-01-24 2006-07-27 Stephane Le Cam Automatic data pattern recognition and extraction
US20060230343A1 (en) * 1998-12-08 2006-10-12 Yodlee.Com, Inc. Method and apparatus for detecting changes in websites and reporting results to web developers for navigation template repair purposes
US20060242266A1 (en) * 2001-02-27 2006-10-26 Paula Keezer Rules-based extraction of data from web pages
US20060259519A1 (en) * 2005-05-12 2006-11-16 Microsoft Corporation Iterative definition of flat file data structure by using document instance
US20060288011A1 (en) * 2005-06-21 2006-12-21 Microsoft Corporation Finding and consuming web subscriptions in a web browser
US20060288329A1 (en) * 2005-06-21 2006-12-21 Microsoft Corporation Content syndication platform
US20070011184A1 (en) * 2005-07-07 2007-01-11 Morris Stuart D Method and apparatus for processing XML tagged data
US20070016609A1 (en) * 2005-07-12 2007-01-18 Microsoft Corporation Feed and email content
US20070027897A1 (en) * 2005-07-28 2007-02-01 Bremer John F Selectively structuring a table of contents for accesing a database
US20070033006A1 (en) * 2005-07-19 2007-02-08 Sony Corporation Information processing apparatus, method and program
US20070101250A1 (en) * 2005-10-31 2007-05-03 Advanced Micro Devices, Inc. Data analysis visualization with hyperlink to external content
US20070150447A1 (en) * 2005-12-23 2007-06-28 Anish Shah Techniques for generic data extraction
US20070185881A1 (en) * 2006-02-03 2007-08-09 Autodesk Canada Co., Database-managed image processing
US20070204220A1 (en) * 2006-02-27 2007-08-30 Microsoft Corporation Re-layout of network content
US20070208759A1 (en) * 2006-03-03 2007-09-06 Microsoft Corporation RSS Data-Processing Object
US7272594B1 (en) 2001-05-31 2007-09-18 Autonomy Corporation Ltd. Method and apparatus to link to a related document
US20070230778A1 (en) * 2006-03-20 2007-10-04 Fabrice Matulic Image forming apparatus, electronic mail delivery server, and information processing apparatus
US20070242925A1 (en) * 2006-04-17 2007-10-18 Hiroaki Kikuchi Recording and reproducing apparatus and reproducing apparatus
US20080077417A1 (en) * 2006-09-21 2008-03-27 Lazzarino William A Systems and Methods for Citation Management
US20080091821A1 (en) * 2001-05-18 2008-04-17 Network Resonance, Inc. System, method and computer program product for auditing xml messages in a network-based message stream
US20080141113A1 (en) * 2006-12-11 2008-06-12 Microsoft Corporation Really simple syndication for data
US20080216023A1 (en) * 2007-03-02 2008-09-04 Omnitus Ab Method and a system for creating a website guide
US20090100056A1 (en) * 2006-06-19 2009-04-16 Tencent Technology (Shenzhen) Company Limited Method And Device For Extracting Web Information
US20090177572A1 (en) * 2001-05-18 2009-07-09 Network Resonance, Inc. System, method and computer program product for providing an efficient trading market
US20090187585A1 (en) * 2008-01-18 2009-07-23 Oracle International Corporation Comparing very large xml data
US7577963B2 (en) 2005-12-30 2009-08-18 Public Display, Inc. Event data translation system
US20090222714A1 (en) * 2008-03-03 2009-09-03 Microsoft Corporation Collapsing margins in documents with complex content
US7672879B1 (en) 1998-12-08 2010-03-02 Yodlee.Com, Inc. Interactive activity interface for managing personal data and performing transactions over a data packet network
US7752535B2 (en) 1999-06-01 2010-07-06 Yodlec.com, Inc. Categorization of summarized information
US7831547B2 (en) 2005-07-12 2010-11-09 Microsoft Corporation Searching and browsing URLs and URL history
US7856386B2 (en) 2006-09-07 2010-12-21 Yodlee, Inc. Host exchange in bill paying services
US7925621B2 (en) 2003-03-24 2011-04-12 Microsoft Corporation Installing a solution
US7936693B2 (en) 2001-05-18 2011-05-03 Network Resonance, Inc. System, method and computer program product for providing an IP datalink multiplexer
US7979856B2 (en) 2000-06-21 2011-07-12 Microsoft Corporation Network-based software extensions
US7979803B2 (en) 2006-03-06 2011-07-12 Microsoft Corporation RSS hostable control
WO2011148342A1 (en) * 2010-05-26 2011-12-01 Nokia Corporation Method and apparatus for enabling generation of multiple independent user interface elements from a web page
US8074272B2 (en) 2005-07-07 2011-12-06 Microsoft Corporation Browser security notification
US8090678B1 (en) * 2003-07-23 2012-01-03 Shopping.Com Systems and methods for extracting information from structured documents
US8166406B1 (en) 2001-12-04 2012-04-24 Microsoft Corporation Internet privacy user interface
US8190629B2 (en) 1998-12-08 2012-05-29 Yodlee.Com, Inc. Network-based bookmark management and web-summary system
US8261334B2 (en) 2008-04-25 2012-09-04 Yodlee Inc. System for performing web authentication of a user by proxy
US20120239675A1 (en) * 2003-03-04 2012-09-20 Error Brett M Associating Website Clicks with Links on a Web Page
US8291313B1 (en) * 2009-08-26 2012-10-16 Adobe Systems Incorporated Generation of a container hierarchy from a document design
US20130067313A1 (en) * 2011-09-09 2013-03-14 Damien LEGUIN Format conversion tool
US20130073992A1 (en) * 2011-09-21 2013-03-21 International Business Machines Corporation Supplementary Calculation Of Numeric Data In A Web Browser
US8429522B2 (en) 2003-08-06 2013-04-23 Microsoft Corporation Correlation, association, or correspondence of electronic forms
US8555359B2 (en) 2009-02-26 2013-10-08 Yodlee, Inc. System and methods for automatically accessing a web site on behalf of a client
US8769392B2 (en) 2010-05-26 2014-07-01 Content Catalyst Limited Searching and selecting content from multiple source documents having a plurality of native formats, indexing and aggregating the selected content into customized reports
US8843814B2 (en) 2010-05-26 2014-09-23 Content Catalyst Limited Automated report service tracking system and method
US8892993B2 (en) 2003-08-01 2014-11-18 Microsoft Corporation Translation file
US8918729B2 (en) 2003-03-24 2014-12-23 Microsoft Corporation Designing electronic forms
US8996539B2 (en) 2012-04-13 2015-03-31 Microsoft Technology Licensing, Llc Composing text and structured databases
US9154364B1 (en) * 2009-04-25 2015-10-06 Dasient, Inc. Monitoring for problems and detecting malware
US20150347535A1 (en) * 2014-05-28 2015-12-03 Aravind Musuluri System and method for displaying table search results
US9210234B2 (en) 2005-12-05 2015-12-08 Microsoft Technology Licensing, Llc Enabling electronic documents for limited-capability computing devices
US9223770B1 (en) * 2009-07-29 2015-12-29 Open Invention Network, Llc Method and apparatus of creating electronic forms to include internet list data
US9229917B2 (en) 2003-03-28 2016-01-05 Microsoft Technology Licensing, Llc Electronic form user interfaces
US9256733B2 (en) 2012-04-27 2016-02-09 Microsoft Technology Licensing, Llc Retrieving content from website through sandbox
US9298919B1 (en) 2009-04-25 2016-03-29 Dasient, Inc. Scanning ad content for malware with varying frequencies
US9323767B2 (en) 2012-10-01 2016-04-26 Longsand Limited Performance and scalability in an intelligent data operating layer system
US9398031B1 (en) 2009-04-25 2016-07-19 Dasient, Inc. Malicious advertisement detection and remediation
US9430470B2 (en) 2010-05-26 2016-08-30 Content Catalyst Limited Automated report service tracking system and method
US20160378274A1 (en) * 2015-06-26 2016-12-29 International Business Machines Corporation Usability improvements for visual interfaces
CN107066258A (en) * 2017-03-06 2017-08-18 武汉斗鱼网络科技有限公司 A kind of page iden-tity image updating method and system
US20170270221A1 (en) * 2006-09-11 2017-09-21 Willow Acquisition Corporation System and method for collecting and processing data
US10394421B2 (en) 2015-06-26 2019-08-27 International Business Machines Corporation Screen reader improvements
CN110457509A (en) * 2018-05-08 2019-11-15 本田技研工业株式会社 Data disclose system
US11381628B1 (en) * 2021-12-22 2022-07-05 Hopin Ltd Browser-based video production
CN116630990A (en) * 2023-07-21 2023-08-22 杭州实在智能科技有限公司 RPA flow element path intelligent restoration method and system

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5664109A (en) * 1995-06-07 1997-09-02 E-Systems, Inc. Method for extracting pre-defined data items from medical service records generated by health care providers
US6014680A (en) * 1995-08-31 2000-01-11 Hitachi, Ltd. Method and apparatus for generating structured document
US6138129A (en) * 1997-12-16 2000-10-24 World One Telecom, Ltd. Method and apparatus for providing automated searching and linking of electronic documents
US6222847B1 (en) * 1997-10-08 2001-04-24 Lucent Technologies Inc. Apparatus and method for retrieving data from a network site
US6304870B1 (en) * 1997-12-02 2001-10-16 The Board Of Regents Of The University Of Washington, Office Of Technology Transfer Method and apparatus of automatically generating a procedure for extracting information from textual information sources
US6405221B1 (en) * 1995-10-20 2002-06-11 Sun Microsystems, Inc. Method and apparatus for creating the appearance of multiple embedded pages of information in a single web browser display
US6424980B1 (en) * 1998-06-10 2002-07-23 Nippon Telegraph And Telephone Corporation Integrated retrieval scheme for retrieving semi-structured documents
US6516308B1 (en) * 2000-05-10 2003-02-04 At&T Corp. Method and apparatus for extracting data from data sources on a network
US6538673B1 (en) * 1999-08-23 2003-03-25 Divine Technology Ventures Method for extracting digests, reformatting, and automatic monitoring of structured online documents based on visual programming of document tree navigation and transformation
US6564254B1 (en) * 1998-11-04 2003-05-13 Dream Technologies Corporation System and a process for specifying a location on a network

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5664109A (en) * 1995-06-07 1997-09-02 E-Systems, Inc. Method for extracting pre-defined data items from medical service records generated by health care providers
US6014680A (en) * 1995-08-31 2000-01-11 Hitachi, Ltd. Method and apparatus for generating structured document
US6405221B1 (en) * 1995-10-20 2002-06-11 Sun Microsystems, Inc. Method and apparatus for creating the appearance of multiple embedded pages of information in a single web browser display
US6222847B1 (en) * 1997-10-08 2001-04-24 Lucent Technologies Inc. Apparatus and method for retrieving data from a network site
US6304870B1 (en) * 1997-12-02 2001-10-16 The Board Of Regents Of The University Of Washington, Office Of Technology Transfer Method and apparatus of automatically generating a procedure for extracting information from textual information sources
US6138129A (en) * 1997-12-16 2000-10-24 World One Telecom, Ltd. Method and apparatus for providing automated searching and linking of electronic documents
US6424980B1 (en) * 1998-06-10 2002-07-23 Nippon Telegraph And Telephone Corporation Integrated retrieval scheme for retrieving semi-structured documents
US6564254B1 (en) * 1998-11-04 2003-05-13 Dream Technologies Corporation System and a process for specifying a location on a network
US6538673B1 (en) * 1999-08-23 2003-03-25 Divine Technology Ventures Method for extracting digests, reformatting, and automatic monitoring of structured online documents based on visual programming of document tree navigation and transformation
US6516308B1 (en) * 2000-05-10 2003-02-04 At&T Corp. Method and apparatus for extracting data from data sources on a network

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
Brochure, "Vignette eContent," Vignette (C) 1997-2000.
Brochure, "Vignette eContent," Vignette © 1997-2000.
Eikvil. Information Extraction from World Wide Web-A Survey. 1999. pp. 1-39.* *
Webpages from Aonix website (www.aonix.com) (C) 1999.
Webpages from Aonix website (www.aonix.com) © 1999.
Webpages from Knowmadic Inc. website (www.knowmadic.com).
Webpages from Vignette website "The Right Content in the Right Context at the Right Time" (www.vignette.com) (C) 1996-2000.
Webpages from Vignette website "The Right Content in the Right Context at the Right Time" (www.vignette.com) © 1996-2000.
Webpages from webMethods website "Resolve Complex B2B Integration Challenges Once and for All" (www.webmethods.com).

Cited By (160)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060230343A1 (en) * 1998-12-08 2006-10-12 Yodlee.Com, Inc. Method and apparatus for detecting changes in websites and reporting results to web developers for navigation template repair purposes
US8190629B2 (en) 1998-12-08 2012-05-29 Yodlee.Com, Inc. Network-based bookmark management and web-summary system
US8069407B1 (en) * 1998-12-08 2011-11-29 Yodlee.Com, Inc. Method and apparatus for detecting changes in websites and reporting results to web developers for navigation template repair purposes
US7672879B1 (en) 1998-12-08 2010-03-02 Yodlee.Com, Inc. Interactive activity interface for managing personal data and performing transactions over a data packet network
US7752535B2 (en) 1999-06-01 2010-07-06 Yodlec.com, Inc. Categorization of summarized information
US20040068693A1 (en) * 2000-04-28 2004-04-08 Jai Rawat Client side form filler that populates form fields based on analyzing visible field labels and visible display format hints without previous examination or mapping of the form
US7979856B2 (en) 2000-06-21 2011-07-12 Microsoft Corporation Network-based software extensions
US20020107834A1 (en) * 2000-11-20 2002-08-08 Larry Yen Quality assurance of data extraction
US20020194379A1 (en) * 2000-12-06 2002-12-19 Bennett Scott William Content distribution system and method
US8230323B2 (en) * 2000-12-06 2012-07-24 Sra International, Inc. Content distribution system and method
US7165088B2 (en) * 2001-01-24 2007-01-16 Microsoft Corporation System and method for incremental and reversible data migration and feature deployment
US20020099767A1 (en) * 2001-01-24 2002-07-25 Microsoft Corporation System and method for incremental and reversible data migration and feature deployment
US20060242266A1 (en) * 2001-02-27 2006-10-26 Paula Keezer Rules-based extraction of data from web pages
US20080313532A1 (en) * 2001-03-06 2008-12-18 International Business Machines Corporation Method and apparatus for repurposing formatted content
US20020129067A1 (en) * 2001-03-06 2002-09-12 Dwayne Dames Method and apparatus for repurposing formatted content
US8055999B2 (en) 2001-03-06 2011-11-08 International Business Machines Corporation Method and apparatus for repurposing formatted content
US7546527B2 (en) * 2001-03-06 2009-06-09 International Business Machines Corporation Method and apparatus for repurposing formatted content
US20020174185A1 (en) * 2001-05-01 2002-11-21 Jai Rawat Method and system of automating data capture from electronic correspondence
US7287036B2 (en) * 2001-05-01 2007-10-23 K-Plex Inc. Method and apparatus for automatically searching hypertext structure
US8560621B2 (en) 2001-05-01 2013-10-15 Mercury Kingdom Assets Limited Method and system of automating data capture from electronic correspondence
US9280763B2 (en) 2001-05-01 2016-03-08 Mercury Kingdom Assets Limited Method and system of automating data capture from electronic correspondence
US20020165858A1 (en) * 2001-05-01 2002-11-07 Koji Kusumoto Method and apparatus for automatically searching hypertext structure
US10027613B2 (en) 2001-05-01 2018-07-17 Mercury Kingdom Assets Limited Method and system of automating data capture from electronic correspondence
US8095597B2 (en) * 2001-05-01 2012-01-10 Aol Inc. Method and system of automating data capture from electronic correspondence
US7979539B2 (en) * 2001-05-18 2011-07-12 Network Resonance, Inc. System, method and computer program product for analyzing data from network-based structured message stream
US7464154B2 (en) * 2001-05-18 2008-12-09 Network Resonance, Inc. System, method and computer program product for analyzing data from network-based structured message stream
US20020174218A1 (en) * 2001-05-18 2002-11-21 Dick Kevin Stewart System, method and computer program product for analyzing data from network-based structured message stream
US20090193114A1 (en) * 2001-05-18 2009-07-30 Network Resonance, Inc. System, method and computer program product for analyzing data from network-based structured message stream
US7979533B2 (en) 2001-05-18 2011-07-12 Network Resonance, Inc. System, method and computer program product for auditing XML messages in a network-based message stream
US20080091821A1 (en) * 2001-05-18 2008-04-17 Network Resonance, Inc. System, method and computer program product for auditing xml messages in a network-based message stream
US7979343B2 (en) 2001-05-18 2011-07-12 Network Resonance, Inc. System, method and computer program product for providing an efficient trading market
US20090177572A1 (en) * 2001-05-18 2009-07-09 Network Resonance, Inc. System, method and computer program product for providing an efficient trading market
US7936693B2 (en) 2001-05-18 2011-05-03 Network Resonance, Inc. System, method and computer program product for providing an IP datalink multiplexer
US7272594B1 (en) 2001-05-31 2007-09-18 Autonomy Corporation Ltd. Method and apparatus to link to a related document
US8005858B1 (en) 2001-05-31 2011-08-23 Autonomy Corporation PLC Method and apparatus to link to a related document
US20040194023A1 (en) * 2001-06-12 2004-09-30 Frank Wiechers User selective reload of images
US7200599B2 (en) * 2001-06-21 2007-04-03 Microsoft Corporation Automated generator of input-validation filters
US20030037236A1 (en) * 2001-06-21 2003-02-20 Simon Daniel R. Automated generator of input-validation filters
US20030028559A1 (en) * 2001-06-27 2003-02-06 Jean-Jacques Moreau Method of analyzing a document represented in a markup language
US6801673B2 (en) * 2001-10-09 2004-10-05 Hewlett-Packard Development Company, L.P. Section extraction tool for PDF documents
US20030068099A1 (en) * 2001-10-09 2003-04-10 Hui Chao Section extraction tool for PDF documents
US8166406B1 (en) 2001-12-04 2012-04-24 Microsoft Corporation Internet privacy user interface
US20050160095A1 (en) * 2002-02-25 2005-07-21 Dick Kevin S. System, method and computer program product for guaranteeing electronic transactions
US7853795B2 (en) 2002-02-25 2010-12-14 Network Resonance, Inc. System, method and computer program product for guaranteeing electronic transactions
US20050091540A1 (en) * 2002-02-25 2005-04-28 Dick Kevin S. System, method and computer program product for guaranteeing electronic transactions
US7769997B2 (en) 2002-02-25 2010-08-03 Network Resonance, Inc. System, method and computer program product for guaranteeing electronic transactions
US20040073555A1 (en) * 2002-03-15 2004-04-15 Dennis Hevener Web callbook interface for amateur radio logging systems
US7032181B1 (en) * 2002-06-18 2006-04-18 Good Technology, Inc. Optimized user interface for small screen devices
US20030237047A1 (en) * 2002-06-18 2003-12-25 Microsoft Corporation Comparing hierarchically-structured documents
US7437664B2 (en) * 2002-06-18 2008-10-14 Microsoft Corporation Comparing hierarchically-structured documents
US20120239675A1 (en) * 2003-03-04 2012-09-20 Error Brett M Associating Website Clicks with Links on a Web Page
US8918729B2 (en) 2003-03-24 2014-12-23 Microsoft Corporation Designing electronic forms
US7925621B2 (en) 2003-03-24 2011-04-12 Microsoft Corporation Installing a solution
US9229917B2 (en) 2003-03-28 2016-01-05 Microsoft Technology Licensing, Llc Electronic form user interfaces
US8572024B2 (en) * 2003-07-23 2013-10-29 Ebay Inc. Systems and methods for extracting information from structured documents
US8090678B1 (en) * 2003-07-23 2012-01-03 Shopping.Com Systems and methods for extracting information from structured documents
US20120101979A1 (en) * 2003-07-23 2012-04-26 Shopping.Com Systems and methods for extracting information from structured documents
US9239821B2 (en) 2003-08-01 2016-01-19 Microsoft Technology Licensing, Llc Translation file
US8892993B2 (en) 2003-08-01 2014-11-18 Microsoft Corporation Translation file
US9268760B2 (en) 2003-08-06 2016-02-23 Microsoft Technology Licensing, Llc Correlation, association, or correspondence of electronic forms
US8429522B2 (en) 2003-08-06 2013-04-23 Microsoft Corporation Correlation, association, or correspondence of electronic forms
US20050065965A1 (en) * 2003-09-19 2005-03-24 Ziemann David M. Navigation of tree data structures
US7325191B2 (en) * 2003-12-08 2008-01-29 Microsoft Corporation Preservation of source code formatting
US20050125730A1 (en) * 2003-12-08 2005-06-09 Microsoft Corporation. Preservation of source code formatting
US7512900B2 (en) 2004-02-11 2009-03-31 Autonomy Corporation Ltd. Methods and apparatuses to generate links from content in an active window
US20050177805A1 (en) * 2004-02-11 2005-08-11 Lynch Michael R. Methods and apparatuses to generate links from content in an active window
US10776343B1 (en) 2004-09-02 2020-09-15 Lyft, Inc. Automated messaging tool
US11386073B2 (en) 2004-09-02 2022-07-12 Lyft, Inc. Automated messaging tool
US20060059234A1 (en) * 2004-09-02 2006-03-16 Atchison Charles E Automated messaging tool
US10204129B2 (en) 2004-09-02 2019-02-12 Prosper Technology, Llc Automated messaging tool
US8275840B2 (en) * 2004-09-02 2012-09-25 At&T Intellectual Property I, L.P. Automated messaging tool
US8996634B2 (en) 2004-09-02 2015-03-31 At&T Intellectual Property I, L.P. Automated messaging tool
US20060167911A1 (en) * 2005-01-24 2006-07-27 Stephane Le Cam Automatic data pattern recognition and extraction
US20060259519A1 (en) * 2005-05-12 2006-11-16 Microsoft Corporation Iterative definition of flat file data structure by using document instance
US9104773B2 (en) 2005-06-21 2015-08-11 Microsoft Technology Licensing, Llc Finding and consuming web subscriptions in a web browser
WO2007001864A1 (en) * 2005-06-21 2007-01-04 Microsoft Corporation Content syndication platform
US8751936B2 (en) 2005-06-21 2014-06-10 Microsoft Corporation Finding and consuming web subscriptions in a web browser
US8661459B2 (en) 2005-06-21 2014-02-25 Microsoft Corporation Content syndication platform
US8832571B2 (en) 2005-06-21 2014-09-09 Microsoft Corporation Finding and consuming web subscriptions in a web browser
US9762668B2 (en) 2005-06-21 2017-09-12 Microsoft Technology Licensing, Llc Content syndication platform
US20060288011A1 (en) * 2005-06-21 2006-12-21 Microsoft Corporation Finding and consuming web subscriptions in a web browser
US20060288329A1 (en) * 2005-06-21 2006-12-21 Microsoft Corporation Content syndication platform
US20090013266A1 (en) * 2005-06-21 2009-01-08 Microsoft Corporation Finding and Consuming Web Subscriptions in a Web Browser
US9894174B2 (en) 2005-06-21 2018-02-13 Microsoft Technology Licensing, Llc Finding and consuming web subscriptions in a web browser
US20070011184A1 (en) * 2005-07-07 2007-01-11 Morris Stuart D Method and apparatus for processing XML tagged data
US7657549B2 (en) * 2005-07-07 2010-02-02 Acl Services Ltd. Method and apparatus for processing XML tagged data
US8074272B2 (en) 2005-07-07 2011-12-06 Microsoft Corporation Browser security notification
US7831547B2 (en) 2005-07-12 2010-11-09 Microsoft Corporation Searching and browsing URLs and URL history
US9141716B2 (en) 2005-07-12 2015-09-22 Microsoft Technology Licensing, Llc Searching and browsing URLs and URL history
US20110022971A1 (en) * 2005-07-12 2011-01-27 Microsoft Corporation Searching and Browsing URLs and URL History
US10423319B2 (en) 2005-07-12 2019-09-24 Microsoft Technology Licensing, Llc Searching and browsing URLs and URL history
US7865830B2 (en) * 2005-07-12 2011-01-04 Microsoft Corporation Feed and email content
US20070016609A1 (en) * 2005-07-12 2007-01-18 Microsoft Corporation Feed and email content
US20070033006A1 (en) * 2005-07-19 2007-02-08 Sony Corporation Information processing apparatus, method and program
US7587673B2 (en) * 2005-07-19 2009-09-08 Sony Corporation Information processing apparatus, method and program
US8601001B2 (en) * 2005-07-28 2013-12-03 The Boeing Company Selectively structuring a table of contents for accessing a database
US20070027897A1 (en) * 2005-07-28 2007-02-01 Bremer John F Selectively structuring a table of contents for accesing a database
US20070101250A1 (en) * 2005-10-31 2007-05-03 Advanced Micro Devices, Inc. Data analysis visualization with hyperlink to external content
US9210234B2 (en) 2005-12-05 2015-12-08 Microsoft Technology Licensing, Llc Enabling electronic documents for limited-capability computing devices
US20070150447A1 (en) * 2005-12-23 2007-06-28 Anish Shah Techniques for generic data extraction
US7860903B2 (en) 2005-12-23 2010-12-28 Teradata Us, Inc. Techniques for generic data extraction
US7577963B2 (en) 2005-12-30 2009-08-18 Public Display, Inc. Event data translation system
US20070185881A1 (en) * 2006-02-03 2007-08-09 Autodesk Canada Co., Database-managed image processing
US8024356B2 (en) * 2006-02-03 2011-09-20 Autodesk, Inc. Database-managed image processing
US20070204220A1 (en) * 2006-02-27 2007-08-30 Microsoft Corporation Re-layout of network content
US8280843B2 (en) 2006-03-03 2012-10-02 Microsoft Corporation RSS data-processing object
US20070208759A1 (en) * 2006-03-03 2007-09-06 Microsoft Corporation RSS Data-Processing Object
US8768881B2 (en) 2006-03-03 2014-07-01 Microsoft Corporation RSS data-processing object
US7979803B2 (en) 2006-03-06 2011-07-12 Microsoft Corporation RSS hostable control
US8201072B2 (en) * 2006-03-20 2012-06-12 Ricoh Company, Ltd. Image forming apparatus, electronic mail delivery server, and information processing apparatus
US9060085B2 (en) 2006-03-20 2015-06-16 Ricoh Company, Ltd. Image forming apparatus, electronic mail delivery server, and information processing apparatus
US20070230778A1 (en) * 2006-03-20 2007-10-04 Fabrice Matulic Image forming apparatus, electronic mail delivery server, and information processing apparatus
US20070242925A1 (en) * 2006-04-17 2007-10-18 Hiroaki Kikuchi Recording and reproducing apparatus and reproducing apparatus
US8196037B2 (en) * 2006-06-19 2012-06-05 Tencent Technology (Shenzhen) Company Limited Method and device for extracting web information
US20090100056A1 (en) * 2006-06-19 2009-04-16 Tencent Technology (Shenzhen) Company Limited Method And Device For Extracting Web Information
US7856386B2 (en) 2006-09-07 2010-12-21 Yodlee, Inc. Host exchange in bill paying services
US11537665B2 (en) * 2006-09-11 2022-12-27 Willow Acquisition Corporation System and method for collecting and processing data
US20170270221A1 (en) * 2006-09-11 2017-09-21 Willow Acquisition Corporation System and method for collecting and processing data
US20080077417A1 (en) * 2006-09-21 2008-03-27 Lazzarino William A Systems and Methods for Citation Management
US10311136B2 (en) * 2006-12-11 2019-06-04 Microsoft Technology Licensing, Llc Really simple syndication for data
US20080141113A1 (en) * 2006-12-11 2008-06-12 Microsoft Corporation Really simple syndication for data
US20080216023A1 (en) * 2007-03-02 2008-09-04 Omnitus Ab Method and a system for creating a website guide
US20090187585A1 (en) * 2008-01-18 2009-07-23 Oracle International Corporation Comparing very large xml data
US8639709B2 (en) * 2008-01-18 2014-01-28 Oracle International Corporation Comparing very large XML data
US8234566B2 (en) * 2008-03-03 2012-07-31 Microsoft Corporation Collapsing margins in documents with complex content
US20090222714A1 (en) * 2008-03-03 2009-09-03 Microsoft Corporation Collapsing margins in documents with complex content
US8261334B2 (en) 2008-04-25 2012-09-04 Yodlee Inc. System for performing web authentication of a user by proxy
US8555359B2 (en) 2009-02-26 2013-10-08 Yodlee, Inc. System and methods for automatically accessing a web site on behalf of a client
US9398031B1 (en) 2009-04-25 2016-07-19 Dasient, Inc. Malicious advertisement detection and remediation
US9154364B1 (en) * 2009-04-25 2015-10-06 Dasient, Inc. Monitoring for problems and detecting malware
US9298919B1 (en) 2009-04-25 2016-03-29 Dasient, Inc. Scanning ad content for malware with varying frequencies
US9704188B1 (en) * 2009-07-29 2017-07-11 Open Invention Network Llc Method and apparatus of creating electronic forms to include internet list data
US10460372B1 (en) * 2009-07-29 2019-10-29 Open Invention Network Llc Method and apparatus of creating electronic forms to include Internet list data
US10089673B1 (en) * 2009-07-29 2018-10-02 Open Invention Network Llc Method and apparatus of creating electronic forms to include internet list data
US9223770B1 (en) * 2009-07-29 2015-12-29 Open Invention Network, Llc Method and apparatus of creating electronic forms to include internet list data
US8291313B1 (en) * 2009-08-26 2012-10-16 Adobe Systems Incorporated Generation of a container hierarchy from a document design
US8769392B2 (en) 2010-05-26 2014-07-01 Content Catalyst Limited Searching and selecting content from multiple source documents having a plurality of native formats, indexing and aggregating the selected content into customized reports
US9430470B2 (en) 2010-05-26 2016-08-30 Content Catalyst Limited Automated report service tracking system and method
WO2011148342A1 (en) * 2010-05-26 2011-12-01 Nokia Corporation Method and apparatus for enabling generation of multiple independent user interface elements from a web page
US8843814B2 (en) 2010-05-26 2014-09-23 Content Catalyst Limited Automated report service tracking system and method
US20130067313A1 (en) * 2011-09-09 2013-03-14 Damien LEGUIN Format conversion tool
US8910039B2 (en) * 2011-09-09 2014-12-09 Accenture Global Services Limited File format conversion by automatically converting to an intermediate form for manual editing in a multi-column graphical user interface
US8935622B2 (en) * 2011-09-21 2015-01-13 International Business Machines Corporation Supplementary calculation of numeric data in a web browser
US20130073992A1 (en) * 2011-09-21 2013-03-21 International Business Machines Corporation Supplementary Calculation Of Numeric Data In A Web Browser
US8893028B2 (en) * 2011-09-21 2014-11-18 International Business Machines Corporation Supplementary calculation of numeric data in a web browser
US8996539B2 (en) 2012-04-13 2015-03-31 Microsoft Technology Licensing, Llc Composing text and structured databases
US9411902B2 (en) 2012-04-27 2016-08-09 Microsoft Technology Licensing, Llc Retrieving content from website through sandbox
US9256733B2 (en) 2012-04-27 2016-02-09 Microsoft Technology Licensing, Llc Retrieving content from website through sandbox
US9323767B2 (en) 2012-10-01 2016-04-26 Longsand Limited Performance and scalability in an intelligent data operating layer system
US11188549B2 (en) * 2014-05-28 2021-11-30 Aravind Musuluri System and method for displaying table search results
US20150347535A1 (en) * 2014-05-28 2015-12-03 Aravind Musuluri System and method for displaying table search results
US10394421B2 (en) 2015-06-26 2019-08-27 International Business Machines Corporation Screen reader improvements
US10452231B2 (en) * 2015-06-26 2019-10-22 International Business Machines Corporation Usability improvements for visual interfaces
US20160378274A1 (en) * 2015-06-26 2016-12-29 International Business Machines Corporation Usability improvements for visual interfaces
CN107066258A (en) * 2017-03-06 2017-08-18 武汉斗鱼网络科技有限公司 A kind of page iden-tity image updating method and system
CN110457509A (en) * 2018-05-08 2019-11-15 本田技研工业株式会社 Data disclose system
CN110457509B (en) * 2018-05-08 2022-10-18 本田技研工业株式会社 Data publishing system
US11381628B1 (en) * 2021-12-22 2022-07-05 Hopin Ltd Browser-based video production
CN116630990A (en) * 2023-07-21 2023-08-22 杭州实在智能科技有限公司 RPA flow element path intelligent restoration method and system
CN116630990B (en) * 2023-07-21 2023-10-10 杭州实在智能科技有限公司 RPA flow element path intelligent restoration method and system

Similar Documents

Publication Publication Date Title
US6732102B1 (en) Automated data extraction and reformatting
US6826553B1 (en) System for providing database functions for multiple internet sources
Hammer et al. Semistructured data: The TSIMMIS experience
US6434554B1 (en) Method for querying a database in which a query statement is issued to a database management system for which data types can be defined
US7165073B2 (en) Dynamic, hierarchical data exchange system
US5911145A (en) Hierarchical structure editor for web sites
US8306998B2 (en) Method for sending an electronic message utilizing connection information and recipient information
US8452776B2 (en) Spatial data portal
US6405216B1 (en) Internet-based application program interface (API) documentation interface
US20020026441A1 (en) System and method for integrating multiple applications
US7086002B2 (en) System and method for creating and editing, an on-line publication
US20110185273A1 (en) System and method for extracting content elements from multiple Internet sources
EP1376408B1 (en) Extraction of information from structured documents
US7770123B1 (en) Method for dynamically generating a “table of contents” view of a HTML-based information system
US20020026461A1 (en) System and method for creating a source document and presenting the source document to a user in a target format
CN101211336B (en) Visualized system and method for generating inquiry file
US20050198567A1 (en) Web navigation method and system
WO2002027537A1 (en) System and method for in-context editing
CN113177168B (en) Positioning method based on Web element attribute characteristics
JP4830637B2 (en) Electronic document update notification device and electronic document update notification method
US7685229B1 (en) System and method for displaying server side code results in an application program
US7480910B1 (en) System and method for providing information and associating information
US20070094289A1 (en) Dynamic, hierarchical data exchange system
KR100522186B1 (en) Methods for dynamically building the home page and Apparatus embodied on the web therefor
Lingam et al. Supporting end-users in the creation of dependable web clips

Legal Events

Date Code Title Description
AS Assignment

Owner name: INSTAKNOW.COM INC., NEW JERSEY

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:KHANDEKAR, PRAMOD;REEL/FRAME:011331/0797

Effective date: 20001116

STCF Information on status: patent grant

Free format text: PATENTED CASE

FEPP Fee payment procedure

Free format text: PAT HOLDER NO LONGER CLAIMS SMALL ENTITY STATUS, ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: STOL); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FPAY Fee payment

Year of fee payment: 4

FPAY Fee payment

Year of fee payment: 8

FPAY Fee payment

Year of fee payment: 12