WO2003005240A1 - Apparatus for searching on internet - Google Patents

Apparatus for searching on internet Download PDF

Info

Publication number
WO2003005240A1
WO2003005240A1 PCT/NO2002/000244 NO0200244W WO03005240A1 WO 2003005240 A1 WO2003005240 A1 WO 2003005240A1 NO 0200244 W NO0200244 W NO 0200244W WO 03005240 A1 WO03005240 A1 WO 03005240A1
Authority
WO
WIPO (PCT)
Prior art keywords
web
web pages
list
unit
server
Prior art date
Application number
PCT/NO2002/000244
Other languages
French (fr)
Inventor
Allan Lochert
Jan Otto Reberg
Gudbrand Eggen
Original Assignee
Wide Computing As
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wide Computing As filed Critical Wide Computing As
Priority to EP02736301A priority Critical patent/EP1412878A1/en
Publication of WO2003005240A1 publication Critical patent/WO2003005240A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results

Definitions

  • the invention relates to publishing and searching on the World Wide Web, in short web.
  • the web contains a large number of web pages, provided by content providers and stored on web servers connected together on the internet. End users often use search engines for finding information. Typical examples of search engines are Fast, AltaVista, Google and Yahoo.
  • search engines are typically based on submission or crawling, or a combination of these two principles, for providing data for a search index. This search index is then the basis for serving search requests from end users.
  • URLs are submitted by end users or content providers, or fetched from directory services like Open Directory. The web pages corresponding to the URLs are then downloaded and used as basis for a the search index.
  • Crawling based search engines are based on a crawler that starts on chosen start web pages, downloads these, finds web pages that are referenced on the downloaded pages and downloads the referenced pages, and so on.
  • the downloaded pages are used as basis for a search index.
  • the start pages can be submitted pages.
  • crawling is typically used. This means that the crawling must be repeated regularly, leading to heavy load on transmission lines, on the web servers of the content providers and on the computers at the search engine. Because of these costs, the crawling of the presumably interesting part of the web is often restricted to be done at a speed corresponding to an average revisit period from from a few weeks to a few months. Therefore, news can be available for some time at the web servers before being found by the search ingenes. New pages that are not linked to, can remain unfound.
  • search engines trying to crawl a significant part of the web. It is an object of the invention to provide search engines with a new way of updating their indices based on what is available on the internet.
  • Fig. 1 a illustrates a search engine based on prior art.
  • Fig. 1 b illustrates a search engine based on the present innovation.
  • Fig. 2 shows a detailed example of information that can flow from web servers to a central change server and further to search engines.
  • Fig. 3 shows several details of the inner structure for an agent.
  • Figs. 4 a and b illustrate ways of balancing the communication load for a central site.
  • Fig. 5 shows an example of a configuration file for the agent.
  • Fig. 6 illustrate one way of interconnecting search engines and cache servers based on the present innovation.
  • Fig. 7 shows a flexible, error tolerant, scalable architecture for data flow in a web change server.
  • Figs. 8 a and b illustrate two possible arrangements for building a list over documents existing on the web.
  • the basic principle of the invention is that local agents are installed on or near web servers or groups of web servers, and these agents detect and transmit changes to a central web change server. Web changes are then communicated further to search engines, giving the search engines a basis for downloading only pages that are new or modified, and allowing them to remove parts of the index corresponding to deleted web pages.
  • the basic principle is illustrated both for use within a search engine and for providing a standalone change directory service that can serve several search engines.
  • Web pages 12 14 are stored on a web server 10.
  • a search engine 20 has a crawler 22.
  • the crawler 22 reads through the web pages it finds, and produces a index 24.
  • a user on an end user machine 30 searches the index 24 with a local browser 32.
  • Wanted documents 14 are downloaded from the web server 10.
  • Fig. 1b shows a search engine based on the present invention.
  • Web pages 52 are stored on a web server 50.
  • An agent 54 crawls the web pages 52. For each crawl, a log file 56 is generated. This log file 56 is used for change detection.
  • Information about modifications, covering new, changed and deleted pages, is sent from the agent 54 to a loader 62 in a search engine 60.
  • the search engine 60 updates an index 64 based on the results from crawling new and changed pages, and deletes index entries based on deleted pages.
  • the index 64 can then be queried and the results presented in a browser 72 on an end user machine 70 in the traditional way.
  • the amount of modified web pages on a web server will usually be significantly less than the total amount of web pages on the server. Therefore, much less data will be transmitted, and the work load for the search engine will be much less when using the present innovation. Also, the agent can transmit the modification list to the search engine soon after modifications have been made, instead of waiting for the search engine to find the modifications during crawling some time.
  • One way is to implement a service or demon that executes regularly according to a time interval and a first starting time, or some other schedule. Another way is to let administrators start the agent manually. Yet another way is to implement the agent so that it can be started from a script, possibly synchronized with other executing on web servers.
  • the web change server can be a componenf of a search engine.
  • the web change server can be used as a stand alone service, serving several search engines.
  • the latter has the advantage of allowing publishing data on several search engines while only installing and maintaining one agent.
  • Fig. 2 shows an example of information that can be transmitted.
  • Two agents each residing on a web server, sends two modification lists 10 20 containing URLs referring to modified web pages to a central server where the lists 10 20 are combined to one aggregated modification list 30.
  • Various extracts 40 50 can so be made from this aggregated list 30 and can be sent to various search engines.
  • new web pages are marked with '+'
  • changed web pages are marked with '!'
  • deleted web pages are marked with '-'.
  • the aggregated list can be produced by concatenating incoming modification lists in order of arrival.
  • Fig. 3 shows one preferred embodiment for detecting modifications.
  • An agent 10 is installed on a web server 5.
  • a crawler 15 crawls the web pages 25 on the web server 5.
  • a log 35 is made based on the crawling.
  • a change detector 40 compares the log 35 from the newest crawling with the log 45 from the previous crawling. The differences between these logs are summarized in a modification list and transmitted.
  • One preferred embodiment for the log is to make this as a table with one row for each web page, one column with URL for each web page and one column with a check sum for each web page.
  • the change detector 50 can then compare the newest log 40 with the previous log 60 and report changes as follows: A URL which is present only in the newest log is reported as a new page, a URL which is present only in the previous log is reported as a deleted page, and a URL which is present in both lists but with different checksums is reported as a changed page. URLs which are present in both logs 40 60 are not reported.
  • the checksum is generated based on Exclusive Or (XOR) of groups of characters on the web page.
  • XOR Exclusive Or
  • checksums can be used.
  • the method should preferable generate a relatively short checksum, e.g. from 16 to 128 bits, and at the same time it should be a relatively low probability that two web pages with different contents are assigned the same checksum.
  • the same content may be represented by several URLs. These web pages are called duplicates. One such case is when aliases are used.
  • the crawl log may be sorted or accessed in checksum order. In cases where two or more pages have the same checksum, one of the URLs can be discarded.
  • the selection of which web page to report in case of duplicates can be done using a fixed rule.
  • the shortest URL can be used, and in case two or more URLs have the same length, the first in alfabetic order can be used.
  • Some web pages have mostly static content but also contain some minor automatically changing part, e.g. a clock or a hit counter. Such pages should not be reported as always changing, because this would cause a search engine to download and reindex all such pages each time "changes" are reported, resulting in unnecessary load on servers and network.
  • This problem can be solved in several different ways.
  • One way is to take certain elements of such pages out from checksum calculation. E.g, all instances of strings of the forms "99:99:99” or “99/99-9999" can be replaced by blanks before or during checksum generation. It should be possible to control this using a configuration tool, including controlling exactly which strings should be taken out.
  • An administrator could be presented a menu of examples of strings to take away. Alternatively, the administrator could be given the possibility of specifying such strings, e.g. by using regular expressions as used by the "grep" command in UNIX.
  • error message pages corresponding to missing pages also called " dead links.
  • the HTTP protocol allows missing pages to be reported with error message 404, without further content.
  • a crawler can detect this and stop further actions.
  • some web servers are programmed to respond with an error message containing an element such as "Sorry, the page AAAA.htm was not found".
  • every URL that leads to the web server but that does not refer to an existing web page would result in a unique web page, which in malign cases could lead to an endless amount of web pages, overloading the agent and/or the search engines.
  • Such cases can be solved by removing self references from the web pages before or during check sum generation.
  • some pages may be dynamically generated. For such cases, it should be possible to implement the agent so that for some pages, both new, changed and deleted pages are reported, while for other pages, only new and deleted pages are reported.
  • One possibility is to discriminate based on file type, e.g. so that pages whose URL has the extension ".html” are checked for all types of modifications, while pages whose URL has the extension ".asp" are only checked with respect to new and deleted pages.
  • Another possibility of discrimination is to base this on folders.
  • some publishing tool is used for preparing web sites. For such cases, it may be more efficient to base the list of modifications directly on results from the publishing tool instead of crawling, or the two methods may be used in combination.
  • This can be a list of existing URLs, so that the agent would perform change detection and transmit afterwards. Alternatively, it can be a list of modifications, suitable for direct transmission.
  • the URLs reported by the agent should be the same as the URLs as seen from the perspective of a search engine and end user. This is a major reason for using the HTTP protocol for crawling, instead of using the FILE protocol. Therefore, the HTTP protocol will be a natural choice in most cases. However, advantages of the FILE protocol may be significant in some cases, like faster execution or not depending on a web server.
  • the agent should have a mechanism for converting FILE-based URLs into HTTP- based URLs that can be used from outside.
  • the agent must transmit the modification list back to the central site.
  • Possible protocols include FTP (File Transfer Protocol), mail (e.g. SMTP) and HTTP (Hypertext Transfer Protocol).
  • FTP File Transfer Protocol
  • mail e.g. SMTP
  • HTTP Hypertext Transfer Protocol
  • the agent takes the initiative to communication by either starting an FTP session, sending an email or by issuing an HTTP request. This results in efficient use of the communication channel, in that communication is only initiated when there is something to communicate.
  • the central site could initiate the communication.
  • One advantage of this is that the central site could achieve a better load distribution over time.
  • the agent should be authenticated both when using FTP, email and HTTP. Authentication can be divided into two phases: Authenticating an administrator and corresponding agent when registering, and later authenticating submissions supposedly coming from the given agent.
  • the authenticating made during registration can be made manually. Alternatively, automatic support for the process can be added.
  • the connection between crawl area and the name and email address of the administrator can be ensured using lookup in a Whois-database. Validity of the email address can be ensured by sending an email to the given address and requesting an answer. Subsequent authentication of submissions can be done using Public Key Encryption, using key pairs generated during registration.
  • One method for building a cache on internet is to provide a mechanism that copies web pages from a content provider, stores these web pages on one or more intermediate locations, and deliver these pages on request.
  • Such a caching service needs a mechanism for ensuring cache coherency, that is ensuring that the copy delivered to users is functionally identical to the original web page residing at the web server.
  • One traditional method is based on HTTP headers: Each time a web page is requested, a caching server fetches the corresponding header from the original web server. If the header is identical to the header stored at the cache server, then the rest of the web page is served from the cache server. This method relies on correctly generated HTTP headers, which cannot always be ensured. This method further relies on communication with the content provider for each web page to be delivered, which results in unwanted network traffic. By employing the methods disclosed in the present application, both problems can be reduced.
  • the modification list can be used as basis for indicating which web pages can be served from the cache and which web pages have to be refetched from the original web servers.
  • a caching service might be able to report hit count for each web page. This can be reported back to the web change server and distributed further to search engines and other interested participants. Hit counts can be valuable for the search engines for selecting which pages to download and index, and also for ranking results to be presented to end users.
  • An agent 10 on a web server 15 reports modifications to a web change server 20.
  • the web change server 20 further sends URLs to a cache server 25 and a search engine 30. Both will download the modified pages from the web server 15, enabling them to deliver search results and web pages respectively.
  • the cache server 25 will maintain hit rates, which are reported back to the web change server 20 and are further reported to the search engine 30, thereby allowing improved search result ranking.
  • several different business models are possible.
  • Content providers could be requested to pay for receiving a more efficient way to publish their content than what is otherwise possible. Payment could be calculated e.g. based on number of URLs submitted, or based on size of the monitored web site.
  • Infra structure vendors like communication, hosting or caching companies could be requested to pay for reducing stress on their infra structure or for adding functionality to their customers. Payment could be based on estimated or measured reduced stress of their infrastructure, or by splitting income that such vendors might receive from their customers for the added functionality.
  • Search engines could be requested to pay for improving the quality of their indices or for reducing their communication costs. Payment could be based on number of URLs received or exploited.
  • search engines are specialized on category. For such search engines, it is relevant to subscribe to data within the selected categories only.
  • a category can be assigned to the crawl area at the time of registration.
  • a user can select category from a list or from a tree structure.
  • OpenDirectory is one example of a category tree structure that can be used.
  • each URL is assigned one or more categories.
  • One data format that is useful for URL level categorization is to add a category column to the modification list. This column could then be filled with category codes according to a given list or tree structure.
  • the category column can be based on configuration data entered by a web server administrator. E. g, all content within a given folder may be assigned a given category.
  • Categorization may also be based on data or metadata.
  • the agent can look for given keywords in the header or body part of the web pages.
  • search engines could do the same with language. This can be handled in a similar way as with categories.
  • a language could be registered at crawl area level.
  • a language code can be carried in the modification lists. This language code can be configured, based on data, based on meta data, or otherwise supplied by an administrator.
  • the agent can be allowed to run without limitation regarding processor load or network traffic.
  • the agent In cases where the agent competes with other processes, it may be advantageous to limit the use of resources. E. g., if the agent is executed on the same computer as a web server program, then the web server performance might be degraded while the agent is executing. For such cases, the agent should be limited with respect to resource usage.
  • One way is to limit the HTTP requests to a given number of pages or kilobytes per second or minute. Another way is to limit the percentage of CPU time used. Another way is to limit the amount of RAM used. Yet another way is to limit the amount of disk used. There should be a possibility to set such limits during configuration.
  • Fig. 7 shows a scalable and error tolerant architecture for a back end system handling modification lists.
  • FTP servers 00 05 accept incoming FTP sessions from agents transferring modification lists.
  • the modification lists are stored on disk 10 15. As long as there is at least one FTP server running, the agents will be able to transmit their modification lists.
  • Each FTP server is essentially independent of the rest of the architecture, making them robust to failures in the rest of the system.
  • Aggregation servers 20 25 read modification lists and store them in aggregated modification lists 30 35.
  • the aggregation servers can also authenticate the modification lists relative to crawl area registered in a database 70.
  • Each aggregation server 2025 can read modification lists from disks 10 15 of several FTP servers 00 05. Therefore, the overall system will still function when one or more aggregation server is out of order, as long as one aggregation server still works.
  • Extract servers 4045 extract data from the aggregated modification lists, based on extract profiles stored in the database 70, again storing on disk 50 55.
  • Playout servers 60 65 distribute extracted data to the respective subscribers.
  • the playout servers 60 65 can be FTP servers, email clients, HTTP servers or other means for communicating with subscribers.
  • the disks at each stage serve as buffers. If one stage stops or starts running slowly, then the disks will buffer the results from the previous stage until the stage is operating again.
  • Fig. IBalance! a shows one method for scaling and load balancing.
  • Various crawlers 10 20 30 each have a list 1525 35 of prioritized addresses for FTP servers 40 50. When a crawler 10 tries to contact an FTP server, it chooses a prioritized address. If no contact can be made, the next address on the list can be tried.
  • scaling and load balancing can be made using Network Address Translation, abbreviated NAT.
  • NAT Network Address Translation
  • search engines might also want to have a list of which web pages are available on the internet at a given time.
  • One example is when a new search engine is established. Such a new search engine might then need a list of available web pages to start its index, to have a baseline for later modifications. A list of web pages available on the internet will from now be called a baseline list.
  • Fig. 8 a shows a way of integrating a baseline database 15 in the pipeline described in fig. 7.
  • An FTP server 05 is connected to a network 00, receiving modification lists.
  • the modification lists are aggregated by an aggegation server 10, inserting URLs into an aggregated modification list 20 and also consolidating into a baseline list 25.
  • An extract server 30 and a playout server 35 then handles data further, distributing to subscribers over a network 40.
  • the method for maintaining a BaseLine list as shown in fig. IBasseLine! a is well suited for real time operation for small or medium amounts of data.
  • a file based version for batch based operation is illustrated in fig. 8 b.
  • An FTP server 55 receives modification lists over a network 50.
  • An aggregation server 60 aggregates the data and stores them into an aggregated modification list 65.
  • Batches of the modification lists are collected and sorted by a Consolidator module 70. The batches are sorted, and the sorted batches and a previous version of the baseline list 80 are read in parallel, the results are consolidated in a merge process, and the data are written to a new version 85 of the baseline database. Extract 85 and playout 90 servers can then handle the data further for final distribution over a network 95.
  • Fig. 9 illustrates a search mechanism, where indexing, searching and ranking is outsourced.
  • a web server 10 has a number of web pages 15, among these a search page 20.
  • An agent 25 reports modifications to a web change server 30.
  • the results are reported to a search engine 35, which has a module 40 for downloading and indexing, producing an index 45.
  • the search engine has a query motor 50.
  • the query is sent to the query motor 50 on the search engine 35, results are generated based on the index 45, and the results are returned back for display on the search page 20.
  • the agent can transmit modifications to the central site.
  • the agent can transfer a complete list of URLs found on the web site to the central site, and the modifications can be computed at the central site.
  • This solution results in an agent with less complexity since operations are carried out on the central site instead of in the agent.
  • this also results in more network traffic, since complete lists of URLs have to be transmitted, instead of just modifications.

Abstract

A crawler (54) installed on a web server (50) crawls web pages (52) on a web site and produces a list over web pages (52) present on the web site. This list is compared with a similar list from a former crawling. Changes in the list, classified as new, changed or deleted web pages, are reported in a modification list and transmitted to a central web change server. The data are transmitted further to a search engine which updates its index based on the modification list. Several variations of this are also disclosed.

Description

APPARATUS FOR SEARCHING ON INTERNET
Detailed description
The invention relates to publishing and searching on the World Wide Web, in short web. The web contains a large number of web pages, provided by content providers and stored on web servers connected together on the internet. End users often use search engines for finding information. Typical examples of search engines are Fast, AltaVista, Google and Yahoo.
A description of one search engine can be found in "Anatomy of a Large-Scale Hypertextual Web Search Engine" by Sergey Brin and Lawrence Page, 1997.
Todays search engines are typically based on submission or crawling, or a combination of these two principles, for providing data for a search index. This search index is then the basis for serving search requests from end users.
For submission based search engines, URLs are submitted by end users or content providers, or fetched from directory services like Open Directory. The web pages corresponding to the URLs are then downloaded and used as basis for a the search index.
Crawling based search engines are based on a crawler that starts on chosen start web pages, downloads these, finds web pages that are referenced on the downloaded pages and downloads the referenced pages, and so on. The downloaded pages are used as basis for a search index. The start pages can be submitted pages.
When web pages are added, changed or deleted, this should be reflected in the search index. For detecting additions, changes and deletions, crawling is typically used. This means that the crawling must be repeated regularly, leading to heavy load on transmission lines, on the web servers of the content providers and on the computers at the search engine. Because of these costs, the crawling of the presumably interesting part of the web is often restricted to be done at a speed corresponding to an average revisit period from from a few weeks to a few months. Therefore, news can be available for some time at the web servers before being found by the search ingenes. New pages that are not linked to, can remain unfound.
There are several search engines trying to crawl a significant part of the web. It is an object of the invention to provide search engines with a new way of updating their indices based on what is available on the internet.
It is a further object of the invention to improve the quality of the search engine indices by reducing the time from a web page is available to it has been included in the indices.
It is a further object of the invention to decrease the costs associated with keeping indices updated.
It is a further object of the invention to give content providers a possibility to publish news and changes on their web pages to potential consumers of information.
It is a further object of the invention to reduce the amount of data traffic needed to keep indices updated.
It is a further object if the invention to give portalsjmproved possibilities for providing news on their web pages.
It is a further object of the invention to lower the load caused by crawling on web servers.
It is a further object of the invention to give better control over which web pages should be included by search engines.
It is a further object of the invention to improve categorization of web pages.
It is a further object of the invention to decrease misuse of categorization.
It is a further object of the invention to allow dynamically generated web pages to be indexed.
It is a further object of the invention to provide an overview of which web pages are available on the Internet.
Fig. 1 a illustrates a search engine based on prior art.
Fig. 1 b illustrates a search engine based on the present innovation.
Fig. 2 shows a detailed example of information that can flow from web servers to a central change server and further to search engines.
Fig. 3 shows several details of the inner structure for an agent.
Figs. 4 a and b illustrate ways of balancing the communication load for a central site. Fig. 5 shows an example of a configuration file for the agent.
Fig. 6 illustrate one way of interconnecting search engines and cache servers based on the present innovation.
Fig. 7 shows a flexible, error tolerant, scalable architecture for data flow in a web change server.
Figs. 8 a and b illustrate two possible arrangements for building a list over documents existing on the web.
The basic principle of the invention is that local agents are installed on or near web servers or groups of web servers, and these agents detect and transmit changes to a central web change server. Web changes are then communicated further to search engines, giving the search engines a basis for downloading only pages that are new or modified, and allowing them to remove parts of the index corresponding to deleted web pages.
The basic principle is illustrated both for use within a search engine and for providing a standalone change directory service that can serve several search engines.
Several methods for detecting changes are disclosed.
The basic principle for existing search engines is illustrated in fig. 1 a. Web pages 12 14 are stored on a web server 10. A search engine 20 has a crawler 22. The crawler 22 reads through the web pages it finds, and produces a index 24. A user on an end user machine 30 searches the index 24 with a local browser 32. Wanted documents 14 are downloaded from the web server 10.
Fig. 1b shows a search engine based on the present invention. Web pages 52 are stored on a web server 50. An agent 54 crawls the web pages 52. For each crawl, a log file 56 is generated. This log file 56 is used for change detection. Information about modifications, covering new, changed and deleted pages, is sent from the agent 54 to a loader 62 in a search engine 60. The search engine 60 updates an index 64 based on the results from crawling new and changed pages, and deletes index entries based on deleted pages. The index 64 can then be queried and the results presented in a browser 72 on an end user machine 70 in the traditional way.
The amount of modified web pages on a web server will usually be significantly less than the total amount of web pages on the server. Therefore, much less data will be transmitted, and the work load for the search engine will be much less when using the present innovation. Also, the agent can transmit the modification list to the search engine soon after modifications have been made, instead of waiting for the search engine to find the modifications during crawling some time.
There are several possible ways of initiating execution for an agent. One way is to implement a service or demon that executes regularly according to a time interval and a first starting time, or some other schedule. Another way is to let administrators start the agent manually. Yet another way is to implement the agent so that it can be started from a script, possibly synchronized with other executing on web servers.
The web change server can be a componenf of a search engine.
Alternatively, the web change server can be used as a stand alone service, serving several search engines. The latter has the advantage of allowing publishing data on several search engines while only installing and maintaining one agent.
Fig. 2 shows an example of information that can be transmitted. Two agents, each residing on a web server, sends two modification lists 10 20 containing URLs referring to modified web pages to a central server where the lists 10 20 are combined to one aggregated modification list 30. Various extracts 40 50 can so be made from this aggregated list 30 and can be sent to various search engines.
In the figure, new web pages are marked with '+', changed web pages are marked with '!' and deleted web pages are marked with '-'.
The aggregated list can be produced by concatenating incoming modification lists in order of arrival.
Fig. 3 shows one preferred embodiment for detecting modifications. An agent 10 is installed on a web server 5. A crawler 15 crawls the web pages 25 on the web server 5. A log 35 is made based on the crawling. After the crawling has finished, a change detector 40 compares the log 35 from the newest crawling with the log 45 from the previous crawling. The differences between these logs are summarized in a modification list and transmitted.
One preferred embodiment for the log is to make this as a table with one row for each web page, one column with URL for each web page and one column with a check sum for each web page. The change detector 50 can then compare the newest log 40 with the previous log 60 and report changes as follows: A URL which is present only in the newest log is reported as a new page, a URL which is present only in the previous log is reported as a deleted page, and a URL which is present in both lists but with different checksums is reported as a changed page. URLs which are present in both logs 40 60 are not reported.
In one preferred embodiment, the checksum is generated based on Exclusive Or (XOR) of groups of characters on the web page. E.g, if 32 bit checksums are wanted, the web page is divided into groups of 4 and 4 bytes, and the checksum is generated by XORing the groups.
Other methods for generating checksums can be used. The method should preferable generate a relatively short checksum, e.g. from 16 to 128 bits, and at the same time it should be a relatively low probability that two web pages with different contents are assigned the same checksum.
In many cases, the same content may be represented by several URLs. These web pages are called duplicates. One such case is when aliases are used.
When aliases are used, the content provider often has a preference as to which URL should be reported for the duplicates.
This can be solved with the use of checksums. The crawl log may be sorted or accessed in checksum order. In cases where two or more pages have the same checksum, one of the URLs can be discarded.
Generally, every checksum generation method where the checksum is shorter than the non-redundant part of the original data may accidentally produce the same checksum for two different web pages. Therefore, in case of checksum collisions, the web pages themselves must be compared in order to ensure that duplicates are correctly detected.
The selection of which web page to report in case of duplicates can be done using a fixed rule. E.g, the shortest URL can be used, and in case two or more URLs have the same length, the first in alfabetic order can be used.
Some web pages have mostly static content but also contain some minor automatically changing part, e.g. a clock or a hit counter. Such pages should not be reported as always changing, because this would cause a search engine to download and reindex all such pages each time "changes" are reported, resulting in unnecessary load on servers and network. This problem can be solved in several different ways. One way is to take certain elements of such pages out from checksum calculation. E.g, all instances of strings of the forms "99:99:99" or "99/99-9999" can be replaced by blanks before or during checksum generation. It should be possible to control this using a configuration tool, including controlling exactly which strings should be taken out. An administrator could be presented a menu of examples of strings to take away. Alternatively, the administrator could be given the possibility of specifying such strings, e.g. by using regular expressions as used by the "grep" command in UNIX.
One special case of dynamically generated pages is error message pages corresponding to missing pages, also called "dead links. The HTTP protocol allows missing pages to be reported with error message 404, without further content. A crawler can detect this and stop further actions. However, some web servers are programmed to respond with an error message containing an element such as "Sorry, the page AAAA.htm was not found". In such a case, every URL that leads to the web server but that does not refer to an existing web page would result in a unique web page, which in malign cases could lead to an endless amount of web pages, overloading the agent and/or the search engines. Such cases can be solved by removing self references from the web pages before or during check sum generation. E.g, for the URL http://linkloader.com/nonexisting.page.htm resulting in an error page containing the string "Sorry, the page http://linkloader.com/nonexistingpage.htm does not exist", the agent could modify the string to "Sorry, the page does not exist", which would then be caught during duplicate detection.
On some web sites where web pages are generated dynamically or change detection is difficult because of other reasons, it may be more efficient to report only new and deleted pages. This should be configurable.
On some web sites, some pages may be dynamically generated. For such cases, it should be possible to implement the agent so that for some pages, both new, changed and deleted pages are reported, while for other pages, only new and deleted pages are reported. One possibility is to discriminate based on file type, e.g. so that pages whose URL has the extension ".html" are checked for all types of modifications, while pages whose URL has the extension ".asp" are only checked with respect to new and deleted pages. Another possibility of discrimination is to base this on folders. In some cases, some publishing tool is used for preparing web sites. For such cases, it may be more efficient to base the list of modifications directly on results from the publishing tool instead of crawling, or the two methods may be used in combination. This can be incorporated in the system by defining an interface file, such that at each scheduled crawling or transmission time, all URLs in the interface are picked up by the agent and reported to the central site. This can be a list of existing URLs, so that the agent would perform change detection and transmit afterwards. Alternatively, it can be a list of modifications, suitable for direct transmission.
The URLs reported by the agent should be the same as the URLs as seen from the perspective of a search engine and end user. This is a major reason for using the HTTP protocol for crawling, instead of using the FILE protocol. Therefore, the HTTP protocol will be a natural choice in most cases. However, advantages of the FILE protocol may be significant in some cases, like faster execution or not depending on a web server. When the FILE protocol is used, the agent should have a mechanism for converting FILE-based URLs into HTTP- based URLs that can be used from outside.
The agent must transmit the modification list back to the central site. Possible protocols include FTP (File Transfer Protocol), mail (e.g. SMTP) and HTTP (Hypertext Transfer Protocol). In one preferred embodiment, the agent takes the initiative to communication by either starting an FTP session, sending an email or by issuing an HTTP request. This results in efficient use of the communication channel, in that communication is only initiated when there is something to communicate.
Alternatively, the central site could initiate the communication. One advantage of this is that the central site could achieve a better load distribution over time.
The agent should be authenticated both when using FTP, email and HTTP. Authentication can be divided into two phases: Authenticating an administrator and corresponding agent when registering, and later authenticating submissions supposedly coming from the given agent.
The authenticating made during registration can be made manually. Alternatively, automatic support for the process can be added. The connection between crawl area and the name and email address of the administrator can be ensured using lookup in a Whois-database. Validity of the email address can be ensured by sending an email to the given address and requesting an answer. Subsequent authentication of submissions can be done using Public Key Encryption, using key pairs generated during registration.
In addition to search engines, other parties may also use data from a web change server. One example is caching services.
One method for building a cache on internet is to provide a mechanism that copies web pages from a content provider, stores these web pages on one or more intermediate locations, and deliver these pages on request.
Such a caching service needs a mechanism for ensuring cache coherency, that is ensuring that the copy delivered to users is functionally identical to the original web page residing at the web server. One traditional method is based on HTTP headers: Each time a web page is requested, a caching server fetches the corresponding header from the original web server. If the header is identical to the header stored at the cache server, then the rest of the web page is served from the cache server. This method relies on correctly generated HTTP headers, which cannot always be ensured. This method further relies on communication with the content provider for each web page to be delivered, which results in unwanted network traffic. By employing the methods disclosed in the present application, both problems can be reduced. The modification list can be used as basis for indicating which web pages can be served from the cache and which web pages have to be refetched from the original web servers.
A caching service might be able to report hit count for each web page. This can be reported back to the web change server and distributed further to search engines and other interested participants. Hit counts can be valuable for the search engines for selecting which pages to download and index, and also for ranking results to be presented to end users.
This is illustrated on fig. IFrequency info!. An agent 10 on a web server 15 reports modifications to a web change server 20. The web change server 20 further sends URLs to a cache server 25 and a search engine 30. Both will download the modified pages from the web server 15, enabling them to deliver search results and web pages respectively. The cache server 25 will maintain hit rates, which are reported back to the web change server 20 and are further reported to the search engine 30, thereby allowing improved search result ranking. Commercially, several different business models are possible.
Content providers could be requested to pay for receiving a more efficient way to publish their content than what is otherwise possible. Payment could be calculated e.g. based on number of URLs submitted, or based on size of the monitored web site.
Infra structure vendors like communication, hosting or caching companies could be requested to pay for reducing stress on their infra structure or for adding functionality to their customers. Payment could be based on estimated or measured reduced stress of their infrastructure, or by splitting income that such vendors might receive from their customers for the added functionality.
Search engines could be requested to pay for improving the quality of their indices or for reducing their communication costs. Payment could be based on number of URLs received or exploited.
Some search engines are specialized on category. For such search engines, it is relevant to subscribe to data within the selected categories only.
A category can be assigned to the crawl area at the time of registration. A user can select category from a list or from a tree structure. OpenDirectory is one example of a category tree structure that can be used.
In addition to crawl area level categorization, there may be a need for URL level categorization, in which each URL is assigned one or more categories.
One data format that is useful for URL level categorization is to add a category column to the modification list. This column could then be filled with category codes according to a given list or tree structure.
The category column can be based on configuration data entered by a web server administrator. E. g, all content within a given folder may be assigned a given category.
Categorization may also be based on data or metadata. The agent can look for given keywords in the header or body part of the web pages.
Just like some search engines may specialize in certain categories or otherwise use category information for serving search requests, search engines could do the same with language. This can be handled in a similar way as with categories. A language could be registered at crawl area level. Alternatively, a language code can be carried in the modification lists. This language code can be configured, based on data, based on meta data, or otherwise supplied by an administrator.
For maximum speed, the agent can be allowed to run without limitation regarding processor load or network traffic.
In cases where the agent competes with other processes, it may be advantageous to limit the use of resources. E. g., if the agent is executed on the same computer as a web server program, then the web server performance might be degraded while the agent is executing. For such cases, the agent should be limited with respect to resource usage. One way is to limit the HTTP requests to a given number of pages or kilobytes per second or minute. Another way is to limit the percentage of CPU time used. Another way is to limit the amount of RAM used. Yet another way is to limit the amount of disk used. There should be a possibility to set such limits during configuration.
Fig. 7 shows a scalable and error tolerant architecture for a back end system handling modification lists.
FTP servers 00 05 accept incoming FTP sessions from agents transferring modification lists. The modification lists are stored on disk 10 15. As long as there is at least one FTP server running, the agents will be able to transmit their modification lists. Each FTP server is essentially independent of the rest of the architecture, making them robust to failures in the rest of the system.
Aggregation servers 20 25 read modification lists and store them in aggregated modification lists 30 35. The aggregation servers can also authenticate the modification lists relative to crawl area registered in a database 70.
Each aggregation server 2025 can read modification lists from disks 10 15 of several FTP servers 00 05. Therefore, the overall system will still function when one or more aggregation server is out of order, as long as one aggregation server still works.
Extract servers 4045 extract data from the aggregated modification lists, based on extract profiles stored in the database 70, again storing on disk 50 55.
Playout servers 60 65 distribute extracted data to the respective subscribers. The playout servers 60 65 can be FTP servers, email clients, HTTP servers or other means for communicating with subscribers. The disks at each stage serve as buffers. If one stage stops or starts running slowly, then the disks will buffer the results from the previous stage until the stage is operating again.
If a large number of agents are installed on various web servers, this may result in a heavy load of network traffic when many of these submit modification lists at the same time. Fig. IBalance! a shows one method for scaling and load balancing. Various crawlers 10 20 30 each have a list 1525 35 of prioritized addresses for FTP servers 40 50. When a crawler 10 tries to contact an FTP server, it chooses a prioritized address. If no contact can be made, the next address on the list can be tried.
Alternatively, scaling and load balancing can be made using Network Address Translation, abbreviated NAT. Using this technique, incoming FTP sessions are distributed to a set of FTP servers based on round robin or load based methods.
In addition to modification lists describing modification of web pages, search engines might also want to have a list of which web pages are available on the internet at a given time. One example is when a new search engine is established. Such a new search engine might then need a list of available web pages to start its index, to have a baseline for later modifications. A list of web pages available on the internet will from now be called a baseline list.
Fig. 8 a shows a way of integrating a baseline database 15 in the pipeline described in fig. 7. An FTP server 05 is connected to a network 00, receiving modification lists. The modification lists are aggregated by an aggegation server 10, inserting URLs into an aggregated modification list 20 and also consolidating into a baseline list 25. An extract server 30 and a playout server 35 then handles data further, distributing to subscribers over a network 40.
One method for consolidation can be summarized by the following pseudo code: For each URL received from the Aggregate server 15: If the operation is '+', i.e. a new web page:
If the URL already exists in the Baseline list:
Give a warning else
Insert the URL into the Baseline list If the operation is '-', i.e. a deleted web page: If the URL exists in the Baseline list:
Delete the URL else
Give a warning If the operation is '!', i.e. a changed web page: If the URL exists in the BaseLine list:
Update the URL else:
Insert the URL
Give a warning
The method for maintaining a BaseLine list as shown in fig. IBasseLine! a is well suited for real time operation for small or medium amounts of data.
However, the demand for Insert, Update and Delete operations make the method dependent on some database management system, e.g. an SQL-based database.
A file based version for batch based operation is illustrated in fig. 8 b. An FTP server 55 receives modification lists over a network 50. An aggregation server 60 aggregates the data and stores them into an aggregated modification list 65. Batches of the modification lists are collected and sorted by a Consolidator module 70. The batches are sorted, and the sorted batches and a previous version of the baseline list 80 are read in parallel, the results are consolidated in a merge process, and the data are written to a new version 85 of the baseline database. Extract 85 and playout 90 servers can then handle the data further for final distribution over a network 95.
Many web sites offer a search mechanism for searching within the web site. This is sometimes achieved by installing and operating a complete search engine running at the web server.
Using the present innovation, such functionality can be outsourced to an existing search engine.
Fig. 9 illustrates a search mechanism, where indexing, searching and ranking is outsourced. A web server 10 has a number of web pages 15, among these a search page 20. An agent 25 reports modifications to a web change server 30. The results are reported to a search engine 35, which has a module 40 for downloading and indexing, producing an index 45. The search engine has a query motor 50. When an end user uses the search page 20 to issue a query, the query is sent to the query motor 50 on the search engine 35, results are generated based on the index 45, and the results are returned back for display on the search page 20.
As described above, the agent can transmit modifications to the central site. Alternatively, the agent can transfer a complete list of URLs found on the web site to the central site, and the modifications can be computed at the central site. This solution results in an agent with less complexity since operations are carried out on the central site instead of in the agent. However, this also results in more network traffic, since complete lists of URLs have to be transmitted, instead of just modifications.

Claims

Claims
1. An apparatus for searching among web pages on Internet, the apparatus called search engine, where each web page is identified by a Unified Resource Locator, abbreviated
URL, and where the web pages are stored on a plurality of web servers, the appratus comprising unit for reading web pages, unit for creating an index based on the web pages, unit for removing a part of the index corresponding to removed web pages, and unit for receiving search requests and returning results based on content of the index, the mentioned units residing on a central site, c h a r a c t e r i z e d i n t h a t the apparatus further comprises a unit called agent, residing on each web server or on a local area network together with each web server, the agent transmits one list containing URLs for new web pages and one list containing URLs for deleted web pages, the two lists together called a modification list, to the central site, and the unit for reading web pages chooses which pages to read based on the modification list.
2. An apparatus according to claim 1 , where the agent also transmits a list containing URLs for changed web pages as part of the modification list.
3. An apparatus according to claim 3, wherein the agent contains a unit for crawling web pages on each web server, the unit called crawler, so that for each crawling a list of web pages is generated, and the list of modifications for the time span between two crawlings is made based on the difference between the lists of web pages corresponding to the two crawlings.
4. An apparatus according to claim 3, wherein the crawler uses Hyper Text Transfer Protocol, abbreviated HTTP.
5. An apparatus according to claim 3, wherein the crawler uses File protocol, and the URLs thereby found are modified so as to appear as valid HTTP-based URLs when accessed from a browser.
6. An appartus according to one of the claims 3 to 5, the apparatus further comprising a unit for calculating a checksum for each web page that is found during crawling, wherein a change in a web page is detected by a change in the corresponding checksums.
7. An apparatus according to claim 6, where the unit for calculating a checksum can disregard parts of the web pages corresponding to given regular expressions.
8. An apparatus according to one of claims 6 or 7, where the unit for calculating a checksum disregards references to the current web page.
9. An apparatus according to one of the claims 1 to 8, where web pages with identical content but with different URLs, the URLs called aliases, are detected, and one of these URLs is selected to be reported by the agent to the search engine.
10. An apparatus according to claim 9, wherein a rule set defines which of the aliases are selected.
11. An apparatus according to one of the claims 1 to 10, wherein the agent further comprises an interface for accepting an externally produced list of modifications.
12. An apparatus according to one of the claims 1 to 11 , wherein
File Transfer Protocol, abbreviated FTP, is used for transmitting the modification list to the search engine.
13. An apparatus according to one of the claims 1 to 11 , wherein electronic mail is used for transmitting the modification list to the central site.
14. An apparatus according to one of the claims 1 to 11 , wherein Hypertext Transfer Protocol, abbreviated HTTP, is used for transmitting the modification list to the search engine.
15. An apparatus according to one of the claims 1 to 14, the apparatus further comprising a unit for starting the agent or the transmission at given times or coupled to given events.
16. An apparatus according to one of the claims 1 to 15, the apparatus further comprising means for registering users and corresponding passwords and crawl areas, and means for authenticating data according to the passwords, wherein a crawl area delimits a part of the web by defining a top level domain, a domain and possibly one or more limitations within the domain, and the means for authenticating data controls that received URLs are within the crawl area.
17. An apparatus according to one of claims 1 to 16, the agent further comprising a unit for associating each URL with one or more categories, based on data or metadata in the webpages, or configurable rules, or lookup in a register over web pages, and the association between web pages and categories is transferred to the central site.
18. An apparatus according to one of claims 1 to 17, the apparatus further comprising a unit, located on each web server or on a device connected to each web server over a local area network, for receiving search requests, forwarding the search requests to the central site, receiving results from the central site, and presenting these from each web server.
19. An apparatus for producing an overview over modifications on web pages on the internet, the apparatus called web change server, where each web page is identified by a URL and where the web pages are stored on a plurality of web servers, the apparatus comprising unit called agent for assembling a list over URLs for new, changed and deleted web pages, the list called total modification list, and unit for transmitting or presenting the total modification list or an extract thereof to a set of subscribers, c h a r a c t e r i z e d i n t h a t the agent is run on each web server or on a device connected to the web server over a local area network, the unit for assembling the total modification list resides on a central site, and the unit for assembling the total modification list receives modification lists from each agent.
20. An apparatus according to claim 19, modified according to any of the claims 2 to 18.
21. An apparatus according to one of claims 19 to 20, where an extract specification is associated with each subscriber, the specification comprising rules defining which URLs from the total modification list should be transmitted to the subscriber.
22. An apparatus for caching web pages, the apparatus creating copies of a plurality of original web pages, the copies collectively called cache, c h a r a c t e r i z e d i n t h a t the cache is kept coherent with the original web pages by means of an apparatus according to one of claims 19 to 21.
23. A search engine according to one of claims 1 to 18, the search engine connected together with an apparatus for caching web pages, wherein the apparatus for caching web pages produces statistics over cache hits for web pages, the statistics is transferred to the search engine, and the search engine utilizes the statistics over cache hits when ranking search results.
24. A web change server according to one of claims 19 to 21 , the web change server connected together with an apparatus for caching web pages, wherein the apparatus for caching web pages produces statistics over cache hits for web pages, the statistics is transferred back to the web change server, and the statistics is presented together with the total modification list or the extract thereof.
25. An apparatus for producing a list over web pages on a plurality of web servers connected to the internet, the list called web status, the apparatus called web status server, c h a r a c t e r i z e d i n t h a t modifications from a web change server according to one of claims 19 to 21 are consolidated into a web status.
PCT/NO2002/000244 2001-07-03 2002-07-02 Apparatus for searching on internet WO2003005240A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP02736301A EP1412878A1 (en) 2001-07-03 2002-07-02 Apparatus for searching on internet

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
NO20013308 2001-07-03
NO20013308A NO20013308L (en) 2001-07-03 2001-07-03 Device for searching the Internet

Publications (1)

Publication Number Publication Date
WO2003005240A1 true WO2003005240A1 (en) 2003-01-16

Family

ID=19912636

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/NO2002/000244 WO2003005240A1 (en) 2001-07-03 2002-07-02 Apparatus for searching on internet

Country Status (3)

Country Link
EP (1) EP1412878A1 (en)
NO (1) NO20013308L (en)
WO (1) WO2003005240A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005020104A1 (en) * 2003-08-18 2005-03-03 Sap Aktiengesellschaft User-requested search or modification of indices for search engines
GB2417342A (en) * 2004-08-19 2006-02-22 Fujitsu Serv Ltd Indexing system for a computer file store
EP2223202A1 (en) * 2007-11-02 2010-09-01 Paglo Labs Inc. Hosted searching of private local area network information with support for add-on applications
US8140507B2 (en) 2007-07-02 2012-03-20 International Business Machines Corporation Method and system for searching across independent applications
WO2014008468A2 (en) * 2012-07-06 2014-01-09 Blekko, Inc. Searching and aggregating web pages
CN105740384A (en) * 2016-01-27 2016-07-06 浪潮软件集团有限公司 Crawler agent automatic switching method and device
US10346483B2 (en) * 2009-10-02 2019-07-09 Akamai Technologies, Inc. System and method for search engine optimization

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5855020A (en) * 1996-02-21 1998-12-29 Infoseek Corporation Web scan process
US5974455A (en) * 1995-12-13 1999-10-26 Digital Equipment Corporation System for adding new entry to web page table upon receiving web page including link to another web page not having corresponding entry in web page table
US6219818B1 (en) * 1997-01-14 2001-04-17 Netmind Technologies, Inc. Checksum-comparing change-detection tool indicating degree and location of change of internet documents
WO2001027793A2 (en) * 1999-10-14 2001-04-19 360 Powered Corporation Indexing a network with agents

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5974455A (en) * 1995-12-13 1999-10-26 Digital Equipment Corporation System for adding new entry to web page table upon receiving web page including link to another web page not having corresponding entry in web page table
US5855020A (en) * 1996-02-21 1998-12-29 Infoseek Corporation Web scan process
US6219818B1 (en) * 1997-01-14 2001-04-17 Netmind Technologies, Inc. Checksum-comparing change-detection tool indicating degree and location of change of internet documents
WO2001027793A2 (en) * 1999-10-14 2001-04-19 360 Powered Corporation Indexing a network with agents

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005020104A1 (en) * 2003-08-18 2005-03-03 Sap Aktiengesellschaft User-requested search or modification of indices for search engines
GB2417342A (en) * 2004-08-19 2006-02-22 Fujitsu Serv Ltd Indexing system for a computer file store
US8140507B2 (en) 2007-07-02 2012-03-20 International Business Machines Corporation Method and system for searching across independent applications
EP2223202A1 (en) * 2007-11-02 2010-09-01 Paglo Labs Inc. Hosted searching of private local area network information with support for add-on applications
EP2223202A4 (en) * 2007-11-02 2014-02-05 Paglo Labs Inc Hosted searching of private local area network information with support for add-on applications
US10346483B2 (en) * 2009-10-02 2019-07-09 Akamai Technologies, Inc. System and method for search engine optimization
WO2014008468A2 (en) * 2012-07-06 2014-01-09 Blekko, Inc. Searching and aggregating web pages
WO2014008468A3 (en) * 2012-07-06 2014-03-20 Blekko, Inc. Searching and aggregating web pages
US9767206B2 (en) 2012-07-06 2017-09-19 International Business Machines Corporation Searching and aggregating web pages
US11630875B2 (en) 2012-07-06 2023-04-18 International Business Machines Corporation Searching and aggregating web pages
CN105740384A (en) * 2016-01-27 2016-07-06 浪潮软件集团有限公司 Crawler agent automatic switching method and device

Also Published As

Publication number Publication date
NO20013308D0 (en) 2001-07-03
NO20013308L (en) 2003-01-06
EP1412878A1 (en) 2004-04-28

Similar Documents

Publication Publication Date Title
US6636854B2 (en) Method and system for augmenting web-indexed search engine results with peer-to-peer search results
EP1706832B1 (en) Improved user interface
US9703885B2 (en) Systems and methods for managing content variations in content delivery cache
US7093012B2 (en) System and method for enhancing crawling by extracting requests for webpages in an information flow
KR100781725B1 (en) Method and system for peer-to-peer authorization
US6360215B1 (en) Method and apparatus for retrieving documents based on information other than document content
US8280868B2 (en) Method and system for monitoring domain name registrations
US7200665B2 (en) Allowing requests of a session to be serviced by different servers in a multi-server data service system
JP3990115B2 (en) Server-side proxy device and program
US6625624B1 (en) Information access system and method for archiving web pages
JP4704750B2 (en) Link generation system
US7293012B1 (en) Friendly URLs
US20080114739A1 (en) System and Method for Searching for Internet-Accessible Content
US20060235873A1 (en) Social network-based internet search engine
US20050091202A1 (en) Social network-based internet search engine
US20110035553A1 (en) Method and system for cache management
US20030158953A1 (en) Protocol to fix broken links on the world wide web
WO2004084097A1 (en) Method and apparatus for detecting invalid clicks on the internet search engine
JP2002507308A (en) Method and apparatus for redirecting a hyperlink query to an external server
AU2001290363A1 (en) A method for searching and analysing information in data networks
JP2004502987A (en) How to build a real-time search engine
JP2000357176A (en) Contents indexing retrieval system and retrieval result providing method
CN101046806B (en) Search engine system and method
US8055665B2 (en) Sorted search in a distributed directory environment using a proxy server
WO2003005240A1 (en) Apparatus for searching on internet

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LU MC NL PT SE SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2002736301

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2002736301

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWW Wipo information: withdrawn in national office

Ref document number: 2002736301

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP