WO2003005240A1

WO2003005240A1 - Apparatus for searching on internet

Info

Publication number: WO2003005240A1
Application number: PCT/NO2002/000244
Authority: WO
Inventors: Allan Lochert; Jan Otto Reberg; Gudbrand Eggen
Original assignee: Wide Computing As
Priority date: 2001-07-03
Filing date: 2002-07-02
Publication date: 2003-01-16
Also published as: NO20013308D0; NO20013308L; EP1412878A1

Abstract

A crawler (54) installed on a web server (50) crawls web pages (52) on a web site and produces a list over web pages (52) present on the web site. This list is compared with a similar list from a former crawling. Changes in the list, classified as new, changed or deleted web pages, are reported in a modification list and transmitted to a central web change server. The data are transmitted further to a search engine which updates its index based on the modification list. Several variations of this are also disclosed.

Description

APPARATUS FOR SEARCHING ON INTERNET

Detailed description

The invention relates to publishing and searching on the World Wide Web, in short web. The web contains a large number of web pages, provided by content providers and stored on web servers connected together on the internet. End users often use search engines for finding information. Typical examples of search engines are Fast, AltaVista, Google and Yahoo.

A description of one search engine can be found in "Anatomy of a Large-Scale Hypertextual Web Search Engine" by Sergey Brin and Lawrence Page, 1997.

Todays search engines are typically based on submission or crawling, or a combination of these two principles, for providing data for a search index. This search index is then the basis for serving search requests from end users.

For submission based search engines, URLs are submitted by end users or content providers, or fetched from directory services like Open Directory. The web pages corresponding to the URLs are then downloaded and used as basis for a the search index.

Crawling based search engines are based on a crawler that starts on chosen start web pages, downloads these, finds web pages that are referenced on the downloaded pages and downloads the referenced pages, and so on. The downloaded pages are used as basis for a search index. The start pages can be submitted pages.

When web pages are added, changed or deleted, this should be reflected in the search index. For detecting additions, changes and deletions, crawling is typically used. This means that the crawling must be repeated regularly, leading to heavy load on transmission lines, on the web servers of the content providers and on the computers at the search engine. Because of these costs, the crawling of the presumably interesting part of the web is often restricted to be done at a speed corresponding to an average revisit period from from a few weeks to a few months. Therefore, news can be available for some time at the web servers before being found by the search ingenes. New pages that are not linked to, can remain unfound.

There are several search engines trying to crawl a significant part of the web. It is an object of the invention to provide search engines with a new way of updating their indices based on what is available on the internet.

It is a further object of the invention to improve the quality of the search engine indices by reducing the time from a web page is available to it has been included in the indices.

It is a further object of the invention to decrease the costs associated with keeping indices updated.

It is a further object of the invention to give content providers a possibility to publish news and changes on their web pages to potential consumers of information.

It is a further object of the invention to reduce the amount of data traffic needed to keep indices updated.

It is a further object if the invention to give portalsjmproved possibilities for providing news on their web pages.

It is a further object of the invention to lower the load caused by crawling on web servers.

It is a further object of the invention to give better control over which web pages should be included by search engines.

It is a further object of the invention to improve categorization of web pages.

It is a further object of the invention to decrease misuse of categorization.

It is a further object of the invention to allow dynamically generated web pages to be indexed.

It is a further object of the invention to provide an overview of which web pages are available on the Internet.

Fig. 1 a illustrates a search engine based on prior art.

Fig. 1 b illustrates a search engine based on the present innovation.

Fig. 2 shows a detailed example of information that can flow from web servers to a central change server and further to search engines.

Fig. 3 shows several details of the inner structure for an agent.

Figs. 4 a and b illustrate ways of balancing the communication load for a central site. Fig. 5 shows an example of a configuration file for the agent.

Fig. 6 illustrate one way of interconnecting search engines and cache servers based on the present innovation.

Fig. 7 shows a flexible, error tolerant, scalable architecture for data flow in a web change server.

Figs. 8 a and b illustrate two possible arrangements for building a list over documents existing on the web.

The basic principle of the invention is that local agents are installed on or near web servers or groups of web servers, and these agents detect and transmit changes to a central web change server. Web changes are then communicated further to search engines, giving the search engines a basis for downloading only pages that are new or modified, and allowing them to remove parts of the index corresponding to deleted web pages.

The basic principle is illustrated both for use within a search engine and for providing a standalone change directory service that can serve several search engines.

Several methods for detecting changes are disclosed.

The basic principle for existing search engines is illustrated in fig. 1 a. Web pages 12 14 are stored on a web server 10. A search engine 20 has a crawler 22. The crawler 22 reads through the web pages it finds, and produces a index 24. A user on an end user machine 30 searches the index 24 with a local browser 32. Wanted documents 14 are downloaded from the web server 10.

Fig. 1b shows a search engine based on the present invention. Web pages 52 are stored on a web server 50. An agent 54 crawls the web pages 52. For each crawl, a log file 56 is generated. This log file 56 is used for change detection. Information about modifications, covering new, changed and deleted pages, is sent from the agent 54 to a loader 62 in a search engine 60. The search engine 60 updates an index 64 based on the results from crawling new and changed pages, and deletes index entries based on deleted pages. The index 64 can then be queried and the results presented in a browser 72 on an end user machine 70 in the traditional way.

The amount of modified web pages on a web server will usually be significantly less than the total amount of web pages on the server. Therefore, much less data will be transmitted, and the work load for the search engine will be much less when using the present innovation. Also, the agent can transmit the modification list to the search engine soon after modifications have been made, instead of waiting for the search engine to find the modifications during crawling some time.

There are several possible ways of initiating execution for an agent. One way is to implement a service or demon that executes regularly according to a time interval and a first starting time, or some other schedule. Another way is to let administrators start the agent manually. Yet another way is to implement the agent so that it can be started from a script, possibly synchronized with other executing on web servers.

The web change server can be a componenf of a search engine.

Alternatively, the web change server can be used as a stand alone service, serving several search engines. The latter has the advantage of allowing publishing data on several search engines while only installing and maintaining one agent.

Fig. 2 shows an example of information that can be transmitted. Two agents, each residing on a web server, sends two modification lists 10 20 containing URLs referring to modified web pages to a central server where the lists 10 20 are combined to one aggregated modification list 30. Various extracts 40 50 can so be made from this aggregated list 30 and can be sent to various search engines.

In the figure, new web pages are marked with '+', changed web pages are marked with '!' and deleted web pages are marked with '-'.

The aggregated list can be produced by concatenating incoming modification lists in order of arrival.

Fig. 3 shows one preferred embodiment for detecting modifications. An agent 10 is installed on a web server 5. A crawler 15 crawls the web pages 25 on the web server 5. A log 35 is made based on the crawling. After the crawling has finished, a change detector 40 compares the log 35 from the newest crawling with the log 45 from the previous crawling. The differences between these logs are summarized in a modification list and transmitted.

One preferred embodiment for the log is to make this as a table with one row for each web page, one column with URL for each web page and one column with a check sum for each web page. The change detector 50 can then compare the newest log 40 with the previous log 60 and report changes as follows: A URL which is present only in the newest log is reported as a new page, a URL which is present only in the previous log is reported as a deleted page, and a URL which is present in both lists but with different checksums is reported as a changed page. URLs which are present in both logs 40 60 are not reported.

In one preferred embodiment, the checksum is generated based on Exclusive Or (XOR) of groups of characters on the web page. E.g, if 32 bit checksums are wanted, the web page is divided into groups of 4 and 4 bytes, and the checksum is generated by XORing the groups.

Other methods for generating checksums can be used. The method should preferable generate a relatively short checksum, e.g. from 16 to 128 bits, and at the same time it should be a relatively low probability that two web pages with different contents are assigned the same checksum.

In many cases, the same content may be represented by several URLs. These web pages are called duplicates. One such case is when aliases are used.

When aliases are used, the content provider often has a preference as to which URL should be reported for the duplicates.

This can be solved with the use of checksums. The crawl log may be sorted or accessed in checksum order. In cases where two or more pages have the same checksum, one of the URLs can be discarded.

Generally, every checksum generation method where the checksum is shorter than the non-redundant part of the original data may accidentally produce the same checksum for two different web pages. Therefore, in case of checksum collisions, the web pages themselves must be compared in order to ensure that duplicates are correctly detected.

The selection of which web page to report in case of duplicates can be done using a fixed rule. E.g, the shortest URL can be used, and in case two or more URLs have the same length, the first in alfabetic order can be used.

Some web pages have mostly static content but also contain some minor automatically changing part, e.g. a clock or a hit counter. Such pages should not be reported as always changing, because this would cause a search engine to download and reindex all such pages each time "changes" are reported, resulting in unnecessary load on servers and network. This problem can be solved in several different ways. One way is to take certain elements of such pages out from checksum calculation. E.g, all instances of strings of the forms "99:99:99" or "99/99-9999" can be replaced by blanks before or during checksum generation. It should be possible to control this using a configuration tool, including controlling exactly which strings should be taken out. An administrator could be presented a menu of examples of strings to take away. Alternatively, the administrator could be given the possibility of specifying such strings, e.g. by using regular expressions as used by the "grep" command in UNIX.

One special case of dynamically generated pages is error message pages corresponding to missing pages, also called ^"dead links. The HTTP protocol allows missing pages to be reported with error message 404, without further content. A crawler can detect this and stop further actions. However, some web servers are programmed to respond with an error message containing an element such as "Sorry, the page AAAA.htm was not found". In such a case, every URL that leads to the web server but that does not refer to an existing web page would result in a unique web page, which in malign cases could lead to an endless amount of web pages, overloading the agent and/or the search engines. Such cases can be solved by removing self references from the web pages before or during check sum generation. E.g, for the URL http://linkloader.com/nonexisting.page.htm resulting in an error page containing the string "Sorry, the page http://linkloader.com/nonexistingpage.htm does not exist", the agent could modify the string to "Sorry, the page does not exist", which would then be caught during duplicate detection.

On some web sites where web pages are generated dynamically or change detection is difficult because of other reasons, it may be more efficient to report only new and deleted pages. This should be configurable.

On some web sites, some pages may be dynamically generated. For such cases, it should be possible to implement the agent so that for some pages, both new, changed and deleted pages are reported, while for other pages, only new and deleted pages are reported. One possibility is to discriminate based on file type, e.g. so that pages whose URL has the extension ".html" are checked for all types of modifications, while pages whose URL has the extension ".asp" are only checked with respect to new and deleted pages. Another possibility of discrimination is to base this on folders. In some cases, some publishing tool is used for preparing web sites. For such cases, it may be more efficient to base the list of modifications directly on results from the publishing tool instead of crawling, or the two methods may be used in combination. This can be incorporated in the system by defining an interface file, such that at each scheduled crawling or transmission time, all URLs in the interface are picked up by the agent and reported to the central site. This can be a list of existing URLs, so that the agent would perform change detection and transmit afterwards. Alternatively, it can be a list of modifications, suitable for direct transmission.

The URLs reported by the agent should be the same as the URLs as seen from the perspective of a search engine and end user. This is a major reason for using the HTTP protocol for crawling, instead of using the FILE protocol. Therefore, the HTTP protocol will be a natural choice in most cases. However, advantages of the FILE protocol may be significant in some cases, like faster execution or not depending on a web server. When the FILE protocol is used, the agent should have a mechanism for converting FILE-based URLs into HTTP- based URLs that can be used from outside.

The agent must transmit the modification list back to the central site. Possible protocols include FTP (File Transfer Protocol), mail (e.g. SMTP) and HTTP (Hypertext Transfer Protocol). In one preferred embodiment, the agent takes the initiative to communication by either starting an FTP session, sending an email or by issuing an HTTP request. This results in efficient use of the communication channel, in that communication is only initiated when there is something to communicate.

Alternatively, the central site could initiate the communication. One advantage of this is that the central site could achieve a better load distribution over time.

The agent should be authenticated both when using FTP, email and HTTP. Authentication can be divided into two phases: Authenticating an administrator and corresponding agent when registering, and later authenticating submissions supposedly coming from the given agent.

The authenticating made during registration can be made manually. Alternatively, automatic support for the process can be added. The connection between crawl area and the name and email address of the administrator can be ensured using lookup in a Whois-database. Validity of the email address can be ensured by sending an email to the given address and requesting an answer. Subsequent authentication of submissions can be done using Public Key Encryption, using key pairs generated during registration.

In addition to search engines, other parties may also use data from a web change server. One example is caching services.

One method for building a cache on internet is to provide a mechanism that copies web pages from a content provider, stores these web pages on one or more intermediate locations, and deliver these pages on request.

Such a caching service needs a mechanism for ensuring cache coherency, that is ensuring that the copy delivered to users is functionally identical to the original web page residing at the web server. One traditional method is based on HTTP headers: Each time a web page is requested, a caching server fetches the corresponding header from the original web server. If the header is identical to the header stored at the cache server, then the rest of the web page is served from the cache server. This method relies on correctly generated HTTP headers, which cannot always be ensured. This method further relies on communication with the content provider for each web page to be delivered, which results in unwanted network traffic. By employing the methods disclosed in the present application, both problems can be reduced. The modification list can be used as basis for indicating which web pages can be served from the cache and which web pages have to be refetched from the original web servers.

A caching service might be able to report hit count for each web page. This can be reported back to the web change server and distributed further to search engines and other interested participants. Hit counts can be valuable for the search engines for selecting which pages to download and index, and also for ranking results to be presented to end users.

This is illustrated on fig. IFrequency info!. An agent 10 on a web server 15 reports modifications to a web change server 20. The web change server 20 further sends URLs to a cache server 25 and a search engine 30. Both will download the modified pages from the web server 15, enabling them to deliver search results and web pages respectively. The cache server 25 will maintain hit rates, which are reported back to the web change server 20 and are further reported to the search engine 30, thereby allowing improved search result ranking. Commercially, several different business models are possible.

Content providers could be requested to pay for receiving a more efficient way to publish their content than what is otherwise possible. Payment could be calculated e.g. based on number of URLs submitted, or based on size of the monitored web site.

Infra structure vendors like communication, hosting or caching companies could be requested to pay for reducing stress on their infra structure or for adding functionality to their customers. Payment could be based on estimated or measured reduced stress of their infrastructure, or by splitting income that such vendors might receive from their customers for the added functionality.

Search engines could be requested to pay for improving the quality of their indices or for reducing their communication costs. Payment could be based on number of URLs received or exploited.

Some search engines are specialized on category. For such search engines, it is relevant to subscribe to data within the selected categories only.

A category can be assigned to the crawl area at the time of registration. A user can select category from a list or from a tree structure. OpenDirectory is one example of a category tree structure that can be used.

In addition to crawl area level categorization, there may be a need for URL level categorization, in which each URL is assigned one or more categories.

One data format that is useful for URL level categorization is to add a category column to the modification list. This column could then be filled with category codes according to a given list or tree structure.

The category column can be based on configuration data entered by a web server administrator. E. g, all content within a given folder may be assigned a given category.

Categorization may also be based on data or metadata. The agent can look for given keywords in the header or body part of the web pages.

Just like some search engines may specialize in certain categories or otherwise use category information for serving search requests, search engines could do the same with language. This can be handled in a similar way as with categories. A language could be registered at crawl area level. Alternatively, a language code can be carried in the modification lists. This language code can be configured, based on data, based on meta data, or otherwise supplied by an administrator.

For maximum speed, the agent can be allowed to run without limitation regarding processor load or network traffic.

In cases where the agent competes with other processes, it may be advantageous to limit the use of resources. E. g., if the agent is executed on the same computer as a web server program, then the web server performance might be degraded while the agent is executing. For such cases, the agent should be limited with respect to resource usage. One way is to limit the HTTP requests to a given number of pages or kilobytes per second or minute. Another way is to limit the percentage of CPU time used. Another way is to limit the amount of RAM used. Yet another way is to limit the amount of disk used. There should be a possibility to set such limits during configuration.

Fig. 7 shows a scalable and error tolerant architecture for a back end system handling modification lists.

FTP servers 00 05 accept incoming FTP sessions from agents transferring modification lists. The modification lists are stored on disk 10 15. As long as there is at least one FTP server running, the agents will be able to transmit their modification lists. Each FTP server is essentially independent of the rest of the architecture, making them robust to failures in the rest of the system.

Aggregation servers 20 25 read modification lists and store them in aggregated modification lists 30 35. The aggregation servers can also authenticate the modification lists relative to crawl area registered in a database 70.

Each aggregation server 2025 can read modification lists from disks 10 15 of several FTP servers 00 05. Therefore, the overall system will still function when one or more aggregation server is out of order, as long as one aggregation server still works.

Extract servers 4045 extract data from the aggregated modification lists, based on extract profiles stored in the database 70, again storing on disk 50 55.

Playout servers 60 65 distribute extracted data to the respective subscribers. The playout servers 60 65 can be FTP servers, email clients, HTTP servers or other means for communicating with subscribers. The disks at each stage serve as buffers. If one stage stops or starts running slowly, then the disks will buffer the results from the previous stage until the stage is operating again.

If a large number of agents are installed on various web servers, this may result in a heavy load of network traffic when many of these submit modification lists at the same time. Fig. IBalance! a shows one method for scaling and load balancing. Various crawlers 10 20 30 each have a list 1525 35 of prioritized addresses for FTP servers 40 50. When a crawler 10 tries to contact an FTP server, it chooses a prioritized address. If no contact can be made, the next address on the list can be tried.

Alternatively, scaling and load balancing can be made using Network Address Translation, abbreviated NAT. Using this technique, incoming FTP sessions are distributed to a set of FTP servers based on round robin or load based methods.

In addition to modification lists describing modification of web pages, search engines might also want to have a list of which web pages are available on the internet at a given time. One example is when a new search engine is established. Such a new search engine might then need a list of available web pages to start its index, to have a baseline for later modifications. A list of web pages available on the internet will from now be called a baseline list.

Fig. 8 a shows a way of integrating a baseline database 15 in the pipeline described in fig. 7. An FTP server 05 is connected to a network 00, receiving modification lists. The modification lists are aggregated by an aggegation server 10, inserting URLs into an aggregated modification list 20 and also consolidating into a baseline list 25. An extract server 30 and a playout server 35 then handles data further, distributing to subscribers over a network 40.

One method for consolidation can be summarized by the following pseudo code: For each URL received from the Aggregate server 15: If the operation is '+', i.e. a new web page:

If the URL already exists in the Baseline list:

Give a warning else

Insert the URL into the Baseline list If the operation is '-', i.e. a deleted web page: If the URL exists in the Baseline list:

Delete the URL else

Give a warning If the operation is '!', i.e. a changed web page: If the URL exists in the BaseLine list:

Update the URL else:

Insert the URL

Give a warning

The method for maintaining a BaseLine list as shown in fig. IBasseLine! a is well suited for real time operation for small or medium amounts of data.

However, the demand for Insert, Update and Delete operations make the method dependent on some database management system, e.g. an SQL-based database.

A file based version for batch based operation is illustrated in fig. 8 b. An FTP server 55 receives modification lists over a network 50. An aggregation server 60 aggregates the data and stores them into an aggregated modification list 65. Batches of the modification lists are collected and sorted by a Consolidator module 70. The batches are sorted, and the sorted batches and a previous version of the baseline list 80 are read in parallel, the results are consolidated in a merge process, and the data are written to a new version 85 of the baseline database. Extract 85 and playout 90 servers can then handle the data further for final distribution over a network 95.

Many web sites offer a search mechanism for searching within the web site. This is sometimes achieved by installing and operating a complete search engine running at the web server.

Using the present innovation, such functionality can be outsourced to an existing search engine.

Fig. 9 illustrates a search mechanism, where indexing, searching and ranking is outsourced. A web server 10 has a number of web pages 15, among these a search page 20. An agent 25 reports modifications to a web change server 30. The results are reported to a search engine 35, which has a module 40 for downloading and indexing, producing an index 45. The search engine has a query motor 50. When an end user uses the search page 20 to issue a query, the query is sent to the query motor 50 on the search engine 35, results are generated based on the index 45, and the results are returned back for display on the search page 20.

As described above, the agent can transmit modifications to the central site. Alternatively, the agent can transfer a complete list of URLs found on the web site to the central site, and the modifications can be computed at the central site. This solution results in an agent with less complexity since operations are carried out on the central site instead of in the agent. However, this also results in more network traffic, since complete lists of URLs have to be transmitted, instead of just modifications.

Claims

1. An apparatus for searching among web pages on Internet, the apparatus called search engine, where each web page is identified by a Unified Resource Locator, abbreviated

URL, and where the web pages are stored on a plurality of web servers, the appratus comprising unit for reading web pages, unit for creating an index based on the web pages, unit for removing a part of the index corresponding to removed web pages, and unit for receiving search requests and returning results based on content of the index, the mentioned units residing on a central site, c h a r a c t e r i z e d i n t h a t the apparatus further comprises a unit called agent, residing on each web server or on a local area network together with each web server, the agent transmits one list containing URLs for new web pages and one list containing URLs for deleted web pages, the two lists together called a modification list, to the central site, and the unit for reading web pages chooses which pages to read based on the modification list.

2. An apparatus according to claim 1 , where the agent also transmits a list containing URLs for changed web pages as part of the modification list.

3. An apparatus according to claim 3, wherein the agent contains a unit for crawling web pages on each web server, the unit called crawler, so that for each crawling a list of web pages is generated, and the list of modifications for the time span between two crawlings is made based on the difference between the lists of web pages corresponding to the two crawlings.

4. An apparatus according to claim 3, wherein the crawler uses Hyper Text Transfer Protocol, abbreviated HTTP.

5. An apparatus according to claim 3, wherein the crawler uses File protocol, and the URLs thereby found are modified so as to appear as valid HTTP-based URLs when accessed from a browser.

6. An appartus according to one of the claims 3 to 5, the apparatus further comprising a unit for calculating a checksum for each web page that is found during crawling, wherein a change in a web page is detected by a change in the corresponding checksums.

7. An apparatus according to claim 6, where the unit for calculating a checksum can disregard parts of the web pages corresponding to given regular expressions.

8. An apparatus according to one of claims 6 or 7, where the unit for calculating a checksum disregards references to the current web page.

9. An apparatus according to one of the claims 1 to 8, where web pages with identical content but with different URLs, the URLs called aliases, are detected, and one of these URLs is selected to be reported by the agent to the search engine.

10. An apparatus according to claim 9, wherein a rule set defines which of the aliases are selected.

11. An apparatus according to one of the claims 1 to 10, wherein the agent further comprises an interface for accepting an externally produced list of modifications.

12. An apparatus according to one of the claims 1 to 11 , wherein

File Transfer Protocol, abbreviated FTP, is used for transmitting the modification list to the search engine.

13. An apparatus according to one of the claims 1 to 11 , wherein electronic mail is used for transmitting the modification list to the central site.

14. An apparatus according to one of the claims 1 to 11 , wherein Hypertext Transfer Protocol, abbreviated HTTP, is used for transmitting the modification list to the search engine.

15. An apparatus according to one of the claims 1 to 14, the apparatus further comprising a unit for starting the agent or the transmission at given times or coupled to given events.

16. An apparatus according to one of the claims 1 to 15, the apparatus further comprising means for registering users and corresponding passwords and crawl areas, and means for authenticating data according to the passwords, wherein a crawl area delimits a part of the web by defining a top level domain, a domain and possibly one or more limitations within the domain, and the means for authenticating data controls that received URLs are within the crawl area.

17. An apparatus according to one of claims 1 to 16, the agent further comprising a unit for associating each URL with one or more categories, based on data or metadata in the webpages, or configurable rules, or lookup in a register over web pages, and the association between web pages and categories is transferred to the central site.

18. An apparatus according to one of claims 1 to 17, the apparatus further comprising a unit, located on each web server or on a device connected to each web server over a local area network, for receiving search requests, forwarding the search requests to the central site, receiving results from the central site, and presenting these from each web server.

19. An apparatus for producing an overview over modifications on web pages on the internet, the apparatus called web change server, where each web page is identified by a URL and where the web pages are stored on a plurality of web servers, the apparatus comprising unit called agent for assembling a list over URLs for new, changed and deleted web pages, the list called total modification list, and unit for transmitting or presenting the total modification list or an extract thereof to a set of subscribers, c h a r a c t e r i z e d i n t h a t the agent is run on each web server or on a device connected to the web server over a local area network, the unit for assembling the total modification list resides on a central site, and the unit for assembling the total modification list receives modification lists from each agent.

20. An apparatus according to claim 19, modified according to any of the claims 2 to 18.

21. An apparatus according to one of claims 19 to 20, where an extract specification is associated with each subscriber, the specification comprising rules defining which URLs from the total modification list should be transmitted to the subscriber.

22. An apparatus for caching web pages, the apparatus creating copies of a plurality of original web pages, the copies collectively called cache, c h a r a c t e r i z e d i n t h a t the cache is kept coherent with the original web pages by means of an apparatus according to one of claims 19 to 21.

23. A search engine according to one of claims 1 to 18, the search engine connected together with an apparatus for caching web pages, wherein the apparatus for caching web pages produces statistics over cache hits for web pages, the statistics is transferred to the search engine, and the search engine utilizes the statistics over cache hits when ranking search results.

24. A web change server according to one of claims 19 to 21 , the web change server connected together with an apparatus for caching web pages, wherein the apparatus for caching web pages produces statistics over cache hits for web pages, the statistics is transferred back to the web change server, and the statistics is presented together with the total modification list or the extract thereof.

25. An apparatus for producing a list over web pages on a plurality of web servers connected to the internet, the list called web status, the apparatus called web status server, c h a r a c t e r i z e d i n t h a t modifications from a web change server according to one of claims 19 to 21 are consolidated into a web status.