WO2002017137A1

WO2002017137A1 - Document retrieval system

Info

Publication number: WO2002017137A1
Application number: PCT/GB2001/003568
Authority: WO
Inventors: William Thomas Harrington Loggie
Original assignee: British Telecommunications Public Limited Company
Priority date: 2000-08-18
Filing date: 2001-08-08
Publication date: 2002-02-28
Also published as: AU2001277617A1

Abstract

Apparatus for generating a search strategy for use in retrieval of information, the apparatus including data storage, or access to data storage, for storing a plurality of entities, data items allocated to each entity, and a predetermined data item structure comprising at least some of the allocated data items, identifying means for identifying a plurality of entities within the data storage; retrieving means for retrieving data items allocated to each identified entity; selecting means for selecting additional data items it the data item structure in accordance with one or more predetermined rules, means for generating a search strategy from the selected additional data items and the retrieved data items corresponding to the identified plurality of entities.

Description

DOCUMENT RETRIEVAL SYSTEM

The present invention is concerned with document retrieval systems, in particular document retrieval systems where documents are retrieved in respect of a user's interest.

Current methods of document retrieval allow a user to search for papers based on a keyword query, such as "ADSL" or "inductive logic programming". This approach works well when the subject of interest has a small set of well-defined standard terms, but not so well for subjects with which a user is not particularly familiar, and which include non-standard terminology. For example, queries based on keywords "voice over IP" and "internet telephony" typically return different sets of papers - there could be overlap between the papers returned, but typically both of these keywords return papers missed by the other. Ideally, a user would know all of the keywords relevant to the field of interest, and enter them all as a query. However, for a given subject, firstly the majority of users are usually only familiar with a subset of keywords, and secondly the keywords within a particular subject can change. In general, the most up to date source for keywords characterising a subject is papers (documents, journals, conference proceedings, book chapters etc.). Papers are commonly listed in bibliographic databases, such as the INSPEC database. The INSPEC database is a database for Physics, Electronics and Computing, providing access to the worldwide literature in these areas. The papers in INSPEC are classified by subject, codes and terms, the subject area being arranged hierarchically in categories and sub-categories: A - Physics, B - Electrical and Electronic Engineering, C - Computers and Control, D - Information Technology.

As an example of the hierarchy within category A:

A0100 - COMMUNICATION, EDUCATION, HISTORY, AND PHILOSOPHY A01 10 - ANNOUNCEMENTS, NEWS, AND ORGANIZATIONAL ACTIVITIES A01 10F - CONFERENCES, LECTURES, AND INSTITUTES.

More information about the INSPEC database can be found from the INSPEC database user manual available from the Institute of Electrical Engineers, and a description of the database is available on-line at http://webrech.tu- darmstadt.de/webrech/inspec/insdata.html. One of the problems with databases such as the INSPEC database is that it has a complex form, and in order to make use of data within the database the user has to have a good working knowledge of the organization of data. This involves understanding the hierarchy of codes and terms, and can be an additional, and inconvenient, distraction for the user.

Many web-based search facilities include an option "More Like This", which appear to retrieve documents based on documents and/or words identified by the user. In particular, Knowledge Discovery Ltd (KDL) developed a system in 1 996 that performs searches on key words, phrases in text passages, and titles of web pages. The system piggybacks onto major search engines-including HotBot, Lycos, Infoseek, Yahoo, AltaVista, Magellan, WebCrawler, and DejaNews-to perform the actual search work. Thus the phrase "More Like This" reflects the fact that the KDL system accesses seven search engines, and thus finds more information than is possible when accessing a single search engine. Whump.com is a site that provides information on building of dynamic web sites, XML, and information architecture, etc., and it offers suggested areas of interest, under "More Like This" based on an internal data model of links between tables of meta categories. The data model is specific to Whump.com, and only operates within the scope of items defined in the tables. Lexis Publishing, a web-based legal information service, also includes a

"More Like This" function for retrieving legal documentation relating to a case of interest. This system primarily works on citations referenced in the case, searching for other cases, related statutory citations, patent citations, law review articles etc., that reference the same citations. Thus the "More Like This" finds more documents that include citations within a specified document.

According to a first aspect of the present invention there is provided apparatus for generating a search strategy for use in retrieval of information, the apparatus including data storage, or access to data storage, for storing a plurality of entities, data items allocated to each entity, and a predetermined data item structure comprising at least some of the allocated data items, identifying means for identifying a plurality of entities within the data storage; retrieving means for retrieving data items allocated to each identified entity; selecting means for selecting additional data items in the data item structure in accordance with one or more predetermined rules, means for generating a search strategy from the selected additional data items and the retrieved data items corresponding to the identified plurality of entities.

Conveniently the means for generating a search strategy includes analysing means for analysing relationships between one or more groups of first data items and second data items. The first data items include selected additional data items and retrieved data items corresponding to one or more first identified entities, and the second data items include selected additional data items and retrieved data items corresponding to each of the other identified entities. Preferably the relationships analysed by analysing means include commonality between first data items and second data items.

Typically each group of first data items is a subset of all of the selected data items and the retrieved data items corresponding to a single entity. Each of the groups is then compared with the second data items (in other words, data items - selected and retrieved, from all of the entities identified), and the degree of overlap between each group and the second data items per corresponding entity is analysed. Conveniently this analysis is quantified into a commonality value (a measure of the degree of overlap). Once all of the groups comprising first data items for a first entity have been considered, this process is repeated for the next uncovered entity, and so on until each of the uncovered entities has been analysed from the standpoint of a "first data item". When all of the entities have been considered, groups of first data items corresponding to commonality values above a predetermined threshold are used to establish the search strategy. An uncovered entity, for the purposes of this description, is an entity corresponding to second data items that do not overlap with groups of first data items having commonality values above a predetermined threshold - i.e. entities whose additional and retrieved data items do not fall within the group of first data items that have commonality values above a predetermined threshold. Preferably the retrieving means is further adapted to receive categorisation information corresponding to each identified entity and/or each data item, the categorisation information including positive and negative categories. This therefore tags the entity, and data items relating to the entity as either positive, or negative representations of the search strategy.

As described above, the selecting means uses predetermined rules to select additional data items from the data item structure. Conveniently, the predetermined rules include relational conditions between the retrieved data items and the data items comprising the data item structure; these relational conditions could include lateral and/or vertical relationships - for example, "select a data item if it is vertically above a retrieved data item".

Conveniently an entity includes a document.

Further aspects, features and advantages of embodiments of the present invention are described below.

A document retrieval system, in particular for retrieving documents from information sources that are arranged in a structured form, such as a hierarchical form, will now be described, by way of example only as an embodiment of the present invention, and with reference to the accompanying drawings, in which: Figure 1 is a schematic block diagram of a document retrieval system according to the present invention;

Figure 2 is a schematic block diagram of interfacing means providing part of the system of Figure 1 ;

Figure 3 is a flow diagram describing the operating steps of selecting means providing part of the system of Figure 1 ;

Figure 4a is an example of data arranged hierarchically in a tree arrangement;

Figure 4b is an example of data arranged in an acyclic arrangement;

Figure 5 is a flow diagram describing the operating steps of comparator providing part of the system of Figure 1 ; and Figure 6 is a block diagram showing in greater detail the processes present in a client and server arrangement forming part of the embodiment of Figure 1 . Overview

The embodiment performs inductive reasoning based on examples submitted by a user, so as to identify one or more statements that characterize the user's interest. Essentially, the embodiment enables a user to submit one or more example papers and query terms that are representative of the user's interests, and carries out a search based on information extracted from the papers and query terms. The invention is particularly suited to retrieval of documents from information sources that are arranged in a relational, specifically a hierarchical, form.

With reference to Figure 1 , document retrieval system 100 essentially comprises three components: interfacing means 101 for interfacing with a user - for presenting information to, and receiving input from, a user, for transposing the received information into a form that can perform queries on an information source 103, and querying the information source 103. The second component is selecting means 105 for receiving output from the information source 103, and, based on the output received from the information source 103, selecting additional information relating thereto. When the information source 103 arranges its information hierarchically, the selecting means 105 accesses a data store 107, which stores the hierarchical structure of the information source 103. The additional information selected by selecting means 105 thus includes information retrieved from data store 107. The selecting means 105 also defines one or more selection rules that specify conditions for selecting at least some of the additional information; the selecting means 105 collates the rules, additional information and output received from the information source 103 and outputs this as collated data.

The third component is a comparator 109, which receives the said collated data from the selecting means 105, and performs a comparative analysis, based on the selection rules in the collated data, between the output received from the information source 103 and the additional information. The comparative analysis results in identification of one or more characterising rules 1 1 1 that describe the subject matter of interest to the user. The interfacing means 101 can then retrieve material such as papers in accordance with the characterizing rules 1 1 1 , and present the retrieved papers to the user. The interfacing means 101 includes a program having a Graphical User Interface (GUI) for displaying and receiving information to and from a user. In use of the document retrieval system 100, a user enters a query, via interfacing means 101 , that the user knows falls within a subject of interest; e.g. if a user is interested in papers on voice traffic over the Internet, the user can enter the phrase "voice over IP" as representative of the subject area. The interfacing means 101 then performs a search on an index of papers, such as the INSPEC database, and extracts papers that include that phrase. Alternatively, or in addition, the user can submit (a) paper(s) that (is) are representative of the user's interest. In the case of a database such as the INSPEC database described above, where papers are classified by codes, terms and free text, the interfacing means 101 requests all of the codes, terms and free text associated with the paper(s); the retrieved data items described above thus include the codes, terms and free text.

Selecting means 105 receives output from the information source 103, and then accesses the hierarchy data associated with INSPEC classifications stored in data store 107. Selecting means 105 also defines a predetermined set of selection rules (described below), which define selection of additional codes and/or terms that are themselves not explicitly listed within the papers; the selected additional data items described above thus include these additional codes and/or terms. The system submits the selection rules, additional and extracted terms, codes and free text to the comparator 109, which performs comparative analysis thereon (analysis of degree of commonality introduced above thus includes performing comparative analysis, as is described more fully below), and outputs one or more characterizing rules 1 1 1 that best characterize the terms, code and free text. These characterizing rules 1 1 1 are typically of the form of terms, phrases and codes (described more fully below), and are submitted to the user, via interfacing means 101 , for verification. The user can accept the characterizing rules 1 1 1 ("positive"), or mark them as inappropriate ("negative"), and the system will use this information to perform a second round of paper extraction and induction, this time based on user-verified categorizations. These characterizing rules 1 1 1 therefore form the basis for a search strategy.

The invention thus both identifies a wide range of papers that may be relevant to a user's interest, and, when the information sources include hierarchically structured sources, the invention retrieves papers without needing the user to have any specific knowledge of this structured form. For the purposes of this embodiment, the information sources 103 are assumed to include the INSPEC database.

Interfacing means 101 As shown in Figure 2, interfacing means 101 comprises a GUI 201 , paper identifying means 203 and data retrieving means 205. When a user submits a query via the GUI 201 , the query (which may be text, title of a paper, author, code etc.) is input to the paper identifying means 203, which sends the query to the INSPEC database 103. As is known to one with ordinary skill in the art, the INSPEC database 103 is operable to receive queries on papers, terms, codes, authors, and returns data in a format specified by a user. Thus the paper identifying means 203 sends a query to the INSPEC database 103, requesting paper reference numbers corresponding to the queries. The INSPEC database 103 returns any matching paper reference numbers, together with an abstract of the paper (if the user specifies that he wants to see the abstract), and these are passed to the user for review. Papers that are selected by the user, and marked as positive or negative, are passed onto the data retrieving means 205. If the user entered a paper reference number directly into the GUI 201 , then this is sent to the data retrieving means 205. Data retrieving means 205 inputs all of the paper reference numbers to the INSPEC database 103, and requests a predetermined (configurable) amount of data for each of the paper references. Example:

A user submits a query such as "Ethernet", whereupon paper identifying means 203 requests reference numbers of all papers that include the keyword "Ethernet" in the text. The INSPEC database 103 returns paper reference numbers for all of these papers, together with their abstracts, to the user, who reads through the abstracts and selects some, or all of the papers returned from the database 103. Paper identifying means 203 passes these details onto data retrieving means 205, which submits a request for codes, terms and free texts listed for each of the paper references. Thus, although the user submits a single query keyword "Ethernet", many terms are likely to be identified in relation to this query (the relevance, or otherwise, of these terms are discussed below).

Post analysis, the GUI 201 is also used to present terms and codes, which are the characterizing rules 1 1 1 output by the comparator 109, to the user. Codes and terms that are accepted by the user are added to a positive list; if any of these terms and/or codes are rejected by the user, then they are added to a negative list (described below).

Selecting means 105

Selecting means 105 receives the codes, terms and free text output from the INSPEC database 103, and arranges them into a form suitable for input to the comparator 109. This involves the following steps (outlined in Figure 3):

• S 3.1 Review codes and terms extracted from INSPEC database 103 and write to an input file;

• S 3.2 Select additional codes and terms that are hierarchically related to the extracted codes and terms (interface with data store 107), and write to the input file;

• S 3.3 Create selection rules that control selection of, and combination of, additional and extracted codes and terms and write to the input file;

In step S 3.1 , extracted codes and terms are written, in Prolog, to an input file. Prolog is a computer language created primarily for logic applications, and is the language utilised by the comparator 109. (Information relating to the language can be found in reference books such as "Prolog: The Standard : Reference Manual", published in 1996 by Springer-Verlag Telos; ISBN number 3540593047, authors: P. Deransart, A. Ed-Dbali, L. Cervoni). Thus for ease of I/O, data is input to the comparator 109 using Prolog statements. The extracted codes are symbolised by code_m - and terms are similarly symbolised by terrπ_m. An example of extracted codes and terms populating an input file, and corresponding to one paper, is presented below: code_m(5276296,'C6110J"). //thus the paper reference number is 5276296 code_m(5276296/C6140b'). code_m(5276296,'C5640'). code_m(5276296/C6110B'). code_m(5276296 ,' C6115 ' ). term_m(5276296,'LISP'). term_m(5276296, Object-oriented languages'). ter m_m(5276296 , ' object-or i ented programming ' ). term_m(5276296, 'protocols'). term_m(5276296, 'software libraries'). term_m(5276296 , ' software standards ' ). free(5276296,'Telos'). free(5276296,'The EuLisp Object System'). free(5276296,'metaobject protocol'). free(5276296, 'EuLisp standard'). free(5276296, 'object-oriented programming'). free(5276296,'user-implementable code'). free(5276296,' libraries'). free(5276296,'Common Lisp"). f r ee(5276296 , ' meta-pr ogramming ' ).

Step S 3.2 involves interaction with the data store 1 07, where the structure template is stored. Selecting means 107 selects all codes and terms that are hierarchically above the extracted terms and codes. The structure corresponding to the codes is a tree hierarchy, whereas the structure corresponding to the terms is a directed acyclic graph; thus, and as shown in Figures 4a and 4b, each code has only one branch 401 leading up therefrom, whereas each term could have one or more branches 403a, 403b leading up therefrom. Referring to Figures 4a and 4b, and assuming that one of the extracted codes were A101 , step S 3.2 would retrieve codes A10, A1 and A as additional codes, and assuming that one of the extracted terms were electric generators, step S 3.2 would retrieve electric machines as an additional term. These additional codes and terms are written to the input file, and take the following form (example text is unrelated to Figures 4a and 4b): code_pαrent('C7400' ,'C7490'). // code C7400 is α parent of code C7490 code_ρarent('C7800' ,'C7810'). code_parent('C7810' ,'C7810C'). code_parent('C7800' ,'C7820'). code_parent(' C7800 ' , ' C7830 ' ). code_parent('C7800' ,'C7840'). code_par ent(' C7000 ' , ' C7800 ' ). code_parent(^,C7800' ,'C7890'). term_parent('standards' ,'ANSI standards'). // term standards is a parent of term ANSI standards term_parent(' digital computers', 'Apple computers'). term_parent('high level languages' , ' BASIC'). term_parent('high level languages' , 'C language'). term_parent(' integrated circuits' , 'CMOS integrated circuits'). term_parent(' etworks (circuits)', 'CMOS integrated circuits'). The selection rules created in step S 3.3 define conditions for selection of the hierarchy data (and thus analysis thereof by the comparator 109). In other words, these selection rules take account of negative and positive codes and terms, as categorised by the user, and specify selection of the additional terms and codes (i.e. specify inclusion of the terms and codes that are hierarchically related to the extracted codes and terms).

Examples of conditions are presented below (in Prolog). These rules (R1 . - R8.) are written to the input file:

Rl. code(P.C) :- code_m(P C), not(reject_code(C)). // accept code C for Paper P if it is listed as a code for paper P and it is not one of the negative codes

R2. term(P.T) :- term_m(P,T), not(reject_term(T)). // accept term T for Paper P if it is listed as a term for paper P and it is not one of the negative terms

R3.code_m_tax(P,C) :- code_m(P,C). // code C is in the code taxonomy for Paper P

R4.code_m_tax(P,C) :- code_parent(C D), code_m_tax(P,D). // code C is in the code taxonomy code_m_tax for paper P if D is in the code taxonomy for paper P and C is the parent of term b

R5.code_tax(P,C) :- code_m_tax(P,C), not(reject_code(C)). //code C is in the code taxonomy for paper P if it is in the code taxonomy for code_m_tax and it is not listed as a negative code

R6.term_rn_tax(P,T) :- term_m(P,T). //term T is in the term taxonomy for Paper P

R7. term_m_tax(P_/T) :- term_parent(T, D), term_m_tax(P,t ). //term T is in the term taxonomy term_m_tax for paper P if β is in the term taxonomy for paper P and T is the parent of term b

R8.term_tax(P,T) :- term_m_tax(P,T), not(reject_term(T)). //term T is in the term taxonomy for paper P if it is in the term taxonomy for term_m_tax and it is not listed as a negative term

The codes, terms and free text retrieved from both the INSPEC database 103 and the data store 107 are in a mixture of formats; for convenience a Perl script is used to create the Prolog file in the format presented above (the use of Perl is inessential - any programming language could be used). Comparator 1 09

In the present embodiment comparator 109 makes use of the Progol

Inductive Logic Programming (ILP) system (refer to Muggleton S. 'Inverse Entailment and Progol", New Generation Computing, 1 3: 245 - 286, 1 995 for further details), but the use of an ILP system is inessential to the invention; any means that provides the functionality described below could be used.

The input file created by selecting means 105 as described above specifies how many, and in what format, the comparator 109 should use of the codes, terms, taxonomies and free text corresponding to papers written to the input file. Codes, terms, free terms and taxonomies, which are used to generate a hypothesis based on the examples (papers) are, in Prolog terminology, known as "literals". The input file specifies literals to be used by the comparator 109 in accordance with the following lines of Prolog:

(a) :- modeh(l,inquery(+αny))? // look for rules that have an "inquery" literal as the consequent; i.e. rules of the form "IF conditions Kl, K2, ..., Kn hold on paper P

THEN P is a positive example of the user's interest. The "+" means that this will be a variable (e.g. A) and not a value (e.g. 5276296)

(b) :- modeb(*,code(+any,#any))? // select an infinite number (* - in practice, as many as is given in the input file) of codes from the input file, in any format, for each inquery

(c) :- modeb(*,code_tax(+any,#any))? //select an infinite number (* - in practice, as many as is calculated) of code taxonomies, in any format, for each inquery (recall that code taxonomies are evaluated according to rules Rl - R8 discussed above) (d) :- modeb(*,term(+any,#any))? // select an infinite number (* - in practice, as many as is given in the input file) of terms from the input file, in any format, for each inquery

(e) :- modeb(*,term_tax(+any,#any))? //select an infinite number (* - in practice, as many as is calculated) of term taxonomies, in any format, for each inquery (recall that term taxonomies are evaluated according to rules Rl - R8 discussed above)

(f ) :- modeb(*,f ree(+any,#any))? //select an infinite number (*) of free terms from the input file, for each inquery

Each of (a) - (f) is a mode declaration, which is a query aimed at a Prolog interpreter within the comparator 1 09. Prolog queries have a :-' at the beginning and a ?" at the end to differentiate them from asserted statements.

Operation and functionality of the comparator 1 09 is described with reference to Figure 5: • S 5.1 Identify literals to perform inductive reasoning on;

• S 5.2 Select combinations of identified literals;

• S 5.3 Evaluate occurrence of selected combinations in other examples (papers);

In step S 5.1 , the comparator 109 identifies literals in accordance with (a) -

(f) above (implicitly including rules R1 -R8) . Additional parameters relating to the selection criteria are included in the input file and include: :- set(h,100)? // number of literals per example (paper) to consider (here set to 100)

In step S 5.2, the comparator 109 selects combinations of the identified literals. The selection of combinations of literals is also specified in the input file, and takes the following form:

:- set(c,2)? // maximum number of literals to combine and base search on (so occurrences of a single literal (which is a null combination as it involves a single literal) are considered)

In step S 5.3 the comparator 109 evaluates the occurrence of these combinations in all of the other examples (i.e. in this embodiment other papers). The number of combinations to be considered is specified in the input file and takes the following form:

:- set(nodes,5000)? // maximum number of combinations of literals which will be tested. For this example it is set to 5000 to allow for 100 different literals (since there are (100*99)/2=4950 different pairs of literals).

The evaluation of occurrences of combinations involves performing a greedy search on the literals listed for each example, and is quantified by a score. Thus a combination of literals that occurs in all of the papers, say, will return a high score. For each example, the combination of literals having the highest score is taken to be a rule that most accurately characterizes the examples; so that a characterizing rule is typically a combination of literals. While there are still papers to be analysed, the comparator 109 moves to the next uncovered example (i.e. a paper that doesn't include the combination of literals previously identified as having the highest score), and repeats the process. With a greedy search, it is unlikely that all of the papers will be individually considered - as previously stated, any papers that include one of the combination of literals that has been identified as having a highest score (for a paper processed previously) will not be processed in its own right. However, as an alternative, each of the papers could be considered individually, with the condition that if a combination that has already been identified appears in a paper, for the purposes of analysis in respect of the said paper, the combination is discarded. Variations on the two searching approaches could also be applied, such as:

Find best rule in first example. Find best rule (that is different to the best rule in the first example) in second example, compare with first.

Find best rule in last example, compare with previous best. Take overall best rule as rule 1 and remove covered examples. Repeat for uncovered examples, until all covered.

The output from the comparator 109 is thus one or more characterizing rules 1 1 1 that have been identified as being most representative of all of the examples given.

Example outputs:

1 . Literals selected for consideration: Step S 5.1

[Generalising inquery(3281416).] // first example considered: paper ref number 3281416 inquery(A) :- code(A,'C6110'), code(A,'C6115'), term(A,' logic programming'), term(A,' programming environments'), term( A, 'software tools'), code_tax(A,'C6110'), code_tax(A/C6115'), code_tax(A, 'C6100'), term_tax(A,' logic programming'), term_tax(A, 'programming environments'), term_tax(A,' software tools'), term_tax(A,' software engineering'), term_tax(A programming), term_tax(A, logic), free(A, 'logic programming'), free(A, interface), free(A, 'Xerox Quintus Prolog'), free(A, 'Xerox Al environment'), free(A, Object-oriented programming'), free(A, 'Xerox Lisp machines'), free(A,' support tools'), free(A, layers), free(A, 'Prolog predicates'),

2. Combinations of literals searched {step S 5.2), and scores (step S 5.3):

[C:180, 18,0,0 inquery(A) :- code(A,^,C6110').] // first combination considered; 4 dimensional value for C is the score

[C:196,65,0,0 inquery(A) :- term(A,' logic programming').] [C.196,130,2,0 inquery(A) :- code_tax(A,^,C6110').] [C:187,27,0,0 inquery(A) :- code_tax(A/C6115').] [C:196,67,0,0 inquery(A) :- term_tax(A,' logic programming').] [C:180, 18,0,0 inquery(A) :- term_tax(A,' programming environments').] [C:198,108,0,0 inquery(A) :- term_tax(A, logic).] [C:182, 20,0,0 inquery(A) :- free(A,' logic programming').] [C:2,2,0,0 inquery(A) :- free(A,interf ace).]

[C:122,5,0,0 inquery(A) :- free(A, Object-oriented programming').] [C:2, 2,0,0 inquery(A) :- free(A, 'Prolog predicates').] ....

[C: 194,77,0,0 inquery(A) :- term_tax(A programming), term_tax(A, logic).] // combinations of 2 literals starts here

[C:172, 20,0,0 inquery(A) :- term_tax(A programming), free(A, ' logic programming').] [C:193,66,0,0 inquery(A) :- code_tax(A,'C6100'), term_tax(A,' logic programming').] [C:169, 18,0,0 inquery(A) :- code_tax(A,'C6100'), term_tax(A,' rogramming environments').] [C:156,13,0,0 inquery(A) :- code_tax(A,'C6100'), term_tax(A,' software tools').] [C:52,4,0,0 inquery(A) :- code_tax(A,'C6100'), free(A,'Prolog database').] [C:162,15,0,0 inquery(A) :- code_tax(A,'C6110'), code_tax(A,'C6115').] [C:194,130,2,0 inquery(A) :- code_tax(A,'C6110'), code_tax(A,'C6100').] [C:192,62,0,0 inquery(A) :- code_tax(A,'C6110'), term_tax(A," logic programming').]

f=198,p=108,n=0,h=0 // This is the "best score" i.e. the one having highest values of f and p, while n and are equal to 0. where f = Number of positive examples covered - Number of negative examples covered - Number of literals in body of clause - Optimistic estimate of literals needed p = Number of positive examples covered n = Number of negative examples covered h = Optimistic estimate of literals needed

Note that these scores are specific to the type of comparator 109 used in this embodiment - Progol (for further details, refer to the above referenced paper) and are thus inessential to the invention; if an alternative comparator were used, the scoring mechanism would most likely be different.

[Result of search is] inquery(A) :- term_tαx(A, logic). // literal corresponding to this "best score"

Once the comparator 109 has performed the above analysis on each of the examples (papers), the characterizing rules 1 1 1 identified by the comparator 109 (e.g. combinations of literals, such as, in the above illustration, "logic" (in this case a null combination)) are input to interfacing means 101 , and displayed to the user. The user confirms, or otherwise, that the characterizing rules (literals) 1 1 1 are representative of the subject of interest. For the characterizing rules that are marked by the user as falling within the subject of interest (positive), the interfacing means 109 submits these characterizing rules to the paper identifying means 203. Additional papers having the literals comprising the characterizing rules 1 1 1 are then retrieved. Clearly, if a code is used to identify papers, then all of the papers having that code in its code listing will be retrieved by paper identifying means 203. These papers will have other codes, free text and terms, and once the paper has been submitted to data retrieving means 205, the selecting means 105, and thus the comparator 109 will receive all of these other codes, free text and terms. As a result the search space for the comparator 109 will increase correspondingly.

The user can also enter more keywords via the GUI 201 , prompting paper identifying means 203 to retrieve more papers (and data retrieving means 205 to retrieve codes/terms and/or free text corresponding thereto).

The paper(s) corresponding to characterizing rules 1 1 1 that have been marked by the user as negative examples are passed to selecting means 105, for inclusion in the input file as negative example(s); similarly for paper(s) corresponding to positively marked characterizing rules 1 1 1 . Individual terms/codes and/or free text can also be marked as negative literals (i.e. independent of paper, these literals are never representative of the subject) and written to the input file by selecting means 105. Note that the user could include negative examples of papers and literals in the first run of the system 100.

The comparator 109 operates on the new input file, and presents the results to the user, as described above. This process is repeatable indefinitely.

A further advantage of the invention is that a user can enter papers relating to multiple fields of interest and the comparator 109 will retrieve characterizing rules for each of the multiple fields. As described above the comparator 109 searches for combinations of literals that occur in each of the example papers. Thus papers having terms, codes and free text relating to subject A will naturally group together and the searching process will simply discard terms, codes and free text relating to subjects B, C, D, etc.

Although the embodiment described above discusses a user submitting one or more example paper to interfacing means 101 , the document retrieval system 100 can be expected to provide increasingly useful characterizing information with increasing numbers of examples. This is primarily because the document retrieval system 100 determines the characterizing information by a process of induction, which, as is well known, is the inference of general hypotheses from specific facts.

Thus the quality of the hypothesis can be expected to be proportional to the number of facts.

Additional conditions

In the embodiment described above, the comparator 109 operates on codes and terms that are hierarchically related to codes and terms within papers, as specified by selection rules R1 - R8. Additional relations could be included, such as • Horizontal relationships e.g. for codes, include code_see_αlso('A2300' /B5615').

R5. rule definition: code_relαted(P,C) :- code(P,0), code_see_αlso(b,C).

(g) :- modeb(2,code_relαted(+αny,#αny))? as Progol instructions in the input file

• Cross interests between users (i.e. user U is interested in paper P if user V is interested and P has code C) e.g. for codes, include code_user('Fred','Sαm'). R6. rule definition: user_choice(Pαper,Person) is α relation describing which papers others have liked (h) :-modeb(2,user_choice(+any,#any))? as Progol instructions in the input file.

• Inter-example (paper) e.g. some information sources include references to other papers (e.g. BIDS), in which case, other papers could be implicitly retrieved (thus making use of a recursive relationship, as for the taxonomies in the embodiment) Other parameters that could be used to select papers in a relational manner include author and paper treatment.

Implementation

Referring to Figure 6, the user enters data via a browser 614, which is stored on a client terminal 605 (e.g. on the hard disk drive thereof), and interacts, in known fashion, with hypertext information received from an application server 620 via a LAN. In this embodiment the hypertext information may be an HTML form, which is displayed to a user as GUI 201 . The user enters various parameters and/or queries (as described above), and posts the form, in a known manner, to a co-operating program, (paper identifying means and data retrieving means 203, 205) stored on the server 620. The information sources 103 are located on a database server 625, and for convenience the selecting means, data store and comparator 105, 107, 109 are stored on the application server 620.

For further information relating to HTML forms in the context of client/server arrangements, see "Client/Server Programming with Java and Corba", 2^nd Edition, R. Ofrali and D, Harkey, pp. 239 - 242.

Additional details and modifications

Although the present invention is mainly concerned with a user specifying papers that are related to the user's subject of interest, as described above it is also possible for the system to look for papers that include user specified keywords. When a search is performed on a keyword, all papers that include instances of that keyword will be returned. One of the features of language is that a single word can take a different meaning in different contexts. Thus the papers returned may include several papers that are unrelated to the subject of interest (this is one of the major problems with accessing information on the Internet). E.g. a user is interested in mining (geological mining), and submits "mining" as a keyword; of the papers that are returned, at least some are likely to be in the field of "data mining", which is unrelated to the user's interest. When the papers are presented to the user for review, he can mark such papers as "negative" examples, which will proactively force the comparator 109 to discard all of the codes, terms and free text that are listed for those papers.

When a user has submitted a keyword K via the GUI 201 , and paper identifying means 203 has returned a selection of papers that match on K, the user may specify that all papers matching keyword K are to be included as positive examples of the user's interest. If the user then enters a paper P1 (not returned by information source 103) as a negative example of his interest, and that paper includes keyword K in it, this keyword will be presented to comparator 109 as being both positive and negative (positive in respect of some examples, and negative in respect of one example). This therefore presents conflicting information to the comparator 109. In order to prevent this, conditions, such as "an explicit mention of a paper overrides a paper identified via a query on a keyword" can be introduced to data retrieving means 205. Applying this condition to the example, where the user has explicitly included information relating to P1 , the status of paper P1 would override that of all of the papers returned from the keyword search. Note that the exact nature of these conditions is unimportant, and is likely to change between scenarios.

As an alternative to a user explicitly entering papers of interest, data retrieving means 205 could be adapted to receive input from either a web browser, if papers are viewed from within a browser, or from a log that is maintained by the information sources 103. Web browsers typically save details of web pages that have been accessed via the browser during a pre-specified time period. Thus data retrieving means 205 could include means for searching the list of web pages that have been viewed by the user, in order to retrieve details corresponding to papers viewed from particular information source sites. Details of these papers could then be input to the GUI 201 , which enables the user to review selection of these papers, and remove any papers that are unrelated to the subject of interest. For example, a user could have several profiles, one for each topic of interest, each including a list of papers that the user has recently looked at relating to that topic. The GUI 201 could include a plurality of activatable dialog boxes, each corresponding to one profile, such that when a dialog box is activated in a manner well known to one with ordinary skill in the art, papers related to the topic associated with that dialog box are displayed to the user. The user could then "drag and drop" papers as required, in order to select papers that are to be included as examples by the system 100.

As will be understood by those skilled in the art, the invention described above may be embodied in one or more computer programs. These programs can be contained on various transmission and/or storage mediums such as a floppy disc, CD- ROM, or magnetic tape so that the programs can be loaded onto one or more general purpose computers or could be downloaded over a computer network using a suitable transmission medium. Unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise", "comprising" and the like are to be construed in an inclusive as opposed to an exclusive or exhaustive sense; that is to say, in the sense of "including, but not limited to".

Claims

1 . Apparatus for generating a search strategy for use in retrieval of information, the apparatus including (a) data storage, or access to data storage, for storing a plurality of entities, data items allocated to each entity, and a predetermined data item structure comprising at least some of the allocated data items, (b) identifying means for identifying a plurality of entities within the data storage;

(c) retrieving means for retrieving data items allocated to each identified entity;

(d) selecting means for selecting additional data items in the data item structure in accordance with one or more predetermined rules,

(e) means for generating a search strategy from the selected additional data items and the retrieved data items corresponding to the identified plurality of entities.

2. Apparatus according to claim 1 , wherein the means for generating a search strategy (e) includes analysing means for analysing relationships between one or more groups of first data items and second data items, the first data items comprising selected additional data items and retrieved data items corresponding to one or more first identified entities and the second data items comprising selected additional data items and retrieved data items corresponding to each of the other identified entities.

3. Apparatus according to claim 2, wherein the relationships analysed by analysing means include commonality between first data items and second data items.

4. Apparatus according to claim 3, wherein the analysing means is further adapted to quantify said commonality into commonality values and determine one or more groups of first data items corresponding to commonality values above a predetermined threshold.

5. Apparatus according to any one of the preceding claims, wherein the retrieving means (c) is further adapted to receive categorisation information corresponding to each identified entity and/or each data item, the categorisation information including positive and negative categories.

6. Apparatus according to any one of the preceding claims, wherein the predetermined rules used by the selecting means (d) include relational conditions between the retrieved data items and the data items comprising the data item structure.

7. Apparatus according to claim 6, wherein the relational conditions include any one or some of hierarchical, acyclic and/or lateral associations characterising the data item structure.

8. Apparatus according to any one of the preceding claims, wherein the identifying means (a) is adapted to receive a query and to process the query so as to identify an entity associated with the query.

9. Apparatus according to any one of the preceding claims, wherein the identifying means (a) is further adapted to receive one or more references corresponding to one or more entities, and to use the or each reference in order to identify the or each entity.

10. Apparatus according to any one of claims 2 to 9, the apparatus further including means for presenting the determined first data items to a user, and means for enabling the user to specify categorisation information for the first data items.

1 1 . Apparatus according to any one of the preceding claims, wherein the retrieved data items include at least some of terms and/or codes and/or free text data.

12. Apparatus according to any one of the preceding claims, wherein an entity includes a document.

13. A method of generating a search strategy for use in retrieval of information, the method including the steps of:

(a) identifying a plurality of entities stored within a data store; (b) retrieving data items allocated to each identified entity, the allocated data items also being stored within a data store;

(c) selecting additional data items in the data item structure in accordance with one or more predetermined rules,

(d) generating a search strategy from the selected additional data items and the retrieved data items corresponding to the identified plurality of entities.

14. A method according to claim 13, in which step (d) of generating a search strategy includes analysing relationships between one or more groups of first data items and second data items, the first data items comprising selected additional data items and retrieved data items corresponding to one or more first identified entity and the second data items comprising selected additional data items and retrieved data items corresponding to each of the other identified entities.

1 5. A method according to claim 14, in which said step of analysing relationships between the first data items and second data items includes establishing the degree of commonality, by means of a commonality value, between each of the groups of first data items and second data items, such that first data items corresponding to commonality values above a predetermined threshold are used, at least in part, to generate the search strategy.

1 6. A method according to any one of claims 13 to 1 5, in which the identifying step (a) includes receiving a query and processing the query so as to identify an entity associated with the query.

17. A method according to any one of claims 13 to 1 6, in which the identifying step (a) further includes receiving one or more references corresponding to one or more entities.

18. A method according to any one of claims 1 5 to 1 7, further including the steps of presenting the first data items comprising at least part of the search strategy to a user and retrieving categorisation information for the said first data items from the user, so as to modify the search strategy to include categorisation information.

19. A method according to any one of claims 13 to 1 8, including repeating steps (a) to (d) a predetermined number of times.

20. A computer program, or a suite of computer programs, comprising a set of instructions to cause a computer, or a suite of computers, to perform the method steps according to any one of claims 12 - 19.