WO1999056224A1 - Associating files of machine-readable data with specified information types - Google Patents

Associating files of machine-readable data with specified information types Download PDF

Info

Publication number
WO1999056224A1
WO1999056224A1 PCT/GB1999/001222 GB9901222W WO9956224A1 WO 1999056224 A1 WO1999056224 A1 WO 1999056224A1 GB 9901222 W GB9901222 W GB 9901222W WO 9956224 A1 WO9956224 A1 WO 9956224A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
computer
score
file
instructions
Prior art date
Application number
PCT/GB1999/001222
Other languages
French (fr)
Inventor
Llewelyn Ignazio Fernandes
Rachel Hammond
Original Assignee
The Dialog Corporation Plc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Dialog Corporation Plc filed Critical The Dialog Corporation Plc
Publication of WO1999056224A1 publication Critical patent/WO1999056224A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • the present invention relates to associating files of machine-readable data with specified information types so as to enhance the availability of information from a database in which users access databases via transmission channels.
  • search engines which assist in the location of information.
  • free text searching in which a user specifies words which they believe are contained within the target file as a mechanism for instructing the system to retrieve files of interest.
  • problems with this technique are well known to users of the available search engines, and a simple enquiry can generate hundreds of thousands of "hits", the majority of which will tend to be totally irrelevant to the user's needs.
  • other relevant files may be missed simply because they do not contain the specific chosen words.
  • a problem with the Internet is that the freedom of the Internet is also its downfall. Information is not classified before it is made available, therefore it is highly likely that even the simplest search will fail to identify relevant documentation and will take a considerable period of time to perform.
  • Procedures are known for processing a data file so as to determine whether the data file should be associated with a particular information category.
  • the known processes require a machine readable association file, also known as an outline file and usually identified by file extension .OTL.
  • OTL file extension
  • association files are particularly well suited to associating new data files which are of a substantially similar size to the original source data files.
  • association files generated by more traditional techniques still tend to be well suited to input data files of a particular size and less well suited to incoming data files of differing sizes.
  • relevant files will fail to be categorised given that the processing of these files will result in an inappropriately low weighting value being calculated.
  • apparatus for associating files of machine-readable data with information types comprising examining means for examining data elements in a file to identify occurrences of specified data types; adjusting means for adjusting a score in response to said identified occurrences and for adjusting said score in relation to said size of said data; and associating means for associating the information types with processed files dependent upon said score values.
  • the adjusting means is configured to further adjust the score inversely in relation to a size of said data.
  • means are included for determining the size of said data in terms of the volume of the stored file representing the number of characters present in the file.
  • a method of associating files of machine-readable data with information types comprising steps of examining data elements in a file to identify occurrences of specified data types; adjusting a score in response to said identified occurrences; further adjusting said score in relation to a size of said data; and associating said information types with processed files dependent upon said score values.
  • the further adjustment is performed if the size of the data is below a predetermined threshold and a further adjustment may be performed up to a predetermined maximum weighting value.
  • the information types are expressed as preferred terms and are specified by information outlines.
  • the information outlines may define preferred terms in a branching hierarchical structure and the adjustable values may be determined at branches of this structure.
  • the further adjustment may include adjustment at a plurality of the branches and score values may be determined by multiplying values stored at branches throughout the hierarchy.
  • Figure 1 shows a data distribution environment, including a data processing, storage and retrieval system
  • Figure 2A illustrates procedures performed by the processing system shown in Figure 1, including a process for specifying preferred terms, a process for associating preferred terms with source files and a process for performing a search in response to a user request;
  • Figure 2B illustrates underlying principles of the operation of the preferred embodiments;
  • Figure 3 details the processing system shown in Figure 1, including subsidiary processing units;
  • Figure 4 details the process shown in Figure 2A for specifying preferred terms for association with data files, including a process for the generation or modification of outline files;
  • Figure 5 details the step shown in Figure 4 for the generation or modification of outline files
  • Figure 6 illustrates a graphical view of an OTL file
  • Figure 7 illustrates the actual OTL data for the graphical representation shown in Figure 6;
  • Figure 8 shows a diagrammatic representation of the data illustrated in Figure 7;
  • Figure 9 shows a subsidiary processor of the type identified in Figure
  • Figure 10 shows operations performed by the processing unit shown in Figure 9 in response to instructions read from the memory shown in Figure 9;
  • Figure 11 shows rule bases generated for each OTL file in response to the selection step shown in Figure 10;
  • Figure 12 details the step for associating preferred terms with source files identified in Figure 2A, including a step for processing data to determine associated preferred terms
  • Figure 13 details the step for processing data to determine associated preferred terms identified in Figure 12, including a triggering phase, a scoring phase and a list generation phase
  • Figure 14 details the triggering phase identified in Figure 13;
  • Figure 15 details the generation of file weighting factor
  • Figure 16 shows an example of a short data file
  • Figure 17 details the scoring phase identified in Figure 13
  • Figure 18 details the generation phase for a list of associated preferred terms identified in Figure 13;
  • Figure 19 illustrates a preferred term table, having pointers to a linked list
  • Figure 20 details the linked list referred to in Figure 19;
  • Figure 21 details procedures performing a search in response to a user request identified in Figure 2A;
  • Figure 22 illustrates a screen display generated by the requesting step shown in Figure 21;
  • Figure 23 details a screen for prompting search criteria generated by the criteria requesting step in Figure 21;
  • Figure 24 illustrates the displaying of titles of associated files generated by the procedure shown in Figure 21.
  • FIG. 1 A data distribution environment is illustrated in Figure 1 in which data, received from a plurality of data sources 101 , 102, 103 is supplied to a data processing, storage and retrieval system 104.
  • Data sources 101 and 102 supply data directly to processing system 104 while data source 103 supplies data via a local area network 105, thereby allowing user terminals 106 and 107 to gain direct access to their local data source 103.
  • the processing system 104 provides access to a plurality of users, such as users 111 , 112, 113, 114, 115, 116 and 117.
  • User 111 has direct access to the processing system 104 while users 112, 113 and 114 gain access to the processing system 104 via the Internet 118.
  • Users 115, 116 and 117 exist within a more sophisticated environment in which they have access, via a local area network 119 to their own local database system 120 in addition to a connection, via an interface 121 , to the data processing system 104.
  • All incoming data from data sources 101 to 103 is categorised with a key word in seven separate fields, comprising "market sector", “location”, “company name”, “publisher”, “publication date” and “scope”.
  • Users such as users 112 to 117 may specify almost any term as the basis for a search and are then prompted by an equivalent word or phrase which constitutes more preferred search parameters.
  • a user may specify a search word such as "confectionery” and the system will prompt the user to consider narrower terms such as “chocolate” along with related terms such as “cakes” or “desserts”, or broader terms such as "food”. From a simple request, a user is given an option of focusing further or of taking a broader overview of the subject under consideration.
  • the scope of an article refers to the context in which the document or article was written. For example, the scope field may consider questions as to whether the article concerns “mergers and acquisitions” or “seasonal trends” et cetera. Such terms are useful in gathering related information from a wide variety of industries and markets and may prove invaluable for particular applications.
  • FIG. 2A An overview of procedures performed by the data processing system 104 is illustrated in Figure 2A.
  • step 201 categories for association with data files are specified. This step is essentially performed as an "off-line" process; establishing the environment for allowing source data to be processed as it is received from sources.
  • Steps 202, 203 and 204 represent on-line procedures after the categories have been specified at step 201.
  • the processing system 104 receives data from sources such as sources 101 , 102 and 103.
  • the source data may be transmitted using different protocols, formats and standards therefore the processing system performs a standardisation process so that the data may be stored locally at the data processing system using standardised formats.
  • the data is processed so as to enhance a user's ability to identify information of interest.
  • Files of machine-readable data received from the sources are associated with specific categories, that may be considered as defining particular information types.
  • a file is considered as individual data elements, usually in the form of natural language words, and files are examined to identify occurrences of specified data types.
  • the purpose of this association is to identify files of data which are of interest in relation to particular topics. This enables a user to organise a search which should result in useful information being supplied to said user, with reference to said topics and categories, from an extremely large database of stored data files.
  • association step 203 significantly enhances the overall functionality of the system and provides an industrially applicable approach to allowing highly focused sets of information to be supplied to a user in preference to large volumes of data; much of which will tend to be totally irrelevant.
  • the files of data are considered and are given a score representing a numerical value as to their relevance with respect to the predefined topics.
  • Scores are adjusted in response to the number of identified occurrences of a specified data type.
  • these scores are also adjusted in relation to the size of the data contained within the file.
  • occurrences of data types in relatively small files are given a higher weighting with occurrences in larger files being given a lower weighting.
  • the adjustment of scores is related inversely to the actual size of the data file.
  • a threshold for the scoring values may be set and information types are associated with particular files dependent upon whether particular value scores fall on one side or on the other side of this threshold.
  • a search is performed, in response to categories identified by a user such that information of interest may be identified within the data stored by the data processing system 104 and transmitted to user terminals, such as terminal 111 , over transmission channels as illustrated in Figure 1.
  • Process 221 considers each incoming data file 222 to determine whether each data file should be associated with a particular category. Thus, process 221 is repeated for each category where associations are required. Decisions concerning the relevance of a particular data file to a particular category are made on the basis of the presence of particular data types within a file under consideration.
  • procedures 223 identify the presence of data type A
  • procedures 224 identify the presence of data type B
  • procedures 225 identify the presence of data type C
  • procedures 226 identify the presence of data type D and so on until procedures 227 identify the existence of data type N.
  • Each identification made by procedures 223 to 227 results in a score value being generated which may be accumulated at local accumulators 228, 229, 230, 231 and 232 respectively.
  • the data elements are examined in a file to identify occurrences of specified data types.
  • Scores, present in accumulators 228 to 232 are adjusted in response to the identified occurrences, a process usually involving accumulation and scaling. Having been examined for the existence of all relevant data types, the present file is considered by process 223, that determines the total size of the file. Having done this, a scaling factor is generated by process 233 which is supplied to each of the accumulation processes 228 to 232. In this way, it is possible to further adjust the score values in relation to the size of the incoming data file.
  • the scaled accumulation values are further accumulated by a unifying process 234, that in turn produces a single bit yes/no response to a categorisation process 235.
  • the incoming file data is received by the categorisation process 235 and in response to the output from process 234, an output file 236 is labelled as either belonging to the category under consideration or as not belonging to the category under consideration.
  • Processing system 104 is detailed in Figure 3. Data signals from data sources 101 to 103 are supplied to input interfaces 301 via data input lines 302. Similarly, output data signals are supplied to users 111 to 117 via an output interface 303 and output wires 304. Input interface 301 and output interface 303 communicate with a central processing system 305 based on DEC Alpha integrated circuitry. The central processing system 305 also communicates with other processing systems in a distributed processing architecture. Processing system 104 includes eight Intel chip based processing systems 311 to 318, each implementing instructions under the control of conventional operating systems such as Windows NT.
  • An operator communicates with the processing system 104 by means of an operator terminal, having a visual display unit 321 and a manually operable keyboard 322.
  • Data files received from sources 101 to 103 are written to bulk storage devices 323 in the form of large magnetic disk arrays. Data files are written to disk arrays 323 after these files have been associated with categories, as illustrated at step 203.
  • These association processes are performed by the subsidiary processors 311 to 318 and the central processing system 305 is mainly concerned with the switching and transferring of data between the interface circuits 301 , 303 and the disk arrays 323.
  • the central processing system 305 communicates with the subsidiary processors 311 to 318 via an Ethernet connection 324 and processing requirements are distributed between processors 311 to 318.
  • each individual incoming data file is supplied exclusively to one of the subsidiary processors.
  • the selected subsidiary processor is then responsible for performing the association process, to identify categories relevant to that particular data file.
  • one incoming data file 222 as shown in Figure 2B is supplied exclusively to one of processors 311 to 318 and the whole of process 221 is performed on said processor.
  • multiple operations of process 221 may be performed, one for each category being considered by the system.
  • the associated data file After being processed within a processor, the associated data file is returned to the central processing system 305, over connection 324 and the central processing system 305 is then responsible for writing the associated data file to the disk array 323. In this way, it is possible to scale the degree of processing capacity provided by system 104 in dependence upon the volume of data files to be processed.
  • the central processing system 305 also maintains a table of categories, pointing to particular data files which have been identified as relevant to said categories.
  • Computer system 305 is interfaced to a CD ROM reader 325 and in this way executable instructions may be supplied to computer system 305 and to processing systems 311 to 318, by means of a CD ROM 326.
  • Process 201 for specifying categories for association with data files is detailed in Figure 4.
  • a preferred term is selected and at step 402 an outline (OTL) file is generated or modified.
  • OTL outline
  • Step 402 for the generation or modification of outline files is detailed in Figure 5.
  • a visual OTL editor is opened resulting in the editor's visual interface being displayed on VDU 321.
  • a question is asked as to whether an existing file is to be loaded for modification and if answered in the negative a new OTL file is created at step 503. If the question asked at step 502 is answered in the affirmative, step 503 is bypassed and at step 504 modifications or additions are made to the OTL definition.
  • the OTL modifications created at step 504 are tested on a sample of test data and at step 506 a question is asked as to whether another modification is to be made. When answered in the affirmative, control is returned to step 504 resulting in further modifications or additions being made to the OTL definitions.
  • the new or modified OTL file is saved at step 507.
  • a graphical representation of the OTL file data is presented to an operator via the visual display unit 321.
  • An example of a display of this type is illustrated in Figure 6, representing a graphical illustration of a specific OTL file.
  • the OTL file stores definitions in an hierarchical tree structure and this structure is represented in the graphical view as shown in Figure 6.
  • a representation of the tree may be contracted or expanded and the possibility of expanding a particular branch is identified by a plus sign on a particular line, as shown at 601. Similarly, when a particular branch has been fully expanded, the line is identified by a minus sign as shown at 602.
  • Definitions within the file consist of rules, words and labels. The labels allow relationships to be defined between various parts of the file and between individual files themselves.
  • the words identify specific words within an input file of interest and the rules define how and what weights are to be attributed to these words.
  • Each rule line includes, at its beginning, a weight value 603 representing the score that will be attributed when a particular rule condition is met.
  • OTL file data represented graphically in the form shown in Figure 6 is actually stored in a data file having a format of the type shown in Figure 7.
  • the actual data file shown in Figure 7 corresponds to the data display in Figure 6 but in Figure 7 all of the data, some of which has been rolled up in Figure 6, is present.
  • the data contained within the file shown in Figure 7 is manipulated interactively by an operator in response to the graphical interface displayed as illustrated in Figure 6.
  • Score values 603 are also identified in the data file shown in Figure 7. Displayed line 601 in Figure 6 is generated from line 701 of the actual stored data.
  • the syntax of the language used for recording the data, as illustrated in Figure 7, may vary and the example shown is specific to this particular application. However, the underlying functionality of the language may be considered with reference to the diagrammatic representation shown in Figure 8.
  • this particular outline file is concerned with the topic of the oil industry and therefore the purpose of the OTL file is to identify words and phrases within an input file so as to provide an indication as to how relevant that input data is to users having an interest in the oil industry.
  • the purpose of procedures exploiting these OTL files is to generate evidence showing that a particular data file conveys information which may be of interest to those studying the oil industry.
  • OTL definitions and structures are determined empirically and would be modified and upgraded over a period of time.
  • the system does more than merely register the existence of a particular word item by placing the word items within an interacting structure; the nature of which is illustrated in Figure 8.
  • the particular entry, given label "oil-industry-mkt" relates to marketing aspects of the oil industry and as such can contribute to an overall score as to the pertinence of incoming data to this particular topic.
  • the first line 801 shows that this particular contribution may provide a total score of forty percent. This total of forty percent is then subdivided such that at line 802 the presence of the phase "buying oil from” has a score of fifty percent.
  • the total contribution made the presence of this phrase consists of fifty percent of forty percent, ie a total of twenty percent being made to the total contribution.
  • particular words may be identified which result in contributions of sixty percent of thirty percent of forty percent.
  • a complete OTL file is structured in this way with particular words and phrases making contributions to an overall score value. These words and phrases may also be specified in the rules as making single contributions or as being allowed to accrue.
  • score value 603 are illustrated in Figure 8 at 804 to 815.
  • the hierarchical structure in Figure 9 consists of a plurality of branches with lowest level entries being considered as leaves. Each leaf has a score value associated with it and values 809 to 815 are leaf score values. Above these, branch score values exist such that score values 807 and 808 exist at the lowest level of branching with score values 805 and 806 being at the next level of branching further up each connected to the highest level of branching illustrated by score value 804.
  • a total score value for a particular occurrence, detected at any level within the hierarchical tree results in a final score contribution derived from the product of the score value assigned to that particular level with all score levels identified while ascending the tree structure back to its root.
  • the score contributions are produced and possibly accumulated when occurrences of the specified elements are identified. This provides a numerical weight to assess whether a file being processed should be associated with a particular information type.
  • the present invention as implemented within the preferred embodiment, further adjusts these score weights in relation to the size of the overall data file. In particular, it is the branch score values (804 to 808) which are modified in preference to leaf values (809 to 815).
  • Subsidiary processor 311 is detailed in Figure 9.
  • the processor includes an Intel Pentium processing unit 901 connected to sixty-four megabytes of randomly accessible memory 902 via a PCI bus 903.
  • a local disk drive 904 and an interface circuit 905 are connected to bus 903.
  • Interface circuit 905 communicates with the TCP/IP network 324.
  • Random access memory 902 stores instructions executable by the processing unit 901 , in addition to storing input data files received from the data sources 101 to 103 and intermediate data. Operations performed on processing unit 901, in response to instructions read from memory 902 are identified in Figure 10.
  • temporary memory structures are cleared and at step
  • an OTL description file is selected.
  • an item in the OTL file is identified and at step 1004 a question is asked as to whether the item selected at step 1003 is a rule definition. If this question is answered in the affirmative, a rule object is defined at step 1005. Alternatively, if the question asked at step 1004 is answered in the negative, to the effect that the item is not a rule definition, a question is asked at step 1006 as to whether the item is a word definition. If this question is answered in the affirmative, a dictionary link is created at step 1004.
  • step 1008 a question is asked as to whether the item is a label and when answered in the affirmative a new entry is created in a label list, whereafter at step 1010 a question is asked as to whether another item is present.
  • control is directed to step 1010.
  • step 1010 When the question asked at step 1010 is answered in the affirmative, to the effect that another item is present, control is returned to step 1003 and the next item is identified in the OTL file. Eventually, all of the items will have been identified resulting in the question asked at step 1010 being answered in the negative. Thereafter, at step 1011 a question is asked as to whether another OTL file is present and when answered in the affirmative control is returned to step 1002 allowing the next OTL description file to be selected. Thus, this process continues until all of the OTL files have been considered resulting in the question asked at step 1011 being answered in the negative.
  • a rule base is generated and a plurality of such rule bases are illustrated in Figure
  • Rule bases 1101 to 1109 are stored in memory 902, which also provides storage space for a dictionary 1121, a label list 1122 and a data buffer 1123.
  • the dictionary stores a list of words which have importance in any of the stored rule bases. Associated with each word in the dictionary there is at least one pointer and possibly many pointers to specific entries in specific rule bases 1101 to 1109. Thus, the words identified at 803 in Figure 8 would all be included in dictionary 1121. Entries within the dictionary 1121 are implemented upon execution of step 1007 in Figure 10. Similarly, execution of step 1009, creating a new entry in the label list, allows a label to relate to rules that are elsewhere in the tree structure. Step 203, as shown in Figure 2A for associating preferred terms with source files, is detailed in Figure 12.
  • central processor 305 obtains access to one of the subsidiary processors 311 to 318.
  • the central processor then expects to receive authorisation so that communication may be effected with one of the subsidiary processors, after a connection has been established, the source file is supplied to the selected subsidiary processor at step 1203 and at step 1204 the data is processed to determine associated preferred terms.
  • Step 1204 for the processing of data to determine associated preferred terms is detailed in Figure 13.
  • the overall processing is broken down into three major phases, consisting of a triggering phase at 1301 , followed by a scoring phase at 1302 followed finally by a list generation phase at step 1303.
  • Triggering phase 1301 is detailed in Figure 14.
  • a section of the data such as its title, market sector or main body of text, is identified and at step 1402 an item of the identified section is selected.
  • a question is asked as to whether the item indicates a new context, which may be considered as a grammatical marker in the form of a full stop, capital, start of a sentence or quotation marks et cetera.
  • a new context which may be considered as a grammatical marker in the form of a full stop, capital, start of a sentence or quotation marks et cetera.
  • step 1404 is bypassed and a look-up address is obtained for rule objects in rule bases from the dictionary at step 1405.
  • step 1406 all addressed objects are triggered and a multiplication of scores is effected by a score weighting factor.
  • step 1407 a question is asked as to whether another item is present and when answered in the affirmative control is returned to step 1402.
  • step 1408 a question is asked as to whether another section is to be considered and when answered in the affirmative control is returned to step 1401.
  • step 1401 the next section is identified and steps 1402 to 1408 are repeated. Eventually, all of the sections will have been considered and the question asked at step 1408 will be answered in the negative.
  • each lowest level leaf of the hierarchical tree has a numerical value associated with the identification of a particular item, as identified generally at 803. If the amount of data contained within a particular file is less than what would generally be accepted, scores are further adjusted in relation to this size so as to improve the association of files with particular information types. This further adjustment is performed at the lowest leaf level (an example of this being level 803 in Figure 8) and these leaf values are multiplied by a file weighting factor, derived from the size of the file, which is triggered at step 1406 as shown in Figure 14.
  • a file weighting factor W is illustrated in Figure 15.
  • granularity the irregular distribution of evidential terms, referred to as granularity, can effect the accuracy of the association process and therefore requires compensation.
  • a threshold value is identified, below which compensation is required, and recorded by variable T.
  • T A typical example for T would be two thousand and forty-eight bytes.
  • the size of the current file is recorded in variable S and a weighting constant, such as zero point eight
  • weighting factor W (0.8), is stored by variable N. From these values, weighting factor W is derived as follows. If the size S of the current file is smaller than threshold T, weighting factor W is calculated by subtracting the size S of the current file from the threshold T and dividing this by the threshold T. This partial result is then multiplied by the weighting constant and added to unity. Alternatively, if the size of the current file is not smaller than the threshold T, the weighting factor is set to unity.
  • the threshold value T represents the minimum file size that can be indexed reliably and files smaller than this value, when detected, result in further adjustments being made to the identification process so as to compensate for this problem.
  • the threshold value could be calculated on an on-going basis so as to provide, for example, the average of the previous one thousand files considered. However, given the inherent unpredictability of file size distribution, it is better to specify a threshold value and, in some circumstances, it may be preferable to modify the threshold value in order to achieve a particular result.
  • the weighting constant may also be modified in response to particular data types and experiments have shown that by optimising the weighting constant and by optimising the threshold value it should be possible to obtain a ten percent improvement in terms of the number of files that are selected correctly for a particular category.
  • weighting factor The relationship between weighting factor and file size is illustrated graphically in Figure 15 and it can be seen that, as the file size falls below the threshold value, the weighting factor is linearly increased up to the value of the weighting constant. Alternatively, a more sophisticated procedure may be achieved, having a non-linear relationship, by using interpolated look-up tables.
  • An example of a short data file is illustrated in Figure 16.
  • the file does not contain many words therefore it is unlikely that enough words would be identified so as to provide evidence for the data file to be included within a particular category identified by its associated preferred terms.
  • Figure 16 also shows particular words and phrases that have been highlighted, so as to identify them as having being caught by a particular rule base. Thus, a decision then has to be made as to whether these highlighted words provide sufficient evidence for the data file to be included.
  • the present procedures, as illustrated in Figure 15, would increase the likelihood of a file of this type being included correctly and as being pointed to by a particular preferred term.
  • Scoring phase 1302 is detailed in Figure 17.
  • a rule base is selected and at step 1702 a score variable is re-set to zero.
  • a branch is identified for score accumulation/accrue and at step 1704 scores are accumulated or accrued from triggered rules attached to the branch.
  • a question is asked as to whether another branch is to be considered and when answered in the affirmative control is returned to step 1703.
  • a next branch is selected at step 1703 with procedure 1704 being repeated. Eventually all of the branches will have been considered resulting in the question asked at step 1705 being answered in the negative.
  • step 1706 an overall score in the range of zero to one hundred is stored for the rule base and at step 1707 a question is asked as to whether another rule base is present. When answered in the affirmative, control is returned to step 1701 and steps 1701 to 1707 are repeated. Eventually, all of the rule bases will have been considered and the question asked at step 1707 will be answered in the negative.
  • Phase 1303 for the generation of a list of associated preferred terms is detailed in Figure 18.
  • a rule base is identified having a score greater than a predetermined threshold.
  • a threshold may be set at forty-eight percent.
  • additional triggered preferred data characteristics are identified by associating successful rule bases with parent categorisations by rule base links.
  • Step 1803 lists of successful and inferred rule bases are combined to form overall lists of preferred data characteristics.
  • Step 1803 results in data being generated by a subsidiary processor, such as processor 311 , which is then supplied back to the central processing system 305.
  • Central processing system 305 is responsible for constructing a table of the type shown in Figure 19 in which an entry is present for each preferred term.
  • the specific preferred terms are stored in column 1901 and, for each of these terms, column 1902 defines a specific pointer to a position in memory associated with the central processing system 305.
  • Specific data files are identified by file names and the number of files associated with each preferred term is variable, depending on the nature and the amount of input data being considered. Thus, in order for this data to be accessible quickly while optimising use of the storage capacity within the central processing system 305, an indication of the file names is stored in the form of a linked list as illustrated in Figure 20.
  • the preferred term "OILJNDUSTRY” has been associated to a pointer 0F8912.
  • Address 0F8912 is the first in column 2001 of the linked list.
  • Column 2002 identifies a particular file name and column 2003 identifies the next pointer in the list.
  • entry 0F8912 points to a particular file with the file name "OIL_INDUSTRY_NETHERLAND_3" with a further pointer to memory location 0F8A20.
  • a new file name is provided, illustrated at column 2002 and again a new pointer is present at column 2003.
  • the database 323 will be continually updated and users will continually be given access to the database, all under the control of the central processing system 305.
  • the association step illustrated at 203 and the searching step at 204 are actually concurrent and will be effected in response to the availability of data and the demand for searching respectively.
  • Procedures 204 for performing a search in response to a user request are detailed in Figure 21.
  • a user logs onto the system and at step 2102 a search method is identified.
  • search criteria are defined and at step 2104 the search criteria are processed to determine categories.
  • a list of categories are supplied to the central processing system 305.
  • step 2106 a question is asked as to whether the host has responded and when answered in the affirmative titles of associated data files are displayed at step 2107.
  • a question is asked as to whether the user wishes to view identified data and when answered in the affirmative the data is viewed; after being downloaded over the communication channel, at step 2109.
  • a question is asked as to whether another search is to be performed and when answered in the affirmative control is returned to step 2102.
  • Step 2102 requires the search method to be identified and in order to achieve this a user is prompted by a screen display of the type shown in Figure 22. Thus, a plurality of text boxes are presented to the user inviting the user to specify a search method.
  • Step 2103 for the defining of search criteria results in the user being prompted by a screen of the type shown in Figure 23. Terms providing a basis for the user's search are displayed in a window 2301. Categories are displayed in uppercase characters, such as the entry shown at position 2302.
  • Each entry such as entry 2401 , includes a check box 2402.
  • Check boxes 2402 allow a particular item to be selected by a user such that the actual information file may be supplied to the user from the central database over a communication channel.

Abstract

A data processing, storage and retrieval system (104) includes a central processing system (305) and a network of subsidiary processors (311 to 318). Files of machine-readable data are supplied to the central processing system (305) and individual files are supplied to a selected subsidiary processor. The selected subsidiary processor examines data elements in a file to identify occurrences of specified terms. A score is adjusted in response to identified occurrences. Furthermore, the score is also adjusted in relation to the size of the data file. Score values are returned to the central processing system (305) and the processing system retains an association of information types with processed files based upon their score values.

Description

Associating Files Of Machine-Readable Data With Specified
Information Types
Field Of The Invention The present invention relates to associating files of machine-readable data with specified information types so as to enhance the availability of information from a database in which users access databases via transmission channels.
Background To The Invention
Traditionally, database technology has been dedicated to the organisation of numerical and tabular data and it is only recently, particularly with the expansion of the Internet, that demand has grown significantly for the retrieval of text-based files. Several facilities are available on the Internet, commonly referred to as "search engines" which assist in the location of information. The majority of these operate by performing what has become known as "free text" searching, in which a user specifies words which they believe are contained within the target file as a mechanism for instructing the system to retrieve files of interest. Problems with this technique are well known to users of the available search engines, and a simple enquiry can generate hundreds of thousands of "hits", the majority of which will tend to be totally irrelevant to the user's needs. Furthermore, other relevant files may be missed simply because they do not contain the specific chosen words. As is well documented, a problem with the Internet is that the freedom of the Internet is also its downfall. Information is not classified before it is made available, therefore it is highly likely that even the simplest search will fail to identify relevant documentation and will take a considerable period of time to perform.
Procedures for classifying volumes of data so as to facilitate subsequent searching are known but these classification processes often involve manual intervention, thereby making them time consuming and prone to human error. Furthermore, except in circumstances where the documentation is considered to be extremely valuable and will continue to be required over a significant period of time, the cost of performing this manual exercise cannot be justified in terms of the commercial worth of the data sources being considered. Consequently, the problem results in much data being effectively inaccessible and outside the realm of searchable knowledge.
Procedures are known for processing a data file so as to determine whether the data file should be associated with a particular information category. The known processes require a machine readable association file, also known as an outline file and usually identified by file extension .OTL. In this way, it is possible for the incoming data file to be processed with reference to one or many outline files whereupon each outline file produces a numerical score value defining an extent to which the data file is relevant to an associated category, whereafter a decision may be made on the basis of a threshold comparison.
In practical systems, thousands of such outline files would be required in order to provide a useful level of categorisation, in the present applicant's co-pending British patent application number GB 98 08 808.1 , (co-pending European application DGC-P11-EP and co-pending United States patent application DGC-P11-US) a method of generating machine readable association files is described. A plurality of data files are manually selected as being examples of files which should be associated with a particular category. In addition, a plurality of files are selected manually which are considered not to be associated with a particular category. Having identified these files, the process identifies preferred term candidates from the associated files, weights these candidates with reference to files not associated with the category and applies terms to a machine readable association file by analysing the weighting values.
The resulting association files are particularly well suited to associating new data files which are of a substantially similar size to the original source data files. Similarly, association files generated by more traditional techniques still tend to be well suited to input data files of a particular size and less well suited to incoming data files of differing sizes. Thus, if a new incoming data file is smaller than the optimum file size, it is possible that relevant files will fail to be categorised given that the processing of these files will result in an inappropriately low weighting value being calculated.
Summary Of The Invention
According to a first aspect of the present invention, there is provided apparatus for associating files of machine-readable data with information types, comprising examining means for examining data elements in a file to identify occurrences of specified data types; adjusting means for adjusting a score in response to said identified occurrences and for adjusting said score in relation to said size of said data; and associating means for associating the information types with processed files dependent upon said score values. In a preferred embodiment, the adjusting means is configured to further adjust the score inversely in relation to a size of said data. Preferably, means are included for determining the size of said data in terms of the volume of the stored file representing the number of characters present in the file.
According to a second aspect of the present invention, there is provided a method of associating files of machine-readable data with information types, comprising steps of examining data elements in a file to identify occurrences of specified data types; adjusting a score in response to said identified occurrences; further adjusting said score in relation to a size of said data; and associating said information types with processed files dependent upon said score values. In a preferred embodiment, the further adjustment is performed if the size of the data is below a predetermined threshold and a further adjustment may be performed up to a predetermined maximum weighting value.
In a preferred embodiment, the information types are expressed as preferred terms and are specified by information outlines. The information outlines may define preferred terms in a branching hierarchical structure and the adjustable values may be determined at branches of this structure. The further adjustment may include adjustment at a plurality of the branches and score values may be determined by multiplying values stored at branches throughout the hierarchy.
Brief Description of The Drawings
Figure 1 shows a data distribution environment, including a data processing, storage and retrieval system;
Figure 2A illustrates procedures performed by the processing system shown in Figure 1, including a process for specifying preferred terms, a process for associating preferred terms with source files and a process for performing a search in response to a user request; Figure 2B illustrates underlying principles of the operation of the preferred embodiments;
Figure 3 details the processing system shown in Figure 1, including subsidiary processing units; Figure 4 details the process shown in Figure 2A for specifying preferred terms for association with data files, including a process for the generation or modification of outline files;
Figure 5 details the step shown in Figure 4 for the generation or modification of outline files; Figure 6 illustrates a graphical view of an OTL file;
Figure 7 illustrates the actual OTL data for the graphical representation shown in Figure 6;
Figure 8 shows a diagrammatic representation of the data illustrated in Figure 7; Figure 9 shows a subsidiary processor of the type identified in Figure
3;
Figure 10 shows operations performed by the processing unit shown in Figure 9 in response to instructions read from the memory shown in Figure 9; Figure 11 shows rule bases generated for each OTL file in response to the selection step shown in Figure 10;
Figure 12 details the step for associating preferred terms with source files identified in Figure 2A, including a step for processing data to determine associated preferred terms; Figure 13 details the step for processing data to determine associated preferred terms identified in Figure 12, including a triggering phase, a scoring phase and a list generation phase; Figure 14 details the triggering phase identified in Figure 13;
Figure 15 details the generation of file weighting factor;
Figure 16 shows an example of a short data file;
Figure 17 details the scoring phase identified in Figure 13; Figure 18 details the generation phase for a list of associated preferred terms identified in Figure 13;
Figure 19 illustrates a preferred term table, having pointers to a linked list;
Figure 20 details the linked list referred to in Figure 19; Figure 21 details procedures performing a search in response to a user request identified in Figure 2A;
Figure 22 illustrates a screen display generated by the requesting step shown in Figure 21;
Figure 23 details a screen for prompting search criteria generated by the criteria requesting step in Figure 21;
Figure 24 illustrates the displaying of titles of associated files generated by the procedure shown in Figure 21.
Detailed Description of The Preferred Embodiments A data distribution environment is illustrated in Figure 1 in which data, received from a plurality of data sources 101 , 102, 103 is supplied to a data processing, storage and retrieval system 104. Data sources 101 and 102 supply data directly to processing system 104 while data source 103 supplies data via a local area network 105, thereby allowing user terminals 106 and 107 to gain direct access to their local data source 103.
The processing system 104 provides access to a plurality of users, such as users 111 , 112, 113, 114, 115, 116 and 117. User 111 has direct access to the processing system 104 while users 112, 113 and 114 gain access to the processing system 104 via the Internet 118. Users 115, 116 and 117 exist within a more sophisticated environment in which they have access, via a local area network 119 to their own local database system 120 in addition to a connection, via an interface 121 , to the data processing system 104.
All incoming data from data sources 101 to 103 is categorised with a key word in seven separate fields, comprising "market sector", "location", "company name", "publisher", "publication date" and "scope". Users, such as users 112 to 117 may specify almost any term as the basis for a search and are then prompted by an equivalent word or phrase which constitutes more preferred search parameters. For example, a user may specify a search word such as "confectionery" and the system will prompt the user to consider narrower terms such as "chocolate" along with related terms such as "cakes" or "desserts", or broader terms such as "food". From a simple request, a user is given an option of focusing further or of taking a broader overview of the subject under consideration.
The scope of an article refers to the context in which the document or article was written. For example, the scope field may consider questions as to whether the article concerns "mergers and acquisitions" or "seasonal trends" et cetera. Such terms are useful in gathering related information from a wide variety of industries and markets and may prove invaluable for particular applications.
The same criteria used for indexing are offered for search purposes and the same indexing terms are used for all documents across a range of specific databases. An overview of procedures performed by the data processing system 104 is illustrated in Figure 2A. At step 201 , categories for association with data files are specified. This step is essentially performed as an "off-line" process; establishing the environment for allowing source data to be processed as it is received from sources.
Steps 202, 203 and 204 represent on-line procedures after the categories have been specified at step 201. At step 202 the processing system 104 receives data from sources such as sources 101 , 102 and 103. The source data may be transmitted using different protocols, formats and standards therefore the processing system performs a standardisation process so that the data may be stored locally at the data processing system using standardised formats.
At step 203 the data is processed so as to enhance a user's ability to identify information of interest. Files of machine-readable data received from the sources are associated with specific categories, that may be considered as defining particular information types. A file is considered as individual data elements, usually in the form of natural language words, and files are examined to identify occurrences of specified data types. The purpose of this association is to identify files of data which are of interest in relation to particular topics. This enables a user to organise a search which should result in useful information being supplied to said user, with reference to said topics and categories, from an extremely large database of stored data files.
In this way, the technical procedures performed by association step 203 significantly enhances the overall functionality of the system and provides an industrially applicable approach to allowing highly focused sets of information to be supplied to a user in preference to large volumes of data; much of which will tend to be totally irrelevant.
In order to achieve this, the files of data are considered and are given a score representing a numerical value as to their relevance with respect to the predefined topics. Scores are adjusted in response to the number of identified occurrences of a specified data type. Furthermore, when embodying the present invention, these scores are also adjusted in relation to the size of the data contained within the file. In particular, occurrences of data types in relatively small files are given a higher weighting with occurrences in larger files being given a lower weighting. Thus, the adjustment of scores is related inversely to the actual size of the data file. Thereafter, a threshold for the scoring values may be set and information types are associated with particular files dependent upon whether particular value scores fall on one side or on the other side of this threshold.
At step 204 a search is performed, in response to categories identified by a user such that information of interest may be identified within the data stored by the data processing system 104 and transmitted to user terminals, such as terminal 111 , over transmission channels as illustrated in Figure 1. Underlying principles of operation for the embodiments are illustrated in Figure 2B. Process 221 considers each incoming data file 222 to determine whether each data file should be associated with a particular category. Thus, process 221 is repeated for each category where associations are required. Decisions concerning the relevance of a particular data file to a particular category are made on the basis of the presence of particular data types within a file under consideration. Thus, procedures 223 identify the presence of data type A, procedures 224 identify the presence of data type B, procedures 225 identify the presence of data type C, procedures 226 identify the presence of data type D and so on until procedures 227 identify the existence of data type N. Each identification made by procedures 223 to 227 results in a score value being generated which may be accumulated at local accumulators 228, 229, 230, 231 and 232 respectively. Thus, the data elements are examined in a file to identify occurrences of specified data types.
Scores, present in accumulators 228 to 232 are adjusted in response to the identified occurrences, a process usually involving accumulation and scaling. Having been examined for the existence of all relevant data types, the present file is considered by process 223, that determines the total size of the file. Having done this, a scaling factor is generated by process 233 which is supplied to each of the accumulation processes 228 to 232. In this way, it is possible to further adjust the score values in relation to the size of the incoming data file. The scaled accumulation values are further accumulated by a unifying process 234, that in turn produces a single bit yes/no response to a categorisation process 235. The incoming file data is received by the categorisation process 235 and in response to the output from process 234, an output file 236 is labelled as either belonging to the category under consideration or as not belonging to the category under consideration.
Processing system 104 is detailed in Figure 3. Data signals from data sources 101 to 103 are supplied to input interfaces 301 via data input lines 302. Similarly, output data signals are supplied to users 111 to 117 via an output interface 303 and output wires 304. Input interface 301 and output interface 303 communicate with a central processing system 305 based on DEC Alpha integrated circuitry. The central processing system 305 also communicates with other processing systems in a distributed processing architecture. Processing system 104 includes eight Intel chip based processing systems 311 to 318, each implementing instructions under the control of conventional operating systems such as Windows NT.
An operator communicates with the processing system 104 by means of an operator terminal, having a visual display unit 321 and a manually operable keyboard 322. Data files received from sources 101 to 103 are written to bulk storage devices 323 in the form of large magnetic disk arrays. Data files are written to disk arrays 323 after these files have been associated with categories, as illustrated at step 203. These association processes are performed by the subsidiary processors 311 to 318 and the central processing system 305 is mainly concerned with the switching and transferring of data between the interface circuits 301 , 303 and the disk arrays 323. The central processing system 305 communicates with the subsidiary processors 311 to 318 via an Ethernet connection 324 and processing requirements are distributed between processors 311 to 318. Having addressed a subsidiary processor 311 to 318, the transferring of data to an addressed processor is performed. Each individual incoming data file is supplied exclusively to one of the subsidiary processors. The selected subsidiary processor is then responsible for performing the association process, to identify categories relevant to that particular data file. Thus, one incoming data file 222 as shown in Figure 2B is supplied exclusively to one of processors 311 to 318 and the whole of process 221 is performed on said processor. Furthermore, multiple operations of process 221 may be performed, one for each category being considered by the system.
After being processed within a processor, the associated data file is returned to the central processing system 305, over connection 324 and the central processing system 305 is then responsible for writing the associated data file to the disk array 323. In this way, it is possible to scale the degree of processing capacity provided by system 104 in dependence upon the volume of data files to be processed. The central processing system 305 also maintains a table of categories, pointing to particular data files which have been identified as relevant to said categories.
Computer system 305 is interfaced to a CD ROM reader 325 and in this way executable instructions may be supplied to computer system 305 and to processing systems 311 to 318, by means of a CD ROM 326.
Process 201 for specifying categories for association with data files is detailed in Figure 4. At step 401 a preferred term is selected and at step 402 an outline (OTL) file is generated or modified. At step 403 a question is asked as to whether another term is to be processed and when answered in the affirmative control is returned to step 401 , allowing the next term to be processed at step 402. Eventually, all of the terms will have been processed resulting in appropriate generations or modifications to their related outline files. Consequently, the question asked at step 403 is answered in the negative whereafter at step 404 data structures are initialised by parsing the OTL files generated at step 402.
Step 402 for the generation or modification of outline files is detailed in Figure 5. At step 501 a visual OTL editor is opened resulting in the editor's visual interface being displayed on VDU 321. At step 502 a question is asked as to whether an existing file is to be loaded for modification and if answered in the negative a new OTL file is created at step 503. If the question asked at step 502 is answered in the affirmative, step 503 is bypassed and at step 504 modifications or additions are made to the OTL definition. At step 505 the OTL modifications created at step 504 are tested on a sample of test data and at step 506 a question is asked as to whether another modification is to be made. When answered in the affirmative, control is returned to step 504 resulting in further modifications or additions being made to the OTL definitions. When answered in the negative at step 506, the new or modified OTL file is saved at step 507.
When performing modifications or additions at step 504, a graphical representation of the OTL file data is presented to an operator via the visual display unit 321. An example of a display of this type is illustrated in Figure 6, representing a graphical illustration of a specific OTL file.
The OTL file stores definitions in an hierarchical tree structure and this structure is represented in the graphical view as shown in Figure 6. A representation of the tree may be contracted or expanded and the possibility of expanding a particular branch is identified by a plus sign on a particular line, as shown at 601. Similarly, when a particular branch has been fully expanded, the line is identified by a minus sign as shown at 602. Definitions within the file consist of rules, words and labels. The labels allow relationships to be defined between various parts of the file and between individual files themselves. The words identify specific words within an input file of interest and the rules define how and what weights are to be attributed to these words. Each rule line includes, at its beginning, a weight value 603 representing the score that will be attributed when a particular rule condition is met. Rules may also have leaves and the rule defines the way in which scores generated from leaves are combined. OTL file data represented graphically in the form shown in Figure 6 is actually stored in a data file having a format of the type shown in Figure 7. The actual data file shown in Figure 7 corresponds to the data display in Figure 6 but in Figure 7 all of the data, some of which has been rolled up in Figure 6, is present. The data contained within the file shown in Figure 7 is manipulated interactively by an operator in response to the graphical interface displayed as illustrated in Figure 6. Score values 603 are also identified in the data file shown in Figure 7. Displayed line 601 in Figure 6 is generated from line 701 of the actual stored data. The syntax of the language used for recording the data, as illustrated in Figure 7, may vary and the example shown is specific to this particular application. However, the underlying functionality of the language may be considered with reference to the diagrammatic representation shown in Figure 8.
Purely to provide a specific example, this particular outline file is concerned with the topic of the oil industry and therefore the purpose of the OTL file is to identify words and phrases within an input file so as to provide an indication as to how relevant that input data is to users having an interest in the oil industry. Thus, the purpose of procedures exploiting these OTL files is to generate evidence showing that a particular data file conveys information which may be of interest to those studying the oil industry.
The outlines analyse data files in order to produce numerical evidence as to the relevance of a particular file with relation to a particular topic. The
OTL definitions and structures are determined empirically and would be modified and upgraded over a period of time. The system does more than merely register the existence of a particular word item by placing the word items within an interacting structure; the nature of which is illustrated in Figure 8. The particular entry, given label "oil-industry-mkt" relates to marketing aspects of the oil industry and as such can contribute to an overall score as to the pertinence of incoming data to this particular topic. The first line 801 shows that this particular contribution may provide a total score of forty percent. This total of forty percent is then subdivided such that at line 802 the presence of the phase "buying oil from" has a score of fifty percent.
Thus, the total contribution made the presence of this phrase consists of fifty percent of forty percent, ie a total of twenty percent being made to the total contribution. Similarly, as shown at line 803 and below, particular words may be identified which result in contributions of sixty percent of thirty percent of forty percent. Thus, a complete OTL file is structured in this way with particular words and phrases making contributions to an overall score value. These words and phrases may also be specified in the rules as making single contributions or as being allowed to accrue.
Examples of score value 603 are illustrated in Figure 8 at 804 to 815. The hierarchical structure in Figure 9 consists of a plurality of branches with lowest level entries being considered as leaves. Each leaf has a score value associated with it and values 809 to 815 are leaf score values. Above these, branch score values exist such that score values 807 and 808 exist at the lowest level of branching with score values 805 and 806 being at the next level of branching further up each connected to the highest level of branching illustrated by score value 804. A total score value for a particular occurrence, detected at any level within the hierarchical tree, results in a final score contribution derived from the product of the score value assigned to that particular level with all score levels identified while ascending the tree structure back to its root.
The score contributions are produced and possibly accumulated when occurrences of the specified elements are identified. This provides a numerical weight to assess whether a file being processed should be associated with a particular information type. The present invention, as implemented within the preferred embodiment, further adjusts these score weights in relation to the size of the overall data file. In particular, it is the branch score values (804 to 808) which are modified in preference to leaf values (809 to 815).
Subsidiary processor 311 is detailed in Figure 9. The processor includes an Intel Pentium processing unit 901 connected to sixty-four megabytes of randomly accessible memory 902 via a PCI bus 903. In addition, a local disk drive 904 and an interface circuit 905 are connected to bus 903. Interface circuit 905 communicates with the TCP/IP network 324. Random access memory 902 stores instructions executable by the processing unit 901 , in addition to storing input data files received from the data sources 101 to 103 and intermediate data. Operations performed on processing unit 901, in response to instructions read from memory 902 are identified in Figure 10. At step 1001 temporary memory structures are cleared and at step
1002 an OTL description file is selected. At step 1003 an item in the OTL file is identified and at step 1004 a question is asked as to whether the item selected at step 1003 is a rule definition. If this question is answered in the affirmative, a rule object is defined at step 1005. Alternatively, if the question asked at step 1004 is answered in the negative, to the effect that the item is not a rule definition, a question is asked at step 1006 as to whether the item is a word definition. If this question is answered in the affirmative, a dictionary link is created at step 1004.
At step 1008 a question is asked as to whether the item is a label and when answered in the affirmative a new entry is created in a label list, whereafter at step 1010 a question is asked as to whether another item is present. After executing step 1005 or after executing step 1007, control is directed to step 1010.
When the question asked at step 1010 is answered in the affirmative, to the effect that another item is present, control is returned to step 1003 and the next item is identified in the OTL file. Eventually, all of the items will have been identified resulting in the question asked at step 1010 being answered in the negative. Thereafter, at step 1011 a question is asked as to whether another OTL file is present and when answered in the affirmative control is returned to step 1002 allowing the next OTL description file to be selected. Thus, this process continues until all of the OTL files have been considered resulting in the question asked at step 1011 being answered in the negative.
For each OTL file considered, by being selected at step 1002, a rule base is generated and a plurality of such rule bases are illustrated in Figure
11. Thus, a first OTL file processed in accordance with the procedures shown in Figure 10 results in the generation of a first rule base 1101. Similarly, further iterations of the procedures shown in Figure 7 result in the generation of rule bases 1102 to 1109. Typically, for a specific installation, in the order of three thousand rule bases would be generated by execution of the procedures illustrated in Figure 10.
Rule bases 1101 to 1109 are stored in memory 902, which also provides storage space for a dictionary 1121, a label list 1122 and a data buffer 1123. The dictionary stores a list of words which have importance in any of the stored rule bases. Associated with each word in the dictionary there is at least one pointer and possibly many pointers to specific entries in specific rule bases 1101 to 1109. Thus, the words identified at 803 in Figure 8 would all be included in dictionary 1121. Entries within the dictionary 1121 are implemented upon execution of step 1007 in Figure 10. Similarly, execution of step 1009, creating a new entry in the label list, allows a label to relate to rules that are elsewhere in the tree structure. Step 203, as shown in Figure 2A for associating preferred terms with source files, is detailed in Figure 12. At step 1201 central processor 305 obtains access to one of the subsidiary processors 311 to 318. The central processor then expects to receive authorisation so that communication may be effected with one of the subsidiary processors, after a connection has been established, the source file is supplied to the selected subsidiary processor at step 1203 and at step 1204 the data is processed to determine associated preferred terms.
After performing the processing at step 1204, the results are transmitted back to the central processing system at step 1205 and at step
1206 data with associated preferred terms is stored and data pointers associated with the preferred data terms are updated at step 1207.
Step 1204 for the processing of data to determine associated preferred terms is detailed in Figure 13. The overall processing is broken down into three major phases, consisting of a triggering phase at 1301 , followed by a scoring phase at 1302 followed finally by a list generation phase at step 1303.
Triggering phase 1301 is detailed in Figure 14. At step 1401 a section of the data, such as its title, market sector or main body of text, is identified and at step 1402 an item of the identified section is selected. At step 1403 a question is asked as to whether the item indicates a new context, which may be considered as a grammatical marker in the form of a full stop, capital, start of a sentence or quotation marks et cetera. When answered in the affirmative new context information is supplied to all rule bases 1101 to 1109 at step 1404 and control is then directed to step 1407.
If the question asked at step 1403 is answered in the negative, step 1404 is bypassed and a look-up address is obtained for rule objects in rule bases from the dictionary at step 1405. Thereafter, at step 1406 all addressed objects are triggered and a multiplication of scores is effected by a score weighting factor. Thereafter, at step 1407 a question is asked as to whether another item is present and when answered in the affirmative control is returned to step 1402. Eventually, all of the items for a selected section will have been considered resulting in the question asked at step 1407 being answered in the negative. Thereafter, at step 1408 a question is asked as to whether another section is to be considered and when answered in the affirmative control is returned to step 1401.
At step 1401 the next section is identified and steps 1402 to 1408 are repeated. Eventually, all of the sections will have been considered and the question asked at step 1408 will be answered in the negative.
As shown in Figure 8, each lowest level leaf of the hierarchical tree has a numerical value associated with the identification of a particular item, as identified generally at 803. If the amount of data contained within a particular file is less than what would generally be accepted, scores are further adjusted in relation to this size so as to improve the association of files with particular information types. This further adjustment is performed at the lowest leaf level (an example of this being level 803 in Figure 8) and these leaf values are multiplied by a file weighting factor, derived from the size of the file, which is triggered at step 1406 as shown in Figure 14.
The generation of a file weighting factor W is illustrated in Figure 15. Experiments have shown that below a certain file size, the irregular distribution of evidential terms, referred to as granularity, can effect the accuracy of the association process and therefore requires compensation. In order to provide compensation, a threshold value is identified, below which compensation is required, and recorded by variable T. A typical example for T would be two thousand and forty-eight bytes. The size of the current file is recorded in variable S and a weighting constant, such as zero point eight
(0.8), is stored by variable N. From these values, weighting factor W is derived as follows. If the size S of the current file is smaller than threshold T, weighting factor W is calculated by subtracting the size S of the current file from the threshold T and dividing this by the threshold T. This partial result is then multiplied by the weighting constant and added to unity. Alternatively, if the size of the current file is not smaller than the threshold T, the weighting factor is set to unity.
The threshold value T represents the minimum file size that can be indexed reliably and files smaller than this value, when detected, result in further adjustments being made to the identification process so as to compensate for this problem. The threshold value could be calculated on an on-going basis so as to provide, for example, the average of the previous one thousand files considered. However, given the inherent unpredictability of file size distribution, it is better to specify a threshold value and, in some circumstances, it may be preferable to modify the threshold value in order to achieve a particular result. Furthermore, the weighting constant may also be modified in response to particular data types and experiments have shown that by optimising the weighting constant and by optimising the threshold value it should be possible to obtain a ten percent improvement in terms of the number of files that are selected correctly for a particular category.
The relationship between weighting factor and file size is illustrated graphically in Figure 15 and it can be seen that, as the file size falls below the threshold value, the weighting factor is linearly increased up to the value of the weighting constant. Alternatively, a more sophisticated procedure may be achieved, having a non-linear relationship, by using interpolated look-up tables. An example of a short data file is illustrated in Figure 16. The file does not contain many words therefore it is unlikely that enough words would be identified so as to provide evidence for the data file to be included within a particular category identified by its associated preferred terms. Figure 16 also shows particular words and phrases that have been highlighted, so as to identify them as having being caught by a particular rule base. Thus, a decision then has to be made as to whether these highlighted words provide sufficient evidence for the data file to be included. The present procedures, as illustrated in Figure 15, would increase the likelihood of a file of this type being included correctly and as being pointed to by a particular preferred term.
Scoring phase 1302 is detailed in Figure 17. At step 1701 a rule base is selected and at step 1702 a score variable is re-set to zero. At step 1703 a branch is identified for score accumulation/accrue and at step 1704 scores are accumulated or accrued from triggered rules attached to the branch. At step 1705 a question is asked as to whether another branch is to be considered and when answered in the affirmative control is returned to step 1703. A next branch is selected at step 1703 with procedure 1704 being repeated. Eventually all of the branches will have been considered resulting in the question asked at step 1705 being answered in the negative.
At step 1706 an overall score in the range of zero to one hundred is stored for the rule base and at step 1707 a question is asked as to whether another rule base is present. When answered in the affirmative, control is returned to step 1701 and steps 1701 to 1707 are repeated. Eventually, all of the rule bases will have been considered and the question asked at step 1707 will be answered in the negative.
The operations illustrated in Figure 17 may be considered with reference to the illustration of the structure in Figure 8. Thus if any of the defined words at 803 are identified within the file a provisional score of one hundred will be allocated. However, the process as shown in Figure 17, must then ascend up the branches so that any scores lower down will be modified in response to scores higher up the structure.
Phase 1303 for the generation of a list of associated preferred terms is detailed in Figure 18. At step 1801 a rule base is identified having a score greater than a predetermined threshold. Thus, for a particular application a threshold may be set at forty-eight percent. At step 1802 additional triggered preferred data characteristics are identified by associating successful rule bases with parent categorisations by rule base links.
At step 1803 lists of successful and inferred rule bases are combined to form overall lists of preferred data characteristics. Step 1803 results in data being generated by a subsidiary processor, such as processor 311 , which is then supplied back to the central processing system 305. Central processing system 305 is responsible for constructing a table of the type shown in Figure 19 in which an entry is present for each preferred term. The specific preferred terms are stored in column 1901 and, for each of these terms, column 1902 defines a specific pointer to a position in memory associated with the central processing system 305. Specific data files are identified by file names and the number of files associated with each preferred term is variable, depending on the nature and the amount of input data being considered. Thus, in order for this data to be accessible quickly while optimising use of the storage capacity within the central processing system 305, an indication of the file names is stored in the form of a linked list as illustrated in Figure 20.
Referring to Figure 19, the preferred term "OILJNDUSTRY" has been associated to a pointer 0F8912. Address 0F8912 is the first in column 2001 of the linked list. Column 2002 identifies a particular file name and column 2003 identifies the next pointer in the list. Thus, entry 0F8912 points to a particular file with the file name "OIL_INDUSTRY_NETHERLAND_3" with a further pointer to memory location 0F8A20. At memory location 0F8A20 a new file name is provided, illustrated at column 2002 and again a new pointer is present at column 2003. Eventually, all relevant files will have been considered and the end of the list is identified by address 000000 at the pointer location in column 2003.
In an active system, the database 323 will be continually updated and users will continually be given access to the database, all under the control of the central processing system 305. Thus, with reference to Figure 2A, it should be understood that the association step illustrated at 203 and the searching step at 204 are actually concurrent and will be effected in response to the availability of data and the demand for searching respectively.
Procedures 204 for performing a search in response to a user request are detailed in Figure 21. At step 2101 a user logs onto the system and at step 2102 a search method is identified. At step 2103 search criteria are defined and at step 2104 the search criteria are processed to determine categories. At step 2105 a list of categories are supplied to the central processing system 305.
At step 2106 a question is asked as to whether the host has responded and when answered in the affirmative titles of associated data files are displayed at step 2107.
At step 2108 a question is asked as to whether the user wishes to view identified data and when answered in the affirmative the data is viewed; after being downloaded over the communication channel, at step 2109. At step 2110 a question is asked as to whether another search is to be performed and when answered in the affirmative control is returned to step 2102. Step 2102 requires the search method to be identified and in order to achieve this a user is prompted by a screen display of the type shown in Figure 22. Thus, a plurality of text boxes are presented to the user inviting the user to specify a search method. Step 2103 for the defining of search criteria results in the user being prompted by a screen of the type shown in Figure 23. Terms providing a basis for the user's search are displayed in a window 2301. Categories are displayed in uppercase characters, such as the entry shown at position 2302.
The displaying of titles of associated files at step 2107 results in the user seeing information displayed of the type illustrated in Figure 24. Each entry, such as entry 2401 , includes a check box 2402. Check boxes 2402 allow a particular item to be selected by a user such that the actual information file may be supplied to the user from the central database over a communication channel.

Claims

What We Claim Is:
1. Apparatus for associating files of machine-readable data with information types, comprising examining means configured to examine data elements in a file to identify occurrences of specified data types; adjusting means configured to examine a score in response to said identified occurrences and to further adjusting said score in relation to the size of said data; and associating means configured to associate said information types with processed files dependent upon said score values.
2. Apparatus according to claim 1 , wherein said adjusting means further adjusts said score if the size of said data is below a predetermined threshold.
3. Apparatus according to claim 1 or claim 2, wherein said adjusting means is configured to perform said further adjustment up to a predetermined maximum weighting value.
4. Apparatus according to any of claims 1 to 3, wherein said associating means is configured to associate said information types and said information types are expressed as categories specified by information outlines.
5. Apparatus according to claim 4, including means for storing a branching hierarchical structure of data types defined by said information outlines, wherein said adjusting means is configured to adjust values defined at branches of said structure.
6. Apparatus according to claim 5, wherein said adjusting means is configured to adjust values at a plurality of said branches.
7. Apparatus according to claim 6, including means for multiplying values stored at branches throughout said hierarchy to produce said score value.
8. Apparatus according to any of claims 1 to 7, wherein said adjusting means is configured to further adjust said score inversely in relation to a size of said data.
9. Apparatus according to claim 8, including means for determining the size of said data in terms of the volume of the stored file representing the number of characters present in the file.
10. Apparatus according to any of claims 1 to 9, wherein machine readable files associated to categories are made available to a plurality of users over a distributed network.
11. A method of associating files of machine-readable data with information types, comprising steps of examining data elements in a file to identify occurrences of specified data types; adjusting a score in response to said identified occurrences; further adjusting said score in relation to a size of said data; and associating said information types with processed files based upon said score values.
12. A method according to claim 11 , wherein said further adjustment is performed if the size of said data is below a predetermined threshold.
13. A method according to claim 12 or claim 13, wherein said further adjustment is performed up to a predetermined maximum weighting value.
14. A method according to any of claims 11 to 13, wherein said information types are expressed as preferred terms and are specified by information outlines.
15. A method according to claim 14, wherein said information outlines define preferred terms in a branching hierarchical structure and said adjustable values are defined at branches of said structure.
16. A method according to claim 15, wherein said further adjustment includes adjustment at a plurality of said branches.
17. A method according to claim 16, wherein said a score value is determined by multiplying values stored at branches throughout said hierarchy.
18. A method according to any of claims 11 to 17, wherein said relation for adjusting said score is an inverse relation, such that said score is adjusted inversely in relation to a size of said data.
19. A method according to claim 18, wherein said size of said data is determined in terms of the volume of the stored file, representing the number of characters present in the file.
20. A method according to any of claims 11 to 19, wherein files associated to categories are made available over a distributed network to a plurality of users.
21. A computer system programmed to execute stored instructions such that in response to said stored instructions said system is configured to: examine data elements in a file to identify occurrences of specified data types; adjust a score in response to said identified occurrences and to further adjust said score in relation to the size of said data; and to associate said information types with processed files dependent upon said score values.
22. A computer system programmed to execute stored instructions according to claim 21 , configured to further adjust said score if the size of said data is below a predetermined threshold.
23. A computer system programmed to execute stored instructions according to claim 21 or claim 22, configured to perform said further adjustment upto a predetermined maximum waiting value.
24. A computer system programmed to execute stored instructions according to any of claims 21 to 23, configured to associate said information types expressed as categories and specified by information outlines.
25. A computer system programmed to execute instructions according to claim 24, configured to store a branching hierarchical structure of data types defined by said information outlines, and to adjust values defined at branches of said structure.
26. A computer system programmed to execute instructions according to claim 25, configured to adjust values at a plurality of said branches.
27. A computer system programmed to execute stored instructions according to claim 26, configured to multiply values stored at branches throughout said hierarchy to produce said store value.
28. A computer system programmed to execute stored instructions according to any of claims 21 to 27, configured to further adjust said score inversely in relation to a size of said data.
29. A computer system programmed to execute stored instructions according to claim 28, configured to determine the size of said data in terms of the volume of the stored file representing the number of characters present in the file.
30. A computer-readable medium having computer-readable instructions executable by a computer such that, when executing said instructions, a computer will perform the steps of: examining data elements in a file to identify occurrences of specified data types; adjusting a score in response to said identified occurrences; further adjusting said score in relation to a size of said data; and associating said information types with processed files based upon said score values.
31. A computer-readable medium having computer-readable instructions according to claim 30, such that when executing said instructions a computer will also perform the step of performing said further adjustment if the size of said data is below a predetermined threshold.
32. A computer-readable medium having computer-readable instructions according to claim 30 or claim 31 , such that when executing said instructions a computer will also perform the step of performing said further adjustment upto a predetermined maximum weighting value.
33. A computer-readable medium having computer-readable instructions according to any of claims 30 to 32, such that when executing said instructions a computer will also perform the step of processing incoming data files with respect to information outlines.
34. A computer-readable medium having computer-readable instructions according to claim 33, such that when executing said instructions a computer will also perform the step of adjusting values defined at branches of information outlines defined in terms of a hierarchical structure.
35. A computer-readable medium having computer-readable instructions according to claim 34, such that when executing said instructions a computer will also perform the step of adjusting said structure at a plurality of branches.
36. A computer-readable medium having computer-readable instructions according to claim 35, such that when executing said instructions a computer will also perform the step of multiplying values stored at branches throughout the outline hierarchy to produce a score value.
37. A computer-readable medium having computer-readable instructions according to any of claims 30 to 36, such that when executing said instructions a computer will also perform the step of adjusting said score inversely in relation to the size of data.
38. A computer-readable medium having computer-readable instructions according to claim 37, such that when executing said instructions a computer will also perform the step of determining the size of a data file in terms of the volume of the stored file, representing the number of characters present in the file.
PCT/GB1999/001222 1998-04-24 1999-04-21 Associating files of machine-readable data with specified information types WO1999056224A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB9808807A GB2336699A (en) 1998-04-24 1998-04-24 Automatic classification of text files
GB9808807.3 1998-04-24

Publications (1)

Publication Number Publication Date
WO1999056224A1 true WO1999056224A1 (en) 1999-11-04

Family

ID=10830952

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB1999/001222 WO1999056224A1 (en) 1998-04-24 1999-04-21 Associating files of machine-readable data with specified information types

Country Status (2)

Country Link
GB (1) GB2336699A (en)
WO (1) WO1999056224A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008046338A1 (en) * 2006-10-18 2008-04-24 Alibaba Group Holding Limited Method and system of determining garbage information

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7333983B2 (en) 2000-02-03 2008-02-19 Hitachi, Ltd. Method of and an apparatus for retrieving and delivering documents and a recording media on which a program for retrieving and delivering documents are stored
DE60044423D1 (en) * 2000-02-03 2010-07-01 Hitachi Ltd Method and device for retrieving and outputting documents and storage medium with corresponding program
AUPQ915600A0 (en) 2000-08-03 2000-08-24 Ltdnetwork Pty Ltd Online network and associated methods
WO2002013045A2 (en) * 2000-08-03 2002-02-14 Ltdnetwork Pty. Ltd. Online network and associated methods
EP1324220A1 (en) * 2001-12-24 2003-07-02 Abb Research Ltd. Process and system for generating and improving a collection of information objects

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5428778A (en) * 1992-02-13 1995-06-27 Office Express Pty. Ltd. Selective dissemination of information

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA1318403C (en) * 1988-10-11 1993-05-25 Michael J. Hawley Method and apparatus for extracting keywords from text
US5598557A (en) * 1992-09-22 1997-01-28 Caere Corporation Apparatus and method for retrieving and grouping images representing text files based on the relevance of key words extracted from a selected file to the text files
US5659766A (en) * 1994-09-16 1997-08-19 Xerox Corporation Method and apparatus for inferring the topical content of a document based upon its lexical content without supervision
EP0822502A1 (en) * 1996-07-31 1998-02-04 BRITISH TELECOMMUNICATIONS public limited company Data access system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5428778A (en) * 1992-02-13 1995-06-27 Office Express Pty. Ltd. Selective dissemination of information

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HAYES P J: "TCS: A SHELL FOR CONTENT-BASED TEXT CATEGORIZATION", PROCEEDINGS OF THE CONFERENCE ON ARTIFICIAL INTELLIGENCE APPLICATIO, SANTA BARBARA, MAR. 5 - 9, 1990, vol. 1, no. CONF. 6, 5 March 1990 (1990-03-05), INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS, pages 320 - 326, XP000295098, ISBN: 0-8186-2032-3 *
WONG J W T ET AL: "ACTION: AUTOMATIC CLASSIFICATION FOR FULL-TEXT DOCUMENTS", SIGIR FORUM, vol. 30, no. 1, 21 March 1996 (1996-03-21), pages 26 - 41, XP000699962 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008046338A1 (en) * 2006-10-18 2008-04-24 Alibaba Group Holding Limited Method and system of determining garbage information
US8234291B2 (en) 2006-10-18 2012-07-31 Alibaba Group Holding Limited Method and system for determining junk information

Also Published As

Publication number Publication date
GB9808807D0 (en) 1998-06-24
GB2336699A (en) 1999-10-27

Similar Documents

Publication Publication Date Title
US6308176B1 (en) Associating files of data
US8321455B2 (en) Method for clustering automation and classification techniques
Robertson et al. The TREC 2002 Filtering Track Report.
US7299247B2 (en) Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors
Ahmed et al. Language identification from text using n-gram based cumulative frequency addition
US6772170B2 (en) System and method for interpreting document contents
AU746762B2 (en) Data summariser
US8275773B2 (en) Method of searching text to find relevant content
WO2002054288A1 (en) Automated adaptive classification system for bayesian knowledge networks
JP2001515623A (en) Automatic text summary generation method by computer
EP2019361A1 (en) A method and apparatus for extraction of textual content from hypertext web documents
JPH03108064A (en) Information retrieving method and system
US7162413B1 (en) Rule induction for summarizing documents in a classified document collection
US6640219B2 (en) Analyzing data files
CN101138001A (en) Learning processing method, learning processing device, and program
CN106997390A (en) A kind of equipment part or parts commodity transaction information search method
US20040158558A1 (en) Information processor and program for implementing information processor
WO1999056224A1 (en) Associating files of machine-readable data with specified information types
JPH0962684A (en) Information retrieval method and information retrieval device, and information guiding method and information guiding device
WO1999056226A1 (en) Alerting user-processing sites as to the availability of information
JP2003016106A (en) Device for calculating degree of association value
GB2336700A (en) Generating machine readable association files
US7356604B1 (en) Method and apparatus for comparing scores in a vector space retrieval process
US6615200B1 (en) Information-retrieval-performance evaluating method, information-retrieval-performance evaluating apparatus, and storage medium containing information-retrieval-performance-retrieval-evaluation processing program
JPH09319752A (en) Retrieval supporting device

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase