US20030088577A1 - Database and method of generating same - Google Patents
Database and method of generating same Download PDFInfo
- Publication number
- US20030088577A1 US20030088577A1 US10/195,847 US19584702A US2003088577A1 US 20030088577 A1 US20030088577 A1 US 20030088577A1 US 19584702 A US19584702 A US 19584702A US 2003088577 A1 US2003088577 A1 US 2003088577A1
- Authority
- US
- United States
- Prior art keywords
- data
- node
- database
- arcs
- nodes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
Definitions
- the present invention relates to a database and a method of generating a database.
- the invention relates to a database which facilitates efficient storage of data, rapid search and retrieval of data.
- Databases are used in computer-based information and processing systems for the storage of large quantities of information or data items for subsequent retrieval and processing. Such databases often require updating from time to time for redistribution to users who may be situated remotely from the producer of the database. Logistical difficulties can arise when databases become large. For example, users of the database might not have the same storage capacity enjoyed by the database creator and in cases where users download updated databases via a computer network such as the Internet, download times can become burdensome and functionality may be compromised.
- Preferred methods of data storage vary depending on the type of data to be stored. Opportunities for compression of data exist particularly when data to be stored contains repetitive elements.
- a relational database might be adopted for customer contact information having different categories.
- Such a database might employ a plurality of separate database tables, one for each category of information such as: one for customer name and address, a second for accounting records and a third for product information. These tables are linked, or related, by a customer ID so that accounting and/or product information can be retrieved without the need to store customer name and address data in the table of each category.
- the database is likely to be structured as a single table listing the data items.
- the category of each data item is then stored against each data item in the table.
- the difficulty is that the resulting table becomes obsolete on storage space because the same category identification is stored many times within the same table.
- the database becomes larger, the more difficult it is to transfer between users and the longer it takes to retrieve information from it.
- U.S. Pat. No. 6,219,786 relates to a method and system for monitoring and controlling computer users' access to network resources from both inside and outside the network.
- the system monitors network traffic and applies access rules to the traffic to permit or deny access to predetermined network resources.
- a networked computer may be monitored so that access to predetermined Internet web-sites can be permitted while others denied.
- Such a system may include a database of URL's which are categorised by subject. Given the existence of many tens or even hundreds of millions of URL's which may be accessed via the World Wide Web (www), a database of these containing a category data tag for each can be expected to require a great deal of storage capacity and be slow to search.
- a database comprising a plurality of keys representing respective data items stored in the database and respective data tags associated with at least some of the data items, respective data tags representing different identifiers or categories among which the associated data items are grouped, wherein the database is arranged in the form of a tree data structure in which each of said plurality of keys is represented by a series of nodes and arcs defining a path between a root node and a terminal node, each node being linked to at least one other node by a respective arc, respective arcs for a given one of said plurality of keys representing a respective character or characters of said given key, and wherein the arcs and the nodes depending from said root node of data items which represent a sequence of characters shared by different keys are combined, and the data tags are associated with the arcs.
- a data tag is associated with each one of the arcs so that a data tag is read from the database as said respective character(s) of the key are read from the database.
- the last data tag which is read before reaching a terminal node defines the category or identifier of the key.
- a database comprising a plurality of keys representing respective data items stored in the database, wherein the database is arranged in the form of a tree data structure in which each of said plurality of keys is represented by a series of nodes and arcs defining a path between a root node and a terminal node, each node being linked to at least one other node by a respective arc, respective arcs for a given one of said plurality of keys representing a respective character or characters of said given key, and wherein the arcs and the nodes depending from said root node of data items representing a sequence of characters shared by different keys are combined, and the arcs and the nodes extending from a given terminal node of data items representing a sequence of characters shared by different keys are also combined, said given terminal node being a sink.
- a database may incorporate the first and the second aspects of the invention.
- the data tags are rationalised to minimise the amount of storage space taken up by category or identifier information for the keys and further storage saving measures are achieved by the combining of arcs and nodes between characters or character sequences shared by different keys when reading from the root node to the terminal nodes and when reading from the terminal nodes to the root node, wherein said terminal nodes are sinks.
- each of said plurality of keys is represented by a series of nodes and arcs defining a path between a root node and a terminal node, each node being linked to at least one other node by a respective arc, and respective arcs for a given one of said plurality of keys representing a respective character or characters of said given key wherein arcs and nodes depending from said root node of data items which represent a sequence of characters shared by different keys and category or identifier are combined;
- the method further includes compacting the data set by removing from a sequence of repeating identical data tags all but one of said identical data tags. Preferably, successive data tags identical to the first occurrence thereof in the sequence are removed. This allows redundant data tags to be removed from the database thereby making space available for more data items.
- a method of generating a database having a plurality of keys representing respective data items stored in the database comprising:
- each of said plurality of keys is represented by a series of nodes and arcs defining a path between a root node and a terminal node, each node being linked to at least one other node by a respective arc, and respective arcs for a given one of said plurality of keys representing a respective character or characters of said given key, wherein arcs and nodes depending from said root node of data items which represent a sequence of characters shared by different keys are combined;
- each of said plurality of keys is represented by a series of nodes and arcs defining a path between a root node and a terminal node, each node being linked to at least one other node by a respective arc, and respective arcs for a given one of said plurality of keys representing a respective character or characters of said given key wherein arcs and nodes depending from said root node of data items which represent a sequence of characters shared by different keys and category or identifier are combined;
- the steps of compacting the data set may each include a recursive routine. Successive data tags identical to first occurrence thereof in the sequence may be the ones removed.
- said compacting step may include assigning a weight value to nodes of the data set, the weight value of a given node being dependent on the characters between said given node and an associated sink(s), said given node and associated sink(s) defining a sub-tree of said data set, and identifying two or more nodes having identical weight values as potentially having identical sub-trees.
- the weight value may be based on a checksum value incorporating the category or identifier of an arc extending from the node to which the weight value is being applied, in addition to the characters in the sub-tree.
- the checksum value may further incorporate an indication of the size of the associated sub-tree of the given node.
- the step of compacting to reduce identical sub-trees includes comparing with one another the nodes and sub-trees depending from, and including, nodes having identical weight values. Nodes having weight values representative of longer sub-trees are preferably compared and compacted prior to those representative of shorter ones. This provides for a faster compaction operation. Nodes and their respective sub-trees identified as identical are rationalised by directing the arc(s) leading to one of the nodes to the other node and removing said one node and its associated sub-tree from the database. This may be done using a recursive routine.
- Any node except the root node may be a terminal node, provided it represents the end of a path defining a key. All nodes that have no further arcs leading to further nodes are terminal nodes, sometimes referred to as ‘sinks’.
- a node may be a terminal node because it defines the end of a key, but may also have further arcs leading to other nodes, the further arcs representing characters of other keys.
- the tree data structure may be in the form of a tree-structured directed graph.
- the data items may represent Universal Resource Locators (URL'S) for identifying Internet web pages, the categories corresponding to subject matter types, respective data tags representing different subject matter types.
- URL'S Universal Resource Locators
- a data carrier having stored thereon a database as defined according to any aspect of the invention hereinabove.
- the data items of the database may be URL's and the data tags may be subject matter types for them.
- the data carrier may be in the form of any computer readable medium, such as: CD-ROM; a hard disk of a personal computer or network server; magnetic tape; or data stream.
- a computer program containing code, which when run on a computer can configure the computer to generate a database according to any of aspect of the invention defined hereinabove.
- the computer program may contain code for configuring a computer to perform any of the methods of generating a database as defined hereinabove.
- Embodiments of the invention have the advantage that information in the form of sequences of characters that recur in many different keys (for example, the sequences “www.”, and “.com” occur in a great many URLs) need only be stored a minimum number of times in the database. This results in a substantial reduction in the bit size of the database and the amount of memory required.
- a further advantage is searching is very fast because once a sequence of characters occurring in the key being sought has been found, there is no need to search anywhere else in the database for those characters. This arises from the tree-structured directed graph in which there is only one valid next move as a data item to be searched is looked up in the tree-structure. Also, once it is determined that the next character in a sequence is not present in the database, the search can be terminated because the key will not be present elsewhere.
- FIG. 1 is a schematic diagram of a known computer system on which a database embodying aspects of the present invention may be implemented;
- FIG. 2 is a flow diagram outlining a method for generating a database in accordance with the first and second aspects of the present invention
- FIG. 3 is an example of data items for use in an illustration of a database embodying the first and/or second aspects of the present invention
- FIG. 4 is a flow diagram with reference to which generation of a database embodying the first aspect of the present invention is explained.
- FIGS. 5 a to 5 e are conceptual representations for explaining the building up of a tree data structure for the data items of FIG. 3;
- FIG. 6 is a conceptual representation of a tree data structure in accordance with the first aspect of the present invention.
- FIG. 7 is a flow diagram with reference to which a process of data tag optimisation is described.
- FIG. 8 shows the directed graph representation of FIG. 6, in which redundancy of the data tags in accordance with the process of FIG. 7 has been reduced;
- FIG. 9 shows the directed graph of FIG. 8 with weight values assigned to nodes in accordance with creation of the database embodying the first and second aspects of the present invention
- FIG. 10 is a flow diagram with reference to which data compaction in accordance with a fourth stage of the process of FIG. 2 is described;
- FIG. 11 is a flow diagram showing a recursive procedure adopted within the flow diagram of FIG. 10;
- FIG. 12 shows the directed graph of FIG. 8 with an example of how arcs and nodes may be shared to extend from a common terminal node for a pair of data items having a common string of characters;
- FIG. 13 shows the directed graph of FIG. 8 with further examples of how arcs and nodes are shared
- FIGS. 14 a and 14 b show examples of paths for two data items which do not share the same root node or sink node;
- FIG. 15 shows the directed graph of FIG. 8 with yet further examples of how arcs and nodes are shared
- FIG. 16 shows the directed graph representation of FIG. 15, redrawn to illustrate a database structure optimised for redundancy using the example of FIG. 3;
- FIG. 17 shows how a database embodying the first and second aspects of the present invention may be represented in a data stream
- FIG. 18 is a flow diagram showing a rapid search and retrieval procedure for use with a database embodying the invention.
- FIGS. 19 a and 19 b show further examples of paths for data items having weight values assigned to nodes in accordance with creation of the database embodying the first and second aspects of the present invention.
- a computer system comprises a user interface 10 , a processor 12 , a data storage means 14 , and program memory 16 , all of which communicate with each other via a data bus 18 .
- the computer system further comprises an internet interface device 20 for facilitating communication with the internet 22 .
- a disk drive and/or CD ROM drive 24 facilitate reading and/or writing of data to and from portable media such as floppy disks or CDs.
- User interface 10 comprises an information display, for example a monitor, and a user input means such as a keyboard and/or a mouse. Instructions contained in the program memory 16 control the processor 12 to process data stored in the data storage means 14 or read from portable media via the drive 24 or downloaded from the internet 22 .
- the system shown in FIG. 2 describes a single user system, however it will be appreciated that the system is extendable to link two or more users communicating via the data bus 18 or internet/intranet/extranet links thereto.
- Computer systems such as the one described in FIG. 1 utilise databases comprising lists of information items and associated categories.
- the information items are in the form of keys, each key comprising a unique character string, for example names of people/companies/places/products etc..
- the categories are represented in the database by a category code, for example a number and take the form of a data tag associated with each key.
- the computer performs a search of the database to locate the key and retrieve the data tag.
- Databases can be very large, some holding many millions of keys and their associated data tags.
- Prior art database structures tend to be such that the computer has to search sequentially through the entire list of keys stored in the database to find one that matches the required key. It then retrieves the data tag to identify the category.
- Two problems limit the efficacy of such systems: firstly, the amount of data stored can be prohibitively large, using up an excessive amount of data storage capacity; secondly, the processing time for completing the search can be very long and use up a large amount of computer memory.
- FIG. 2 shows a process for creating a compact and rapidly searchable database in accordance with the various aspects of the present invention.
- the processes that make up the steps of FIG. 2 will be described for a specific example, using the data items shown in FIG. 3, with reference to FIGS. 4 to 17 .
- the raw data 28 (keys and associated data tags) are read in at step 30 .
- the raw data is processed to produce a data structure representative of a tree data structure or tree-structured directed graph 34 , as will be described in more detail below with reference to FIGS. 4 to 6 .
- an algorithm is used to identify and discard superfluous data tags and produce a data structure representative of an optimised directed graph 38 , as will be described below with reference to FIGS. 7 and 8.
- the optimised directed graph 38 is compacted by the processes of steps 40 and 44 .
- weight values are assigned as will be described with reference to FIG. 9.
- the weight values are used to identify and reduce redundant key data to produce a data structure representative of a compacted directed graph 46 , as will be described with reference to FIGS. 10 to 16 .
- the optimised and compacted directed graph 46 is stored as a final database 50 in a data storage format that will be described with reference to FIG. 17.
- the key data is read by the system and the database 50 is searched at step 54 to rapidly retrieve the required data tag 56 .
- FIG. 3 illustrates a data set to be used as an example for describing the processes that make up an embodiment of the invention.
- the data set of FIG. 3 comprises a set of keys “BABYLON”, “BARITONE” etc., to each of which is assigned a data tag 0, 1, 2, or 3 according to which of the four categories: music, property, city or material entity, the key has been assigned.
- the data set of FIG. 3 is shown here only for the purpose of describing the embodiment of the invention, and is very small compared with most databases in use on computer systems.
- FIG. 4 shows the process for generating a tree-structured directed graph, and will be described with reference to FIGS. 5 and 6 to describe generation of a tree-structured directed graph for the data set of FIG. 3.
- a directed graph is a way of visualising, in two dimensions, an arrangement of data. Trees in the context of data structures, graphs and directed graphs are all known terms in the art (see for example, the NIST dictionary referred to above). The data itself remains as a binary encoded bit stream stored electronically by the computer system.
- the data in a directed graph structure is represented by arcs, each arc representing a character (e.g. a letter or numeral). It is contemplated that a given character could represent more than one alpha-numeric character of the data item.
- a node does not represent any of the source data, but represents a point or junction between one character and one or more further characters.
- nodes are represented as circles and arcs are represented as lines having arrowheads pointing towards the node to which the arc leads.
- the root node is represented by a larger circle having a smaller circle inside it, and terminal nodes are represented by bold circles. The structure of the directed graph will become more apparent as the process of generation is described.
- the graph is blank and has only a root node with no arcs assigned. All the keys are now incorporated individually into the graph character by character, whereby their characters are stored along the arcs, and all arcs of a node are sorted in ascending order according to their key-character information. Sorting the arcs lends itself to fast search operations within a node. If a new arc is created, and not merely traversed, the data tag (or a reference to it) for the current key must also be filed along this arc so embodying the first aspect of the present invention. Each node to which the last arc of a key opens has to be marked as a terminal node and must be equipped with the current key's data tag. Consequently, following completion of the process there is a deterministic finite state machine available, which is the basis of the further steps.
- the process of building a graph from a set of data items is started at step 60 .
- a key and associated data tag are read from the source data set 64 .
- an indexing counter is set to 0.
- the directed graph consists only of a single root node and no arcs, as shown by the “initial state” of FIG. 5 a .
- the directed graph generator is positioned on the root node.
- the process reads the next character of the key, key[i]. The first time through the process this is the first character of the key, key[ 0 ], as defined by the indexing counter.
- FIGS. 5 b to 5 e show the example where the first key read is METALLOPHON.
- the first character is the letter “M”, and this is called the arc name of the next (first) arc.
- the process interrogates the data structure as to whether the character “M” already exists as an arc. As no arcs have yet been generated, the answer must clearly be No, and the process proceeds to step 74 , where the arc is generated.
- the associated data tag is also added to the arc. In the example, “metallophon” has been assigned the category 0 , “music”.
- the arc is traversed to position the generator on the next node, i.e. the node at the end of the arc.
- the directed graph is now at state 2 as shown in FIG. 5 b.
- the indexing counter is increment by 1.
- the process interrogates the data to ask if the end of the key has been reached. The answer in the example case is No, and the process returns to step 70 to commence generation of the next arc, which this time is given the arc name key[ 1 ], the letter “E”.
- the data tag is added to the arc, and the directed graph is then at state 3 as shown in FIG. 5 c . The process repeats for each letter of the key until eventually all the letters of “METALLOPHON” have been assigned to arcs.
- step 82 the answer is Yes and the process proceeds to step 84 where a flag data bit is added to the data to indicate that the node at the end of the last arc “N” is a terminal node.
- the directed graph is then at state 4 , as shown in FIG. 5 d.
- step 85 the process ensures that the data tag associated with the last arc of the key is that associated with the key.
- the data tag will have been associated with the arc name at step 76 , however it is possible that the key may be made up entirely of characters already contained in the database and that step 76 will have been by-passed for every character of the key. In such circumstances it is necessary to associate the correct data tag with the last arc in the key.
- FIG. 6, shows the directed graph for the data set of FIG. 3.
- the key POLY has all its characters the same as the first four characters of the key POLYMORPH, but has a data tag of 1 whereas POLYMORPH has a data tag of 0 .
- step 86 the process interrogates the data to see if the end of the data set has been reached. If the answer is Yes, the process is ended. However, in the illustrative example the answer is No, so the process returns to step 62 to read the next key and associated data tag. The next key is “MONOPHON”.
- the process reaches step 72 for the first time and asks whether the arc name “M” exists for the current node (in this case the root node), the answer is Yes because the arc with arc name “M” was generated for the key “METALLOPHON”. The process therefore steps ahead to step 78 , without generating an arc.
- step 72 the process asks the same question of the arc name “O”, but here the answer is No, and so a new arc must be generated. Thereafter, for MONOPHON all arcs will be new arcs because there will be no existing arcs connected to the nodes. State 5 , as shown in FIG. 5 e has then been reached.
- the data will represent the directed graph of FIG. 6.
- the directed graph is termed “tree-structured”, because each key is represented by a pathway of arcs commencing at the root node and terminating at a terminal node. Each arc may only be traversed once and (at this stage) each node is only arrived at via one arc, but may have more than one arc departing from it.
- the data structure represented by FIG. 6 is well suited for searching. Starting at the root node a searching algorithm only needs to look for an arc with an arc name the same as the first character of the key being searched, and then to follow the path of arcs with arc names equivalent to the characters of the key, to identify the existence of the key in the database when the terminal node is reached. On reaching any node without an arc having an equivalent arc name to the next character of the key identifies the absence of the key from the data base.
- the algorithm reads the data tags of the arcs as it traverses the pathway, disregarding the previously read data tag each time it reads a new data tag, then when it reaches a terminal node, the last data tag to be read will be the one associated with the key and will correctly identify the category of the key.
- FIG. 6 the data structure of FIG. 6 is far from optimised. Data tags are stored with every arc, but this entails storing a great many more data tags than necessary to identify the tag associated with a key.
- the process shown in FIG. 7 removes superfluous data tags. The process is recursive, which is to say that it involves passing through the steps of a procedure that includes all the steps of the procedure itself as one of the steps. In other words it involves calling a subroutine, which calls itself.
- the process illustrated by the flow chart of FIG. 7 is started at step 100 , and at step 102 calls the data tag optimisation subroutine “data_tag_opt”, which operates on the parameters “current_node” and “data_tag”.
- the directed graph data structure is optimised by analysing the structure node by node, recursively, along each branch of the tree. The procedure keeps track of which node in the structure it is analysing by reference to a node label called p_node.
- the subroutine starts at step 104 .
- the node being analysed is labelled p_node and this becomes the current node.
- the process interrogates the data as to whether the current node has arcs.
- step 110 the number “n” of arcs branching from the node is read and, at step 112 , a counter “i” is initialised to 0.
- step 116 the data tag is compared with the previous data tag. If it is the same, then at step 118 the data tag is removed. If not, then the data tag is not removed and the routine moves directly to step 120 where it moves on to the next node (i.e. the node at the end of arc[i]).
- step 122 the subroutine calls itself, i.e. it calls “data_tag_opt”, to perform the analysis for the next node. This can be considered as performing the analysis at the next level down the tree.
- step 108 If at step 108 the answer is No, the node must be a sink, and the subroutine returns (i.e. goes back up a level to the previous node) via step 128 .
- step 124 the counter “i” is incremented by 1 and at step 126 , if the counter has not reached “n”, the number of arcs at the node, the data tag on the next arc is read by looping back to step 114 .
- step 128 the subroutine moves to step 128 where it is returned back up to the node at the level above.
- the subroutine will be returned back to step 102 and the process is ended at step 130 .
- the routine moves on down the levels through the arcs “B”, “Y”, “L”, “O”, and “N”, removing the data tags ( 2 ) from all of these arcs as they are the same as the first ( 2 ) on the first arc “B”.
- the routine reaches the sink (the last node) it is returned back up the levels until it reaches a node where there are further, as yet unanalysed, arcs branching from it, in this case the node with the arc “R”.
- the procedure continues for all the arcs of the directed graph, finally producing the directed graph of FIG. 8, which has been optimised to contain a minimal number of data tags, thereby reducing redundancy of data tag information in the database.
- the optimised database described above can be further reduced in size in accordance with an embodiment of the second aspect of the present invention.
- the nature of a directed graph requires that the path starting at the root node is the same for all keys that have an equal sequence of characters up to the point of a difference in one single character. Although keys might have equal character sequences in subsequent parts of the string, the path is held separately. Therefore, the database can be compacted by finding paths in the tree that have the same sequence of characters and data—i.e. paths that are equal—and reusing one single path rather than storing the path multiple times. Paths can be considered as equal only if the sequence of arcs is identical and the data tags stored along the arcs are identical.
- FIG. 9 shows the directed graph of FIG. 8 for the example data set of FIG. 3.
- the nodes have been assigned weight values (shown as numbers in the node circles).
- each character has been assigned a character value, which in this case is the character's ASCII value.
- the weight value of a node may be a checksum which is the sum of the character values of all the characters in the sub-tree below the node (i.e. between the node and all sinks that can be reached from the node). Put another way, the checksum is the sum of the character values of all the arcs branching from the node plus the weight values of the nodes at the ends of those arcs (sinks have zero weight value).
- FIG. 19 a shows a simple example of assigning checksums which does not form a part of the example database, but uses the same method.
- a very simple checksum algorithm can be used: the checksum of a particular node is the sum of all character ASCII values of the node's arcs plus the checksum of all connected nodes.
- This algorithm is sufficient for the sample as it provides a reasonably unique value for a sub-tree as well as includes the level of the node—the higher the value, the larger the sub-tree.
- the higher the value the larger the sub-tree.
- the checksum is the concatenation of (1) the length of the longest path of the sub-tree, (2) the sum of the character values and (3) the sum of the data tag values.
- the format is a 9-digit number, padded with leading zeros in the form lllcccddd, where lll is the level, ccc is the character sum and ddd is the data sum.
- the checksum values for each of the nodes of FIG. 19 b are summarised in the table below. Node 6. Character sum. 84 Level 1 Data sum. 0 Characters. T Data none 001084000 Node 5 Character sum 84 Level.
- checksums represent a hash of a data set. This hash does not necessarily hold unique value depending on the data set, but can have the same value for several different sets of data. Computing time is, however, saved by comparing only sub-trees with equal checksums. Equal checksums indicate that sub-trees have a high probability of being identical. For fast and easy processing the checksums are first collected into a list, which is then sorted by descending value. As already indicated the checksum should represent the level information. The list will, therefore, show the largest sub-trees first.
- Each record in the list should additionally store a reference information to the corresponding node as a means of finding the node again later in the process.
- the reference for example, may be a pointer to the memory location, or anything else appropriate. Best optimisation can be achieved by reducing large sub-trees prior to small sub-trees. Special care should be taken on implementation to ensure that, when reducing sub-trees, references stored with nodes do not become invalid.
- the method reads in the database and at step 202 compiles a list 204 of all the nodes (identified by node references) and their associated checksums.
- the list is sorted into a descending order of checksum values.
- a variable called “last_cs” is set to 0.
- the next checksum on the list is read and its value assigned to the variable “current_cs”.
- the values of “current_cs” and “last_cs” are compared. If they are not equal, the sub-trees below the nodes must be different and the method steps forward via step 213 where the parameter last_cs is set equal to current_cs (i.e.
- step 216 the node references, noderef 1 and noderef 2 , of the nodes having equal checksums are read and at step 218 the comparison of the sub-trees is performed, as will be described below with reference to FIG. 11. If the comparison determines that the sub-trees are not identical by returning a FALSE flag at step 220 the method is stepped forward to step 224 .
- step 220 if the comparison has determined that the sub-trees are identical by returning a TRUE flag, then at step 222 the arc leading into the node of noderef 2 is redirected to the node of noderef 1 so that the sub-tree below the node of noderef 2 can be removed from the database.
- step 224 the method determines if there are any more nodes on the list. If there are the method loops back to step 210 , but if not the method is ended at step 226 .
- the method for comparing the sub-trees is performed recursively.
- the subroutine “compare_tree” is started at step 300 to compare the sub-trees of two nodes identified at step 212 of FIG. 10 as having identical checksums and called here node 1 and node 2 .
- a comparison is made of the number of arcs branching from each of the nodes. If these are not equal, the sub-trees cannot be identical, and so the subroutine is returned with a FALSE flag at step 318 .
- the subroutine continues at step 304 to set a variable “n” to equal the number of arcs and at step 305 initialises a counter “i” to 0.
- the subroutine reads the arc names (i.e. the characters) on the first arc of each node. The characters are read in the order of ascending character value (the values used to determine the node checksums).
- a comparison of the arc names is made. If they are not the same, then the subroutine immediately returns with a FALSE flag at step 318 .
- the data tags of the arcs being compared are read.
- the data tags are compared and if they are not the same the subroutine immediately returns with a FALSE flag at step 318 . If they are the same then the subroutine moves on to compare the next nodes of the two sub-trees (next_node 1 and next_node 2 ) at steps 320 and 322 .
- the subroutine calls itself to compare the next nodes and to continue down the levels of the sub-tree in a recursive manner.
- the compaction method described with reference to FIGS. 10 and 11 can be applied to the example database shown in FIG. 9.
- the checksum information from the tree is extracted into a sequential list of “checksum, pointer”.
- the pointer is a reference to the particular node, and provides a means of finding it again. 6296, Node0 1303, . . . 0309, . . . 0864, . . . 0861, . . . 0226, . . . 0918, Node1 0689, . . . 0229, . . . 0788, . . . 0779, . . . 0147, . . .
- This list is then sorted into descending checksum value order: 6296, Node0 0699, . . . 0388, . . . 0234, . . . 0157, . . . 0078, . . . 1303, . . . 0689, . . . 0383, . . . 0233, . . . 0157, . . . 0078, . . . 1014, . . . 0631, . . . 0380, . . . 0233, . . . 0157, . . . 0078, . . . 0943, . . .
- the first value found that is equal for two nodes is 464. Comparing the underlying trees shows that they are equal in character sequence as well as in data tags (no data tags in this case). Consequently reassigning the arc named “Y” of node A to point to node B can cut off the second tree. The storage resources used by the tree starting at node C can now be freed up—the tree is not connected any more.
- 388 is the next value to look at. Again one tree can be reduced. Although 388 occurs in the list three times, the third occurrence had already been cut off in the previous step and can therefore be ignored.
- FIG. 12 illustrates this example.
- the keys METALLOPHON and XYLOPHON have both been categorised as music (category 0 ) and both end with the sequence of characters LOPHON.
- the nodes labelled B and C in FIG. 12 both have the checksum values 464. Comparison of the sub-trees determines that both contain identical characters and data tags, so the arc having arc name “Y” that connects the nodes labelled A and C is redirected to connect node A to node B. All the arcs that comprise the sub-tree below node C are then removed from the database.
- FIG. 13 shows similar compaction of the example database for other nodes having equal checksum values.
- the sub-trees shown in boxes with a shaded background are those that are being removed from the database.
- FIG. 14 a presents an example of two nodes having equal checksum values, but which are not identical.
- the character values of both the sub-trees “NTH” and “RPH” produce checksums totalling 234 (see the nodes in the keys “NINTH” and “POLYMORPH” in FIG. 13).
- checksums totalling 234 see the nodes in the keys “NINTH” and “POLYMORPH” in FIG. 13.
- comparison of the individual characters soon indicates that they are not identical and causes the comparison subroutine of FIG. 11 to return a FALSE flag.
- FIG. 14 b illustrates an example of two keys BARITONE and MARITAL. Both contain the same string of characters “ARIT”. However the subroutine would not identify equal checksums and so compaction of the database to produce the sub-tree illustrated in FIG. 14 b would not occur. This is important because compaction in this way would give rise to the possibility of keys not in the original data set being present in the final compacted database. In the example, the keys “MARITONE” and “BARITAL” are present in the compacted tree, even though they were not part of the original data set.
- FIG. 15 illustrates further examples of compacting of the example data set at lower levels (i.e. at nodes having lower checksum values). Again, the sub-trees shown in boxes with a shaded background are those that are being removed from the database. It should be noted that the most efficient method of compacting the database is to start with comparing the highest checksum values first so as to remove the largest equivalent sub-trees from the data base first, and then proceed by comparing progressively smaller sub-trees having equal checksum values.
- FIG. 16 illustrates the example database in its final compacted form, with all the redundant arcs removed, and as such represents an embodiment of both of the first and second aspects of the present invention. All the original keys from the data set of FIG. 3 are present together with their associated data tags. Some of the nodes in FIG. 16 have numbers appearing in the circles that represent the nodes. These are not checksum values, but are node labels which will be used to describe the format in which the data is stored with reference to FIG. 17.
- the data itself must be stored.
- the data may be stored electronically in the format of a one dimensional binary encoded bit stream.
- a node is stored as its set of arcs, sorted in ascending order in terms of character information.
- arcs are stored in ascending sorted order by their character value.
- FIG. 17 is a representation of a bit stream.
- the top line 400 in FIG. 17 comprises 56 bits which are used to store the data associated with a single arc.
- the first 8 bits are the character itself as represented by its ASCII value.
- the ninth bit is a data tag flag. If this bit is a 1 it indicates that a data tag is also stored with the arc, but if it is a 0 there is no data tag.
- the tenth bit is another data flag which indicates whether or not the arc leads to a terminal node.
- the next 30 bits (bits 10 to 39) contain pointer information in the form of an address to the location in the data base of the first arc of the next node.
- the last 16 bits (bits 40 to 55) contain the data tag, if the data tag flag indicates its presence. Otherwise these bits are not present.
- next two lines 402 illustrate the data for a set of nodes corresponding to the nodes labelled 0 to 11 in FIG. 16.
- the bits that comprise the arc data in line 400 are shown compressed into the four fields, character, flags, pointer and data tag.
- Node 0 is the root node in FIG. 16 and has arcs with the characters B, M, N, P, S, T and X.
- the arc representing the character ‘B’ carries a data flag (shown here as a ‘Y’ for ‘Yes’ ) indicating the presence of a data tag, but no flag (shown here as N for ‘No’) indicating the presence of a terminal node; the pointer data for this arc points to the first arc (‘A’) of Node 1 ; and the arc carries the data tag ‘ 2 ’. Similar data is contained the fields representing the other arcs of node 0 , all of which carry data tags, but none of which lead to a terminal node. Finally the character ‘&’ is used to represent the termination of data for the node.
- Node 1 has only a single arc representing the character ‘A’. This arc carries no data tag and does not lead to a terminal node, so the flags are both shown as N.
- FIG. 18 illustrates a procedure for rapidly searching the database to find a key and return its associated data tag.
- a query is read in the form of the key to be sought.
- an indexing counter “i” is set to 0 and at step 504 the search is started at the root node of the data structure.
- a parameter result_tag is set to a null value.
- the parameter arc_name is set to the next character of the key, key[i].
- the procedure determines whether the arc name exists for the current node. If the arc_name does not exist the procedure steps directly to step 524 to return a null value. This means that the key is not to be found in the database.
- the procedure determines if a data tag is stored with the arc. If there is a data tag, the parameter result_tag is set to the value of the data tag at step 514 . At step 516 the procedure moves on to the next node and at step 518 the indexing counter “i” is incremented by 1. At step 520 the procedure determines whether the counter “i” is less than the length of the key (i.e. the number of characters in the key), and if it is the procedure returns to step 508 to look for the next character in the key.
- the procedure determines at step 522 whether the current node is a terminal node by reading the terminal node flag associated with the arc (see the data stream representation of FIG. 17). If the current node is not a terminal node, the key is not in the database and the procedure moves directly to step 524 to return a null data tag value. If the node is a terminal node then the procedure moves to step 526 to return the result_tag value, which is the data tag associated with the key, and confirms the presence of the key in the database.
Abstract
Description
- The present invention relates to a database and a method of generating a database. In particular the invention relates to a database which facilitates efficient storage of data, rapid search and retrieval of data.
- Databases are used in computer-based information and processing systems for the storage of large quantities of information or data items for subsequent retrieval and processing. Such databases often require updating from time to time for redistribution to users who may be situated remotely from the producer of the database. Logistical difficulties can arise when databases become large. For example, users of the database might not have the same storage capacity enjoyed by the database creator and in cases where users download updated databases via a computer network such as the Internet, download times can become burdensome and functionality may be compromised.
- Preferred methods of data storage vary depending on the type of data to be stored. Opportunities for compression of data exist particularly when data to be stored contains repetitive elements. Various schemes exist in the art for increasing the efficiency of data management. For example, relational databases are adopted in situations where it is desirable to avoid repetition of data entry. A relational database might be adopted for customer contact information having different categories. Such a database might employ a plurality of separate database tables, one for each category of information such as: one for customer name and address, a second for accounting records and a third for product information. These tables are linked, or related, by a customer ID so that accounting and/or product information can be retrieved without the need to store customer name and address data in the table of each category.
- A difficulty arises when it is desired to store a large number of data items which are to be classified into a relatively small number of different categories. In such a case, the database is likely to be structured as a single table listing the data items. The category of each data item is then stored against each data item in the table. The difficulty is that the resulting table becomes extravagant on storage space because the same category identification is stored many times within the same table. As the database becomes larger, the more difficult it is to transfer between users and the longer it takes to retrieve information from it.
- This problem of ‘wasted space’ is exacerbated in cases where the data items contain repetitive elements or components. For example, in a database for relating Internet web pages identified by Uniform Resource Locators (URL's) to subject category, it is expected that there will be millions of URL's and subject categories numbered in the order of a few tens to hundreds, possibly a few thousand. URL's are keys containing strings of alphanumeric and other characters. Not only is there ‘wasted space’ in the storage of identical subject categories against multiple data items, but there is ‘wasted space’ in storing elements (i.e. strings of characters) which repeat themselves among the URL's.
- Although numerous methods of data compression are known in the art, these techniques are generally applicable to the passive storage and transport of data. In other words, the database is not designed to facilitate search and retrieval of data while in a compressed state. It is an aim of the invention to devise a database structure which provides for greater storage capacity and searching speed in a decompressed state.
- U.S. Pat. No. 6,219,786 relates to a method and system for monitoring and controlling computer users' access to network resources from both inside and outside the network. The system monitors network traffic and applies access rules to the traffic to permit or deny access to predetermined network resources. In one application of this system, a networked computer may be monitored so that access to predetermined Internet web-sites can be permitted while others denied. Such a system may include a database of URL's which are categorised by subject. Given the existence of many tens or even hundreds of millions of URL's which may be accessed via the World Wide Web (www), a database of these containing a category data tag for each can be expected to require a great deal of storage capacity and be slow to search.
- It is therefore an aim of the invention to devise a database structure and method of generating same which alleviates these problems. In particular, it is an aim of the invention to devise a database structure which can contain more data items than in prior art database structures having the same storage capacity. It is another aim to provide for faster retrieval of data. It is a further aim of the invention to devise a database structure which provides for faster confirmation of the absence of a data item.
- It is an aim of the invention to devise a database which can store many millions of URL's and their respective category data tags (numbering tens to hundreds) with a reduced storage requirement. It is a further aim to provide for faster retrieval and searching of such a database.
- According to a first aspect of the present invention there is provided a database comprising a plurality of keys representing respective data items stored in the database and respective data tags associated with at least some of the data items, respective data tags representing different identifiers or categories among which the associated data items are grouped, wherein the database is arranged in the form of a tree data structure in which each of said plurality of keys is represented by a series of nodes and arcs defining a path between a root node and a terminal node, each node being linked to at least one other node by a respective arc, respective arcs for a given one of said plurality of keys representing a respective character or characters of said given key, and wherein the arcs and the nodes depending from said root node of data items which represent a sequence of characters shared by different keys are combined, and the data tags are associated with the arcs.
- In a preferred embodiment of the invention, a data tag is associated with each one of the arcs so that a data tag is read from the database as said respective character(s) of the key are read from the database. The last data tag which is read before reaching a terminal node defines the category or identifier of the key. In cases where successive arcs within a path have the same data tags associated with them, only one, for example the first occurrence of the data tag when reading from the root node, is stored in the database to reduce or eliminate redundancy of data therein.
- According to a second aspect of the present invention there is provided a database comprising a plurality of keys representing respective data items stored in the database, wherein the database is arranged in the form of a tree data structure in which each of said plurality of keys is represented by a series of nodes and arcs defining a path between a root node and a terminal node, each node being linked to at least one other node by a respective arc, respective arcs for a given one of said plurality of keys representing a respective character or characters of said given key, and wherein the arcs and the nodes depending from said root node of data items representing a sequence of characters shared by different keys are combined, and the arcs and the nodes extending from a given terminal node of data items representing a sequence of characters shared by different keys are also combined, said given terminal node being a sink.
- A database may incorporate the first and the second aspects of the invention. In such a database, the data tags are rationalised to minimise the amount of storage space taken up by category or identifier information for the keys and further storage saving measures are achieved by the combining of arcs and nodes between characters or character sequences shared by different keys when reading from the root node to the terminal nodes and when reading from the terminal nodes to the root node, wherein said terminal nodes are sinks.
- According to a further aspect of the present invention there is provided a method of generating a database having a plurality of keys representing respective data items stored in the database and respective data tags associated with at least some of the data items, respective data tags representing different identifiers or categories among which the data items are grouped, wherein the method comprises:
- generating a data set represented by tree data structure in which each of said plurality of keys is represented by a series of nodes and arcs defining a path between a root node and a terminal node, each node being linked to at least one other node by a respective arc, and respective arcs for a given one of said plurality of keys representing a respective character or characters of said given key wherein arcs and nodes depending from said root node of data items which represent a sequence of characters shared by different keys and category or identifier are combined; and
- associating at least some of the arcs with data tags which correspond to the category or identifier of the key represented by the character or characters of the arc.
- In a preferred embodiment, the method further includes compacting the data set by removing from a sequence of repeating identical data tags all but one of said identical data tags. Preferably, successive data tags identical to the first occurrence thereof in the sequence are removed. This allows redundant data tags to be removed from the database thereby making space available for more data items.
- According to a yet further aspect of the present invention, there is provided a method of generating a database having a plurality of keys representing respective data items stored in the database, wherein the method comprises:
- generating a data set represented by tree data structure in which each of said plurality of keys is represented by a series of nodes and arcs defining a path between a root node and a terminal node, each node being linked to at least one other node by a respective arc, and respective arcs for a given one of said plurality of keys representing a respective character or characters of said given key, wherein arcs and nodes depending from said root node of data items which represent a sequence of characters shared by different keys are combined; and
- compacting the data set so that arcs and nodes extending from a given terminal node towards said root node of data items which represent a sequence of characters shared by different keys are also combined, said given terminal node being a sink.
- In a yet further aspect of the present invention, there is provided a method of generating a database having a plurality of keys representing respective data items stored in the database and respective data tags associated with at least some of the data items, respective data tags representing different categories or identifiers among which the data items are grouped, wherein the method comprises:
- generating a data set represented by a tree data structure in which each of said plurality of keys is represented by a series of nodes and arcs defining a path between a root node and a terminal node, each node being linked to at least one other node by a respective arc, and respective arcs for a given one of said plurality of keys representing a respective character or characters of said given key wherein arcs and nodes depending from said root node of data items which represent a sequence of characters shared by different keys and category or identifier are combined;
- associating at least some of the arcs with data tags which correspond to the category or identifier of the key represented by the character or characters of the arc;
- compacting the data set by removing from a sequence of repeating identical data tags all but one of said identical data tags; and
- further compacting the data set so that arcs and nodes extending from a given terminal node towards said root node of data items which represent a sequence of characters and category or identifier shared by different keys are also combined, wherein said given terminal node is a sink node.
- The steps of compacting the data set may each include a recursive routine. Successive data tags identical to first occurrence thereof in the sequence may be the ones removed.
- In a preferred embodiment, said compacting step may include assigning a weight value to nodes of the data set, the weight value of a given node being dependent on the characters between said given node and an associated sink(s), said given node and associated sink(s) defining a sub-tree of said data set, and identifying two or more nodes having identical weight values as potentially having identical sub-trees. The weight value may be based on a checksum value incorporating the category or identifier of an arc extending from the node to which the weight value is being applied, in addition to the characters in the sub-tree. The checksum value may further incorporate an indication of the size of the associated sub-tree of the given node.
- The step of compacting to reduce identical sub-trees includes comparing with one another the nodes and sub-trees depending from, and including, nodes having identical weight values. Nodes having weight values representative of longer sub-trees are preferably compared and compacted prior to those representative of shorter ones. This provides for a faster compaction operation. Nodes and their respective sub-trees identified as identical are rationalised by directing the arc(s) leading to one of the nodes to the other node and removing said one node and its associated sub-tree from the database. This may be done using a recursive routine.
- Any node except the root node may be a terminal node, provided it represents the end of a path defining a key. All nodes that have no further arcs leading to further nodes are terminal nodes, sometimes referred to as ‘sinks’. A node may be a terminal node because it defines the end of a key, but may also have further arcs leading to other nodes, the further arcs representing characters of other keys. The tree data structure may be in the form of a tree-structured directed graph.
- In an embodiment of the present invention, the data items may represent Universal Resource Locators (URL'S) for identifying Internet web pages, the categories corresponding to subject matter types, respective data tags representing different subject matter types.
- According to the present invention, there is yet further provided a data carrier having stored thereon a database as defined according to any aspect of the invention hereinabove. The data items of the database may be URL's and the data tags may be subject matter types for them. The data carrier may be in the form of any computer readable medium, such as: CD-ROM; a hard disk of a personal computer or network server; magnetic tape; or data stream.
- According to the present invention, there is yet further provided a computer program containing code, which when run on a computer can configure the computer to generate a database according to any of aspect of the invention defined hereinabove. The computer program may contain code for configuring a computer to perform any of the methods of generating a database as defined hereinabove.
- The terms used herein are defined in a dictionary published by the National Institute of Standards and Technology (NIST), see in particular their Dictionary of Algorithms, Data Structures and Problems. This may be accessed via the Internet (see URL: http://www.nist.gov/dads/terms.html).
- It should be noted that variations may be made to embodiments of the present invention without departing from the scope thereof. For example, there may be instances within a tree-structured directed graph in which pairs of nodes are linked by more than one arc.
- Embodiments of the invention have the advantage that information in the form of sequences of characters that recur in many different keys (for example, the sequences “www.”, and “.com” occur in a great many URLs) need only be stored a minimum number of times in the database. This results in a substantial reduction in the bit size of the database and the amount of memory required. A further advantage is searching is very fast because once a sequence of characters occurring in the key being sought has been found, there is no need to search anywhere else in the database for those characters. This arises from the tree-structured directed graph in which there is only one valid next move as a data item to be searched is looked up in the tree-structure. Also, once it is determined that the next character in a sequence is not present in the database, the search can be terminated because the key will not be present elsewhere.
- Further advantages arise from the optimisation and storage of the data tag information. By storing the data tags with the characters of the keys, as a key is read the data tag is also read, removing the need to retrieve the data tag from an associated data location. Removal of redundant data tags results in a substantial reduction in the amount of data that has to be stored. In the case where the database stores URL's, a tenfold reduction in the size of the database is contemplated relative to prior art database structures which may be employed.
- The invention will now be further described by way of example, with reference to the following drawings, in which:
- FIG. 1 is a schematic diagram of a known computer system on which a database embodying aspects of the present invention may be implemented;
- FIG. 2 is a flow diagram outlining a method for generating a database in accordance with the first and second aspects of the present invention;
- FIG. 3 is an example of data items for use in an illustration of a database embodying the first and/or second aspects of the present invention;
- FIG. 4 is a flow diagram with reference to which generation of a database embodying the first aspect of the present invention is explained,
- FIGS. 5a to 5 e are conceptual representations for explaining the building up of a tree data structure for the data items of FIG. 3;
- FIG. 6 is a conceptual representation of a tree data structure in accordance with the first aspect of the present invention;
- FIG. 7 is a flow diagram with reference to which a process of data tag optimisation is described;
- FIG. 8 shows the directed graph representation of FIG. 6, in which redundancy of the data tags in accordance with the process of FIG. 7 has been reduced;
- FIG. 9 shows the directed graph of FIG. 8 with weight values assigned to nodes in accordance with creation of the database embodying the first and second aspects of the present invention;
- FIG. 10 is a flow diagram with reference to which data compaction in accordance with a fourth stage of the process of FIG. 2 is described;
- FIG. 11 is a flow diagram showing a recursive procedure adopted within the flow diagram of FIG. 10;
- FIG. 12 shows the directed graph of FIG. 8 with an example of how arcs and nodes may be shared to extend from a common terminal node for a pair of data items having a common string of characters;
- FIG. 13 shows the directed graph of FIG. 8 with further examples of how arcs and nodes are shared;
- FIGS. 14a and 14 b show examples of paths for two data items which do not share the same root node or sink node;
- FIG. 15 shows the directed graph of FIG. 8 with yet further examples of how arcs and nodes are shared;
- FIG. 16 shows the directed graph representation of FIG. 15, redrawn to illustrate a database structure optimised for redundancy using the example of FIG. 3;
- FIG. 17 shows how a database embodying the first and second aspects of the present invention may be represented in a data stream;
- FIG. 18 is a flow diagram showing a rapid search and retrieval procedure for use with a database embodying the invention; and
- FIGS. 19a and 19 b show further examples of paths for data items having weight values assigned to nodes in accordance with creation of the database embodying the first and second aspects of the present invention.
- Referring to FIG. 1, a computer system comprises a
user interface 10, aprocessor 12, a data storage means 14, andprogram memory 16, all of which communicate with each other via adata bus 18. The computer system further comprises an internet interface device 20 for facilitating communication with theinternet 22. A disk drive and/or CD ROM drive 24 facilitate reading and/or writing of data to and from portable media such as floppy disks or CDs.User interface 10 comprises an information display, for example a monitor, and a user input means such as a keyboard and/or a mouse. Instructions contained in theprogram memory 16 control theprocessor 12 to process data stored in the data storage means 14 or read from portable media via thedrive 24 or downloaded from theinternet 22. The system shown in FIG. 2 describes a single user system, however it will be appreciated that the system is extendable to link two or more users communicating via thedata bus 18 or internet/intranet/extranet links thereto. - Computer systems such as the one described in FIG. 1 utilise databases comprising lists of information items and associated categories. The information items are in the form of keys, each key comprising a unique character string, for example names of people/companies/places/products etc.. The categories are represented in the database by a category code, for example a number and take the form of a data tag associated with each key. When information about the category of an item is required, for example at the request of a user, or in response to a coded instruction as part of a software routine or control procedure, the computer performs a search of the database to locate the key and retrieve the data tag.
- Databases can be very large, some holding many millions of keys and their associated data tags. Prior art database structures tend to be such that the computer has to search sequentially through the entire list of keys stored in the database to find one that matches the required key. It then retrieves the data tag to identify the category. Two problems limit the efficacy of such systems: firstly, the amount of data stored can be prohibitively large, using up an excessive amount of data storage capacity; secondly, the processing time for completing the search can be very long and use up a large amount of computer memory.
- FIG. 2 shows a process for creating a compact and rapidly searchable database in accordance with the various aspects of the present invention. The processes that make up the steps of FIG. 2 will be described for a specific example, using the data items shown in FIG. 3, with reference to FIGS.4 to 17. Referring to FIG. 2, the raw data 28 (keys and associated data tags) are read in at
step 30. Atstep 32 the raw data is processed to produce a data structure representative of a tree data structure or tree-structured directedgraph 34, as will be described in more detail below with reference to FIGS. 4 to 6. Atstep 36 an algorithm is used to identify and discard superfluous data tags and produce a data structure representative of an optimised directedgraph 38, as will be described below with reference to FIGS. 7 and 8. - The optimised directed
graph 38 is compacted by the processes ofsteps 40 and 44. Atstep 40 weight values are assigned as will be described with reference to FIG. 9. At step 44 the weight values are used to identify and reduce redundant key data to produce a data structure representative of a compacted directed graph 46, as will be described with reference to FIGS. 10 to 16. - At
step 48 the optimised and compacted directed graph 46 is stored as afinal database 50 in a data storage format that will be described with reference to FIG. 17. When the system requires to know the category (data tag) associated with a key, the key data is read by the system and thedatabase 50 is searched atstep 54 to rapidly retrieve the requireddata tag 56. - FIG. 3 illustrates a data set to be used as an example for describing the processes that make up an embodiment of the invention. The data set of FIG. 3 comprises a set of keys “BABYLON”, “BARITONE” etc., to each of which is assigned a
data tag - FIG. 4 shows the process for generating a tree-structured directed graph, and will be described with reference to FIGS. 5 and 6 to describe generation of a tree-structured directed graph for the data set of FIG. 3. A directed graph is a way of visualising, in two dimensions, an arrangement of data. Trees in the context of data structures, graphs and directed graphs are all known terms in the art (see for example, the NIST dictionary referred to above). The data itself remains as a binary encoded bit stream stored electronically by the computer system. The data in a directed graph structure is represented by arcs, each arc representing a character (e.g. a letter or numeral). It is contemplated that a given character could represent more than one alpha-numeric character of the data item. The arcs interconnect nodes. A node does not represent any of the source data, but represents a point or junction between one character and one or more further characters. In FIGS. 5 and 6 nodes are represented as circles and arcs are represented as lines having arrowheads pointing towards the node to which the arc leads. The root node is represented by a larger circle having a smaller circle inside it, and terminal nodes are represented by bold circles. The structure of the directed graph will become more apparent as the process of generation is described.
- In the beginning the graph is blank and has only a root node with no arcs assigned. All the keys are now incorporated individually into the graph character by character, whereby their characters are stored along the arcs, and all arcs of a node are sorted in ascending order according to their key-character information. Sorting the arcs lends itself to fast search operations within a node. If a new arc is created, and not merely traversed, the data tag (or a reference to it) for the current key must also be filed along this arc so embodying the first aspect of the present invention. Each node to which the last arc of a key opens has to be marked as a terminal node and must be equipped with the current key's data tag. Consequently, following completion of the process there is a deterministic finite state machine available, which is the basis of the further steps.
- The process of building a graph from a set of data items is started at
step 60. At step 62 a key and associated data tag are read from the source data set 64. Atstep 66 an indexing counter is set to 0. Thus far no data has been processed and the directed graph consists only of a single root node and no arcs, as shown by the “initial state” of FIG. 5a. Atstep 68 the directed graph generator is positioned on the root node. At step 70 the process reads the next character of the key, key[i]. The first time through the process this is the first character of the key, key[0 ], as defined by the indexing counter. FIGS. 5b to 5 e show the example where the first key read is METALLOPHON. Thus the first character is the letter “M”, and this is called the arc name of the next (first) arc. Atstep 72 the process interrogates the data structure as to whether the character “M” already exists as an arc. As no arcs have yet been generated, the answer must clearly be No, and the process proceeds to step 74, where the arc is generated. Atstep 76 the associated data tag is also added to the arc. In the example, “metallophon” has been assigned thecategory 0, “music”. Atstep 78 the arc is traversed to position the generator on the next node, i.e. the node at the end of the arc. The directed graph is now atstate 2 as shown in FIG. 5b. - At
step 80, the indexing counter is increment by 1. Atstep 82 the process interrogates the data to ask if the end of the key has been reached. The answer in the example case is No, and the process returns to step 70 to commence generation of the next arc, which this time is given the arc name key[1], the letter “E”. Again atstep 76, the data tag is added to the arc, and the directed graph is then atstate 3 as shown in FIG. 5c. The process repeats for each letter of the key until eventually all the letters of “METALLOPHON” have been assigned to arcs. This time, atstep 82 the answer is Yes and the process proceeds to step 84 where a flag data bit is added to the data to indicate that the node at the end of the last arc “N” is a terminal node. The directed graph is then atstate 4, as shown in FIG. 5d. - At
step 85 the process ensures that the data tag associated with the last arc of the key is that associated with the key. In most cases the data tag will have been associated with the arc name atstep 76, however it is possible that the key may be made up entirely of characters already contained in the database and thatstep 76 will have been by-passed for every character of the key. In such circumstances it is necessary to associate the correct data tag with the last arc in the key. An example of this can be seen in FIG. 6, which shows the directed graph for the data set of FIG. 3. The key POLY has all its characters the same as the first four characters of the key POLYMORPH, but has a data tag of 1 whereas POLYMORPH has a data tag of 0. Therefore if POLY is entered into the database after POLYMORPH, all the arcs will already exist and have associated data tags of 0. Therefore the arc representing the last character “Y” of POLY must have thecorrect data tag 1 associated with it by overwriting the previous data tag. Note that the arc “Y” leads to a terminal node, but the terminal node is not a sink. - At
step 86 the process interrogates the data to see if the end of the data set has been reached. If the answer is Yes, the process is ended. However, in the illustrative example the answer is No, so the process returns to step 62 to read the next key and associated data tag. The next key is “MONOPHON”. Here, when the process reaches step 72 for the first time and asks whether the arc name “M” exists for the current node (in this case the root node), the answer is Yes because the arc with arc name “M” was generated for the key “METALLOPHON”. The process therefore steps ahead to step 78, without generating an arc. The next time around, atstep 72, the process asks the same question of the arc name “O”, but here the answer is No, and so a new arc must be generated. Thereafter, for MONOPHON all arcs will be new arcs because there will be no existing arcs connected to the nodes.State 5, as shown in FIG. 5e has then been reached. - Once the process has been undertaken for all of the keys of the data set, the data will represent the directed graph of FIG. 6. Note that the directed graph is termed “tree-structured”, because each key is represented by a pathway of arcs commencing at the root node and terminating at a terminal node. Each arc may only be traversed once and (at this stage) each node is only arrived at via one arc, but may have more than one arc departing from it.
- The data structure represented by FIG. 6 is well suited for searching. Starting at the root node a searching algorithm only needs to look for an arc with an arc name the same as the first character of the key being searched, and then to follow the path of arcs with arc names equivalent to the characters of the key, to identify the existence of the key in the database when the terminal node is reached. On reaching any node without an arc having an equivalent arc name to the next character of the key identifies the absence of the key from the data base. Furthermore, if the algorithm reads the data tags of the arcs as it traverses the pathway, disregarding the previously read data tag each time it reads a new data tag, then when it reaches a terminal node, the last data tag to be read will be the one associated with the key and will correctly identify the category of the key.
- Nevertheless, the data structure of FIG. 6 is far from optimised. Data tags are stored with every arc, but this entails storing a great many more data tags than necessary to identify the tag associated with a key. The process shown in FIG. 7 removes superfluous data tags. The process is recursive, which is to say that it involves passing through the steps of a procedure that includes all the steps of the procedure itself as one of the steps. In other words it involves calling a subroutine, which calls itself.
- The process illustrated by the flow chart of FIG. 7 is started at
step 100, and at step 102 calls the data tag optimisation subroutine “data_tag_opt”, which operates on the parameters “current_node” and “data_tag”. The directed graph data structure is optimised by analysing the structure node by node, recursively, along each branch of the tree. The procedure keeps track of which node in the structure it is analysing by reference to a node label called p_node. The subroutine starts atstep 104. Atstep 106 the node being analysed is labelled p_node and this becomes the current node. Atstep 108, the process interrogates the data as to whether the current node has arcs. If the answer is Yes, then atstep 110 the number “n” of arcs branching from the node is read and, atstep 112, a counter “i” is initialised to 0. Atstep 114, the data tag stored with the next arc, arc[i] is read (when i=0, arc [0] is the first arc at the node). Atstep 116 the data tag is compared with the previous data tag. If it is the same, then atstep 118 the data tag is removed. If not, then the data tag is not removed and the routine moves directly to step 120 where it moves on to the next node (i.e. the node at the end of arc[i]). Atstep 122 the subroutine calls itself, i.e. it calls “data_tag_opt”, to perform the analysis for the next node. This can be considered as performing the analysis at the next level down the tree. - If at
step 108 the answer is No, the node must be a sink, and the subroutine returns (i.e. goes back up a level to the previous node) viastep 128. - When the subroutine has been returned back up a level it continues to step124 where the counter “i” is incremented by 1 and at
step 126, if the counter has not reached “n”, the number of arcs at the node, the data tag on the next arc is read by looping back to step 114. Once all the arcs at a node have been analysed (i.e. i=n) the subroutine moves to step 128 where it is returned back up to the node at the level above. Eventually, when the entire database has been analysed, the subroutine will be returned back to step 102 and the process is ended atstep 130. - Referring back to FIG. 6, if the process is started at the root node and the first arc to be analysed is “B”, then as there is no previous data tag the arc “B” retains the data tag (2) and the routine moves down a level to the next node (the node between “B” and “A”). The arc “A” is the next to be analysed and because this also has the data tag (2), which is the same as the previous arc, it is removed. The routine moves down a level to the next node. Here there are two arcs branching from the node, “B” and “R”. The routine considers first the arc “B” (it could consider the arc “R”, it would make no difference to the outcome). The routine moves on down the levels through the arcs “B”, “Y”, “L”, “O”, and “N”, removing the data tags (2) from all of these arcs as they are the same as the first (2) on the first arc “B”. When the routine reaches the sink (the last node) it is returned back up the levels until it reaches a node where there are further, as yet unanalysed, arcs branching from it, in this case the node with the arc “R”. The procedure continues for all the arcs of the directed graph, finally producing the directed graph of FIG. 8, which has been optimised to contain a minimal number of data tags, thereby reducing redundancy of data tag information in the database.
- The optimised database described above can be further reduced in size in accordance with an embodiment of the second aspect of the present invention. To achieve efficient storage of all keys it is desirable to rid the graph from redundancy. The nature of a directed graph requires that the path starting at the root node is the same for all keys that have an equal sequence of characters up to the point of a difference in one single character. Although keys might have equal character sequences in subsequent parts of the string, the path is held separately. Therefore, the database can be compacted by finding paths in the tree that have the same sequence of characters and data—i.e. paths that are equal—and reusing one single path rather than storing the path multiple times. Paths can be considered as equal only if the sequence of arcs is identical and the data tags stored along the arcs are identical.
- The method of creating the database embodying the second aspect of the invention will be described with reference to FIGS.9 to 16. FIG. 9 shows the directed graph of FIG. 8 for the example data set of FIG. 3. In FIG. 9 the nodes have been assigned weight values (shown as numbers in the node circles). In this example each character has been assigned a character value, which in this case is the character's ASCII value. It will be appreciated that any consistent set of values could be used, which uniquely identifies every possible character found in the keys. The weight value of a node may be a checksum which is the sum of the character values of all the characters in the sub-tree below the node (i.e. between the node and all sinks that can be reached from the node). Put another way, the checksum is the sum of the character values of all the arcs branching from the node plus the weight values of the nodes at the ends of those arcs (sinks have zero weight value).
- FIG. 19a shows a simple example of assigning checksums which does not form a part of the example database, but uses the same method. For the example presented a very simple checksum algorithm can be used: the checksum of a particular node is the sum of all character ASCII values of the node's arcs plus the checksum of all connected nodes.
- Example Calculation:
- A=65, B=66, C=67, D=68, E=69
-
Node 3=D=68 -
Node 4=E=69 -
Node 2=Node 3+Node 4+B+C=68+69+66+67=270 -
Node 1=Node 2+65=270+65=335 - This algorithm is sufficient for the sample as it provides a reasonably unique value for a sub-tree as well as includes the level of the node—the higher the value, the larger the sub-tree. However, for larger trees it is recommended to use a more complex calculation to reduce the number of equal checksums and to take counter overflows into consideration.
- Other methods of assigning checksums may be used. CRC and MD5 are two examples of known methods.
- An example for calculating a compound checksum value is described with reference to FIG. 19b. The checksum is the concatenation of (1) the length of the longest path of the sub-tree, (2) the sum of the character values and (3) the sum of the data tag values. The format is a 9-digit number, padded with leading zeros in the form lllcccddd, where lll is the level, ccc is the character sum and ddd is the data sum. The checksum values for each of the nodes of FIG. 19b are summarised in the table below.
Node 6.Character sum. 84 Level 1Data sum. 0 Characters. T Data none 001084000 Node 5Character sum 84Level. 1 Data sum 0Characters T Data none 001084000 Node 4Character sum 83 + 82 + 84 + 84 = 333 Level 2Data sum 9Characters S, R, T, T Data. 9 002333009 Node 3: Character sum 69Level. 1 Data sum 0Characters E Data none 001069000 Node 2Character sum. 82 + 69 + 79 + 83 + 84 + 82 + 84 = 563 Level. 3 (the longest Data sum: 5 + 9 = 14 path) Characters: R, E, O, S, T, R, T Data. 5, 9 003563014 Node 1: Character sum: 80 + 82 + 69 + 79 + 83 + 84 + 82 + 84 = 643 Level 4 (the longest Data sum: 3 + 5 + 9 = 17 path) Characters: P, R, E, O, S, T, R, T Data 3, 5, 9 004643017 - The purpose of assigning checksums to the nodes is to perform the compaction method outlined in FIG. 10. Checksums represent a hash of a data set. This hash does not necessarily hold unique value depending on the data set, but can have the same value for several different sets of data. Computing time is, however, saved by comparing only sub-trees with equal checksums. Equal checksums indicate that sub-trees have a high probability of being identical. For fast and easy processing the checksums are first collected into a list, which is then sorted by descending value. As already indicated the checksum should represent the level information. The list will, therefore, show the largest sub-trees first. Each record in the list should additionally store a reference information to the corresponding node as a means of finding the node again later in the process. The reference, for example, may be a pointer to the memory location, or anything else appropriate. Best optimisation can be achieved by reducing large sub-trees prior to small sub-trees. Special care should be taken on implementation to ensure that, when reducing sub-trees, references stored with nodes do not become invalid.
- Starting at step200, the method reads in the database and at step 202 compiles a
list 204 of all the nodes (identified by node references) and their associated checksums. Atstep 206 the list is sorted into a descending order of checksum values. At step 208 a variable called “last_cs” is set to 0. Atstep 210 the next checksum on the list is read and its value assigned to the variable “current_cs”. Atstep 212 the values of “current_cs” and “last_cs” are compared. If they are not equal, the sub-trees below the nodes must be different and the method steps forward viastep 213 where the parameter last_cs is set equal to current_cs (i.e. the checksum value of the current node) and on to step 224. However, if they are equal there is a possibility that the two sub-trees are identical. As will be described in an example later, it is not possible to be certain that they are identical and so it is necessary to perform a comparison of the sub-trees. Atstep 216 the node references, noderef1 and noderef2, of the nodes having equal checksums are read and atstep 218 the comparison of the sub-trees is performed, as will be described below with reference to FIG. 11. If the comparison determines that the sub-trees are not identical by returning a FALSE flag atstep 220 the method is stepped forward to step 224. Atstep 220, if the comparison has determined that the sub-trees are identical by returning a TRUE flag, then atstep 222 the arc leading into the node of noderef2 is redirected to the node of noderef1 so that the sub-tree below the node of noderef2 can be removed from the database. - At
step 224 the method determines if there are any more nodes on the list. If there are the method loops back to step 210, but if not the method is ended atstep 226. - Referring to FIG. 11, the method for comparing the sub-trees is performed recursively. The subroutine “compare_tree” is started at
step 300 to compare the sub-trees of two nodes identified atstep 212 of FIG. 10 as having identical checksums and called here node1 and node2. At step 302 a comparison is made of the number of arcs branching from each of the nodes. If these are not equal, the sub-trees cannot be identical, and so the subroutine is returned with a FALSE flag atstep 318. If the number of arcs is equal, then the subroutine continues atstep 304 to set a variable “n” to equal the number of arcs and at step 305 initialises a counter “i” to 0. Atsteps step 318. - Even if the arc names are the same, it is important that they are only considered identical if they carry the same data tags. Therefore at
steps step 316 the data tags are compared and if they are not the same the subroutine immediately returns with a FALSE flag atstep 318. If they are the same then the subroutine moves on to compare the next nodes of the two sub-trees (next_node1 and next_node2) atsteps step 324 the subroutine calls itself to compare the next nodes and to continue down the levels of the sub-tree in a recursive manner. If at any stage the subroutine identifies a disparity between the two sub-trees it is immediately returned at viasteps step 326 the subroutine has returned recursively without a FALSE flag it moves to step 328 where the counter “i” is indexed by 1. If atstep 330 it is determined that the entire sub-tree has been compared without a FALSE flag (i.e. i=n), then the subroutine returns with a TRUE flag. - The compaction method described with reference to FIGS. 10 and 11 can be applied to the example database shown in FIG. 9. To simplify the task of finding trees that are potentially equal, the checksum information from the tree is extracted into a sequential list of “checksum, pointer”. The pointer is a reference to the particular node, and provides a means of finding it again.
6296, Node0 1303, . . . 0309, . . . 0864, . . . 0861, . . . 0226, . . . 0918, Node1 0689, . . . 0229, . . . 0788, . . . 0779, . . . 0147, . . . 0853, . . . 0605, . . . 0157, . . . 0699, . . . 0710, . . . 0069, . . . 0322, . . . 0540, . . . 0078, . . . 0313, . . . 0631, . . . 0553, . . . 0233, . . . 0464, . . . 0629, . . . 0234, . . . 0229, . . . 0464, . . . 0157, . . . 0388, . . . 0234, . . . 0152, . . . 0157, . . . 0388, . . . 0078, . . . 0309, . . . 0156, . . . 0072, . . . 0078, . . . 0309, . . . 0383, . . . 0229, . . . 0072, . . . 0229, . . . 0238, . . . 0229, . . . 0310, . . . 0157, . . . 0233, . . . 0157, . . . 0149, . . . 0157, . . . 0226, . . . 0078, . . . 0157, . . . 0078, . . . 0069, . . . 0078, . . . 0147, . . . 0466, . . . 0078, . . . 1014, . . . 0380, . . . 0069, . . . 0388, . . . 0943, . . . 0930, . . . 0308, . . . - This list is then sorted into descending checksum value order:
6296, Node0 0699, . . . 0388, . . . 0234, . . . 0157, . . . 0078, . . . 1303, . . . 0689, . . . 0383, . . . 0233, . . . 0157, . . . 0078, . . . 1014, . . . 0631, . . . 0380, . . . 0233, . . . 0157, . . . 0078, . . . 0943, . . . 0629, . . . 0322, . . . 0229, . . . 0157, . . . 0078, . . . 0930, . . . 0605, . . . 0313, . . . 0229, . . . 0157, . . . 0078, . . . 0918, Node1 0553, . . . 0310, . . . 0229, . . . 0156, . . . 0072, . . . 0864, . . . 0540, . . . 0309, . . . 0229, . . . 0152, . . . 0072, . . . 0861, . . . 0466, . . . 0309, . . . 0229, . . . 0149, . . . 0069, . . . 0853, . . . 0464, . . . 0309, . . . 0226, . . . 0147, . . . 0069, . . . 0788, . . . 0464, . . . 0308, . . . 0226, . . . 0147, . . . 0069, . . . 0779, . . . 0388, . . . 0238, . . . 0157, . . . 0078, . . . 0710, . . . 0388, . . . 0234, . . . 0157, . . . 0078, . . . - The first value found that is equal for two nodes is 464. Comparing the underlying trees shows that they are equal in character sequence as well as in data tags (no data tags in this case). Consequently reassigning the arc named “Y” of node A to point to node B can cut off the second tree. The storage resources used by the tree starting at node C can now be freed up—the tree is not connected any more.
- 388 is the next value to look at. Again one tree can be reduced. Although 388 occurs in the list three times, the third occurrence had already been cut off in the previous step and can therefore be ignored.
- There are 3 occurrences of 309. However, after the above compaction only one is left and so no further action is necessary. The next value is 234. The two sub-trees have an equal checksum. On comparing the tree, it can be seen that they differ in character sequence. No reduction is therefore possible here.
- FIG. 12 illustrates this example. The keys METALLOPHON and XYLOPHON have both been categorised as music (category0) and both end with the sequence of characters LOPHON. The nodes labelled B and C in FIG. 12 both have the checksum values 464. Comparison of the sub-trees determines that both contain identical characters and data tags, so the arc having arc name “Y” that connects the nodes labelled A and C is redirected to connect node A to node B. All the arcs that comprise the sub-tree below node C are then removed from the database.
- FIG. 13 shows similar compaction of the example database for other nodes having equal checksum values. The sub-trees shown in boxes with a shaded background are those that are being removed from the database.
- FIG. 14a presents an example of two nodes having equal checksum values, but which are not identical. The character values of both the sub-trees “NTH” and “RPH” produce checksums totalling 234 (see the nodes in the keys “NINTH” and “POLYMORPH” in FIG. 13). However, comparison of the individual characters soon indicates that they are not identical and causes the comparison subroutine of FIG. 11 to return a FALSE flag.
- The more information that can be provided in the form of a weight value for each node, the more efficient the process of identifying equivalent sub-trees.
- It might appear that further compaction of the data set is possible by combining groups of identical characters or character strings that occur in keys. FIG. 14b illustrates an example of two keys BARITONE and MARITAL. Both contain the same string of characters “ARIT”. However the subroutine would not identify equal checksums and so compaction of the database to produce the sub-tree illustrated in FIG. 14b would not occur. This is important because compaction in this way would give rise to the possibility of keys not in the original data set being present in the final compacted database. In the example, the keys “MARITONE” and “BARITAL” are present in the compacted tree, even though they were not part of the original data set.
- FIG. 15 illustrates further examples of compacting of the example data set at lower levels (i.e. at nodes having lower checksum values). Again, the sub-trees shown in boxes with a shaded background are those that are being removed from the database. It should be noted that the most efficient method of compacting the database is to start with comparing the highest checksum values first so as to remove the largest equivalent sub-trees from the data base first, and then proceed by comparing progressively smaller sub-trees having equal checksum values.
- FIG. 16 illustrates the example database in its final compacted form, with all the redundant arcs removed, and as such represents an embodiment of both of the first and second aspects of the present invention. All the original keys from the data set of FIG. 3 are present together with their associated data tags. Some of the nodes in FIG. 16 have numbers appearing in the circles that represent the nodes. These are not checksum values, but are node labels which will be used to describe the format in which the data is stored with reference to FIG. 17.
- Having optimised and compacted the database, the data itself must be stored. As previously described, the data may be stored electronically in the format of a one dimensional binary encoded bit stream. A node is stored as its set of arcs, sorted in ascending order in terms of character information. For the purpose of fast searching, arcs are stored in ascending sorted order by their character value.
- FIG. 17 is a representation of a bit stream. The
top line 400 in FIG. 17 comprises 56 bits which are used to store the data associated with a single arc. The first 8 bits are the character itself as represented by its ASCII value. The ninth bit is a data tag flag. If this bit is a 1 it indicates that a data tag is also stored with the arc, but if it is a 0 there is no data tag. The tenth bit is another data flag which indicates whether or not the arc leads to a terminal node. The next 30 bits (bits 10 to 39) contain pointer information in the form of an address to the location in the data base of the first arc of the next node. The last 16 bits (bits 40 to 55) contain the data tag, if the data tag flag indicates its presence. Otherwise these bits are not present. - The next two
lines 402 illustrate the data for a set of nodes corresponding to the nodes labelled 0 to 11 in FIG. 16. The bits that comprise the arc data inline 400 are shown compressed into the four fields, character, flags, pointer and data tag.Node 0 is the root node in FIG. 16 and has arcs with the characters B, M, N, P, S, T and X. The arc representing the character ‘B’ carries a data flag (shown here as a ‘Y’ for ‘Yes’ ) indicating the presence of a data tag, but no flag (shown here as N for ‘No’) indicating the presence of a terminal node; the pointer data for this arc points to the first arc (‘A’) ofNode 1; and the arc carries the data tag ‘2’. Similar data is contained the fields representing the other arcs ofnode 0, all of which carry data tags, but none of which lead to a terminal node. Finally the character ‘&’ is used to represent the termination of data for the node. -
Node 1 has only a single arc representing the character ‘A’. This arc carries no data tag and does not lead to a terminal node, so the flags are both shown as N. - Similar data appears in the data stream for all the other nodes. Note, however, that both
Node 6 andNode 11 have arcs that lead to terminal nodes and carry the flag ‘Y’. - FIG. 18 illustrates a procedure for rapidly searching the database to find a key and return its associated data tag. At step500 a query is read in the form of the key to be sought. At
step 502 an indexing counter “i” is set to 0 and atstep 504 the search is started at the root node of the data structure. As yet no data tags have been read and so at step 506 a parameter result_tag is set to a null value. At step 508 the parameter arc_name is set to the next character of the key, key[i]. Atstep 510 the procedure determines whether the arc name exists for the current node. If the arc_name does not exist the procedure steps directly to step 524 to return a null value. This means that the key is not to be found in the database. - If the arc_name of key[i] does exist, at
step 512 the procedure determines if a data tag is stored with the arc. If there is a data tag, the parameter result_tag is set to the value of the data tag atstep 514. Atstep 516 the procedure moves on to the next node and atstep 518 the indexing counter “i” is incremented by 1. Atstep 520 the procedure determines whether the counter “i” is less than the length of the key (i.e. the number of characters in the key), and if it is the procedure returns to step 508 to look for the next character in the key. If the last character in the key has been reached, (i=key length) the procedure determines atstep 522 whether the current node is a terminal node by reading the terminal node flag associated with the arc (see the data stream representation of FIG. 17). If the current node is not a terminal node, the key is not in the database and the procedure moves directly to step 524 to return a null data tag value. If the node is a terminal node then the procedure moves to step 526 to return the result_tag value, which is the data tag associated with the key, and confirms the presence of the key in the database.
Claims (56)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/128,669 US7809758B2 (en) | 2001-07-20 | 2005-05-13 | Database and method of generating same |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GBGB0117721.1A GB0117721D0 (en) | 2001-07-20 | 2001-07-20 | Database and method of generating same |
GB0117721.1 | 2001-07-20 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/128,669 Continuation US7809758B2 (en) | 2001-07-20 | 2005-05-13 | Database and method of generating same |
Publications (1)
Publication Number | Publication Date |
---|---|
US20030088577A1 true US20030088577A1 (en) | 2003-05-08 |
Family
ID=9918880
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/195,847 Abandoned US20030088577A1 (en) | 2001-07-20 | 2002-07-11 | Database and method of generating same |
US11/128,669 Expired - Lifetime US7809758B2 (en) | 2001-07-20 | 2005-05-13 | Database and method of generating same |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/128,669 Expired - Lifetime US7809758B2 (en) | 2001-07-20 | 2005-05-13 | Database and method of generating same |
Country Status (3)
Country | Link |
---|---|
US (2) | US20030088577A1 (en) |
EP (1) | EP1278136A3 (en) |
GB (1) | GB0117721D0 (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050267902A1 (en) * | 2001-07-20 | 2005-12-01 | Surfcontrol Plc | Database and method of generating same |
US20060253584A1 (en) * | 2005-05-03 | 2006-11-09 | Dixon Christopher J | Reputation of an entity associated with a content item |
US20060253458A1 (en) * | 2005-05-03 | 2006-11-09 | Dixon Christopher J | Determining website reputations using automatic testing |
US20060253582A1 (en) * | 2005-05-03 | 2006-11-09 | Dixon Christopher J | Indicating website reputations within search results |
US20060253578A1 (en) * | 2005-05-03 | 2006-11-09 | Dixon Christopher J | Indicating website reputations during user interactions |
US20060253579A1 (en) * | 2005-05-03 | 2006-11-09 | Dixon Christopher J | Indicating website reputations during an electronic commerce transaction |
US20060253580A1 (en) * | 2005-05-03 | 2006-11-09 | Dixon Christopher J | Website reputation product architecture |
US20060253581A1 (en) * | 2005-05-03 | 2006-11-09 | Dixon Christopher J | Indicating website reputations during website manipulation of user information |
WO2008090373A1 (en) | 2007-01-22 | 2008-07-31 | Websense Uk Limited | Resource access filtering system and database structure for use therewith |
US20080219278A1 (en) * | 2007-03-06 | 2008-09-11 | International Business Machines Corporation | Method for finding shared sub-structures within multiple hierarchies |
US20090049035A1 (en) * | 2007-08-14 | 2009-02-19 | International Business Machines Corporation | System and method for indexing type-annotated web documents |
US20100309933A1 (en) * | 2009-06-03 | 2010-12-09 | Rebelvox Llc | Method for synchronizing data maintained at a plurality of nodes |
CN102073722A (en) * | 2011-01-11 | 2011-05-25 | 吕晓东 | URL (Uniform Resource Locator) cloud publishing system |
US20110314028A1 (en) * | 2010-06-18 | 2011-12-22 | Microsoft Corporation | Presenting display characteristics of hierarchical data structures |
US8244817B2 (en) | 2007-05-18 | 2012-08-14 | Websense U.K. Limited | Method and apparatus for electronic mail filtering |
US8566726B2 (en) | 2005-05-03 | 2013-10-22 | Mcafee, Inc. | Indicating website reputations based on website handling of personal information |
US8701196B2 (en) | 2006-03-31 | 2014-04-15 | Mcafee, Inc. | System, method and computer program product for obtaining a reputation associated with a file |
US9378282B2 (en) | 2008-06-30 | 2016-06-28 | Raytheon Company | System and method for dynamic and real-time categorization of webpages |
CN111309851A (en) * | 2020-02-13 | 2020-06-19 | 北京金山安全软件有限公司 | Entity word storage method and device and electronic equipment |
CN113365270A (en) * | 2021-06-15 | 2021-09-07 | 王云森 | RFID multi-label joint authentication system and method based on application of Internet of things |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2416879B (en) | 2004-08-07 | 2007-04-04 | Surfcontrol Plc | Device resource access filtering system and method |
GB2418037B (en) | 2004-09-09 | 2007-02-28 | Surfcontrol Plc | System, method and apparatus for use in monitoring or controlling internet access |
CA2577277A1 (en) * | 2004-09-09 | 2006-03-16 | Surfcontrol Plc | System, method and apparatus for use in monitoring or controlling internet access |
GB2418108B (en) | 2004-09-09 | 2007-06-27 | Surfcontrol Plc | System, method and apparatus for use in monitoring or controlling internet access |
GB2418999A (en) | 2004-09-09 | 2006-04-12 | Surfcontrol Plc | Categorizing uniform resource locators |
US20080015887A1 (en) * | 2006-07-14 | 2008-01-17 | Aegon Direct Marketing Services, Inc. | System and process for enrollment and sales |
US8799250B1 (en) * | 2007-03-26 | 2014-08-05 | Amazon Technologies, Inc. | Enhanced search with user suggested search information |
US9348884B2 (en) * | 2008-05-28 | 2016-05-24 | International Business Machines Corporation | Methods and apparatus for reuse optimization of a data storage process using an ordered structure |
US8935129B1 (en) * | 2010-06-04 | 2015-01-13 | Bentley Systems, Incorporated | System and method for simplifying a graph'S topology and persevering the graph'S semantics |
CN102279856B (en) | 2010-06-09 | 2013-10-02 | 阿里巴巴集团控股有限公司 | Method and system for realizing website navigation |
CA2706743A1 (en) * | 2010-06-30 | 2010-09-08 | Ibm Canada Limited - Ibm Canada Limitee | Dom based page uniqueness indentification |
US8863084B2 (en) * | 2011-10-28 | 2014-10-14 | Google Inc. | Methods, apparatuses, and computer-readable media for computing checksums for effective caching in continuous distributed builds |
CN103218719B (en) | 2012-01-19 | 2016-12-07 | 阿里巴巴集团控股有限公司 | A kind of e-commerce website air navigation aid and system |
US10318703B2 (en) | 2016-01-19 | 2019-06-11 | Ford Motor Company | Maximally standard automatic completion using a multi-valued decision diagram |
US10621032B2 (en) * | 2017-06-22 | 2020-04-14 | Uber Technologies, Inc. | Checksum tree generation for improved data accuracy verification |
IT201800002790A1 (en) * | 2018-02-19 | 2019-08-19 | Siae – Soc It Degli Autori Ed Editori | Method and system for detecting anomalous structures in oriented graphs. |
US10872090B2 (en) * | 2018-09-18 | 2020-12-22 | Mastercard International Incorporated | Generating test data based on data value rules of linked data nodes |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6073135A (en) * | 1998-03-10 | 2000-06-06 | Alta Vista Company | Connectivity server for locating linkage information between Web pages |
US6219786B1 (en) * | 1998-09-09 | 2001-04-17 | Surfcontrol, Inc. | Method and system for monitoring and controlling network access |
US6675169B1 (en) * | 1999-09-07 | 2004-01-06 | Microsoft Corporation | Method and system for attaching information to words of a trie |
Family Cites Families (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5758152A (en) * | 1990-12-06 | 1998-05-26 | Prime Arithmetics, Inc. | Method and apparatus for the generation and manipulation of data structures |
JP3397431B2 (en) * | 1994-03-16 | 2003-04-14 | 富士通株式会社 | Data compression method and device and data decompression method and device |
US5774668A (en) | 1995-06-07 | 1998-06-30 | Microsoft Corporation | System for on-line service in which gateway computer uses service map which includes loading condition of servers broadcasted by application servers for load balancing |
US5712979A (en) * | 1995-09-20 | 1998-01-27 | Infonautics Corporation | Method and apparatus for attaching navigational history information to universal resource locator links on a world wide web page |
US5884325A (en) | 1996-10-09 | 1999-03-16 | Oracle Corporation | System for synchronizing shared data between computers |
US6065059A (en) | 1996-12-10 | 2000-05-16 | International Business Machines Corporation | Filtered utilization of internet data transfers to reduce delay and increase user control |
US5896502A (en) | 1996-12-10 | 1999-04-20 | International Business Machines Corporation | Internet data transfer control via a client system to reduce delay |
US6862602B2 (en) | 1997-03-07 | 2005-03-01 | Apple Computer, Inc. | System and method for rapidly identifying the existence and location of an item in a file |
US6058389A (en) | 1997-10-31 | 2000-05-02 | Oracle Corporation | Apparatus and method for message queuing in a database system |
US6742003B2 (en) * | 2001-04-30 | 2004-05-25 | Microsoft Corporation | Apparatus and accompanying methods for visualizing clusters of data and hierarchical cluster classifications |
US6519571B1 (en) | 1999-05-27 | 2003-02-11 | Accenture Llp | Dynamic customer profile management |
US20020049883A1 (en) | 1999-11-29 | 2002-04-25 | Eric Schneider | System and method for restoring a computer system after a failure |
US7159237B2 (en) | 2000-03-16 | 2007-01-02 | Counterpane Internet Security, Inc. | Method and system for dynamic network intrusion monitoring, detection and response |
US7418440B2 (en) | 2000-04-13 | 2008-08-26 | Ql2 Software, Inc. | Method and system for extraction and organizing selected data from sources on a network |
US6571249B1 (en) * | 2000-09-27 | 2003-05-27 | Siemens Aktiengesellschaft | Management of query result complexity in hierarchical query result data structure using balanced space cubes |
US20020073089A1 (en) | 2000-09-29 | 2002-06-13 | Andrew Schwartz | Method and system for creating and managing relational data over the internet |
US7209893B2 (en) | 2000-11-30 | 2007-04-24 | Nokia Corporation | Method of and a system for distributing electronic content |
US6782388B2 (en) | 2000-12-29 | 2004-08-24 | Bellsouth Intellectual Property Corporation | Error usage investigation and disposal system |
US7058663B2 (en) | 2001-03-13 | 2006-06-06 | Koninklijke Philips Electronics, N.V. | Automatic data update |
US7114184B2 (en) | 2001-03-30 | 2006-09-26 | Computer Associates Think, Inc. | System and method for restoring computer systems damaged by a malicious computer program |
US7188368B2 (en) | 2001-05-25 | 2007-03-06 | Lenovo (Singapore) Pte. Ltd. | Method and apparatus for repairing damage to a computer system using a system rollback mechanism |
US6741997B1 (en) | 2001-06-14 | 2004-05-25 | Oracle International Corporation | Instantiating objects in distributed database systems |
EP1410258A4 (en) | 2001-06-22 | 2007-07-11 | Inc Nervana | System and method for knowledge retrieval, management, delivery and presentation |
GB0117721D0 (en) | 2001-07-20 | 2001-09-12 | Surfcontrol Plc | Database and method of generating same |
US6947985B2 (en) | 2001-12-05 | 2005-09-20 | Websense, Inc. | Filtering techniques for managing access to internet sites or other software applications |
US20030126139A1 (en) | 2001-12-28 | 2003-07-03 | Lee Timothy A. | System and method for loading commercial web sites |
US7136867B1 (en) * | 2002-04-08 | 2006-11-14 | Oracle International Corporation | Metadata format for hierarchical data storage on a raw storage device |
US7379978B2 (en) | 2002-07-19 | 2008-05-27 | Fiserv Incorporated | Electronic item management and archival system and method of operating the same |
US20040068479A1 (en) | 2002-10-04 | 2004-04-08 | International Business Machines Corporation | Exploiting asynchronous access to database operations |
AU2003294245A1 (en) | 2002-11-08 | 2004-06-03 | Dun And Bradstreet, Inc. | System and method for searching and matching databases |
US7529754B2 (en) | 2003-03-14 | 2009-05-05 | Websense, Inc. | System and method of monitoring and controlling application files |
US7185015B2 (en) | 2003-03-14 | 2007-02-27 | Websense, Inc. | System and method of monitoring and controlling application files |
JP4218451B2 (en) | 2003-08-05 | 2009-02-04 | 株式会社日立製作所 | License management system, server device and terminal device |
-
2001
- 2001-07-20 GB GBGB0117721.1A patent/GB0117721D0/en not_active Ceased
-
2002
- 2002-07-11 US US10/195,847 patent/US20030088577A1/en not_active Abandoned
- 2002-07-18 EP EP02255055A patent/EP1278136A3/en not_active Withdrawn
-
2005
- 2005-05-13 US US11/128,669 patent/US7809758B2/en not_active Expired - Lifetime
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6073135A (en) * | 1998-03-10 | 2000-06-06 | Alta Vista Company | Connectivity server for locating linkage information between Web pages |
US6219786B1 (en) * | 1998-09-09 | 2001-04-17 | Surfcontrol, Inc. | Method and system for monitoring and controlling network access |
US6675169B1 (en) * | 1999-09-07 | 2004-01-06 | Microsoft Corporation | Method and system for attaching information to words of a trie |
Cited By (36)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050267902A1 (en) * | 2001-07-20 | 2005-12-01 | Surfcontrol Plc | Database and method of generating same |
US7809758B2 (en) | 2001-07-20 | 2010-10-05 | Websense Uk Limited | Database and method of generating same |
US9384345B2 (en) | 2005-05-03 | 2016-07-05 | Mcafee, Inc. | Providing alternative web content based on website reputation assessment |
US8826155B2 (en) | 2005-05-03 | 2014-09-02 | Mcafee, Inc. | System, method, and computer program product for presenting an indicia of risk reflecting an analysis associated with search results within a graphical user interface |
US20060253578A1 (en) * | 2005-05-03 | 2006-11-09 | Dixon Christopher J | Indicating website reputations during user interactions |
US20060253579A1 (en) * | 2005-05-03 | 2006-11-09 | Dixon Christopher J | Indicating website reputations during an electronic commerce transaction |
US20060253580A1 (en) * | 2005-05-03 | 2006-11-09 | Dixon Christopher J | Website reputation product architecture |
US20060253581A1 (en) * | 2005-05-03 | 2006-11-09 | Dixon Christopher J | Indicating website reputations during website manipulation of user information |
US20080109473A1 (en) * | 2005-05-03 | 2008-05-08 | Dixon Christopher J | System, method, and computer program product for presenting an indicia of risk reflecting an analysis associated with search results within a graphical user interface |
US20060253458A1 (en) * | 2005-05-03 | 2006-11-09 | Dixon Christopher J | Determining website reputations using automatic testing |
US8429545B2 (en) | 2005-05-03 | 2013-04-23 | Mcafee, Inc. | System, method, and computer program product for presenting an indicia of risk reflecting an analysis associated with search results within a graphical user interface |
US20060253582A1 (en) * | 2005-05-03 | 2006-11-09 | Dixon Christopher J | Indicating website reputations within search results |
US7562304B2 (en) | 2005-05-03 | 2009-07-14 | Mcafee, Inc. | Indicating website reputations during website manipulation of user information |
US7765481B2 (en) | 2005-05-03 | 2010-07-27 | Mcafee, Inc. | Indicating website reputations during an electronic commerce transaction |
US20060253584A1 (en) * | 2005-05-03 | 2006-11-09 | Dixon Christopher J | Reputation of an entity associated with a content item |
US7822620B2 (en) | 2005-05-03 | 2010-10-26 | Mcafee, Inc. | Determining website reputations using automatic testing |
US8826154B2 (en) | 2005-05-03 | 2014-09-02 | Mcafee, Inc. | System, method, and computer program product for presenting an indicia of risk associated with search results within a graphical user interface |
US8296664B2 (en) | 2005-05-03 | 2012-10-23 | Mcafee, Inc. | System, method, and computer program product for presenting an indicia of risk associated with search results within a graphical user interface |
US8566726B2 (en) | 2005-05-03 | 2013-10-22 | Mcafee, Inc. | Indicating website reputations based on website handling of personal information |
US8516377B2 (en) | 2005-05-03 | 2013-08-20 | Mcafee, Inc. | Indicating Website reputations during Website manipulation of user information |
US8438499B2 (en) | 2005-05-03 | 2013-05-07 | Mcafee, Inc. | Indicating website reputations during user interactions |
US8701196B2 (en) | 2006-03-31 | 2014-04-15 | Mcafee, Inc. | System, method and computer program product for obtaining a reputation associated with a file |
WO2008090373A1 (en) | 2007-01-22 | 2008-07-31 | Websense Uk Limited | Resource access filtering system and database structure for use therewith |
US8250081B2 (en) | 2007-01-22 | 2012-08-21 | Websense U.K. Limited | Resource access filtering system and database structure for use therewith |
US20080219278A1 (en) * | 2007-03-06 | 2008-09-11 | International Business Machines Corporation | Method for finding shared sub-structures within multiple hierarchies |
US8244817B2 (en) | 2007-05-18 | 2012-08-14 | Websense U.K. Limited | Method and apparatus for electronic mail filtering |
US9473439B2 (en) | 2007-05-18 | 2016-10-18 | Forcepoint Uk Limited | Method and apparatus for electronic mail filtering |
US8799388B2 (en) | 2007-05-18 | 2014-08-05 | Websense U.K. Limited | Method and apparatus for electronic mail filtering |
US20090049035A1 (en) * | 2007-08-14 | 2009-02-19 | International Business Machines Corporation | System and method for indexing type-annotated web documents |
US9378282B2 (en) | 2008-06-30 | 2016-06-28 | Raytheon Company | System and method for dynamic and real-time categorization of webpages |
US20100309933A1 (en) * | 2009-06-03 | 2010-12-09 | Rebelvox Llc | Method for synchronizing data maintained at a plurality of nodes |
US8345707B2 (en) * | 2009-06-03 | 2013-01-01 | Voxer Ip Llc | Method for synchronizing data maintained at a plurality of nodes |
US20110314028A1 (en) * | 2010-06-18 | 2011-12-22 | Microsoft Corporation | Presenting display characteristics of hierarchical data structures |
CN102073722A (en) * | 2011-01-11 | 2011-05-25 | 吕晓东 | URL (Uniform Resource Locator) cloud publishing system |
CN111309851A (en) * | 2020-02-13 | 2020-06-19 | 北京金山安全软件有限公司 | Entity word storage method and device and electronic equipment |
CN113365270A (en) * | 2021-06-15 | 2021-09-07 | 王云森 | RFID multi-label joint authentication system and method based on application of Internet of things |
Also Published As
Publication number | Publication date |
---|---|
GB0117721D0 (en) | 2001-09-12 |
US20050267902A1 (en) | 2005-12-01 |
EP1278136A3 (en) | 2004-08-18 |
US7809758B2 (en) | 2010-10-05 |
EP1278136A2 (en) | 2003-01-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7809758B2 (en) | Database and method of generating same | |
JP3771271B2 (en) | Apparatus and method for storing and retrieving ordered collections of keys in a compact zero complete tree | |
CN102142038B (en) | Multi-stage query processing system and method for use with tokenspace repository | |
KR100798609B1 (en) | Data sort method, data sort apparatus, and storage medium storing data sort program | |
US8335779B2 (en) | Method and apparatus for gathering, categorizing and parameterizing data | |
US20090063538A1 (en) | Method for normalizing dynamic urls of web pages through hierarchical organization of urls from a web site | |
JPH06103497B2 (en) | Record search method and database system | |
CN112579155A (en) | Code similarity detection method and device and storage medium | |
EP0435476A2 (en) | Database system | |
CN101310277B (en) | Method of obtaining a representation of a text and system | |
CN111562920A (en) | Method and device for determining similarity of small program codes, server and storage medium | |
CN110795526A (en) | Mathematical formula index creating method and system for retrieval system | |
Kashyap et al. | Analysis of the multiple-attribute-tree data-base organization | |
JPH09245043A (en) | Information retrieval device | |
JP3258063B2 (en) | Database search system and method | |
CN115130043B (en) | Database-based data processing method, device, equipment and storage medium | |
US20060101045A1 (en) | Methods and apparatus for interval query indexing | |
CN111858601A (en) | Tree structure data query method, device, equipment and storage medium | |
JP2020135530A (en) | Data management device, data search method and program | |
JP2000090115A (en) | Index generating method and retrieval method | |
JPH10240741A (en) | Managing method for tree structure type data | |
KR100434718B1 (en) | Method and system for indexing document | |
JP2000250930A (en) | Structured document retrieval system | |
JP2001195400A (en) | Method and device for structuralizing document context | |
JP2002183142A (en) | Device and method for storing data, and computer- readable recording medium with recorded program making computer to implement data storing method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: SURFCONTROL PLC, ENGLAND Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:STOIBER, HARALD;REEL/FRAME:013266/0961 Effective date: 20020808 Owner name: SURFCONTROL PLC, UNITED KINGDOM Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:THURNHOFER, KLAUS;REEL/FRAME:013266/0964 Effective date: 20020726 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |
|
AS | Assignment |
Owner name: WEBSENSE UK LIMITED, UNITED KINGDOM Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:SURFCONTROL LIMITED (FORMERLY NAMED SURFCONTROL PLC);REEL/FRAME:020675/0792 Effective date: 20080131 |