WO2002008953A1 - A method and apparatus for dynamically clustering complex profile information - Google Patents

A method and apparatus for dynamically clustering complex profile information Download PDF

Info

Publication number
WO2002008953A1
WO2002008953A1 PCT/NL2000/000516 NL0000516W WO0208953A1 WO 2002008953 A1 WO2002008953 A1 WO 2002008953A1 NL 0000516 W NL0000516 W NL 0000516W WO 0208953 A1 WO0208953 A1 WO 0208953A1
Authority
WO
WIPO (PCT)
Prior art keywords
keys
pattern
general
cluster
profiles
Prior art date
Application number
PCT/NL2000/000516
Other languages
French (fr)
Inventor
Odin Taylor
Nicholas John Schmansky
Ihor Theodore Kuz
Joost Van De Wijgerd
Original Assignee
Smarthaven B.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Smarthaven B.V. filed Critical Smarthaven B.V.
Priority to AU2000261903A priority Critical patent/AU2000261903A1/en
Priority to PCT/NL2000/000516 priority patent/WO2002008953A1/en
Publication of WO2002008953A1 publication Critical patent/WO2002008953A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Definitions

  • the present invention relates to an improved method and/or system for clustering information items expressed as features or profiles.
  • the Internet can be viewed as a gargantuan store of information, and finding relevant information is analogous to the needle and the haystack problem.
  • This problem is most easily seen in the domain of search sites that exist to enable Internet users to input keywords and search for information.
  • most of the information that is located by the various search engines is not what the user requires.
  • Many other domains of information selection, such as television show or product selection suffer the same problem.
  • the problem reduces to the fact that a person seeking information is a complex and unique individual, and the information they seek is just as complex and unique. There exists a huge void or bottleneck between the user and the information, which must be traversed to bring the two together in an informative manner.
  • the present invention relates to an improved method and/or system for clustering profiles of objects, for the purpose of improved search and information discovery, as defined by the claims.
  • the present invention relates to a clustering method and apparatus that provide a quick method to find similar profiles of objects. Keys are extracted from a profile and collected into a general pattern. A set of general patterns is formed from a number of profiles and used in a clustering process. After similar profiles have been located, specific keys that could relate to the query are then extracted from the profiles. Another specific query pattern is formed by this extraction of new keys from the profiles that have been identified as similar.
  • a query is embodiment dependent (such as a user entering keywords into a search engine, or an information processing system querying an information store).
  • This method is applicable as the number of object profiles will be large in number and the size of the profiles may be large (in terms of the number of keys that a profile contains).
  • the improvement offered by this invention over other clustering methods is the ability to cluster a dynamically changing number of profiles, the ability to cluster into groups without having to specify groups ⁇ -priori (that is, it works in an 'unsupervised' manner), and the ability to perform multiple levels of clustering.
  • the process described in this invention may be calculated in software or specialized hardware.
  • Figure 1 depicts the definition of Entities, objects, profiles, keys and patterns.
  • Figure 2 illustrates the sequence of events composing the main embodiment of the invention: the clustering engine.
  • Figure 3 depicts a sample of a pattern that is constructed from n keys
  • Figure 4 is a table that represents a description of the trace information that a neuron contains.
  • Figure 5 illustrates the logic flow for adding trace information to a neuron
  • Figure 6 details how a Gaussian curve is fitted over a neuron and used for thresholding and generating confidence value information
  • Figure 7 illustrates Level One clustering
  • Figure 8 illustrates Level Three clustering
  • Figure 9 illustrates Level Two clustering
  • Figure 10 shows the logic flow from the first level of clustering through to the second level and how the third level of clustering is applied when needed.
  • Figure 11 shows a computer arrangement by means of which the invention can be implemented.
  • an object (101) describes an entity (100) (which can be anything from a user profile, a physical item, a concept, a document to a monitored process) by the use of a profile.
  • a profile (102) consists of an n dimensional array of keys, the number of which could vary from profile to profile.
  • a key (103) consists of a label and a numerical weight value (not shown in figure 1) that describes the importance of the key to the current obj ect.
  • the present invention is a clustering method that provides a quick method to find similar profiles of objects. Keys are extracted from a profile (102) and collected into a general pattern (104). A set of general patterns is formed from a number of profiles and used in a clustering process. After similar profiles have been located, specific keys that could relate to the query are then extracted from the profiles. Another specific query pattern is formed by this extraction of new keys from the profiles that have been identified as similar. These specific query patterns are then re-clustered dynamically to search for a match to the initial query, where a query is embodiment dependent (such as a user entering keywords into a search engine, or an information processing system querying an information store). This method is applicable as the number of object profiles will be large in number and the size of the profiles may be large (in terms of the number of keys that a profile contains). This will be explained in detail below.
  • Neural networks in general have the advantage that they can learn by example the general characteristics of a data set, and from that classify new, previously unseen data items. Neural networks have the advantage over statistical techniques in that they can operate with missing values, are simple to implement, can work with a "noisy" input, can handle large data sets and are very robust. Neural network models fall into two general categories: that of supervised and unsupervised methods. Using a supervised method (as used by networks such as the multi-layer perceptron) normally indicates that there is some knowledge of the data set and the classes contained within it, which is not suitable for this type of application (as it is not known ⁇ -priori which cluster or group a profile should belong to).
  • unsupervised methods such as the Kohonen SOM (Self Organizing Map) can be used to learn the characteristics of high dimensional data sets (Murtagh and Hernandez, 1995; Siores and Seeto, 1995) and can aid in visualization [Mori et al, 1993; Muknahallipatana et al, 1996; Kohonen et al, 1996].
  • This ability to learn the characteristics of unknown data places it in an ideal position to use with profile information, where the incoming data is of a high dimensionality and complexity.
  • the Kohonen SOM is limited in the requirements of this invention as the parameters have to be assigned before the network is trained, which indicates that there exists some prior information of the clusters contained within the data.
  • the network cannot be retrained once a training session has ended, which is not useful in the dynamic environment where profiles adaptively change and increase (or decrease) in numbers.
  • a dynamic, growing neural network architecture is used. Dynamic architectures are extremely powerful as data distributions can be modelled accurately without "dead" areas of zero probability density being included in the model. Units ending up in areas of zero probability density are a side effect of training algorithms, which could be harmful to the overall network performance and also limit the resulting mappings [Fritzke, 1993]. The number of units in a dynamic network architecture also increases and decreases during training, removing the need to specify the network architecture ⁇ -priori [Fritzke, 1996]. An advantage of this is that neuron units do not all have to be initialized with a random value, which could result in the network producing different results every time it is trained.
  • the algorithm selected for usage within the present invention is the Growing Neural Gas (GNG) network as proposed by Fritzke [Fritzke and Wilke, 1991; Fritzke, 1991; Fritzke, 1993; Fritzke, 1994a; Fritzke, 1994b; Fritzke, 1995; Fritzke, 1995b; Fritzke, 1996].
  • GNG Growing Neural Gas
  • the clustering process (in general) is illustrated in Figure 2, with the patterns being extracted (202) from a store of overall profiles (201). Every training cycle a random pattern is chosen (it then becomes unavailable for selection again until every other pattern in the set has been selected, ensuring that each pattern is presented equally) from the overall set (203) and used for training of the neural network (204).
  • training finishes (205).
  • the resulting network can then be saved in memory or used immediately for querying, although it should be noted that at the end of every training cycle (presentation of one pattern) the network could also be queried (206).
  • one, two, or three levels may be employed in clustering profile information.
  • the first level uses a general set of keys that are extracted from the profile and are formed into a pattern.
  • the pattern forms a general overview of the information contained within the profile, and are common to all profiles. These patterns are used to perform a first level cluster of the profiles.
  • the second level is the extraction of query specific keys from the profile, forming a subset of query related patterns, and clustering on those patterns with the aim of finding matches to the incoming query pattern.
  • a third level of clustering is also employed if no matching information is found in the first two cluster methods. In this third level, an alternate cluster, which matches the query specific information, is identified.
  • the first level of operation is the clustering of the n dimensional general keys.
  • Figure 3 shows a profile having n keys key 1? key 2 , ...., key n .
  • Clustering will be performed on this general key "string" (pattern) to identify similar profiles with similar interests.
  • Interest value or weight value
  • the clustering process in the present invention can run continuously, with updates at a specified time interval.
  • New profiles can be associated to one or more of these general groups upon initial querying, without a major re-clustering of the network due to the non-decaying parameters of the neural network architecture.
  • removal also occurs in the same manner, with minimal disturbance to the clustering process.
  • the n dimensional general pattern as shown in Fig. 3 is clustered with the general pattern from every other profile in the global set of profiles (which can change dynamically as profiles are added, removed or change).
  • the clustering process in this invention is carried out by the unsupervised Growing Neural Gas (GNG) architecture, with queries being processed during training or after a certain time period of constant training has elapsed.
  • GNG unsupervised Growing Neural Gas
  • the GNG neural network learns the underlying structure of the data (data in this term refers to profile information) contained within a data set.
  • the learnt representation can then be queried with a new input.
  • the query returns a "distance” that represents how "similar” the query is to the "represented” knowledge that the neural network has learned (from observation of the data).
  • This form of representation is fine for some standard types of neural network applications. However, in the case of this invention, additional knowledge is needed to capture information about which patterns affected the position of the network's neuron.
  • each neuron in this invention is given a "trace" facility that allows each pattern that effects the neuron's position to be recorded.
  • the trace list shown in Fig. 4 records the pattern's identification numbers (first column) that have had an effect on the neuron position, the number of hits (second column) each neuron has had by a certain pattern, and how old each trace (third column) is.
  • the neuron's trace list ( Figure 4) is checked to see if the pattern identification number (501) is in its trace (502). If the unique pattern identification number is not in the trace list and the list is not full (503), then it is added to the end of the trace list (504), the number of hits for that pattern identification number is set to zero in the table of Fig. 4 and so is the age parameter. If the pattern identification number does exist in the trace list of Fig. 4, then the 'number of hits' parameter is incremented by one and the age is refreshed to zero (505).
  • the oldest trace is found (506) and replaced with the new pattern identification number (the number of hits and age parameters are refreshed to zero).
  • all the age information for each trace in the list is incremented and checked to see that the age has not passed a pre-determined threshold. If the age of a trace has passed a pre-determined threshold then that trace is removed from the list. This method ensures that only up-to-date and accurate trace information is recorded in the neuron.
  • a decision threshold was also needed for each neuron to determine if the query input was similar or not to the winning (closest) neuron.
  • the standard GNG neural network has no facility to do this, and the closest match (whether the input is a significant distance away from the winning neuron) is always located.
  • each neuron in the network is given its own threshold that it uses (if it is the winning neuron) to decide if the query input is similar to it (and the knowledge it describes).
  • Each neuron in the network (after training) is assigned its own threshold automatically.
  • the threshold is in the form of a Gaussian curve (or other mathematical function) that fits over the neuron ( Figure 6) and is used to decided if a new input is inside its threshold or not. If it is inside, then the neuron classes the new input as similar to the neuron, where as, if it is not, then it is classified as too dissimilar. Deciding that the input is similar or not to a neuron provides only basic information. Further information is gained by examining the location of the input (inside the threshold) by referencing the Gaussian function and determining a confidence value. If the input is the same as the neuron, then the output from the Gaussian function will be around 1.0 (shown as II in Figure 6).
  • the neural network is examined by algorithms within the clustering engine to establish some statistical information.
  • Statistics include: the number of distinct clusters (clusters that are separate from other clusters), number of neurons and edge connections in a cluster (to determine pseudo-probability density of the clusters), and the overall values for neurons and edges.
  • the statistical information can then be used to derive knowledge through data mining, for instance plotting any developing trends in the data (from a base or first neural network model that is used as an augmented model so trends can be plotted and tracked as the neural network learns the data over time).
  • the clustering engine also automatically attempts to label the identified clusters.
  • the engine examines all of the weights that belong to each neuron in a cluster, and from that determines which synapses are the most dominant. These dominant features are then used to label each cluster individually.
  • the labeled clusters are used further in the clustering and searching process, as well as for data mining and information discovery.
  • the general key information that was used for creating training patterns is extracted from the profile of the querying object, which may or may not have been used for the initial stage of the clustering.
  • the query pattern is a list of keys (that define an area of knowledge) that the user is searching for (1001).
  • the process of querying involves comparing the objects general pattern against those stored by the neurons in the neural network.
  • the objects general pattern is extracted from its profile (1002) and is then compared to see if there is a similar match within the neural network.
  • the comparison process uses a distance metric (1003) to find the similarity of the users general pattern against the positions of the neurons (1004), which describe a generalized view of the profile data.
  • the general pattern is compared to the winning neuron's threshold (as this defines the bounds of similarity). If the new input is not within the bounds of the threshold (as described previously with reference to Fig. 6), the result is returned to the clustering engine that then progresses to a further level of clustering (1005).
  • the new input (A ( course + i) in Figure 7) is within the bounds then the confidence value is retrieved (from the neuron), and the identification of the patterns that have affected the position of the neuron are retrieved from the trace information.
  • the patterns identified in the trace information form the basis of the subset patterns that are used in the second level of clustering.
  • Second clustering level If a similar group of profiles is found (as checked in 1005 in Fig. 10 and as shown by C5 in Figure 7), then these profiles are used in a further clustering process.
  • the query specific keys are extracted (1006) from each of the identified profiles (from the first stage as being similar to the query object's profile) and are formed into patterns, which are used as a training set for another clustering process.
  • This second stage of clustering groups together query specific patterns from each profile, with the query specific patterns defining information that each profile has knowledge of (1007).
  • the clustering is performed (on-the-fly) by another Growing Neural Gas architecture, which is immediately queried with the querying objects search pattern, where matches are recorded (1008).
  • a list of matching profiles are then returned back (1009) to the higher-level process or application making use of this invention (the clustering engine) (1010).
  • Third clustering level If, as established in 1005 at the second level, no matches are found, the clustering engine will revert to the third level of clustering.
  • the third level of clustering examines the labelled clusters (from the first level of training) to see if another clustered group could hold information that is more specific to the querying object (1011). If another candidate cluster is found (1012), then this cluster will be used as the basis for the second level of clustering, as illustrated in an example in Figure 8. If no other clusters can be found (1012) a message to that effect will be returned to the client (1009).
  • the new profile A (whi + ⁇ is clustered as part of an initial query.
  • the new profile is defined as being similar to profiles that describe cluster C 5 (in Figure 8) and can be incorporated into the constantly updating training data set, or just be used for the current query (as the current querying object may not be part of the original clustered set but an external guest to the clustered space).
  • the profile however needs to run a query on a subject that is defined in the general pattern, which could be for demonstration, "sport".
  • the profiles that show a high interest in sport are defined in cluster Ci (in Figure 8) that can be identified as existing by the neural network examining the cluster labelling.
  • the profiles that have had an effect in positioning neurons in that cluster can be identified, with the rest of the profiles being discarded from the search.
  • Information relating to sport can then be extracted from each of the identified candidate profiles and formed into patterns, re-clustered, and queried (as shown in Figure 9) as in the second level of clustering described above.
  • the clustering method is used to provide collaborative filtering of information found by personalized Internet agents.
  • An example of this aspect occurs as follows.
  • search engines work by requiring the entry of one or more 'keywords', which are supposed to 'describe' the topic of interest to the user.
  • the problem is that in reality, it is necessary to specify to the search engine a long and very narrowly defined set of keywords in order to get back exactly the web page of interest, and more often than not, the user does not know exactly what these keywords should be, ⁇ -priori.
  • a solution to this problem involves the usage of a "personalized intelligent agent”.
  • This agent is software that learns the interests of the user by extracting the entire set of 'keywords', or concepts, associated with a web page that the user feels is highly valued, as indicated by the user giving his favourite web pages a "high rating", or score, via a web-browser enabled facility of the agent. These concepts are then stored within the user's agent, alongside other highly rated concepts from previous searches. Over time, the user agent is filled with personal information in the form of a set of concepts, each concept (or keyword) having a 'weight', or score, associated with it.
  • the user uses his agent as a front-end interface in conducting Internet searches.
  • the agent receives the query keywords from the user (typically via a web-browser, or any web-interface mechanism).
  • the agent looks at its internal personalized concept store for similar keywords, which may be added to the query, thus typically resulting in higher quality results returned by the search engines.
  • These web pages are returned to the user's web browser via the agent.
  • the agent also stores a list of URLs associated with each highly rated concept, such that the agent may provide to the user these web pages in the event that the user requests information on these concepts in future queries.
  • the agent is able to return even more intelligent and personalized results if the agent 'consults' or 'communicates' with other agents that are similar to the user agent.
  • the premise is that 'similar minded' agents will have stored URLs that are of high interest to other similar minded agents.
  • a real-life analogy is that an English football sports fan is very much more likely to know about Manchester United as compared to an American nuclear physicist.
  • each agent stores a set of concepts that is associated with the interests of that user. This set is a "profile" of the user. Every user has an agent with a unique profile. Each profile is different in respect to the keywords and the interest level in those keywords (where interest level may be expressed as a number ranging from zero to one).
  • a user profile is a weighted set of concepts, which in mathematical terms is an n-dimensional vector of floating point values, where n is the number of concepts in each profile, and n is different for each profile. It is of course possible to simply compare profiles keyword by keyword, comparing only similar keywords and determining agent 'similarity' based on a sum of differences between keyword interest values.
  • the present invention takes a much more sophisticated statistical approach than this.
  • the present invention 'clusters' profiles into groups, and the invention allows for three different forms of clustering, described earlier as 'level one', 'level two' and 'level three'. In the level one clustering, a subset of concepts, which are common to ALL user profiles, is used to cluster an entire set of profiles into groups.
  • This subset might for example consist of general concepts such as 'sports', 'travel', 'music', 'books', etc., with interest values ranging from zero to one.
  • This set of concepts is present in every user profile at the time of agent creation, and some mechanism such as a question and answer session between the agent and the user assigns an interest level value to each of these 'base' concepts.
  • level one clustering users with common interests are grouped together.
  • the first step taken by the agent is to query the clustering engine for agents similar to the profile of the agent's user. Recall that the clustering engine stores the names of agents associated with each profile, so the clustering engine is able to perform this de-referencing operation.
  • the agent is then able to query each agent directly to discover if any of these similar-minded agents can supply URLs of web pages matching the original user query.
  • a problem with this solution is that potentially this sub-group of similar-minded agents might be very large in number, thus, another level of clustering is necessary to sub- cluster within the first cluster, such that 'very-similar-minded' profiles are found. In this case, it is necessary to identify the profiles within the already identified sub-group which have interests which coincide with the concepts specified in the users original query.
  • the clustering process is executed again (termed 'level-two' clustering), this time making use of the set of concepts from the search query.
  • the clustering engine is then able to return a list of agents that are not just similar-minded in a general way, but having interests in the concepts currently of interest to the user in their query.
  • the agent consults this set of agents, and very likely, these agents will supply URLs to web pages of pertinence.
  • a potential problem is that upon completing the second level of clustering, no profiles exist which contains the specific concepts of immediate interest to the user (as found in their query). In this case, the third type of clustering is necessary.
  • the clustering engine is able to 'label' cluster groups, meaning, for each cluster of similar minded profiles, the highest rated interests within that cluster may be identified. For instance, one group may be highly interested in sports, and another in travel, and yet another in music.
  • the agent map this 'label' information to the concepts found in the users query, and then requests the clustering engine to find a set of agents that are 'like-minded' in this specific sense.
  • this clustering engine invention is for the clustering of documents within an information database, for example on a corporate intranet, for use by a search engine in fulfilling requests for information on a particular interest.
  • a document may be 'profiled' in the same manner as an Internet users interests were profiled in the previous example.
  • a text summarization or concept extraction tool of some sort is used to extract the key concepts from a document, whether it be a web page, email message, or MS Word document.
  • the set of concepts, each concept having a weight that describes the relevancy of that concept to that document composes this profile.
  • a collection of document profiles may be grouped by the clustering engine, such that 'themes' are identified. For instance, the clustering engine may group a set of documents all of which pertain to 'sports', 'football', "Manchester United' and possibly some other concepts that could not be predicted ⁇ -priori (maybe 'beer', or 'hats').
  • a search engine is thus much better able to match user queries to the information in a database.
  • the arrangement comprises a processor 1 for carrying out arithmetic operations.
  • the processor 1 is connected to a plurality of memory components, including a hard disk 5, Read Only Memory (ROM) 7, Electrically Erasable Programmable Read Only Memory (EEPROM) 9, and Random Access Memory (RAM) 11. Not all of these memory types need necessarily be provided. Moreover, these memory components need not be located physically close to the processor 1 but may be located remote from the processor 1.
  • the processor 1 is also connected to means for inputting instructions, data etc. by a user, like a keyboard 13, and a mouse 15. Other input means, such as a touch screen, a track ball and/or a voice converter, known to persons skilled in the art may be provided too.
  • a reading unit 17 connected to the processor 1 is provided.
  • the reading unit 17 is arranged to read data from and possibly write data on a data carrier like a floppy disk 19 or a CDROM 21.
  • Other data carriers may be tapes, DVD, etc,, as is known to persons skilled in the art.
  • the processor 1 is also connected to a printer 23 for printing output data on paper, as well as to a display 3, for instance, a monitor or LCD (Liquid Crystal Display) screen, or any other type of display known to persons skilled in the art.
  • a printer 23 for printing output data on paper
  • a display 3 for instance, a monitor or LCD (Liquid Crystal Display) screen, or any other type of display known to persons skilled in the art.
  • LCD Liquid Crystal Display
  • the processor 1 may be connected to a communication network 27, for instance, the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), etc. by means of I/O means 25.
  • the processor 1 may be arranged to communicate with other communication arrangements through the network 27.
  • the processor 1 may be implemented as stand alone system, or as a plurality of parallel operating processors each arranged to carry out subtasks of a larger computer program, or as one or more main processors with several subprocessors. Parts of the functionality of the invention may even be carried out by remote processors communicating with processor 1 through the network 27.
  • the intelligence of the invention will be, preferably, implemented by software running on an arrangement as shown in Fig. 11 whereas all data relating to the profiles defined above will be stored in one of the memory components.
  • the computer arrangement is preferably a server accessible for multiple users (that may be persons or other computer arrangements) through the network 27.

Abstract

A method and arrangement to perform the following functions: receiving a source profile including a general pattern of keys and a query pattern of keys; comparing the general pattern of keys with clusters (1003); identifying a cluster that is most close to the extracted general pattern of keys (1004); if the extracted general pattern is within predetermined bounds of the identified cluster (1005) performing the following steps: producing a subcluster of the identified cluster, the subcluster comprising general profiles that match with the query pattern of keys (1007); carrying out a task related to the query pattern of keys using information associated with the subcluster; if the extracted general pattern is not within predetermined bounds of the identified cluster (1005) performing the following steps: producing another subcluster of another cluster (1007); carrying out a task related to the query pattern of keys using information associated with the another subcluster. The above mentioned functions are preferably performed using a Growing Neural Gas neural network architecture.

Description

A Method and apparatus for dynamically clustering complex profile information
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to an improved method and/or system for clustering information items expressed as features or profiles.
2. Related Art The Internet is exponentially increasing in size with the number of users who
"surf and the number of sites appearing that provide information or services. The Internet can be viewed as a gargantuan store of information, and finding relevant information is analogous to the needle and the haystack problem. This problem is most easily seen in the domain of search sites that exist to enable Internet users to input keywords and search for information. Typically, most of the information that is located by the various search engines is not what the user requires. Many other domains of information selection, such as television show or product selection, suffer the same problem. The problem reduces to the fact that a person seeking information is a complex and unique individual, and the information they seek is just as complex and unique. There exists a huge void or bottleneck between the user and the information, which must be traversed to bring the two together in an informative manner.
Many techniques have been developed that attempt to bridge the gap between user and the information sought. On the user side, techniques exist that model a users interests (which compose a 'user profile'), and these profiles are compared to find 'clusters' of similar profiles, as similar profiles to a user may contain more specific information that the user is searching for, as the profiles will be from similar people with similar interests. Often, this technique is called 'collaborative filtering'. On the information side of the bridge, an item, whether it be a document or product profile, may also be clustered into similar groups, such that a user seeking a particular information item will have available to them 'similar' items which may be of value. SUMMARY OF THE INVENTION The present invention relates to an improved method and/or system for clustering profiles of objects, for the purpose of improved search and information discovery, as defined by the claims. The present invention relates to a clustering method and apparatus that provide a quick method to find similar profiles of objects. Keys are extracted from a profile and collected into a general pattern. A set of general patterns is formed from a number of profiles and used in a clustering process. After similar profiles have been located, specific keys that could relate to the query are then extracted from the profiles. Another specific query pattern is formed by this extraction of new keys from the profiles that have been identified as similar. These specific query patterns are then re-clustered dynamically to search for a match to the initial query, where a query is embodiment dependent (such as a user entering keywords into a search engine, or an information processing system querying an information store). This method is applicable as the number of object profiles will be large in number and the size of the profiles may be large (in terms of the number of keys that a profile contains).
The improvement offered by this invention over other clustering methods is the ability to cluster a dynamically changing number of profiles, the ability to cluster into groups without having to specify groups α-priori (that is, it works in an 'unsupervised' manner), and the ability to perform multiple levels of clustering. The process described in this invention may be calculated in software or specialized hardware.
BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 depicts the definition of Entities, objects, profiles, keys and patterns. Figure 2 illustrates the sequence of events composing the main embodiment of the invention: the clustering engine.
Figure 3 depicts a sample of a pattern that is constructed from n keys Figure 4 is a table that represents a description of the trace information that a neuron contains. Figure 5 illustrates the logic flow for adding trace information to a neuron
Figure 6 details how a Gaussian curve is fitted over a neuron and used for thresholding and generating confidence value information Figure 7 illustrates Level One clustering Figure 8 illustrates Level Three clustering Figure 9 illustrates Level Two clustering
Figure 10 shows the logic flow from the first level of clustering through to the second level and how the third level of clustering is applied when needed. Figure 11 shows a computer arrangement by means of which the invention can be implemented.
DETAILED DESCRIPTION OF THE INVENTION
In an embodiment of this invention shown in figure 1, an object (101) describes an entity (100) (which can be anything from a user profile, a physical item, a concept, a document to a monitored process) by the use of a profile. A profile (102) consists of an n dimensional array of keys, the number of which could vary from profile to profile. A key (103) consists of a label and a numerical weight value (not shown in figure 1) that describes the importance of the key to the current obj ect.
The present invention is a clustering method that provides a quick method to find similar profiles of objects. Keys are extracted from a profile (102) and collected into a general pattern (104). A set of general patterns is formed from a number of profiles and used in a clustering process. After similar profiles have been located, specific keys that could relate to the query are then extracted from the profiles. Another specific query pattern is formed by this extraction of new keys from the profiles that have been identified as similar. These specific query patterns are then re-clustered dynamically to search for a match to the initial query, where a query is embodiment dependent (such as a user entering keywords into a search engine, or an information processing system querying an information store). This method is applicable as the number of object profiles will be large in number and the size of the profiles may be large (in terms of the number of keys that a profile contains). This will be explained in detail below.
In the context of this invention, the solution adopted uses an unsupervised neural network. Neural networks in general have the advantage that they can learn by example the general characteristics of a data set, and from that classify new, previously unseen data items. Neural networks have the advantage over statistical techniques in that they can operate with missing values, are simple to implement, can work with a "noisy" input, can handle large data sets and are very robust. Neural network models fall into two general categories: that of supervised and unsupervised methods. Using a supervised method (as used by networks such as the multi-layer perceptron) normally indicates that there is some knowledge of the data set and the classes contained within it, which is not suitable for this type of application (as it is not known α-priori which cluster or group a profile should belong to).
When there is no α-priori knowledge about the data set, unsupervised methods such as the Kohonen SOM (Self Organizing Map) can be used to learn the characteristics of high dimensional data sets (Murtagh and Hernandez, 1995; Siores and Seeto, 1995) and can aid in visualization [Mori et al, 1993; Muknahallipatana et al, 1996; Kohonen et al, 1996]. This ability to learn the characteristics of unknown data places it in an ideal position to use with profile information, where the incoming data is of a high dimensionality and complexity. However, the Kohonen SOM is limited in the requirements of this invention as the parameters have to be assigned before the network is trained, which indicates that there exists some prior information of the clusters contained within the data. In addition, the network cannot be retrained once a training session has ended, which is not useful in the dynamic environment where profiles adaptively change and increase (or decrease) in numbers.
As a basis for the clustering in this invention, a dynamic, growing neural network architecture is used. Dynamic architectures are extremely powerful as data distributions can be modelled accurately without "dead" areas of zero probability density being included in the model. Units ending up in areas of zero probability density are a side effect of training algorithms, which could be harmful to the overall network performance and also limit the resulting mappings [Fritzke, 1993]. The number of units in a dynamic network architecture also increases and decreases during training, removing the need to specify the network architecture α-priori [Fritzke, 1996]. An advantage of this is that neuron units do not all have to be initialized with a random value, which could result in the network producing different results every time it is trained. This also avoids initialization problems where the initial vector weights are in local minima that cannot be resolved by the training algorithm. The algorithm selected for usage within the present invention is the Growing Neural Gas (GNG) network as proposed by Fritzke [Fritzke and Wilke, 1991; Fritzke, 1991; Fritzke, 1993; Fritzke, 1994a; Fritzke, 1994b; Fritzke, 1995; Fritzke, 1995b; Fritzke, 1996]. The clustering process (in general) is illustrated in Figure 2, with the patterns being extracted (202) from a store of overall profiles (201). Every training cycle a random pattern is chosen (it then becomes unavailable for selection again until every other pattern in the set has been selected, ensuring that each pattern is presented equally) from the overall set (203) and used for training of the neural network (204). After a certain amount of epochs (epoch equals one full presentation of the pattern set) has elapsed, or some other training criteria have been reached, training finishes (205). The resulting network can then be saved in memory or used immediately for querying, although it should be noted that at the end of every training cycle (presentation of one pattern) the network could also be queried (206).
Depending on the embodiment of this invention, one, two, or three levels may be employed in clustering profile information. The first level uses a general set of keys that are extracted from the profile and are formed into a pattern. The pattern forms a general overview of the information contained within the profile, and are common to all profiles. These patterns are used to perform a first level cluster of the profiles. The second level is the extraction of query specific keys from the profile, forming a subset of query related patterns, and clustering on those patterns with the aim of finding matches to the incoming query pattern. A third level of clustering is also employed if no matching information is found in the first two cluster methods. In this third level, an alternate cluster, which matches the query specific information, is identified.
First clustering level.
The first level of operation is the clustering of the n dimensional general keys. Figure 3 shows a profile having n keys key1? key2, ...., keyn. Clustering will be performed on this general key "string" (pattern) to identify similar profiles with similar interests. Interest value (or weight value) is defined as how much knowledge a certain profile contains relating to a subject or concept, and is described as a numerical value per key between the range of zero to one that describes no interest to lots of interest, respectively, for the key concerned. The clustering process in the present invention can run continuously, with updates at a specified time interval. New profiles can be associated to one or more of these general groups upon initial querying, without a major re-clustering of the network due to the non-decaying parameters of the neural network architecture. As for addition of new profiles to the cluster process, removal also occurs in the same manner, with minimal disturbance to the clustering process.
For the first level of clustering, the n dimensional general pattern as shown in Fig. 3 is clustered with the general pattern from every other profile in the global set of profiles (which can change dynamically as profiles are added, removed or change). The clustering process in this invention is carried out by the unsupervised Growing Neural Gas (GNG) architecture, with queries being processed during training or after a certain time period of constant training has elapsed.
In order to support the clustering of patterns, and subsequent queries upon that information, knowledge of the patterns that effected the position of a neuron in the network have to be recorded. In the standard form, the GNG neural network learns the underlying structure of the data (data in this term refers to profile information) contained within a data set. The learnt representation can then be queried with a new input. The query returns a "distance" that represents how "similar" the query is to the "represented" knowledge that the neural network has learned (from observation of the data). This form of representation is fine for some standard types of neural network applications. However, in the case of this invention, additional knowledge is needed to capture information about which patterns affected the position of the network's neuron. As a solution, each neuron in this invention is given a "trace" facility that allows each pattern that effects the neuron's position to be recorded. The trace list shown in Fig. 4 records the pattern's identification numbers (first column) that have had an effect on the neuron position, the number of hits (second column) each neuron has had by a certain pattern, and how old each trace (third column) is.
As a neuron is identified as a winner (500), i.e. the input pattern is identified to be closest to that neuron, the neuron's trace list (Figure 4) is checked to see if the pattern identification number (501) is in its trace (502). If the unique pattern identification number is not in the trace list and the list is not full (503), then it is added to the end of the trace list (504), the number of hits for that pattern identification number is set to zero in the table of Fig. 4 and so is the age parameter. If the pattern identification number does exist in the trace list of Fig. 4, then the 'number of hits' parameter is incremented by one and the age is refreshed to zero (505). If the maximum number of traces is full (503) and a new pattern needs to be added to the trace, then the oldest trace is found (506) and replaced with the new pattern identification number (the number of hits and age parameters are refreshed to zero). At the end of every training cycle, all the age information for each trace in the list is incremented and checked to see that the age has not passed a pre-determined threshold. If the age of a trace has passed a pre-determined threshold then that trace is removed from the list. This method ensures that only up-to-date and accurate trace information is recorded in the neuron.
Further to the addition of the trace information, a decision threshold was also needed for each neuron to determine if the query input was similar or not to the winning (closest) neuron. The standard GNG neural network has no facility to do this, and the closest match (whether the input is a significant distance away from the winning neuron) is always located. To overcome this, each neuron in the network is given its own threshold that it uses (if it is the winning neuron) to decide if the query input is similar to it (and the knowledge it describes).
Each neuron in the network (after training) is assigned its own threshold automatically. The threshold is in the form of a Gaussian curve (or other mathematical function) that fits over the neuron (Figure 6) and is used to decided if a new input is inside its threshold or not. If it is inside, then the neuron classes the new input as similar to the neuron, where as, if it is not, then it is classified as too dissimilar. Deciding that the input is similar or not to a neuron provides only basic information. Further information is gained by examining the location of the input (inside the threshold) by referencing the Gaussian function and determining a confidence value. If the input is the same as the neuron, then the output from the Gaussian function will be around 1.0 (shown as II in Figure 6). However, if the input is just within the bounds of the function, then the output will be towards zero (shown as 12 in Figure 6), with any other input between these bounds producing an output value, as shown by 13 in Figure 6. This information provides a numerical rating (confidence value) of similarity to the learnt information, which in turn can be used by the other levels of processing for decision making.
After training has occurred, the neural network is examined by algorithms within the clustering engine to establish some statistical information. Statistics include: the number of distinct clusters (clusters that are separate from other clusters), number of neurons and edge connections in a cluster (to determine pseudo-probability density of the clusters), and the overall values for neurons and edges. The statistical information can then be used to derive knowledge through data mining, for instance plotting any developing trends in the data (from a base or first neural network model that is used as an augmented model so trends can be plotted and tracked as the neural network learns the data over time).
The clustering engine also automatically attempts to label the identified clusters. The engine examines all of the weights that belong to each neuron in a cluster, and from that determines which synapses are the most dominant. These dominant features are then used to label each cluster individually. The labeled clusters are used further in the clustering and searching process, as well as for data mining and information discovery. To perform a query, cf. Fig. 10, the general key information that was used for creating training patterns is extracted from the profile of the querying object, which may or may not have been used for the initial stage of the clustering. Along with the general pattern is the query pattern. The query pattern is a list of keys (that define an area of knowledge) that the user is searching for (1001). The process of querying involves comparing the objects general pattern against those stored by the neurons in the neural network. The objects general pattern is extracted from its profile (1002) and is then compared to see if there is a similar match within the neural network. The comparison process uses a distance metric (1003) to find the similarity of the users general pattern against the positions of the neurons (1004), which describe a generalized view of the profile data.
When the closest match (in terms of the similarity distance) is found, the general pattern is compared to the winning neuron's threshold (as this defines the bounds of similarity). If the new input is not within the bounds of the threshold (as described previously with reference to Fig. 6), the result is returned to the clustering engine that then progresses to a further level of clustering (1005).
If the new input (A(+i) in Figure 7) is within the bounds then the confidence value is retrieved (from the neuron), and the identification of the patterns that have affected the position of the neuron are retrieved from the trace information. The patterns identified in the trace information form the basis of the subset patterns that are used in the second level of clustering.
Second clustering level. If a similar group of profiles is found (as checked in 1005 in Fig. 10 and as shown by C5 in Figure 7), then these profiles are used in a further clustering process. The query specific keys are extracted (1006) from each of the identified profiles (from the first stage as being similar to the query object's profile) and are formed into patterns, which are used as a training set for another clustering process. This second stage of clustering groups together query specific patterns from each profile, with the query specific patterns defining information that each profile has knowledge of (1007). The clustering is performed (on-the-fly) by another Growing Neural Gas architecture, which is immediately queried with the querying objects search pattern, where matches are recorded (1008). A list of matching profiles are then returned back (1009) to the higher-level process or application making use of this invention (the clustering engine) (1010).
Third clustering level. If, as established in 1005 at the second level, no matches are found, the clustering engine will revert to the third level of clustering. The third level of clustering examines the labelled clusters (from the first level of training) to see if another clustered group could hold information that is more specific to the querying object (1011). If another candidate cluster is found (1012), then this cluster will be used as the basis for the second level of clustering, as illustrated in an example in Figure 8. If no other clusters can be found (1012) a message to that effect will be returned to the client (1009).
In the example of Fig. 8, the new profile A(+\ is clustered as part of an initial query. The new profile is defined as being similar to profiles that describe cluster C5 (in Figure 8) and can be incorporated into the constantly updating training data set, or just be used for the current query (as the current querying object may not be part of the original clustered set but an external guest to the clustered space). The profile however needs to run a query on a subject that is defined in the general pattern, which could be for demonstration, "sport". The profiles that show a high interest in sport are defined in cluster Ci (in Figure 8) that can be identified as existing by the neural network examining the cluster labelling. The profiles that have had an effect in positioning neurons in that cluster can be identified, with the rest of the profiles being discarded from the search. Information relating to sport can then be extracted from each of the identified candidate profiles and formed into patterns, re-clustered, and queried (as shown in Figure 9) as in the second level of clustering described above.
In one embodiment of the present invention, the clustering method is used to provide collaborative filtering of information found by personalized Internet agents. An example of this aspect occurs as follows. When conducting query searches for information on the Internet, it is typically the case that online search engines return to user web pages that are either irrelevant or "not quite what I wanted". The reason is that search engines work by requiring the entry of one or more 'keywords', which are supposed to 'describe' the topic of interest to the user. The problem is that in reality, it is necessary to specify to the search engine a long and very narrowly defined set of keywords in order to get back exactly the web page of interest, and more often than not, the user does not know exactly what these keywords should be, α-priori. The user is most often required to search through a large set of web pages until the page of interest is found. A solution to this problem involves the usage of a "personalized intelligent agent". This agent is software that learns the interests of the user by extracting the entire set of 'keywords', or concepts, associated with a web page that the user feels is highly valued, as indicated by the user giving his favourite web pages a "high rating", or score, via a web-browser enabled facility of the agent. These concepts are then stored within the user's agent, alongside other highly rated concepts from previous searches. Over time, the user agent is filled with personal information in the form of a set of concepts, each concept (or keyword) having a 'weight', or score, associated with it. The user uses his agent as a front-end interface in conducting Internet searches. The agent receives the query keywords from the user (typically via a web-browser, or any web-interface mechanism). The agent then looks at its internal personalized concept store for similar keywords, which may be added to the query, thus typically resulting in higher quality results returned by the search engines. These web pages are returned to the user's web browser via the agent. The agent also stores a list of URLs associated with each highly rated concept, such that the agent may provide to the user these web pages in the event that the user requests information on these concepts in future queries. Continuing with this example, the agent is able to return even more intelligent and personalized results if the agent 'consults' or 'communicates' with other agents that are similar to the user agent. The premise is that 'similar minded' agents will have stored URLs that are of high interest to other similar minded agents. A real-life analogy is that an English football sports fan is very much more likely to know about Manchester United as compared to an American nuclear physicist. Recall that each agent stores a set of concepts that is associated with the interests of that user. This set is a "profile" of the user. Every user has an agent with a unique profile. Each profile is different in respect to the keywords and the interest level in those keywords (where interest level may be expressed as a number ranging from zero to one). For example, most users may have the keyword 'sports' in their profile, but some may have an interest level of 0.2 in 'sports', while a sports fanatic might have an interest level of 0.95. It is also possible that one of these profiles stores the keyword 'badminton' while the other stores the keyword 'football'. Thus, it should be clear that if an agent is seeking Internet information on football, then it is best that this agent 'consult' with the agent whose user has a high interest in sports and football. This process is known as 'collaborative filtering', in the manner that agents work with each other to 'filter-out' irrelevant web pages on behalf of the agents user. Continuing with this example, the present invention addresses the problem of identifying 'similar-minded' agents. Recall that a user profile is a weighted set of concepts, which in mathematical terms is an n-dimensional vector of floating point values, where n is the number of concepts in each profile, and n is different for each profile. It is of course possible to simply compare profiles keyword by keyword, comparing only similar keywords and determining agent 'similarity' based on a sum of differences between keyword interest values. The present invention takes a much more sophisticated statistical approach than this. The present invention 'clusters' profiles into groups, and the invention allows for three different forms of clustering, described earlier as 'level one', 'level two' and 'level three'. In the level one clustering, a subset of concepts, which are common to ALL user profiles, is used to cluster an entire set of profiles into groups. This subset might for example consist of general concepts such as 'sports', 'travel', 'music', 'books', etc., with interest values ranging from zero to one. This set of concepts is present in every user profile at the time of agent creation, and some mechanism such as a question and answer session between the agent and the user assigns an interest level value to each of these 'base' concepts. Thus, in level one clustering, users with common interests are grouped together. When an agent conducts an Internet search query on behalf of its user, the first step taken by the agent is to query the clustering engine for agents similar to the profile of the agent's user. Recall that the clustering engine stores the names of agents associated with each profile, so the clustering engine is able to perform this de-referencing operation. With this agent list in hand, the agent is then able to query each agent directly to discover if any of these similar-minded agents can supply URLs of web pages matching the original user query. A problem with this solution is that potentially this sub-group of similar-minded agents might be very large in number, thus, another level of clustering is necessary to sub- cluster within the first cluster, such that 'very-similar-minded' profiles are found. In this case, it is necessary to identify the profiles within the already identified sub-group which have interests which coincide with the concepts specified in the users original query. The clustering process is executed again (termed 'level-two' clustering), this time making use of the set of concepts from the search query. The clustering engine is then able to return a list of agents that are not just similar-minded in a general way, but having interests in the concepts currently of interest to the user in their query. The agent consults this set of agents, and very likely, these agents will supply URLs to web pages of pertinence. A potential problem is that upon completing the second level of clustering, no profiles exist which contains the specific concepts of immediate interest to the user (as found in their query). In this case, the third type of clustering is necessary. Recall that the clustering engine is able to 'label' cluster groups, meaning, for each cluster of similar minded profiles, the highest rated interests within that cluster may be identified. For instance, one group may be highly interested in sports, and another in travel, and yet another in music. The agent map this 'label' information to the concepts found in the users query, and then requests the clustering engine to find a set of agents that are 'like-minded' in this specific sense.
This concludes the description of an example usage of this clustering invention in a personal intelligent agent system for conducting Internet searches. The same agent system incorporating clustering of similar-minded profiles could just as easily be used to recommend items for purchase, or news items to read. The variations limited only by the type of information stored alongside each concept (keyword) within an agent's profile. Another embodiment of a system making use of this clustering engine invention is for the clustering of documents within an information database, for example on a corporate intranet, for use by a search engine in fulfilling requests for information on a particular interest. In this aspect, a document may be 'profiled' in the same manner as an Internet users interests were profiled in the previous example. A text summarization or concept extraction tool of some sort is used to extract the key concepts from a document, whether it be a web page, email message, or MS Word document. The set of concepts, each concept having a weight that describes the relevancy of that concept to that document, composes this profile. A collection of document profiles may be grouped by the clustering engine, such that 'themes' are identified. For instance, the clustering engine may group a set of documents all of which pertain to 'sports', 'football', "Manchester United' and possibly some other concepts that could not be predicted α-priori (maybe 'beer', or 'hats'). A search engine is thus much better able to match user queries to the information in a database.
In figure 11, an overview is given of a computer arrangement that can be used to implement the invention. The arrangement comprises a processor 1 for carrying out arithmetic operations.
The processor 1 is connected to a plurality of memory components, including a hard disk 5, Read Only Memory (ROM) 7, Electrically Erasable Programmable Read Only Memory (EEPROM) 9, and Random Access Memory (RAM) 11. Not all of these memory types need necessarily be provided. Moreover, these memory components need not be located physically close to the processor 1 but may be located remote from the processor 1. The processor 1 is also connected to means for inputting instructions, data etc. by a user, like a keyboard 13, and a mouse 15. Other input means, such as a touch screen, a track ball and/or a voice converter, known to persons skilled in the art may be provided too.
A reading unit 17 connected to the processor 1 is provided. The reading unit 17 is arranged to read data from and possibly write data on a data carrier like a floppy disk 19 or a CDROM 21. Other data carriers may be tapes, DVD, etc,, as is known to persons skilled in the art.
The processor 1 is also connected to a printer 23 for printing output data on paper, as well as to a display 3, for instance, a monitor or LCD (Liquid Crystal Display) screen, or any other type of display known to persons skilled in the art.
The processor 1 may be connected to a communication network 27, for instance, the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), etc. by means of I/O means 25. The processor 1 may be arranged to communicate with other communication arrangements through the network 27.
The processor 1 may be implemented as stand alone system, or as a plurality of parallel operating processors each arranged to carry out subtasks of a larger computer program, or as one or more main processors with several subprocessors. Parts of the functionality of the invention may even be carried out by remote processors communicating with processor 1 through the network 27.
The intelligence of the invention, described above, will be, preferably, implemented by software running on an arrangement as shown in Fig. 11 whereas all data relating to the profiles defined above will be stored in one of the memory components. The computer arrangement is preferably a server accessible for multiple users (that may be persons or other computer arrangements) through the network 27.
References.
Fritzke B. & Wilke P., "FLEXMAP: A Neural Network for the Traveling Salesman problem with Linear Time and Space Complexity", Proceedings of the International Joint Conference on Neural Networks, Singapore, pp. 929-934, 1991.
Fritzke B., "Unsupervised Clustering with Growing Cell Structures", Proceedings of the International Joint Conference on Neural Networks, Seattle WA, Vol. 2, pp. 531- 536, 1991.
Fritzke B., "Kohonen Feature Maps and Growing Cellular Structures - A Performance Comparison", Advanced in Neural Information Processing. Vol. 5, 1993.
Fritzke B., "Growing Cell Structures - a self organizing network for unsupervised and supervised learning", Neural Networks, Vol. 7, pp. 1441-1460, 1994a.
Fritzke B., "Fast Learning with incremental RBF networks", Neural Processing Letters, Vol. 1, pp.2-5, 199b4.
Fritzke B., "A Growing neural gas network learns topologies", In Advances in Neural Information Processing Systems Seven, Eds. Tesauro G., Touretzky D.S. & Leen T.K., pp. 625-632, MIT Press, 1995.
Fritzke B., "Growing Grid - a self organizing network with constant neighborhood range and adaptation strength", Neural Processing Letters, Vol.2, No. 5, pp. 9-13, 1995b.
Fritzke B., "Growing Self-Organizing Networks - Why?", European Symposium on Artificial Neural Networks ESANN 96, pp. 61-72, 1996.
Murtagh F. & Hernandez-Pajares M., "The Kohonen Self-Organizing Map Method: An Assessment". Journal of Classification. Vol. 12, pp. 165-190, 1995. Siores E. & Seeto T., "Categorizing Acoustic Emission Signals using Self Organizing Maps", Non-Destructive Testing - Australia. Vol. 32, No. 4, 1995. Mori H., Tamura Y. & Tsuzuki S., "An Artificial Neural-Net based technique for power system dynamic stability with the Kohonen Model", IEEE Transactions on Power Systems, Vol. 7, No. 2, May 1992.
Muknahallipatna S., Badrul H. & Chowdury B.H., "Determination, by Kohonen network, of the generator coherency in dynamic studies", Electric Machines and Power Systems, Vol. 24, No. 8, pp. 869-992, 1996.
Kohonen T., Oja E., Simula O., Visa A. & Kangas J., "Engineering Applications of the Self-Organising Map", Proceedings of the IEEE, Vol. 84, No. 10, October 1996.

Claims

Claims
1. A computer arrangement comprising a processor and memory for storing data, the processor being arranged to perform the following functions: • receiving a source profile including a general pattern of keys relating to an object or user and receiving a query pattern of keys relating to a query;
• extracting the general pattern of keys from said source profile (1002);
• comparing said general pattern of keys with clusters stored in said memory, each cluster comprising a plurality of general profiles of objects or users (1003); • identifying a cluster that is most close to said extracted general pattern of keys (1004);
• if said extracted general pattern is within predetermined bounds of said identified cluster (1005) performing the following steps:
> producing a subcluster of the identified cluster, the subcluster comprising those general profiles of said plurality of general profiles that match with said query pattern of keys (1007); carrying out a task related to said query pattern of keys using information associated with said subcluster;
• if said extracted general pattern is not within predetermined bounds of said identified cluster (1005) performing the following steps:
> identifying another cluster that holds information related to said query pattern of keys, said another cluster comprising another plurality of general profiles (1011);
> producing another subcluster of said another cluster, said another subcluster comprising those general profiles of said another plurality of general profiles that matches with said query pattern of keys (1007);
> carrying out a task related to said query pattern of keys using information associated with said another subcluster.
2. The computer arrangement according to claim 1, arranged to cluster on general extracted patterns from received source profiles by using a Growing Neural Gas neural network architecture.
3. The computer arrangement according to claim 1 or 2, arranged to produce subclusters on query patterns of keys from received source profiles by using a Growing Neural Gas neural network architecture.
4. The computer arrangement according to claim 1, 2 or 3, arranged to use a modified Growing Neural Gas neural network architecture to contain a dynamic trace information system for each cluster.
5. The computer arrangement according to any of the preceding claims, arranged to perform a dynamic thresholding method during a training process to decide whether or not said extracted general pattern is within said predetermined bounds of said identified cluster and to provide a numerical representation of the extent to which said extracted general pattern is similar to said identified cluster.
6. The computer arrangement according to claim 5, wherein said numerical representation is made by using a Gaussian function.
7. The computer arrangement according to any of the preceding claims, wherein said task is selected from the following group of tasks: searching and querying, trending information, and data mining/knowledge discovery.
8. The computer arrangement according to any of the preceding claims, arranged to use a modified Growing Neural Gas network architecture to derive statistical information of the clusters.
9. The computer arrangement according to any of the preceding claims, arranged to track trends in the clusters over time, and to derive information from these trends.
10. The computer arrangement according to any of the preceding claims, implemented within at least one of software and hardware.
11. The computer arrangement according to any of the preceding claims, arranged to perform tasks selected from the following group of tasks: collaborative filtering in Internet related activities, document categorization, product categorization, and machine monitoring applications.
12. A method to be carried out on a computer arrangement comprising a processor and memory for storing data, the method including the following steps:
• receiving a source profile including a general pattern of keys relating to an object or user and receiving a query pattern of keys relating to a query;
• extracting the general pattern of keys from said source profile (1002);
• comparing said general pattern of keys with clusters stored in said memory, each cluster comprising a plurality of general profiles of objects or users (1003);
• identifying a cluster that is most close to said extracted general pattern of keys (1004);
• if said extracted general pattern is within predetermined bounds of said identified cluster (1005) performing the following steps: > producing a subcluster of the identified cluster, the subcluster comprising those general profiles of said plurality of general profiles that match with said query pattern of keys (1007); carrying out a task related to said query pattern of keys using information associated with said subcluster; • if said extracted general pattern is not within predetermined bounds of said identified cluster (1005) performing the following steps:
> identifying another cluster that holds information related to said query pattern of keys, said another cluster comprising another plurality of general profiles (1011); producing another subcluster of said another cluster, said another subcluster comprising those general profiles of said another plurality of general profiles that matches with said query pattern of keys (1007); carrying out a task related to said query pattern of keys using information associated with said another subcluster.
13. The method according to claim 12, arranged to cluster on general extracted patterns from received source profiles by using a Growing Neural Gas neural network architecture.
14. The method according to claim 12 or 13, arranged to produce subclusters on query patterns of keys from received source profiles by using a Growing Neural Gas neural network architecture.
15. The method according to claim 12, 13 or 14, arranged to use a modified Growing Neural Gas neural network architecture to contain a dynamic trace information system for each cluster.
16. The method according to any of the claims 12 through 15, arranged to perform a dynamic thresholding method during a training process to decide whether or not said extracted general pattern is within said predetermined bounds of said identified cluster and to provide a numerical representation of the extent to which said extracted general pattern is similar to said identified cluster.
17. The method according to claim 16, wherein said numerical representation is made by using a Gaussian function.
18. The method according to any of the claims 12 through 17, wherein said task is selected from the following group of tasks: searching and querying, trending information, and data mining/knowledge discovery.
19. The method according to any of the claims 12 through 18, arranged to use a modified Growing Neural Gas network architecture to derive statistical information of the clusters.
20. The method according to any of the claims 12 through 19, arranged to track trends in the clusters over time, and to derive information from these trends.
21. The method according to any of the claims 12 through 20, implemented within at least one of software and hardware.
22. The method according to any of the claims 12 through 21, arranged to perform tasks selected from the following group of tasks: collaborative filtering in Internet related activities, document categorization, product categorization, and machine monitoring applications.
23. A computer program product to be loaded by a computer arrangement comprising a processor and memory for storing data, the computer program product, after being loaded, providing said computer arrangement the capacity to perform the following functions: • receiving a source profile including a general pattern of keys relating to an object or a user and receiving a query pattern of keys relating to a query;
• extracting the general pattern of keys from said source profile (1002);
• comparing said general pattern of keys with clusters stored in said memory, each cluster comprising a plurality of general profiles of objects or users (1003); • identifying a cluster that is most close to said extracted general pattern of keys (1004);
• if said extracted general pattern is within predetermined bounds of said identified cluster (1005) performing the following steps:
> producing a subcluster of the identified cluster, the subcluster comprising those general profiles of said plurality of general profiles that match with said query pattern of keys (1007); carrying out a task related to said query pattern of keys using information associated with said subcluster;
• if said extracted general pattern is not within predetermined bounds of said identified cluster (1005) performing the following steps:
> identifying another cluster that holds information related to said query pattern of keys, said another cluster comprising another plurality of general profiles (1011);
> producing another subcluster of said another cluster, said another subcluster comprising those general profiles of said another plurality of general profiles that matches with said query pattern of keys (1007);
> carrying out a task related to said query pattern of keys using information associated with said another subcluster.
4. A data carrier provided with a computer program product according to claim 23.
PCT/NL2000/000516 2000-07-21 2000-07-21 A method and apparatus for dynamically clustering complex profile information WO2002008953A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
AU2000261903A AU2000261903A1 (en) 2000-07-21 2000-07-21 A method and apparatus for dynamically clustering complex profile information
PCT/NL2000/000516 WO2002008953A1 (en) 2000-07-21 2000-07-21 A method and apparatus for dynamically clustering complex profile information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/NL2000/000516 WO2002008953A1 (en) 2000-07-21 2000-07-21 A method and apparatus for dynamically clustering complex profile information

Publications (1)

Publication Number Publication Date
WO2002008953A1 true WO2002008953A1 (en) 2002-01-31

Family

ID=19760703

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/NL2000/000516 WO2002008953A1 (en) 2000-07-21 2000-07-21 A method and apparatus for dynamically clustering complex profile information

Country Status (2)

Country Link
AU (1) AU2000261903A1 (en)
WO (1) WO2002008953A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8001008B2 (en) 2006-10-24 2011-08-16 Garett Engle System and method of collaborative filtering based on attribute profiling
CN105913011A (en) * 2016-04-08 2016-08-31 北京大学深圳研究生院 Human body abnormal behavior detection method based on parameter self-regulation nerve network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5619709A (en) * 1993-09-20 1997-04-08 Hnc, Inc. System and method of context vector generation and retrieval
US5802507A (en) * 1992-12-16 1998-09-01 U.S. Philips Corporation Method for constructing a neural device for classification of objects
US5867799A (en) * 1996-04-04 1999-02-02 Lang; Andrew K. Information system and method for filtering a massive flow of information entities to meet user information classification needs

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5802507A (en) * 1992-12-16 1998-09-01 U.S. Philips Corporation Method for constructing a neural device for classification of objects
US5619709A (en) * 1993-09-20 1997-04-08 Hnc, Inc. System and method of context vector generation and retrieval
US5867799A (en) * 1996-04-04 1999-02-02 Lang; Andrew K. Information system and method for filtering a massive flow of information entities to meet user information classification needs

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FRITZKE B: "GROWING CELL STRUCTURES A SELF-ORGANIZING NETWORK FOR UNSUPERVISED AND SUPERVISED LEARNING", NEURAL NETWORKS,GB,ELSEVIER SCIENCE PUBLISHERS, BARKING, vol. 7, no. 9, 1994, pages 1441 - 1460, XP000489767, ISSN: 0893-6080 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8001008B2 (en) 2006-10-24 2011-08-16 Garett Engle System and method of collaborative filtering based on attribute profiling
US8645224B2 (en) 2006-10-24 2014-02-04 Fossick Services Limited Liability Company System and method of collaborative filtering based on attribute profiling
CN105913011A (en) * 2016-04-08 2016-08-31 北京大学深圳研究生院 Human body abnormal behavior detection method based on parameter self-regulation nerve network
CN105913011B (en) * 2016-04-08 2019-06-04 深圳市感动智能科技有限公司 Human body anomaly detection method based on parameter self-regulation neural network

Also Published As

Publication number Publication date
AU2000261903A1 (en) 2002-02-05

Similar Documents

Publication Publication Date Title
US20060095852A1 (en) Information storage and retrieval
GB2395808A (en) Information retrieval
EP1426882A2 (en) Information storage and retrieval
Saeed et al. Text stream to temporal network-a dynamic heartbeat graph to detect emerging events on twitter
Rajamohana et al. An effective hybrid cuckoo search with harmony search for review spam detection
Punitha et al. Performance evaluation of semantic based and ontology based text document clustering techniques
McGeer et al. Performance enhancement through the generalized bypass transform
Komalavalli et al. Challenges in big data analytics techniques: a survey
Gaber et al. Detection and classification of changes in evolving data streams
Dai et al. A two-layer text clustering approach for retrospective news event detection
Casalino et al. Enhancing the dissfcm algorithm for data stream classification
Nguyen et al. Learning approach for domain-independent linked data instance matching
Trieschnigg et al. Hierarchical topic detection in large digital news archives: exploring a sample based approach
WO2002008953A1 (en) A method and apparatus for dynamically clustering complex profile information
Roy Predicting User’s web navigation behaviour using AMD and HMM approaches
CN113761125A (en) Dynamic summary determination method and device, computing equipment and computer storage medium
Gaber et al. Classification of changes in evolving data streams using online clustering result deviation
Velayutham et al. A novel entropy based unsupervised feature selection algorithm using rough set theory
Veparala et al. Big Data and Different Subspace Clustering Approaches: From social media promotion to genome mapping
Li et al. Research on the evaluation of learning behavior on MOOCs based on cluster analysis
Sundari et al. A review on pattern discovery techniques of web usage mining
Tejasree et al. An improved differential bond energy algorithm with fuzzy merging method to improve the document clustering for information mining
GeethaRamani et al. Grouping of users based on user navigation behaviour using supervised association rule tree mining
Shi et al. Everyone’s preference changes differently: A weighted multi-interest model for retrieval
Vats et al. Dimensionality Reduction Techniques: Comparative Analysis

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: COMMUNICATION UNDER RULE 69 EPC (EPO FORM 1205A DATED 25.04.2003)

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP