WO2003063030A1 - System and method for clustering data - Google Patents

System and method for clustering data Download PDF

Info

Publication number
WO2003063030A1
WO2003063030A1 PCT/US2003/001806 US0301806W WO03063030A1 WO 2003063030 A1 WO2003063030 A1 WO 2003063030A1 US 0301806 W US0301806 W US 0301806W WO 03063030 A1 WO03063030 A1 WO 03063030A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
state
state value
values
updating
Prior art date
Application number
PCT/US2003/001806
Other languages
French (fr)
Inventor
Guangzhou Zou
Xun Wang
Zhen Su
Original Assignee
Syngenta Participations Ag
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Syngenta Participations Ag filed Critical Syngenta Participations Ag
Publication of WO2003063030A1 publication Critical patent/WO2003063030A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Definitions

  • the present invention relates generally to analysis of data, and more particularly, to a method and apparatus for data clustering.
  • Clustering is a type of pattern recognition. Clustering is a process of organizing data into clusters by revealing naturally occurring patterns or structures. The resulting clusters allow users to discover similarities and differences among patterns and to derive useful conclusions about them. Some general applications of clustering are data reduction, hypothesis generation, hypothesis testing, and prediction based on clusters.
  • Clustering is useful in many fields such as life sciences, medical sciences, social sciences, earth sciences, and engineering. Clustering is often found under different names in different contexts such as data mining in software engineering, machine learning in pattern recognition, numerical taxonomy in biology and ecology, typology in social sciences, and partition in graph theory.
  • One example of clustering is grouping genes according to similarities in their expression patterns. Other examples are helping marketers discover and characterize customers, identifying areas of similar land use in an earth observation database, identifying groups of automobile insurance policy holders with a high average claim cost, and identifying groups of houses in an area according to house types, value, and geographic location.
  • Some conventional clustering methods are not efficient for large data sets and suffer from long computation times.
  • the agglomerative type of hierarchical clustering algorithms have a computational complexity of 0(n 3 ).
  • Other methods have inflexible clustering criteria which result in clusters that are either too coarse or too fine so that the natural patterns in the data are missed.
  • some methods, such as partitioning methods try to fit data to predefined or arbitrary patterns and, thus, they too are unable to reveal the natural patterns in the data, h addition, few methods are scalable for massively parallel computation. Therefore, a need exists for a scalable method that lends itself to parallel computation and that employs flexible clustering criteria.
  • the present invention provides systems and methods that cluster data in a data space.
  • the present invention is scalable and thus allows for parallel computation. Additionally, the clustering criteria of the present invention imposes minimal a priori restriction on the data and thus allows for a more natural clustering that does not obscure the natural patterns in the data.
  • values of cluster parameters are selected and a data set is created from the data points in the data space that satisfy the selected cluster parameter values.
  • Each data point in the data set is assigned a different initial state value.
  • the state value for each data point in the data set is updated according to one or more rales that are related to the cluster parameters.
  • the update process is capable of being performed in a substantially parallel manner. After the update process has been applied to the entire data set, the update process is repeated for the entire data set until the state values of respective data points in the data set have stabilized. For example, stabilization of state values are indicated by all of the state values of the respective data points in the data set remaining unchanged after a completed update process of the entire data set.
  • the data points in the data set can be grouped into data clusters as a function of the state values. For example, data points with the same stabilized state value belong to the same data cluster.
  • the process can be repeated for different cluster parameter values.
  • data clusters can be discovered for different sets of cluster parameter values to reveal the natural patterns of the data.
  • FIG. 1 is a representation illustrating an exemplary embodiment of a system that clusters data according to the present invention.
  • FIG. 2 is a flowchart illustrating an exemplary embodiment of a method for clustering data according to the present invention.
  • FIGS. 3, 4, and 5 are flowcharts of alternative embodiments of methods for clustering data according to the present invention.
  • FIG. 6 is a representation illustrating an exemplary embodiment of a hierarchy or an aspect of a hierarchy according to the present invention.
  • the present invention generally relates to systems and methods that cluster data.
  • the present invention provides a system and a method for discovering data clusters in a data space.
  • An exemplary embodiment of a system 10 according to the present invention is illustrated in FIG. 1.
  • the system 10 includes a computing unit 90 such as at least one data processor, at least one computer (e.g., a Beowulf cluster, a supercomputer, a server, a mainframe, a desktop, a laptop, a notebook, a portable or a handheld computer) and/or any equivalents thereof.
  • the computing unit 90 includes a controller 20, a memory 30, a bus 40, an input device 50 and/or an output device 60.
  • the system 10 includes a data device 70 (e.g., an external data storage device or a remote data storage device) and/or a link 80 (e.g., a cable link or a wireless link).
  • a data device 70 e.g., an external data storage device or a remote data storage device
  • the controller 20 includes a computing device or a plurality of computing devices (e.g., processors, microprocessors and/or state machines).
  • the memory 30 includes volatile memory components and/or non- volatile memory components.
  • the controller 20 and the memory 30 are in two-way communications with the bus 40.
  • the input device 50 e.g., receiver, microphone, mouse, keypad, keyboard, sensor and/or touch-sensitive display
  • the output device 60 e.g., display, speaker and/or transmitter
  • the data device 70 e.g., a conventional memory and/or conventional data storage device with or without computing power
  • the link 80 also provides a conventional input/output interface with the bus 40.
  • the components of the system 10 are connected directly to each other in addition to or instead of being connected via the bus 40.
  • Other conventional methods of communicating between components e.g., conventional wireless communications means
  • various levels of integration between components is contemplated by the present invention. For example, any component may be integrated in part or in whole with any other component or components.
  • the controller 20 controls data flow and/or access on the bus 40. Programs and/or data are accessed by the controller 20 from, for example, the memory 30, the input device 50 and/or the data device 70.
  • the input device 50 provides a user interface for entering data and/or commands (e.g., via a keyboard, programming the system 10 or entering values for parameters during the execution of a program).
  • the output device 60 provides, for example, a display and/or an interface that transmits information to a user or to another device, under control of the controller 20.
  • a method 100 (shown in Figure 2) that clusters data according to an exemplary embodiment of the present invention is stored, for example, in the memory 30, the controller 20 or some combination thereof. Furthermore, the method 100 is encompassed in software, hardware (e.g., an application specific integrated circuit (ASIC)) or some combination thereof.
  • ASIC application specific integrated circuit
  • a data space represents the set of all data points from which data sets are generated and/or from which clusters are formed. Each data point includes a plurality of dimensions and/or degrees of freedom.
  • the data space is stored, at least partly, in the memory 30, the controller 20, the data device 70 or some combination thereof. Furthermore, information including data points is transmitted and stored between, for example, the memory 30, the controller 20 and the data device 70. For example, if the data space is not efficiently stored within the memory 30 and/or the controller 20, then the data device 70 via the link 80 at least partly provides storage for the data space. Furthermore, the data device 70 provides additional computing power that can process at least portions of the data space stored in the data device 70 in some embodiments.
  • the controller 20 controls the other components of the system 10. For example, the controller 20 accesses the method 100 stored in the memory 30. The controller 20 then executes the method 100 and processes the data points received via the bus 40, the memory 30 and/or the data device 70. In one example, the memory 30, the data device 70, the input device 50 and/or the controller 20 serve as a source of data points, hi another example, the memory 30 provides a local cache corresponding to the memory of the data device 70.
  • At least a portion of the data can be processed substantially in parallel by the controller 20, in one embodiment.
  • parallel processing is achieved via one or more processors and/or state machines.
  • the controller 20 can process data stored in the memory 30, the data device 70, the controller 20 or some combination thereof, h another example, the processing power is distributed between the controller 20 and the data device 70.
  • the controller 20 in executing the method 100 according to an exemplary embodiment of the present invention creates at least one data set from the data space, hi one example, the data set is further processed via a substantially parallel process and possibly iterative process resulting in the clustering of data.
  • cluster conditions e.g., cluster parameter values
  • possibly different data sets are generated from the data space resulting in one or more possibly different data clusters.
  • the clusters of data are organized, for example, according to cluster conditions to illustrate one or more hierarchical levels in one or more hierarchies.
  • a flowchart shows an exemplary embodiment of the method 100 that clusters data from a data space according to the present invention.
  • the method 100 begins in step 110 and proceeds with the selection of values for cluster parameters in step 120.
  • the values for the cluster parameters are automatically selected.
  • the method 100 automatically selects initial values for the cluster parameters and automatically changes the values for the cluster parameters.
  • one or more of the cluster parameter values are increased or decreased by the integer multiples of a particular resolution value up to or down to a particular threshold value.
  • the values for the cluster parameters are selected or updated manually by an operator.
  • the number and/or type of cluster parameters are preset or chosen as a function of, for example, the application and/or the data point type.
  • the method 100 employs two cluster parameters, namely a number of neighbors n and a radius r.
  • a data set is generated by selecting data points of the data space that satisfy the particular values of the cluster parameters.
  • the data set includes data points that have at least n neighbors within a radius r.
  • the data set includes data points that satisfy at least one of the cluster parameter values.
  • Other methods such as conventional methods for applying the cluster parameter values are used to create the data set in some embodiments.
  • each of the data points in the data set is assigned a different initial state value. Integers and other types of numbers or representations are used for initial state values.
  • step 150 the state value of each data point in the data set is updated according to particular rules.
  • the rules are preset by the user, programmed by the user and/or selected automatically.
  • the rales are selected automatically as a function of the type of data space being processed or the type of application.
  • the rule is related to the cluster parameters.
  • An example rule is that a particular data point should be given the lowest state value in a corresponding neighborhood as defined by the cluster parameters.
  • the step 150 is carried out as a parallel process with all the data points simultaneously undergoing the updating process in step 150. Parallel processing usually reduces computing time, especially when the data sets are very large.
  • step 160 the method 100 determines whether or not any state values were changed in the previous step (i.e., step 150). If a state value of any of the data points was changed, then the method 100 jumps back to step 150. If no state values were changed, then the state values have stabilized and the method 100 proceeds with step 170.
  • Step 170 determines the clusters according to state values.
  • the data points are grouped into clusters according to state value.
  • a first cluster is formed from the data points with a first state value.
  • a second cluster is formed from the data points with a second state value.
  • the number of clusters is determined by the number of different state values.
  • step 180 the method 100 determines whether or not the value of any cluster parameters are to be changed. If not, then the method 100 terminates in step 190. Otherwise, if at least one of the cluster parameter values is to be changed, then the method 190 updates the particular cluster parameter value (step 190) and the process proceeds again to step 130 in which possibly different data sets are generated according to the one or more updated cluster parameter values. Any of the parameter values are manually and/or automatically changeable by the method 100. Furthermore, the change in at least one cluster parameter value is by a constant or variable incremental amount. In one embodiment, they are selected according to the noise level of the data. The changes are by addition, subtraction, multiplication, division or any other conventional methods for changing the value of a particular parameter.
  • n For a cluster parameter with a cluster parameters n and r as described above, for a particular value of n, the values of r are changed by adding a constant value to r until r reaches a particular threshold value. In another example, for a particular value of r, the values of n are changed by subtracting a constant value from n until n reaches a particular threshold value.
  • Each set of cluster parameter values generates a possibly different data set and possibly clusters of different composition, size and number.
  • each set of cluster parameter values form, for example, at least portions of hierarchical levels in one or more hierarchies.
  • FIG. 3 is a flowchart of an alternative embodiment of a method 300 for clustering data according to the present invention.
  • Cluster criteria are selected 302 and a different state is assigned each data point in the data 304.
  • the state of each data point is updated according to at least one rule that is a function of the cluster criteria 306 and this is repeated until all of the states remain unchanged 308.
  • at least one of the cluster criteria are changed and the method 300 is repeated for the new criteria.
  • data points grouped by states are displayed as a result of method 300.
  • the data points are grouped by states in a hierarchy according to cluster criteria.
  • FIG. 4 is a flowchart of another alternative embodiment of a method 400 for clustering data according to the present invention.
  • Data points are selected from a k- dimensional space that have at least n neighboring data points with a similarity measure less than or equal to r 402.
  • Each selected data point is labeled with a unique initial state 404.
  • the state of each labeled data point is updated to the lowest state in its neighborhood, if the state differs from the lowest state in its neighborhood 406.
  • the updating is repeated until there is no state change in the ⁇ -dimensional space 408.
  • each data point represents a gene and its characteristics.
  • k, n, and r are predetermined values.
  • the states are updated simultaneously.
  • genes grouped by state are displayed.
  • r is increased by a resolution ⁇ r and the method is repeated for the new r.
  • the resolution ⁇ r is selected according to noise level.
  • the resulting clusters are provided before selecting the resolution ⁇ r so that the resolution ⁇ r is selected according to resulting clusters.
  • r is varied by ⁇ r over a range of values to produce a hierarchy of clusters. In another embodiment, the hierarchy of clusters is displayed.
  • FIG. 5 is a flowchart of another alternative embodiment of a method 500 for clustering data according to the present invention.
  • a system for clustering data comprises one or more memory units that store at least a portion of a data space and a controller coupled to the one or more memory units.
  • the data space contains a plurality of data points.
  • the controller includes a plurality of computing devices that operate to perform a method.
  • a state value is updated for each data point according to at least one rale that is a function of cluster criteria 502.
  • the cluster criteria comprise a minimum number of neighbors (it) and a similarity value (r).
  • the similarity value (r) is increased by a pre-determined increment and, then, the updating is repeated 504.
  • the method is performed by the computing devices in parallel, another embodiment, the controller and the plurality of computing devices are part of a multicomputer architecture capable of parallel processing. In another embodiment, the controller and the plurality of computing devices are part of a supercomputer. In another embodiment, the method further comprises displaying a hierarchy of data points grouped by state values over a range of similarity values (r).
  • n represents the minimum number of neighbors required and r represents a value of a similarity measure.
  • the parameters n and r allow for control of data density.
  • a similarity measure is chosen depending on the application.
  • Additional examples of similarity measures include the family of Minkowski metrics, of which the Euclidean distance is a member.
  • An example of a data point is a gene and its characteristics, such as an expression pattern.
  • a group is a collection of similar data points.
  • Neighbors are data points having a defined similarity measure with respect to a particular data point. States are associated with data points and identity groups or clusters.
  • a ⁇ -dimensional space is a data set of size k, such as k number of experiments. The example is to be constraed merely as an illustration and is not to be constraed as a limitation in any manner.
  • Step T Select data points that have at least n neighboring data points within a given radius r.
  • the initial value of the minimum neighbors requirement n and the neighborhood size r are pre-determined.
  • the distance between any two data points can be calculated in a fc-dimensional space.
  • Step 2 Label each selected data point with a unique integer i, which becomes the initial state of the data point.
  • Step 3 Simultaneously update the state of all labeled data points according to the following rules: o Change the state of the data point under consideration to the lowest state that occurred in its neighborhood. o Keep the state of the data point unchanged if its state is the lowest one in its neighborhood.
  • Step 4 Repeat step 3 until there is no state change in the entire data space.
  • Step 5 Output the groups of the data points that have the same state as the clusters formed at the parameter point (r, n).
  • Step 6 Increase the value of parameter r (or ⁇ ) by a user specified resolution ⁇ r and repeat steps 1-5 to create the set of the lower level clusters with a smaller neighborhood (or number of minimum neighbors requirement).
  • Step 7 Repeat step 6 to cover a meaningful range in the parameter space. This will produce a hierarchy of clusters with respect to the selected resolution . ⁇ r.
  • Steps 3, 6, and 7 can all be carried out in a massively parallel fashion.
  • a data space includes ten data points labeled A to J.
  • Each data point represents, for example, an m-dimensional space where m is an integer.
  • point A includes m parameters that is represented, for example, as the m coordinates of point A, e.g., (A 0 , Al,..., A m-1 ). These parameters represent information such as measurement(s), characteristic(s) or representative value(s) of the same type(s) or different type(s).
  • each data point represents a particular person or group of persons with a particular financial history.
  • Information stored in the coordinates of each data point includes, for example, income data, asset data, debt data, liability data, overhead data, statistical scores relating to personal financial history and other data that is relevant as to whether or not a bank should approve a loan or extend a pre-approved credit card offer to a particular person or group of persons.
  • each data point represents a test subject (e.g., one or more organisms, cells, organic material, DNA, etc.) that is the focus of scientific research.
  • Information stored in the coordinates of each data point includes, for example, test conditions, subject characteristics, statistical data relating to the test conditions and/or subject. It will be appreciated that these are merely illustrations and not intended to limit the present invention in any way.
  • the systems and methods for clustering data according to the present invention find application in a wide variety of applications in which information is processed and/or analyzed.
  • a distance is determined between each point and every other point of the data set.
  • the term “distance” includes, for example, conventional spatial distances or its equivalent between two points (e.g., A and B) as represented, for example, by the conventional relation of the square root of the sum of the square of the differences of corresponding coordinates between two points.
  • the term “distance” also includes, for example, a conventional correlation value or its equivalent.
  • the distance is a normalized correlation between two points (e.g., A and B) subtracted from an offset such as, for example, one. Accordingly, the distance between, for example, point A and itself would be zero.
  • the term “distance” need not be limited to any one of the above-identified embodiments, but includes any conventional mathematical (e.g., statistical) parameters as known to one of ordinary skill in the art.
  • the distance information is storable in a matrix.
  • the example matrix below is for storing distances between the points of a data set and contains ten points A-J.
  • the information need not be stored in a matrix format and there can be more or less than ten data points.
  • other methods for organizing and/or keeping track of data e.g., coordinate systems, pointers, etc. are also employed.
  • the distance between a particular point and itself is zero.
  • the distance between point A and point B is, for example, shown to have a value of one. This value is reflected in row one, column two, and row two, column one.
  • Other distance values between points are also stored in the matrix.
  • the distance values are shown in the matrix to be integers, other types of numbers (e.g., non-integers) are also stored. Integers were employed, in this example, to simplify the discussion.
  • point C has at least two neighbors (i.e., points F and G), each within a distance of 1.
  • points B, C, D, E, G, H and I each have at least two neighbors, each within a distance of 1.
  • Each point is assigned an initial state value, h this example, the selected points are each assigned a different integer value. However, non-integer values may be used and may be arbitrarily assigned. The following indicates the initial state values for each of the points:
  • point B has a state value of 0.
  • point B has two neighbors in the clustering process: point D with a state value of 2 and point H with a state value of 5. Accordingly, since point B has the lowest state value (i.e., 0) when compared with its neighbors' state values (i.e., 2 and 5), the state value of point B remains unchanged at 0.
  • point C has a state value of i and has one neighbor in the clustering process: point G with a state value of 4. Accordingly, since point C has the lowest state value (i.e., 1) when compared with its neighbor's state value (i.e., 4), the state value of point C remains unchanged at 1.
  • Point D has a state value of 2 and has two neighbors in the clustering process: point B with a state value of 0 (which is not the updated value) and point H with a state value of 5.
  • a second update stage using the rales of the first update stage is performed.
  • the difference in the second update stage is that the updated state values from the first update stage, instead of the initial state values, are employed.
  • point E has a state value of 3 after the first update stage.
  • Point E has two neighbors: point G with a state value of I after the first update stage and point I with a state value of 3 after the first update stage. Accordingly, point E takes on the lowest state value of I from neighbor G.
  • point I has a state value of 3 after the first update stage.
  • Point I has two neighbors: point E with a state value of 3 after the first update stage and point G with a state value of 1 after the first update stage. Accordingly, point I takes on the lowest state value of 1 from neighbor G.
  • a third update stage using the rules of the second update stage is performed.
  • the difference in the third update stage is that the updated state values from the second update stage are employed.
  • the result of the above-disclosed clustering process is a single cluster containing all of the points A to I.
  • FIG. 6 illustrates the clustering information shown on different hierarchical levels of a hierarchy.
  • a first cluster is formed by the set including the points B, D and H.
  • a second cluster is formed by the set including the points C, E, G and I.
  • FIG. 6 illustrates a hierarchy generated by keeping a first cluster parameter n constant and by changing a second cluster parameter r.
  • Another hierarchy or another aspect of the same hierarchy is illustrated by keeping the second cluster parameter r constant and changing the first cluster parameter n.
  • cluster parameters n and r are both changed.
  • the hierarchy includes more or less than two hierarchical levels in other embodiments.
  • an exemplary embodiment provides that at least some steps of the process (e.g., the update stage) are performed in parallel and/or simultaneously, wherein the terms "in parallel” and “simultaneously” have overlapping meanings.
  • the updating of all the state values of respective data points in the data set can be performed in parallel since the updating process uses the state values from the previously completed update stage.
  • the state values can be updated separately from each other.
  • the data points are split among processes.
  • selected steps are performed in parallel, while other steps are performed sequentially.
  • Parallel processing often reduces processing time and embodiments are scalable for massively parallel computations, in particular, when the number of data points becomes very large. Such parallel processing is achieved, for example, by one or more processors and/or state machines.
  • FIG. 2 illustrates a particular order of steps
  • the present invention also contemplates other orderings and groupings.
  • the present invention includes fewer or more steps than illustrated in FIG. 2.
  • the present invention contemplates that a process is formed from a subset of the steps illustrated in FIG. 2 such as, for example, a process including steps 140 to 160.
  • additional steps not illustrated in FIG. 2 are included such as, for example, forming and/or displaying hierarchical levels and/or at least some aspects of one or more hierarchies.
  • Data source for the above described application can be obtained from microarrays and DNA chips, which give expression levels for hundreds to thousands of genes.
  • the methods and systems of the present invention can be used process the data from microarrays and DNA chips, obtained from either a single experiment or multiple experiments, to group genes together whose expression profiles are similar to each other.
  • the data source can also be nucleotide sequences.
  • the methods and systems of the present invention can also be used to align nucleotide sequences in order to produce a global alignment of the sequences collected from an organism or across organisms.
  • Other examples of applications of the systems and methods of the present invention include determining the socio-economic demographic of the world population or of each country or city in the world or hemisphere.
  • the source data could be the World Bank statistics of countries from a selected period of time.
  • the data could include various quality of life factors such as state of health, nutrition, educational service, etc.
  • countries that have similar values will be grouped together with each group assigned its own unique color.
  • the socio- economic demographic of each country of the world can then be visualized in a straightforward manner, wherein each country on the geographic map is colored according to its socio-economic type.
  • Embodiments of the present invention have many advantages over existing technology.
  • the dependence of the clustering results on the selection of parameters has been minimized so that the natural or true structure of the data can be revealed.
  • Using simple unified state transition rules allows computations to be carried out much faster or in a parallel fashion, especially for problems involving a large data set.
  • Various embodiments have computational complexities of O(n), after the distance matrix computation. Searching clusters by simultaneous state transition operations and constructing a cluster hierarchy by continuous parameter changes provide these and other advantages.

Abstract

System and method cluster data by searching clusters through simultaneous state transition operation and constructing a cluster hierarch by continuous parameter changes. Parallel systems and methods may be used. A data set is created (130) from the data point by applying selected cluster parameter values. After assigning a different initial state value to each data point in the data set (140), the state value for each data point in the data set is updated according to one or more rules that are related to the cluster parameters. After the update process has been applied to the entire data set (150), the update process is repeated for the entire data set until the state value of respective data points in the data set have stabilized. Data clusters can be formed as a function of the stabilized state values. The process can be repeated for different cluster parameter values (190).

Description

SYSTEM AND METHOD FOR CLUSTERING DATA
FIELD OF THE INVENTION
The present invention relates generally to analysis of data, and more particularly, to a method and apparatus for data clustering.
BACKGROUND OF THE INVENTION
Clustering is a type of pattern recognition. Clustering is a process of organizing data into clusters by revealing naturally occurring patterns or structures. The resulting clusters allow users to discover similarities and differences among patterns and to derive useful conclusions about them. Some general applications of clustering are data reduction, hypothesis generation, hypothesis testing, and prediction based on clusters.
Clustering is useful in many fields such as life sciences, medical sciences, social sciences, earth sciences, and engineering. Clustering is often found under different names in different contexts such as data mining in software engineering, machine learning in pattern recognition, numerical taxonomy in biology and ecology, typology in social sciences, and partition in graph theory. One example of clustering is grouping genes according to similarities in their expression patterns. Other examples are helping marketers discover and characterize customers, identifying areas of similar land use in an earth observation database, identifying groups of automobile insurance policy holders with a high average claim cost, and identifying groups of houses in an area according to house types, value, and geographic location.
Some conventional clustering methods are not efficient for large data sets and suffer from long computation times. For example, the agglomerative type of hierarchical clustering algorithms have a computational complexity of 0(n3). Other methods have inflexible clustering criteria which result in clusters that are either too coarse or too fine so that the natural patterns in the data are missed. Also, some methods, such as partitioning methods try to fit data to predefined or arbitrary patterns and, thus, they too are unable to reveal the natural patterns in the data, h addition, few methods are scalable for massively parallel computation. Therefore, a need exists for a scalable method that lends itself to parallel computation and that employs flexible clustering criteria.
SUMMARY OF THE INVENTION
The present invention provides systems and methods that cluster data in a data space. The present invention is scalable and thus allows for parallel computation. Additionally, the clustering criteria of the present invention imposes minimal a priori restriction on the data and thus allows for a more natural clustering that does not obscure the natural patterns in the data.
In embodiments of the present invention, values of cluster parameters are selected and a data set is created from the data points in the data space that satisfy the selected cluster parameter values. Each data point in the data set is assigned a different initial state value. The state value for each data point in the data set is updated according to one or more rales that are related to the cluster parameters.
The update process is capable of being performed in a substantially parallel manner. After the update process has been applied to the entire data set, the update process is repeated for the entire data set until the state values of respective data points in the data set have stabilized. For example, stabilization of state values are indicated by all of the state values of the respective data points in the data set remaining unchanged after a completed update process of the entire data set.
The data points in the data set can be grouped into data clusters as a function of the state values. For example, data points with the same stabilized state value belong to the same data cluster. The process can be repeated for different cluster parameter values. Thus, for example, data clusters can be discovered for different sets of cluster parameter values to reveal the natural patterns of the data.
Other features and advantages will become apparent from the following detailed description, drawings, and claims. BRIEF DESCRIPTION OF THE DRAWINGS
The invention is pointed out with particularity in the appended claims. The advantages of the invention described above, as well as further advantages of the invention, are better understood by reference to the following detailed description taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a representation illustrating an exemplary embodiment of a system that clusters data according to the present invention.
FIG. 2 is a flowchart illustrating an exemplary embodiment of a method for clustering data according to the present invention.
FIGS. 3, 4, and 5 are flowcharts of alternative embodiments of methods for clustering data according to the present invention.
FIG. 6 is a representation illustrating an exemplary embodiment of a hierarchy or an aspect of a hierarchy according to the present invention.
DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS
The present invention generally relates to systems and methods that cluster data. In an exemplary embodiment, the present invention provides a system and a method for discovering data clusters in a data space. An exemplary embodiment of a system 10 according to the present invention is illustrated in FIG. 1. The system 10 includes a computing unit 90 such as at least one data processor, at least one computer (e.g., a Beowulf cluster, a supercomputer, a server, a mainframe, a desktop, a laptop, a notebook, a portable or a handheld computer) and/or any equivalents thereof. The computing unit 90 includes a controller 20, a memory 30, a bus 40, an input device 50 and/or an output device 60. Optionally, the system 10 includes a data device 70 (e.g., an external data storage device or a remote data storage device) and/or a link 80 (e.g., a cable link or a wireless link).
The controller 20 includes a computing device or a plurality of computing devices (e.g., processors, microprocessors and/or state machines). The memory 30 includes volatile memory components and/or non- volatile memory components. The controller 20 and the memory 30 are in two-way communications with the bus 40. The input device 50 (e.g., receiver, microphone, mouse, keypad, keyboard, sensor and/or touch-sensitive display) and the output device 60 (e.g., display, speaker and/or transmitter) are in at least one-way communications with the bus 40. The data device 70 (e.g., a conventional memory and/or conventional data storage device with or without computing power) is in at least one-way communications with, for example, the bus 40 via the link 80. The link 80 also provides a conventional input/output interface with the bus 40.
Although illustrated as connected via the bus 40, in other embodiments the components of the system 10 are connected directly to each other in addition to or instead of being connected via the bus 40. Other conventional methods of communicating between components (e.g., conventional wireless communications means) are be employed in some embodiments. Furthermore, various levels of integration between components is contemplated by the present invention. For example, any component may be integrated in part or in whole with any other component or components.
The controller 20 controls data flow and/or access on the bus 40. Programs and/or data are accessed by the controller 20 from, for example, the memory 30, the input device 50 and/or the data device 70. The input device 50 provides a user interface for entering data and/or commands (e.g., via a keyboard, programming the system 10 or entering values for parameters during the execution of a program). The output device 60 provides, for example, a display and/or an interface that transmits information to a user or to another device, under control of the controller 20.
A method 100 (shown in Figure 2) that clusters data according to an exemplary embodiment of the present invention is stored, for example, in the memory 30, the controller 20 or some combination thereof. Furthermore, the method 100 is encompassed in software, hardware (e.g., an application specific integrated circuit (ASIC)) or some combination thereof.
A data space represents the set of all data points from which data sets are generated and/or from which clusters are formed. Each data point includes a plurality of dimensions and/or degrees of freedom. The data space is stored, at least partly, in the memory 30, the controller 20, the data device 70 or some combination thereof. Furthermore, information including data points is transmitted and stored between, for example, the memory 30, the controller 20 and the data device 70. For example, if the data space is not efficiently stored within the memory 30 and/or the controller 20, then the data device 70 via the link 80 at least partly provides storage for the data space. Furthermore, the data device 70 provides additional computing power that can process at least portions of the data space stored in the data device 70 in some embodiments.
In operation, the controller 20 controls the other components of the system 10. For example, the controller 20 accesses the method 100 stored in the memory 30. The controller 20 then executes the method 100 and processes the data points received via the bus 40, the memory 30 and/or the data device 70. In one example, the memory 30, the data device 70, the input device 50 and/or the controller 20 serve as a source of data points, hi another example, the memory 30 provides a local cache corresponding to the memory of the data device 70.
At least a portion of the data can be processed substantially in parallel by the controller 20, in one embodiment. In one example, parallel processing is achieved via one or more processors and/or state machines. Furthermore, the controller 20 can process data stored in the memory 30, the data device 70, the controller 20 or some combination thereof, h another example, the processing power is distributed between the controller 20 and the data device 70.
The controller 20 in executing the method 100 according to an exemplary embodiment of the present invention creates at least one data set from the data space, hi one example, the data set is further processed via a substantially parallel process and possibly iterative process resulting in the clustering of data. By changing cluster conditions (e.g., cluster parameter values), possibly different data sets are generated from the data space resulting in one or more possibly different data clusters. The clusters of data are organized, for example, according to cluster conditions to illustrate one or more hierarchical levels in one or more hierarchies.
In FIG. 2, a flowchart shows an exemplary embodiment of the method 100 that clusters data from a data space according to the present invention. The method 100 begins in step 110 and proceeds with the selection of values for cluster parameters in step 120. In some embodiments, the values for the cluster parameters are automatically selected. For example, the method 100 automatically selects initial values for the cluster parameters and automatically changes the values for the cluster parameters. For example, one or more of the cluster parameter values are increased or decreased by the integer multiples of a particular resolution value up to or down to a particular threshold value. Alternatively, the values for the cluster parameters are selected or updated manually by an operator.
The number and/or type of cluster parameters are preset or chosen as a function of, for example, the application and/or the data point type. For example, the method 100 employs two cluster parameters, namely a number of neighbors n and a radius r.
In step 130, a data set is generated by selecting data points of the data space that satisfy the particular values of the cluster parameters. For example, the data set includes data points that have at least n neighbors within a radius r. In another example, the data set includes data points that satisfy at least one of the cluster parameter values. Other methods such as conventional methods for applying the cluster parameter values are used to create the data set in some embodiments.
In step 140, each of the data points in the data set is assigned a different initial state value. Integers and other types of numbers or representations are used for initial state values.
In step 150, the state value of each data point in the data set is updated according to particular rules. The rules are preset by the user, programmed by the user and/or selected automatically. In one embodiment, the rales are selected automatically as a function of the type of data space being processed or the type of application. In another embodiment, the rule is related to the cluster parameters. An example rule is that a particular data point should be given the lowest state value in a corresponding neighborhood as defined by the cluster parameters. In an exemplary embodiment, the step 150 is carried out as a parallel process with all the data points simultaneously undergoing the updating process in step 150. Parallel processing usually reduces computing time, especially when the data sets are very large.
In step 160, the method 100 determines whether or not any state values were changed in the previous step (i.e., step 150). If a state value of any of the data points was changed, then the method 100 jumps back to step 150. If no state values were changed, then the state values have stabilized and the method 100 proceeds with step 170.
Step 170 determines the clusters according to state values. In an exemplary embodiment, the data points are grouped into clusters according to state value.
Thus, for example, a first cluster is formed from the data points with a first state value. Additionally, a second cluster is formed from the data points with a second state value. Thus, in this example, the number of clusters is determined by the number of different state values.
In step 180, the method 100 determines whether or not the value of any cluster parameters are to be changed. If not, then the method 100 terminates in step 190. Otherwise, if at least one of the cluster parameter values is to be changed, then the method 190 updates the particular cluster parameter value (step 190) and the process proceeds again to step 130 in which possibly different data sets are generated according to the one or more updated cluster parameter values. Any of the parameter values are manually and/or automatically changeable by the method 100. Furthermore, the change in at least one cluster parameter value is by a constant or variable incremental amount. In one embodiment, they are selected according to the noise level of the data. The changes are by addition, subtraction, multiplication, division or any other conventional methods for changing the value of a particular parameter. For example, for a cluster parameter with a cluster parameters n and r as described above, for a particular value of n, the values of r are changed by adding a constant value to r until r reaches a particular threshold value. In another example, for a particular value of r, the values of n are changed by subtracting a constant value from n until n reaches a particular threshold value. Each set of cluster parameter values generates a possibly different data set and possibly clusters of different composition, size and number.
Furthermore, each set of cluster parameter values form, for example, at least portions of hierarchical levels in one or more hierarchies.
FIG. 3 is a flowchart of an alternative embodiment of a method 300 for clustering data according to the present invention. Cluster criteria are selected 302 and a different state is assigned each data point in the data 304. The state of each data point is updated according to at least one rule that is a function of the cluster criteria 306 and this is repeated until all of the states remain unchanged 308. In another embodiment, at least one of the cluster criteria are changed and the method 300 is repeated for the new criteria. In another embodiment, data points grouped by states are displayed as a result of method 300. In another embodiment, the data points are grouped by states in a hierarchy according to cluster criteria.
FIG. 4 is a flowchart of another alternative embodiment of a method 400 for clustering data according to the present invention. Data points are selected from a k- dimensional space that have at least n neighboring data points with a similarity measure less than or equal to r 402. Each selected data point is labeled with a unique initial state 404. The state of each labeled data point is updated to the lowest state in its neighborhood, if the state differs from the lowest state in its neighborhood 406. The updating is repeated until there is no state change in the ^-dimensional space 408. In one embodiment, each data point represents a gene and its characteristics. In another embodiment k, n, and r are predetermined values. In another embodiment, the states are updated simultaneously. In another embodiment, genes grouped by state are displayed. In another embodiment, r is increased by a resolution Δr and the method is repeated for the new r. In another embodiment, the resolution Δr is selected according to noise level. In another embodiment, the resulting clusters are provided before selecting the resolution Δr so that the resolution Δr is selected according to resulting clusters. In another embodiment, r is varied by Δr over a range of values to produce a hierarchy of clusters. In another embodiment, the hierarchy of clusters is displayed.
FIG. 5 is a flowchart of another alternative embodiment of a method 500 for clustering data according to the present invention. A system for clustering data comprises one or more memory units that store at least a portion of a data space and a controller coupled to the one or more memory units. The data space contains a plurality of data points. The controller includes a plurality of computing devices that operate to perform a method. A state value is updated for each data point according to at least one rale that is a function of cluster criteria 502. The cluster criteria comprise a minimum number of neighbors (it) and a similarity value (r). The similarity value (r) is increased by a pre-determined increment and, then, the updating is repeated 504. In one embodiment, the method is performed by the computing devices in parallel, another embodiment, the controller and the plurality of computing devices are part of a multicomputer architecture capable of parallel processing. In another embodiment, the controller and the plurality of computing devices are part of a supercomputer. In another embodiment, the method further comprises displaying a hierarchy of data points grouped by state values over a range of similarity values (r).
An example illustrating the operation of a method according to the present invention will be described. A sample method is shown in Table 1. In the sample method, n represents the minimum number of neighbors required and r represents a value of a similarity measure. The parameters n and r allow for control of data density. A similarity measure is chosen depending on the application. Some examples of similarity measures are the Euclidean distance d(x,y) = V∑ (x; - yD2 and positive linear correlations. Additional examples of similarity measures include the family of Minkowski metrics, of which the Euclidean distance is a member. An example of a data point is a gene and its characteristics, such as an expression pattern. A group is a collection of similar data points. Neighbors are data points having a defined similarity measure with respect to a particular data point. States are associated with data points and identity groups or clusters. A ^-dimensional space is a data set of size k, such as k number of experiments. The example is to be constraed merely as an illustration and is not to be constraed as a limitation in any manner.
• Step T. Select data points that have at least n neighboring data points within a given radius r. The initial value of the minimum neighbors requirement n and the neighborhood size r are pre-determined. The distance between any two data points can be calculated in a fc-dimensional space.
• Step 2: Label each selected data point with a unique integer i, which becomes the initial state of the data point.
• Step 3: Simultaneously update the state of all labeled data points according to the following rules: o Change the state of the data point under consideration to the lowest state that occurred in its neighborhood. o Keep the state of the data point unchanged if its state is the lowest one in its neighborhood.
• Step 4: Repeat step 3 until there is no state change in the entire data space. • Step 5: Output the groups of the data points that have the same state as the clusters formed at the parameter point (r, n).
• Step 6: Increase the value of parameter r (or ή) by a user specified resolution Δr and repeat steps 1-5 to create the set of the lower level clusters with a smaller neighborhood (or number of minimum neighbors requirement). • Step 7: Repeat step 6 to cover a meaningful range in the parameter space. This will produce a hierarchy of clusters with respect to the selected resolution . Δr.
• Steps 3, 6, and 7 can all be carried out in a massively parallel fashion.
Table 1. Sample method.
As an example, a data space includes ten data points labeled A to J. Each data point represents, for example, an m-dimensional space where m is an integer. Thus, for example, point A includes m parameters that is represented, for example, as the m coordinates of point A, e.g., (A0, Al,..., Am-1). These parameters represent information such as measurement(s), characteristic(s) or representative value(s) of the same type(s) or different type(s).
For example, each data point represents a particular person or group of persons with a particular financial history. Information stored in the coordinates of each data point includes, for example, income data, asset data, debt data, liability data, overhead data, statistical scores relating to personal financial history and other data that is relevant as to whether or not a bank should approve a loan or extend a pre-approved credit card offer to a particular person or group of persons. In another example, each data point represents a test subject (e.g., one or more organisms, cells, organic material, DNA, etc.) that is the focus of scientific research. Information stored in the coordinates of each data point includes, for example, test conditions, subject characteristics, statistical data relating to the test conditions and/or subject. It will be appreciated that these are merely illustrations and not intended to limit the present invention in any way. The systems and methods for clustering data according to the present invention find application in a wide variety of applications in which information is processed and/or analyzed.
one embodiment, a distance is determined between each point and every other point of the data set. The term "distance" includes, for example, conventional spatial distances or its equivalent between two points (e.g., A and B) as represented, for example, by the conventional relation of the square root of the sum of the square of the differences of corresponding coordinates between two points. The term "distance" also includes, for example, a conventional correlation value or its equivalent. Thus, for example, the distance is a normalized correlation between two points (e.g., A and B) subtracted from an offset such as, for example, one. Accordingly, the distance between, for example, point A and itself would be zero. The term "distance" need not be limited to any one of the above-identified embodiments, but includes any conventional mathematical (e.g., statistical) parameters as known to one of ordinary skill in the art.
Regardless of the method used in determining the distance between points in a data space, the distance information is storable in a matrix. The example matrix below is for storing distances between the points of a data set and contains ten points A-J. However, the information need not be stored in a matrix format and there can be more or less than ten data points. For example, other methods for organizing and/or keeping track of data (e.g., coordinate systems, pointers, etc.) are also employed. to nt fA Jϊ C D £ ζj /r / «
A 0 1 2 4 6 7 5 s 5 2
B 1 0 3 1 4 3 7 1 2 9
C 2 3 0 5 4 1 1 2 4
D 4 1 5 0 7 2 6 1 2 3
E 6 4 4 7 0 5 I 4 I 9
F 7 3 1 2 5 0 7 2 S **
G 5 7 1 6 ϊ 7 0 3 J 9
H S 1 2 1 4 2 3 0 9 1
I 5 2 2 2 1 8 i 9 0 3
* 2 9 4 3 9 2 9 1 3 0
As illustrated in the matrix by the main diagonal containing zeroes, the distance between a particular point and itself is zero. The distance between point A and point B is, for example, shown to have a value of one. This value is reflected in row one, column two, and row two, column one. Other distance values between points are also stored in the matrix. Although the distance values are shown in the matrix to be integers, other types of numbers (e.g., non-integers) are also stored. Integers were employed, in this example, to simplify the discussion.
Cluster parameter values are then selected. For example, for clustering purposes, a particular data point should have at least two neighbors (i.e., n = 2) that are within a distance of one (i.e., r = 1) with respect to the particular data point. By selecting such parameter values, it is clear from the matrix, in this example, that point A should not be considered in the clustering process. Point A has only one neighbor (i.e., point B) within a distance of 1. Using the same analysis, point B, for example, should be considered in the clustering process. Point B has at least two neighbors (i.e., points A, D and H), each within a distance of 1 with respect to point B. Similarly, point C has at least two neighbors (i.e., points F and G), each within a distance of 1. Using such a process, it is determined that points B, C, D, E, G, H and I each have at least two neighbors, each within a distance of 1.
Each point is assigned an initial state value, h this example, the selected points are each assigned a different integer value. However, non-integer values may be used and may be arbitrarily assigned. The following indicates the initial state values for each of the points:
B *=>2
Each state then is updated according to the following rales:
(1) if a particular data point under consideration has a state value that is the lowest state value in its neighborhood, then the state value of the particular data point under consideration remains unchanged; and
(2) if the particular data point under consideration has a state value that is greater than the lowest state value in its neighborhood, then the state value of the particular data point under consideration should be changed to the lowest state value in its neighborhood.
During the first update stage, point B has a state value of 0. For example, point B has two neighbors in the clustering process: point D with a state value of 2 and point H with a state value of 5. Accordingly, since point B has the lowest state value (i.e., 0) when compared with its neighbors' state values (i.e., 2 and 5), the state value of point B remains unchanged at 0. In a further example, point C has a state value of i and has one neighbor in the clustering process: point G with a state value of 4. Accordingly, since point C has the lowest state value (i.e., 1) when compared with its neighbor's state value (i.e., 4), the state value of point C remains unchanged at 1. Point D has a state value of 2 and has two neighbors in the clustering process: point B with a state value of 0 (which is not the updated value) and point H with a state value of 5. Continuing with the analysis, it is evident that, after the first update stage, point E has the same state value and points G, H and I will be updated with lower state values. The results of the first update stage are summarized below:
B;0=>Q
C: i => !
D 2 ^=> 0
G 4=>1 H: 5 — > 0 I:6=^>3.
A second update stage using the rales of the first update stage is performed. The difference in the second update stage is that the updated state values from the first update stage, instead of the initial state values, are employed. Thus, for example, point E has a state value of 3 after the first update stage. Point E has two neighbors: point G with a state value of I after the first update stage and point I with a state value of 3 after the first update stage. Accordingly, point E takes on the lowest state value of I from neighbor G. Furthermore, point I has a state value of 3 after the first update stage. Point I has two neighbors: point E with a state value of 3 after the first update stage and point G with a state value of 1 after the first update stage. Accordingly, point I takes on the lowest state value of 1 from neighbor G.
The results of the second update stage are summarized below:
B;0- -= (H =^>o
C:l- =->! = -* ι
D;2 = =>o- =>0
J * •*? """ = 3- =e>l
G:4 = =>ϊ* =_ j
H:5« =>o^ =>δ
1:6* =>3=, =>1.
A third update stage using the rules of the second update stage is performed.
The difference in the third update stage is that the updated state values from the second update stage are employed. For this example, the state values all remain the same and the clustering process is completed for the particular cluster parameters n=2 and r=l.
The clustering process for cluster parameters n = 2 and r = 1 is interpreted in an exemplary embodiment as initially having seven clusters corresponding to the seven initial state values (i.e., 0 to 6). After a first update stage, there were three clusters. A first cluster included those points with updated state values of 0 (i.e., 5 points B, D and H). A second cluster included those points with updated state values of 1 (i.e., points C and G). A third cluster included those points with updated state values of 3 (i.e., points E and I). After a second and subsequent update stages, there were two clusters. A first cluster included those points with updated state values of 0 (i.e., points B, D and H). A second cluster included those points with updated state values of 1 (i.e., points C, E, G and I).
The process begins again by changing one or more cluster parameter values (e.g., incrementing or decrementing a cluster parameter by an integer or non-integer amount). For example, the radius r is changed, for example, by increasing r by an integer value of 1) and the number of neighbors n is kept the same. Then the process is repeated for the updated cluster parameters of, for example, n = 2 and r = 2. The result of the above-disclosed clustering process is a single cluster containing all of the points A to I.
FIG. 6 illustrates the clustering information shown on different hierarchical levels of a hierarchy. The hierarchy illustrates that for the hierarchical level defined by the cluster parameters (n = 2, r = 1) the data set is grouped into two clusters. A first cluster is formed by the set including the points B, D and H. A second cluster is formed by the set including the points C, E, G and I. Another hierarchical level is defined by the cluster parameters (n = 2, r = 2). In such an example, there is only one cluster formed by the set including the points A-J. Thus, FIG. 6 illustrates a hierarchy generated by keeping a first cluster parameter n constant and by changing a second cluster parameter r. Another hierarchy or another aspect of the same hierarchy is illustrated by keeping the second cluster parameter r constant and changing the first cluster parameter n. In yet another example, cluster parameters n and r are both changed. Furthermore, although illustrated in FIG. 6 as having only two hierarchical levels, the hierarchy includes more or less than two hierarchical levels in other embodiments.
Although the data points were described as processed, at least in part, in a certain order, it will be appreciated by one of ordinary skill in the art that the process need not start with any particular point nor continue in any particular order through the relevant data points.
Furthermore, an exemplary embodiment provides that at least some steps of the process (e.g., the update stage) are performed in parallel and/or simultaneously, wherein the terms "in parallel" and "simultaneously" have overlapping meanings. For example, the updating of all the state values of respective data points in the data set can be performed in parallel since the updating process uses the state values from the previously completed update stage. Thus, the state values can be updated separately from each other. In an example embodiment, there is a process for each data point. In another embodiment, the data points are split among processes. In another embodiment, selected steps are performed in parallel, while other steps are performed sequentially. Parallel processing often reduces processing time and embodiments are scalable for massively parallel computations, in particular, when the number of data points becomes very large. Such parallel processing is achieved, for example, by one or more processors and/or state machines.
Although examples have been provided with particular steps, it is understood that the present invention need not be limited those particular examples. For example, although FIG. 2 illustrates a particular order of steps, the present invention also contemplates other orderings and groupings. In addition, the present invention includes fewer or more steps than illustrated in FIG. 2. For example, the present invention contemplates that a process is formed from a subset of the steps illustrated in FIG. 2 such as, for example, a process including steps 140 to 160. some embodiments, additional steps not illustrated in FIG. 2 are included such as, for example, forming and/or displaying hierarchical levels and/or at least some aspects of one or more hierarchies.
Further applications of methods and systems of the present invention include predicting gene functions by correlating gene expression profiles with gene functional classes; discovering gene regulatory elements by correlating gene expression with promoter regions; and correlated data to restructure gene regulation networks. Data source for the above described application can be obtained from microarrays and DNA chips, which give expression levels for hundreds to thousands of genes. The methods and systems of the present invention can be used process the data from microarrays and DNA chips, obtained from either a single experiment or multiple experiments, to group genes together whose expression profiles are similar to each other. The data source can also be nucleotide sequences. The methods and systems of the present invention can also be used to align nucleotide sequences in order to produce a global alignment of the sequences collected from an organism or across organisms.
Other examples of applications of the systems and methods of the present invention include determining the socio-economic demographic of the world population or of each country or city in the world or hemisphere. For such an application the source data could be the World Bank statistics of countries from a selected period of time. The data could include various quality of life factors such as state of health, nutrition, educational service, etc. Countries that have similar values will be grouped together with each group assigned its own unique color. The socio- economic demographic of each country of the world can then be visualized in a straightforward manner, wherein each country on the geographic map is colored according to its socio-economic type.
In general, it should be emphasized that the various components of embodiments of the present invention can be implemented in hardware, software, or a combination thereof. In such embodiments, the various components and steps would be implemented in hardware and/or software to perform the functions of the present invention. Any presently available or future developed computer software language and/or hardware components can be employed in such embodiments of the present invention. For example, at least some of the functionality mentioned above could be implemented using C or C++ programming languages.
Thus, it is seen that systems and methods for clustering data are provided. One skilled in the art will appreciate that the present invention can be practiced by other than the preferred embodiments which are presented in this description for purposes of illustration and not of limitation and that numerous changes in the details of construction and combination and arrangement of processes and equipment may be made without departing from the spirit and scope of the invention, and the present invention is limited only by the claims that follow. It is noted that equivalents for the particular embodiments discussed in this description may practice the present invention as well.
Conclusion Embodiments of the present invention have many advantages over existing technology. The dependence of the clustering results on the selection of parameters has been minimized so that the natural or true structure of the data can be revealed. Using simple unified state transition rules allows computations to be carried out much faster or in a parallel fashion, especially for problems involving a large data set. Various embodiments have computational complexities of O(n), after the distance matrix computation. Searching clusters by simultaneous state transition operations and constructing a cluster hierarchy by continuous parameter changes provide these and other advantages.

Claims

What is claimed is:
1. A computer-implemented method for clustering data, comprising: assigning a different state value to each data point in the data; updating the state value of each data point according to at least one rale as a function of selected cluster criteria; and repeating the updating until all of the states remain unchanged.
2. The method according to claim 1, further comprising displaying data points grouped by state values.
3. The method according to claim 1, further comprising changing at least one of the cluster criteria; and performing the assigning, updating, and repeating.
4. The method according to claim 3, further comprising: displaying data points grouped by state values in a hierarchy according to cluster criteria.
5. A computer-readable medium having computer-executable instructions for performing a method for clustering, comprising:
selecting data points from a fe-dimensional space that have at least n neighboring data points with a similarity measure less than or equal to r;
labeling each selected data point with a unique initial state value;
updating the state value of each labeled data point to the lowest state value in its neighborhood, if the state differs from the lowest state value in its neighborhood; and
repeating the updating until there is no state value change in the ^-dimensional space.
6. The computer-readable medium according to claim 5, wherein each data point represents a gene and its characteristics.
7. The computer-readable medium according to claim 5, wherein each data point represents gene expression data obtained from at least one of a microarray or a DNA chip.
8. The computer-readable medium according to claim 5, wherein k, n, and r are predetermined values.
9. The computer-readable medium according to claim 5, wherein the state values are updated simultaneously.
10. The computer-readable medium according to claim 5, further comprising: displaying genes grouped by state values.
11. The computer-readable medium according to claim 5, further comprising increasing r by a resolution Δr, and repeating at least once the selecting, labeling, updating and repeating.
12. The computer-readable medium according to claim 11, wherein the resolution Δr is selected according to noise level.
13. The computer-readable medium according to claim 11, further comprising providing resulting clusters before selecting the resolution Δr, wherein the resolution Δr is selected according to resulting clusters.
14. The computer-readable medium according to claim 11, further comprising varying r by Δr over a range of values to produce a hierarchy of clusters.
15. The computer-readable medium according to claim 14, further comprising displaying the hierarchy of clusters.
16. A system for clustering data, comprising: one or more memory units that store at least a portion of a data space, the data space containing a plurality of data points; and
a controller coupled to the one or more memory units, the controller including a plurality of computing devices that operate to perform a method, the method comprising:
updating a state value for each data point according to at least one rule that is a function of cluster criteria, the cluster criteria comprising a minimum number of neighbors (n) and a similarity value (r); and increasing the similarity value (r) by a pre-determined increment and repeating the updating.
17. The system according to claim 16, wherein the method is performed by the computing devices in parallel.
18. The system according to claim 16, wherein the controller and the plurality of computing devices are part of a multicomputer architecture capable of parallel processing.
19. The system according to claim 16, wherein the controller and the plurality of computing devices are part of a supercomputer.
20. The system according to claim 16, wherein the method further comprises displaying a hierarchy of data points grouped by state values over a range of similarity values (r).
21. A method for clustering data, comprising: creating a data set including data points that have at least n neighbors and a similarity measure less than or equal to r, where n and r are selected cluster parameter values that define a particular cluster parameter point (n, r);
associating a different state value with each of the data points;
updating the state values according to the following rale:
if the state value of a particular data point is greater than any of the other state values associated with data points having similarity measure less than or equal to the similarity measure r of the particular data point, then the state value of the particular data point in the data set is changed to a lowest state value of all of the state values attributed to the data points that have similarity measure less than or equal to the similarity measure r of the particular data point; and
repeating the updating until the state values of all of the data points remain unchanged.
22. The method according to claim 21, further comprising the step of grouping the data points of the data set into clusters as a function of respective state values.
23. The method according to claim 21, further comprising repeating the method for cluster parameter point (it, r + kΔr), where kΔr is an integer multiple k of a specified resolution Δr.
24. The method according to claim 23, wherein k has an initial value of 1 or -1.
25. The method according to claim 23, further comprising the step of repeating the method for different integer values of k.
26. The method according to claim 21, further comprising the step of repeating the method for cluster parameter point (n +jΔn, ), where jΔn is an integer multiple j of a specified resolution An.
27. The method according to claim 26, wherein has an initial value of 1 or - 1.
28. The method according to claim 26, further comprising the step of repeating the method for different integer values fj.
29. A method for correlating gene expression with gene function comprising: assigning a different state number to each gene expression data point; updating the state number of each gene expression data point according to at least one rule, wherein said at least one rule is a function of selected cluster criteria; and repeating the updating until all of the states remain unchanged from one iteration to the next.
30. A method for clustering gene expression data comprising:
assigning to each gene expression data point a state value;
updating the state value for each gene expression data point, wherein said updating step comprises comparing the state values of at least two gene expression data points; and repeating said updating step until the state values are self consistent between two successive updating steps.
31. A method for clustering data, comprising:
means for creating a data set including data points that have at least n neighbors and a similarity measure less than or equal to r, where it and r are selected cluster parameter values that define a particular cluster parameter point (n, r);
means for associating a different state value with each of the data points;
means for updating the state values according to the following rule:
if the state value of a particular data point is greater than any of the other state values associated with data points having similarity measure less than or equal to the similarity measure r of the particular data point, then the state value of the particular data point in the data set is changed to a lowest state value of all of the state values attributed to the data points that have similarity measure less than or equal to the similarity measure r of the particular data point; and
means for repeating the updating until the state values of all of the data points remain unchanged.
32. A method of discovering trends and features in economic data comprising:
assigning to each economic data point a state value;
updating the state value for each economic data point , wherein said updating step comprises comparing the state value of at least two economic data points; and
repeating said updating step until the state values are self consistent between successive updating steps.
33. The method according to claim 32, wherein said economic data point comprises m-parameters, wherein m is an integer.
34. A method of identifying economic sectors as clusters of assets with similar economic dynamics from a set of economic data points, said method comprising:
assigning to each economic data point a state value;
determining the state value of each economic data point according to at least one rale;
repeating said determining the state value for each economic data point until the state values are self consistent between successive determining steps.
35. The method according to claim 34, wherein said at least one rale is a function of selected cluster criteria.
PCT/US2003/001806 2002-01-22 2003-01-22 System and method for clustering data WO2003063030A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US35057002P 2002-01-22 2002-01-22
US60/350,570 2002-01-22

Publications (1)

Publication Number Publication Date
WO2003063030A1 true WO2003063030A1 (en) 2003-07-31

Family

ID=27613403

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2003/001806 WO2003063030A1 (en) 2002-01-22 2003-01-22 System and method for clustering data

Country Status (1)

Country Link
WO (1) WO2003063030A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340104A (en) * 2020-02-24 2020-06-26 中移(杭州)信息技术有限公司 Method and device for generating control rule of intelligent device, electronic device and readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6263334B1 (en) * 1998-11-11 2001-07-17 Microsoft Corporation Density-based indexing method for efficient execution of high dimensional nearest-neighbor queries on large databases
US6529891B1 (en) * 1997-12-04 2003-03-04 Microsoft Corporation Automatic determination of the number of clusters by mixtures of bayesian networks

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6529891B1 (en) * 1997-12-04 2003-03-04 Microsoft Corporation Automatic determination of the number of clusters by mixtures of bayesian networks
US6263334B1 (en) * 1998-11-11 2001-07-17 Microsoft Corporation Density-based indexing method for efficient execution of high dimensional nearest-neighbor queries on large databases

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340104A (en) * 2020-02-24 2020-06-26 中移(杭州)信息技术有限公司 Method and device for generating control rule of intelligent device, electronic device and readable storage medium
CN111340104B (en) * 2020-02-24 2023-10-31 中移(杭州)信息技术有限公司 Method and device for generating control rules of intelligent equipment, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
Kumar et al. An efficient k-means clustering filtering algorithm using density based initial cluster centers
Kim et al. Fuzzy clustering of categorical data using fuzzy centroids
Asur et al. An ensemble framework for clustering protein–protein interaction networks
Madhulatha An overview on clustering methods
Shah et al. Variable selection with error control: another look at stability selection
US8332346B1 (en) Fuzzy-learning-based extraction of time-series behavior
Pontes et al. Configurable pattern-based evolutionary biclustering of gene expression data
Acharya et al. Bi-clustering of microarray data using a symmetry-based multi-objective optimization framework
Tian et al. Stratified feature sampling for semi-supervised ensemble clustering
Balamurugan et al. A modified harmony search method for biclustering microarray gene expression data
Wang et al. Chaotic harmony search based multi-objective feature selection for classification of gene expression profiles
Shang et al. Diversity Subsampling: Custom Subsamples from Large Data Sets
WO2003063030A1 (en) System and method for clustering data
Tchagang et al. Biclustering of DNA microarray data: theory, evaluation, and applications
US20230289955A1 (en) End-to-end part learning for image analysis
Arzamasov et al. REDS: rule extraction for discovering scenarios
Sheng et al. A niching genetic k-means algorithm and its applications to gene expression data
US9183503B2 (en) Sparse higher-order Markov random field
CN117634618B (en) Knowledge reasoning method and system for iterative update biological high-dimensional dataset
Torrente Band-based similarity indices for gene expression classification and clustering
Azzini et al. A practically efficient fixed-pivot selection algorithm and its extensible MATLAB suite
Foszner et al. Consensus Algorithm for Bi-clustering Analysis
Safinianaini et al. Orthogonal Mixture of Hidden Markov Models
CN117634618A (en) Knowledge reasoning method and system for iterative update biological high-dimensional dataset
Chatterjee et al. An Evolutionary Matrix Factorization Approach for Missing Value Prediction

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SC SD SE SG SK SL TJ TM TR TT TZ UA UG US UZ VC VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP