WO2002015122A2 - A system and method for a greedy pairwise clustering - Google Patents

A system and method for a greedy pairwise clustering Download PDF

Info

Publication number
WO2002015122A2
WO2002015122A2 PCT/IB2001/001892 IB0101892W WO0215122A2 WO 2002015122 A2 WO2002015122 A2 WO 2002015122A2 IB 0101892 W IB0101892 W IB 0101892W WO 0215122 A2 WO0215122 A2 WO 0215122A2
Authority
WO
WIPO (PCT)
Prior art keywords
similarity
cluster
score
value
heap
Prior art date
Application number
PCT/IB2001/001892
Other languages
French (fr)
Other versions
WO2002015122A3 (en
Inventor
Eliyahu Dichterman
Gideon Maliniak
Ori Berger
Original Assignee
Camelot Information Technologies Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Camelot Information Technologies Ltd. filed Critical Camelot Information Technologies Ltd.
Priority to AU2001294089A priority Critical patent/AU2001294089A1/en
Publication of WO2002015122A2 publication Critical patent/WO2002015122A2/en
Publication of WO2002015122A3 publication Critical patent/WO2002015122A3/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • H04L63/102Entity profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/107License processing; Key processing
    • G06F21/1078Logging; Metering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/554Detecting local intrusion or implementing counter-measures involving event detection and direct action
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/604Tools and structures for managing or administering access control systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • H04L63/105Multiple levels of security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2141Access rights, e.g. capability lists, access control lists, access tables, access matrices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2145Inheriting rights or properties, e.g., propagation of permissions or restrictions within a hierarchy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2149Restricted operating environment

Definitions

  • This disclosure teaches techniques for pairwise clustering of data elements. More specifically, the teachings include, but are not limited to systems and methods for the implementation of a greedy pairwise clustering of data elements and and iteratively merging the best elements and/or clusters.
  • Clustering Assignment a specific case that represents an allocation of elements to one or more clusters.
  • a clustering assignment indicates which element belongs to each cluster.
  • Cluster - a set of one or more elements
  • Merge the operation of combining two clusters, including a cluster containing a single element, to form a merged cluster such that all the elements contained in each cluster become part of the merged cluster.
  • Score - a real number representing a quantitative and qualitative indication of how good a cluster assignment is.
  • clustering assignment problems are considered a subset of pattern recognition problems. Solutions to clustering assignment problems are well-known. In such clustering assignment problems, the goal is to partition a set of data points into multiple sets of data points. These multiple sets of data points are commonly referred to as clusters. Clearly, the partitioning (or clustering) of data points is influenced by assumptions regarding the origin of these data points that form the cluster. The particular field of application in which the clusters will be used also influences the partitioning.
  • FIGs. 1A and IB are examples of two possible clustering assignments for elements ⁇ , X ⁇ " through "X ⁇ 3 " organized in a two dimensional projection.
  • the first clustering assignment is shown in FIG. 1A where there are three clusters 10, 11, and 12 containing ⁇ X lf X 2 , X 3 , X 4/ X 5 ⁇ , ⁇ Xe, ?, Xs, Xg ⁇ , and ⁇ X ⁇ 0 , Xu, X ⁇ 2 , X i3 > respectively.
  • FIG. IB shows a second possible clustering assignment, using the same data points.
  • clusters 15, 16, 17, and 18 there are four clusters 15, 16, 17, and 18 containing ⁇ X lr X 2 , X 3 , X 5 ⁇ , ⁇ X 4 , X 9 >, ⁇ X 6 , X 7/ X 8 ⁇ , and ⁇ X 10 , X llf X 12 , X ⁇ 3 ⁇ respectively.
  • Such conventional clustering assignment techniques have been used, for example, in pattern and image processing (US patents 5,519,789, 5,537,491, 5,703,964, 5,857,169, 6,038,340), speech analysis (US patents 4,181,821, 4,837,831, 5,276,766, 5,806,030), placement (US patents 5,202,840, 5,566,078, 5,682,321), data clustering in databases (US Patents 5,706,503, 5,832,182, 5,940,832), and general clustering and clustering analysis (US patents 5,040,133, 5,389,936, 5,404,561, 5,764,824, 6,021,383, 6,026,397, 6,031,979, 6,134,541).
  • the disclosed teachings provide a clustering system.
  • the system comprises, an initializer, a merger, and a selector. All three units are further connected to an assigner database.
  • the initializer is adapted to accept as an input pairwise similarity data and provide the merger and the assigner database with the initialized data ready for the clustering assignment process.
  • the merger is adapted to perform the repetitive task of calculating delta values, determining a clustering assignment with a highest score, obtained from the highest delta, from all pairwise clustering options in each round of calculations, and updating the assigner database.
  • the selector is adapted to determine the clustering assignment having the highest overall score.
  • the assigner database is adapted to hold the initialized pairwise similarity data and updates thereof, such updates are made by the merger, as well as other data structures as may be required for the operation of the system.
  • the initial similarity value for said first one of an element or a cluster to itself is an average of a similarity of said first one of an element or a cluster to all other elements or clusters. More specifically, initial similarity value for said first one of an element or a cluster to itself is an average of all other elements or clusters to said first one of an element or cluster.
  • the initial similarity values are organized in a two-dimensional table.
  • the initial similarity value for said first one of an element or a cluster to itself is an average of the similarity of said first one of an element or cluster to all other elements in a same row in the table.
  • the initial similarity value for said first one of an element or a cluster to itself is an average of the similarity of said first one of an element or cluster to all other elements in a same row in the table.
  • the score is a sum of multiple sub-scores. More specifically, at least one sub-score is based on scores related to all elements within a cluster. Still more specifically, an initial overall score is a sum of the similarity of each element to itself. Even more specifically, each subsequent overall scores, other than said initial overall score, is calculated as a differential value from an immediately preceding score. Even more specifically, each score for a potential clustering assignment is determined by adding an immediately preceding overall score to said differential value. Even more specifically, only a highest of said differential values in each round of calculations is added to the immediately preceding score.
  • updating of similarity matrix is achieved by adding rows corresponding to the merged clusters in the table and adding columns corresponding to the merged clusters.
  • the differential values are placed in a priority queue. More specifically, the priority queue is in plemented by a heap. Still more specifically, a back-annotation of a differential value and its position in the heap is maintained. Even more specifically, the updates of similarity matrix result in removal of all delta values in the heap corresponding to said merged clusters. Even more specifically, the system is adapted to replace said removed value with the value at the bottom of the heap, reduce the heap length by one, swap value with a parent if said value is larger then the parent or, heapify the heap starting from a node corresponding to the value if the value is smaller than at least one of its child nodes, and update said back-annotation to reflect changes in the heap.
  • At least one new differential value is inserted into the heap.
  • Another aspect of the disclosed teachings is a computer system that implements the system summarized above.
  • Yet another aspect of the disclosed teachings is a method for the indirect calculation of a series of greedy pairwise IN scores.
  • the method comprises inputting pairwise similarity between elements; replacing a self-similarity of each element to itself as an average of said element's similarity to all other elements and all other elements to said element; calculating an initial IN score as the sum of said self similarities; calculating a delta value corresponding to each case of merging two elements into a cluster; determining the highest of said delta values and accordingly selecting the corresponding merge of the two elements having the highest delta score for a clustering assignment; calculating the new IN score as the sum of the most recent IN score calculated and said highest delta value; updating the pairwise similarity by combining rows corresponding to said merged clusters and combining columns corresponding to said merged clusters; repeating until all elements are merged into a single cluster; selecting the clustering assignment which has the highest IN score.
  • the pairwise similarities are organized in a similarity matrix.
  • the initial similarity value for a first element to itself is replaced by an average of the similarity of said first element to all other elements in a same row.
  • the initial similarity / value for a first element to itself is replaced by an average of the similarity of said first element to all other elements in a same column.
  • the delta values for merging two clusters are computed using a method adding values of similarity between a first cluster and a second cluster and similarity between a second cluster and a first cluster ; subtracting the self- similarity of first cluster multiplied by the ratio between the number of elements in said second cluster and said first cluster; subtracting the self similarity of second cluster multiplied by the ratio between the number of elements in said first cluster and said second cluster; and dividing by a total number of elements to be merged in said potential clustering assignment.
  • the repetitive loop is stopped when a new IN score is smaller then an immediately preceding IN score.
  • Yet another aspect of the disclosed teachings is a method for clustering f elements comprising accessing initial similarity values in a similarity matrix, said similarity values corresponding to pairs of clusters wherein each of said pair of clusters could also be an individual element; calculating a score for an optional merger of a pair of clusters; determining a highest score from all possible cluster pairs; updating said similarity matrix as a result of merging said clusters with said highest score; determining the clustering assignment with the highest overall score.
  • the initial similarity value of a first element to itself is replaced by an average of a similarity of said first one of an element or a cluster to all other elements, and all other elements f to said first one of an element or a cluster.
  • the initial similarity values are organized in a two-dimensional table.
  • the initial similarity value for a for said first one of an element or a cluster to itself is an average of the similarity of said first one of an element or cluster to all other elements in a same row in the table. More specifically, the initial similarity value for a for said first one of an element or a cluster to itself is an average of the similarity of said first one of an element or cluster to all other elements in a same row in the table.
  • the score is a sum of multiple sub-scores. More specifically, the sum is a weighted sum. More specifically, at least one sub-score is based on scores related to all elements within a cluster. Still more specifically, an initial overall score is a sum of the similarity of each element to itself. Even more specifically, each subsequent overall scores, other than said initial overall score, is calculated as a differential value from an immediately preceding score. Even more specifically, each score for a potential clustering assignment is determined by adding an immediately preceding overall score to said differential value. Even more specifically, only a highest of said differential values in each round of calculations is added to the immediately preceding score.
  • the updating of similarity matrix is achieved by adding rows corresponding to the merged clusters in the table and adding columns corresponding to the merged clusters.
  • the differential values are placed in a priority queue. More specifically the priority queue is implemented by a heap. Still more specifically, a back-annotation of a differential value and its position in the heap is maintained. Even more specifically, updates of similarity matrix result in removal of all delta values in said heap corresponding to said merged clusters.
  • the method comprises replacing the removed said value with the value at the bottom of the heap; reducing the heap length by one; if said value is larger then the parent then swapping places and continuing said swap while a child is larger the a parent; if said value is smaller then at least one of its respective child nodes then heapifying the heap from said value's node; and updating the said back-annotation to reflect the changes in the heap.
  • At least one new differential value is inserted into the heap.
  • Yet another aspect of the disclosed teachings is a method for sorting and removal of data in a heap, the heap including nodes. The method comprises placing values in a heap; removing at least one node in the heap and updating the heap to conform with the heap property; and updating a back-annotation of said values and their position in said heap.
  • the removal of a value is comprised of replacing the removed value with a value corresponding to a bottom of the heap; reducing a length of the heap by one; if said value is larger then a parent, then swapping places and continuing said swap while a child is larger than a parent; if said value is smaller then at least one of its respective child nodes, then heapifying the heap from said value's node.
  • Another aspect of the disclosed teachings is a method for the indirect calculation of a series of greedy pairwise IN-OUT scores.
  • the method comprises inputting a cluster pairwise similarity between pairs of clusters; replacing a self- similarity of each element to itself with an average of said element's similarity to all other elements and all other elements to said element; creating a vector of values where each value in said vector corresponds to an element and where the content of said value is the sum of the similarities of said element to all other elements; setting an initial IN-OUT score to zero; calculating IN-OUT delta values of each case of merging two clusters based on the data in said pairwise similarity and said vector; determining a highest of said delta values and accordingly selecting a corresponding clustering assignment; calculating a new IN-OUT score as a sum of the most recent IN-OUT score calculated and said highest delta value; updating said similarity matrix by combining rows of said merged clusters and combining coHimns of said merged clusters; updating said
  • the initial similarity value for a first element to itself is replaced by the average of the similarity of said first element to all the other elements in the same row.
  • the initial similarity value for a first element to itself is replaced by the average of the similarity of said first element to all the other elements in the same column.
  • the combined IN-OUT score is determined by assigning a first weight to the IN score and a second weight to the OUT score and subtracting a weighted OUT score from a weighted IN score.
  • the IN delta values is computed by a sub-process comprising: adding values of similarity between a first cluster and a second cluster and similarity between a second cluster and a first cluster ; subtracting the self- similarity of first cluster multiplied by the ratio between the number of elements in said second cluster and said first cluster; subtracting the self similarity of second cluster multiplied by the ratio between the number of elements in said first cluster and said second cluster; and dividing by a total number of elements to be merged in said potential clustering assignment.
  • the OUT delta values is computed by a sub-process comprising multiplying an aggregation of similarities of a first cluster to all clusters but itself by a ratio of a number of elements in a second cluster to a difference of a total number of elements and a number of elements of the first cluster; adding an aggregation of similarities of the seconf cluster to all clusters but itself by a ratio of a number of elements in the first cluster to a difference of a total number of elements and a number of elements of the second cluster; subtracting a sum of values of similarity between a first cluster and a second cluster and similarity between a second cluster and a first cluster; dividing by a difference of a total number of elements and a sum of the number of elements of the first cluster and second cluster.
  • the repetitive loop is stopped when a new IN-OUT score is smaller then the immediately preceding IN-OUT score.
  • FIGs. 1A-B represent conventional illustration of two possible clustering assignments.
  • FIGs. 2A-B represent conventional illustration of a heap.
  • FIG. 3 - is a general similarity matrix.
  • FIGs. 4A-E - depict an exemplary similarity matrix input and various stages in the calculation of the IN Score.
  • FIG. 5 - is a graph of IN scores.
  • FIGs. 6A-D depict an exemplary similarity matrix and various stages in the calculation of the IN-OUT Score.
  • FIG. 7 - is a graph of IN-OUT scores.
  • FIGs. 8A-E show diagrams of a heap used for the purpose of determining scores for greedy based clustering assignments.
  • FIG. 9 - is an exemplary block diagram of the disclosed system
  • a set of data points can be clustered in various clustering assignments. It is important to know which clustering assignment is the best among the various clustering assignments possible for the same set of data points. In other instances, it may be important to know if a given clustering assignment is good, thereby satisfying a minimum quality threshold. This is done by allocating a global measure, or score, to each of the clustering assignments. The quality of the clustering assignment is then determined based on this score. Finding a good clustering assignment of the data is based on finding a global measure, or score, for the quality of each clustering assignment. Techniques are then determined to* obtain a good clustering assignment (or the best among several potential clustering assignments) based on this score.
  • the pairwise similarity (or dissimilarity) measure may be a measure, based on the specific application, of the similarity (or dissimilarity) between two data points.
  • this similarity need not be symmetric. In other words, the similarity of element X t to " element X 2 could be different from the similarity of element X 2 to element Xi.
  • the elements represent persons in a workplace
  • a first employee is highly dependent on the work done by a second employee
  • the second employee is much less dependant on the work done by the first employee (hence the similarity of employee two to employee one is low).
  • driving from point A to point B may be a shorter distance then driving from point B to point A, as a result of various access constraints such as one way streets and other road conditions.
  • the disclosed data clustering algorithms use pairwise similarity or dissimilarity only, without making any further assumptions apart from the similarity or dissimilarity measure itself. This makes the disclosed data clustering algorithms belong to a class of statistical non-parametric techniques.
  • clustering assignment where the average pairwise similarity within clusters is higher than another clustering assignment will have a better score. It should be noted that any qualitative assessments like “better,” “best,” “good,” etc, made between (or among) clustering assignments is relative to the pairwise relationship measure (similarity, dissimilarity, proximity, etc) and the scoring method used. A different sfmilarity measure and/or scoring method could lead to different results.
  • FIGs. 1A-B provide examples of clustering assignments.
  • similarity is related to the physical proximity. It may be difficult to visualize by human eye and evaluate the clustering assignment of FIG. 1A and compare it against the clustering assignment of FIG. IB.
  • a score gives a concrete measure indicating the better choice of clustering assignment relative to this score. The score itself, as is noted earlier, is based on a predefined measure. Therefore, a score is calculated for each of the potential clustering assignments. The clustering assignment having the better overall score is selected.
  • a technique for providing a score of a certain clustering assignment is based on quantitatively determining the relationship of the elements within each o cluster and combining them to calculate a score for each cluster and combining the score for each cluster within a potential clustering assignment for an overall score for the specific potential clustering assignment. This is referred to herein as the "IN" score.
  • This IN score thus, provides an indication of the average intra-cluster proximity of elements within a clustering assignment.
  • the higher the proximity of elements to other elements within their cluster the higher is the overall score.
  • One of the techniques that is useful in the implementation of the disclosed teaching is a priority queue.
  • the priority queue is designed to help in finding an item with the highest priority efficiently over a series of operations.
  • a priority queue can be implemented using a heap.
  • a heap is a tree where every node is associated with a key. Further a key satisfies the condition that for every node, the key of the parent is greater than or equal to the key of a child. Alternately, for every node, the key of the parent could be less than or equal to the key for a child.
  • the heap data structure is well-known in the art and is further described in Cormen et al "Introduction to Algorithms", pages 140-151.
  • the heap data structure is organized as a binary tree, as shown in the example depicted in FIG. 2A.
  • each node I has at most two child nodes, a left child, Left(I), and a right child, Right(I). Heaps are required to satisfy the "heap" property.
  • a parent is always greater than or equal to the value of its children according to some predefined criteria.
  • a heap In some implementations of a heap the smallest value will always be at the top of the heap. Further, the implementation of basic operations on a heap require a processing time proportional to, at most, the height of the binary tree created by the heap. Such a height of a heap is of the order of the logarithm of N, where N is the number of elements within the heap.
  • Heaps can be organized in a memory array, as illustrated using an example in FIG. 2B. It should be noted that the address of the left child of a node in a heap is always two times the address of the parent. Accordingly, if the parent is placed in location "n", the left child of the parent will be placed in location "2*n”. It should be further noted that the right child of the parent is located in address "n” will be placed in location with the address "2*n + 1".
  • the heap is further defined by the heap size that determines the maximum number of nodes within the heap. It is further possible to attach additional data that can be used in conjunction with the node value, to the node.
  • the first is the build-heap operation that creates a heap from an unsorted input of data.
  • Second is the "heapify" operation that ensures that every node in the heap obeys the heap property; assuming that individual heaps under the node also obey the heap property.
  • the heapify operation performs a swap. A parent is swapped with the child that has the largest value.
  • the heap property of the child may now be corrupted. It should be noted that only the children along the branch from the root to the leaf that belong to the swapping sequence may be corrupted. Therefore, an update has to be recursively performed until such point where the parent is larger than, or equal to, both its children.
  • the heapify operation may be performed starting at any node of the heap affecting all the children nodes thereof.
  • the heap-extract-max operation performs at least the following: a) extracts the element at the top of the heap; b) removes that element from the top of the heap; c) replaces the extracted element from the heap with the element removed from the bottom of the heap; d) shortens the length of the heap by one; and e) uses the heapify operation to ensure that after the changes made by the above operations, the heap property is still maintained.
  • Fourth is heap-insert, used for adding a new element to the heap. New elements are added at the bottom of the heap, as a new node.
  • the heap-insert operation performs at least the following: a) increases the heap length by one; b) inserts the new element in the riew location created in the heap; c) checks if the new element is larger then the parent. If in step c) the child is larger then the parent, the content of the nodesis swapped and due to the nature of the heap the parent now is in conformance with the heap property. However, it is possible that the parent, possibly a child of another parent, is in violation of the heap property and hence, the operation is recursively repeated until such time where no further swapping is required. At the end of this operation, the heap satisfies the heap property after the insertion.
  • the disclosed teachings provide new systems and methods for clustering of elements, including the establishment of new methods for the efficient calculation of clustering assignment scores.
  • One possible implementation of the disclosed teachings is in a computer system having as an input a two dimensional similarity matrix generally of the format described in FIG. 3.
  • the similarity matrix defines the similarity between each pair of input elements and therefore respective data is organized in a two dimensional matrix.
  • the similarity of an element to itself is naturally considered to be very high and denoted by the value "1".
  • Other similarities may have a different degree of similarity ranging from "0" to "1". The lower the value in this matrix, lower is the similarity between the corresponding elements.
  • V i2 The similarity of Xi to X 2 is denoted by the value V i2 and the similarity of X 2 to Xi is denoted by the value V 2i .
  • V u is not necessarily equal to Vjj. It should be noted that the above range of [0,1] is only illustrative, and any convenient range [x,y] can be used without deviating from the scope of the disclosed teachings.
  • the disclosed teachings contemplates using a modified values in the similarity matrix. This is done by replacing the self-similarity value, in each row of the similarity matrix, with the average similarity to the other elements in the row. Therefore the modified self- similarity value Vn for X x shall be the result of the sum of the elements V ⁇ 2 through V in divided by n-1. While in this implementation the averaging is performed by rows, without deviating from the spirit of the disclosed teachings, columns, combination of columns and rows, as well as a weighted combination of columns and rows could be used as a basis for averaging.
  • FIG. 4A shows a non-limiting exemplary similarity matrix for five elements. As a first step, the self-similarity values Vu, V 22 , V 33 , V 44 , and V 55 are replaced with the respective average for each row and hence:
  • the modified similarity matrix has the values shown in FIG. 4B.
  • a repetitive procedure is used for the purpose of identifying the best two clusters to be merged in a specific similarity matrix.
  • the pair having a highest score is chosen to be merged. It should be noted that the pair could consist of two elements, two clusters, or a cluster and an element, where an element is considered to be in a cluster containing a single element, otherwise known in the art as a singleton.
  • the similarity matrix is then recalculated by adding the merged clusters' respective rows and columns. By this operation the similarity matrix could become a non-normalized similarity matrix thereon. For simplicity, this non- normalized similarity matrix will continue to be referred herein as a similarity matrix. The process continues until all the elements are grouped in one cluster. This repetitive procedure is discussed in detail subsequently with reference to FIGs. 4B-4E explaining this repetitive procedure.
  • a maximum score will correspond to one of the selected clustering assignments of each round of calculation.
  • the clustering assignment corresponding to the maximum score is then chosen. Relative to the relationship measure and the scoring technique used, this chosen clustering assignment can be characterized as the best clustering assignment. As noted earlier, a different relationship measure and a different scoring technique could yield a different result.
  • the disclosed teachings further note that once the clustering assignment score drops for the first time no further calculations are necessary. This is because, the immediately preceding clustering assignment has at least a local maximum score and hence can be chosen as the clustering assignment of choice.
  • the difference between consecutive score values herein referred to as the delta or delta values is used. These delta values are added to the immediately preceding score, resulting in the determination of the next score. Using this technique, the processing time for score calculations is reduced further.
  • a score is determined based on the pairwise proximity or similarity, that is, it measures the proximity or similarity of elements within a cluster .
  • multiple scores may be calculated. Each such score may be denoted as mi - m n .
  • a final score may be obtained by adding the multiple scores to one another to get a combined score.
  • the multiple scores may also be weighted using a variety of weights, ⁇ , ⁇ , ⁇ , etc, such that more important of the multiple scores may influence the final combined score more than others.
  • a measure relative to the average similarity (and/or the score noted above) of each element of a cluster to all other elements within that cluster is used. This is known as the "IN” score.
  • the IN(5) score for the example in FIG. 4B i.e., the IN score where there f are five separate elements each belonging to its own cluster, has the following value:
  • This IN score represents the case where each element remains in its own cluster. It should be noted that this IN score is also the highest score possible for this case. Each time a cluster assignment is tested this calculation must be repeated.
  • the disclosed teachings also contemplate using a simpler method of efficiently selecting a clustering assignment, as well as an efficient technique for calculating the IN of each clustering option.
  • this technique instead of calculating the IN score for all possible clustering assignments (ten in the case described in FIG. 4B), only delta values from the immediately previous IN score need to be calculated.
  • the new overall score is a sum of the immediately previous IN score and the delta value.
  • the alternative would be to create ten new non-normalized similarity matrices and calculate the IN score from there.
  • the non-normalized similarity is the simple sum of the similarities between elements.
  • the next step is to calculate the delta value that would represent the results of merging any two clusters into one cluster, while leaving all other clusters intact.
  • the disclosed teachings contemplate calculating delta using the following formula:
  • ⁇ j - is the number of elements in the ith cluster
  • Vj j - is the non-normalized similarity value of cluster i to cluster j
  • a new IN score is calculated based on the best clustering assignment, which is the one where the A N (i,j) is the highest.
  • FIG. 4B it is possible to identify ten optional clustering assignments for merging two of the five clusters and hence ten delta values have to be calculated.
  • ⁇ m (l,2) is equal to 0.3125.
  • the value calculated for merging the cluster containing X 3 with the cluster containing X 5 into one cluster is calculated as follows:
  • the next step of the technique recalculates the similarity matrix of FIG. 4B such that it takes into account that two clusters have been merged. In each case where there is an effect of the merge of clusters the recalculation must take place.
  • f Initially, in this recalculation, the rows respective to the clusters to be merged are added to each other followed by the same process for the columns of the clusters to be merged.
  • FIG. 4C shows the similarity matrix after this recalculation. It can be easily seen that the values for clusters that have not merged in this step have remained the same. For example, the value for V 34 that has the value of 0.5 in
  • FIG. 4B has the same value in FIG. 4C.
  • the value V 15 no longer exists in FIG. 4C as the clusters containing elements Xi and X 2 were merged into one cluster.
  • the new value is the sum of the values of V 15 and V 25 , now V ⁇ 2 ,5; which is 0.762.
  • the values V n , V ⁇ 2 , V 2i , and V 22 are summed up resulting in a value of 2.47.
  • FIG.s 4C, 4D, and 4E depicts the repetition of this process that enables the calculation of all possible IN scores using the disclosed teachings.
  • the delta values are calculated and the highest delta score selected.
  • the clusters respective to that delta value are merged and the similarity matrix updated accordingly. This process continues until all elements are merged into one cluster.
  • the calculations of deltas will cease when the first decline of IN scores is detected.
  • multiple scoring techniques are used, each designated its own weight factor.
  • a graph similar to the graph shown in FIG. 5 is drawn. However, in this case, the weighted overall score is used rather then the IN scores alone. The highest overall score is used to determine the best clustering assignment possible.
  • the calculation ceases when the first decline in score values is observed.
  • the IN-OUT score is a variation of previously explained IN score, the variation being based on a new score measurement named OUT.
  • the OUT score is reflective of the average proximity between a cluster and all other clusters. It should be noted that proximity between two clusters is the aggregate (or average) of the proximity between all possible pairwise combinations of members of the first cluster and the second cluster. More specifically, the use of IN-OUT score is aimed at merging two clusters having a strong proximity to each other but a weak proximity to all other clusters.
  • N - is the total number of el ments to be clustered.
  • a score may now be calculated based on both ⁇ IN and ⁇ 0 u ⁇ value as follows:
  • the modified similarity matrix using the IN-OUT score is now depicted in FIG. 6A with the corresponding delta values ⁇ IN-0 u ⁇ being calculated for each f possible clustering assignment.
  • the similarity matrix is updated and the highest overall score is selected in the same way as previously described.
  • the vector representing all the U values is updated using the following formula:
  • U me rge(i,j) represents the value of U for the merged cluster formed by merging of clusters i and j.
  • FIG.s 6B through 6D The IN-OUT scores for IN-OUT(5), IN-OUT(4), IN-OUT(3), and IN-OUT(2) are 0, 0.5208, 0.9606, and 0.9977 respectively. These scores are plotted in FIG. 7 and it can be easily seen that in this case the highest IN-OUT score is achieved when there are only two clusters. The selected cluster assignment is therefore ⁇ X1,X2 ⁇ X3,X4,X5 ⁇ . The IN-OUT score for leaving each element in its own cluster is always zero, as explained above. Different ⁇ and ⁇ values may result in a different cluster assignment.
  • an efficient sorting of the delta values calculated is required. This is desirable as this clustering method may be employed for clustering a large number of elements and hence efficient sorting is required for practical implementation using the large number of elements.
  • the present implementation uses a heap for priority queue sorting.
  • it is not an absolute requirement for implementing the disclosed teachings to use a heap.
  • a back-annotation that includes the cluster location within the heap is used for easy update of the heap.
  • This enables fast updates of the heap, including size reduction of the heap, without requiring recreation of the heap after every update of the table. Therefore, a remove function is added to the functions, for managing the heap.
  • the remove function allows for the removal of certain node information from the heap, as may be desired, and consequently updating the heap such that it retains the heap property. The removal is required as a result of the merging of elements into a cluster as explained above.
  • the remove operation performs the following when receiving the index of a node to be removed: a) replaces the content of the node to be removed with the content of the node at the bottom of the heap; b) reduces the length of the heap by one; c) compares the content of the node with the content of its parent and performs the following: i) if the content of the element is larger then that of its parent, then a swap-up operation, as explained below, takes place; otherwise ii) a heapify operation beginning at that * node takes place.
  • the swap-up operation replaces the content of a parent and child if the content of the child is higher than that of the parent and continues recursively as long as a swap can take place. This is repeated, when necessary, all the way to the top of the heap. By doing this, it is guaranteed that once performed, the heap in its entirety will maintain the heap property.
  • the delta values are the same values that were calculated in the first example.
  • the first step is to place the calculated delta values in heap 800, in no specific order, as shown in FIG. 8A.
  • the delta values were taken from FIG. 4B using the order from ⁇ m (l,2) through ⁇ m (4,5).
  • BAT back-annotation table
  • mapping the location of each delta value in the heap is also shown in FIG. 8A. It can be easily seen that ⁇ (l,2) equals to 0.3125 is placed at the location "1" of the heap in node 801.
  • ⁇ m (l,3) that is equal to -0.0893 is placed at the location "2" of the heap in node 802. This continues until the last delta value is placed in location "10" at node 810.
  • the corresponding node number is placed in BAT 850.
  • ⁇ IN (1,2) has a designator "1" in the table as it is placed in node 801, which is the first location of the heap.
  • the origin of the delta value is also included, resulting in an immediate indication of the potential merge of elements or clusters.
  • the heap does not satisfy the heap property. That is, presently, a parent node is not always equal to, or larger than, the child nodes. Therefore a build-heap operation is required. However a modified build-heap(MBH) operation is used, because, an update of BAT 850 is also necessary as delta values are moved around heap 800. Hence, in addition to the normal operation of the build-heap operation, the corresponding update of BAT 850 takes place.
  • FIG. 8B the heap 800 and BAT 850 are shown after the heap was updated in order to conform to the heap property.
  • At the top of the heap is the largest delta value and the indicati n that elements "1" and “2" are the candidates for a merge.
  • the elements "1” and “2" are to be merged, there will be implications on the delta values respective to the delta values where either element "1” or element "2" appear. All such delta values need to be updated.
  • the procedure of removing the top of the heap is conventionally known as explained above. After applying the heap-extract-max, the heap will have the structure described in FIG. 8C.
  • the remove operation is now used to remove elements "1,3", “1,4", “1,5", “2,3”, “2,4", and “2,5" from the heap and the result is shown in FIG. 8D.
  • the operation can be performed due to the data stored in BAT 850 that provides for the back-annotation of the elements or clusters to the nodes.
  • BAT 850 is updated to allow for the back annotation corresponding to the changes in the heap. It should be noted that from the two merged elements or clusters, the row with more data cells in BAT 850 is replaced by "-1" to denote an invalid value. The index to the row containing less data cells is updated to correspond to a designator indicating that it is a merged case. The process is now repeated until all elements/clusters are merged into one cluster.
  • this repetitive procedure will cease once the new IN score drops for the first time from the immediately preceding IN score.
  • BST binary search tree
  • System 900 comprises, an initializer 910, a merger 920, a slector 930 and an assigner database 930.
  • Initializer 910 receives the pairwise similarity data and performs the functions required to allow the iterative processing disclosed in this invention.
  • Initialization functions may include at least the update of the self- similarity values and the calculation of the initial scores.
  • Merger 920 is adapted to perform the iterative function of finding the best two clusters to be merged at each step of the calculations, and providing a score for the selected clustering assignment.
  • Assigner database 940 contains all the data necessary for the operation of merger 920 and is a source for data provided to selector 930. Assigner database may hold the similarity data and its updates, the delta values possibly in a heap structure.
  • Any block described hereinabove can be implemented in hardware, software, or combination of hardware and software.
  • the system can also be implemented using any type of computer including PCs, workstations and mainframes.
  • the disclosed teaching also includes computer program products that includes a computer readable medium with instructions. These instructions can include, but not limited to, source code, object code and executable code and can be in any higher level language, assembly language and machine language.
  • the computer readable medium is not restricted to any particular type of medium but can include, but not limited to, RAMS, ROMs, hard disks, floppies, tapes and internet downloads.

Abstract

This disclusre teaches a clustering system. The clustering system includes an initializr, a mergerer, a clustering asignment selector and an assigner database. The initializer adapted to perform at least one of: update of self-similarity values and calculation of an initial score for an optional clustering assignment. The mergerer is adapted to perform at least one of: calculate the delta values for the potential merger of any of two clusters, and update the pairwise similarity as a result of merging of two clusters with the highest score in each round of calculations. The clustering assignment selector is adapted to determine the clustering assignment with the highest overall score. The assigner database is adapted to store and retrieve at least: pairwise similarity and updates thereof, and scores of potential clustering assignment determined at each round of calculations.

Description

A System and Method for a Greedy Pairwise Clustering I. Description LA. Claim for Priority
This application claims priority of U.S. Provisional Patent Application serial No. 60/226,128 filed on August 18*, 2000, incorporated in its entirety by reference herein. This application claims priority of U.S. Provisional Patent Application serial No. 60/259,575 filed on January 4, 2001, incorporated in its entirety by reference herein.
LB. Field
This disclosure teaches techniques for pairwise clustering of data elements. More specifically, the teachings include, but are not limited to systems and methods for the implementation of a greedy pairwise clustering of data elements and and iteratively merging the best elements and/or clusters.
I.C. Background and Related Art
The following documents provide background for a better understanding of the disclosed teaching, and to that extent, they are incorporated herein by reference.
1. US Patents
4,181,821 Jan 1980 Pirz et al.
4,837,831 Jun 1989 Gillick et al. 5,040,133 Aug 1991 Feintuch et al. 5,202,840 Apr 1993 Wong 5,276,766 Jan 1994 Bahl et al. 5,389,936 Feb 1995 Alcock
5,404,561 Apr 1995 Castlelaz f
5,519,789 May 1996 Etoh
5,537,491 Jul 1996 Mahoney et al.
5,566,078 Oct 1996 Ding et al.
5,682,321 Oct 1997 Ding et al.
5,703,964 Dec 1997 Menon et al.
5,706,503 Jan 1998 Poppen et al.
5,764,824 Jun 1998 Kurtzberg et al
5,806,030 Sep 1998 Junqua
5,832,182 Nov 1998 Zhang et al.
5,857,169 Jan 1989 Seide
P
5,940,832 Aug 1999 Hamada
6,021,383 Feb 2000 Domany et al.
6,026,397 Feb 2000 Sheppard
6,031,979 Feb 2000 Hachiya
6,038,340 Mar 2000 Ancin et al.
6,134,541 Oct 2000 Castelli et al.
6,144,838 Nov 2000 Sheehan
2. Other References
Buhmann, J.M. & Hofmann, T., "A Maximum Entropy Approach to Paiwise Data Clustering", Proceedings >f the International Conference on Pattern Recognition '94, Hebrew University Jerusalem, pp. 1-6. Hofmann, T., Puzicha, J. & Buhmann, J.M., "Deterministic Annealing for Unsupervised Texture Segmentation", Proceedings of the EMMCVPR'97, Venice, 1997
Puzicha, J., Hofmann, T. & Buhmann, J.M., "A Theory of Proximity Based
Clustering : Structure Detection by Optimization", October 1998
Gdalyahu, Y., Weinshall, D. & Werman, M., "Stochastic Clustering and its Application to Image Segmentation", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 1999
Cormen, T. et al, Introduction to Algorithms, McGraw-Hill, 23rd printing, 1999, pp. 140-151.
Hofmann, T., Puzicha, J. & Buhmann, J.M., "An Optimization Approach to
Unsupervised Hierarchical Texture Segmentation"
Hofmann, T. & Buhmann, J.M., "Inferring Hierarchical Clustering Structures r by Deterministic Annealing" I.D. Definitions
To better understand this disclosure and its teachings, the following terms are described asunder: Clustering Assignment - a specific case that represents an allocation of elements to one or more clusters. A clustering assignment indicates which element belongs to each cluster. Cluster - a set of one or more elements
Element - a specific data point in the domain under consideration
Merge - the operation of combining two clusters, including a cluster containing a single element, to form a merged cluster such that all the elements contained in each cluster become part of the merged cluster.
Score - a real number representing a quantitative and qualitative indication of how good a cluster assignment is.
Conventionally, clustering assignment problems are considered a subset of pattern recognition problems. Solutions to clustering assignment problems are well-known. In such clustering assignment problems, the goal is to partition a set of data points into multiple sets of data points. These multiple sets of data points are commonly referred to as clusters. Clearly, the partitioning (or clustering) of data points is influenced by assumptions regarding the origin of these data points that form the cluster. The particular field of application in which the clusters will be used also influences the partitioning.
Given N data points that need-to be clustered, it can be easily determined that 1 to N clusters can be formed in a given clustering assignment. FIGs. 1A and IB are examples of two possible clustering assignments for elements λ,Xι" through "Xι3" organized in a two dimensional projection. The first clustering assignment is shown in FIG. 1A where there are three clusters 10, 11, and 12 containing {Xlf X2, X3, X4/ X5}, {Xe, ?, Xs, Xg}, and {Xι0, Xu, Xι2, Xi3> respectively. FIG. IB shows a second possible clustering assignment, using the same data points. In this clustering assignment, there are four clusters 15, 16, 17, and 18 containing {Xlr X2, X3, X5}, {X4, X9>, {X6, X7/ X8}, and {X10, Xllf X12, Xι3} respectively.
Such conventional clustering assignment techniques have been used, for example, in pattern and image processing (US patents 5,519,789, 5,537,491, 5,703,964, 5,857,169, 6,038,340), speech analysis (US patents 4,181,821, 4,837,831, 5,276,766, 5,806,030), placement (US patents 5,202,840, 5,566,078, 5,682,321), data clustering in databases (US Patents 5,706,503, 5,832,182, 5,940,832), and general clustering and clustering analysis (US patents 5,040,133, 5,389,936, 5,404,561, 5,764,824, 6,021,383, 6,026,397, 6,031,979, 6,134,541).
These conventional techniques allegedly solve certain clustering and scoring problems. However, they constrain the data elements by imposing certain conditions on the data elements. It will be advantageous to develop efficient clustering techniques that require a-s fewer assumptions as possible.
II. -Summary
To help realize the advantages mentioned above, the disclosed teachings provide a clustering system. The system comprises, an initializer, a merger, and a selector. All three units are further connected to an assigner database. The initializer is adapted to accept as an input pairwise similarity data and provide the merger and the assigner database with the initialized data ready for the clustering assignment process. The merger is adapted to perform the repetitive task of calculating delta values, determining a clustering assignment with a highest score, obtained from the highest delta, from all pairwise clustering options in each round of calculations, and updating the assigner database. The selector is adapted to determine the clustering assignment having the highest overall score. The assigner database is adapted to hold the initialized pairwise similarity data and updates thereof, such updates are made by the merger, as well as other data structures as may be required for the operation of the system.
Specifically, the initial similarity value for said first one of an element or a cluster to itself is an average of a similarity of said first one of an element or a cluster to all other elements or clusters. More specifically, initial similarity value for said first one of an element or a cluster to itself is an average of all other elements or clusters to said first one of an element or cluster.
Specifically, the initial similarity values are organized in a two-dimensional table.
More specifically, the initial similarity value for said first one of an element or a cluster to itself is an average of the similarity of said first one of an element or cluster to all other elements in a same row in the table.
More specifically, the initial similarity value for said first one of an element or a cluster to itself is an average of the similarity of said first one of an element or cluster to all other elements in a same row in the table.
Specifically, the score is a sum of multiple sub-scores. More specifically, at least one sub-score is based on scores related to all elements within a cluster. Still more specifically, an initial overall score is a sum of the similarity of each element to itself. Even more specifically, each subsequent overall scores, other than said initial overall score, is calculated as a differential value from an immediately preceding score. Even more specifically, each score for a potential clustering assignment is determined by adding an immediately preceding overall score to said differential value. Even more specifically, only a highest of said differential values in each round of calculations is added to the immediately preceding score.
More specifically, updating of similarity matrix is achieved by adding rows corresponding to the merged clusters in the table and adding columns corresponding to the merged clusters.
More specifically, the differential values are placed in a priority queue. More specifically, the priority queue is in plemented by a heap. Still more specifically, a back-annotation of a differential value and its position in the heap is maintained. Even more specifically, the updates of similarity matrix result in removal of all delta values in the heap corresponding to said merged clusters. Even more specifically, the system is adapted to replace said removed value with the value at the bottom of the heap, reduce the heap length by one, swap value with a parent if said value is larger then the parent or, heapify the heap starting from a node corresponding to the value if the value is smaller than at least one of its child nodes, and update said back-annotation to reflect changes in the heap.
More specifically, at least one new differential value is inserted into the heap. Another aspect of the disclosed teachings is a computer system that implements the system summarized above.
Yet another aspect of the disclosed teachings is a method for the indirect calculation of a series of greedy pairwise IN scores. The method comprises inputting pairwise similarity between elements; replacing a self-similarity of each element to itself as an average of said element's similarity to all other elements and all other elements to said element; calculating an initial IN score as the sum of said self similarities; calculating a delta value corresponding to each case of merging two elements into a cluster; determining the highest of said delta values and accordingly selecting the corresponding merge of the two elements having the highest delta score for a clustering assignment; calculating the new IN score as the sum of the most recent IN score calculated and said highest delta value; updating the pairwise similarity by combining rows corresponding to said merged clusters and combining columns corresponding to said merged clusters; repeating until all elements are merged into a single cluster; selecting the clustering assignment which has the highest IN score. Specifically, the pairwise similarities are organized in a similarity matrix.
Specifically, the initial similarity value for a first element to itself is replaced by an average of the similarity of said first element to all other elements in a same row.
Specifically, the initial similarity/value for a first element to itself is replaced by an average of the similarity of said first element to all other elements in a same column.
Specifically, the delta values for merging two clusters are computed using a method adding values of similarity between a first cluster and a second cluster and similarity between a second cluster and a first cluster ; subtracting the self- similarity of first cluster multiplied by the ratio between the number of elements in said second cluster and said first cluster; subtracting the self similarity of second cluster multiplied by the ratio between the number of elements in said first cluster and said second cluster; and dividing by a total number of elements to be merged in said potential clustering assignment. Specifically, the repetitive loop is stopped when a new IN score is smaller then an immediately preceding IN score.
Yet another aspect of the disclosed teachings is a method for clustering f elements comprising accessing initial similarity values in a similarity matrix, said similarity values corresponding to pairs of clusters wherein each of said pair of clusters could also be an individual element; calculating a score for an optional merger of a pair of clusters; determining a highest score from all possible cluster pairs; updating said similarity matrix as a result of merging said clusters with said highest score; determining the clustering assignment with the highest overall score.
Specifically, after said accessing of initial similarity value, the initial similarity value of a first element to itself is replaced by an average of a similarity of said first one of an element or a cluster to all other elements, and all other elements f to said first one of an element or a cluster. Specifically, the initial similarity values are organized in a two-dimensional table.
More specifically, the initial similarity value for a for said first one of an element or a cluster to itself is an average of the similarity of said first one of an element or cluster to all other elements in a same row in the table. More specifically, the initial similarity value for a for said first one of an element or a cluster to itself is an average of the similarity of said first one of an element or cluster to all other elements in a same row in the table.
Specifically, the score is a sum of multiple sub-scores. More specifically, the sum is a weighted sum. More specifically, at least one sub-score is based on scores related to all elements within a cluster. Still more specifically, an initial overall score is a sum of the similarity of each element to itself. Even more specifically, each subsequent overall scores, other than said initial overall score, is calculated as a differential value from an immediately preceding score. Even more specifically, each score for a potential clustering assignment is determined by adding an immediately preceding overall score to said differential value. Even more specifically, only a highest of said differential values in each round of calculations is added to the immediately preceding score.
More specifically, the updating of similarity matrix is achieved by adding rows corresponding to the merged clusters in the table and adding columns corresponding to the merged clusters.
More specifically, the differential values are placed in a priority queue. More specifically the priority queue is implemented by a heap. Still more specifically, a back-annotation of a differential value and its position in the heap is maintained. Even more specifically, updates of similarity matrix result in removal of all delta values in said heap corresponding to said merged clusters.
More specifically, the method comprises replacing the removed said value with the value at the bottom of the heap; reducing the heap length by one; if said value is larger then the parent then swapping places and continuing said swap while a child is larger the a parent; if said value is smaller then at least one of its respective child nodes then heapifying the heap from said value's node; and updating the said back-annotation to reflect the changes in the heap.
More specifically, at least one new differential value is inserted into the heap. Yet another aspect of the disclosed teachings is a method for sorting and removal of data in a heap, the heap including nodes. The method comprises placing values in a heap; removing at least one node in the heap and updating the heap to conform with the heap property; and updating a back-annotation of said values and their position in said heap.
Specifically, the removal of a value is comprised of replacing the removed value with a value corresponding to a bottom of the heap; reducing a length of the heap by one; if said value is larger then a parent, then swapping places and continuing said swap while a child is larger than a parent; if said value is smaller then at least one of its respective child nodes, then heapifying the heap from said value's node.
More specifically, at least one new value is inserted into the heap. Another aspect of the disclosed teachings is a method for the indirect calculation of a series of greedy pairwise IN-OUT scores. The method comprises inputting a cluster pairwise similarity between pairs of clusters; replacing a self- similarity of each element to itself with an average of said element's similarity to all other elements and all other elements to said element; creating a vector of values where each value in said vector corresponds to an element and where the content of said value is the sum of the similarities of said element to all other elements; setting an initial IN-OUT score to zero; calculating IN-OUT delta values of each case of merging two clusters based on the data in said pairwise similarity and said vector; determining a highest of said delta values and accordingly selecting a corresponding clustering assignment; calculating a new IN-OUT score as a sum of the most recent IN-OUT score calculated and said highest delta value; updating said similarity matrix by combining rows of said merged clusters and combining coHimns of said merged clusters; updating said vector by adding values corresponding to each of the merged clusters, subtracting similarity of first cluster to second cluster, and, subtracting similarity of second cluster to first cluster; replacing said values with the newly calculated value; repeating until only two clusters remain; selecting a clustering assignment which has a highest IN-OUT score.
Specifically, the initial similarity value for a first element to itself is replaced by the average of the similarity of said first element to all the other elements in the same row.
Specifically, the initial similarity value for a first element to itself is replaced by the average of the similarity of said first element to all the other elements in the same column. Specifically, the combined IN-OUT score is determined by assigning a first weight to the IN score and a second weight to the OUT score and subtracting a weighted OUT score from a weighted IN score.
Specifically, the IN delta values is computed by a sub-process comprising: adding values of similarity between a first cluster and a second cluster and similarity between a second cluster and a first cluster ; subtracting the self- similarity of first cluster multiplied by the ratio between the number of elements in said second cluster and said first cluster; subtracting the self similarity of second cluster multiplied by the ratio between the number of elements in said first cluster and said second cluster; and dividing by a total number of elements to be merged in said potential clustering assignment.
Specifically, the OUT delta values is computed by a sub-process comprising multiplying an aggregation of similarities of a first cluster to all clusters but itself by a ratio of a number of elements in a second cluster to a difference of a total number of elements and a number of elements of the first cluster; adding an aggregation of similarities of the seconf cluster to all clusters but itself by a ratio of a number of elements in the first cluster to a difference of a total number of elements and a number of elements of the second cluster; subtracting a sum of values of similarity between a first cluster and a second cluster and similarity between a second cluster and a first cluster; dividing by a difference of a total number of elements and a sum of the number of elements of the first cluster and second cluster.
Specifically, the repetitive loop is stopped when a new IN-OUT score is smaller then the immediately preceding IN-OUT score.
The above summaries are merely meant to provide a guidance for a better understanding of the disclosed teachings and are not intended to be limiting the scope of the claims in any manner.
III. Brief Description of the Drawings The disclosed teachings and techniques are described in more details using embodiments thereof with reference to the attached drawings in which : FIGs. 1A-B represent conventional illustration of two possible clustering assignments.
FIGs. 2A-B represent conventional illustration of a heap. FIG. 3 - is a general similarity matrix.
FIGs. 4A-E - depict an exemplary similarity matrix input and various stages in the calculation of the IN Score. FIG. 5 - is a graph of IN scores. FIGs. 6A-D depict an exemplary similarity matrix and various stages in the calculation of the IN-OUT Score.
FIG. 7 - is a graph of IN-OUT scores.
FIGs. 8A-E show diagrams of a heap used for the purpose of determining scores for greedy based clustering assignments.
FIG. 9 - is an exemplary block diagram of the disclosed system
IV. Detailed Description of the Disclosed Techniques
Initially, understanding of some basic techniques is required for understanding implementations of1:he disclosed techniques.
It will be clear that a set of data points can be clustered in various clustering assignments. It is important to know which clustering assignment is the best among the various clustering assignments possible for the same set of data points. In other instances, it may be important to know if a given clustering assignment is good, thereby satisfying a minimum quality threshold. This is done by allocating a global measure, or score, to each of the clustering assignments. The quality of the clustering assignment is then determined based on this score. Finding a good clustering assignment of the data is based on finding a global measure, or score, for the quality of each clustering assignment. Techniques are then determined to* obtain a good clustering assignment (or the best among several potential clustering assignments) based on this score. In many data clustering problems, a matrix of pairwise similarity or dissimilarity measures is often used to calculate the overall score. The pairwise similarity (or dissimilarity) measure may be a measure, based on the specific application, of the similarity (or dissimilarity) between two data points. However, it should be noted that, this similarity need not be symmetric. In other words, the similarity of element Xt to" element X2 could be different from the similarity of element X2 to element Xi. For instance, if the elements represent persons in a workplace, it is possible that a first employee is highly dependent on the work done by a second employee (hence similarity of employee one to employee two is high), while the second employee is much less dependant on the work done by the first employee (hence the similarity of employee two to employee one is low). Likewise, driving from point A to point B may be a shorter distance then driving from point B to point A, as a result of various access constraints such as one way streets and other road conditions. The disclosed data clustering algorithms use pairwise similarity or dissimilarity only, without making any further assumptions apart from the similarity or dissimilarity measure itself. This makes the disclosed data clustering algorithms belong to a class of statistical non-parametric techniques.
Generally, a clustering assignment where the average pairwise similarity within clusters is higher than another clustering assignment will have a better score. It should be noted that any qualitative assessments like "better," "best," "good," etc, made between (or among) clustering assignments is relative to the pairwise relationship measure (similarity, dissimilarity, proximity, etc) and the scoring method used. A different sfmilarity measure and/or scoring method could lead to different results.
FIGs. 1A-B provide examples of clustering assignments. In the example of FIGs.lA-B, similarity is related to the physical proximity. It may be difficult to visualize by human eye and evaluate the clustering assignment of FIG. 1A and compare it against the clustering assignment of FIG. IB. On the other hand, a score gives a concrete measure indicating the better choice of clustering assignment relative to this score. The score itself, as is noted earlier, is based on a predefined measure. Therefore, a score is calculated for each of the potential clustering assignments. The clustering assignment having the better overall score is selected.
A technique for providing a score of a certain clustering assignment is based on quantitatively determining the relationship of the elements within each o cluster and combining them to calculate a score for each cluster and combining the score for each cluster within a potential clustering assignment for an overall score for the specific potential clustering assignment. This is referred to herein as the "IN" score. This IN score, thus, provides an indication of the average intra-cluster proximity of elements within a clustering assignment. Clearly, the higher the proximity of elements to other elements within their cluster, the higher is the overall score. One of the techniques that is useful in the implementation of the disclosed teaching is a priority queue. The priority queue is designed to help in finding an item with the highest priority efficiently over a series of operations. Some of the basic operations that are required for implementing a priority queue are: insert, find-minimum (or maximum), and t delete-minimum (or maximum). In addition, some implementations also support joining of two priority queues efficiently. Such a joining operation is called melding. Stijl further, deleting an arbitrary item, increasing (or decreasing) the priority of an item, etc. are supported efficiently. A priority queue can be implemented using a heap. A heap is a tree where every node is associated with a key. Further a key satisfies the condition that for every node, the key of the parent is greater than or equal to the key of a child. Alternately, for every node, the key of the parent could be less than or equal to the key for a child. The heap data structure is well-known in the art and is further described in Cormen et al "Introduction to Algorithms", pages 140-151. The heap data structure is organized as a binary tree, as shown in the example depicted in FIG. 2A. In this tree, each node I has at most two child nodes, a left child, Left(I), and a right child, Right(I). Heaps are required to satisfy the "heap" property. According to the heap property, a parent is always greater than or equal to the value of its children according to some predefined criteria. An advantage of using heaps for sorting data is its efficiency in ensuring that the node with the largest value will always be at the top of the heap. In some implementations of a heap the smallest value will always be at the top of the heap. Further, the implementation of basic operations on a heap require a processing time proportional to, at most, the height of the binary tree created by the heap. Such a height of a heap is of the order of the logarithm of N, where N is the number of elements within the heap.
Heaps can be organized in a memory array, as illustrated using an example in FIG. 2B. It should be noted that the address of the left child of a node in a heap is always two times the address of the parent. Accordingly, if the parent is placed in location "n", the left child of the parent will be placed in location "2*n". It should be further noted that the right child of the parent is located in address "n" will be placed in location with the address "2*n + 1". The heap is further defined by the heap size that determines the maximum number of nodes within the heap. It is further possible to attach additional data that can be used in conjunction with the node value, to the node.
Several computational operations are required to create, maintain and utilize heaps efficiently. The first is the build-heap operation that creates a heap from an unsorted input of data. Second is the "heapify" operation that ensures that every node in the heap obeys the heap property; assuming that individual heaps under the node also obey the heap property. In the case where the left child and right child of a heap obeys the heap property but the parent does not, thereby the parent being smaller than at least one of its children, the heapify operation performs a swap. A parent is swapped with the child that has the largest value.
However, as a result of the swap, the heap property of the child may now be corrupted. It should be noted that only the children along the branch from the root to the leaf that belong to the swapping sequence may be corrupted. Therefore, an update has to be recursively performed until such point where the parent is larger than, or equal to, both its children. The heapify operation may be performed starting at any node of the heap affecting all the children nodes thereof.
Third, the heap-extract-max operation performs at least the following: a) extracts the element at the top of the heap; b) removes that element from the top of the heap; c) replaces the extracted element from the heap with the element removed from the bottom of the heap; d) shortens the length of the heap by one; and e) uses the heapify operation to ensure that after the changes made by the above operations, the heap property is still maintained. Fourth is heap-insert, used for adding a new element to the heap. New elements are added at the bottom of the heap, as a new node. The heap-insert operation performs at least the following: a) increases the heap length by one; b) inserts the new element in the riew location created in the heap; c) checks if the new element is larger then the parent. If in step c) the child is larger then the parent, the content of the nodesis swapped and due to the nature of the heap the parent now is in conformance with the heap property. However, it is possible that the parent, possibly a child of another parent, is in violation of the heap property and hence, the operation is recursively repeated until such time where no further swapping is required. At the end of this operation, the heap satisfies the heap property after the insertion.
The disclosed teachings provide new systems and methods for clustering of elements, including the establishment of new methods for the efficient calculation of clustering assignment scores. One possible implementation of the disclosed teachings is in a computer system having as an input a two dimensional similarity matrix generally of the format described in FIG. 3. However, it should be noted that the similarity between two elements might be provided in several other forms, including a list. The similarity matrix defines the similarity between each pair of input elements and therefore respective data is organized in a two dimensional matrix. The similarity of an element to itself is naturally considered to be very high and denoted by the value "1". Other similarities may have a different degree of similarity ranging from "0" to "1". The lower the value in this matrix, lower is the similarity between the corresponding elements. The similarity of Xi to X2 is denoted by the value Vi2 and the similarity of X2 to Xi is denoted by the value V2i. As explained earlier Vu is not necessarily equal to Vjj. It should be noted that the above range of [0,1] is only illustrative, and any convenient range [x,y] can be used without deviating from the scope of the disclosed teachings.
In the implemented technique, a "greedy" approach is used. The term "agglomerative" is also used in the art to denote this approach. In such a greedy approach, in every step, a current best score is calculated and used thereon, rather then calculating the entire universe of possible cases. Calculating all the possible cases may provide a more accurate result. However, this is not feasible especially where there are significantly large number of elements to be clustered.
Hence, in order to entice pairs of elements into merging initially, the disclosed teachings contemplate preferably replacing the self-similarity Vn=l, i.e., the similarity of an element to itself, by another value more conducive to allowing the merge of elements to take place. A skilled artisan will note that leaving the self-similarity at a value of "1", though not incorrect, will result in increased "resistance" for an element to cluster with other elements.
To further improve the chance of merging, the disclosed teachings contemplates using a modified values in the similarity matrix. This is done by replacing the self-similarity value, in each row of the similarity matrix, with the average similarity to the other elements in the row. Therefore the modified self- similarity value Vn for Xx shall be the result of the sum of the elements Vι2 through Vin divided by n-1. While in this implementation the averaging is performed by rows, without deviating from the spirit of the disclosed teachings, columns, combination of columns and rows, as well as a weighted combination of columns and rows could be used as a basis for averaging. FIG. 4A shows a non-limiting exemplary similarity matrix for five elements. As a first step, the self-similarity values Vu, V22, V33, V44, and V55 are replaced with the respective average for each row and hence:
Vn= (V12+ V13+ Vi4+ Vι5)/4=(0.714+0.286+0.429+0.429)/4=0.464
Similarly
V22= (V21+ V23+ V24+ V25)/4=(0.833+0.167+0.5+0.333)/4=0.458
After completing all the calculations, replacing the original self-similarities with the new averages, the modified similarity matrix has the values shown in FIG. 4B.
A repetitive procedure is used for the purpose of identifying the best two clusters to be merged in a specific similarity matrix. The pair having a highest score is chosen to be merged. It should be noted that the pair could consist of two elements, two clusters, or a cluster and an element, where an element is considered to be in a cluster containing a single element, otherwise known in the art as a singleton. The similarity matrix is then recalculated by adding the merged clusters' respective rows and columns. By this operation the similarity matrix could become a non-normalized similarity matrix thereon. For simplicity, this non- normalized similarity matrix will continue to be referred herein as a similarity matrix. The process continues until all the elements are grouped in one cluster. This repetitive procedure is discussed in detail subsequently with reference to FIGs. 4B-4E explaining this repetitive procedure.
A maximum score will correspond to one of the selected clustering assignments of each round of calculation. The clustering assignment corresponding to the maximum score is then chosen. Relative to the relationship measure and the scoring technique used, this chosen clustering assignment can be characterized as the best clustering assignment. As noted earlier, a different relationship measure and a different scoring technique could yield a different result.
The disclosed teachings further note that once the clustering assignment score drops for the first time no further calculations are necessary. This is because, the immediately preceding clustering assignment has at least a local maximum score and hence can be chosen as the clustering assignment of choice.
In another implementation, instead of directly calculating the score in each f of the iterations, the difference between consecutive score values, herein referred to as the delta or delta values is used. These delta values are added to the immediately preceding score, resulting in the determination of the next score. Using this technique, the processing time for score calculations is reduced further.
As mentioned above, a score is determined based on the pairwise proximity or similarity, that is, it measures the proximity or similarity of elements within a cluster . However, in alternate implementations, multiple scores may be calculated. Each such score may be denoted as mi - mn. A final score may be obtained by adding the multiple scores to one another to get a combined score. The multiple scores may also be weighted using a variety of weights, α, β, γ, etc, such that more important of the multiple scores may influence the final combined score more than others. Hence, in a general form the combined score for the clustering assignment would be calculated as follows: Combined score = α*m!-+ β*m2 + ...
In an implementation of the disclosed teachings, a measure relative to the average similarity (and/or the score noted above) of each element of a cluster to all other elements within that cluster is used. This is known as the "IN" score. The initial "IN" value is calculated for a clustering assignment wherein each cluster consists of one element, with each element having its own cluster. Therefore, the initial "IN" score is the summation of the V values starting at 1=1 through I=n.
Hence, the IN(5) score for the example in FIG. 4B, i.e., the IN score where there f are five separate elements each belonging to its own cluster, has the following value:
IN(5) = Vιι+V22+V33+V44+V55 = 0.464+0.458+0.5+0.35+0.389 = 2.162
This IN score represents the case where each element remains in its own cluster. It should be noted that this IN score is also the highest score possible for this case. Each time a cluster assignment is tested this calculation must be repeated.
However, the disclosed teachings also contemplate using a simpler method of efficiently selecting a clustering assignment, as well as an efficient technique for calculating the IN of each clustering option. According to this technique, instead of calculating the IN score for all possible clustering assignments (ten in the case described in FIG. 4B), only delta values from the immediately previous IN score need to be calculated. The new overall score is a sum of the immediately previous IN score and the delta value. This method of calculation of IN(i-l) through a delta value from IN(i), which is simpler and faster, can be easily proven to be equally effective by a skilled artisan.
The alternative would be to create ten new non-normalized similarity matrices and calculate the IN score from there. The non-normalized similarity is the simple sum of the similarities between elements.
The next step is to calculate the delta value that would represent the results of merging any two clusters into one cluster, while leaving all other clusters intact. The disclosed teachings contemplate calculating delta using the following formula:
Figure imgf000025_0001
where the notations have the following meanings:
ΔO/J ~ is tne incremental value to the score of merging cluster i and cluster j
Νj - is the number of elements in the ith cluster
Vjj - is the non-normalized similarity value of cluster i to cluster j
A new IN score is calculated based on the best clustering assignment, which is the one where the AN (i,j) is the highest. In FIG. 4B it is possible to identify ten optional clustering assignments for merging two of the five clusters and hence ten delta values have to be calculated. For example, the value calculated for merging the cluster containing Xi with the cluster containing X2 into one cluster is calculated as follows: Δ^(l,2) = 0.714 + 0.833 — *0.464 — * 0.458
1 + 1 1 1
The result of the calculation is that Δm(l,2) is equal to 0.3125. In another example the value calculated for merging the cluster containing X3 with the cluster containing X5 into one cluster is calculated as follows:
Δ^(3,5) = * 0.75 + 0.333 i*0 5 — *o.389
1 + 1 I f 1 1
The result of the calculation is that Am (3,5) is equal to 0.09722. This sequence is repeated for all ten possible merges and the respective results are presented in FIG. 4B. The best score for IN(4) can now be easily calculated as it is defined as the value of IN(5) plus the highest of the delta values calculated in this step. In this case the highest delta value is that of Δ^ (1,2) and hence the value of IN(4) is determined to be 2.474.
The next step of the technique recalculates the similarity matrix of FIG. 4B such that it takes into account that two clusters have been merged. In each case where there is an effect of the merge of clusters the recalculation must take place. f Initially, in this recalculation, the rows respective to the clusters to be merged are added to each other followed by the same process for the columns of the clusters to be merged. FIG. 4C shows the similarity matrix after this recalculation. It can be easily seen that the values for clusters that have not merged in this step have remained the same. For example, the value for V34 that has the value of 0.5 in
FIG. 4B has the same value in FIG. 4C. In comparison, the value V15 no longer exists in FIG. 4C as the clusters containing elements Xi and X2 were merged into one cluster. The new value is the sum of the values of V15 and V25, now Vι2,5; which is 0.762. In the case of the value of Vι2;12 the values Vn, Vι2, V2i, and V22, are summed up resulting in a value of 2.47.
' FIG.s 4C, 4D, and 4E depicts the repetition of this process that enables the calculation of all possible IN scores using the disclosed teachings. At each stage, the delta values are calculated and the highest delta score selected. The clusters respective to that delta value are merged and the similarity matrix updated accordingly. This process continues until all elements are merged into one cluster.
The results of IN(5)=2.162, -IN(4)=2.474, IN(3)=2.7379, IN(2)=2.6647, and IN(1)=2.1615, are depicted in graphical form in FIG. 5. It can be easily seen that the IN score raises when there are four clusters in comparison to five clusters, and raises even more when there are three clusters in comparison to four clusters. However, the IN score begins to decline when two clusters are created and continues to decline when all elements are combined into one cluster.
Therefore, according to the disclosed teachings, it can be concluded that by maximizing the IN score, a "best" clustering assignment relative to the relationship measure and the scoring technique (the IN score computation technique in this case) is obtained. Accordingly, the case where there are three clusters, namely
{ ι,X2} X3}{X4,Xs} should be chosen.
In another implementation of the disclosed teachings, the calculations of deltas will cease when the first decline of IN scores is detected.
In yet another implementation of the disclosed teachings, multiple scoring techniques are used, each designated its own weight factor. A graph similar to the graph shown in FIG. 5 is drawn. However, in this case, the weighted overall score is used rather then the IN scores alone. The highest overall score is used to determine the best clustering assignment possible.
In still another implementation, the calculation ceases when the first decline in score values is observed.
For calculations related to the modified similarity matrix, such as the exemplary implementation described in FIG. 4B, another quality measurement, the IN-OUT score, could be used. The IN-OUT score is a variation of previously explained IN score, the variation being based on a new score measurement named OUT. The OUT score is reflective of the average proximity between a cluster and all other clusters. It should be noted that proximity between two clusters is the aggregate (or average) of the proximity between all possible pairwise combinations of members of the first cluster and the second cluster. More specifically, the use of IN-OUT score is aimed at merging two clusters having a strong proximity to each other but a weak proximity to all other clusters.
The initial OUT value is identical to that of the initial IN score explained above, as each cluster contains dhly one element, hence IN(5)=OUT(5) of the example. It should be further noted, that when all elements are clustered into one cluster, the OUT score has no mea'ning as there is no proximity/similarity to another cluster to compare with, and hence the score calculation ceases at two clusters. It should be noted that the meaning of delta in the context of an OUT score is the same as before. The delta value for an OUT score, from the immediately preceding OUT score is calculated as follows:
'OUT (i ) =
Figure imgf000028_0001
where the notations have the following meanings:
N - is the total number of el ments to be clustered.
U - is the aggregation of similarities to all clusters but itself
All other designators have the meaning explained above in reference to the
IN score.
A score may now be calculated based on both ΔIN and Δ0uτ value as follows:
Figure imgf000029_0001
The modified similarity matrix using the IN-OUT score is now depicted in FIG. 6A with the corresponding delta values ΔIN-0uτ being calculated for each f possible clustering assignment. The similarity matrix is updated and the highest overall score is selected in the same way as previously described. The vector representing all the U values is updated using the following formula:
Umerge(U) = Uj + Uj - Vjj - Vjj
Umerge(i,j) represents the value of U for the merged cluster formed by merging of clusters i and j.
Assuming an α and β value each equal to "1" the results may be viewed in
FIG.s 6B through 6D. The IN-OUT scores for IN-OUT(5), IN-OUT(4), IN-OUT(3), and IN-OUT(2) are 0, 0.5208, 0.9606, and 0.9977 respectively. These scores are plotted in FIG. 7 and it can be easily seen that in this case the highest IN-OUT score is achieved when there are only two clusters. The selected cluster assignment is therefore {X1,X2}{X3,X4,X5}. The IN-OUT score for leaving each element in its own cluster is always zero, as explained above. Different α and β values may result in a different cluster assignment.
In order to facilitate an efficient implementation of the disclosed teachings, an efficient sorting of the delta values calculated, is required. This is desirable as this clustering method may be employed for clustering a large number of elements and hence efficient sorting is required for practical implementation using the large number of elements. For such a sorting operation, the present implementation uses a heap for priority queue sorting. However, it should be noted that it is not an absolute requirement for implementing the disclosed teachings to use a heap.
However, it is advantageous to use a heap-based technique, as the heap guarantees that the largest number will always be at the top of the heap. However, certain modifications to the standard heap implementation are required to handle the fact that data stored in nodes of the heap may be changed or removed as a result of the merging of elements. It is undesirable to create the heap each time some of the values change, as this will be a time consuming task.
Therefore, a back-annotation that includes the cluster location within the heap, is used for easy update of the heap. This enables fast updates of the heap, including size reduction of the heap, without requiring recreation of the heap after every update of the table. Therefore, a remove function is added to the functions, for managing the heap. The remove function allows for the removal of certain node information from the heap, as may be desired, and consequently updating the heap such that it retains the heap property. The removal is required as a result of the merging of elements into a cluster as explained above.
The remove operation performs the following when receiving the index of a node to be removed: a) replaces the content of the node to be removed with the content of the node at the bottom of the heap; b) reduces the length of the heap by one; c) compares the content of the node with the content of its parent and performs the following: i) if the content of the element is larger then that of its parent, then a swap-up operation, as explained below, takes place; otherwise ii) a heapify operation beginning at that*node takes place.
The swap-up operation replaces the content of a parent and child if the content of the child is higher than that of the parent and continues recursively as long as a swap can take place. This is repeated, when necessary, all the way to the top of the heap. By doing this, it is guaranteed that once performed, the heap in its entirety will maintain the heap property.
The operation of this technique is best explained through the example described in FIG. 8. The delta values are the same values that were calculated in the first example. The first step is to place the calculated delta values in heap 800, in no specific order, as shown in FIG. 8A. The delta values were taken from FIG. 4B using the order from Δm(l,2) through Δm(4,5). In addition a back-annotation table (BAT) 850, mapping the location of each delta value in the heap, is also shown in FIG. 8A. It can be easily seen that Δ (l,2) equals to 0.3125 is placed at the location "1" of the heap in node 801. Similarly Δm(l,3) that is equal to -0.0893 is placed at the location "2" of the heap in node 802. This continues until the last delta value is placed in location "10" at node 810. For each delta value placed in a node, the corresponding node number is placed in BAT 850. Hence, ΔIN(1,2) has a designator "1" in the table as it is placed in node 801, which is the first location of the heap. In addition to the value of each node, the origin of the delta value is also included, resulting in an immediate indication of the potential merge of elements or clusters.
At this time, it should be noted that the heap does not satisfy the heap property. That is, presently, a parent node is not always equal to, or larger than, the child nodes. Therefore a build-heap operation is required. However a modified build-heap(MBH) operation is used, because, an update of BAT 850 is also necessary as delta values are moved around heap 800. Hence, in addition to the normal operation of the build-heap operation, the corresponding update of BAT 850 takes place.
In FIG. 8B the heap 800 and BAT 850 are shown after the heap was updated in order to conform to the heap property. At the top of the heap is the largest delta value and the indicati n that elements "1" and "2" are the candidates for a merge. As the elements "1" and "2" are to be merged, there will be implications on the delta values respective to the delta values where either element "1" or element "2" appear. All such delta values need to be updated. The procedure of removing the top of the heap is conventionally known as explained above. After applying the heap-extract-max, the heap will have the structure described in FIG. 8C. The remove operation is now used to remove elements "1,3", "1,4", "1,5", "2,3", "2,4", and "2,5" from the heap and the result is shown in FIG. 8D. The operation can be performed due to the data stored in BAT 850 that provides for the back-annotation of the elements or clusters to the nodes.
At this stage, three new nodes will be added representing the possible merge between the cluster 1,2 and elements 3, 4, and 5 respectively. The corresponding delta values can be found in FIG. 4C and the newly created heap is shown in FIG. 8E. It should be noted that this technique allowed for the reduction of the size of the heap, and hence, as the operation continues a smaller heap is to be processed. BAT 850 is updated to allow for the back annotation corresponding to the changes in the heap. It should be noted that from the two merged elements or clusters, the row with more data cells in BAT 850 is replaced by "-1" to denote an invalid value. The index to the row containing less data cells is updated to correspond to a designator indicating that it is a merged case. The process is now repeated until all elements/clusters are merged into one cluster.
In another implementation, this repetitive procedure will cease once the new IN score drops for the first time from the immediately preceding IN score.
While the above discussion was for the case of delta values for the IN score it can be used for the IN-OUT score, or other scores or combination of scores. A person skilled in the art could easily implement, instead of a heap-based sorting mechanism, a binary search tree (BST) based sorting mechanism. The first step for the BST is somewhat different from that of the heap, however, the rest of the techniques used in the disclosed teachings apply.
Furthermore, a skilled artisan art could easily implement the system for use with dissimilarity measurements instead of similarity between elements. However, in such a case it will be necessary to seek minimum values instead of the maximum values.
It should be clear to a skilled artisan that the disclosed teaching can be implemented using hardware, software or a combination thereof. An exemplary block diagram of such a clustering system 900 is described in FIG. 9. System 900 comprises, an initializer 910, a merger 920, a slector 930 and an assigner database 930. Initializer 910 receives the pairwise similarity data and performs the functions required to allow the iterative processing disclosed in this invention. Initialization functions may include at least the update of the self- similarity values and the calculation of the initial scores. Merger 920 is adapted to perform the iterative function of finding the best two clusters to be merged at each step of the calculations, and providing a score for the selected clustering assignment. It is further adapted to perform the calculation of the scores by calculating delta values to be added to the immediately preceding score. It is further adapted to perform the iterative process of repeating this process until all clusters are clustered into a single cluser, or until selector 930 determines that the highest score possible using the method disclosed was achieved. Selector 930 selects the best clustering assignment from all the potential clustering assignments it receives. In some embodiments it will select the clustering assignment upon first indication of a decrease in the score of the immediately proceeding potential clustering assignment. Assigner database 940 contains all the data necessary for the operation of merger 920 and is a source for data provided to selector 930. Assigner database may hold the similarity data and its updates, the delta values possibly in a heap structure. Any block described hereinabove can be implemented in hardware, software, or combination of hardware and software. The system can also be implemented using any type of computer including PCs, workstations and mainframes. The disclosed teaching also includes computer program products that includes a computer readable medium with instructions. These instructions can include, but not limited to, source code, object code and executable code and can be in any higher level language, assembly language and machine language. The computer readable medium is not restricted to any particular type of medium but can include, but not limited to, RAMS, ROMs, hard disks, floppies, tapes and internet downloads.
Other modifications and variations to the disclosed teachings will be apparent to those skilled in the art from the foregoing disclosure and teachings.
Thus, while only certain embodiments using the disclosed teachings have been specifically described herein, it will be apparent that numerous modifications may be made thereto without departing from the spirit and scope of the disclosed teachings.

Claims

What is claimed is:
1. A clustering system comprising: an initializer adapted to perform at least one of: update of self- similarity values and calculation of an initial score for an optional clustering assignment; and a mergerer adapted to perform at least one of: calculate the delta values for the potential merger of any of two clusters, and update the pairwise similarity as a result of merging of two clusters with the highest score in each round of calculations; and a clustering assignment selector adapted to determine the clustering assignment with the highest overall score; and an assigner database adapted to store and retrieve at least: pairwise similarity and updates thereof, and scores of potential clustering assignment determined at each round of calculations.
2. The clustering system of claim 1, wherein said initial similarity value for said first one of an element or a cluster to itself is an average of a similarity of said first one of an element or a cluster to all other elements.
3. The clustering system of claim 1, wherein said initial similarity values are organized in a two-dimensional table.
4. The clustering system of claim 3, wherein said initial similarity value for said first one of an element or a cluster to itself is an average of the similarity of said first one of an element or cluster to all other elements in a same row in said table.
5. The clustering system of claim 3, wherein said initial similarity value for said first one of an element or a cluster to itself is an average of the similarity of said first one of an element or cluster to all other elements in a same row in said table.
6. The clustering system of claim 1, wherein said score is a sum of multiple sub-scores.
7. The clustering system of claim 6, wherein said sum is a weighted sum.
8. The clustering system of claim 6, wherein at least one sub-score is based on scores related to all elements within a cluster.
9. The clustering system of claim 8, wherein an initial overall score is a sum of the similarity of each element to itself.
10. The clustering system of claim 9, wherein each subsequent overall scores, other than said initial overall score, is calculated as a differential value from an immediately preceding score.
11. The clustering system of claim 10, wherein each score for a potential clustering assignment is determined by adding an immediately preceding overall score to said differential value.
12. The clustering system of claim 11, wherein only a highest of said differential values in each round of calculations is added to the immediately preceding score.
13. The clustering system of claim 3, wherein said updating of similarity matrix is achieved by adding rows corresponding to the merged clusters in said table and adding columns corresponding to the merged clusters.
14. The clustering system of cla*im 10, wherein the differential values are placed in a priority queue.
15. The clustering system of claim 14, wherein the priority queue is implemented by a heap.
16. The clustering system of claim 15, wherein a back-annotation of a differential value and its position in the heap is maintained.
17. The clustering system of claim 16, wherein said updates of similarity matrix result in removal of all d lta values in said heap corresponding to said merged clusters.
18. The clustering system of claim 17, wherein said system is adapted to replace said removed value with the value at the bottom of the heap, reduce the heap length by one, swap value with a parent if said value is larger then the parent or, heapify the heap starting from a node corresponding to the value if the value is smaller than at least one of its child nodes, and update said back-annotation to reflect changes in the heap.
19. The clustering system of claim 17, wherein at least one new differential value is inserted into the heap.
20. A computer system with a memory and a central processing unit, the memory comprising instructions, said instructions being capable of enabling the computer to implement clustering, said instructions including instruction for implementing : an initializer adapted to perform at least one of: update of self- similarity values and calculation of an initial score for an optional clustering assignment; a merger adapted to perform at least one of: calculate the delta values for the potential merger of any of two clusters, and update the pairwise similarity as a result of merging of two clusters with the highest score in each round of calculations; and a clustering assignment selector adapted to determine the clustering assignment with the highest overall score; and an assigner database adapted to store and retrieve at least: pairwise similarity and updates thereof, and scores of potential clustering assignment determined at each round of calculations.
21. The computing system of claim 20, wherein said initial similarity value for said first one of an element or a cluster to itself is an average of a
3 S similarity of said first one of an element or a cluster to all other elements, and all other elements to said first one of an element or a cluster.
22. The computing system of claim 20, wherein said initial similarity values are organized in a two-dimensional table.
23. The computing system of cl^im 22, wherein said initial similarity value for a for said first one of an element or a cluster to itself is an average of the similarity of said first one of an element or cluster to all other elements in a same row in said table.
24. The computing system of claim 22, wherein said initial similarity value for a for said first one of an element or a cluster to itself is an average of the similarity of said first one of an element or cluster to all other elements in a same row in said table.
25. The computing system of cljaim 20, wherein said score is a sum of multiple sub-scores.
26. The computing system of claim 25, wherein said sum is a weighted sum.
27. The computing system of claim 25, wherein at least one sub-score is based on scores related to all elements within a cluster.
28. The computing system of claim 25, wherein an initial overall score is a sum of the similarity of each element to itself.
29. The computing system of claim 28, wherein each subsequent overall scores, other than said initial overall score, is calculated as a differential value from an immediately preceding score.
30. The computing system of claim 29, wherein each score for a potential clustering assignment is determined by adding an immediately preceding overall score to said differential value.
31. The computing system of claim 30, wherein only a highest of said differential values in each round of calculations is added to the immediately preceding score.
32. The computing system of claim 22, wherein said updating of similarity matrix is achieved by adding rows corresponding to the merged clusters in the said table and adding columns corresponding to the merged clusters.
33. The computing system of claim 29, wherein the differential values are placed in a priority queue.
34. The computing system of claim 33, wherein said priority queue is implemented by a heap.
35. The computing system of claim 34, wherein a back-annotation of a differential value and its position in the heap is maintained.
36. The computing system of claim 35, wherein said updates of similarity matrix result in removal of all delta values in said heap corresponding to said merged clusters.
37. The computing system of claim 36, wherein said system is adapted to replace the removed value with the value at the bottom of the heap, reduce the heap length by one, swap value with a parent if said value is larger then the parent and swap value with a child if the child is larger the a parent, heapify the heap starting from a node corresponding to the value if the value is smaller than at least one of its child nodes, and update said back-annotation to reflect changes in the heap.
38. The clustering system of claim 36, wherein at least one new differential value is inserted into the heap.
39. The computing system of claim 20, wherein said central processing unit includes a plurality of processing units.
40. The computing system of claim 20, wherein said central processing unit includes a plurality of distributed processing units.
41. A computer program product for clustering elements, the computer program including computer-reέdable media, said instructions including instructions for enabling a computer to implement operations comprising: a) accessing initial similarity values, said similarity values corresponding to a pair of clusters wherein each of said pair of clusters could also be an individual element; b) calculating a delta score for an optional merger of a pair of clusters; c) determining a highest score from all possible cluster pairs' scores; e) updating said similarity values as a result of merging said clusters with said highest score; f) determining the clustering assignment with the highest overall score.
42. The computer program product of claim 41, wherein said similarity values are organized in a similarity matrix.
43. The computer program product of claim 41, wherein after said accessing of initial similarity value, the initial similarity value of a first element to itself is replaced by an average of a similarity of said first one of an element or a cluster to all other elements, and all other elements to said first one of an element or a cluster.
44. The computer program product of claim 41, wherein said initial similarity values are organized in a two-dimensional table.
45. The computer program product of claim 44, wherein said initial similarity value for a for said first one of an element or a cluster to itself is an average of the similarity of said first one of an element or cluster to all other elements in a same row in said table.
46. The computer program product of claim 44, wherein said initial similarity value for a for said first one of an element or a cluster to itself is an average of the similarity of said first one of an element or cluster to all other elements in a same row in said table.
47. The computer program product of claim 41, wherein said score is a sum of multiple sub-scores.
48. The computer program product of claim 47, wherein said sum is a weighted sum.
49. The computer program product of claim 47, wherein at least one sub- score is based on scores related to all elements within a cluster.
50. The computer program product of claim 49, wherein an initial overall score is a sum of the similarity of each element to itself.
51. The computer program product of claim 50, wherein each subsequent overall score, other than said initial overall score, is calculated as a differential value from an immediately preceding score.
52. The computer program product of claim 51, wherein each score for a potential clustering assignment is determined by adding an immediately preceding overall score to said differential value.
53. The computer program product of claim 52, wherein only a highest of said differential values in each round of calculations is added to the immediately preceding score.
54. The computer program product of claim 44, wherein said updating of similarity matrix is achieved by adding rows corresponding to the merged clusters in said table and adding columns corresponding to the merged clusters.
55. The computer program product of claim 51, wherein the differential values are placed in a priority queue.
56. The computer program product of claim 55, wherein said priority queue is implemented by a heap.
57. The computer program product of claim 56, wherein a back- annotation of a differential value and its position in the heap is maintained.
58. The computer program product of claim 57, wherein said updates of similarity matrix result in removal of all delta values in said heap corresponding to said merged clusters.
59. The computer program product of claim 58, wherein said instruction further comprise instructions for: g) replacing the removed said'value with the value at the bottom of the heap; h) reducing the heap length by one; i) if said value is larger then the parent then swapping places and continuing said swap while a child is larger the a parent; j) if said value is smaller then at least one of its respective child nodes then heapifying the heap from said value's node; and k) updating the said back-annotation to reflect the changes in the heap.
60. The computer program product of claim 59, wherein at least one new differential value is inserted into the heap.
61. A method for the indirect calculation of a series of greedy pairwise IN scores, the method comprising: a) inputting pairwise similarity between elements; b) replacing a self-similarity of each element to itself as an average of said element's similarity to all other elements and all other elements to said element; c) calculating an initial IN score as the sum of said self similarities; d) calculating a delta value corresponding to each case of merging two elements into a cluster; e) determining the highest of said delta values and accordingly selecting the corresponding clustering assignment; f) calculating the new IN score as the sum of the most recent IN score calculated and said highest delta value; g) updating the pairwise similarity by combining rows corresponding to said merged clusters and combining columns corresponding to said merged clusters; h) repeating steps d) through g) until all elements are merged into a single cluster; i) selecting the clustering assignment which has the highest IN score.
62. The method of claim 61, wherein said pairwise similarities are organized in a similarity matrix.
63. The method of claim 61, wherein in step b) said initial similarity value for a first element to itself is replaced by an average of the similarity of said first element to all other elements in a same row.
64. The method of claim 61, wherein in step b) said initial similarity value for a first element to itself is replaced by an average of the similarity of said first element to all other elements in a same column.
65. The method of claim 61, wherein said delta values for merging two clusters are computed using a method comprising : i) adding values ofsimilarity between a first cluster and a second cluster and similarity between a second cluster and a first cluster ; ii) subtracting from the result of step i) the self-similarity of first cluster multiplied by the ratio between the number of elements in said second cluster and said first cluster; iii) subtracting from the result of step ii) the self similarity of second cluster multiplied by the ratio between the number of elements in said first cluster and said second cluster; and iv) dividing the result of step iii) by a total number of elements to be merged in said potential clustering assignment ..
66. The method of claim 61, wherein the repetitive loop is stopped when a new IN score is smaller then an immediately preceding IN score.
67. A method for clustering elements comprising: a) accessing initial similarity values in a similarity matrix, said similarity values corresponding to pairs of clusters wherein each of said pair of clusters could also be an individual element; b) calculating a score for an optional merger of a pair of clusters; c) determining a highest score from all possible cluster pairs; e) updating said similarity matrix as a result of merging said clusters with said highest score; f) determining the clustering assignment with the highest overall score.
68. The method of claim 67, wherein after said accessing of initial similarity value, the initial similarity value of a first element to itself is replaced by an average of a similarity of said first one of an element or a cluster to all other elements, and all other elements to said first one of an element or a cluster.
69. The method of claim 67, wherein said initial similarity values are organized in a two-dimensional table.
70. The method of claim 69, wherein said initial similarity value for a for said first one of an element or a cluster to itself is an average of the similarity of said first one of an element or cluster to all other elements in a same row in said table.
71. The method of claim 69, wherein said initial similarity value for a for said first one of an element or a cluster to itself is an average of the similarity of said first one of an element or cluster to all other elements in a same row in said table.
72. The method of claim 67, wherein said score is a sum of multiple sub- scores.
73. The method of claim 72, wherein said sum is a weighted sum.
74. The method of claim 72, wherein at least one sub-score is based on scores related to all elements within a cluster.
75. The method of claim 74, wherein an initial overall score is a sum of the similarity of each element to itself.
76. The method of claim 75, wherein each subsequent overall scores, other than said initial overall score, is calculated as a differential value from an immediately preceding score.
77. The method of claim 76, wherein each score for a potential clustering assignment is determined by adding an immediately preceding overall score to said differential value.
78. The method of claim 77, wherein only a highest of said differential values in each round of calculations is added to the immediately preceding score.
79. The method of claim 69, wherein said updating of similarity matrix is achieved by adding rows corresponding to the merged clusters in said table and adding columns corresponding to the merged .clusters.
80. The method of claim 76, wherein the differential values are placed in a priority queue.
81. The method of claim 80, wherein said priority queue is implemented by a heap.
82. The method of claim 81, wherein a back-annotation of a differential value and its position in the heap is maintained.
83. The method of claim 82, wherein said updates of similarity matrix result in removal of all delta values in said heap corresponding to said merged clusters.
84. The method of claim 83, further comprising : g) replacing the removed said value with the value at the bottom of the heap; h) reducing the heap length by one; i) if said value is larger then the parent then swapping places and continuing said swap while a child is larger the a parent; j) if said value is smaller then at least one of its respective child nodes then heapifying the heap from said value's node; and k) updating the said back-annotation to reflect the changes in the heap.
85. The method of claim 84, wherein at least one new differential value is inserted into the heap.
86. A method for sorting and removal of data in a heap, the heap including nodes, the method comprising: a) placing values in a heap; b) removing at least one node in the heap and updating the heap to conform with the heap property; and c) updating a back-annotation of said values and their position in said heap.
87. The method of claim 86, wherein said removal of a value comprises: bl) replacing the removed value with the a value corresponding to a bottom of the heap; b2) reducing a length of the heap by one; b3) if said value is larger then a parent, then swapping places and continuing said swap while a child is larger than a parent; and b4) if said value is smaller then at least one of its respective child nodes, then heapifying the heap from said value's node.
88. The method of claim 87, wherein at least one new value is inserted into the heap.
89. A method for the indirect calculation of a series of greedy pairwise IN- OUT scores, the method comprising: a) inputting a cluster pairwise similarity between said pairs of clusters; b) replacing a self-similarity of each element to itself with an average of said element's similarity to all other elements and all other elements to said element; c) creating a vector of values where each value in said vector corresponds to an element and where the content of said value is the sum of the similarities of said element to all other elements; d) setting an initial IN-OUT s*core to zero; e) calculating IN-OUT delta values of each case of merging two clusters based on the data in said pairwise similarity and said vector; f) determining a highest of said delta values and accordingly selecting a corresponding clustering assignment; g) calculating a new IN-OUT score as a sum of the most recent IN-OUT score calculated and said highest delta value; h) updating said similarity matrix by combining rows of said merged clusters and combining columns of said merged clusters; i) updating said vector by: il) adding values corresponding to each of the merged clusters; ι'2) subtracting similarity of first cluster to second cluster; and, i3) subtracting similarity of second cluster to first cluster; i4) replacing said values with the newly calculated value; j) repeating steps d) through i) until only two clusters remain; and k) selecting a clustering assignment which has a highest IN-OUT score.
90. The method of claim 89, wherein said pairwise similarity is organized in a similarity matrix.
91. The method of claim 89, wherein in step b) said initial similarity value for a first element to itself is replaced by the average of the similarity of said first element to all the other elements in the same row.
92. The method of claim 89, wherein in step b) said initial similarity value for a first element to itself is replaced by the average of the similarity of said first element to all the other elements in the same column.
93. The method of claim 89, wherein the combined IN-OUT score is determined by assigning a first weight to the IN score and a second weight to the OUT score and subtracting a weighted OUT score from a weighted IN score.
94. The method of claim 93, wherein the IN delta values is computed by. a sub-process comprising: i) adding values of similarity between a first cluster and a second cluster and similarity between a second cluster and a first cluster ; ii) subtracting from the result of step i) the self-similarity of first cluster multiplied by the ratio between the number of elements in said second cluster and said first cluster; iii) subtracting from the result of step ii) the self similarity of second cluster multiplied by the ratio between the number of elements in said first cluster and said second cluster; and iv) dividing the result of step iii) by a total number of elements to be f merged in said potential clustering assignment.
95. The method of claim 93, wherein the OUT delta values is computed by a sub-process comprising : i)multiplying an aggregation of similarities of a first cluster to all clusters but itself by a ratio of a number of elements in a second cluster to a difference of a total number of elements and a number of elements of the first cluster; ii) adding to result of i) an aggregation of similarities of the seconf cluster to all clusters but itself by a ratio of a number of elements in the first cluster to a difference of a total number of elements and a number of elements of the second cluster; iii) subtracting from result of ii) a sum of values of similarity between a first cluster and a second clusteV and similarity between a second cluster and a first cluster; iv) dividing the result of iii) by" a difference of a total number of elements and a sum of the number of elements of the first cluster and second cluster.
96. The method of claim 89, wherein the repetitive loop is stopped when a new IN-OUT score is smaller then the immediately preceding IN-OUT score.
97. The clustering system of clciim 1, wherein said initial similarity value for said first one of an element or a cluster to itself is an average of a similarity of all other elements to "said first one of an element or a cluster.
PCT/IB2001/001892 2000-08-18 2001-08-20 A system and method for a greedy pairwise clustering WO2002015122A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2001294089A AU2001294089A1 (en) 2000-08-18 2001-08-20 A system and method for a greedy pairwise clustering

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US22612800P 2000-08-18 2000-08-18
US60/226,128 2000-08-18
US25957501P 2001-01-04 2001-01-04
US60/259,575 2001-01-04

Publications (2)

Publication Number Publication Date
WO2002015122A2 true WO2002015122A2 (en) 2002-02-21
WO2002015122A3 WO2002015122A3 (en) 2003-12-04

Family

ID=26920229

Family Applications (4)

Application Number Title Priority Date Filing Date
PCT/IB2001/001876 WO2002014987A2 (en) 2000-08-18 2001-08-20 An adaptive system and architecture for access control
PCT/IB2001/001923 WO2002014989A2 (en) 2000-08-18 2001-08-20 Permission level generation based on adaptive learning
PCT/IB2001/001877 WO2002014988A2 (en) 2000-08-18 2001-08-20 A method and an apparatus for a security policy
PCT/IB2001/001892 WO2002015122A2 (en) 2000-08-18 2001-08-20 A system and method for a greedy pairwise clustering

Family Applications Before (3)

Application Number Title Priority Date Filing Date
PCT/IB2001/001876 WO2002014987A2 (en) 2000-08-18 2001-08-20 An adaptive system and architecture for access control
PCT/IB2001/001923 WO2002014989A2 (en) 2000-08-18 2001-08-20 Permission level generation based on adaptive learning
PCT/IB2001/001877 WO2002014988A2 (en) 2000-08-18 2001-08-20 A method and an apparatus for a security policy

Country Status (2)

Country Link
AU (4) AU2001294083A1 (en)
WO (4) WO2002014987A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106778314A (en) * 2017-03-01 2017-05-31 全球能源互联网研究院 A kind of distributed difference method for secret protection based on k means

Families Citing this family (59)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003203140A (en) * 2001-10-30 2003-07-18 Asgent Inc Method for grasping situation of information system and device used in the same
WO2003063449A1 (en) * 2002-01-18 2003-07-31 Metrowerks Corporation System and method for monitoring network security
EP1339199A1 (en) * 2002-02-22 2003-08-27 Hewlett-Packard Company Dynamic user authentication
CA2478128A1 (en) 2002-03-06 2003-09-12 Peregrine Systems, Inc. Method and system for a network management console
FR2838207B1 (en) * 2002-04-08 2006-06-23 France Telecom INFORMATION EXCHANGE SYSTEM WITH CONDITIONED ACCESS TO AN INFORMATION TRANSFER NETWORK
US7302488B2 (en) 2002-06-28 2007-11-27 Microsoft Corporation Parental controls customization and notification
ATE540373T1 (en) 2002-11-29 2012-01-15 Sap Ag METHOD AND COMPUTER SYSTEM FOR PROTECTING ELECTRONIC DOCUMENTS
CN1417690A (en) * 2002-12-03 2003-05-14 南京金鹰国际集团软件系统有限公司 Application process audit platform system based on members
US10110632B2 (en) * 2003-03-31 2018-10-23 Intel Corporation Methods and systems for managing security policies
US8266699B2 (en) 2003-07-01 2012-09-11 SecurityProfiling Inc. Multiple-path remediation
US20070113272A2 (en) 2003-07-01 2007-05-17 Securityprofiling, Inc. Real-time vulnerability monitoring
US9118710B2 (en) 2003-07-01 2015-08-25 Securityprofiling, Llc System, method, and computer program product for reporting an occurrence in different manners
US9100431B2 (en) 2003-07-01 2015-08-04 Securityprofiling, Llc Computer program product and apparatus for multi-path remediation
US9118708B2 (en) 2003-07-01 2015-08-25 Securityprofiling, Llc Multi-path remediation
US9118711B2 (en) 2003-07-01 2015-08-25 Securityprofiling, Llc Anti-vulnerability system, method, and computer program product
US9118709B2 (en) 2003-07-01 2015-08-25 Securityprofiling, Llc Anti-vulnerability system, method, and computer program product
US9350752B2 (en) 2003-07-01 2016-05-24 Securityprofiling, Llc Anti-vulnerability system, method, and computer program product
US8984644B2 (en) 2003-07-01 2015-03-17 Securityprofiling, Llc Anti-vulnerability system, method, and computer program product
EP1510904B1 (en) * 2003-08-19 2008-12-31 France Telecom Method and system for evaluating the level of security of an electronic equipment and for providing conditional access to resources
DE10348729B4 (en) 2003-10-16 2022-06-15 Vodafone Holding Gmbh Setup and procedures for backing up protected data
FR2864657B1 (en) * 2003-12-24 2006-03-24 Trusted Logic METHOD FOR PARAMETRABLE SECURITY CONTROL OF COMPUTER SYSTEMS AND EMBEDDED SYSTEMS USING THE SAME
US7907934B2 (en) 2004-04-27 2011-03-15 Nokia Corporation Method and system for providing security in proximity and Ad-Hoc networks
JP4811271B2 (en) * 2004-08-25 2011-11-09 日本電気株式会社 Information communication apparatus and program execution environment control method
JP4643204B2 (en) 2004-08-25 2011-03-02 株式会社エヌ・ティ・ティ・ドコモ Server device
US7979889B2 (en) 2005-01-07 2011-07-12 Cisco Technology, Inc. Methods and apparatus providing security to computer systems and networks
US7193872B2 (en) 2005-01-28 2007-03-20 Kasemsan Siri Solar array inverter with maximum power tracking
US7661111B2 (en) 2005-10-13 2010-02-09 Inernational Business Machines Corporation Method for assuring event record integrity
JP2009519546A (en) * 2005-12-13 2009-05-14 インターデイジタル テクノロジー コーポレーション Method and system for protecting user data in a node
US7882560B2 (en) 2005-12-16 2011-02-01 Cisco Technology, Inc. Methods and apparatus providing computer and network security utilizing probabilistic policy reposturing
US9286469B2 (en) 2005-12-16 2016-03-15 Cisco Technology, Inc. Methods and apparatus providing computer and network security utilizing probabilistic signature generation
US8413245B2 (en) 2005-12-16 2013-04-02 Cisco Technology, Inc. Methods and apparatus providing computer and network security for polymorphic attacks
US8495743B2 (en) 2005-12-16 2013-07-23 Cisco Technology, Inc. Methods and apparatus providing automatic signature generation and enforcement
US8326296B1 (en) 2006-07-12 2012-12-04 At&T Intellectual Property I, L.P. Pico-cell extension for cellular network
CN101350052B (en) 2007-10-15 2010-11-03 北京瑞星信息技术有限公司 Method and apparatus for discovering malignancy of computer program
CN101350054B (en) 2007-10-15 2011-05-25 北京瑞星信息技术有限公司 Method and apparatus for automatically protecting computer noxious program
US8626223B2 (en) 2008-05-07 2014-01-07 At&T Mobility Ii Llc Femto cell signaling gating
US8719420B2 (en) 2008-05-13 2014-05-06 At&T Mobility Ii Llc Administration of access lists for femtocell service
US8179847B2 (en) * 2008-05-13 2012-05-15 At&T Mobility Ii Llc Interactive white list prompting to share content and services associated with a femtocell
US8743776B2 (en) 2008-06-12 2014-06-03 At&T Mobility Ii Llc Point of sales and customer support for femtocell service and equipment
JP5482667B2 (en) * 2009-02-10 2014-05-07 日本電気株式会社 Policy management apparatus, policy management system, method and program used therefor
US8510801B2 (en) 2009-10-15 2013-08-13 At&T Intellectual Property I, L.P. Management of access to service in an access point
US8713056B1 (en) 2011-03-30 2014-04-29 Open Text S.A. System, method and computer program product for efficient caching of hierarchical items
US10225249B2 (en) * 2012-03-26 2019-03-05 Greyheller, Llc Preventing unauthorized access to an application server
US10229222B2 (en) 2012-03-26 2019-03-12 Greyheller, Llc Dynamically optimized content display
US9355261B2 (en) 2013-03-14 2016-05-31 Appsense Limited Secure data management
US8959657B2 (en) 2013-03-14 2015-02-17 Appsense Limited Secure data management
US9215251B2 (en) 2013-09-11 2015-12-15 Appsense Limited Apparatus, systems, and methods for managing data security
JP6190518B2 (en) 2014-03-19 2017-08-30 日本電信電話株式会社 Analysis rule adjustment device, analysis rule adjustment system, analysis rule adjustment method, and analysis rule adjustment program
US9787685B2 (en) 2014-06-24 2017-10-10 Xiaomi Inc. Methods, devices and systems for managing authority
CN104125335B (en) * 2014-06-24 2017-08-25 小米科技有限责任公司 Right management method, apparatus and system
WO2023170635A2 (en) * 2022-03-10 2023-09-14 Orca Security LTD. System and methods for a machine-learning adaptive permission reduction engine
US10891816B2 (en) 2017-03-01 2021-01-12 Carrier Corporation Spatio-temporal topology learning for detection of suspicious access behavior
WO2018160560A1 (en) 2017-03-01 2018-09-07 Carrier Corporation Access control request manager based on learning profile-based access pathways
WO2018160407A1 (en) 2017-03-01 2018-09-07 Carrier Corporation Compact encoding of static permissions for real-time access control
US10764299B2 (en) * 2017-06-29 2020-09-01 Microsoft Technology Licensing, Llc Access control manager
US10831787B2 (en) * 2017-06-30 2020-11-10 Sap Se Security of a computer system
US11115421B2 (en) 2019-06-26 2021-09-07 Accenture Global Solutions Limited Security monitoring platform for managing access rights associated with cloud applications
US11501257B2 (en) * 2019-12-09 2022-11-15 Jpmorgan Chase Bank, N.A. Method and apparatus for implementing a role-based access control clustering machine learning model execution module
CN114981812A (en) * 2020-01-15 2022-08-30 华为技术有限公司 Secure and reliable data access

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6049797A (en) * 1998-04-07 2000-04-11 Lucent Technologies, Inc. Method, apparatus and programmed medium for clustering databases with categorical attributes

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6049797A (en) * 1998-04-07 2000-04-11 Lucent Technologies, Inc. Method, apparatus and programmed medium for clustering databases with categorical attributes

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHIDANANDA GOWDA K ET AL: "SYMBOLIC CLUSTERING USING A NEW DISSIMILARITY MEASURE" PATTERN RECOGNITION, PERGAMON PRESS INC. ELMSFORD, N.Y, US, vol. 24, no. 6, 1991, pages 567-578, XP000214973 ISSN: 0031-3203 *
TAKIO KURITA: "AN EFFICIENT AGGLOMERATIVE CLUSTERING ALGORITHM USING A HEAP" PATTERN RECOGNITION, PERGAMON PRESS INC. ELMSFORD, N.Y, US, vol. 24, no. 3, 1991, pages 205-209, XP000205242 ISSN: 0031-3203 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106778314A (en) * 2017-03-01 2017-05-31 全球能源互联网研究院 A kind of distributed difference method for secret protection based on k means

Also Published As

Publication number Publication date
WO2002014988A2 (en) 2002-02-21
WO2002014989A8 (en) 2003-03-06
AU2001294110A1 (en) 2002-02-25
WO2002014988A8 (en) 2003-04-24
WO2002015122A3 (en) 2003-12-04
AU2001294083A1 (en) 2002-02-25
WO2002014989A2 (en) 2002-02-21
WO2002014987A2 (en) 2002-02-21
AU2001294084A1 (en) 2002-02-25
AU2001294089A1 (en) 2002-02-25
WO2002014987A8 (en) 2003-09-04

Similar Documents

Publication Publication Date Title
WO2002015122A2 (en) A system and method for a greedy pairwise clustering
US6182058B1 (en) Bayes rule based and decision tree hybrid classifier
JP3323180B2 (en) Decision tree changing method and data mining device
CN111667050B (en) Metric learning method, device, equipment and storage medium
US20080201340A1 (en) Decision tree construction via frequent predictive itemsets and best attribute splits
US20070274597A1 (en) Method and system for fuzzy clustering of images
CN105929690B (en) A kind of Flexible Workshop Robust Scheduling method based on decomposition multi-objective Evolutionary Algorithm
CN109829162A (en) A kind of text segmenting method and device
CN105740386A (en) Thesis search method and device based on sorting integration
Karypis Multi-constraint mesh partitioning for contact/impact computations
CN105205052A (en) Method and device for mining data
EP3452916A1 (en) Large scale social graph segmentation
CN112906865A (en) Neural network architecture searching method and device, electronic equipment and storage medium
CN110580252B (en) Space object indexing and query method under multi-objective optimization
CN114781688A (en) Method, device, equipment and storage medium for identifying abnormal data of business expansion project
CN111125329B (en) Text information screening method, device and equipment
CN108319626B (en) Object classification method and device based on name information
US20030149698A1 (en) System and method for positioning records in a database
CN109033746B (en) Protein compound identification method based on node vector
CN108830302B (en) Image classification method, training method, classification prediction method and related device
Lopes et al. Local and distributed machine learning for inter-hospital data utilization: an application for TAVI outcome prediction
CN108717551A (en) A kind of fuzzy hierarchy clustering method based on maximum membership degree
CN104809098A (en) Method and device for determining statistical model parameter based on expectation-maximization algorithm
KR102268570B1 (en) Apparatus and method for generating document cluster
KR102028487B1 (en) Document topic modeling apparatus and method, storage media storing the same

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PH PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP