WO2006123013A2 - A method for analyzing multiparameter data - Google Patents

A method for analyzing multiparameter data Download PDF

Info

Publication number
WO2006123013A2
WO2006123013A2 PCT/FI2006/000160 FI2006000160W WO2006123013A2 WO 2006123013 A2 WO2006123013 A2 WO 2006123013A2 FI 2006000160 W FI2006000160 W FI 2006000160W WO 2006123013 A2 WO2006123013 A2 WO 2006123013A2
Authority
WO
WIPO (PCT)
Prior art keywords
elements
parameters
seed
coordinate system
group
Prior art date
Application number
PCT/FI2006/000160
Other languages
French (fr)
Inventor
Mika KORKEAMÄKI
Perttu Terho
Jussi Vaahtovuo
Original Assignee
Cyflo Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cyflo Oy filed Critical Cyflo Oy
Publication of WO2006123013A2 publication Critical patent/WO2006123013A2/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/40Software arrangements specially adapted for pattern recognition, e.g. user interfaces or toolboxes therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Definitions

  • the invention relates to the management of information and the analysis of large amounts of information in different types of applications, such as, for example, flow cytometry and gene chip methods.
  • the method is not automated in any way, rather the user must himself adjust the directions of the parameters and seek suitable values. Secondly, automating the method is difficult.
  • the third problem is that elements may end up at the same point from several different routes (so- called pseudo-populations are possible), i.e. the route travelled by the elements is not without ambiguity.
  • a self-organizing map i.e. SOM; T. Ko- honen, Self-organizing Maps, New York, Springer- Verlag, 2001
  • SOM self-organizing map
  • a SOM does not, however, create a map suitable, for example, for analysis of flow cytometry data, rather it builds the map as a density figure of a given area. Additionally, understanding of a SOM is difficult because its mathematical basis is exceptionally challenging.
  • the object of the invention is to disclose a new type of visual data mining method for the management of large amounts of data and as a tool for analysis of information.
  • the object of the invention is to obviate the problems mentioned above.
  • the present invention presents an assumption- free and user-independent data mining and visualization method for the processing of multivariable research or measurement material into a more comprehen- sible form.
  • the material comprises a large number of elements, and each element contains at least one parameter, wherein each parameter has its value.
  • each element is placed at the beginning at an initial location in a coordinate system which in one embodiment of the invention is a two- dimensional XY-plane. Two elements are selected and the values of the parameters of these elements and the locations of the elements are compared. After this, force vectors are calculated such that to one element is directed a group of force vectors created by each of the other elements. Each of these force vectors is parallel with the connecting line between the two elements, i.e.
  • the force is either an attracting or repelling force between the two elements .
  • the magnitude of the force vector is a function of the difference of the values between the parameters of the two elements being contemplated and the distance between them.
  • the second element is then exchanged until the first element has been compared to all of the other elements .
  • the other elements each cause a force vector directed to the first element, thus a sum force vector directed to the first element can be obtained by summation of all the force vectors directed to this element.
  • a sum force vector is calculated in a corresponding manner for all elements, or in other words, the first element is exchanged and the calculation described above is repeated.
  • each element is moved in the coordinate system by an amount equal to its sum force vector. After this, the iteration described above is re- peated on the moved elements. Thus, new sum force vectors are obtained for the moved elements, according to which the elements are moved once again. Iterations are repeated until all force vectors are approximately null vectors or until a predefined number of itera- tions is reached. In this case, a state of balance has been found and the elements essentially no longer move. A large volume of data can be analysed more eas- ily from this type of data group distributed into different populations .
  • the elements can at the beginning be placed in the coordinate system at the initial locations defined by some predefined algorithm or at randomly defined initial locations.
  • the difference of the values between the parameters of two elements is calculated by subtracting from each parameter value of the first element the corresponding parameter value of the second element and calculating an average of the absolute values of the differences.
  • One alternative manner of calculating the difference of the values is to divide the corresponding two values of the parameters instead of subtracting and calculating an average for the quotients .
  • weighting can be laid on one or several parameters, or one or several elements, or one or several elements or parameters can be left out of the calculation of the difference of the values .
  • the sum force vectors calculated can be scaled according to the set scaling factor prior to moving of the elements .
  • the direction of an individual force vector is set to differ from the direction of the connecting line between the two elements .
  • seed elements are selected from the entire group of elements which seed elements are brought into a mutual state of balance using the presented method. After this, the non-seed elements are brought into a state of balance with the seed elements such that the non-seed elements are compared only to the seed elements. In this case, the seed elements also remain stationary, i.e. in the final phase force vectors are not calculated to them.
  • the non-seed elements can initially be placed in the coordinate system either at the point of the average of the locations of the group of seed elements in a state of balance, at the point defined by the average of the ex- treme values in the direction of each axis of the group of seed elements in a state of balance or at the point of such a seed element where the difference of the values of the parameters of the element and the seed element is minimized, or in some other manner which is advantageous for the arranging of the elements such that it is expected for the number of necessary iterations to remain reasonable.
  • the elements of a large data group are first divided into sub-groups, each of which is first brought into an internal state of balance using the method according to the invention. After this, a sample of one or more elements is selected from each sub-group which can, for example, be the element closest to the centre of the population formed by the elements of the subgroup. These samples are then brought into a mutual state of balance using the method according to the invention. These balanced sample elements are selected as seed elements and the rest of the elements are brought into balance in relation to this group of the seed elements as described above in connection with the seed elements .
  • the XY-plane can be replaced using a polar coordinate system or yet a third dimension can be added to the XY-plane coordinate system in the form of a Z-axis .
  • the angles between the axes can be 90 degrees or alternatively the angle between at least one pair of axes in the coordinate system can be set at an angle value other than 90 degrees.
  • a method for data mining and visualization according to the present invention can be used, for example, in flow cytometry, gene chip methods, business economics, demographic information and information gathered for gallup polls as a tool for the analysis of data groups or in the analysis of any sta- tistic material or measurement material.
  • the idea of the invention can be used in multivariable analysis in any field where this type of information can be measured or collected and where there is a desire to more exactly analyse or visualize.
  • the method according to the present invention can be implemented using a computer program where the program code implements the different steps of the data mining and visualization method according to the present invention.
  • the presented invention is particularly good in those cases where a lot of data has accumulated, but the researcher does not have exact information about what sort of finding he is seeking, rather he attempts to find something significant from the data, such as, for example, new correlations or connections between the parameters being studied.
  • the presented method offers an excellent solution.
  • flow cytometry a situation like that described above can arise, for example, in a situation where the researcher begins studying about particles, for example, microbes, such characteristics (for example the light refracting ability of different types of microbes from different angles) , about which it is not known whether the parameters are significant or not.
  • the invention presented in this patent application obviates the disadvantages of methods of known art, as it is automatic and due to the attracting or repelling influences of force vectors superimpositions are prevented from occurring, i.e. the elements cannot end up at the same point by different routes, i.e. using this method, elements cannot form pseudo- populations .
  • the invention presented in this patent application handles slides (in other words, distorted and undefined populations) correctly, as it allows each point to find its own place.
  • the invention presented in this patent application allows the elements to be settled at the distance from one another that is necessary, and the area formed by the elements becomes exactly as large as the elements require. Additionally, the mathematical base of the invention now presented is simple and therefore the result is easily understood. Users are not then required to have special information, professional skill or ability in order to utilize the method of the invention.
  • An essential advantage of the present inven- tion is that complicated information is easily and clearly changed into graphic form such that large volume of information can be more easily analysed and by this means it is possible to make interesting and significant findings regarding a data group. Visualiza- tion of a data group is then facilitated by the method of the invention. Without the method of the invention an important finding, a connection between elements or other interdependence could remain undiscovered due to the complexity of the data group and the difficulty of interpreting it.
  • the number of dimensions used for representing the information can be reduced without loss of the data itself.
  • a large group of different XY-plane graphs can be condensed to one XY-plane for presentation such that the information is preserved.
  • An advantage of the invention is further that no hypotheses or assumptions need to be made about the material. Errors relating to these are then left out when using the method of the invention.
  • An essential advantage of the invention is also that its number of applications is quite high. In different areas of technology and different types of areas of life, as the result of research or measure- ments, large amounts of information are commonly created that must be analysed for making conclusions . A manner of data mining according to the present invention offers an excellent tool for this.
  • Fig. 1 shows a flow cytometry data analysis method according to known art using several two- dimensional graphs
  • Fig. 2 shows the method according to the pre- sent invention as a flow chart
  • Fig. 3 illustrates the balanced state of the points of a data group achieved by the method according to the present invention
  • Fig. 4 illustrates the manner of calculating force vectors between three elements according to the present invention
  • Fig. 5 illustrates a manner according to the present invention for achieving the optimal definition of the method by calculating the relative difference of two elements in an alternative manner.
  • the method presented in this patent application solves the problems described .above by handling multiparametric information in a coordinate system such that the distance between individual parts or elements of a data group illustrates the difference between the parameters of the elements and using simple mathematical regularities the elements are caused to gather as illustrative groups.
  • the coordinate sys- tern in the embodiment presented subsequently is a two- dimensional XY-plane, but the inventive idea of the invention is also applicable in relation to other types of coordinate systems such as, for example, an XYZ coordinate system or a polar coordinate system.
  • this type of presentation is achieved using the following algorithm with reference in this connection to the flow chart shown in Fig. 2:
  • the force vector is calculated accord- ing to a given function 202.
  • the force vector is directed from the element either towards the second element or away from it. 5) it is checked, whether the first element has already been compared to all other elements 203. If not all other elements have been gone through, the second element is exchanged 204 and the iteration 201, 202 is repeated. For the element are thus repeated steps 2, 3 and 4 to the third element 201, 202 and the force vector obtained from this is added together with the previous force vector. Steps 2 to 4 are repeated until the element has been compared to all other ele- ments . All the force vectors thus obtained and summed form the final sum force vector of the element 205.
  • the sum force vector for the new element is obtained.
  • a sum force vector will be calculated for the third and by continuing iteration, finally for all elements 205.
  • All elements are moved by an amount equal to their sum force vectors 208, after which it is checked, whether at least one element moved 209. If one moved, then after this two new elements are selected 211 from the moved element group as initial values for further processing.
  • Steps 2 to 7 are repeated 201 to 211 until the elements no longer move.
  • the elements have achieved a state of balance 212, in which the length of the sum vector of each element is virtu- ally zero when checking during the phase 209.
  • Fig. 3 the elements 30 have after a given number of iterations settled into a state of balance where the elements, whose relative difference in relation to the parameters examined is small, are gathered close to one another as a group or a population 31.
  • the elements of such a population have settled further from those types of elements and populations, from which the values of the parameters examined differ more. Relationships, proportions, difference, similarity, proximity, volume, uniformity, form, distribution or other measures of statistical mathe- matics of the populations 31 and/or the elements 30 thus formed can be easily and clearly analysed using a two-dimensional graph.
  • step one of the method 200 both an X- value and an Y-value are given to each element .
  • the values can be given, for example, randomly, on the basis of some predefined algorithm or to each element can be given, for example, the same value from the coordinate system.
  • step two 201 the value of each parameter of the first element is subtracted from the value of each corresponding parameter of the second element and the absolute values of these differences are calculated. These absolute values are added together and divided by the number of parameters . Thus is obtained the relative difference of the parameters of the elements.
  • the difference of the parameters of the ele- ments can naturally also be calculated in other ways, for example, the values of each parameter can be divided by the value of corresponding parameters of the second element. By calculating the average of these ratios, the difference of the parameters of the ele- ments can be described.
  • weighting can be set on some parameter, some elements weighting can be set on some parameter, some elements or, if needed, some parameter or element can be left out of the calculation or this/these can be included in the calculation at some other step of the analysis algorithm.
  • step three 201 the distance between the elements in the coordinate system is calculated, for example, using the Pythagorean theorem.
  • step four 202 the force vector 41 is cal- culated to the first element 40 towards the second element 42. This is done, for example, by subtracting the distance from the relative difference or by other calculations which will be contemplated further below. If the difference remains positive, the force vector is of length the difference towards the second element. If the difference is negative, the force vector is of length the difference directly away from the second element.
  • step five the first element does steps two, three and four 201, 202 to the third element 43. It must be noted that the force vector 44 obtained from this and the force vector 41 obtained the previous time can be pointed in different directions. These force vectors are summed, wherein the partial sum force vector 45 is obtained. The calculation and summation of force vectors into their previous force vectors is continued for the first element 40 until the element 40 has been compared to all other elements 205.
  • steps two to five 201 to 207 are implemented for the other elements of the element group to be analysed. Using the method according to this patent application, a sum force vector 205 is calculated according to the same algorithm for each element using the principle presented above.
  • step seven the elements are moved 208 in the XY coordinate system in a direction defined by the sum force vector and by a distance defined by the length of the sum force vector.
  • new locations are calculated for the elements, wherein the distances and locations of all elements in relation to one another change, and due to this a new iteration round is needed.
  • the elements move little by little toward the state of balance, i.e. the length of the sum force vectors of the elements reduces and will in the end be zero, wherein a single element no longer moves 212.
  • Finding the state of balance of the elements can be speeded up, for example, in step seven by dividing the force vectors by a suitable constant before moving the elements, wherein the movement of the ele- ments is less forceful than movement according to the force vectors should otherwise be. In this way, the possibly extreme movement of the elements can be restrained and fewer iteration rounds are needed to achieve a state of balance, when a constant is pru- dently selected. Imprudent selection of a constant, such as a constant having too big value, may, however, slow the movement of elements so much that, due to the slow movement, i.e. short drifting lengths of the elements, more iteration rounds will be needed.
  • step four where distance is subtracted from the relative difference
  • definition can be improved, for example, by calculating a value for the relative difference of all elements, for example, from a Gauss curve or from some other graph.
  • Fig. 5 there is a difference on the X-axis 50 calculated using the formula of step three.
  • the value of the relative difference is replaced by the value of a function 51 like that presented in the figure, wherein from the Y-axis 52 of the graph is obtained a new value 54 for the difference which is in the exemplary case greater than the original value 53.
  • a situation can be achieved where elements further from one another in relative difference repel one another more strongly than in the situation where the relative difference is taken into the calculation of step three as such.
  • each element is compared to all other elements in the data material, each iteration round requires many calculations, and desktop computers of the present day may not be capable of processing an information volume of, for example, one hundred thousand elements within a reasonable time.
  • This slowness due to the limited nature of the calculation power of present-day computers can be solved, for example, by selecting from a large number of elements a given amount of so-called seed elements which are first brought into mutual balance using phases
  • the rest of the elements can ini- tially be placed in the XY coordinate system either by giving random X- and Y-values or by locating them in the coordinate system in a predefined manner, for example, by giving to all the same X- and Y-values in the middle of that part of the coordinate system where the balanced seed elements are located.
  • the values of the parameters of each element can be compared before the first iteration round to the values of the parameters of the seed elements in the same manner as is done in step two of the method, and a relative difference is calculated for the values. After this, the element is given the coordinates of its seed element, from which the element in question has the smallest relative difference.
  • the elements are not necessarily in exactly the right place after placing, it is possible to get the elements placed near their final balanced locations . After this, with a relatively small number of iterations the elements are brought into their final state of balance.
  • the elements of the data material can be split into groups (for example groups of 1000 elements) , after which the elements of each group can be brought into an internal state of balance of its group using phases 200, 201, ..., 212 of the method described above. After this, these groups in a state of balance can be combined by processing an ade- quately characteristic sample or individual element from each group.
  • This type of characteristic element from each group can be, for example, the values of central locations of populations formed by the ele- ments of each group.
  • Flow cytometry is also a good example of the development of a historical so-called single analyte and on the other hand from a single parameter method into a multiplexed system.
  • flow cytometry in one analysis it was possible to measure only one parameter, while at the present time, using flow cytometry, already more than ten parameters can be analysed simultaneously.
  • flow cytometry already just in the analysis of flow cytometry data, even though the information volume created in flow cytometry is just a fraction of the data material produced by many other methods and measurements .
  • one data file can contain hundreds of thousands of elements and for each element 4 to 15 parameters (for example the size, granularity, and different types of fluorescent labels of a particle) can be measured whose values can vary, for example, in the range 0 to 1023 or even greater variation ranges starting even from negative values .
  • parameters for example the size, granularity, and different types of fluorescent labels of a particle
  • the method in practice has applications in all fields where statistical science is needed.
  • the method is well suited.
  • statistical information is formed, in which there are many elements (people) as well as many parameters (questions) .
  • an experiment can be mentioned, which studies the productional and microbiological effects of animal feeding and additive intervention.
  • the experiment can be related, for example, to poultry production.
  • feeding parameters were included in the experiment as one group of parameters that express the composition of the feed given to the animals.
  • ad- ditive interventions wheat, shelled oats, oats, soybean, oil, lysine and threonine (2-amino-3- hydroxybutanoic acid) were used as the feeding parameters group .
  • parameters relating to the living conditions of the animals and micro- biological parameters linked to these can be examined.
  • Such as these are, for example, the dry-matter content of an intestinal content sample, the total bacterial content of an intestinal content sample, absolute bacterial numbers of desired different bacterial groups as well as the numerical value of a given microbiological index.
  • the group of desired output parameters can be defined.
  • Such as these are typically animal-related production parameters. These can be, for example, internal fat, weight of the breast meat, mortality, feed efficiency and growth.
  • the points of data are placed at their initial locations which are selected as desired or random locations. After this, it is determined how many different test feedings were used in the experiment. In the computer program according to the experiment, it is possible to select from the parameters mentioned above only the desired parameters whose influence it is desired to study. In this connection, the variables to be subsequently visualized can be selected which are all other feeding parameters except additive intervention. After this, an analysis method according to the invention can be started. In this case, the data points are arranged at new locations and populations are created, in this example four separate populations. It can thus be stated that in the experiment four different types of feeding were used and graphically from the presentation it can be perceived which feedings are more alike than others and, on the other hand, which feedings differ most from the others . The closer two populations of the graphic presentation are to one another, the more alike they are.
  • the computer program implemented using the analysis method of the invention offers use- ful characteristics for continuing analysis. Namely, now desired parameters can be removed from the production parameters to be analysed one at a time and analysis can be restarted. Using this exclusion principle, it can be discovered, in what production pa- rameters (one or more) there were differences between different populations. For example, in the analysis, it is discovered that internal fat and mortality vary between different populations. Also from the graphic presentation it can be seen, for example, for how big percentual proportion of the animals reaches the percentage of fat the greater of two possible values and for how big proportion the smaller value. After this, in the case of the example, in addition to feeding parameters the additive intervention could now be included among the parameters . When analysis is started and the data points once again settle at new locations, it can then be seen that three animal populations divide in just the same manner as internal fat and mortality in the earlier analysis .

Description

METHOD FOR THE ANALYSIS OF MULTIPARAMETRIC INFORMATION
FIELD OF THE INVENTION
The invention relates to the management of information and the analysis of large amounts of information in different types of applications, such as, for example, flow cytometry and gene chip methods.
BACKGROUND OF THE INVENTION In many research methods currently in use, significant amounts of information are gathered. There may be tens or hundreds of parameters to be measured, and the elements from which parameters are measured may number into the thousands, or even millions. New innovations are constantly being developed for the production and storage of this type of information volume. Large amounts of information like that described above are generated, for example, in the fields of natural science, social science and medi- cine. In practise, the analysis of all large amounts of information is currently done using computers.
Analysis of enormous amounts of information requires significant amounts of manpower, for example, statistical mathematicians and bio-information scien- tists . The methods used by most statistical mathematicians are on a theoretical level quite complicated and require a user to be exceptionally well versed in the field before he can efficiently make use of the methods. In many scientific fields, the scientist requires the assistance of a statistical mathematician well versed in the analysis of numerical materials. The conducting of scientific work itself, i.e. such things as planning and practical implementation of experiment arrangements are among the work duties of a scientist, but the statistical analysis of the results obtained must often be turned over to statistical mathematicians. A person trained, for example, in medicine would not have received much training in the methods of statistical mathematics, despite the fact that in their work medical researchers must handle large statistical materials. Many study programmes for biosci- ence research also omit statistical mathematics as a subject of study. It is usually not possible to include adequate amounts of statistical mathematics in the programmes of study for researchers in different fields . Currently, flow cytometry data is mainly analysed by setting one parameter to the X-axis, another parameter to the Y-axis and by in this manner arranging the elements in a two-dimensional graph. Among the problems encountered is i.a. that in an XY-plane only two different parameters can be observed simultaneously. As the number of parameters is increased, the number of XY-graphs required increases dramatically. Because the desire is to present and examine the information produced by the analyses with regard to all parameters measured, then, for example, the use of four parameters requires simultaneous interpretation of six XY-graphs and correspondingly the use of six parameters already requires use of fifteen XY-graphs
(Fig. 1) . It is easy to understand how a researcher would find it very difficult to clearly perceive information and the connections between different variables from these types of separate graphs. Interesting findings may remain undiscovered due to the difficulty of interpreting the data. In known art, there is a method for spreading multiparametric information in two dimensions, by the name of star coordinates. The method is presented, for example, in reference publication US2002171646. In star coordinates, each parameter is given a direction and a maximum length. After this, the elements move forward along each parameter for a distance corresponding to the value of their parameter in the direc- tion indicated by the parameter. By this means, multi- parametric information can also be spread in two dimensions. There are, however, many problems associated with star coordinates. First of all, the method is not automated in any way, rather the user must himself adjust the directions of the parameters and seek suitable values. Secondly, automating the method is difficult. The third problem is that elements may end up at the same point from several different routes (so- called pseudo-populations are possible), i.e. the route travelled by the elements is not without ambiguity.
Methods have also been developed for clustering of flow cytometry data, for example the so-called Autoklus (Schut et al . , 1993) which groups the elements into groups, i.e. clusters. The method is efficient and using it the populations that clearly are distinctly separate from one another can easily be found. The problem with the method is, however, that the manner in which the data is presented is still a two-dimensional XY-figure, and although different clusters can, for example, be coloured with different colours, the different clusters may be settled in the graphs on top of one another and interpretation of the data becomes once again more difficult. Another problem is that when the program is used each element must be placed as belonging to some cluster. It is, however, possible that some population of elements would not form a clear population, rather the element popu- lation in question forms in relation to some parameter a so-called slide, i.e. the population is, for example, distorted or its kurtosis is great, i.e. the intensity distribution of the population has long tails. This type of formation cannot be correctly solved us- ing clustering.
Using a self-organizing map (i.e. SOM; T. Ko- honen, Self-organizing Maps, New York, Springer- Verlag, 2001) , it is also possible to create two- dimensional maps from complicated information, maps in which different populations are set to some defined area in the same manner as in the invention presented in this patent application. Elements containing parameters close to one another are closer to one another on a self-organizing map than elements having parameters that are clearly different. A SOM does not, however, create a map suitable, for example, for analysis of flow cytometry data, rather it builds the map as a density figure of a given area. Additionally, understanding of a SOM is difficult because its mathematical basis is exceptionally challenging.
Considering the situation described above, there is a clear need for a fast and efficient method for the analysis of multiparametric information. Using the method, it should be possible to get large amounts of information into such an easily-understood form that even a person without an exceptional knowledge of statistical mathematics would be capable of handling and understanding it.
OBJECT OF THE INVENTION
The object of the invention is to disclose a new type of visual data mining method for the management of large amounts of data and as a tool for analysis of information. In particular, the object of the invention is to obviate the problems mentioned above.
SUMMARY OF THE INVENTION
The present invention presents an assumption- free and user-independent data mining and visualization method for the processing of multivariable research or measurement material into a more comprehen- sible form. The material comprises a large number of elements, and each element contains at least one parameter, wherein each parameter has its value. In the present invention, each element is placed at the beginning at an initial location in a coordinate system which in one embodiment of the invention is a two- dimensional XY-plane. Two elements are selected and the values of the parameters of these elements and the locations of the elements are compared. After this, force vectors are calculated such that to one element is directed a group of force vectors created by each of the other elements. Each of these force vectors is parallel with the connecting line between the two elements, i.e. the force is either an attracting or repelling force between the two elements . The magnitude of the force vector is a function of the difference of the values between the parameters of the two elements being contemplated and the distance between them. The second element is then exchanged until the first element has been compared to all of the other elements . The other elements each cause a force vector directed to the first element, thus a sum force vector directed to the first element can be obtained by summation of all the force vectors directed to this element. A sum force vector is calculated in a corresponding manner for all elements, or in other words, the first element is exchanged and the calculation described above is repeated.
When the sum force vectors have been found for all elements, each element is moved in the coordinate system by an amount equal to its sum force vector. After this, the iteration described above is re- peated on the moved elements. Thus, new sum force vectors are obtained for the moved elements, according to which the elements are moved once again. Iterations are repeated until all force vectors are approximately null vectors or until a predefined number of itera- tions is reached. In this case, a state of balance has been found and the elements essentially no longer move. A large volume of data can be analysed more eas- ily from this type of data group distributed into different populations .
In one embodiment of the present invention, relationships, proportions, difference, similarity, proximity, volume, uniformity, form, correlations of the parameters, distribution or other measures of statistical mathematics are analysed for the populations formed by the elements in the state of balance.
In one embodiment of the present invention, the elements can at the beginning be placed in the coordinate system at the initial locations defined by some predefined algorithm or at randomly defined initial locations.
In one embodiment of the present invention, the difference of the values between the parameters of two elements is calculated by subtracting from each parameter value of the first element the corresponding parameter value of the second element and calculating an average of the absolute values of the differences. One alternative manner of calculating the difference of the values is to divide the corresponding two values of the parameters instead of subtracting and calculating an average for the quotients . In calculation of the difference of the values, weighting can be laid on one or several parameters, or one or several elements, or one or several elements or parameters can be left out of the calculation of the difference of the values .
In one embodiment of the present invention, the sum force vectors calculated can be scaled according to the set scaling factor prior to moving of the elements .
In one embodiment of the present invention, the direction of an individual force vector is set to differ from the direction of the connecting line between the two elements . In one embodiment of the present invention, at the beginning so-called seed elements are selected from the entire group of elements which seed elements are brought into a mutual state of balance using the presented method. After this, the non-seed elements are brought into a state of balance with the seed elements such that the non-seed elements are compared only to the seed elements. In this case, the seed elements also remain stationary, i.e. in the final phase force vectors are not calculated to them. The non-seed elements can initially be placed in the coordinate system either at the point of the average of the locations of the group of seed elements in a state of balance, at the point defined by the average of the ex- treme values in the direction of each axis of the group of seed elements in a state of balance or at the point of such a seed element where the difference of the values of the parameters of the element and the seed element is minimized, or in some other manner which is advantageous for the arranging of the elements such that it is expected for the number of necessary iterations to remain reasonable.
In one embodiment of the present invention, the elements of a large data group are first divided into sub-groups, each of which is first brought into an internal state of balance using the method according to the invention. After this, a sample of one or more elements is selected from each sub-group which can, for example, be the element closest to the centre of the population formed by the elements of the subgroup. These samples are then brought into a mutual state of balance using the method according to the invention. These balanced sample elements are selected as seed elements and the rest of the elements are brought into balance in relation to this group of the seed elements as described above in connection with the seed elements . In one embodiment of the present invention, the XY-plane can be replaced using a polar coordinate system or yet a third dimension can be added to the XY-plane coordinate system in the form of a Z-axis . The angles between the axes can be 90 degrees or alternatively the angle between at least one pair of axes in the coordinate system can be set at an angle value other than 90 degrees.
A method for data mining and visualization according to the present invention can be used, for example, in flow cytometry, gene chip methods, business economics, demographic information and information gathered for gallup polls as a tool for the analysis of data groups or in the analysis of any sta- tistic material or measurement material. In general, the idea of the invention can be used in multivariable analysis in any field where this type of information can be measured or collected and where there is a desire to more exactly analyse or visualize. The method according to the present invention can be implemented using a computer program where the program code implements the different steps of the data mining and visualization method according to the present invention. The presented invention is particularly good in those cases where a lot of data has accumulated, but the researcher does not have exact information about what sort of finding he is seeking, rather he attempts to find something significant from the data, such as, for example, new correlations or connections between the parameters being studied. For this purpose, the presented method offers an excellent solution. In flow cytometry a situation like that described above can arise, for example, in a situation where the researcher begins studying about particles, for example, microbes, such characteristics (for example the light refracting ability of different types of microbes from different angles) , about which it is not known whether the parameters are significant or not. As another example, mention can be made of the researching and defining of yet-unknown characteristics and connections of cells whose characteristics are insufficiency known, such as stem cells.
The invention presented in this patent application obviates the disadvantages of methods of known art, as it is automatic and due to the attracting or repelling influences of force vectors superimpositions are prevented from occurring, i.e. the elements cannot end up at the same point by different routes, i.e. using this method, elements cannot form pseudo- populations . The invention presented in this patent application handles slides (in other words, distorted and undefined populations) correctly, as it allows each point to find its own place.
Compared to a SOM, the invention presented in this patent application allows the elements to be settled at the distance from one another that is necessary, and the area formed by the elements becomes exactly as large as the elements require. Additionally, the mathematical base of the invention now presented is simple and therefore the result is easily understood. Users are not then required to have special information, professional skill or ability in order to utilize the method of the invention.
An essential advantage of the present inven- tion is that complicated information is easily and clearly changed into graphic form such that large volume of information can be more easily analysed and by this means it is possible to make interesting and significant findings regarding a data group. Visualiza- tion of a data group is then facilitated by the method of the invention. Without the method of the invention an important finding, a connection between elements or other interdependence could remain undiscovered due to the complexity of the data group and the difficulty of interpreting it.
It can then be said that using the data min- ing and visualization method according to the invention, the number of dimensions used for representing the information can be reduced without loss of the data itself. In other words, using the invention, for example, a large group of different XY-plane graphs can be condensed to one XY-plane for presentation such that the information is preserved.
An advantage of the invention is further that no hypotheses or assumptions need to be made about the material. Errors relating to these are then left out when using the method of the invention.
An essential advantage of the invention is also that its number of applications is quite high. In different areas of technology and different types of areas of life, as the result of research or measure- ments, large amounts of information are commonly created that must be analysed for making conclusions . A manner of data mining according to the present invention offers an excellent tool for this.
LIST OF FIGURES
Fig. 1 shows a flow cytometry data analysis method according to known art using several two- dimensional graphs,
Fig. 2 shows the method according to the pre- sent invention as a flow chart,
Fig. 3 illustrates the balanced state of the points of a data group achieved by the method according to the present invention,
Fig. 4 illustrates the manner of calculating force vectors between three elements according to the present invention, Fig. 5 illustrates a manner according to the present invention for achieving the optimal definition of the method by calculating the relative difference of two elements in an alternative manner.
DETAILED DESCRIPTION OF THE INVENTION
The method presented in this patent application solves the problems described .above by handling multiparametric information in a coordinate system such that the distance between individual parts or elements of a data group illustrates the difference between the parameters of the elements and using simple mathematical regularities the elements are caused to gather as illustrative groups. The coordinate sys- tern in the embodiment presented subsequently is a two- dimensional XY-plane, but the inventive idea of the invention is also applicable in relation to other types of coordinate systems such as, for example, an XYZ coordinate system or a polar coordinate system. In the present invention, this type of presentation is achieved using the following algorithm with reference in this connection to the flow chart shown in Fig. 2:
1) The coordinates are given to all elements in the XY coordinate system 200.
2) The value of each parameter of the element is compared to the value of each parameter of the second element and from these comparisons the difference between the parameters of the elements is calculated 201.
3) The distance of the element from the second element is calculated 201.
4) Using the distance and the difference of the parameters, the force vector is calculated accord- ing to a given function 202. The force vector is directed from the element either towards the second element or away from it. 5) it is checked, whether the first element has already been compared to all other elements 203. If not all other elements have been gone through, the second element is exchanged 204 and the iteration 201, 202 is repeated. For the element are thus repeated steps 2, 3 and 4 to the third element 201, 202 and the force vector obtained from this is added together with the previous force vector. Steps 2 to 4 are repeated until the element has been compared to all other ele- ments . All the force vectors thus obtained and summed form the final sum force vector of the element 205.
6) It is checked, whether a sum force vector has already been calculated for all elements 206. If not, the first element is exchanged 207 and the pre- sented iteration steps (steps 2 to 5) are repeated.
Thus, the sum force vector for the new element is obtained. By exchanging the first element again 207 a sum force vector will be calculated for the third and by continuing iteration, finally for all elements 205. 7) All elements are moved by an amount equal to their sum force vectors 208, after which it is checked, whether at least one element moved 209. If one moved, then after this two new elements are selected 211 from the moved element group as initial values for further processing.
8) Steps 2 to 7 are repeated 201 to 211 until the elements no longer move. In this case, the elements have achieved a state of balance 212, in which the length of the sum vector of each element is virtu- ally zero when checking during the phase 209.
Using the above-described algorithm of this method, very complex multivariable data materials can be easily and quickly analysed 213 by finding the so- called balanced state of the elements of the data ma- terials, such as is shown as an example in Fig. 3. In Fig. 3, the elements 30 have after a given number of iterations settled into a state of balance where the elements, whose relative difference in relation to the parameters examined is small, are gathered close to one another as a group or a population 31. On the other hand, the elements of such a population have settled further from those types of elements and populations, from which the values of the parameters examined differ more. Relationships, proportions, difference, similarity, proximity, volume, uniformity, form, distribution or other measures of statistical mathe- matics of the populations 31 and/or the elements 30 thus formed can be easily and clearly analysed using a two-dimensional graph.
In the following, the steps of the algorithm of the method are described in more detail and with further reference to Fig. 4 as an example.
In step one of the method 200, both an X- value and an Y-value are given to each element . The values can be given, for example, randomly, on the basis of some predefined algorithm or to each element can be given, for example, the same value from the coordinate system.
In step two 201, the value of each parameter of the first element is subtracted from the value of each corresponding parameter of the second element and the absolute values of these differences are calculated. These absolute values are added together and divided by the number of parameters . Thus is obtained the relative difference of the parameters of the elements. The difference of the parameters of the ele- ments can naturally also be calculated in other ways, for example, the values of each parameter can be divided by the value of corresponding parameters of the second element. By calculating the average of these ratios, the difference of the parameters of the ele- ments can be described. It is obvious to a person skilled in the art that in calculating the difference of the parameters of the elements, if needed, weighting can be set on some parameter, some elements weighting can be set on some parameter, some elements or, if needed, some parameter or element can be left out of the calculation or this/these can be included in the calculation at some other step of the analysis algorithm.
In step three 201, the distance between the elements in the coordinate system is calculated, for example, using the Pythagorean theorem.
In step four 202 the force vector 41 is cal- culated to the first element 40 towards the second element 42. This is done, for example, by subtracting the distance from the relative difference or by other calculations which will be contemplated further below. If the difference remains positive, the force vector is of length the difference towards the second element. If the difference is negative, the force vector is of length the difference directly away from the second element.
In step five, the first element does steps two, three and four 201, 202 to the third element 43. It must be noted that the force vector 44 obtained from this and the force vector 41 obtained the previous time can be pointed in different directions. These force vectors are summed, wherein the partial sum force vector 45 is obtained. The calculation and summation of force vectors into their previous force vectors is continued for the first element 40 until the element 40 has been compared to all other elements 205. In step six, steps two to five 201 to 207 are implemented for the other elements of the element group to be analysed. Using the method according to this patent application, a sum force vector 205 is calculated according to the same algorithm for each element using the principle presented above.
In step seven, the elements are moved 208 in the XY coordinate system in a direction defined by the sum force vector and by a distance defined by the length of the sum force vector. In this method, in one round of calculation new locations are calculated for the elements, wherein the distances and locations of all elements in relation to one another change, and due to this a new iteration round is needed. In this manner, the elements move little by little toward the state of balance, i.e. the length of the sum force vectors of the elements reduces and will in the end be zero, wherein a single element no longer moves 212.
Finding the state of balance of the elements can be speeded up, for example, in step seven by dividing the force vectors by a suitable constant before moving the elements, wherein the movement of the ele- ments is less forceful than movement according to the force vectors should otherwise be. In this way, the possibly extreme movement of the elements can be restrained and fewer iteration rounds are needed to achieve a state of balance, when a constant is pru- dently selected. Imprudent selection of a constant, such as a constant having too big value, may, however, slow the movement of elements so much that, due to the slow movement, i.e. short drifting lengths of the elements, more iteration rounds will be needed. Depending on the type and quality of the data material, to achieve optimal definition the calculation of step four where distance is subtracted from the relative difference, can be advantageously implemented in an alternative manner suitable for each case. At the point of this calculation, definition can be improved, for example, by calculating a value for the relative difference of all elements, for example, from a Gauss curve or from some other graph. In Fig. 5, there is a difference on the X-axis 50 calculated using the formula of step three. When the relative difference is known, the value of the relative difference is replaced by the value of a function 51 like that presented in the figure, wherein from the Y-axis 52 of the graph is obtained a new value 54 for the difference which is in the exemplary case greater than the original value 53. In this case, a situation can be achieved where elements further from one another in relative difference repel one another more strongly than in the situation where the relative difference is taken into the calculation of step three as such.
Because in the above-described method accord- ing to the present invention each element is compared to all other elements in the data material, each iteration round requires many calculations, and desktop computers of the present day may not be capable of processing an information volume of, for example, one hundred thousand elements within a reasonable time. This slowness due to the limited nature of the calculation power of present-day computers can be solved, for example, by selecting from a large number of elements a given amount of so-called seed elements which are first brought into mutual balance using phases
200, 201, ..., 212 of the method described above. After this, the rest of the elements of the data material are placed in the coordinate system according to phase 200 of the algorithm and these are compared dur- ing each iteration round only to the seed elements according to the algorithm described above. At this stage, a force vector is no longer calculated for the seed elements, i.e. they remain stationary. Finding the balance of the seed elements can require even hun- dreds of iterations, but due to the small volume of seed elements (for example 1000 to 5000 pieces) the number of calculations remains so small that the seed elements are brought into balance in a short time. After this, placing the rest of the elements in the co- ordinate system requires significantly less iterations, as the seed elements have already settled in the correct place. The rest of the elements can ini- tially be placed in the XY coordinate system either by giving random X- and Y-values or by locating them in the coordinate system in a predefined manner, for example, by giving to all the same X- and Y-values in the middle of that part of the coordinate system where the balanced seed elements are located.
After iterating the locations of the seed elements and before beginning the iteration of other elements, different types of means can also be used for placing the other elements in the coordinate system. For example, the values of the parameters of each element can be compared before the first iteration round to the values of the parameters of the seed elements in the same manner as is done in step two of the method, and a relative difference is calculated for the values. After this, the element is given the coordinates of its seed element, from which the element in question has the smallest relative difference. Although the elements are not necessarily in exactly the right place after placing, it is possible to get the elements placed near their final balanced locations . After this, with a relatively small number of iterations the elements are brought into their final state of balance. One other alternative for avoiding slowness due to limited calculation power and problems caused by the poorly selected seed elements is the seeking of mutual balance between the elements of the data material in portions as well as finding seed elements on the basis of average from the entire material. The elements of the data material (for example one million elements) can be split into groups (for example groups of 1000 elements) , after which the elements of each group can be brought into an internal state of balance of its group using phases 200, 201, ..., 212 of the method described above. After this, these groups in a state of balance can be combined by processing an ade- quately characteristic sample or individual element from each group. This type of characteristic element from each group can be, for example, the values of central locations of populations formed by the ele- ments of each group. These group sample points are brought into balance using phases 200, 201, ..., 212 of the method described above. These sample points settled into balance function as proper seed elements, to which the values of the parameters of each element are compared using phases 200, 201, ..., 212 of the method described above and thus achieving a final state of balance for all elements .
The volume of information produced by the methods of modern scientific research is enormous and it will continue to grow vigorously. This flood of information will be increased i.a. by the development of new, ever more powerful research methods and along with the development of existing methods created so- called multiplexed systems. In these types of methods, one analysis measures from one sample not just one variable or analyte, but several parameters simultaneously. Concentrations of several variables, for example, of immunological messenger molecules, can be measured in the same analysis. Using so-called gene chip methods (microarray analysis of gene fragments) , thousands of different variables can be measured, for example, the state of the function of genes, together with analysis. Flow cytometry is also a good example of the development of a historical so-called single analyte and on the other hand from a single parameter method into a multiplexed system. In the early days of flow cytometry, in one analysis it was possible to measure only one parameter, while at the present time, using flow cytometry, already more than ten parameters can be analysed simultaneously. There is a clear need for the method presented in this patent application already just in the analysis of flow cytometry data, even though the information volume created in flow cytometry is just a fraction of the data material produced by many other methods and measurements . Using flow cytometry as an example, it can be stated that one data file can contain hundreds of thousands of elements and for each element 4 to 15 parameters (for example the size, granularity, and different types of fluorescent labels of a particle) can be measured whose values can vary, for example, in the range 0 to 1023 or even greater variation ranges starting even from negative values .
In addition to the analysis of flow cytometry data, the method in practice has applications in all fields where statistical science is needed. For exam- pie, for the processing of demographical material the method is well suited. Using the method, it is possible, for example, to analyse gallup-type polls, in which people are asked their opinions regarding many different matters, for example, on a scale of 1 to 10. In this case, statistical information is formed, in which there are many elements (people) as well as many parameters (questions) . For this type of data it is difficult to find "populations" without the method presented in this patent application. As one more extensive example of the many applications for the invention, an experiment can be mentioned, which studies the productional and microbiological effects of animal feeding and additive intervention. The experiment can be related, for example, to poultry production. Because feeding has a significant effect on the welfare and growth of poultry, feeding parameters were included in the experiment as one group of parameters that express the composition of the feed given to the animals. In the example, ad- ditive interventions, wheat, shelled oats, oats, soybean, oil, lysine and threonine (2-amino-3- hydroxybutanoic acid) were used as the feeding parameters group .
As another parameter group, parameters relating to the living conditions of the animals and micro- biological parameters linked to these can be examined. Such as these are, for example, the dry-matter content of an intestinal content sample, the total bacterial content of an intestinal content sample, absolute bacterial numbers of desired different bacterial groups as well as the numerical value of a given microbiological index.
In the case of the example, the group of desired output parameters can be defined. Such as these are typically animal-related production parameters. These can be, for example, internal fat, weight of the breast meat, mortality, feed efficiency and growth.
The task of the experiment was then to analyse and visualize the effect of feeding and additive intervention on productional and microbiological char- acteristics of the animals. Using the method according to the present invention, analysis was implemented graphically using a computer program such that each point of data represents one animal.
First the points of data are placed at their initial locations which are selected as desired or random locations. After this, it is determined how many different test feedings were used in the experiment. In the computer program according to the experiment, it is possible to select from the parameters mentioned above only the desired parameters whose influence it is desired to study. In this connection, the variables to be subsequently visualized can be selected which are all other feeding parameters except additive intervention. After this, an analysis method according to the invention can be started. In this case, the data points are arranged at new locations and populations are created, in this example four separate populations. It can thus be stated that in the experiment four different types of feeding were used and graphically from the presentation it can be perceived which feedings are more alike than others and, on the other hand, which feedings differ most from the others . The closer two populations of the graphic presentation are to one another, the more alike they are.
Next, the connection between the feeding pa- rameters and the production parameters is contemplated. Therefore, all production parameters are now additionally selected into the previous contemplation. When the computer program is run again, as the result of analysis, seven animal populations are obtained having different distances between them. From this it can be concluded that one or several of the production parameters separates the populations from one another. Additionally perceived is that in three of the feeding groups the production parameters vary (several popula- tions within the feeding group) , but in the fourth feeding group the production parameters remain the same (one population in this feeding group) .
After this, the computer program implemented using the analysis method of the invention offers use- ful characteristics for continuing analysis. Namely, now desired parameters can be removed from the production parameters to be analysed one at a time and analysis can be restarted. Using this exclusion principle, it can be discovered, in what production pa- rameters (one or more) there were differences between different populations. For example, in the analysis, it is discovered that internal fat and mortality vary between different populations. Also from the graphic presentation it can be seen, for example, for how big percentual proportion of the animals reaches the percentage of fat the greater of two possible values and for how big proportion the smaller value. After this, in the case of the example, in addition to feeding parameters the additive intervention could now be included among the parameters . When analysis is started and the data points once again settle at new locations, it can then be seen that three animal populations divide in just the same manner as internal fat and mortality in the earlier analysis .
As the result of the analysis, it can then be concluded that the additive intervention has some influence or interdependence with differences in internal fat variation and mortality.
In a similar manner, advance can be made in the case of the example when it is desired to visually analyse the influence of any other parameter. In the previous example, it would in this case be natural to include microbiological parameters in the contemplation.
The invention is not limited to only the em- bodiment examples presented above, rather many variations are possible within the scope of inventive idea defined by the claims .

Claims

1. An assumption-free and user-independent data mining and visualization method for the processing of multivariable research and measurement material into a more comprehensible form, in which the material comprises elements which each element contains at least one parameter with its value, and which method comprises the step of: placing each element in a coordinate system at an initial location; characterised in that the method further comprises the steps of: calculating the difference of the values of the parameters of each element to the values of corre- sponding parameters of other elements; calculating the distance of each element from the other elements; calculating the force vectors to the elements caused by every other element such that the magnitude of the force vector is a function of the difference of the values of the parameters of the two elements being contemplated and the distance between them, and that each force vector is parallel with the connecting line between the two elements being contemplated; calculating a sum force vector for each element by summation of the force vectors directed to the element in question; moving each element in the coordinate system by an amount equal to the calculated sum force vector; performing iteration under repeating the calculation of distances, calculation of force vectors, calculation of sum force vectors and movement of elements ; and ending iteration when the elements essentially no longer move or when a predefined number of iterations is reached.
2. A data mining and visualization method according to claim 1, characterised in that the method further comprises the step of: analysing relationships, proportions, difference, similarity, proximity, volume, uniformity, form, correlations of the parameters, distribution or other measures of statistical mathematics for the populations formed by the elements in the state of balance.
3. A data mining and visualization method ac- cording to any one of preceding claims 1 to 2, characterised in that the method further comprises the step of: placing the elements in the coordinate system at the initial locations defined by some predefined algo- rithm or at randomly defined initial locations .
4. A data mining and visualization method according to any one of preceding claims 1 to 3, characterised in that the method further comprises the step of: calculating the difference of the values of the parameters of two elements by subtracting the value of each parameter of the second element from the corresponding value of each parameter of the first element, summing the absolute values of the differences and di- viding the sum by the number of parameters.
5. A data mining and visualization method according to any one of preceding claims 1 to 4, characterised in that the method further comprises the step of: calculating the difference of the values of the parameters of two elements by dividing the each value of the parameter of the first element by the corresponding values of the parameter of the second element and calculating the average of the quotients .
6. A data mining and visualization method according to any one of preceding claims 1 to 5, characterised in that the method further comprises the step of: weighting one or more parameters, or one or more elements in the calculation of the difference of the values of the parameters of two elements, or leaving one or more parameters or elements out of the calculation of the value difference.
.
7. A data mining and visualization method according to any one of preceding claims 1 to 6, char- acterised in that the method further comprises the steps of: setting a scaling factor, and scaling the calculated sum force vectors by a scaling factor prior to movement of the elements .
8. A data mining and visualization method according to any one of preceding claims 1 to 7, characterised in that the method further comprises the steps of: selecting a group of seed elements as a group of elements for processing which group of seed elements is a sub-group of all the elements; bringing the group of seed elements into a mutual state of balance; including non-seed elements to the group of ele- ments to be processed; and bringing the non-seed elements into a state of balance with the seed elements such that the non-seed elements are compared only to the seed elements, and that force vectors are not calculated for the seed elements.
9. A data mining and visualization method according to any one of preceding claims 1 to 8, characterised in that the method further comprises the step of: defining as the initial locations of the non-seed elements in the coordinate system the average point of the locations of the group of points of the seed elements in a state of balance.
10. A data mining and visualization method according to any one of preceding claims 1 to 9 , characterised in that the method further comprises the step of: defining as the initial locations of the non-seed elements in the coordinate system the point defined by the average of the extreme values in the direction of each axis of the group of seed elements in a state of balance.
11. A data mining and visualization method according to any one of preceding claims 1 to 10, characterised in that the method further com- prises the step of: defining as the initial locations of the non-seed elements in the coordinate system such a point of the seed element where the difference of the values of the parameters of the element contemplated and the seed elements is minimized.
12. A data mining and visualization method according to any one of preceding claims 1 to 11, characterised in that the method further comprises the steps of: dividing the elements into sub-groups; bringing each sub-group into an internal state of balance; selecting from each sub-group a sample of one or more elements; bringing the sub-group samples into a mutual state of balance; and bringing all elements into a state of balance keeping the balanced samples as seed elements.
13. A data mining and visualization method according to claim 12, characterised in that the method further comprises the step of: selecting one element as the sample of the subgroup, which element is closest to the central location of the population formed by the elements of the sub-group.
14. A data mining and visualization method according to any one of preceding claims 1 to 13 , characterised in that the method further comprises the step of: setting at least one force vector in a different direction from the direction of the connecting line between the two elements .
15. A data mining and visualization method according to any one of preceding claims 1 to 14, characterised in that the method further com- prises the step of: setting XY-plane as the coordinate system.
16. A data mining and visualization method according to any one of preceding claims 1 to 15, characterised in that the method further com- prises the step of: setting an XYZ coordinate system as the coordinate system.
17. A data mining and visualization method according to any one of preceding claims 1 to 16, characterised in that the method further comprises the step of: setting a polar coordinate system as the coordinate system.
18. An information miningA data mining and visualization method according to any one of preceding claims 1 to 17, characterised in that the method further comprises the step of: setting the a value of the an angle between at least one pair of axes in the coordinate system at an angle value other than 90 degrees.
19. An information miningA data mining and visualization method according to any one of preceding claims 1 to 18, characterised in that the method characterised in that the method is used in flow cytometry as a tool for the analysis of information.
20. A data mining and visualization method according to any one of preceding claims 1 to 19, characterised in that the method is used in gene chip methods as a tool for the analysis of information.
21. A data mining and visualization method according to any one of preceding claims 1 to 20, characterised in that the method is used as a tool for the analysis of demographical information.
22. A data mining and visualization method according to any one of preceding claims 1 to 21, characterised in that the method is used as a tool for the analysis of information gathered in gallup polls .
23. A computer program for the assumption- free and user-independent data mining and visualization of multivariable research or measurement material into a more comprehensible form, in which the material comprises elements which each element contains at least one parameter with its value, and which computer program comprises a program code that when run on an information processing device is arranged to execute the step of: situating each element in a coordinate system at their initial locations; characterised in that when run on an information processing device the program code is fur- ther arranged to execute the following steps of: calculating the difference of the values of the parameters of each element to the values of corresponding parameters of other elements; calculating the distance of each element from the other elements; calculating the force vectors to the elements caused by every other element such that the magnitude of the force vector is a function of the difference of the values of the parameters of the two elements being contemplated and the distance between them, and that each force vector is parallel with the connecting line between the two elements being contemplated; calculating a sum force vector for each element by summation of the force vectors directed to the element in question; performing iteration under repeating the calcula- tion of distances, calculation of force vectors, calculation of sum force vectors and movement of elements ; and ending iteration when the elements essentially no longer move or when a predefined number of iterations is reached.
24. A computer program according to claim 23, characterised in that when run on an information processing device the program code is further arranged to execute the step of: analysing relationships, proportions, difference, similarity, proximity, volume, uniformity, form, correlations of the parameters, distribution or other measures of statistical mathematics for the populations formed by the elements in the state of balance.
25. A computer program according to any one of preceding claims 23 to 24, characterised in that when run on an information processing device the program code is further arranged to execute the step of: placing the elements in the coordinate system at the initial locations defined by some predefined algorithm or at randomly defined initial locations .
26. A computer program according to any one of preceding claims 23 to 25, characterised in that when run on an information processing device the program code is further arranged to execute the step of: calculating the difference of the values of the parameters of two elements by subtracting the value of each parameter of the second element from the corresponding value of each parameter of the first element, summing the absolute values of the differences and dividing the sum by the number of parameters .
27. A computer program according to any one of preceding claims 23 to 26, characterised in that when run on an information processing device the program code is further arranged to execute the step of: calculating the difference of the values of the parameters of two elements by dividing the each value of the parameter of the first element by the corre- sponding values of the parameter of the second element and calculating the average of the quotients .
28. A computer program according to any one of preceding claims 23 to 27, characterised in that when run on an information processing device the program code is further arranged to execute the step of: weighting one or more parameters , or one or more elements in the calculation of the difference of the values of the parameters of two elements, or leaving one or more parameters or elements out of the calculation of the value difference.
29. A computer program according to any one of preceding claims 23 to 28, characterised in that when run on an information processing device the program code is further arranged to execute steps: setting a scaling factor, and scaling calculated sum force vectors by a scaling factor prior to movement of the elements .
30. A computer program according to any one of preceding claims 23 to 29, characterised in that when run on an information processing device the program code is further arranged to execute steps : selecting a group of seed elements as a group of elements for processing which group of seed elements is a sub-group of all the elements; bringing the group of seed elements into a mutual state of balance; including non-seed elements to the group of elements to be processed; and bringing the non-seed elements into a state of balance with the seed elements such that the non-seed elements are compared only to the seed elements, and that force vectors are not calculated for the seed elements .
31. A computer program according to any one of preceding claims 23 to 30, characterised in that when run on an information processing device the program code is further arranged to execute the step of: defining as the initial locations of the non-seed elements in the coordinate system the average point of the locations of the group of points of the seed elements in a state of balance.
32. A computer program according to any one of preceding claims 23 to 31, characterised in that when run on an information processing device the program code is further arranged to execute the step of: defining as the initial locations of the non-seed elements in the coordinate system the point defined by the average of the extreme values in the direction of each axis of the group of seed elements in a state of balance .
33. A computer program according to any one of preceding claims 23 to 32, characterised in that when run on an information processing device the program code is further arranged to execute the step of: defining as the initial locations of the non-seed elements in the coordinate system such a point of the seed element where the difference of the values of the parameters of the element contemplated and the seed elements is minimized.
34. A computer program according to any one of preceding claims 23 to 33, characterised in that when run on an information processing device the program code is further arranged to execute steps: dividing the elements into sub-groups; bringing each sub-group into an internal state of balance; selecting from each sub-group a sample of one or more elements; bringing the sub-group samples into a mutual state of balance; and bringing all elements into a state of balance keeping the balanced samples as seed elements .
35. A computer program according to claim 34, characterised in that when run on an information processing device the program code is further arranged to execute the step of: selecting one element as the sample of the subgroup which element is closest to the central location of the population formed by the elements of the subgroup .
36. A computer program according to any one of preceding claims 23 to 35, characterised in that when run on an information processing device the program code is further arranged to execute the step of: setting at least one force vector in a different direction from the direction of the connecting line between the two elements .
37. A computer program according to any one of preceding claims 23 to 36, characterised in that when run on an information processing device the program code is further arranged to execute the step of: setting XY-plane as the coordinate system.
38. A computer program according to any one of preceding claims 23 to 37, characterised in that when run on an information processing device the program code is further arranged to execute the step of: setting an XYZ coordinate system as the coordinate system.
39. A computer program according to any one of preceding claims 23 to 38, characterised in that when run on an information processing device the program code is further arranged to execute the step of: setting a polar coordinate system as the coordinate system.
40. A computer program according to any one of preceding claims 23 to 39, characterised in that when run on an information processing device the program code is further arranged to execute the step of: setting a value of an angle between at least one pair of axes in the coordinate system at an angle value other than 90 degrees.
41. A computer program according to any one of preceding claims 23 to 40, characterised in that the computer program is stored on media readable by an information processing device.
PCT/FI2006/000160 2005-05-19 2006-05-19 A method for analyzing multiparameter data WO2006123013A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20050534 2005-05-19
FI20050534A FI20050534L (en) 2005-05-19 2005-05-19 Procedure for analyzing information with several parameters

Publications (1)

Publication Number Publication Date
WO2006123013A2 true WO2006123013A2 (en) 2006-11-23

Family

ID=34630101

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2006/000160 WO2006123013A2 (en) 2005-05-19 2006-05-19 A method for analyzing multiparameter data

Country Status (2)

Country Link
FI (1) FI20050534L (en)
WO (1) WO2006123013A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8635694B2 (en) 2009-01-10 2014-01-21 Kaspersky Lab Zao Systems and methods for malware classification

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8635694B2 (en) 2009-01-10 2014-01-21 Kaspersky Lab Zao Systems and methods for malware classification

Also Published As

Publication number Publication date
FI20050534A0 (en) 2005-05-19
FI20050534L (en) 2006-11-20

Similar Documents

Publication Publication Date Title
Welch et al. SLICER: inferring branched, nonlinear cellular trajectories from single cell RNA-seq data
Uyeda et al. Comparative analysis of principal components can be misleading
Monteiro Multivariate regression models and geometric morphometrics: the search for causal factors in the analysis of shape
US10289802B2 (en) Spanning-tree progression analysis of density-normalized events (SPADE)
Diggins et al. Methods for discovery and characterization of cell subsets in high dimensional mass cytometry data
Bielejec et al. Inferring heterogeneous evolutionary processes through time: from sequence substitution to phylogeography
Calenge et al. The concept of animals' trajectories from a data analysis perspective
US8874412B2 (en) Method for discovering relationships in data by dynamic quantum clustering
US20160070950A1 (en) Method and system for automatically assigning class labels to objects
Wu et al. Interactive analysis of gene interactions using graphical Gaussian model
US20090299646A1 (en) System and method for biological pathway perturbation analysis
Carteron et al. Assessing the efficiency of clustering algorithms and goodness-of-fit measures using phytoplankton field data
US20140006447A1 (en) Generating epigenentic cohorts through clustering of epigenetic suprisal data based on parameters
Weaver et al. Using geometric morphometric visualizations of directional selection gradients to investigate morphological differentiation
WO2020147557A1 (en) Method and device for processing intestinal microorganism sequencing data, storage medium, and processor
Fernstad et al. Visual exploration of microbial populations
WO2006123013A2 (en) A method for analyzing multiparameter data
Fu et al. Mapping morphological shape as a high-dimensional functional curve
WO2018165530A1 (en) Method of constructing a reusable low-dimensionality map of high-dimensionality data
Xu et al. Statistical inference in partially observed stochastic compartmental models with application to cell lineage tracking of in vivo hematopoiesis
Lee et al. Supervised classification of flow cytometric samples via the Joint Clustering and Matching (JCM) procedure
Omiotek et al. An efficient method for analyzing measurement results on the example of thyroid ultrasound images
Balsor et al. A primer on high-dimensional data analysis workflows for studying visual cortex development and plasticity
US11113853B2 (en) Systems and methods for blending and aggregating multiple related datasets and rapidly generating a user-directed series of interactive 3D visualizations
Rodríguez-Casado et al. A priori groups based on Bhattacharyya distance and partitioning around medoids algorithm (PAM) with applications to metagenomics

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

NENP Non-entry into the national phase

Ref country code: RU

WWW Wipo information: withdrawn in national office

Country of ref document: RU

122 Ep: pct application non-entry in european phase

Ref document number: 06743528

Country of ref document: EP

Kind code of ref document: A2