WO1989010596A1

WO1989010596A1 - Data analysis method and apparatus

Info

Publication number: WO1989010596A1
Application number: PCT/GB1989/000461
Authority: WO
Inventors: David Stephen Hickey; David Jeremy Prendergast
Original assignee: The Victoria University Of Manchester
Priority date: 1988-04-30
Filing date: 1989-05-02
Publication date: 1989-11-02
Also published as: EP0419496A1; GB8810357D0

Abstract

An apparatus for determining from a set of data describing A samples each in terms of B parameters those parameters which are important in distinguishing one sample from another. The apparatus comprises means for performing a first expansion to determine the pigenvectors and eigenvalues of a covariance or related matrix D, where D is as shown in formula (I) or an approximation thereto, and C is a parameter vector or matrix having B elements corresponding to the B parameters of one of the samples. The parameter corresponding to the largest value (by magnitude) in the eigenvector of greatest eigenvalue is selected to constitute a first selected parameter. The parameter other than the first selected parameter corresponding to the largest value by magnitude in the eigenvector having a second highest eigenvalue is then selected to constitute a second selected parameter, and further parameters other than those previously selected corresponding to the largest magnitude values in the eigenvectors having the third, fourth and consequent highest eigenvalues are selected to constitute the third, fourth and subsequent selected parameters until the eigenvalues of the remaining eigenvectors are too small to contain the further significant data. The selected parameters describe substantially all the intrinsic variations in the data set and therefore describe substantially all the features of the samples which are significant in distinguishing one sample from another.

Description

DATA ANALYSIS .METHOD AND APPARATUS

There are many circumstances in which it is desirable to be able to analyse sets of data describing samples which have common but not identical features. The term "pattern recognition" is often used in the context of recognising characteristics of two dimensional images, for example photographs, but the principles involved in recognising the characteristics of images are no different from the principles involved in recognising the characteristics of other data sets. Although the following description is primarily concerned with the analysis of data initially recorded in the form of two dimensional images it will be appreciated that the same techniques can be applied to any problems where it is necessary to be able to recognise and evaluate features which are common to a plurality of samples.

The term pattern recognition is now used in the context of artificial intelligence research to cover the processing of relatively large sets of data in a multi-dimensional space. This research is often concerned with standard expert systems and neural networks. In the case of expert systems rules are generated by direct coding from a knowledge engineers interpretation of information provided by an expert. This process is expensive and time consuming and is sometimes impossible, for example when there is no available "expert" who can provide a reliable interpretation of the information to be processed. Attempts are being made to introduce learning algorithms into expert systems but these are limited in application. Expert systems also tend to fail if presented with a situation not covered explicitly by the "expert" information provided.

Neural networks have generated a great deal of excitement as it is inherent in their structure that they incorporate learning processes. The ability of neural networks to build up a knowledge base without requiring expert instruction in the form of rigid rules makes such networks particularly suitable for pattern recognition applications . Unfortunately as the number of parameters used to describe a particular sample increases so does the level of complexity of the neural network required to recognise common features in a series of such samples and as a result there is little prospect of neural networks being able to solve the majority of pattern recognition problems in the foreseeable future.

To give an example of the sort of problem which can be confronted in pattern recognition systems, artificial replacement hip joints have been fitted to patients over a period of many years. Generally radiographs are taken immediately after the implant operation so that the surgeon responsible for the operation can assess the results of the operation. A surgeon obviously does not wish to send a patient away with an artificial hip joint that is likely to fail within the foreseeable future. Unfortunately although all artificial hip joints can be described in terms of a series of common features there are many variations in these features and it is difficult to determine which of these variations are significant in terms of expected hip joint life. For example an element of the hip joint will be received in the patients one and secured therein using a medical cement. The length of the overlap between the hip joint component and the bone and the thickness of the cement is of significance but it is very difficult to produce a formula which enables these and other variable features to be given the appropriate weighting so that a joint can be categorised as either "good" (having an acceptable probable life) or "bad" (having an unacceptably short probable life) .

In one study of artificial hip joints, 73 different measurements were taken from each of a series of radiographs the success or failure of which over a 12 year period was known. These measurements included distances and angles of the joint prosthesis with respect to the femur and pelvis, and the thickness of fixation cement at several points. The surgeons and scientists responsible for assessing the joints on the basis of these measurements had to rely on educated guesses as to the relative importance of each of the measurements and this assessment required a great deal of experience and careful attention.

It is known that the ability to classify data such as that describing each of the various hip joints referred to above can be improved by subjecting that data to a mathematical transformation, for example a Karhunen-Loeve transformation (hereinafter referred to as a KL transformation). The KL transformation involves an eigenvalue and eigenvector analysis of the covariance matrix. Matrices of similar form to the covariance matrix leading to eigenvectors minimising the mean absolute deviation, entropy or chisquare parameters may be substituted for the covariance matrix which leads to a least squares approximation. There are several other mathematical transformations which may be used as approximations to the KL transformation, for example the Fourier, Hartley and Hadamard expansions. Orthogonal transformations of this form produce orthogonal components which correspond to the eigenvectors and the magnitudes of which correspond to the eigenvalues of a KL transformation. The use of such approximation does however result in a degradation of the resultant classification. A detailed explanation of the KL transformation and the significance of the eigenvalues and eigenvectors resulting from the transformation can be found in "Pattern Recognition: a Statistical Approach" by P.A. Devijver and J. Kittler, Prentice Hall International, Eaglewood Cliffs, New Jersey, U.S.A. 1982 (ISBN 013 6542360). The content of this publication is incorporated herein by reference.

The use of a KL transformation or an approximation thereto does improve the ability to classify, but unfortunately when the. number of values input into the KL transformation becomes large, for example in excess of 20, the classification ability is often reduced. There are several reasons for this. It is very difficult to obtain input values (e.g. the hip joint measurements referred to above) that are not effectively dependent upon the rest of the data set . As the number of input values is increased so is the probability of correlation between nominally different input values. One is confronted in effect with a problem of diminishing returns . For example if 5 input values can give a 90% classification accuracy, adding a further 5 input values can only increase the classification accuracy by classifying the 10% of errors. Correlation with the initial 5 input values would however increase the noise in the original data. It has been generally observed that increasing the number of measurements above a certain number can result in a reduction of the original classification accuracy. Thus the KL transformation approach to data classification loses its effectiveness when the number of input values to be considered is large but unfortunately this is to be expected in complex pattern recognition applications.

It is an object of the present invention to provide a method and apparatus which can obviate or mitigate the problems outlined above.

According to the present invention there is provided an apparatus for determining from a set of data describing A samples each in terms of B parameters those parameters which are important in distinguishing one sample from another, the apparatus comprising a. means for performing a first expansion to determine the eigenvectors and eigenvalues of a covariance or related matrix D, where

A D = C^τ C /A or an approximation thereto, and C is a parameter vector or matrix having B elements corresponding to the B parameters of one of the samples, b. means for selecting the parameter corresponding to the largest value (by magnitude) in the eigenvector cf greatest eigenvalue to constitute a first selected parameter, c. means for selecting the parameter other than the first selected parameter corresponding to the largest value by magnitude in the eigenvector having a second highest eigenvalue to constitute a second selected parameter, and d. means for selecting further parameters other than those previously selected corresponding to the largest magnitude values in the eigenvectors having the third, fourth and consequent highest eigenvalues to constitute the third, fourth and subsequent selected parameters until the eigenvalues of the remaining eigenvectors are too small to contain the further significant data, the said selected parameters describing substantially all the intrinsic variations in the data set and therefore describing substantially all the features of the samples which are significant in distinguishing one sample from another.

The invention also provides an apparatus for representing common features of a plurality A of samples each described by a plurality B of measurements, the apparatus comprising a. means for performing a first expansion to determine the eigenvectors and eigenvalues of a covariance or related matrix D, where:

A D = 2Z- C^T C /A or an approximation thereto, and C is a measurement vector or matrix having B elements corresponding to the B measurements of one of the samples, b. means for identifying eigenvectors resulting from the first expansion which have eigenvalues greater than a predetermined or data dependant limit to form a group of E identified eigenvectors c. means for selecting the measurement corresponding to the largest magnitude value in the identified eigenvector having the highest eigenvalue to constitute a first selected measurement, d. means for selecting the measurement other than the first selected measurement corresponding to the largest magnitude value in the identified eigenvector • having the second highest eigenvalue to constitute a second selected measurement. e. means for sequentially selecting further measurements other than those previously selected corresponding to the largest values by magnitude in the identified eigenvectors having the third to the Eth highest eigenvalues to constitute the third to the Eth selected measurements f. means for performing a second expansion to determine the eigenvectors and eigenvalues of a covariance or related matrix G, where

A G = " F^τ F / A or an approximation thereto, and F is a reduced measurement vector having E elements corresponding to the E selected measurements of each of the A samples, the resulting data representation constituting a description of the said common features of the A samples.

The invention further provides an apparatus for classifying a test sample into one of two or more classes on the basis of known classifications of A samples, each sample being described by E measurements which represent features common to all of the samples, the apparatus comprising a. means for performing an expansion to determine the eigenvectors and eigenvalues of the covariance or related matrix G where

A G = " ~^" F^τ F / A and F is a measurement vector having E elements corresponding to the E measurements of each of the A samples, b. means for applying a set of classifiers to scalar dot products of the selected measurement vectors and the resultant eigenvectors to define a classification vector 1_^ for each of the A measurement vectors F, the vector I having a number of elements corresponding to the number H of classifiers plus a "true" element identifying the class into which the respective sample falls c. means for performing a further expansion to determine the eigenvectors and eigenvalues of the covariance or related matrix J, where A ^j = 2H ι^τ i / ^A or an approximation thereto d. means for identifying eigenvectors resulting from the further expansion which have eigenvalues greater than a predetermined or data dependent limit to form a group of K identified eigenvectors e. means for selecting the classifier corresponding to the largest magnitude value in the identified eigenvector having the highest eigenvalue to constitute a first selected classifier; f. means for selecting the classifier other than the first selected classifier corresponding to the largest magnitude value in the identified eigenvector having the second highest eigenvalue to constitute a second selected classifier g. means for selecting further classifiers other than those previously selected corresponding to the largest magnitude values in the identified eigenvectors having the third to the Kth highest eigenvalues to constitute the third to the Kth selected classifiers h. means for performing a still further expansion to determine the eigenvectors and eigenvalues of the covariance or related matrix M where

A

M = -> ^" L^τ L / A or an approximation thereto, and L is a reduced classifier vector corresponding to the K selected classifiers for each of the A samples i. means for projecting a line or lines representing true elements identifying the class into which each of the A samples falls into the space defined by the covariance or related matrix M j . means for representing the E measurements of the test sample in a measurement space defined by the covariance matrix G by forming scalar dot products with the eigenvectors of the covariance matrix G, k. the K selected classifiers are applied to the scalar products,

1. means for representing the result of the application of the K selected classifiers in the reduced classifier space defined by the covariance matrix M, and m. means for projecting the representation by scalar products onto the true line or lines of the reduced classifier space, the projecting giving an indication of the class into which the said one set of data falls and an estimation of the validity of the indication.

The invention still further provides an apparatus for classifying a test sample into one of two or more classes on the basis of known classifications of A samples, each sample being described by E measurements which represent features common to all of the samples, the apparatus comprising a. means for performing an expansion to determine the eigenvectors and eigenvalues of the covariance or related matrix G, where

A G = J F^τ F / A or an approximation thereto and F is a measurement vector having E elements corresponding to the E measurements of each of the A samples b. means for forming a classifier vector L for each of the A samples from K classifiers, c. means for performing a further expansion to determine the eigenvectors and eigenvalues of the covariance or related matrix M, where

A M = ^<> L^τ L / A or an approximation thereto d. means for projecting a line or lines representing the true elements identifying the class into which each of the A samples falls into the space defined by the covariance or related matrix M, e. means for representing the E measurements of the test sample in the measurement space defined by the covariance matrix G by forming scalar dot products with the eigenvectors of a covariance matrix G f. means for applying the K selected classifiers to the scalar products g. means for representing the result of the application of the K selected classifiers in the reduced classifier space defined by the covariance matrix M and h. means for projecting the representation by scalar products onto the true line or lines of the reduced classifier space, the projection giving an indication of the class into which the said one set of data falls and an estimation of the validity of the indication.

The invention still further provides an apparatus for classifying a test sample into one of two or more classes on the basis of known classifications of A samples, the apparatus comprising a. means for applying classifiers to form a classifier space comprising the eigenvectors and eigenvalues of M where

A M = ^ L^τ L / A or an approximation thereto b. means for applying a further set of P classifiers to form a further classification vector Q for each of the A samples, c. means for forming a hierarchically higher classification space which may be of the described reduced form and comprises the eigenvectors and eigenvalues of R where

A

R = . Q^τ 2 / A or an approximation thereto, and d. means for successively applying the classifiers in hierarchically higher classification spaces which may also be of the form reduced by the described sequential selection until a final classification onto a true line or lines defined in the last classification space is achieved.

The present invention also provides methods for implementation by the apparatus referred to above. The present invention in effect enables the number of input values which need to be considered to be dramatically reduced, thereby immediately simplifying the problem. The present invention also enables classifiers to be selected which optimise the classification process. For the purpose of the present invention a classifier is any device separating a sample or set of samples into a set of defined categories or classes. In a particular application once a system has "learnt" the appropriate input values and the appropriate classifiers a relatively simple apparatus can be used to perform the necessary processing of the reduced data set. For example a simple personal computer can rapidly assess a complex image recognition problem of the type described above with reference to artificial hip joints once the problem has been simplified by selection of the appropriate hip joint measurement data and classifier data.

References above to "related" matrices to a covariance matrix are intended to incorporate matrices of similar form to the covariance matrix leading to eigenvectors minimising the mean absolute deviation, entropy or chisquare parameters which may be substituted for the covariance matrix which leads to a least squares approximation. In the case when measurements are input as matrices other than vectors it is intended that the selected measurements and parameters may be of vector or matrix form as appropriate from the context.

An embodiment of the present invention will now be described, by way of example, with reference to the accompanying drawings, in which:

Fig. 1 illustrates a system for initial evaluation of input data;

Fig. 2 illustrates a system for receiving outputs provided by the system of Fig. 1;

Fig. 3 illustrates a system for processing a reduced data set obtainable by the application of the systems illustrated in Figs. 1 and 2;

Fig. 4 illustrates a system for classifying a data representation obtainable from the system of Fig. 3; and Fig. 5 illustrates a system for recognising the applicability of a predetermined classification scheme to freshly presented data sets on the basis of "training" provided by sets of data of known classification.

The application of the present invention to a problem expressed in generalised terms will be described with reference to Figs. 1 to 5. Specific examples of the application of systems described in general terms with reference to those drawings will then be given.

Referring to Fig. 1, a data set is presented for analysis, the data set relating to A samples each of B measurements, the measurements being arranged as parameter vectors C. This data can therefore be represented as follows:

1st sample C = a , -* \ ez ^a\ ^ ^a _l H-

2nd sample C ⁼ 2 _j ^a 2, ^a 2 3 ^a 2 , G_>

nth sample C = a ^ _t ^a n , α. ^a ^ ,1- ^a ^ , β

Ath sample C ^ = a ₍ a ^ a _ι ^a ft, 6

As shown at the top of Fig. 1, the parameter vectors C are presented sequentially to the inputs of B input nodes 1. Each input node is connected to each of an array of weighting adjustment devices 2 each connected in turn to a respective first summing node 3. Each input to a weighting adjustment device 2 is multiplied by a weighting factor determined by inputs from an initial weighting controller and an associated comparator 5.

The first summing nodes 3 each perform the following summation B a W. b = 1 where W ^ _χ is the weighting factor by which input a_Xι^ is multiplied before application to a particular summing mode x is the position of the summing node counting from the left in Fig. 1.

The initial weighting control device 4 initially selects random weightings and thereafter takes no further part in the process. The comparators 5 receive inputs from a "preset summed output" device 6 that sets up initial output conditions for the first summing nodes 3. The comparators then compare the actual outputs of the respective summing nodes and control by feedback loops the respective weighting adjustment devices to increase or decrease all the weighting factors to cause the first summing node output to track the preset initial output conditions. All the weighting factors determined by one weighting control device are increased or decreased in accordance with the respective comparator output. The weighting factor adjustment routine described above can rely on theories of back propagation of errors, or any similar technique for weight determination. Such theories are discussed in the following document: Rumelhart, D.E., G.E. Hinton and R.J. Williams. "Learning Internal Representations by Error Propagation." In Parallel Distributed Processing. Explorations in the Microstructures of Cognition. Cambridge: MIT Press, 1986.

The initially selected output vector from the first summing nodes 3 (that is the vector made up by the output of the B summing nodes) is set by the preset summed outputs device 6 subject to the condition that the squared values of the elements of the output vector 0 approximate to the minimum entropy distribution describing the distribution of eigenvalues from a KL expansion and are output in magnitude order, largest first, smallest last, such that the output of the summing node on the lefthand side of Fig. 1 receives the largest element, the next to the left receives the next largest, and so on.

The preset output vectors only require a number of non¬ zero outputs corresponding to the number of degrees of freedom in the data to be analysed. As this might not be known initially, the preset summed output device can be.arranged to output a range of vectors 0 with different and increasing distribution widths . An ideal distribution width can thus be determined by running a training set of data through the system and selecting the distribution which corresponds to the determined number of degrees of freedom. This distribution selection can be achieved using a variance comparator 7. This comparator receives the outputs of the summing nodes 3 and the inputs to the input nodes 1 and adjusts the distribution width of the output of the preset summed outputs device 6 such that the variance of the sample data input to the input nodes 1 is approximately equal to that of the summed outputs which may be achieved if the weightings are adjusted so that the individual sample output 0 has a magnitude approximately equal to the magnitude of the corresponding measurement sample C .

The outputs of the first summing nodes represent a projection of the data to be analysed into a space defined by the KL expansion which has been effected by the processing described above. This processing leads to a set of weights that approximates to the corresponding entries m the KL eigenvectors. These outputs are then applied to an eigenvector selection device described below with reference to Fig. 2.

Each of the outputs from the first summing nodes (Fig.

1) is applied to a respective threshold device 8 which effectively turns that output off if a threshold level, e.g.

0.1 is not achieved. The total variance of the outputs applied to the threshold devices 8 is an approximation to the selected orthogonal output values. By selecting only those outputs with a variance greater than the preset value the number of degrees of freedom of the data is determined with a representation error proportional to the preset value. The outputs of the threshold devices 8, 0_{0 χ} , 0_{o α} , 0 etc are thus either zero or have variances greater than 0.1 of the total variance. These outputs are supplied to second summing nodes 9 in which the outputs are multiplied by the measurements a

etc as follows:

x = 1 where x is the number of the summing node 9 counted from the left in Fig. 2.

One such sum is formed from each of the B measurements in each sample. The result is B outputs from each of the 2nd summing nodes 9.

Comparators 10 compare the outputs of the respective summing nodes 9. Numbering the comparators from the left in Fig. 2 comparator 1 finds the largest magnitude input and outputs the identity of the measurement a contributing to that sum. Comparator 2 then finds the largest non-zero input to it, not counting the sum to which a^^ has contributed, this being a^, . Comparator 3 finds the largest non-zero input to it, not counting a _JV , a ^ „ . Thus the outputs of the comparators identify the measurements in the original data which make a significant contribution to that data. Many of the outputs of the comparators 10 will be zero because of the threshold devices 8. For example if the data exhibits 5 degrees of freedom, only the first five of the comparators will provide outputs. Of course, as any relevant data will always have at least one degree of freedom the first comparator will always provide an output and thus the associated threshold device could be dispensed with. Equally many of the summing devices and comparators to the right-hand side of Fig. 2 could also be dispensed with in most circumstances.

In some cases as determined by the user the selection without replacement described above could continue starting again sequentially at the first comparator 10 so that more than one input value, is chosen by each comparator but that no output value is chosen twice.

At this stage of the process therefore information has been derived which is of great value in itself, that is only the measurements identified by the outputs of the comparators 10 need to be considered or even recorded at the -basic data collection stage. Whatever data classification processes are to be subsequently applied, the identified data contains all the information potentially valuable to such classification. Assuming that the initial number of measurements B has been reduced as described above to a number E, these measurements could be identified as follows:

^ar ^ , \ ^arn, ^r n ' " ' ^ar , ^ύ

Of course, ar , will be identical to one of the original measurements, say a₄ ar_λ _, to a ₍₅ etc. The reduced set of E measurements is then processed as described with reference to Fig. 3. Referring to Fig. 3, the inputs are initially normalised in normalisation devices 11. That is to say, the reduced data set is normalised to zero mean and unit variance. The normalised inputs are then applied to respective input nodes 12 which in turn are each connected tc every one of an array of E weighting adjustment devices 13.

The weighting adjustment devices provide inputs to summing devices 14 which in turn supply comparators 15.

The apparatus operates in a manner similar to that described with reference to Figs. 1 and 2 above, with an initial random selection of weightings to be applied by the weighting adjustment devices 13 and an initial output vector determined by a preset summed output device 16. The initially selected output is however derived from a single minimum entropy distribution, that is the same distribution as the one selected as being of minimum width by the initial weighting control device described with reference to Fig. 1.

The outputs from the comparators 15 represent a projection of the data contained in the reduced number (E) of measurements into a space defined by the KL expansion which has been effected by the system illustrated in Fig. 3. This is a highly efficient form of data representation facilitating subsequent analysis of the represented data.

Having derived the information of representation appearing at the outputs of the comparators 15 shown in Fig.

3, that data can then be classified as described with reference to Fig. 4.

Referring to Fig. 4, input nodes 17 receive E inputs from the system described with reference to Fig. 3. These inputs are the scalar dot products of measurement vectors F having E elements with the selected KL eigenvectors or approximations thereto. Each input node 17 is connected to each of H classifiers 18. The classifiers each apply classification criteria which may be selected from any of the many known methods of classification or any such method which might be developed. The nature of the individual classifiers does not affect the invention. As a simple example, one classifier might consider whether or not the first three inputs to it are positive, the rest negative, and if such a condition is found will provide a particular output. Of course most classifiers will apply more complex procedures. Details of various classifiers are given in the document by Devi ver and Kittler described above.

Each classifier 18 produces a respective output CL0 . - CLO ^ etc, each representing the result of the respective classification method. The outputs CL0₄ - CL0 constitute a set of measurements of the ability of the selected classifiers to separate out the original data into sets. Some of the classifiers will be highly effective, others might be useless. The most effective combination of classifiers can be selected by subjecting the outputs CLO - CLO^to a KL expansion of the form described above with respect to Figs. 1 and 2. That is to say, the input CL0_n , CL0,-^₂... CLO_r^ would be applied in place of inputs a , a etc. in Fig. 1. There would be H rather than B inputs. The same variance comparison, threshold level check, weight modification to determine eigenvectors and eigenvalue comparison routine would be followed. The result would be a reduced set of K classifier outputs CLR , CLR^ .. CLR^.

Thus having followed through the routines described with reference to Figs. 1 to 4, not only have we identified which of the original measurements can be discarded, thereby simplifying subsequent processing and initial data.collection, but also we have provided a mechanism for selecting an optimal set of classifiers. Any classifier which might assist in classification can be tried and only that combination which proves effective needs to be retained.

Let us assume that the above classifier selection process has identified K effective classifiers. The K classifiers could be obtained by a single operation as described with reference to Fig. 4. Alternatively a hierarchical system could be employed. For example if a set of HI classifiers was applied to the initial classifier output producing a second classifier data vector containing HI measurements, this HI classifier vector could then be reduced to a set of K effective classifiers using the above described device. The process of applying sets of classifiers followed by classifier selection can be continued until a preset classification accuracy or condition is achieved.

Once the reduced classifier set K has been identified, the data (that is the output of the system illustrated in Fig. 3 which is based on the reduced number of measurements) is run through the structure of Fig. 4 subject to the reduction of the number of classifiers in Fig. 4 from H to K. This is illustrated in Fig. 5.

With regard to Fig. 5, the reduced number of classifier outputs CL0„ , CL0„ .. CLO from device 18 is then applied to two KL expansion devices 19 and 20 arranged in parallel. Each of these devices is essentially the same in structure as that illustrated in Fig. 1, initially set out with random weightings and the same preset minimum entropy distribution as that used in the system of Fig. 3. The KL device 19 also receives a "true" input derived from a training set, the object of the exercise being to be able to classify new data into one of two classes on the basis of a previously performed analysis of equivalent data each sample of which falls into one of two classes.

The classifiers CL0^ - CLO^of the training set of data are run through both the KL devices 19 and 20. The "true" condition data are also input to the righthand KL devices 19. The weightings in the righthand device associate the true condition with each of the K classifier outputs.

The weightings set up in the KL device 19 are output to a summation device 21 which for each classifier sums the product of the weighting from the "true" input to each output with the corresponding weighting from the classifier to that output, that is to say it forms a scalar dot product between the "true" input weightings and the corresponding weightings for each classifier. For example of the K classifiers this sum is output to the lefthand KL device 20, each output being input to the corresponding classifier input as used for prior training and later classification. The outputs from the lefthand KL device 20 are then the coordinates of the "true" line in the final classifier space.

Having run training data through the system of Fig. 5 as outlined above the system is now "trained" . A new sample of unknown "true" condition is then received for analysis. The reduced number of measurements is taken from the sample, fed through the system of Fig. 3 and the lefthand KL device- 20 of Fig. 5, and the final output indicates whether or not that sample is "true" . Thus although during training large volumes of data have been processed, once the training process is complete computation is relatively simple. The end user only requires a simplified form of the device together with appropriate guidance as to the measurements to be made and the presentation of those measurements to the simpler device.

In one practical application of the techniques outlined in general terms above, data was analysed which was in the form of measurements taken semi-automatically using a personal computer having peripherals including an image frame store and image display device from anterior-posterior radiographs of replacement hip joints, the radiographs being taken immediately after the completion of implant surgery. In theory 73 measurements should have been available from each of 123 radiographs, each measurement relating to a distance or an angle of the prosthesis with respect to the femur and pelvis and the thickness of fixation cement at several points. Measurements were available from 123 radiographs each showing a prosthesis the success or failure of which over a 12 year period was known. A substantial number of measurements were not available for all of the radiographs and accordingly 37 of the 73 measurements were used in the analysis. Each of these 37 measurements from each radiograph was arranged m a list which constituted a measurement vector. Thus each measurement vector had 37 elements. In order to analyse tne data to distinguish between successful and failed prostheses 40 measurement vectors were produced, 20 from radiographs of successful prostheses and 20 from radiographs of failed prostheses. These 40 measurement vectors represented training data which would be used to train the system in recognising good and bad joints.

The training data was represented in optimal form by performing a KL transformation. The transformation was obtained from a covariance matrix which can be represented as follows:

A D = Σ__ C^τ C / A or an approximation thereto where A is the number of samples (40 in the training set) and C is the 37 element training vector.

The data were normalised to zero expectation (zero mean) by subtraction of the corresponding mean values of the training set from each of the elements in the measurement vectors before calculation of the covariance matrix in accordance with the above equation by multiplication of each measurement vector by its transpose and summation of the resultant matrices.

The KL transformation was simply obtained by calculating the eigenvectors and eigenvalues of this covariance matrix. The eigenvectors are the uncorrelated KL features and form an orthogonal vector space. The eigenvalues are the sample variances of the measurement vectors projected onto the KL axes (eigenvectors) by formation of the corresponding scalar dot products. The eigenvalues and eigenvectors of the covariance matrix were obtained using Householders method to yield the tridiagonal form of the matrix followed by application of the QL algorithm. Details of the Householders and QL algorithms are given in the document: "Numerical Recipes: The art of Scientific Computing" by William H. Press, Brian P. Flannery, Saul A. Teukolsky and William T. Vetterling (1986), Cambridge University Press, Cambridge, England, pages 335-381 (ISBN 0 521 308119).

The most important measurements (that is the measurements which contain most information) were identified by reference to the magnitude of the eigenvalues associated with the eigenvectors. The eigenvectors were arranged in order of eigenvalue magnitude and eigenvectors having an eigenvalue less than 0.1 were ignored. The ignored eigenvectors were identified by their low eigenvalue as containing little or no further useful information assuming that the eigenvectors of greater eigenvalue were considered. 11 eigenvectors had eigenvalues greater than 0.1

The measurement corresponding to the largest value in the eigenvector of greatest eigenvalue was selected to constitute a first selected parameter. The measurements corresponding to the largest magnitude value in the eigenvector having the second highest eigenvalue was then identified. Assuming that the identified measurement did not correspond to the first selected measurement, the identified measurement was taken as a second selected parameter. If the measurement corresponding to the largest magnitude value in the eigenvector having the second highest eigenvalue corresponded to the first selected parameter, the measurement corresponding to the second largest magnitude value in the eigenvector having the second highest eigenvalue was selected as the second selected parameter. This selection of measurements without replacement on the basis of the largest eigenvector magnitude values was continued without replacement until 11 different measurements had been selected. These 11 measurements were taken to be the only measurements of significance and the other 26 measurements contained in the basic data set were thereafter completely ignored.

The 11 selected measurements were then arranged in the form of a reduced measurement vector in respect of each of the 40 samples, each reduced measurement vector having only 11 elements . A second KL transformation was performed on the reduced measurement vectors. The values obtained by projection (scalar product) of the reduced measurement vectors onto the new KL axes in this reduced KL measurement space were normalised by division with the corresponding eigenvalues of this reduced space.

The resulting data representation constituted a description of the common features of the 40 samples of relatively simple form in which many of the measurements initially considered are entirely excluded.

Having produced a simplified representation of the data, appropriate classifiers were applied to that data to classify subsequent samples the classification of which is not known, for example data in the form of radiographs for which no good/bad data are known.

A set of 19 simple classifiers was applied in the reduced KL measurement space defined by the earlier transformation (the second transformation). The nature and number of these classifiers is of limited importance provided that in total they capture sufficient discriminative information. The first 10 classifiers were based upon simple linear measures of separation. The first five of these were derived from Student's t values and the second five of Mann- Whitney t values. These t values were calculated from the projections (scalar products) of the measurement vectors onto the KL eigenvectors. Probabilities P(t) associated with these t values were calculated using the continued fraction representation of the incomplete 9 function. Students t values were rejected if the variance of the two class samples fell outside the 5% confidence level in a two tailed F-test. Probabilities for the F values being calculated using the sum of two incomplete p functions . The results were converted into coordinates Q for eight discriminative lines formed from Students and Mann-Whitney t values using, for the ith eigenvector

Q(t ^'- ) = t . Q_ι_(p(t: )) = 1 / P ( t ) Q^(P(t i ) ) ^~- Tanh t ^%

Q (^p(t; )) = i - p (t ; )

The fifth and tenth classifiers were obtained using the binomial theorem probabilities associates with the independent Student and Mann-Whitney t probabilities.

"Nearest neighbour" classifiers were also used, the nearest 3, 5 or 7 neighbours being used on a simple voting system. The class assigned was that with the greatest number. Tlxe projection (scalar dot product) of the product of the nearest neighbour distance vectors onto the KL eigenvectors form class hypervolumes which were used to estimate class probabilities and provided a further 3 classifiers. The final 3i nearest neighbour classifiers were obtained using hypervolumes calculated from the square of the distances to the nearest neighbours. The results from all these classifiers were scaled to values between -1 and 1.

Results obtained from the application of these classifiers provided a second set of data that required evaluation. For example measurement vector in the training set, a 19 element classification vector was formed. In addition, the success or failure of the prostheses in the training set were labelled as 1 or -1 respectively and this value was added to the classification vector to give a 20 element classification vector which now contained a "true" element in addition to the 19 element classification vector. Roth the 19 and 20 element classification vectors were then evaluated using a third KL expansion, the expansion being as described above in the case of the original measurement vectors. The vector spaces resulting from the third expansion are classifier spaces. A selection without replacement was performed on the 20 element classification vector to give a reduced classification vector. The true entry in the KL eigenvector was excluded from the selection.

The "selection without replacement" was equivalent to that described above with regard to the selection of the 11 measurement values, that is to say the classifier corresponding to the largest value by magnitude in the identified eigenvector having the highest eigenvalue is selected to constitute a first selected classifier, the classifier other than the first selected classifier corresponding to the largest value of the eigenvector having the second highest eigenvalue is selected to constitute a second selected classifier, and so on.

Finally, an additional reduced classification space was obtained using the reduced classification vector in a KL expansion. This space was normalised by eigenvalue division as described above.

No modification to the program was made on the basis of its classification performance, to avoid overdeterminmg the system on the particular data set. Two standard approaches were used to estimate the classification performance. The first was straightforward classification of measurement vectors not used in the training set, i.e new measurements from the 123 radiographs other than the 40 used for training This first method can be expected to provide a conservative estimate of the true classification potential. A second unbiased leave-one-out estimate was based upon the training set alone. In the leave one out procedure each sample was removed from the training data and classified on the basis of the remaining sample of training data. The data was tested for consistency by using different training sets with varied numbers of measurements from each group and comparing the selected measurements and the accuracy of classification achieved on a leave-one-out basis.

Data represented in the reduced classifier space was classified by the above method with greater than 90% accuracy. In the leave-one-out approach the classification rate was 92% with the initial training set Twenty five new reduced measurement vectors were then presented and all were correctly classified. (This result would occur at random less than three times in 100 million attempts). The dimensionality and measurements selected were consistent and all additional attempts to test the classification accuracy gave similar results. Examination of the classifier results indicated that the features that predispose to hip failure include body weight, cement distribution and prostheses orientation. These results were explicable in terms of mechanical stress and consistent with medical knowledge.

The true element in the 20 element classification vector defines a true line in the final reduced classification space. To classify a set of measurements it is necessary only to take the reduced number of measurements (11) and represent them in the second normalised KL measurement space (by forming scalar products with the eigenvectors and dividing the result by the eigenvalue), to apply the reduced number of classifiers, and to represent the result in the reduced classifier space. In the reduced classifier space projection (by scalar product) onto the true line gives the required prediction and an estimate of its validity. This classification of new data requires relatively few arithmetic operations and can thus be achieved almost instantaneously using modest equipment, e.g. a personal computer.

The system described above is particularly powerful as although the initial selection of measurements to be taken into account and the selection of classifiers to be used is computationally intensive subsequent processing on the basis of the selected measurements and classifiers is not computationally intensive. The end user can therefore be provided with an operating system which is cheap and easy to use. The system can be updated periodically by a fresh analysis of the total available data to check that the appropriate measurements are being selected and the most effective classifiers are being used. Thus the system supplier can update the systems which are in use to take account of increases in the available data but so far as the end user is concerned the system is a simple tool which can be relied upon to analyse the available information on a systematic basis.

It will be appreciated that the techniques outlined above can be used to analyse data which can be represented in the form of measurements or other input value vectors and is not limited to image analysis as such.

It will also be appreciated that the present invention can be applied to the problem of identifying whether or not a particular data set can be separated into classes the natures of which are not known. This is known as the "clustering" problem. The definition of classifier given above is extended to include devices that perform the well known statistical operation of clustering data into subsets. Simply by reducing the data set to the minimum and then applying such classifiers selected without any appreciation of how relevant they may be one can readily pick out the most relevant classifiers for segmenting the data into subsets, or come to the conclusion that the data is not separable into useful sub-classes.

Claims

1. An apparatus for determining from a set of data describing A samples each in terms of B parameters those parameters which are important in distinguishing one sample from another, the apparatus comprising a. means for performing a first expansion to determine the eigenvectors and eigenvalues of a covariance or related matrix D, where

A D = C^τ C /A or an approximation thereto, and C is a parameter vector or matrix having B elements corresponding to the B parameters of one of the samples, b. means for selecting the parameter corresponding to the largest value (by magnitude) in the eigenvector of g eatest eigenvalue to constitute a first selected parameter, c. means for selecting the parameter other than the first selected parameter corresponding to the largest value by magnitude in the eigenvector having a second highest eigenvalue to constitute a second selected parameter, and d. means for selecting further parameters other than those previously selected corresponding to the largest magnitude values in the eigenvectors having the third, fourth and consequent highest eigenvalues to constitute the third, fourth and subsequent selected parameters until the eigenvalues of the remaining eigenvectors are too small to contain the further significant data, the said selected parameters describing substantially ail the intrinsic variations in the data set and therefore describing substantially all the features of the samples which are significant in distinguishing one sample from another.

2. A method for determining from a set of data describing A samples each in terms of B parameters those parameters which are important in distinguishing one sample from another, wherein a. the E parameters of each of the A samples are arranged in the form of parameter vectors or matrices C each having B elements; b. a first expansion is performed to determine the eigenvectors and eigenvalues of a covariance or related matrix D, where:

A D = ~ C^T C /A or an approximation thereto c. the parameter corresponding to the largest value (by magnitude) in the eigenvector of greatest eigenvalue is selected to constitute a first selected parameter; d. the parameter other than the first selected parameter corresponding to the largest value by magnitude in the eigenvector having the second highest eigenvalue is selected to constitute a second selected parameter, and e. further parameters other than those previously selected corresponding to the largest_, magnitude values of the eigenvectors having the third, fourth and subsequent highest eigenvalues are selected sequentially to constitute the third, fourth and subsequent selected parameters until the eigenvalues of the remaining eigenvectors are too small to contain further significant data; the said selected parameters describing substantially all the intrinsic variations in the data set and therefore describing substantially all the features of the samples which are significant in distinguishing one sample from another.

3. A method according to claim 2, wherein, after the selection of the said fourth and subsequent highest eigenvalues, the parameter not previously selected corresponding to the largest value (by magnitude) in the eigenvector of greatest eigenvalue is selected to constitute a still further selected parameter, followed by the sequential selection of further parameters not previously selected corresponding to the largest values by magnitude in the eigenvectors corresponding to the second, third and subsequent highest eigenvalues the process continuing until an accurate representation of the data is obtained.

4. An apparatus for representing common features of a plurality A of samples each described by a plurality B of measurements, the apparatus comprising a. means for performing a first expansion to determine the eigenvectors and eigenvalues of a covariance or related matrix D, where:

A D^' = ^__ C^τ C /A or an approximation thereto, and C is a measurement vector or matrix having B elements corresponding to the B measurements of one of the samples, b. means for identifying eigenvectors resulting from the first expansion which have eigenvalues greater than a predetermined or data dependant limit to form a group of E identified eigenvectors c. means for selecting the measurement corresponding to the largest magnitude value in the identified eigenvector having the highest eigenvalue to constitute a first selected measurement, d. means for selecting the measurement other than the first selected measurement corresponding to the largest magnitude value in the identified eigenvector having the second highest eigenvalue to constitute a second selected measurement. e. means for sequentially selecting further measurements other than those previously selected corresponding to . the largest values by magnitude in the identified eigenvectors having the third to the Eth highest eigenvalues to constitute the third to the Eth selected measurements f. means for performing a second expansion to determine the eigenvectors and eigenvalues of a covariance or related matrix G, where

A G = ^*7 ^{^} F^τ F / A or an approximation thereto, and F is a reduced measurement vector having E elements corresponding to the E selected measurements of each of the A samples, the resulting data representation constituting a description of the said common features of the A samples .

5. A method for representing common features of a plurality A of samples each described by a plurality B of measurements, wherein a. the B measurements of each of the A samples are arranged in the form of measurement vectors or matrices C each having B elements; b. a first expansion is performed to determine the eigenvectors and eigenvalues of a covariance or related matrix D, where:

A D = ~ _ C^τ C / A or an approximation thereto c. eigenvectors resulting from the first expansion which have eigenvalues greater than a predetermined or data dependent limit are identified to form a group of E identified eigenvectors; d. the measurement corresponding to the largest magnitude value in the identified eigenvector having the highest eigenvalue is selected to constitute a first selected measurement; e. the measurement other than the first selected measurement corresponding to the largest magnitude value in the identified eigenvector having the second highest eigenvalue is selected to constitute a second selected measurement; f. further measurements other than those previously selected corresponding to the largest values by magnitude in the identified eigenvectors having the third to the Eth highest eigenvalues are selected sequentially to constitute the third to the Eth selected measurements; g. the E selected measurements of each of the A samples are arranged in the form of a reduced measurement vector F having E elements; h. a second expansion is performed to determine the eigenvectors and eigenvalues of a covariance or related matrix G, where

A G = S F^τ F / A or an approximation thereto the resulting data representation constituting a description of the said common features of the A samples.

6. A method according to claim 5, wherein, after the said further measurements are selected, further selections are made to contribute to the E selected measurements, the further selections including the selection of the parameter not previously selected corresponding to the largest value (by magnitude) in the eigenvector of greatest eigenvalue followed by the selection of further parameters not previously selected corresponding to the largest values by magnitude in the eigenvectors corresponding to the second, third and subsequent highest eigenvalues, the process continuing until an accurate representation of the data is obtained.

7. An apparatus for classifying a test sample into one of two or more classes on the basis of known classifications of A samples, each sample being described by E measurements which represent features common to all of the samples, the apparatus comprising a. means for performing an expansion to determine the eigenvectors and eigenvalues of the covariance or related matrix G where

. A G = ^* ~ F^τ F / A and F is a measurement vector having E elements corresponding to the E measurements of each of the A samples, b. means for applying a set of classifiers to scalar dot products of the selected measurement vectors and the resultant eigenvectors to define a classification vector I_ for each of the A measurement vectors F, the vector I_ having a number of elements corresponding to the number H of classifiers plus a "true" element identifying the class into which the respective sample falls c. means for performing a further expansion to determine the eigenvectors and eigenvalues of the covariance or related matrix J, where

or an approximation thereto d. means for identifying eigenvectors resulting from the further expansion which have eigenvalues greater than a predetermined or data dependent limit to form a group of K identified eigenvectors e. means for selecting the classifier corresponding to the largest magnitude value in the identified eigenvector having the highest eigenvalue to constitute a first selected classifier; f. means for selecting the classifier other than the first selected classifier corresponding to the largest magnitude value in the identified eigenvector having the second highest eigenvalue to constitute a second selected classifier g. means for selecting further classifiers other than those previously selected corresponding to the largest magnitude values of the identified eigenvectors having the third to the Kth highest eigenvalues to constitute the third to the Kth selected classifiers h. means for performing a still further expansion to determine the eigenvectors and eigenvalues of the covariance or related matrix M where

A

M = ^' H_ h¹ L / A or an approximation thereto, and L is a reduced classifier vector corresponding to the K selected classifiers for each of the A samples i. means for projecting a line or lines representing true elements identifying the class into which each of the A samples falls into the space defined by the covariance or related matrix M j. means for representing the E measurements of the test sample in a measurement space defined by the covariance matrix G by forming scalar dot products with the eigenvectors of the covariance matrix G, k. the K selected classifiers are applied to the scalar products,

8. A method for classifying a test sample into one of two or more classes on the basis of known classifications of A samples, each sample being described by E measurements which represent features common to all of the samples, wherein a. the E measurements of each of the A samples are arranged in the form of a measurement vector F having E elements, b. an expansion is performed to determine the eigenvectors and eigenvalues of the covariance or related matrix G, where

A G = J F^τ F / A

or an approximation thereto c. a set of H classifiers is applied to scalar dot products of the selected measurement vectors and the resultant eigenvectors to define a classification vector II for each of the A measurement vectors F_, the vector _I having a number of elements corresponding to the number H of classifiers plus a "true" element identifying the class into which the respective sample falls; d. a further expansion is performed to determine the eigenvectors and eigenvalues of the covariance or related matrix J, where

A

J = Z I^τ I / A or an approximation thereto e. eigenvectors resulting from the further expansion which have eigenvalues greater than a predetermined or data dependant limit are identified to form a group of K identified eigenvectors, f. the classifier corresponding to the largest magnitude value in the identified eigenvector having the highest eigenvalue is selected to constitute a first selected classifier, g. the classifier other than the first selected classifier corresponding to the largest magnitude value in the identified eigenvector having the second highest eigenvalue is selected to constitute a second selected classifier; h. further classifiers other than those previously selected corresponding to the largest magnitude values in the identified eigenvectors having the third to the Kth highest eigenvalues are selected sequentially to constitute the third to the Kth selected classifiers, i. the K selected classifiers are arranged in the form of a reduced classifier vector L for each of the A samples, j. a still further expansion is performed to determine the eigenvectors and eigenvalues of the covariance or related matrix M, where

A M = > L^τ L / A or an approximation thereto k. a line or lines representing "true" elements identifying the class into which each of the A samples falls is projected into the space defined by the covariance or related matrix M;

1. the E measurements of the test sample are represented in the measurement space defined by the covariance matrix G by forming scalar dot products with the eigenvectors of the covariance matrix G, m. the K selected classifiers are applied to the scalar products, n. the result of the application of the K selected classifiers is represented in the reduced classifier space defined by the covariance matrix M, and o. the representation is projected by scalar products onto the "true" line or lines of the reduced classifier space, the projection giving an indication of the class into which the said one set of data falls and an estimation of the validity of the indication.

9. A method according to claim 8, wherein, after the said further classifiers are selected, further selections are made to contribute to the K selected classifiers, the further selections including the selection of the parameter not previously selected corresponding to the largest value (by magnitude) in the eigenvector of greatest eigenvalue followed by the selection of further parameters not previously selected corresponding to the largest values by magnitude in the eigenvectors corresponding to the second, third and subsequent highest eigenvalues, the process continuing until an accurate representation of the data is obtained.

10. A method according to claim 8, wherein the classifier space is further reduced in dimensionality by selecting the classifier or classifiers most correlated with the true lines by maximum scalar dot product and classifiers most correlated with the residual of the true line when the selected classifier or classifiers are subtracted.

11. An apparatus for classifying a test sample into one of two or more classes on the basis of known classifications of A samples, each sample being described by E measurements which represent features common to all of the samples, the apparatus comprising a. means for performing an expansion to determine the eigenvectors and eigenvalues of the covariance or related matrix G, where A G = ^"-> F^T F / A or an approximation thereto and F is a measurement vector having E elements corresponding to the E measurements of each of the A samples b. means for forming a classifier vector L for each of the A samples from K classifiers, c. means for performing a further expansion to determine the eigenvectors and eigenvalues of the covariance or related matrix M, where

A M = ^ L^τ L / A or an approximation thereto d. means for projecting a line or lines representing the true elements identifying the class into which each of the A samples falls into the space defined by the covariance or related matrix M, e. means for representing the E measurements of the test sample in the measurement space defined by the covariance matrix G by forming sealer dot products with the eigenvectors of a covariance matrix G f. means for applying the K selected classifiers to the scalar products g. means for representing the result of the application of the K selected classifiers in the reduced classifier space defined by the covariance matrix M and h. means for projecting the representation by scalar products onto the true line or lines of the reduced classifier space, the projection giving an indication of the class into which the said one set of data falls and an estimation of the validity of the indication.

12. A method for classifying a test sample into one of two or more classes on the basis of known classifications of A samples, each sample being described by E measurements which represent features common to all of the samples, wherein a. the E measurements of each of the A samples are arranged in the form of a measurement vector F having E elements ; b. an expansion is performed to determine the eigenvectors and eigenvalues of the covariance or related matrix G, where A

G = 2 F^τ F / A or; an approximation thereto c. K classifiers are arranged in the form of a classifier vector L for each of the A samples; d. a further expansion is performed to determine the eigenvectors and eigenvalues of the covariance or related matrix M, where

A M = ^{^~} L^T L / A or an approximation thereto e. a line or lines representing the "true" elements identifying the class into which each of the A samples falls is projected into the space defined by the covariance or related matrix M; f. the E measurements of the test sample are represented in the measurement space defined by the covariance matrix G by forming scalar dot products with the eigenvectors of the covariance matrix G; g. the K selected classifiers are applied to the scalar products; h. the result of the application of the K selected classifiers is represented in the reduced classifier space defined by the covariance matrix M; and i. the representation is projected by scalar products onto the "true" line or lines of the reduced classifier space, the projecting giving an indication of the class into which the said one set of data falls and an estimation of the validity of the indication.

13. An apparatus for classifying a test sample into one of two or more classes on the basis of known classifications of A samples, the apparatus comprising a. means for applying classifiers to form a classifier space M where

M = Z L^f L / A or an approximation thereto b. means for applying a further set of P classifiers to form a further classification vector Q for each of the A samples, c. means for forming a hierarchically higher classification space R where

A

^R = 5H s^τ 2 / A or an approximation thereto, and d. means for successively applying the classifiers in hierarchically higher classification spaces until a final classification onto a true line or lines defined in the last classification space is achieved.

14. A method for classifying a test sample into one of two or more classes on the basis of known classifications of A samples, wherein a. classifiers are applied to form a classifier space M where

A M = ? L^τ L / A or an approximation thereto b. a further set of P classifiers are applied to form a further classification vector Q for each of the A samples; c. the P classification vectors are used to form a hierarchically higher classification space R where

A ^R = ΣH ^T 2 / A or an approximation thereto; and d. the classifiers are applied successively in hierarchically higher classification spaces until a final classification onto a true line or lines defined .in the last classification space is achieved.