WO2010023334A1

WO2010023334A1 - Method for reducing the dimensionality of data

Info

Publication number: WO2010023334A1
Application number: PCT/ES2009/000383
Authority: WO
Inventors: Pascual Campoy Cervera
Original assignee: Universidad Politécnica de Madrid
Priority date: 2008-08-29
Filing date: 2009-07-20
Publication date: 2010-03-04

Abstract

Method for reducing the dimensionality of data which comprises at least the following stages: (i) a learning stage (10) for generating an output map from data relating to the input space, which is made up of nodes, by forming a structure with a size smaller than the input and by updating the values of the input vectors associated with these nodes; and (ii) an execution stage (20a,20b) which is intended to use said output map and in turn comprises two different stages: (a) reducing the size (20a) by representing an input data item using the coordinates of the output vector of the node which represents it; (b) reconstruction (20b) which involves reconstructing an input data item from the reduced set of coordinates which represents it and from the map of output nodes generated in the learning stage (10).

Description

Title: METHOD FOR REDUCING DATA DIMENSIONALITY

Object of the invention

The object of the present invention is an industrial application process for reducing the dimensionality of data based on the generation of an output map that keeps the distance relation constant with respect to the represented data.

Field of the Invention

This invention is related to the reduction of the dimension of the input data which, being placed approximately on a hyper-surface within the input space, its position can be determined by the non-linear coordinates that parameterize said surface. More specifically this invention is related to the methods that learn the distribution of the input data by generating output maps, the points of which represent said input data.

Background of the invention

The attempt to parameterize multidimensional data surfaces, so that they can be represented by a reduced set of parameters, has been the subject of several previous works. The most classic and oldest of these works is the analysis of main components and their extension of linear discriminant analysis, whose great limitation is that it works correctly only when the data is located on a linear variety (hyperplane), which in many practical applications It makes it useless. Subsequently there have been several works to parameterize the surface of the data (that is, for the reduction of dimensionality) when they are located on a non-linear variety, among which the following methods stand out, which can be considered to be based on a Most of the methods currently employed: Multidimensional Scaling (MDS), Isomap, Kernel PCA, Broadcast Maps, Multilayer Autoencoder, Locally Linear Embedding (LLE) and Maps Self-organizing (SOM). In the latter, the selection criterion is based on the distance in the input space, which requires updating the weights of the neighboring units that leads to an iterative and long process, with invalid intermediate solutions. All these systems need to be introduced as input the dimension of the non-linear variety to be parameterized, this being a great inconvenience in practice, which has to be previously solved through the use of other techniques that estimate the intrinsic dimensionality of the data (that is, of the variety on which they have been), both procedures being decoupled and unrelated.

The procedure described in the present invention differs radically from the previous procedures, since the data is entered only once (avoiding iterations of the same data to obtain a good solution), the desired criterion of correspondence between the distances of the reduced data and the original data and the size of the output space is obtained as a result of the procedure itself (not introduced a priori).

Description of the invention

The present invention has as its main object to represent data with high dimensionality (in which each input data is specified by a large set of numerical values) by other data of much smaller size (ie specified by much less values) so that these The latter contain all the information necessary to reconstruct the former with a resolution set a priori.

This method has an initial phase, called learning or training, in which the relationship between any data in the input space and its corresponding data in the output space (and vice versa) is established, starting from a set of data input representative of all possible input data that may exist. Once the learning phase is finished, the execution phase allows any data in the input space to be represented by a data in the data space. output, which can be determined by a very small number of numerical values, called the output space dimension.

More specifically, the learning phase of the present invention generates an output map in which each point of said map, called a node, has associated the coordinates of a point in the input space and additionally each point of the map has a relative position with respect to to the other points that form it, in a way that this relative position is determined by the variation of a limited set of values, which represent their relative coordinates and whose number constitutes the dimension of the output map.

The learning phase calculates in an iterative way, with the input data available up to a given moment, the coordinates associated with each node of the output map, so that the relationship between the distance between two data from each other is verified at each moment input and the distance between its two representatives on the output map is kept as constant as possible for any pair of points in the input space, according to the criteria of the mean square error applied to all pairs of points in the input space. This relationship is the only adjustable parameter of the method, which is the inverse of the resolution with which they represent and reconstruct the input data. This parameter is set a priori and remains constant throughout the procedure.

During the execution phase, each point of the input space, of high dimensionality, has as a representative on the output map that node that has associated the closest coordinates to its own, so that to indicate from which point of the input space It is enough to indicate the coordinates on the output map of your representative node, which may be relative to the representative of another point in the input space. Therefore, the identification of a point of the entry space in the execution phase is done by identifying its representative on the output map, which requires only a limited number of coordinates, thereby reducing the dimensionality necessary to identify Any point of the input space. The fundamental idea of the method, therefore, is to ensure that the relationship between the distance on the output map between the representatives of two input data and the distance between said input data remains constant, thereby correctly parameterizing the hyper-surface on which the input data is found by a reduced set of coordinates, which represent the position of the nodes of the output map on said hyper-surface.

An important novelty of the method presented is that the dimension of the output space data is obtained by the method itself based on the input data in the learning phase and the resolution with which you want to have the data of the entry. This characteristic distinguishes this method from other methods of the current state of the art, in which the dimension of the output data must be introduced a priori externally to the method, resulting in that said dimension of the output space can be or either superior to sufficient or inferior to that necessary to represent the input data.

Another important novelty of the method presented is the drastic saving of time in the learning phase to establish the relationship between the input data and its representative data in the output space. This is because this method determines the representative data of the output space, for a specific input data in an intermediate iteration of the learning phase, by means of calculations in the output space (of reduced size) and not by calculations in the entrance space (of much greater dimensionality.

It is initially assumed that there can be an indefinitely large number of nodes and that these are connected to each other forming a network structure of the same dimension as the input space, so that the position of each node is determined by its coordinates in this network. In the learning phase, the proposed method aims to assign values only to a limited set of network nodes, those necessary to represent the input data entered in the learning phase, so that the connections between these chosen nodes involve the least possible number of dimensions of the network and therefore of coordinates to determine its position. This ensures that the position of a node in the network (which in turn represents data in the input space) is determined by the minimum set of possible values, which are its coordinates in that network.

Once this connection structure is generated, any new data of high dimensionality is approximated by the data generated closest to it and this data is in turn represented by the number of data needed to reach it, according to each output dimension, counted from an arbitrary data chosen as origin or from the data representative of the previous data of high dimension.

Brief description of the figures

A series of drawings that help to better understand the invention and that expressly relate to an embodiment of said invention which is presented as a non-limiting example thereof is described very briefly below.

FIGURE L- Represents the structure of the input data (fig.lA) and the output nodes (fig.lB). Each output node consists of two vectors, the first one, called input and (m), represents the position of the node in the input space and the second vector, called output and s (m), represents the coordinates of said node in the output map.

FIGURE 2.- Represents the flow of actions from the learning phase of the method, in which the nodes of the output map are positioned in the input space, by updating the input vectors of a series of nodes chosen for Represent the input data entered.

FIGURE 3.- Graphically represents the effect of the actions described in Figure 2 on the input data and the nodes of the output map. FIGURE 4.- Represents the flow of actions of the execution phase, in which an input data is represented by a reduced set of coordinates, which may be disjunctively those of the output vector of the node that represents it, or the increase of these coordinates with respect to those of the output vector of the node representing the previous input data.

FIGURE 5.- Represents the flow of actions necessary to reconstruct an input data, once it is identified by a reduced set of coordinates of the node that represents it from the output map.

Preferred Embodiment of the Invention

The method object of the present invention comprises, at least, the following threads or phases: (i) a first learning phase (10), configured for the generation of an output map, where data is successively introduced into the space of input and from them the method calculates the nodes of the output space that represent them and updates the values of the input vectors of these nodes, which place them in the input space to represent a set of said data, such and as seen in figures 2 and figure 3; Y

(ii) a second execution phase (20a, 20b), configured for the use of the output map generated in the first phase (10), so that the dimension of the input data is reduced, as well as its subsequent reconstruction from the data of. reduced dimension; said second phase, in turn comprises two distinct stages:

(a) a first part of reduction of the dimension (2Oa) ₅ consisting of the representation of an input data, which has a high number of coordinates D, by another data with a reduced number of coordinates, corresponding to the output vector of the node which represents it or the increase with respect to the representative node of the lower entry;

(b) a second reconstruction part (20b), consisting of the reconstruction of an input data approximately equal to the original, starting from the reduced set of coordinates and the map of output nodes generated in the first learning phase (10).

The first function of the method object of the invention consists in achieving a significant reduction in the dimensionality of the input data, that is, that the coordinates representing the input data are few and, on the other hand, that the data of reconstructed input, after being represented by a reduced set of coordinates, closely resembles the original input data, that is, the distance between them in the input space is small. Both objectives are opposed to each other, given that the greater the data reduction, the lower the resolution in reconstruction and vice versa, being regulated by a single parameter that represents the desired relationship between the distance in the input space between two data and the distance in the output space between two nodes of the map that represent them. This parameter is called the output map resolution, it is represented by the letter R and therefore represents the resolution with which the input data represented by its output map nodes is reconstructed. This output map resolution therefore indicates the distance chosen a priori between the input data represented by two neighboring nodes of the output map (i.e. separated by a unit in coordinates of the output map).

The second phase of execution (20a, 20b) - dimensionality reduction (20a) and reconstruction (2Ob) - can take place after the end of the learning phase (10), so that it does not continue, even if new data is available of entry, remaining fixed, therefore, all the nodes of the output map that are used both for the dimensionality reduction (20a) of an input data, and for its reconstruction (20b). Alternatively, the execution phase (20a, 20b) can also take place at the same time as the learning phase (10), since at any intermediate step of this phase (10) there is a structure of the output node map, which can , consequently, be used for execution. In the latter case, it should be borne in mind that the output map will change in the future, becoming better at the time of reducing the dimension of the input data (less coordinates of the node of output) and especially better when reconstructing the input data (closer to its original value).

The actions that take place during the learning phase (10) are indicated in figure 2 and the graphic representation of these actions on the input data and the output nodes are schematized in figure 3.

Thus, as shown in Figure 2, the first learning phase (10) includes, in turn, the following stages: (i) A first stage of reading a new input data (11), which has D coordinates, and that generates the execution of the following stages of the first learning phase (10), from which it is decided whether a new input data is introduced or said learning phase is terminated.

(ii) A second stage of selecting a set Ω of inputs (12), previously introduced, where the most suitable set depends on the sequence in which the input data is entered, as well as the extent of the area of the input space in which you want the distance relation R to be fulfilled; and where if the sequence is random and it is desired that said relation of distances R be fulfilled in the entire input space, any sufficiently large set of previous entries is valid, and therefore, in this case the set formed by the P previous entries introduced in the learning phase. The higher the number of entries chosen, the better the method works, in the sense that the nodes of the output map converge with less input data at their desired values, but on the contrary the method has to perform more operations for each new data input and is therefore slower. In practice there is a P value of the number of inputs that should not be exceeded, since the improvement in convergence is almost negligible, compared to the computational effort required. This practical value is of the order of ten raised to the dimension of the input space. As a variant to this choice of the Ω set of inputs, these may be, instead of actual inputs previously introduced, the input vectors associated with output nodes that have already been winners in previous iterations of the procedure. This choice of the input vectors of the Ω set has two important advantages. The The first advantage is that the output node representing each input, which is necessary to calculate in step iv ₅ , is already known in advance, further reducing its representation error, which is the distance between the chosen input and its representative, and giving rise to therefore to a better fulfillment of the relation of distances between the entrance space and its representatives in the exit space. As a second advantage, the set Ω is greatly reduced and therefore the total process time.

(iii) A third stage of calculating the distances in the input space (13) between all the data of the previously selected Ω set, and the new input data entered. (iv) A fourth stage of calculation of the nodes of the output space (14) representing each of the inputs of the set Ω previously selected in the second stage (12) being the node representing a certain input data that whose Input vector is the closest to the input data it represents. In order to calculate these representative nodes of the output map, it should be taken into account that in each of the iterations, there is only a limited set of nodes in the output map, which are those to which values have been assigned to their output vectors. Entry in previous iterations. To save computational time in this fourth action it is important to keep in mind that the vast majority of the input data of the set chosen in this iteration were also chosen in the previous iteration, and therefore their representative nodes are already calculated in the output map from the previous iteration, which in turn differs from the current map only in one node, this being the only node updated in the previous iteration, called the previous winner. According to the previous reasoning, for all the input data that was already chosen in the previous iteration it is only necessary to check if the previous winning node is closer than the previous representative of said input data; updating only his representative to said winning node in the event that his distance to him is less. In the usual case, in which the new set of input data chosen only has a single new element with respect to the set of the previous iteration, which is precisely the input data that was introduced as new in said previous iteration, this is the only data of the set for which the significant computational reduction just explained to calculate its node representative of the map of output, it being necessary to calculate for this single point the minimum distance with the input vectors of all the nodes of the output map.

This process is greatly simplified in the case that the set Ω has been chosen as the inputs associated with output nodes that have gained in previous iterations of the process, since in this case it is already known which are the output nodes associated with each chosen entry.

(v) A fifth stage of calculating the distances in the output map (15) between the representative nodes calculated in the fourth stage (14) and the new node that is wanted is calculated as representative of the new input data entered, where these distances are all calculated in the output space, that is, they are distances between the output vectors of the nodes of the output map. The calculation of these output distances is done simply by dividing each of the distances calculated in the third action by the resolution R defined above.

(vi) A sixth stage of calculating the coordinates of the output vector of the node (16) that represents the new input data introduced in this iteration, called the winning node, and where this output vector is calculated as the one that is located as close as possible to the distances calculated in the fifth stage (15) of each of the representative nodes, calculated in the fourth stage (14), of the previous input data, chosen in the third stage (13). For the calculation of this position of the output vector the difference between the distance with each representative node and the desired distance with said node calculated in the fourth action is evaluated, finally choosing as output vector the one whose differences for all representative nodes is the minor (usually evaluated as the sum of the squares of said distance differences). The mathematical expression for this calculation is as follows:

Where (i) are the distances between the input data entered and the previous inputs included in the set Ω chosen in the second stage (12), and. s (gj) are the output vectors of the nodes representing these previous entries, and s (g) are the coordinates of the output vector of the winning node for the current input data. (vile) A seventh stage of calculating the input vector of the winning node (17) calculated in the sixth stage (16). For this, this input vector has to be located near of the new input data introduced in this iteration, so that it represents it, that is, if, at the execution stage, said input data is reintroduced, then the output node that represents it (the one whose vector of input is closer) be this same node from which your input vector is now being updated. It should be borne in mind that updating the input vectors of this and other output nodes in later iterations of the learning phase may result in the output node representing this input data being another node other than its representative now. . For the update of the input vector it is taken into account if this is the first time that this output node has been a winner (that is, it is the first time its input vector is calculated), or on the contrary its input vector It has already been calculated in some previous iteration of this learning phase. In the first case, the input vector of the winning node is assigned the value of the new input data introduced in this iteration, thereby ensuring its representativeness. In the second case, the value of the input vector of the winning node is updated by moving it towards the new input data, as is traditional in other learning algorithms, proportionally to the distance between the input data and the input vector of the node. winner. The proportionality constant should decrease with the number of iterations to ensure the stability of the input vectors, as is usual in these update rules.

Once the seventh stage (17) is finished, the status of the output map has been updated in the sense of having learned the new input data introduced in this iteration; In fact, only the input vector of a single node on the map, called the winning node before this input data, has been updated. At this time, the learning process can be continued iteratively (18) by successively introducing new input data or terminating (19) the learning process (10), bringing the input vectors of the Output nodes remain fixed and ready to be used in the execution phase (20a, 20b) for dimensionality reduction (20a).

The input data that must be entered during the learning phase 10 so that it is good (that is, so that the generated output map allows a good reduction of the size of the input data and a good reconstruction of these) must be on the one hand representative of the input space, that is, be scattered throughout all areas of the data where data may be presented, and on the other hand the amount of these data must be sufficiently dense, this data density being related to the value of resolution R, so that there must be input data every R units away in those areas of the input space where there is input data (i.e. , on a hyper-surface of this one), being convenient that there are several times (up to an order of magnitude greater) this number of input data.

If the number of available input data is only of the order of the value indicated in the previous paragraph, these must be entered several times during the learning phase, with each complete pass of all available input data being called time. If, on the contrary, the number of available input data is several times higher than the value indicated above, the learning phase can be terminated when all available input data has been entered only once. In practical cases it may happen that said desired value of the number of input data is exceeded for some areas of the input space and not for others, in which case it is advisable to enter the input data several times in the less dense areas, until exceed these values a few times.

Once the learning phase 10 is finished, the nodes that have updated the value of their input vector are those that constitute the so-called output map (Fig. 1B), in which each node is composed of an input vector and e ( m) and an output vector and s (m), as indicated in Figure 1. The input vectors and. e (1 ... m) serve to relate each output node with data in the input space and reciprocally to relate a set of data in the input space (those that are closer to this input vector than to any another input vector of another node) with this output node. The output vectors and s (l ... m) are used to indicate the node in question, the coordinates being those of the node in the so-called output map (Fig. 1B) Once the learning phase 10 is executed, the described method has selected the nodes that constitute the output map as those whose coordinates of the output vector are the minimum necessary to identify each node that represents the input data, the rest of which is null and void. the coordinates. This fact is fulfilled with the condition, considered during the learning phase, that the relationship between the distance between the input vectors of two map nodes and their respective output vectors keeps a relationship as constant as possible for all nodes of the map. map, according to the criteria of the mean square error calculated in equation (1), which is the so-called output map resolution R.

The fact that the output vectors of the map nodes have a high number of null components (or constants, if the node chosen as representative of the first input data is not the null vector), allows these coordinates to not have to used to identify a specific node of the output map, using only non-zero coordinate values, and thus obtaining the objective of dimensionality reduction to identify an output node and therefore of the input data that it It represents.

The dimensionality reduction procedure 20a is schematized in Figure 4, in which there are two versions, depending on whether the coordinates of the output vector representing the input data (24) are encoded, or if the coordinates are encoded of the increase of said vector with respect to the output vector representing the previous data (23), indicated in said figure by the dilemma

"relative position" (25). In many applications in which the sequence in which the input data is entered is not random, but follows a certain order, the second option is better to reduce the size of the output vector.

Said dimensionality reduction subprocess (20a) comprises, in turn, the following steps: (i) A first step (21) for calculating the output node representing the input data to be characterized by a dimension data reduced, being this output node is the one whose input vector is closest to said input data and is called the winning node.

(ii) A second stage (22) for calculating fractional coordinates with respect to neighboring nodes, which may optionally be executed depending on the resolution with which the input data is to be represented. If this resolution is of the order of the value of the R parameter used during the learning phase, it is not necessary to execute this action, the coordinates of the winning node, which are integer values, being passed directly to the third action. If, on the contrary, a higher resolution is desired, the coordinates in the output map will be calculated as fractional values, in which each value indicates the proximity of the input data to the winning node with respect to the next (or previous) node of the output space in each of its dimensions other than zero.

Then there are two possibilities as the coordinates of the output vector of the representative node (24) are coded or the coordinates of the difference vector between said output vector and those of the output vector of the representative of the previous input data (23). In both cases the coordinates of both vectors have a high number of null coordinates that are not necessary to indicate, which is used in the fourth stage of coordinate coding (24) to identify the node representing the input data entered by means of a number of coordinates less than the dimension of the input space.

The nodes of the output map, which represent data from the input space and that are identified by a reduced number of coordinates as just described, can be introduced in the procedure outlined in Figure 5 of reconstruction of the input data (20b) . This process includes, in turn:

(i) A first stage (31) of decoding the output vector coordinates, adding the necessary null coordinates. The absolute coordinates of the vector in the output map are then calculated, if necessary, executing the second stage (32), which adds the corresponding output vector to the previous input data. (ii) A second step (32) of calculating the output vector of the representative node.

(iii) A third step (33) of calculating the output vectors of integer coordinates that are around the output vector calculated above. (iv) A fourth step (34) of identification of the output nodes corresponding to the calculated whole coordinate output vectors.

(v) Finally, a fifth stage (35) interpolates the input vectors corresponding to the output nodes obtained to calculate the input data, which is very similar to the original input data represented by the output node introduced in this Reconstruction procedure (20b).

Claims

1.- Method for reducing the dimensionality of data characterized in that it comprises at least the following threads or phases: (i) a first learning phase (10), configured to generate an output map, where they go successively entering data in the input space and from them the method calculates the nodes of the output space that represent them and updates the values of the input vectors of these nodes, which place them in the input space to represent a set of such data; and (ii) a second execution phase (20a, 20b), configured for the use of the output map generated in the first phase (10), so that the dimension of the input data is reduced, as well as its subsequent reconstruction from the reduced dimension data; said second phase, in turn comprises two differentiated stages: (a) a first part of reduction of the dimension (20a), consisting of the representation of an input data, which has a high number of coordinates D, by another data with reduced number of coordinates, corresponding to the output vector of the node that represents it or to the increase with respect to the representative node of the lower input; (b) a second reconstruction part (20b), consisting of the reconstruction of an input data approximately equal to the original, starting from the reduced set of coordinates and the map of output nodes generated in the first learning phase (10); where, in addition, said method is configured so that the output map has a fixed dimensional structure in the sense that each node of this map is identified by its coordinates in said dimensional space, each node of the output map associated with a set of updatable values, which represent the coordinates of a point in the dimensional input space, so that each point of the input space is represented by that node of the output map whose associated coordinates are closest to those of the point in question of the space input because only the values of a single point of the exit space are updated, chosen according to distance criteria in said exit space; and because the values of the points of the output map are updated iteratively before each of the points of the input space introduced, so that the representatives of the points of previous entries are those whose values are closest to each of these points at the current time when the last point of the input space has been entered.

2. Method according to claim 1, characterized in that once the point of the output map that updates its values has been selected and updated, it is proceeded in an iterative manner with successive entries.

3. Method according to any of the preceding claims characterized in that the values associated with the points of the output map are such that the distances between two points of the input space and their representatives in the output map have a predefined, inverse relationship of the resolution, as closely as possible according to the criteria of the mean square error defined by the

expression,) / Xg) = argminj ∑ (| y

where (i) are the distances between the input data entered and the previous inputs included in a set Ω chosen in the first learning phase (10), and s (g¡) are the output vectors of the nodes representing said previous entries, and s (g) are the coordinates of the output vector of the winning node for the current input data.

4. Method according to any of claims 1-3, characterized in that the input data is represented by fractional coordinates in the output map, calculated from the proximity to the vectors associated with several nodes of said map.