WO2010023334A1 - Method for reducing the dimensionality of data - Google Patents

Method for reducing the dimensionality of data Download PDF

Info

Publication number
WO2010023334A1
WO2010023334A1 PCT/ES2009/000383 ES2009000383W WO2010023334A1 WO 2010023334 A1 WO2010023334 A1 WO 2010023334A1 ES 2009000383 W ES2009000383 W ES 2009000383W WO 2010023334 A1 WO2010023334 A1 WO 2010023334A1
Authority
WO
WIPO (PCT)
Prior art keywords
input
output
data
coordinates
space
Prior art date
Application number
PCT/ES2009/000383
Other languages
Spanish (es)
French (fr)
Inventor
Pascual Campoy Cervera
Original Assignee
Universidad Politécnica de Madrid
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Universidad Politécnica de Madrid filed Critical Universidad Politécnica de Madrid
Publication of WO2010023334A1 publication Critical patent/WO2010023334A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2137Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on criteria of topology preservation, e.g. multidimensional scaling or self-organising maps

Definitions

  • the object of the present invention is an industrial application process for reducing the dimensionality of data based on the generation of an output map that keeps the distance relation constant with respect to the represented data.
  • This invention is related to the reduction of the dimension of the input data which, being placed approximately on a hyper-surface within the input space, its position can be determined by the non-linear coordinates that parameterize said surface. More specifically this invention is related to the methods that learn the distribution of the input data by generating output maps, the points of which represent said input data.
  • MDS Multidimensional Scaling
  • Isomap Kernel PCA
  • Broadcast Maps Multilayer Autoencoder
  • LLE Locally Linear Embedding
  • SOM Maps Self-organizing
  • the procedure described in the present invention differs radically from the previous procedures, since the data is entered only once (avoiding iterations of the same data to obtain a good solution), the desired criterion of correspondence between the distances of the reduced data and the original data and the size of the output space is obtained as a result of the procedure itself (not introduced a priori).
  • the present invention has as its main object to represent data with high dimensionality (in which each input data is specified by a large set of numerical values) by other data of much smaller size (ie specified by much less values) so that these The latter contain all the information necessary to reconstruct the former with a resolution set a priori.
  • This method has an initial phase, called learning or training, in which the relationship between any data in the input space and its corresponding data in the output space (and vice versa) is established, starting from a set of data input representative of all possible input data that may exist.
  • learning phase the execution phase allows any data in the input space to be represented by a data in the data space.
  • output which can be determined by a very small number of numerical values, called the output space dimension.
  • the learning phase of the present invention generates an output map in which each point of said map, called a node, has associated the coordinates of a point in the input space and additionally each point of the map has a relative position with respect to to the other points that form it, in a way that this relative position is determined by the variation of a limited set of values, which represent their relative coordinates and whose number constitutes the dimension of the output map.
  • the learning phase calculates in an iterative way, with the input data available up to a given moment, the coordinates associated with each node of the output map, so that the relationship between the distance between two data from each other is verified at each moment input and the distance between its two representatives on the output map is kept as constant as possible for any pair of points in the input space, according to the criteria of the mean square error applied to all pairs of points in the input space.
  • This relationship is the only adjustable parameter of the method, which is the inverse of the resolution with which they represent and reconstruct the input data. This parameter is set a priori and remains constant throughout the procedure.
  • each point of the input space has as a representative on the output map that node that has associated the closest coordinates to its own, so that to indicate from which point of the input space It is enough to indicate the coordinates on the output map of your representative node, which may be relative to the representative of another point in the input space. Therefore, the identification of a point of the entry space in the execution phase is done by identifying its representative on the output map, which requires only a limited number of coordinates, thereby reducing the dimensionality necessary to identify Any point of the input space.
  • the fundamental idea of the method is to ensure that the relationship between the distance on the output map between the representatives of two input data and the distance between said input data remains constant, thereby correctly parameterizing the hyper-surface on which the input data is found by a reduced set of coordinates, which represent the position of the nodes of the output map on said hyper-surface.
  • the dimension of the output space data is obtained by the method itself based on the input data in the learning phase and the resolution with which you want to have the data of the entry. This characteristic distinguishes this method from other methods of the current state of the art, in which the dimension of the output data must be introduced a priori externally to the method, resulting in that said dimension of the output space can be or either superior to sufficient or inferior to that necessary to represent the input data.
  • Another important novelty of the method presented is the drastic saving of time in the learning phase to establish the relationship between the input data and its representative data in the output space. This is because this method determines the representative data of the output space, for a specific input data in an intermediate iteration of the learning phase, by means of calculations in the output space (of reduced size) and not by calculations in the entrance space (of much greater dimensionality.
  • the proposed method aims to assign values only to a limited set of network nodes, those necessary to represent the input data entered in the learning phase, so that the connections between these chosen nodes involve the least possible number of dimensions of the network and therefore of coordinates to determine its position. This ensures that the position of a node in the network (which in turn represents data in the input space) is determined by the minimum set of possible values, which are its coordinates in that network.
  • any new data of high dimensionality is approximated by the data generated closest to it and this data is in turn represented by the number of data needed to reach it, according to each output dimension, counted from an arbitrary data chosen as origin or from the data representative of the previous data of high dimension.
  • FIGURE L- Represents the structure of the input data (fig.lA) and the output nodes (fig.lB).
  • Each output node consists of two vectors, the first one, called input and (m), represents the position of the node in the input space and the second vector, called output and s (m), represents the coordinates of said node in the output map.
  • FIGURE 2.- Represents the flow of actions from the learning phase of the method, in which the nodes of the output map are positioned in the input space, by updating the input vectors of a series of nodes chosen for Represent the input data entered.
  • FIGURE 3. Graphically represents the effect of the actions described in Figure 2 on the input data and the nodes of the output map.
  • FIGURE 4. Represents the flow of actions of the execution phase, in which an input data is represented by a reduced set of coordinates, which may be disjunctively those of the output vector of the node that represents it, or the increase of these coordinates with respect to those of the output vector of the node representing the previous input data.
  • FIGURE 5.- Represents the flow of actions necessary to reconstruct an input data, once it is identified by a reduced set of coordinates of the node that represents it from the output map.
  • the method object of the present invention comprises, at least, the following threads or phases: (i) a first learning phase (10), configured for the generation of an output map, where data is successively introduced into the space of input and from them the method calculates the nodes of the output space that represent them and updates the values of the input vectors of these nodes, which place them in the input space to represent a set of said data, such and as seen in figures 2 and figure 3; Y
  • said second phase in turn comprises two distinct stages:
  • a second reconstruction part (20b) consisting of the reconstruction of an input data approximately equal to the original, starting from the reduced set of coordinates and the map of output nodes generated in the first learning phase (10).
  • the first function of the method object of the invention consists in achieving a significant reduction in the dimensionality of the input data, that is, that the coordinates representing the input data are few and, on the other hand, that the data of reconstructed input, after being represented by a reduced set of coordinates, closely resembles the original input data, that is, the distance between them in the input space is small.
  • Both objectives are opposed to each other, given that the greater the data reduction, the lower the resolution in reconstruction and vice versa, being regulated by a single parameter that represents the desired relationship between the distance in the input space between two data and the distance in the output space between two nodes of the map that represent them.
  • This parameter is called the output map resolution, it is represented by the letter R and therefore represents the resolution with which the input data represented by its output map nodes is reconstructed.
  • This output map resolution therefore indicates the distance chosen a priori between the input data represented by two neighboring nodes of the output map (i.e. separated by a unit in coordinates of the output map).
  • the second phase of execution (20a, 20b) - dimensionality reduction (20a) and reconstruction (2Ob) - can take place after the end of the learning phase (10), so that it does not continue, even if new data is available of entry, remaining fixed, therefore, all the nodes of the output map that are used both for the dimensionality reduction (20a) of an input data, and for its reconstruction (20b).
  • the execution phase (20a, 20b) can also take place at the same time as the learning phase (10), since at any intermediate step of this phase (10) there is a structure of the output node map, which can , consequently, be used for execution.
  • the first learning phase (10) includes, in turn, the following stages: (i) A first stage of reading a new input data (11), which has D coordinates, and that generates the execution of the following stages of the first learning phase (10), from which it is decided whether a new input data is introduced or said learning phase is terminated.
  • the first advantage is that the output node representing each input, which is necessary to calculate in step iv 5 , is already known in advance, further reducing its representation error, which is the distance between the chosen input and its representative, and giving rise to therefore to a better fulfillment of the relation of distances between the entrance space and its representatives in the exit space.
  • the set ⁇ is greatly reduced and therefore the total process time.
  • the new set of input data chosen only has a single new element with respect to the set of the previous iteration, which is precisely the input data that was introduced as new in said previous iteration, this is the only data of the set for which the significant computational reduction just explained to calculate its node representative of the map of output, it being necessary to calculate for this single point the minimum distance with the input vectors of all the nodes of the output map.
  • a fifth stage of calculating the distances in the output map (15) between the representative nodes calculated in the fourth stage (14) and the new node that is wanted is calculated as representative of the new input data entered, where these distances are all calculated in the output space, that is, they are distances between the output vectors of the nodes of the output map.
  • the calculation of these output distances is done simply by dividing each of the distances calculated in the third action by the resolution R defined above.
  • the value of the input vector of the winning node is updated by moving it towards the new input data, as is traditional in other learning algorithms, proportionally to the distance between the input data and the input vector of the node. winner.
  • the proportionality constant should decrease with the number of iterations to ensure the stability of the input vectors, as is usual in these update rules.
  • the learning process can be continued iteratively (18) by successively introducing new input data or terminating (19) the learning process (10), bringing the input vectors of the Output nodes remain fixed and ready to be used in the execution phase (20a, 20b) for dimensionality reduction (20a).
  • the input data that must be entered during the learning phase 10 so that it is good must be on the one hand representative of the input space, that is, be scattered throughout all areas of the data where data may be presented, and on the other hand the amount of these data must be sufficiently dense, this data density being related to the value of resolution R, so that there must be input data every R units away in those areas of the input space where there is input data (i.e. , on a hyper-surface of this one), being convenient that there are several times (up to an order of magnitude greater) this number of input data.
  • the number of available input data is only of the order of the value indicated in the previous paragraph, these must be entered several times during the learning phase, with each complete pass of all available input data being called time. If, on the contrary, the number of available input data is several times higher than the value indicated above, the learning phase can be terminated when all available input data has been entered only once. In practical cases it may happen that said desired value of the number of input data is exceeded for some areas of the input space and not for others, in which case it is advisable to enter the input data several times in the less dense areas, until exceed these values a few times.
  • the nodes that have updated the value of their input vector are those that constitute the so-called output map (Fig. 1B), in which each node is composed of an input vector and e ( m) and an output vector and s (m), as indicated in Figure 1.
  • the input vectors and. e (1 ... m) serve to relate each output node with data in the input space and reciprocally to relate a set of data in the input space (those that are closer to this input vector than to any another input vector of another node) with this output node.
  • the output vectors and s (l ... m) are used to indicate the node in question, the coordinates being those of the node in the so-called output map (Fig.
  • the described method has selected the nodes that constitute the output map as those whose coordinates of the output vector are the minimum necessary to identify each node that represents the input data, the rest of which is null and void. the coordinates. This fact is fulfilled with the condition, considered during the learning phase, that the relationship between the distance between the input vectors of two map nodes and their respective output vectors keeps a relationship as constant as possible for all nodes of the map. map, according to the criteria of the mean square error calculated in equation (1), which is the so-called output map resolution R.
  • the output vectors of the map nodes have a high number of null components (or constants, if the node chosen as representative of the first input data is not the null vector), allows these coordinates to not have to used to identify a specific node of the output map, using only non-zero coordinate values, and thus obtaining the objective of dimensionality reduction to identify an output node and therefore of the input data that it It represents.
  • the dimensionality reduction procedure 20a is schematized in Figure 4, in which there are two versions, depending on whether the coordinates of the output vector representing the input data (24) are encoded, or if the coordinates are encoded of the increase of said vector with respect to the output vector representing the previous data (23), indicated in said figure by the dilemma
  • Said dimensionality reduction subprocess (20a) comprises, in turn, the following steps: (i) A first step (21) for calculating the output node representing the input data to be characterized by a dimension data reduced, being this output node is the one whose input vector is closest to said input data and is called the winning node.
  • a second stage (22) for calculating fractional coordinates with respect to neighboring nodes which may optionally be executed depending on the resolution with which the input data is to be represented. If this resolution is of the order of the value of the R parameter used during the learning phase, it is not necessary to execute this action, the coordinates of the winning node, which are integer values, being passed directly to the third action. If, on the contrary, a higher resolution is desired, the coordinates in the output map will be calculated as fractional values, in which each value indicates the proximity of the input data to the winning node with respect to the next (or previous) node of the output space in each of its dimensions other than zero.
  • the coordinates of the output vector of the representative node (24) are coded or the coordinates of the difference vector between said output vector and those of the output vector of the representative of the previous input data (23).
  • the coordinates of both vectors have a high number of null coordinates that are not necessary to indicate, which is used in the fourth stage of coordinate coding (24) to identify the node representing the input data entered by means of a number of coordinates less than the dimension of the input space.
  • the nodes of the output map which represent data from the input space and that are identified by a reduced number of coordinates as just described, can be introduced in the procedure outlined in Figure 5 of reconstruction of the input data (20b) .
  • This process includes, in turn:
  • a fifth stage (35) interpolates the input vectors corresponding to the output nodes obtained to calculate the input data, which is very similar to the original input data represented by the output node introduced in this Reconstruction procedure (20b).

Abstract

Method for reducing the dimensionality of data which comprises at least the following stages: (i) a learning stage (10) for generating an output map from data relating to the input space, which is made up of nodes, by forming a structure with a size smaller than the input and by updating the values of the input vectors associated with these nodes; and (ii) an execution stage (20a,20b) which is intended to use said output map and in turn comprises two different stages: (a) reducing the size (20a) by representing an input data item using the coordinates of the output vector of the node which represents it; (b) reconstruction (20b) which involves reconstructing an input data item from the reduced set of coordinates which represents it and from the map of output nodes generated in the learning stage (10).

Description

Título: MÉTODO PARA LA REDUCCIÓN DE LA DIMENSIONALIDAD DE DATOS Title: METHOD FOR REDUCING DATA DIMENSIONALITY
Objeto de la invenciónObject of the invention
El objeto de la presente invención es un procedimiento de aplicación industrial para la reducción de la dimensionalidad de datos basado en la generación de un mapa de salida que mantiene constante la relación de distancias respecto a los datos representados.The object of the present invention is an industrial application process for reducing the dimensionality of data based on the generation of an output map that keeps the distance relation constant with respect to the represented data.
Campo de la invenciónField of the Invention
Esta invención está relacionada con la reducción de la dimensión de los datos de entrada que, al estar colocados de forma aproximada sobre una hiper-superfície dentro del espacio de entrada, su posición puede venir determinada por las coordenadas no lineales que parametrizan dicha superficie. Más concretamente esta invención está relacionada con los métodos que aprenden la distribución de los datos de entrada mediante la generación de mapas de salida, cuyos puntos representan a dichos datos de entrada.This invention is related to the reduction of the dimension of the input data which, being placed approximately on a hyper-surface within the input space, its position can be determined by the non-linear coordinates that parameterize said surface. More specifically this invention is related to the methods that learn the distribution of the input data by generating output maps, the points of which represent said input data.
Antecedentes de Ia invenciónBackground of the invention
El intento de parametrizar superficies de datos multidimensionales, de manera que estos puedan ser representados mediante un conjunto reducido de parámetros, ha sido objeto de varios trabajos anteriores. El más clásico y antiguo de estos trabajos es el análisis de componentes principales y su extensión de análisis discriminante lineal, cuya gran limitación es que funciona correctamente sólo cuando los datos están situados sobre una variedad lineal (hiperplano), lo que en muchísimas aplicaciones prácticas lo hace inservible. Posteriormente ha habido varios trabajos para parametrizar la superficie de los datos (es decir, para la reducción de dimensionalidad) cuando estos están situados sobre una variedad no lineal, entre los que destacan los siguientes métodos, en los que se puede considerar que están basados una mayoría de los métodos actualmente empleados: el Escalado Multidimensional (MDS), el Isomap, Kernel PCA, Mapas de Difusión, Autoencoder Multicapa, Locally Linear Embedding (LLE) y los Mapas Autorganizativos (SOM). En estos últimos el criterio de selección está basado en la distancia en el espacio de entrada, lo que obliga a actualizar los pesos de las unidades vecinas que da lugar a un proceso iterativo y largo, con soluciones intermedias no válidas. Todos estos sistemas necesitan que se introduzca como entrada la dimensión de la variedad no-lineal que se va a parametrizar, siendo esto un gran inconveniente en la práctica, que tiene que ser solventado previamente mediante la utilización de otras técnicas que realizan una estimación de la dimensionalidad intrínseca de los datos (es decir, de la variedad sobre la que se hayan), estando ambos procedimientos desacoplados y no relacionados.The attempt to parameterize multidimensional data surfaces, so that they can be represented by a reduced set of parameters, has been the subject of several previous works. The most classic and oldest of these works is the analysis of main components and their extension of linear discriminant analysis, whose great limitation is that it works correctly only when the data is located on a linear variety (hyperplane), which in many practical applications It makes it useless. Subsequently there have been several works to parameterize the surface of the data (that is, for the reduction of dimensionality) when they are located on a non-linear variety, among which the following methods stand out, which can be considered to be based on a Most of the methods currently employed: Multidimensional Scaling (MDS), Isomap, Kernel PCA, Broadcast Maps, Multilayer Autoencoder, Locally Linear Embedding (LLE) and Maps Self-organizing (SOM). In the latter, the selection criterion is based on the distance in the input space, which requires updating the weights of the neighboring units that leads to an iterative and long process, with invalid intermediate solutions. All these systems need to be introduced as input the dimension of the non-linear variety to be parameterized, this being a great inconvenience in practice, which has to be previously solved through the use of other techniques that estimate the intrinsic dimensionality of the data (that is, of the variety on which they have been), both procedures being decoupled and unrelated.
El procedimiento descrito en la presente invención difiere radicalmente de los procedimientos anteriores, pues los datos son introducidos una única vez (evitando iteraciones de los mismos datos para obtener una buena solución), se optimiza en todo momento el criterio deseado de correspondencia entre la distancias de los datos reducidos y de los datos originales y la dimensión del espacio de salida se obtiene como consecuencia del propio procedimiento (no introducida a priori).The procedure described in the present invention differs radically from the previous procedures, since the data is entered only once (avoiding iterations of the same data to obtain a good solution), the desired criterion of correspondence between the distances of the reduced data and the original data and the size of the output space is obtained as a result of the procedure itself (not introduced a priori).
Descripción de la invenciónDescription of the invention
La presente invención tiene como objeto principal representar datos con elevada dimensionalidad (en los que cada dato del entrada está especificado mediante un conjunto elevado de valores numéricos) mediante otros datos de dimensión mucho más reducida (i.e. especificado por muchos menos valores) de manera que estos últimos contienen toda la información necesaria para reconstruir los primeros con una resolución fijada a priori.The present invention has as its main object to represent data with high dimensionality (in which each input data is specified by a large set of numerical values) by other data of much smaller size (ie specified by much less values) so that these The latter contain all the information necessary to reconstruct the former with a resolution set a priori.
Este método tiene una fase inicial, denominada de aprendizaje o entrenamiento, en la que se establece la relación entre cualquier dato en el espacio de entrada y su correspondiente dato en el espacio de salida (y viceversa), partiendo para ello de un conjunto de datos de entrada representativos de todos los posibles datos de entrada que puedan existir. Una vez acabada la fase de aprendizaje, la fase de ejecución permite que cualquier dato en el espacio de entrada venga representado por un dato en el espacio de salida, que puede determinarse por un número muy reducido de valores numéricos, denominado dimensión del espacio de salida.This method has an initial phase, called learning or training, in which the relationship between any data in the input space and its corresponding data in the output space (and vice versa) is established, starting from a set of data input representative of all possible input data that may exist. Once the learning phase is finished, the execution phase allows any data in the input space to be represented by a data in the data space. output, which can be determined by a very small number of numerical values, called the output space dimension.
Más concretamente, la fase de aprendizaje de la presente invención genera un mapa de salida en el que cada punto de dicho mapa, denominado nodo, tiene asociado las coordenadas de un punto del espacio de entrada y adicionalmente cada punto del mapa tiene una posición relativa respecto a los otros puntos que lo forman, de maneara que esta posición relativa está determinada por la variación de un conjunto limitado de valores, que representan sus coordenadas relativas y cuyo número constituye la dimensión del mapa de salida.More specifically, the learning phase of the present invention generates an output map in which each point of said map, called a node, has associated the coordinates of a point in the input space and additionally each point of the map has a relative position with respect to to the other points that form it, in a way that this relative position is determined by the variation of a limited set of values, which represent their relative coordinates and whose number constitutes the dimension of the output map.
La fase de aprendizaje calcula de una forma iterativa, con los datos de entrada disponibles hasta un momento dado, las coordenadas asociadas a cada nodo del mapa de salida, de manera que se verifica en cada momento que la relación entre la distancia entre dos datos de entrada y la distancia entre sus dos representantes en el mapa de salida se mantiene lo más constante posible para cualquier pareja de puntos del espacio de entrada, según el criterio de error cuadrático medio aplicado sobre todas las parejas de puntos del espacio de entrada. Esta relación el único parámetro ajustable del método, que es el inverso de la resolución con la que representan y reconstruyen los datos de entrada. Este parámetro es fijado a priori y permanece constante durante todo el procedimiento.The learning phase calculates in an iterative way, with the input data available up to a given moment, the coordinates associated with each node of the output map, so that the relationship between the distance between two data from each other is verified at each moment input and the distance between its two representatives on the output map is kept as constant as possible for any pair of points in the input space, according to the criteria of the mean square error applied to all pairs of points in the input space. This relationship is the only adjustable parameter of the method, which is the inverse of the resolution with which they represent and reconstruct the input data. This parameter is set a priori and remains constant throughout the procedure.
Durante la fase de ejecución, cada punto del espacio de entrada, de dimensionalidad elevada, tiene como representante en el mapa de salida aquel nodo que tiene asociado las coordenadas más cercanas a las suyas, de manera que para indicar de qué punto del espacio de entrada se trata, basta con indicar las coordenadas en el mapa de salida de su nodo representante, pudiendo ser éstas relativas respecto al representante de otro punto del espacio de entrada. Por tanto, la identificación de un punto del espacio de entrada en la fase de ejecución se realiza mediante la identificación de su representante en el mapa de salida, lo que requiere sólo de un número limitado de coordenadas, reduciendo con ello la dimensionalidad necesaria para identificar cualquier punto del espacio de entrada. La idea fundamental del método, por tanto, es conseguir que la relación entre la distancia en el mapa de salida entre los representantes de dos datos de entrada y la distancia entre dichos datos de entrada permanezca constante, consiguiendo con ello parametrizar correctamente la hiper-superficie sobre la que se encuentran los datos de entrada mediante un conjunto reducido de coordenadas, que representan la posición de los nodos del mapa de salida sobre dicha hiper-superficie.During the execution phase, each point of the input space, of high dimensionality, has as a representative on the output map that node that has associated the closest coordinates to its own, so that to indicate from which point of the input space It is enough to indicate the coordinates on the output map of your representative node, which may be relative to the representative of another point in the input space. Therefore, the identification of a point of the entry space in the execution phase is done by identifying its representative on the output map, which requires only a limited number of coordinates, thereby reducing the dimensionality necessary to identify Any point of the input space. The fundamental idea of the method, therefore, is to ensure that the relationship between the distance on the output map between the representatives of two input data and the distance between said input data remains constant, thereby correctly parameterizing the hyper-surface on which the input data is found by a reduced set of coordinates, which represent the position of the nodes of the output map on said hyper-surface.
Una novedad importante del método presentado consiste en que la dimensión de los datos del espacio de salida es obtenida por el propio método en función de los datos de entrada en la fase de aprendizaje y de la resolución con la que se quieren tener representados los datos de entrada. Esta característica distingue a este método de otros métodos del actual estado de la técnica, en los que la dimensión de los datos de salida debe ser introducida a priori de forma externa al método, dando lugar a que dicha dimensión del espacio de salida pueda ser o bien superior a la suficiente o bien inferior a la necesaria para representar los datos de entrada.An important novelty of the method presented is that the dimension of the output space data is obtained by the method itself based on the input data in the learning phase and the resolution with which you want to have the data of the entry. This characteristic distinguishes this method from other methods of the current state of the art, in which the dimension of the output data must be introduced a priori externally to the method, resulting in that said dimension of the output space can be or either superior to sufficient or inferior to that necessary to represent the input data.
Otra novedad importante del método presentado consiste el drástico ahorro de tiempo en la fase de aprendizaje para establecer la relación entre los datos de entrada y su dato representativo en el espacio de salida. Esto es debido a que este método determina el dato representativo del espacio de salida, para un dato concreto de entrada en una iteración intermedia de la fase de aprendizaje, mediante unos cálculos en el espacio de salida (de dimensión reducida) y no mediante cálculos en el espacio de entrada (de dimensionalidad mucho mayor.Another important novelty of the method presented is the drastic saving of time in the learning phase to establish the relationship between the input data and its representative data in the output space. This is because this method determines the representative data of the output space, for a specific input data in an intermediate iteration of the learning phase, by means of calculations in the output space (of reduced size) and not by calculations in the entrance space (of much greater dimensionality.
Se supone inicialmente que se puede contar con un número indefinidamente grande de nodos y que estos están conectados entre sí formando una estructura de red de la misma dimensión que el espacio de entrada, de manera que la posición de cada nodo queda determinada por sus coordenadas en esta red. En la fase de aprendizaje el método propuesto tiene como objetivo final asignar valores sólo a un conjunto limitado de nodos de la red, aquellos necesarios para representar los datos de entrada introducidos en la fase de aprendizaje, de manera que las conexiones entre estos nodos elegidos impliquen el menor número posible de dimensiones de la red y por tanto de coordenadas para determinar su posición. De esta forma se garantiza que la posición de un nodo de la red (que a su vez representa un dato en el espacio de entrada) quede determinado por el mínimo conjunto de valores posibles, que son sus coordenadas en dicha red.It is initially assumed that there can be an indefinitely large number of nodes and that these are connected to each other forming a network structure of the same dimension as the input space, so that the position of each node is determined by its coordinates in this network. In the learning phase, the proposed method aims to assign values only to a limited set of network nodes, those necessary to represent the input data entered in the learning phase, so that the connections between these chosen nodes involve the least possible number of dimensions of the network and therefore of coordinates to determine its position. This ensures that the position of a node in the network (which in turn represents data in the input space) is determined by the minimum set of possible values, which are its coordinates in that network.
Una vez generada esta estructura de conexiones, cualquier nuevo dato de elevada dimensionalidad viene aproximado por el dato generado más cercano a él y este dato viene a su vez representado por el número de datos necesarios para llegar a él, según cada dimensión de salida, contados a partir de un dato arbitrario elegido como origen o bien desde el dato representativo del dato anterior de elevada dimensión.Once this connection structure is generated, any new data of high dimensionality is approximated by the data generated closest to it and this data is in turn represented by the number of data needed to reach it, according to each output dimension, counted from an arbitrary data chosen as origin or from the data representative of the previous data of high dimension.
Breve descripción de las figurasBrief description of the figures
A continuación se pasa a describir de manera muy breve una serie de dibujos que ayudan a comprender mejor la invención y que se relacionan expresamente con una realización de dicha invención que se presenta como un ejemplo no limitativo de ésta.A series of drawings that help to better understand the invention and that expressly relate to an embodiment of said invention which is presented as a non-limiting example thereof is described very briefly below.
FIGURA L- Representa la estructura de los datos de entrada (fig.lA) y de los nodos de salida (fig.lB). Cada nodo de salida consta de dos vectores, el primer de ellos, denominado de entrada y.e(m), representa la posición del nodo en el espacio de entrada y el segundo vector, denominado de salida y.s(m), representa las coordenadas de dicho nodo en el mapa de salida.FIGURE L- Represents the structure of the input data (fig.lA) and the output nodes (fig.lB). Each output node consists of two vectors, the first one, called input and (m), represents the position of the node in the input space and the second vector, called output and s (m), represents the coordinates of said node in the output map.
FIGURA 2.- Representa el flujo de acciones que de la fase de aprendizaje del método, en la que se posicionan los nodos del mapa de salida en el espacio de entrada, mediante la actualización de los vectores de entrada de una serie de nodos elegidas para representar los datos de entrada introducidos.FIGURE 2.- Represents the flow of actions from the learning phase of the method, in which the nodes of the output map are positioned in the input space, by updating the input vectors of a series of nodes chosen for Represent the input data entered.
FIGURA 3.- Representa gráficamente el efecto de las acciones descritas en la figura 2 sobre los datos de entrada y los nodos del mapa de salida. FIGURA 4.- Representa el flujo de acciones de la fase de ejecución, en la que un dato de entrada es representado mediante un conjunto reducido de coordenadas, que puede ser disyuntivamente aquellas del vector de salida del nodo que lo representa, o bien del incremento de estas coordenadas respecto a las del vector de salida del nodo representativo del dato de entrada anterior.FIGURE 3.- Graphically represents the effect of the actions described in Figure 2 on the input data and the nodes of the output map. FIGURE 4.- Represents the flow of actions of the execution phase, in which an input data is represented by a reduced set of coordinates, which may be disjunctively those of the output vector of the node that represents it, or the increase of these coordinates with respect to those of the output vector of the node representing the previous input data.
FIGURA 5.- Representa el flujo de acciones necesario para reconstruir un dato de entrada, una vez que este está identificado por un conjunto reducido de coordenadas del nodo que lo representa del mapa de salida.FIGURE 5.- Represents the flow of actions necessary to reconstruct an input data, once it is identified by a reduced set of coordinates of the node that represents it from the output map.
Realización preferente de la invenciónPreferred Embodiment of the Invention
El método objeto de la presente invención comprende, al menos, los siguientes subprocesos o fases: (i) una primera fase de aprendizaje (10), configurada para la generación de un mapa de salida, donde se van introduciendo sucesivamente datos en el espacio de entrada y a partir de ellos el método va calculando los nodos del espacio de salida que los representan y va actualizando los valores de los vectores de entrada de estos nodos, que los sitúan en el espacio de entrada para representar a un conjunto de dichos datos, tal y como se observa en las figuras 2 y ñgura 3; yThe method object of the present invention comprises, at least, the following threads or phases: (i) a first learning phase (10), configured for the generation of an output map, where data is successively introduced into the space of input and from them the method calculates the nodes of the output space that represent them and updates the values of the input vectors of these nodes, which place them in the input space to represent a set of said data, such and as seen in figures 2 and figure 3; Y
(ii) una segunda fase de ejecución (20a,20b), configurada para la utilización del mapa de salida generado en la primera fase (10), de tal forma que se reduzca la dimensión de los datos de entrada, así como su posterior reconstrucción a partir de los datos de . dimensión reducida; dicha segunda fase, a su vez comprende dos etapas diferenciadas:(ii) a second execution phase (20a, 20b), configured for the use of the output map generated in the first phase (10), so that the dimension of the input data is reduced, as well as its subsequent reconstruction from the data of. reduced dimension; said second phase, in turn comprises two distinct stages:
(a) una primera parte de reducción de la dimensión (2Oa)5 consistente en la representación de un dato de entrada, que tiene un número elevado de coordenadas D, mediante otro dato con número reducido de coordenadas, correspondientes al vector de salida del nodo que lo representa o bien al incremento respecto al nodo representativo de la entrada inferior;(a) a first part of reduction of the dimension (2Oa) 5 consisting of the representation of an input data, which has a high number of coordinates D, by another data with a reduced number of coordinates, corresponding to the output vector of the node which represents it or the increase with respect to the representative node of the lower entry;
(b) una segunda parte de reconstrucción (20b), consistente en la reconstrucción de un dato de entrada aproximadamente igual al original, partiendo del conjunto reducido de coordenadas y del mapa de nodos de salida generado en la primera fase de aprendizaje (10).(b) a second reconstruction part (20b), consisting of the reconstruction of an input data approximately equal to the original, starting from the reduced set of coordinates and the map of output nodes generated in the first learning phase (10).
La primera función del método objeto de la invención consiste en la consecución de una importante reducción de la dimensionalidad de los datos de entrada, es decir, que las coordenadas que representen el dato de entrada sean pocas y, por otra parte, que el dato de entrada reconstruido, después de estar representado por un conjunto reducido de coordenadas, se parezca mucho al dato de entrada original, es decir, que la distancia entre ellos en el espacio de entrada sea pequeña. Ambos objetivos son contrapuestos entre sí, dado que a mayor reducción de los datos, menor resolución en la reconstrucción y viceversa, estando regulados por un único parámetro que representa la relación deseada entre la distancia en el espacio de entrada entre dos datos y la distancia en el espacio de salida entre dos nodos del mapa que los representan. Este parámetro se denomina resolución del mapa de salida, viene representado por la letra R y representa, por tanto, la resolución con la que se reconstruyen los datos de entrada representados por sus nodos del mapa de salida. Esta resolución del mapa de salida indica por tanto la distancia elegida a priori entre los datos de entrada representados por dos nodos vecinos del mapa de salida (i.e. separados entre sí una unidad en coordenadas del mapa de salida).The first function of the method object of the invention consists in achieving a significant reduction in the dimensionality of the input data, that is, that the coordinates representing the input data are few and, on the other hand, that the data of reconstructed input, after being represented by a reduced set of coordinates, closely resembles the original input data, that is, the distance between them in the input space is small. Both objectives are opposed to each other, given that the greater the data reduction, the lower the resolution in reconstruction and vice versa, being regulated by a single parameter that represents the desired relationship between the distance in the input space between two data and the distance in the output space between two nodes of the map that represent them. This parameter is called the output map resolution, it is represented by the letter R and therefore represents the resolution with which the input data represented by its output map nodes is reconstructed. This output map resolution therefore indicates the distance chosen a priori between the input data represented by two neighboring nodes of the output map (i.e. separated by a unit in coordinates of the output map).
La segunda fase de ejecución (20a,20b) -reducción de dimensionalidad (20a) y reconstrucción (2Ob)- puede tener lugar una vez finalizada la fase de aprendizaje (10), de manera que ésta no continua, aunque se disponga de nuevos datos de entrada, permaneciendo fijos, por tanto, todos lo nodos del mapa de salida que son utilizados tanto para la reducción de dimensionalidad (20a) de un dato de entrada, como para su reconstrucción (20b). Alternativamente también puede tener lugar la fase de ejecución (20a,20b) al mismo tiempo que la fase de aprendizaje (10), puesto que en cualquier paso intermedio de esta fase (10) existe una estructura del mapa de nodos de salida, que puede, consecuentemente, ser utilizada para la ejecución. En este último caso debe tenerse en cuenta que el mapa de salida cambiará en el futuro, siendo cada vez mejor a la hora de reducir la dimensión del dato de entrada (menos coordenadas del nodo de salida) y sobre todo mejor a la hora de reconstruir el dato de entrada (más cercano a su valor original).The second phase of execution (20a, 20b) - dimensionality reduction (20a) and reconstruction (2Ob) - can take place after the end of the learning phase (10), so that it does not continue, even if new data is available of entry, remaining fixed, therefore, all the nodes of the output map that are used both for the dimensionality reduction (20a) of an input data, and for its reconstruction (20b). Alternatively, the execution phase (20a, 20b) can also take place at the same time as the learning phase (10), since at any intermediate step of this phase (10) there is a structure of the output node map, which can , consequently, be used for execution. In the latter case, it should be borne in mind that the output map will change in the future, becoming better at the time of reducing the dimension of the input data (less coordinates of the node of output) and especially better when reconstructing the input data (closer to its original value).
Las acciones que tienen lugar durante la fase de aprendizaje (10) están indicadas en la figura 2 y la representación gráfica de estas acciones sobre los datos de entrada y los nodos de salida se encuentran esquematizados en la figura 3.The actions that take place during the learning phase (10) are indicated in figure 2 and the graphic representation of these actions on the input data and the output nodes are schematized in figure 3.
Así, tal y como se observa en la figura 2 la primera fase de aprendizaje (10) comprende, a su vez, las siguientes etapas: (i) Una primera etapa de lectura de un nuevo dato de entrada (11), que tiene D coordenadas, y que genera la ejecución de las siguientes etapas de la primera fase de aprendizaje (10), a partir de la cual se decide si se introduce un nuevo dato de entrada o se da por finalizada dicha fase de aprendizaje.Thus, as shown in Figure 2, the first learning phase (10) includes, in turn, the following stages: (i) A first stage of reading a new input data (11), which has D coordinates, and that generates the execution of the following stages of the first learning phase (10), from which it is decided whether a new input data is introduced or said learning phase is terminated.
(ii) Una segunda etapa de selección de un conjunto Ω de entradas (12), anteriormente introducidas, donde el conjunto más adecuado depende de la secuencia en que los datos de entrada son introducidos, así como la extensión de la zona del espacio de entrada en la que quiere que se cumpla la relación de distancias R; y donde si la secuencia es aleatoria y se desea que se cumpla dicha relación de distancias R en todo el espacio de entrada, cualquier conjunto suficientemente grande de entradas anteriores es válido, y por tanto, en este caso se puede coger el conjunto formado por las P entradas anteriores introducidas en la fase de aprendizaje. Cuanto mayor sea el número de entradas elegido mejor funciona el método, en el sentido de que los nodos del mapa de salida convergen con menos datos de entrada a sus valores deseados, pero por el contrario el método tiene que efectuar más operaciones por cada nuevo dato de entrada y es por tanto más lento. En la práctica existe un valor P del número de entradas que no conviene sobrepasar, pues la mejora en la convergencia es casi despreciable, comparada con el esfuerzo computacional requerido. Este valor práctico es del orden de diez elevado a la dimensión del espacio de entrada. Como variante a esta elección del conjunto Ω de entradas, éstas pueden ser, en lugar de entradas reales introducidas anteriormente, los vectores de entradas asociados a nodos de salida que ya han sido ganadores en iteraciones anteriores del procedimiento. Esta elección de los vectores de entrada del conjunto Ω tiene dos ventajas importantes. La primera ventaja es que el nodo de salida que representa cada entrada, que es necesario calcular en la etapa iv5 se conoce ya de antemano, reduciendo adicionalmente su error de representación, que es la distancia entre la entrada escogida y su representante, y dando lugar por tanto a un mejor cumplimiento de la relación de distancias entre el espacio de entrada y sus representantes en el espacio de salida. Como segunda ventaja, se reduce en gran medida el conjunto Ω y por tanto el tiempo total del proceso.(ii) A second stage of selecting a set Ω of inputs (12), previously introduced, where the most suitable set depends on the sequence in which the input data is entered, as well as the extent of the area of the input space in which you want the distance relation R to be fulfilled; and where if the sequence is random and it is desired that said relation of distances R be fulfilled in the entire input space, any sufficiently large set of previous entries is valid, and therefore, in this case the set formed by the P previous entries introduced in the learning phase. The higher the number of entries chosen, the better the method works, in the sense that the nodes of the output map converge with less input data at their desired values, but on the contrary the method has to perform more operations for each new data input and is therefore slower. In practice there is a P value of the number of inputs that should not be exceeded, since the improvement in convergence is almost negligible, compared to the computational effort required. This practical value is of the order of ten raised to the dimension of the input space. As a variant to this choice of the Ω set of inputs, these may be, instead of actual inputs previously introduced, the input vectors associated with output nodes that have already been winners in previous iterations of the procedure. This choice of the input vectors of the Ω set has two important advantages. The The first advantage is that the output node representing each input, which is necessary to calculate in step iv 5 , is already known in advance, further reducing its representation error, which is the distance between the chosen input and its representative, and giving rise to therefore to a better fulfillment of the relation of distances between the entrance space and its representatives in the exit space. As a second advantage, the set Ω is greatly reduced and therefore the total process time.
(iii) Una tercera etapa de cálculo de las distancias en el espacio de entrada (13) entre todos los datos del conjunto Ω anteriormente elegido, y el nuevo dato de entrada introducido. (iv) Una cuarta etapa de cálculo de los nodos del espacio de salida (14) que representan a cada una de las entradas del conjunto Ω anteriormente seleccionado en la segunda etapa (12) siendo el nodo que representa un determinado dato de entrada aquel cuyo vector de entrada es el más cercano al dato de entrada que representa. Para calcular estos nodos representativos del mapa de salida debe tenerse en cuenta que en cada una de las iteraciones, solamente existe en un conjunto limitado de nodos en el mapa de salida, que son aquellos a los que se les ha asignado valores a sus vectores de entrada en iteraciones anteriores. Para ahorrar tiempo computacional en esta cuarta acción es importante tener en cuenta que la gran mayoría de los datos de entrada del conjunto elegido en esta iteración también fueron elegidos en la iteración anterior, y por tanto ya están calculados sus nodos representativos en el mapa de salida de la iteración anterior, que a su vez se diferencia del mapa actual solamente en un nodo, siendo éste el único nodo actualizado en la anterior iteración, denominado ganador anterior. Según el anterior razonamiento, para todos los datos de entrada que ya fueron elegidos en la anterior iteración solamente hay que comprobar si el nodo ganador anterior está más cerca que el representante anterior de dicho dato de entrada; actualizando solamente su representante a dicho nodo ganador en el caso de que sea menor su distancia a él. En el caso, habitual, en que el nuevo conjunto de datos de entrada elegido solamente tenga un único elemento nuevo respecto al conjunto de la iteración anterior, que es precisamente el dato de entrada que se introdujo como nuevo en dicha iteración anterior, éste es el único dato del conjunto para el cual no se puede aplicar la importante reducción computacional acabada de explicar para calcular su nodo representante del mapa de salida, siendo necesario calcular para este único punto la distancia mínima con los vectores de entrada de todos los nodos del mapa de salida.(iii) A third stage of calculating the distances in the input space (13) between all the data of the previously selected Ω set, and the new input data entered. (iv) A fourth stage of calculation of the nodes of the output space (14) representing each of the inputs of the set Ω previously selected in the second stage (12) being the node representing a certain input data that whose Input vector is the closest to the input data it represents. In order to calculate these representative nodes of the output map, it should be taken into account that in each of the iterations, there is only a limited set of nodes in the output map, which are those to which values have been assigned to their output vectors. Entry in previous iterations. To save computational time in this fourth action it is important to keep in mind that the vast majority of the input data of the set chosen in this iteration were also chosen in the previous iteration, and therefore their representative nodes are already calculated in the output map from the previous iteration, which in turn differs from the current map only in one node, this being the only node updated in the previous iteration, called the previous winner. According to the previous reasoning, for all the input data that was already chosen in the previous iteration it is only necessary to check if the previous winning node is closer than the previous representative of said input data; updating only his representative to said winning node in the event that his distance to him is less. In the usual case, in which the new set of input data chosen only has a single new element with respect to the set of the previous iteration, which is precisely the input data that was introduced as new in said previous iteration, this is the only data of the set for which the significant computational reduction just explained to calculate its node representative of the map of output, it being necessary to calculate for this single point the minimum distance with the input vectors of all the nodes of the output map.
Este proceso se simplifica mucho en el caso de que el conjunto Ω haya sido elegido como las entradas asociadas a nodos de salida que han ganado en iteraciones anteriores del proceso, puesto que en este caso ya se sabe cuales son los nodos de salida asociado a cada entrada elegida.This process is greatly simplified in the case that the set Ω has been chosen as the inputs associated with output nodes that have gained in previous iterations of the process, since in this case it is already known which are the output nodes associated with each chosen entry.
(v) Una quinta etapa de cálculo de las distancias en el mapa de salida (15) entre los nodos representativos calculados en la cuarta etapa (14) y el nuevo nodo que se quiere calcula como representativo del nuevo dato de entrada introducido, donde estas distancias se calculan todas en el espacio de salida, es decir son distancias entre los vectores de salida de los nodos del mapa de salida. El cálculo de estas distancias de salida se efectúa simplemente dividiendo cada una de las distancias calculadas en la acción tercera por la resolución R anteriormente definida.(v) A fifth stage of calculating the distances in the output map (15) between the representative nodes calculated in the fourth stage (14) and the new node that is wanted is calculated as representative of the new input data entered, where these distances are all calculated in the output space, that is, they are distances between the output vectors of the nodes of the output map. The calculation of these output distances is done simply by dividing each of the distances calculated in the third action by the resolution R defined above.
(vi) Una sexta etapa de cálculo de las coordenadas del vector de salida del nodo (16) que representa al nuevo dato de entrada introducido en esta iteración, denominado nodo ganador, y donde este vector de salida se calcula como aquel que está situado lo más aproximado posible a las distancias calculadas en la quinta etapa (15) de cada uno de los nodos representativos, calculados en la cuarta etapa (14), de los datos de entrada anteriores, elegidos en la tercera etapa (13). Para el cálculo de esta posición del vector de salida se evalúa la diferencia entre la distancia con cada nodo representativo y la distancia deseada con dicho nodo calculada en la acción cuarta, eligiendo finalmente como vector de salida aquel cuyas diferencias para todos los nodos representativos sea la menor (usualmente evaluado como suma de los cuadrados de dichas diferencias de distancias). La expresión matemática para dicho cálculo es la siguiente:
Figure imgf000011_0001
(vi) A sixth stage of calculating the coordinates of the output vector of the node (16) that represents the new input data introduced in this iteration, called the winning node, and where this output vector is calculated as the one that is located as close as possible to the distances calculated in the fifth stage (15) of each of the representative nodes, calculated in the fourth stage (14), of the previous input data, chosen in the third stage (13). For the calculation of this position of the output vector the difference between the distance with each representative node and the desired distance with said node calculated in the fourth action is evaluated, finally choosing as output vector the one whose differences for all representative nodes is the minor (usually evaluated as the sum of the squares of said distance differences). The mathematical expression for this calculation is as follows:
Figure imgf000011_0001
Donde de (i) son las distancias entre el dato de entrada introducido y las entradas anteriores incluidas en el conjunto Ω elegido en la segunda etapa (12), y. s(gj) son los vectores de salida de los nodos representantes de dichas entradas anteriores, y.s(g) son las coordenadas del vector de salida del nodo ganador para el dato de entrada actual. (vil) Una séptima etapa de cálculo del vector de entrada del nodo ganador (17) calculado en la sexta etapa (16). Para ello este vector de entrada tiene que situarse cerca del nuevo dato de entrada introducido en esta iteración, de manera que lo represente, es decir de manera que, si en fase de ejecución, se vuelve a introducirse dicho dato de entrada, entonces el nodo de salida que lo represente (aquel cuyo vector de entrada está más cerca) sea este mismo nodo del que ahora se está actualizando su vector de entrada. Debe tenerse presente que la actualización de vectores de entrada de este y otros nodos de salida en iteraciones posteriores de la fase de aprendizaje pueden dar lugar a que el nodo de salida representante de este dato de entrada sea otro nodo distinto del que ahora es su representante. Para la actualización del vector de entrada se tiene en cuenta si esta es la primera vez que este nodo de salida ha sido ganador (es decir, es la primera vez que se calcula su vector de entrada), o por el contrario su vector de entrada ya ha sido calculado en alguna iteración anterior de esta fase de aprendizaje. En el primer caso se le asigna al vector de entrada del nodo ganador el valor del nuevo dato de entrada introducido en esta iteración, asegurando con ello su representatividad. En el segundo caso se actualiza el valor del vector de entrada del nodo ganador desplazándolo hacia el nuevo dato de entrada, como es tradicional en otros algoritmos de aprendizaje, de manera proporcional a la distancia entre el dato de entrada y el vector de entrada del nodo ganador. La constante de proporcionalidad debe decrecer con el número de iteraciones para asegurar la estabilidad de los vectores de entrada, tal y como es habitual en estas reglas de actualización.Where (i) are the distances between the input data entered and the previous inputs included in the set Ω chosen in the second stage (12), and. s (gj) are the output vectors of the nodes representing these previous entries, and s (g) are the coordinates of the output vector of the winning node for the current input data. (vile) A seventh stage of calculating the input vector of the winning node (17) calculated in the sixth stage (16). For this, this input vector has to be located near of the new input data introduced in this iteration, so that it represents it, that is, if, at the execution stage, said input data is reintroduced, then the output node that represents it (the one whose vector of input is closer) be this same node from which your input vector is now being updated. It should be borne in mind that updating the input vectors of this and other output nodes in later iterations of the learning phase may result in the output node representing this input data being another node other than its representative now. . For the update of the input vector it is taken into account if this is the first time that this output node has been a winner (that is, it is the first time its input vector is calculated), or on the contrary its input vector It has already been calculated in some previous iteration of this learning phase. In the first case, the input vector of the winning node is assigned the value of the new input data introduced in this iteration, thereby ensuring its representativeness. In the second case, the value of the input vector of the winning node is updated by moving it towards the new input data, as is traditional in other learning algorithms, proportionally to the distance between the input data and the input vector of the node. winner. The proportionality constant should decrease with the number of iterations to ensure the stability of the input vectors, as is usual in these update rules.
Una vez finalizada la séptima etapa (17), ya se ha actualizado el estado del mapa de salida en el sentido de haber aprendido el nuevo dato de entrada introducido en esta iteración; en realidad solamente se ha actualizado el vector de entrada de un único nodo del mapa, denominado nodo ganador ante este dato de entrada. En este momento se puede continuar el proceso de aprendizaje de forma iterativa (18) mediante la introducción sucesiva de nuevos datos de entrada o bien dar por finalizado (19) el proceso de aprendizaje (10), con lo que los vectores de entrada de los nodos de salida permanecen fijos y listos para ser utilizados en la fase de ejecución (20a,20b) para la reducción de dimensionalidad (20a).Once the seventh stage (17) is finished, the status of the output map has been updated in the sense of having learned the new input data introduced in this iteration; In fact, only the input vector of a single node on the map, called the winning node before this input data, has been updated. At this time, the learning process can be continued iteratively (18) by successively introducing new input data or terminating (19) the learning process (10), bringing the input vectors of the Output nodes remain fixed and ready to be used in the execution phase (20a, 20b) for dimensionality reduction (20a).
Los datos de entrada que se deben introducir durante la fase de aprendizaje 10 para que ésta sea buena (es decir, para que el mapa de salida generado permita una buena reducción de la dimensión de los datos de entrada y una buena reconstrucción de éstos) deben ser por una parte representativos del espacio de entrada, es decir estar esparcidos por todas las zonas de éste donde puedan presentarse datos, y por otra parte la cantidad de estos datos debe ser suficientemente densa, estando esta densidad de datos relacionada con el valor de la resolución R, de manera que debe haber datos de entrada cada R unidades de distancia en aquellas zonas del espacio de entrada en las que haya datos de entrada (es decir, sobre una hiper-superficie de éste), siendo conveniente que haya varias veces (hasta un orden de magnitud superior) este número de datos de entrada.The input data that must be entered during the learning phase 10 so that it is good (that is, so that the generated output map allows a good reduction of the size of the input data and a good reconstruction of these) must be on the one hand representative of the input space, that is, be scattered throughout all areas of the data where data may be presented, and on the other hand the amount of these data must be sufficiently dense, this data density being related to the value of resolution R, so that there must be input data every R units away in those areas of the input space where there is input data (i.e. , on a hyper-surface of this one), being convenient that there are several times (up to an order of magnitude greater) this number of input data.
Si el número de datos de entrada disponible es solamente del orden del valor indicado en el párrafo anterior, estos deberán introducirse varias veces durante la fase de aprendizaje, denominándose época a cada pasada completa de todos los datos de entrada disponibles. Si por el contrario el número de datos de entrada disponibles es varias veces superior al valor antes indicado, la fase de aprendizaje puede darse por concluida cuando se han introducido una sola vez todos los datos de entrada disponibles. En casos prácticos puede suceder que dicho valor deseado de numero de datos de entrada se sobrepase para unas zonas del espacio de entrada y para otras no, en cuyo caso es aconsejable introducir varias veces los datos de entrada en las zonas menos densas, hasta que se sobrepase en ellas dicho valor unas cuantas veces.If the number of available input data is only of the order of the value indicated in the previous paragraph, these must be entered several times during the learning phase, with each complete pass of all available input data being called time. If, on the contrary, the number of available input data is several times higher than the value indicated above, the learning phase can be terminated when all available input data has been entered only once. In practical cases it may happen that said desired value of the number of input data is exceeded for some areas of the input space and not for others, in which case it is advisable to enter the input data several times in the less dense areas, until exceed these values a few times.
Una vez finalizada la fase de aprendizaje 10, los nodos que han actualizado el valor de su vector de entrada son los que constituyen el denominado mapa de salida (Fig.lB), en el que cada nodo está compuesto por un vector de entrada y.e(m) y un vector de salida y.s(m), según se indica en la figura 1. Los vectores de entrada y. e (1... m) sirven para relacionar cada nodo de salida con un dato en el espacio de entrada y recíprocamente para relacionar un conjunto de datos en el espacio de entrada (aquellos que están más cerca de este vector de entrada que de cualquier otro vector de entrada de otro nodo) con este nodo de salida. Los vectores de salida y.s(l...m) sirven para indicar el nodo de que se trata, siendo sus valores las coordenadas las del nodo en el denominado mapa de salida (Fig.lB) Una vez ejecutada la fase de aprendizaje 10, el método descrito ha seleccionado los nodos que constituyen el mapa de salida como aquellos cuyas coordenadas del vector de salida son las mínimas necesarias para identificar cada nodo que representa los datos de entrada, siendo nulas el resto de las coordenadas. Este hecho se cumple con la condición, considerada durante la fase de aprendizaje, de que la relación entre la distancia entre los vectores de entrada de dos nodos del mapa y sus respectivos vectores de salida guardan una relación lo más constante posible para todos lo nodos del mapa, según el criterio del error cuadrático medio calculado en la ecuación (1), que es la denominada resolución del mapa de salida R.Once the learning phase 10 is finished, the nodes that have updated the value of their input vector are those that constitute the so-called output map (Fig. 1B), in which each node is composed of an input vector and e ( m) and an output vector and s (m), as indicated in Figure 1. The input vectors and. e (1 ... m) serve to relate each output node with data in the input space and reciprocally to relate a set of data in the input space (those that are closer to this input vector than to any another input vector of another node) with this output node. The output vectors and s (l ... m) are used to indicate the node in question, the coordinates being those of the node in the so-called output map (Fig. 1B) Once the learning phase 10 is executed, the described method has selected the nodes that constitute the output map as those whose coordinates of the output vector are the minimum necessary to identify each node that represents the input data, the rest of which is null and void. the coordinates. This fact is fulfilled with the condition, considered during the learning phase, that the relationship between the distance between the input vectors of two map nodes and their respective output vectors keeps a relationship as constant as possible for all nodes of the map. map, according to the criteria of the mean square error calculated in equation (1), which is the so-called output map resolution R.
El hecho de que los vectores de salida de los nodos del mapa tengan un elevado número de componentes nulas (o bien constantes, si el nodo elegido como representativo del primer dato de entrada no es el vector nulo), permite que estas coordenadas no tengan que utilizarse para identificar un nodo concreto del mapa de salida, utilizando solamente para ello los valores de las coordenadas distintas de cero, y obteniéndose de esta manera el objetivo de reducción de dimensionalidad para identificar un nodo de salida y por tanto del dato de entrada que éste representa.The fact that the output vectors of the map nodes have a high number of null components (or constants, if the node chosen as representative of the first input data is not the null vector), allows these coordinates to not have to used to identify a specific node of the output map, using only non-zero coordinate values, and thus obtaining the objective of dimensionality reduction to identify an output node and therefore of the input data that it It represents.
El procedimiento de reducción de dimensionalidad 20a se encuentra esquematizado en la figura 4, en la cual existen dos versiones, dependiendo de si se codifican las coordenadas del vector de salida que representa al dato de entrada (24), o si bien se codifican las coordenadas del incremento de dicho vector respecto al vector de salida representante del dato anterior (23), indicadas en dicha figura mediante la disyuntivaThe dimensionality reduction procedure 20a is schematized in Figure 4, in which there are two versions, depending on whether the coordinates of the output vector representing the input data (24) are encoded, or if the coordinates are encoded of the increase of said vector with respect to the output vector representing the previous data (23), indicated in said figure by the dilemma
"posición relativa" (25). En muchas aplicaciones en las que la secuencia en que se introducen los datos de entrada no es aleatoria, sino que sigue un cierto orden, la segunda opción es mejor para reducir la dimensión del vector de salida."relative position" (25). In many applications in which the sequence in which the input data is entered is not random, but follows a certain order, the second option is better to reduce the size of the output vector.
Dicho subproceso de reducción de la dimensionalidad (20a) comprende, a su vez, las siguientes etapas: (i) Una primera etapa (21) de cálculo del nodo de salida que representa el dato de entrada que se quiere caracterizar mediante un dato de dimensión reducida, siendo este nodo de salida aquel cuyo vector de entrada es el más cercano a dicho dato de entrada y que se denomina nodo ganador.Said dimensionality reduction subprocess (20a) comprises, in turn, the following steps: (i) A first step (21) for calculating the output node representing the input data to be characterized by a dimension data reduced, being this output node is the one whose input vector is closest to said input data and is called the winning node.
(ii) Una segunda etapa (22) de cálculo de las coordenadas fraccionarias respecto a los nodos vecinos, que puede ejecutarse opcionalmente dependiendo de la resolución con la que se quiera tener representado el dato de entrada. Si esta resolución es del orden del valor del parámetro R utilizado durante la fase de aprendizaje, no es necesario ejecutar esta acción, pasándose a la acción tercera directamente las coordenadas del nodo ganador, que son valores enteros. Si por el contrario se quiere mayor resolución, se calcularan las coordenadas en el mapa de salida como valores fraccionarios, en los que cada valor indica la cercanía del dato de entrada al nodo ganador respecto al siguiente (o anterior) nodo del espacio de salida en cada una de las dimensiones de éste distintas de cero.(ii) A second stage (22) for calculating fractional coordinates with respect to neighboring nodes, which may optionally be executed depending on the resolution with which the input data is to be represented. If this resolution is of the order of the value of the R parameter used during the learning phase, it is not necessary to execute this action, the coordinates of the winning node, which are integer values, being passed directly to the third action. If, on the contrary, a higher resolution is desired, the coordinates in the output map will be calculated as fractional values, in which each value indicates the proximity of the input data to the winning node with respect to the next (or previous) node of the output space in each of its dimensions other than zero.
A continuación existen dos posibilidades según se codifiquen las coordenadas del vector de salida del nodo representante (24) o bien las coordenadas del vector diferencia entre dicho vector de salida y las del vector de salida del representante del dato de entrada anterior (23). En ambos casos las coordenadas de ambos vectores tiene un elevado número de coordenadas nulas que no son necesario indicar, lo que es aprovechado en la cuarta etapa de codificación de coordenadas (24) para identificar al nodo representante del dato de entrada introducido mediante un numero de coordenadas inferior a la dimensión del espacio de entrada.Then there are two possibilities as the coordinates of the output vector of the representative node (24) are coded or the coordinates of the difference vector between said output vector and those of the output vector of the representative of the previous input data (23). In both cases the coordinates of both vectors have a high number of null coordinates that are not necessary to indicate, which is used in the fourth stage of coordinate coding (24) to identify the node representing the input data entered by means of a number of coordinates less than the dimension of the input space.
Los nodos del mapa de salida, que representan datos del espacio de entrada y que están identificados mediante un número reducido de coordenadas según se acaba de describir, pueden ser introducidos en el procedimiento esquematizado en la figura 5 de reconstrucción del dato de entada (20b). Dicho proceso comprende, a su vez:The nodes of the output map, which represent data from the input space and that are identified by a reduced number of coordinates as just described, can be introduced in the procedure outlined in Figure 5 of reconstruction of the input data (20b) . This process includes, in turn:
(i) Una primera etapa (31) de decodificación de coordenadas del vector de salida, añadiendo para ello las coordenadas nulas necesarias. A continuación se calculan las coordenadas absolutas del vector en el mapa de salida, ejecutando en caso necesario la segunda etapa (32), que suma el vector de salida correspondiente al dato de entrada anterior. (ii) Una segunda etapa (32) de cálculo del vector de salida del nodo representativo.(i) A first stage (31) of decoding the output vector coordinates, adding the necessary null coordinates. The absolute coordinates of the vector in the output map are then calculated, if necessary, executing the second stage (32), which adds the corresponding output vector to the previous input data. (ii) A second step (32) of calculating the output vector of the representative node.
(iii) Una tercera etapa (33) de cálculo de los vectores de salida de coordenadas enteras que están alrededor del vector de salida calculado anteriormente. (iv) Una cuarta etapa (34) de identificación de los nodos de salida correspondientes a los vectores de salida de coordenadas enteras calculados.(iii) A third step (33) of calculating the output vectors of integer coordinates that are around the output vector calculated above. (iv) A fourth step (34) of identification of the output nodes corresponding to the calculated whole coordinate output vectors.
(v) Para finalizar, una quinta etapa (35) interpola los vectores de entrada correspondientes a los nodos de salida obtenidos para calcular así el dato de entrada, que es muy parecido al dato de entrada original representado mediante el nodo de salida introducido en este procedimiento de Reconstrucción (20b). (v) Finally, a fifth stage (35) interpolates the input vectors corresponding to the output nodes obtained to calculate the input data, which is very similar to the original input data represented by the output node introduced in this Reconstruction procedure (20b).

Claims

Reivindicaciones Claims
1.- Método para la reducción de la dimensionalidad de datos caracterizado porque comprende, al menos, los siguientes subprocesos o fases: (i) una primera fase de aprendizaje (10), configurada para la generación de un mapa de salida, donde se van introduciendo sucesivamente datos en el espacio de entrada y a partir de ellos el método va calculando los nodos del espacio de salida que los representan y va actualizando los valores de los vectores de entrada de estos nodos, que los sitúan en el espacio de entrada para representar a un conjunto de dichos datos; y (ii) una segunda fase de ejecución (20a,20b), configurada para la utilización del mapa de salida generado en la primera fase (10), de tal forma que se reduzca la dimensión de los datos de entrada, así como su posterior reconstrucción a partir de los datos de dimensión reducida; dicha segunda fase, a su vez comprende dos etapas diferenciadas: (a) una primera parte de reducción de la dimensión (20a), consistente en la representación de un dato de entrada, que tiene un número elevado de coordenadas D, mediante otro dato con número reducido de coordenadas, correspondientes al vector de salida del nodo que lo representa o bien al incremento respecto al nodo representativo de la entrada inferior; (b) una segunda parte de reconstrucción (20b), consistente en la reconstrucción de un dato de entrada aproximadamente igual al original, partiendo del conjunto reducido de coordenadas y del mapa de nodos de salida generado en la primera fase de aprendizaje (10); donde, además, dicho método está configurado para que el mapa de salida tenga una estructura dimensional fija en el sentido de que cada nodo de este mapa viene identificado por sus coordenadas en dicho espacio dimensional, teniendo asociado cada nodo del mapa de salida un conjunto de valores actualizables, que representan las coordenadas de un punto en el espacio dimensional de entrada, de manera que cada punto del espacio de entrada viene representado por aquel nodo del mapa de salida cuyas coordenadas asociadas son las más cercanas a las del punto en cuestión del espacio de entrada; porque solo se actualizan los valores de un único punto del espacio de salida, elegido según criterio de distancias en dicho espacio de salida; y porque los valores de los puntos del mapa de salida se actualizan de forma iterativa ante cada uno de los puntos del espacio de entrada introducidos, de manera que los representantes de los puntos de entradas anteriores son aquellos cuyos valores son los más cercanos a cada uno de dichos puntos en el momento actual en el que se ha introducido el último punto del espacio de entrada.1.- Method for reducing the dimensionality of data characterized in that it comprises at least the following threads or phases: (i) a first learning phase (10), configured to generate an output map, where they go successively entering data in the input space and from them the method calculates the nodes of the output space that represent them and updates the values of the input vectors of these nodes, which place them in the input space to represent a set of such data; and (ii) a second execution phase (20a, 20b), configured for the use of the output map generated in the first phase (10), so that the dimension of the input data is reduced, as well as its subsequent reconstruction from the reduced dimension data; said second phase, in turn comprises two differentiated stages: (a) a first part of reduction of the dimension (20a), consisting of the representation of an input data, which has a high number of coordinates D, by another data with reduced number of coordinates, corresponding to the output vector of the node that represents it or to the increase with respect to the representative node of the lower input; (b) a second reconstruction part (20b), consisting of the reconstruction of an input data approximately equal to the original, starting from the reduced set of coordinates and the map of output nodes generated in the first learning phase (10); where, in addition, said method is configured so that the output map has a fixed dimensional structure in the sense that each node of this map is identified by its coordinates in said dimensional space, each node of the output map associated with a set of updatable values, which represent the coordinates of a point in the dimensional input space, so that each point of the input space is represented by that node of the output map whose associated coordinates are closest to those of the point in question of the space input because only the values of a single point of the exit space are updated, chosen according to distance criteria in said exit space; and because the values of the points of the output map are updated iteratively before each of the points of the input space introduced, so that the representatives of the points of previous entries are those whose values are closest to each of these points at the current time when the last point of the input space has been entered.
2.- Método según reivindicación 1, caracterizado porque una vez elegido el punto del mapa de salida que actualiza sus valores y actualizados éstos, se procede de una manera iterativa con sucesivas entradas.2. Method according to claim 1, characterized in that once the point of the output map that updates its values has been selected and updated, it is proceeded in an iterative manner with successive entries.
3.- Método según cualquiera de las reivindicaciones anteriores caracterizado porque los valores asociados a los puntos del mapa de salida son tales que las distancias entre dos puntos del espacio de entrada y sus representantes en el mapa de salida guardan una relación ñja predefinida, inversa de la resolución, de forma lo más aproximada posible según el criterio del error cuadrático medio definido por la3. Method according to any of the preceding claims characterized in that the values associated with the points of the output map are such that the distances between two points of the input space and their representatives in the output map have a predefined, inverse relationship of the resolution, as closely as possible according to the criteria of the mean square error defined by the
expresión,)/ Xg) = argminj ∑(|y
Figure imgf000018_0001
donde de (i) son las distancias entre el dato de entrada introducido y las entradas anteriores incluidas en un conjunto Ω elegido en la primera fase de aprendizaje (10), y.s(g¡) son los vectores de salida de los nodos representantes de dichas entradas anteriores, y.s(g) son las coordenadas del vector de salida del nodo ganador para el dato de entrada actual.
expression,) / Xg) = argminj ∑ (| y
Figure imgf000018_0001
where (i) are the distances between the input data entered and the previous inputs included in a set Ω chosen in the first learning phase (10), and s (g¡) are the output vectors of the nodes representing said previous entries, and s (g) are the coordinates of the output vector of the winning node for the current input data.
4.- Método según cualquiera de las reivindicaciones 1-3, caracterizado porque se representa el dato de entrada mediante coordenadas fraccionarias en el mapa de salida, calculadas a partir de la cercanía a los vectores asociados a varios nodos de dicho mapa. 4. Method according to any of claims 1-3, characterized in that the input data is represented by fractional coordinates in the output map, calculated from the proximity to the vectors associated with several nodes of said map.
PCT/ES2009/000383 2008-08-29 2009-07-20 Method for reducing the dimensionality of data WO2010023334A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
ES200802521 2008-08-29
ESP200802521 2008-08-29

Publications (1)

Publication Number Publication Date
WO2010023334A1 true WO2010023334A1 (en) 2010-03-04

Family

ID=41720865

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/ES2009/000383 WO2010023334A1 (en) 2008-08-29 2009-07-20 Method for reducing the dimensionality of data

Country Status (1)

Country Link
WO (1) WO2010023334A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001071624A1 (en) * 2000-03-22 2001-09-27 3-Dimensional Pharmaceuticals, Inc. System, method, and computer program product for representing object relationships in a multidimensional space
US6526168B1 (en) * 1998-03-19 2003-02-25 The Regents Of The University Of California Visual neural classifier
WO2003107120A2 (en) * 2002-06-13 2003-12-24 3-Dimensional Pharmaceuticals, Inc. Methods, systems, and computer program products for representing object relationships in a multidimensional space
US20040022445A1 (en) * 2002-07-30 2004-02-05 International Business Machines Corporation Methods and apparatus for reduction of high dimensional data
US20040078351A1 (en) * 2000-12-12 2004-04-22 Pascual-Marqui Roberto Domingo Non-linear data mapping and dimensionality reduction system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6526168B1 (en) * 1998-03-19 2003-02-25 The Regents Of The University Of California Visual neural classifier
WO2001071624A1 (en) * 2000-03-22 2001-09-27 3-Dimensional Pharmaceuticals, Inc. System, method, and computer program product for representing object relationships in a multidimensional space
US20040078351A1 (en) * 2000-12-12 2004-04-22 Pascual-Marqui Roberto Domingo Non-linear data mapping and dimensionality reduction system
WO2003107120A2 (en) * 2002-06-13 2003-12-24 3-Dimensional Pharmaceuticals, Inc. Methods, systems, and computer program products for representing object relationships in a multidimensional space
US20040022445A1 (en) * 2002-07-30 2004-02-05 International Business Machines Corporation Methods and apparatus for reduction of high dimensional data

Similar Documents

Publication Publication Date Title
Carlier et al. From Knothe's transport to Brenier's map and a continuation method for optimal transport
Bengtsson et al. Geometry of quantum states: an introduction to quantum entanglement
Byrne et al. Geodesic Monte Carlo on embedded manifolds
US20200132476A1 (en) Method and apparatus for producing a lane-accurate road map
Fišer et al. Growing neural gas efficiently
Aras et al. The Kohonen network incorporating explicit statistics and its application to the travelling salesman problem
US10460236B2 (en) Neural network learning device
CA2839279C (en) Method and apparatus for a local competitive learning rule that leads to sparse connectivity
Sommer et al. Sparse multi-scale diffeomorphic registration: the kernel bundle framework
JP2019535079A (en) Efficient data layout for convolutional neural networks
US20090087084A1 (en) Method and apparatus for pattern recognition
Feragen et al. Geometries on spaces of treelike shapes
CN108563660B (en) Service recommendation method, system and server
Kalinay et al. Mapping of diffusion in a channel with soft walls
Rodriguez et al. 3-dimensional curve similarity using string matching
Miolane et al. Template shape estimation: correcting an asymptotic bias
Iglesias et al. Neural network modelling of planform geometry of headland-bay beaches
WO2010023334A1 (en) Method for reducing the dimensionality of data
Su Statistical shape modelling: automatic shape model building
Lowenstein et al. Renormalization of one-parameter families of piecewise isometries
Courtiel et al. Bijections for Weyl chamber walks ending on an axis, using arc diagrams and Schnyder woods
Berkels et al. Discrete geodesic regression in shape space
Albani et al. Data driven recovery of local volatility surfaces
CN110176029A (en) Image restoration based on level rarefaction representation with match integral method and system
Goudon et al. The lovebirds problem: why solve Hamilton-Jacobi-Bellman equations matters in love affairs

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09809358

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09809358

Country of ref document: EP

Kind code of ref document: A1