CN103500165A

CN103500165A - High-dimensional vector quantity search method combining clustering and double key values

Info

Publication number: CN103500165A
Application number: CN201310365592.2A
Authority: CN
Inventors: 吕锐; 杨丽芳; 曹学会; 黄祥林; 成鹏; 龚昊; 史欣萍
Original assignee: XINHUA NEWS AGENCY; Communication University of China
Current assignee: XINHUA NEWS AGENCY; Communication University of China
Priority date: 2013-08-21
Filing date: 2013-08-21
Publication date: 2014-01-08
Anticipated expiration: 2033-08-21
Also published as: CN103500165B

Abstract

The invention provides a high-dimensional vector quantity search method combining clustering and double key values. According to the high-dimensional vector quantity search method combining the clustering and the double key values, a double-key-value index structure CDKB-tree combining the clustering is provided, cluster dividing is firstly carried out on a high-dimensional vector quantity set by adopting a clustering algorithm, and then a double-key-value expansion B+-tree is established for each cluster of data to form a CDKB-tree. In the process of searching, only the clustering data intersected with an inquire range are searched, first filtering is achieved through clustering, two times of key value filtering are achieved through a primary key and an auxiliary key (the double key values), and similarity matching calculation is only carried out between the high-dimensional vector quantities and inquire vector quantities with both the primary key and the auxiliary key in the searching range. By means of the high-dimensional vector quantity search method combining the clustering and the double key values, through clustering and a simple comparison method of the double key values, the index structure largely reduces the amount of calculation of similarity matching, and largely improves the searching speed.

Description

A kind of high dimension vector search method in conjunction with cluster and two key values

Technical field

The invention belongs to the data processing field such as multimedia information retrieval, Intelligent Information Processing, data mining, what be specifically related to is a kind of high dimension vector search method in conjunction with cluster and two key values.

Background technology

Along with the development of computing machine and infotech, produced the multi-medium data of magnanimity, how in the multimedia database of magnanimity, finding fast required information is an Important Problems of current multimedia database area research.Traditional method is by manually multi-medium data being marked, and then by text retrieval, realizes multimedia information retrieval.Yet there is workload in artificial mark greatly and the strong defect of subjectivity, for the multi-medium data of explosive growth, fully manually mark can not be realized, therefore need to the content-based multimedia information retrieval technology of research.

The technology path of realizing content-based multimedia information retrieval is: by eigentransformation, multi-medium data is mapped to point---the proper vector in higher dimensional space, by this proper vector, describes multimedia object, obtain feature database; Then extract the proper vector of query object by same eigentransformation method, finally by the similarity between proper vector, mate to realize the similar to search of multimedia messages.Therefore the similar to search of multimedia messages changes the process of finding the point set nearest with given query point in high-dimensional feature space into.

To in higher dimensional space, find the point set the most close with given query point, the method of simple, intuitive is exactly sequential scanning, successively each feature (high dimension vector) in feature database and query point are carried out to the similarity coupling, return to those feature point sets that mate most, obtain result for retrieval.Sequential scanning, along with the increase of number of features in feature database and characteristic dimension, is calculated the linear increase of elapsed time, and when the number of features in feature database is very large, sequential scanning can not meet the real-time demand.In order to accelerate retrieval rate, the most frequently used method is exactly by means of High-dimensional Index Technology.

In order to realize the management to the magnanimity high dimension vector, researchers have proposed a large amount of index structures, and wherein classical is to take the R-tree family series index structure that R-tree is representative.R-tree is proposed by Guttman the eighties in 20th century, a kind of index structure designed for managing multidimensional rectangular block data, it is a kind of height balanced tree that utilizes the tree construction management data, each node means with the minimum boundary rectangle (MBR:Minimal Bounding Rectangle) of all data in this node, and real data only appears in leaf node.This index structure also can be used for the management of higher dimensional space middle data by expansion.In query script, from the root node layer, to the leaf node layer, search for downwards, judge by the minor increment of calculating between query vector and each node M BR whether query context intersects to realize the beta pruning filtration with certain node, and only search may comprise the subtree of result, thereby accelerates retrieval rate.This index structure allows the space overlap between node, has affected its search efficiency.In order to improve the performance of R-tree, researchers are the continuous R that proposed mutually ⁺-tree, R ^*the index structures such as-tree, SS-tree, SR-tree, X-tree, A-tree.But these tree index structures are along with the increase of characteristic dimension, and search efficiency sharply descends, not even as sequential scanning, Here it is so-called " dimension disaster ".

Except tree, also there is the index structure of higher-dimension to the one dimension conversion, for example: pyramid technology, NB-tree, iDistance, iMinMax etc.Higher-dimension passes through certain rule to the index structure of one dimension conversion, and high dimension vector is mapped as to one-dimensional data (being called the key value), then adopts the B of one dimension ⁺-tree manages these key values, and the key value is at B ⁺the leaf node layer ordered arrangement of-tree.While being inquired about, at first calculate the inquiry key value of query vector to the one dimension transformation rule by identical higher-dimension, finally according to query context, determine key value reference position and the end position of search, and scan successively high dimension vector corresponding to these key values, calculate the similarity between query vector and these high dimension vectors, return to those the most similar high dimension vector collection, obtain result for retrieval.From query script, higher-dimension to the index structure of one dimension conversion under any circumstance performance all be better than or be equivalent to sequential scanning, and the great many of experiments based on forefathers shows, this class index structure is with the increase of dimension and data volume, performance reduces slowly.

These higher-dimensions such as pyramid technology, NB-tree, iDistance, iMinMax filter beta pruning to one dimension switching cable guiding structure by the simply relatively realization of single key value, although do not need complicated distance calculating and there is higher recall precision, but higher-dimension can cause a large amount of data message loss to the process of one dimension conversion, cause that different vectors may have identical one-dimensional k ey value, only can filter out the little a part of data of ratio by single key value, cause the operand of final similarity matching process still very large, query cost is still not little.

Summary of the invention

The object of the invention is to propose a kind of high dimension vector search method in conjunction with cluster and two key values, the method adopts clustering algorithm that higher dimensional space is carried out to a bunch division, then each high dimension vector in each cluster data is mapped as to two one-dimensional k ey values.In query script, filter out a part and the disjoint cluster data of query region by cluster, pass through to increase one deck key value filtering layer for each cluster data, further adopt simple key value relatively to realize again filtering beta pruning, greatly reduce the operand of final vector similarity coupling, significantly accelerated inquiry velocity.

Overall thought of the present invention is as follows: at first adopt clustering algorithm to carry out a bunch division to the high dimension vector collection, then choose two reference point for each cluster data, and adopt high dimension vector, to the distance of two reference point of this cluster data, the high dimension vector in this cluster data is mapped as to two one-dimensional k ey values to each cluster data, unified choose key value that in this cluster data, a certain employing same reference points obtains as main key, another is as auxiliary key, finally, adopt respectively the main key of each cluster data to build B for each cluster data ⁺-tree, each B simultaneously ⁺each main key of-tree leaf node layer binds a pointer that points to its corresponding auxiliary key, and each auxiliary key binds a pointer that points to its corresponding high dimension vector.While being retrieved, at first only need be searched for each cluster data intersected with query context, then in each cluster data intersected with query region, use identical two reference point and mapping method that query vector is mapped as to the main key of inquiry and the auxiliary key of inquiry, determine the main key hunting zone in each cluster data by inquiring about main key and query context, and determine the hunting zone of auxiliary key by inquiring about auxiliary key and query context, last only need be to its auxiliary key after main key filters carry out the similarity coupling between those high dimension vectors in auxiliary key hunting zone and query vector and calculate, return to those the most similar vector sets, obtain result for retrieval.

Concrete innovative point: higher dimensional space is carried out to a bunch division, choose two reference point for each high dimension vector in each cluster data and obtain two one-dimensional k ey values, by cluster and twice simple key value, compare, greatly reduce the high dimension vector number of final participation similarity matching operation, accelerated significantly inquiry velocity.

Concrete grammar step of the present invention is: (1) adopts clustering algorithm to carry out a bunch division to the high dimension vector collection, obtains cluster centre and the cluster radius of each cluster data, (2) build two key value expansion B for each cluster data ⁺-tree, for every cluster data builds two key value expansion B ⁺the process of-tree is: at first for this cluster data, choose two reference point, and adopt high dimension vector, to the distance of these two reference point, the high dimension vector in this cluster data is mapped as to two one-dimensional k ey values, it is main key that the key value that in this cluster data, a certain employing same reference points obtains is chosen in unification, then another adopt the main key of this cluster data to build B for this cluster data as auxiliary key ⁺-tree, this B simultaneously ⁺each main key of-tree leaf node layer binds a pointer that points to its corresponding auxiliary key, and each auxiliary key binds a pointer that points to its corresponding high dimension vector, B ⁺the all main key of-tree leaf node layer forms main key layer, and all auxiliary key form auxiliary key layer, (3) cluster centre of each cluster data and cluster radius are all bound to constructed couple of key value expansion B of its corresponding cluster data of sensing ⁺the pointer of-tree, form CDKB-tree, (4) while being retrieved, filter out those and disjoint each cluster data of query region by query context, each cluster data intersected with query context is searched for, searching method in each cluster data is: use identical reference point and mapping method that query vector is mapped as to the main key of inquiry and the auxiliary key of inquiry, determine reference position and the end position in the search of this cluster data master key layer by inquiring about main key and query context, then determine the auxiliary key hunting zone at the auxiliary key layer of this cluster data by inquiring about auxiliary key and query context, then main key layer is scanned to each the main key end position one by one from the search reference position, judge that auxiliary key that this main key is corresponding is whether in auxiliary key hunting zone, if in hunting zone, this auxiliary key being carried out the similarity coupling between corresponding high dimension vector and query vector calculates, the high dimension vector that will meet query context returns, obtain result for retrieval.

Further, the clustering algorithm described in step 1 comprises the Kmeans cluster.

Further, described in step 2, choose two reference point, comprise and can choose initial point and cluster centre is reference point.

Further, the high dimension vector described in step 2 can adopt Euclidean distance or city block distance to the distance of these two reference point.

Further again, when the CDKB-tree described in step 3 carries out the high dimension vector insertion, at first arrive the distance value of each cluster data cluster centre according to this high dimension vector, the nearest cluster data of this high dimension vector of selected distance is carried out update, upgrade cluster radius, then obtain being inserted into main key and the auxiliary key value of vector to the distance of two reference point of this cluster data according to being inserted into vector, locate it according to the size of this main key value and should be inserted into the corresponding B of this cluster data ⁺in a certain leaf node of-tree: if this leaf node less than, directly this main key value is inserted in this leaf node, its auxiliary key is inserted into the position that this main key is corresponding, be inserted into proper vector and be inserted into the position that this auxiliary key is corresponding, and make main key produce the pointer that points to its corresponding auxiliary key, its corresponding auxiliary key produces and points to the pointer that is inserted into high dimension vector, upgrades the key value that this leaf father node is corresponding; If this leaf node is full, the mode of processing is as follows:

1) if the left and right brotgher of node of this leaf node exist less than situation, in conjunction with its left and right brotgher of node, carry out the insertion of becoming owner of key, auxiliary key and high dimension vector to be inserted, and upgrade the key value that its father node is corresponding;

2) if its left and right brotgher of node is all full, in conjunction with the main key value that is inserted into high dimension vector, directly this leaf node is divided, the new leaf node produced after division is inserted in its father node, its auxiliary key and high dimension vector are inserted into to the corresponding stored position simultaneously, upgrade the key value that its father node is corresponding, if father node is also full, fission process continues upwards to transmit, and upgrades corresponding key value.

Further again, during being retrieved described in step 4, the retrieval mode of employing had both comprised that range query also comprised the k NN Query.

Further again, query context described in step 4, for range query, determined by the inquiry radius, for the k NN Query, be to be determined by the inquiry radius increased progressively by a certain step-length, until k neighbour is less than the inquiry radius to the distance value of query vector.

The accompanying drawing explanation

Accompanying drawing described herein is used to provide a further understanding of the present invention, forms a part of the present invention, and schematic embodiment of the present invention and explanation thereof the present invention do not form inappropriate limitation of the present invention for explaining.In the accompanying drawings:

The process flow diagram of Fig. 1 (a) the method for the invention

The exemplary plot of Fig. 1 (b) index structure of the present invention

Fig. 2 is at the block diagram of the enterprising line range inquiry of index structure of the present invention

Fig. 3 carries out the block diagram of k NN Query on index structure of the present invention

Embodiment

For the technical matters, the technical scheme that make solution required for the present invention is clearer, understand, below in conjunction with accompanying drawing and embodiment, the specific embodiment of the present invention is described further.

The process flow diagram that a kind of its index structure of the high dimension vector search method in conjunction with cluster and two key values that the invention process example provides builds is as shown in Fig. 1 (a):

At first, adopt clustering algorithm to carry out space bunch division to the high dimension vector collection, obtain each bunch of high dimensional data; Next calculates cluster centre and the radius of each cluster data, and chooses two reference point for every cluster data; Again calculate one by one each high dimension vector in each cluster data and the distance between two reference point of this cluster data, obtain two one-dimensional k ey values that each high dimension vector is corresponding; Then choose key value that in each cluster data, a certain employing same reference points obtains as main key, another is as auxiliary key, and to adopt the main key of each cluster data be that this cluster data builds B ⁺-tree, auxiliary key and high dimension vector data that the main key of each cluster data is corresponding are inserted into corresponding auxiliary key and place, high dimension vector memory location, the pointer of its corresponding auxiliary key is pointed in each main key binding, the pointer of its corresponding high dimension vector is pointed in each auxiliary key binding, obtains two key value expansion B corresponding to each cluster data ⁺-tree; Finally the cluster centre of each cluster data and cluster radius are bound to two key value expansion B of this cluster data ⁺-tree, form the CDKB-tree index structure.(as shown in Fig. 1 (b), upper strata is the clustering information layer, and each B that each cluster data master key builds serves as reasons in centre ⁺-tree, bottom is auxiliary key layer and the proper vector layer of the auxiliary key of storage and high dimension vector, each B ⁺each main key of-tree leaf node layer binds a pointer that points to its corresponding auxiliary key, pointer that points to corresponding high dimension vector of its auxiliary key binding.) when being retrieved, judge by query vector and query context whether each cluster data intersects with this inquiry, then for this, inquiring about crossing cluster data, further search for, the method of search is: at first adopt identical reference point and mapping ruler, calculate the distance between query vector and two reference point of this cluster data, obtain the main key of inquiry and the auxiliary key of inquiry in this cluster data; Then according to the main key of inquiry and query context, determine the two key value expansion Bs corresponding in this cluster data ⁺-tree index structure master key layer (is B ⁺-tree leaf node layer) main key hunting zone, obtain scanning starting position and the end position of main key layer, and, according to the auxiliary key of inquiry and query context, determine the two key value expansion Bs corresponding in this cluster data ⁺the auxiliary key hunting zone of the auxiliary key layer of-tree index structure; Finally, from the scanning starting position of main key layer to end position (main key hunting zone), carry out one by one the key scan value, judge that auxiliary key that this main key is corresponding is whether within auxiliary key hunting zone, if within hunting zone, calculate high dimension vector that this auxiliary key is corresponding and the distance between query vector, the high dimension vector that meets result for retrieval is returned, obtain similar vector set.

The retrieval mode of being retrieved of the present invention comprises range query and k NN Query, and as shown in Figure 2, the process flow diagram of k NN Query as shown in Figure 3 for the process flow diagram of range query.As shown in Figure 3, the k NN Query is to realize by range query.

Above-mentioned high dimension vector can be the proper vector of image, video, audio frequency.

Should be understood that, the above-mentioned description for embodiment is comparatively concrete, can not therefore think the restriction to scope of patent protection of the present invention, and scope of patent protection of the present invention should be as the criterion with claims.

Claims

1. the high dimension vector search method in conjunction with cluster and two key values is characterized in that concrete steps are as follows:

1) adopt clustering algorithm to carry out a bunch division to the high dimension vector collection, obtain cluster centre and the cluster radius of each cluster data;

2) build two key value expansion B for each cluster data ⁺-tree, for every cluster data builds two key value expansion B ⁺the process of-tree is: at first for this cluster data, choose two reference point, and adopt high dimension vector, to the distance of these two reference point, the high dimension vector in this cluster data is mapped as to two one-dimensional k ey values, it is main key that the key value that in this cluster data, a certain employing same reference points obtains is chosen in unification, then another adopt the main key of this cluster data to build B for this cluster data as auxiliary key ⁺-tree, this B simultaneously ⁺each main key of-tree leaf node layer binds a pointer that points to its corresponding auxiliary key, and each auxiliary key binds a pointer that points to its corresponding high dimension vector, B ⁺the all main key of-tree leaf node layer forms main key layer, and all auxiliary key form auxiliary key layer;

3) cluster centre of each cluster data and cluster radius are all bound to constructed couple of key value expansion B of its corresponding cluster data of sensing ⁺the pointer of-tree, form CDKB-tree;

While 4) being retrieved, filter out those and disjoint each cluster data of query region by query context, each cluster data intersected with query context is searched for, searching method in each cluster data intersected is: use identical reference point and mapping method that query vector is mapped as to the main key of inquiry and the auxiliary key of inquiry, determine reference position and the end position in the search of this cluster data master key layer by inquiring about main key and query context, then determine the auxiliary key hunting zone at the auxiliary key layer of this cluster data by inquiring about auxiliary key and query context, then main key layer is scanned to each the main key end position one by one from the search reference position, judge that auxiliary key that this main key is corresponding is whether in auxiliary key hunting zone, if in hunting zone, this auxiliary key being carried out the similarity coupling between corresponding high dimension vector and query vector calculates, the high dimension vector that will meet query context returns, obtain result for retrieval.

2. the method for claim 1, it is characterized in that: the clustering algorithm described in step 1 comprises the Kmeans cluster.

3. the method for claim 1 is characterized in that: choose two reference point described in step 2, comprise and can choose initial point and cluster centre is reference point.

4. the method for claim 1 is characterized in that: the high dimension vector described in step 2 can adopt Euclidean distance or city block distance to the distance of these two reference point.

5. the method for claim 1, it is characterized in that: when the CDKB-tree described in step 3 carries out the high dimension vector insertion, at first arrive the distance value of each cluster data cluster centre according to this high dimension vector, the nearest cluster data of this high dimension vector of selected distance is carried out update, upgrade cluster radius, then obtain being inserted into main key and the auxiliary key value of vector to the distance of two reference point of this cluster data according to being inserted into vector, locate it according to the size of this main key value and should be inserted into the corresponding B of this cluster data ⁺in a certain leaf node of-tree: if this leaf node less than, directly this main key value is inserted in this leaf node, its auxiliary key is inserted into the position that this main key is corresponding, be inserted into proper vector and be inserted into the position that this auxiliary key is corresponding, and make main key produce the pointer that points to its corresponding auxiliary key, its corresponding auxiliary key produces and points to the pointer that is inserted into high dimension vector, upgrades the key value that this leaf node father node is corresponding; If this leaf node is full, the mode of processing is as follows:

Step 1: if the left and right brotgher of node of this leaf node exist less than situation, in conjunction with its left and right brotgher of node, carry out the insertion of becoming owner of key, auxiliary key and high dimension vector to be inserted, and upgrade the key value that its father node is corresponding;

Step 2: if its left and right brotgher of node is all full, in conjunction with the main key value that is inserted into high dimension vector, directly this leaf node is divided, the new leaf node produced after division is inserted in its father node, its auxiliary key and high dimension vector are inserted into to the corresponding stored position simultaneously, upgrade the key value that its father node is corresponding, if father node is also full, fission process continues upwards to transmit, and upgrades corresponding key value.

6. the method for claim 1 is characterized in that: during being retrieved described in step 4, the retrieval mode of employing had both comprised that range query also comprised the k NN Query.

7. the method for claim 1, it is characterized in that: the query context described in step 4, for range query, by the inquiry radius, determined, for the k NN Query, be to be determined by the inquiry radius increased progressively by a certain step-length, until k neighbour is less than the inquiry radius to the distance value of query vector.