CN103514264A

CN103514264A - Method for searching high-dimensional vector combining clustering and city block distances

Info

Publication number: CN103514264A
Application number: CN201310365384.2A
Authority: CN
Inventors: 吕锐; 黄祥林; 陈明祥; 杨丽芳; 储达峰; 高庆; 魏海涛
Original assignee: XINHUA NEWS AGENCY; Communication University of China
Current assignee: XINHUA NEWS AGENCY; Communication University of China
Priority date: 2013-08-21
Filing date: 2013-08-21
Publication date: 2014-01-15

Abstract

The invention discloses a method for searching a high-dimensional vector combining clustering and city block distances. In the method, an index structure CBlockB-tree combining clustering and the city block distances is provided; firstly, cluster partition is performed on high dimensional vector sets by adopting a clustering algorithm, and then BlockB-tree is constructed for each cluster data to form the CBlockB-tree. When the index structure performs searching, a part of cluster data which are disjointed with a query region can be filtered through clustering; the operation amount matched with final vector similarity can be further reduced by comparing Key values transformed from a high dimension to one dimension, and therefore the searching speed of the high-dimensional vector can be increased; meanwhile, the index structure is capable of effectively supporting the simple and efficient city block distances for matching searching.

Description

High dimension vector searching method in conjunction with cluster and city block distance

Technical field

The invention belongs to the data processing field such as multimedia information retrieval, Intelligent Information Processing, data mining, what be specifically related to is the high dimension vector searching method of a kind of combination cluster and city block distance.

Background technology

Along with the development of computing machine and infotech, produced the multi-medium data of magnanimity, how in the multimedia database of magnanimity, finding fast required information is an Important Problems of current multimedia database area research.Traditional method is by manually multi-medium data being marked, and then by text retrieval, realizes multimedia information retrieval.Yet artificial mark exists workload greatly and the strong defect of subjectivity, for the multi-medium data of explosive growth, completely artificial mark is not attainable, therefore need to study content-based multimedia information retrieval technology.

The technology path of realizing content-based multimedia information retrieval is: by eigentransformation, multi-medium data is mapped to point---the proper vector in higher dimensional space, by this proper vector, describes multimedia object, obtain feature database; Then by same eigentransformation method, extract the proper vector of query object, finally by the similarity between proper vector, mate to realize the similar to search of multimedia messages.Therefore the similar to search of multimedia messages changes the process of finding the point set nearest with given query point in high-dimensional feature space into.

To in higher dimensional space, find the point set the most close with given query point, the method of simple, intuitive is exactly sequential scanning, successively each feature (high dimension vector) in feature database is carried out to similarity with query point and mate, return to those feature point sets that mate most, obtain result for retrieval.Sequential scanning, along with the increase of number of features in feature database and characteristic dimension, is calculated the linear increase of elapsed time, and when the number of features in feature database is very large, sequential scanning can not meet real-time demand.In order to accelerate retrieval rate, the most frequently used method is exactly by means of High-dimensional Index Technology.

In order to realize the management to magnanimity high dimension vector, researchers have proposed a large amount of index structures, and wherein classical is to take the R-tree family series index structure that R-tree is representative.R-tree is proposed by Guttman the eighties in 20th century, a kind of index structure designing for managing multidimensional rectangular block data, it is a kind of height balanced tree that utilizes tree construction management data, each node represents with the minimum boundary rectangle (MBR:Minimal Bounding Rectangle) of all data in this node, and real data only appears in leaf node.This index structure also can be used for the management of higher dimensional space middle data by expansion.In query script, from root node layer, to leaf node layer, search for downwards, by the minor increment of calculating between query vector and each node M BR, judge whether query context intersects to realize beta pruning with certain node and filter, only search may comprise the subtree of result, thereby accelerates retrieval rate.This index structure allows the space overlap between node, has affected its search efficiency.In order to improve the performance of R-tree, researchers are the continuous R that proposed mutually ⁺-tree, R ^*the index structures such as-tree, SS-tree, SR-tree, X-tree, A-tree.But these tree index structures are along with the increase of characteristic dimension, and search efficiency sharply declines, not even as sequential scanning, Here it is so-called " dimension disaster ".

Except tree, also exist higher-dimension for example, to the index structure of one dimension conversion: pyramid technology, NB-tree, iDistance, iMinMax etc.Higher-dimension passes through certain rule to the index structure of one dimension conversion, and high dimension vector is mapped as to one-dimensional data (being called key value), then adopts the B of one dimension ⁺-tree manages these key values, and key value is at B ⁺the leaf node layer ordered arrangement of-tree.While inquiring about, first by identical higher-dimension, to one dimension transformation rule, calculate the inquiry key value of query vector, finally according to query context, determine key value reference position and the end position of search, and scan successively high dimension vector corresponding to these key values, calculate the similarity between query vector and these high dimension vectors, return to those the most similar high dimension vector collection, obtain result for retrieval.From query script, higher-dimension to the index structure of one dimension conversion under any circumstance performance be all better than or be equivalent to sequential scanning, and the great many of experiments based on forefathers shows, this class index structure is with the increase of dimension and data volume, performance reduces slowly.

These higher-dimensions such as pyramid technology, NB-tree, iDistance, iMinMax filter beta pruning to one dimension switching cable guiding structure by simply relatively realizing of single key value, although do not need complicated distance calculating and there is higher recall precision, but higher-dimension can cause a large amount of data message loss to the process of one dimension conversion, cause different vectors may there is identical one-dimensional k ey value, make only can filter out the little a part of data of ratio by single key value, cause that finally seemingly to spend the operand of matching process still very large, query cost is still not little.And these higher-dimensions that previously proposed mostly propose based on Euclidean distance matching measurement to the index structure of one dimension conversion, and city block distance is one of metric form the most frequently used in high dimension vector similarity matching algorithm, its computing is simple, and has higher recall precision.

Summary of the invention

The object of the invention is to propose the index structure CBlockB-tree (abbreviation of Clustering BlockB-tree) of a kind of combination cluster and city block distance, this index structure can filter a part and the disjoint cluster data of query region by cluster, the key value arriving after one dimension conversion by higher-dimension compares, can further reduce the operand of final vector similarity coupling, accelerate the search speed of high dimension vector.Meanwhile, this index structure can effectively support simple city block distance efficiently to carry out match search.

Overall thought of the present invention is as follows: first adopt clustering algorithm to carry out a bunch division to high dimension vector collection, then for each cluster data is chosen a reference point, and adopt high dimension vector, to the city block distance between this reference point, the high dimension vector in this cluster data is mapped as to one-dimensional k ey value, finally, adopt respectively each cluster data to build BlockB-tree, it is each cluster data structure B that this BlockB-tree adopts the one-dimensional k ey of each cluster data ⁺-tree, simultaneously its B ⁺each key value of-tree leaf node layer is bound a pointer that points to its corresponding high dimension vector.While retrieving, first only need search for each cluster data intersecting with query context, then in each cluster data intersecting with query region, use identical reference point and city block distance that query vector is mapped as to inquiry key value, by inquiry key value and query context, determine key value reference position and the end position (hunting zone) of searching in each cluster data, finally scan these key value characteristic of correspondence vectors in hunting zone, calculate the similarity between query vector and these proper vectors, return to those the most similar vector sets, obtain result for retrieval.

Concrete innovative point: higher dimensional space is carried out to a bunch division, for each cluster data is chosen a reference point, adopt each high dimension vector in each cluster data to obtain one-dimensional k ey value to the city block distance between its reference point, index building structure, this index structure compares by cluster and simple key value, reduced the high dimension vector number of final participation similarity matching operation, accelerated inquiry velocity, and the rule that adopts city block distance to change to one dimension as higher-dimension, makes this index structure can directly support to adopt city block distance to carry out the match search of high dimension vector.

Concrete grammar step of the present invention is: (1) adopts clustering algorithm to carry out a bunch division to high dimension vector collection, obtains cluster centre and the cluster radius of each cluster data, (2) for each cluster data is chosen a reference point, then adopt one by one this high dimension vector to be mapped as one-dimensional k ey value to the city block distance between selected reference point each high dimension vector in this cluster data, and to adopt all high dimension vectors of this cluster data and its corresponding key value be this cluster data structure BlockB-tree, this BlockB-tree employing B ⁺-tree index structure is managed the key value on upper strata, and each key value of leaf node layer is bound a pointer that points to corresponding high dimension vector simultaneously, (3) cluster centre of each cluster data and cluster radius are all bound to a pointer that points to the constructed BlockB-tree of its corresponding cluster data, form CBlockB-tree, (4) while retrieving, by query context, filter out those and disjoint each cluster data of query region, each cluster data intersecting with query context is searched for, and each cluster data of intersecting of query context in searching method be: the city block distance calculating between query vector and the selected reference point of this cluster data obtains the inquiry key value in this cluster data, according to query context and inquiry key value, obtain reference position and the end position of the key value that need to search for, high dimension vector corresponding to these key values and the distance between query vector are calculated in scanning, high dimension vector in query context is returned, obtain result for retrieval.

Further, the clustering algorithm described in step 1 comprises Kmeans cluster.

Further, described in step 2, for each cluster data, choose a reference point, comprise and choose initial point or cluster centre is reference point.

Further again, during retrieving described in step 4, comprise and adopt range query or k NN Query.

Accompanying drawing explanation

Accompanying drawing described herein is used to provide a further understanding of the present invention, forms a part of the present invention, and schematic embodiment of the present invention and explanation thereof are used for explaining the present invention, do not form inappropriate limitation of the present invention.In the accompanying drawings:

The process flow diagram of Fig. 1 (a) the method for the invention

The exemplary plot of Fig. 1 (b) index structure CBlockB-tree of the present invention

Fig. 2 is at the block diagram of the enterprising line range inquiry of index structure CBlockB-tree of the present invention

Fig. 3 carries out the block diagram of k NN Query on index structure CBlockB-tree of the present invention

Embodiment

In order to make technical matters, the technical scheme of solution required for the present invention clearer, to understand, below in conjunction with accompanying drawing and embodiment, the specific embodiment of the present invention is described further.

The process flow diagram that its index structure of a kind of combination cluster that the invention process example provides and the high dimension vector searching method of city block distance builds is as shown in Fig. 1 (a):

First, adopt clustering algorithm to carry out space bunch division to high dimension vector collection, obtain each bunch of high dimensional data; Next calculates cluster centre and the radius of each cluster data, and chooses a reference point for every cluster data; Again calculate one by one each high dimension vector in each cluster data and the city block distance between this cluster data reference point, obtain the one-dimensional k ey value that each high dimension vector is corresponding; Then each high dimension vector of each cluster data and corresponding key value are inserted one by one, for each cluster data builds BlockB-tree, this BlockB-tree adopts B ⁺-tree index structure is managed the key value on upper strata, and each key value of leaf node layer is bound a pointer that points to corresponding high dimension vector simultaneously; Finally the cluster centre of each cluster data and cluster radius are bound to the constructed BlockB-tree of this cluster data, form CBlockB-tree index structure.(as shown in Fig. 1 (b), upper strata is clustering information layer, and each B that each cluster data one-dimensional k ey builds serves as reasons in centre ⁺-tree, bottom is the high dimension vector layer of storage high dimension vector, each B ⁺each key value of-tree leaf node layer is bound a pointer that points to its corresponding high dimension vector.) when retrieving, by query vector and query context, judge which cluster data and this inquiry intersects, then for inquiring about with this each cluster data intersecting, further search for, searching method in each cluster data is: first calculate the city block distance between query vector and this cluster data reference point, obtain the inquiry key value of searching in this cluster data; Secondly according to inquiry key value and query context, determine key value reference position and the end position (hunting zone) at the enterprising line search of the corresponding BlockB-tree of this cluster data; Finally, from the reference position of key value layer to end position, carry out one by one key scan value, calculate high dimension vector corresponding to this key value and the distance between query vector, by and query vector between the result for retrieval of distance value in query context return, obtain similar vector set.

Retrieval mode during of the present invention retrieval comprises range query and k NN Query, and as shown in Figure 2, the process flow diagram of k NN Query as shown in Figure 3 for the process flow diagram of range query.As shown in Figure 3, k NN Query is to realize by range query.

Above-mentioned high dimension vector can be the proper vector of image, video, audio frequency.

Should be understood that, the above-mentioned description for embodiment is comparatively concrete, can not therefore think the restriction to scope of patent protection of the present invention, and scope of patent protection of the present invention should be as the criterion with claims.

Claims

1. in conjunction with the high dimension vector searching method of cluster and city block distance, it is characterized in that concrete steps are as follows:

1) adopt clustering algorithm to carry out a bunch division to high dimension vector collection, obtain cluster centre and the cluster radius of each cluster data;

2) for each cluster data is chosen a reference point, then adopt one by one this high dimension vector to be mapped as one-dimensional k ey value to the city block distance between selected reference point each high dimension vector in this cluster data, and to adopt all high dimension vectors of this cluster data and its corresponding key value be this cluster data structure BlockB-tree, this BlockB-tree employing B ⁺-tree index structure is managed the key value on upper strata, and each key value of leaf node layer is bound a pointer that points to corresponding high dimension vector simultaneously;

3) cluster centre of each cluster data and cluster radius are all bound to a pointer that points to the constructed BlockB-tree of its corresponding cluster data, form CBlockB-tree, when inserting a high dimension vector to CBlockB-tree, first according to this high dimension vector, arrive the distance value of each cluster data cluster centre, the nearest cluster data of this high dimension vector of selected distance is carried out update, upgrade cluster radius, then according to being inserted into vector, to the city block distance between the selected reference point of this cluster data, obtain being inserted into vectorial one-dimensional k ey value, according to the size of this key value, locate it and should be inserted into a certain leaf node in the corresponding BlockB-tree of this cluster data, if this leaf node less than, directly key value is inserted in this leaf node, and the pointer of corresponding high dimension vector is pointed in generation, upgrade the key value that its father node is corresponding, if this leaf node is full, the mode of processing is as follows:

31) if the left and right brotgher of node of this leaf node exist less than situation,, in conjunction with its left and right brotgher of node, be inserted into the insertion of high dimension vector and key value, and upgrade the key value that its father node is corresponding;

32) if its left and right brotgher of node is all full, in conjunction with the high dimension vector and the key value that are inserted into, directly this leaf node is divided, and the new leaf node producing after division is inserted in its father node, upgrade the key value that its father node is corresponding simultaneously, if father node is also full, fission process continues upwards to transmit, and upgrades corresponding key value;

4) while retrieving, by query context, filter out those and disjoint each cluster data of query region, each cluster data intersecting with query context is searched for, and each cluster data of intersecting of query context in searching method be: the city block distance calculating between query vector and the selected reference point of this cluster data obtains the inquiry key value in this cluster data, according to query context and inquiry key value, obtain reference position and the end position of the key value that need to search for, high dimension vector corresponding to these key values and the distance between query vector are calculated in scanning, high dimension vector in query context is returned, obtain result for retrieval.

2. the method for claim 1, is characterized in that: the clustering algorithm described in step 1 comprises Kmeans cluster.

3. the method for claim 1, is characterized in that: described in step 2, for each cluster data, choose a reference point, comprise and choose initial point or cluster centre is reference point.

4. the method for claim 1, is characterized in that: during retrieving described in step 4, comprise and adopt range query or k NN Query.