CN103514264A - Method for searching high-dimensional vector combining clustering and city block distances - Google Patents

Method for searching high-dimensional vector combining clustering and city block distances Download PDF

Info

Publication number
CN103514264A
CN103514264A CN201310365384.2A CN201310365384A CN103514264A CN 103514264 A CN103514264 A CN 103514264A CN 201310365384 A CN201310365384 A CN 201310365384A CN 103514264 A CN103514264 A CN 103514264A
Authority
CN
China
Prior art keywords
cluster
cluster data
key value
high dimension
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310365384.2A
Other languages
Chinese (zh)
Inventor
吕锐
黄祥林
陈明祥
杨丽芳
储达峰
高庆
魏海涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XINHUA NEWS AGENCY
Communication University of China
Original Assignee
XINHUA NEWS AGENCY
Communication University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XINHUA NEWS AGENCY, Communication University of China filed Critical XINHUA NEWS AGENCY
Priority to CN201310365384.2A priority Critical patent/CN103514264A/en
Publication of CN103514264A publication Critical patent/CN103514264A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/435Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/41Indexing; Data structures therefor; Storage structures

Abstract

The invention discloses a method for searching a high-dimensional vector combining clustering and city block distances. In the method, an index structure CBlockB-tree combining clustering and the city block distances is provided; firstly, cluster partition is performed on high dimensional vector sets by adopting a clustering algorithm, and then BlockB-tree is constructed for each cluster data to form the CBlockB-tree. When the index structure performs searching, a part of cluster data which are disjointed with a query region can be filtered through clustering; the operation amount matched with final vector similarity can be further reduced by comparing Key values transformed from a high dimension to one dimension, and therefore the searching speed of the high-dimensional vector can be increased; meanwhile, the index structure is capable of effectively supporting the simple and efficient city block distances for matching searching.

Description

High dimension vector searching method in conjunction with cluster and city block distance
Technical field
The invention belongs to the data processing field such as multimedia information retrieval, Intelligent Information Processing, data mining, what be specifically related to is the high dimension vector searching method of a kind of combination cluster and city block distance.
Background technology
Along with the development of computing machine and infotech, produced the multi-medium data of magnanimity, how in the multimedia database of magnanimity, finding fast required information is an Important Problems of current multimedia database area research.Traditional method is by manually multi-medium data being marked, and then by text retrieval, realizes multimedia information retrieval.Yet artificial mark exists workload greatly and the strong defect of subjectivity, for the multi-medium data of explosive growth, completely artificial mark is not attainable, therefore need to study content-based multimedia information retrieval technology.
The technology path of realizing content-based multimedia information retrieval is: by eigentransformation, multi-medium data is mapped to point---the proper vector in higher dimensional space, by this proper vector, describes multimedia object, obtain feature database; Then by same eigentransformation method, extract the proper vector of query object, finally by the similarity between proper vector, mate to realize the similar to search of multimedia messages.Therefore the similar to search of multimedia messages changes the process of finding the point set nearest with given query point in high-dimensional feature space into.
To in higher dimensional space, find the point set the most close with given query point, the method of simple, intuitive is exactly sequential scanning, successively each feature (high dimension vector) in feature database is carried out to similarity with query point and mate, return to those feature point sets that mate most, obtain result for retrieval.Sequential scanning, along with the increase of number of features in feature database and characteristic dimension, is calculated the linear increase of elapsed time, and when the number of features in feature database is very large, sequential scanning can not meet real-time demand.In order to accelerate retrieval rate, the most frequently used method is exactly by means of High-dimensional Index Technology.
In order to realize the management to magnanimity high dimension vector, researchers have proposed a large amount of index structures, and wherein classical is to take the R-tree family series index structure that R-tree is representative.R-tree is proposed by Guttman the eighties in 20th century, a kind of index structure designing for managing multidimensional rectangular block data, it is a kind of height balanced tree that utilizes tree construction management data, each node represents with the minimum boundary rectangle (MBR:Minimal Bounding Rectangle) of all data in this node, and real data only appears in leaf node.This index structure also can be used for the management of higher dimensional space middle data by expansion.In query script, from root node layer, to leaf node layer, search for downwards, by the minor increment of calculating between query vector and each node M BR, judge whether query context intersects to realize beta pruning with certain node and filter, only search may comprise the subtree of result, thereby accelerates retrieval rate.This index structure allows the space overlap between node, has affected its search efficiency.In order to improve the performance of R-tree, researchers are the continuous R that proposed mutually +-tree, R *the index structures such as-tree, SS-tree, SR-tree, X-tree, A-tree.But these tree index structures are along with the increase of characteristic dimension, and search efficiency sharply declines, not even as sequential scanning, Here it is so-called " dimension disaster ".
Except tree, also exist higher-dimension for example, to the index structure of one dimension conversion: pyramid technology, NB-tree, iDistance, iMinMax etc.Higher-dimension passes through certain rule to the index structure of one dimension conversion, and high dimension vector is mapped as to one-dimensional data (being called key value), then adopts the B of one dimension +-tree manages these key values, and key value is at B +the leaf node layer ordered arrangement of-tree.While inquiring about, first by identical higher-dimension, to one dimension transformation rule, calculate the inquiry key value of query vector, finally according to query context, determine key value reference position and the end position of search, and scan successively high dimension vector corresponding to these key values, calculate the similarity between query vector and these high dimension vectors, return to those the most similar high dimension vector collection, obtain result for retrieval.From query script, higher-dimension to the index structure of one dimension conversion under any circumstance performance be all better than or be equivalent to sequential scanning, and the great many of experiments based on forefathers shows, this class index structure is with the increase of dimension and data volume, performance reduces slowly.
These higher-dimensions such as pyramid technology, NB-tree, iDistance, iMinMax filter beta pruning to one dimension switching cable guiding structure by simply relatively realizing of single key value, although do not need complicated distance calculating and there is higher recall precision, but higher-dimension can cause a large amount of data message loss to the process of one dimension conversion, cause different vectors may there is identical one-dimensional k ey value, make only can filter out the little a part of data of ratio by single key value, cause that finally seemingly to spend the operand of matching process still very large, query cost is still not little.And these higher-dimensions that previously proposed mostly propose based on Euclidean distance matching measurement to the index structure of one dimension conversion, and city block distance is one of metric form the most frequently used in high dimension vector similarity matching algorithm, its computing is simple, and has higher recall precision.
Summary of the invention
The object of the invention is to propose the index structure CBlockB-tree (abbreviation of Clustering BlockB-tree) of a kind of combination cluster and city block distance, this index structure can filter a part and the disjoint cluster data of query region by cluster, the key value arriving after one dimension conversion by higher-dimension compares, can further reduce the operand of final vector similarity coupling, accelerate the search speed of high dimension vector.Meanwhile, this index structure can effectively support simple city block distance efficiently to carry out match search.
Overall thought of the present invention is as follows: first adopt clustering algorithm to carry out a bunch division to high dimension vector collection, then for each cluster data is chosen a reference point, and adopt high dimension vector, to the city block distance between this reference point, the high dimension vector in this cluster data is mapped as to one-dimensional k ey value, finally, adopt respectively each cluster data to build BlockB-tree, it is each cluster data structure B that this BlockB-tree adopts the one-dimensional k ey of each cluster data +-tree, simultaneously its B +each key value of-tree leaf node layer is bound a pointer that points to its corresponding high dimension vector.While retrieving, first only need search for each cluster data intersecting with query context, then in each cluster data intersecting with query region, use identical reference point and city block distance that query vector is mapped as to inquiry key value, by inquiry key value and query context, determine key value reference position and the end position (hunting zone) of searching in each cluster data, finally scan these key value characteristic of correspondence vectors in hunting zone, calculate the similarity between query vector and these proper vectors, return to those the most similar vector sets, obtain result for retrieval.
Concrete innovative point: higher dimensional space is carried out to a bunch division, for each cluster data is chosen a reference point, adopt each high dimension vector in each cluster data to obtain one-dimensional k ey value to the city block distance between its reference point, index building structure, this index structure compares by cluster and simple key value, reduced the high dimension vector number of final participation similarity matching operation, accelerated inquiry velocity, and the rule that adopts city block distance to change to one dimension as higher-dimension, makes this index structure can directly support to adopt city block distance to carry out the match search of high dimension vector.
Concrete grammar step of the present invention is: (1) adopts clustering algorithm to carry out a bunch division to high dimension vector collection, obtains cluster centre and the cluster radius of each cluster data, (2) for each cluster data is chosen a reference point, then adopt one by one this high dimension vector to be mapped as one-dimensional k ey value to the city block distance between selected reference point each high dimension vector in this cluster data, and to adopt all high dimension vectors of this cluster data and its corresponding key value be this cluster data structure BlockB-tree, this BlockB-tree employing B +-tree index structure is managed the key value on upper strata, and each key value of leaf node layer is bound a pointer that points to corresponding high dimension vector simultaneously, (3) cluster centre of each cluster data and cluster radius are all bound to a pointer that points to the constructed BlockB-tree of its corresponding cluster data, form CBlockB-tree, (4) while retrieving, by query context, filter out those and disjoint each cluster data of query region, each cluster data intersecting with query context is searched for, and each cluster data of intersecting of query context in searching method be: the city block distance calculating between query vector and the selected reference point of this cluster data obtains the inquiry key value in this cluster data, according to query context and inquiry key value, obtain reference position and the end position of the key value that need to search for, high dimension vector corresponding to these key values and the distance between query vector are calculated in scanning, high dimension vector in query context is returned, obtain result for retrieval.
Further, the clustering algorithm described in step 1 comprises Kmeans cluster.
Further, described in step 2, for each cluster data, choose a reference point, comprise and choose initial point or cluster centre is reference point.
Further again, during retrieving described in step 4, comprise and adopt range query or k NN Query.
Accompanying drawing explanation
Accompanying drawing described herein is used to provide a further understanding of the present invention, forms a part of the present invention, and schematic embodiment of the present invention and explanation thereof are used for explaining the present invention, do not form inappropriate limitation of the present invention.In the accompanying drawings:
The process flow diagram of Fig. 1 (a) the method for the invention
The exemplary plot of Fig. 1 (b) index structure CBlockB-tree of the present invention
Fig. 2 is at the block diagram of the enterprising line range inquiry of index structure CBlockB-tree of the present invention
Fig. 3 carries out the block diagram of k NN Query on index structure CBlockB-tree of the present invention
Embodiment
In order to make technical matters, the technical scheme of solution required for the present invention clearer, to understand, below in conjunction with accompanying drawing and embodiment, the specific embodiment of the present invention is described further.
The process flow diagram that its index structure of a kind of combination cluster that the invention process example provides and the high dimension vector searching method of city block distance builds is as shown in Fig. 1 (a):
First, adopt clustering algorithm to carry out space bunch division to high dimension vector collection, obtain each bunch of high dimensional data; Next calculates cluster centre and the radius of each cluster data, and chooses a reference point for every cluster data; Again calculate one by one each high dimension vector in each cluster data and the city block distance between this cluster data reference point, obtain the one-dimensional k ey value that each high dimension vector is corresponding; Then each high dimension vector of each cluster data and corresponding key value are inserted one by one, for each cluster data builds BlockB-tree, this BlockB-tree adopts B +-tree index structure is managed the key value on upper strata, and each key value of leaf node layer is bound a pointer that points to corresponding high dimension vector simultaneously; Finally the cluster centre of each cluster data and cluster radius are bound to the constructed BlockB-tree of this cluster data, form CBlockB-tree index structure.(as shown in Fig. 1 (b), upper strata is clustering information layer, and each B that each cluster data one-dimensional k ey builds serves as reasons in centre +-tree, bottom is the high dimension vector layer of storage high dimension vector, each B +each key value of-tree leaf node layer is bound a pointer that points to its corresponding high dimension vector.) when retrieving, by query vector and query context, judge which cluster data and this inquiry intersects, then for inquiring about with this each cluster data intersecting, further search for, searching method in each cluster data is: first calculate the city block distance between query vector and this cluster data reference point, obtain the inquiry key value of searching in this cluster data; Secondly according to inquiry key value and query context, determine key value reference position and the end position (hunting zone) at the enterprising line search of the corresponding BlockB-tree of this cluster data; Finally, from the reference position of key value layer to end position, carry out one by one key scan value, calculate high dimension vector corresponding to this key value and the distance between query vector, by and query vector between the result for retrieval of distance value in query context return, obtain similar vector set.
Retrieval mode during of the present invention retrieval comprises range query and k NN Query, and as shown in Figure 2, the process flow diagram of k NN Query as shown in Figure 3 for the process flow diagram of range query.As shown in Figure 3, k NN Query is to realize by range query.
Above-mentioned high dimension vector can be the proper vector of image, video, audio frequency.
Should be understood that, the above-mentioned description for embodiment is comparatively concrete, can not therefore think the restriction to scope of patent protection of the present invention, and scope of patent protection of the present invention should be as the criterion with claims.

Claims (4)

1. in conjunction with the high dimension vector searching method of cluster and city block distance, it is characterized in that concrete steps are as follows:
1) adopt clustering algorithm to carry out a bunch division to high dimension vector collection, obtain cluster centre and the cluster radius of each cluster data;
2) for each cluster data is chosen a reference point, then adopt one by one this high dimension vector to be mapped as one-dimensional k ey value to the city block distance between selected reference point each high dimension vector in this cluster data, and to adopt all high dimension vectors of this cluster data and its corresponding key value be this cluster data structure BlockB-tree, this BlockB-tree employing B +-tree index structure is managed the key value on upper strata, and each key value of leaf node layer is bound a pointer that points to corresponding high dimension vector simultaneously;
3) cluster centre of each cluster data and cluster radius are all bound to a pointer that points to the constructed BlockB-tree of its corresponding cluster data, form CBlockB-tree, when inserting a high dimension vector to CBlockB-tree, first according to this high dimension vector, arrive the distance value of each cluster data cluster centre, the nearest cluster data of this high dimension vector of selected distance is carried out update, upgrade cluster radius, then according to being inserted into vector, to the city block distance between the selected reference point of this cluster data, obtain being inserted into vectorial one-dimensional k ey value, according to the size of this key value, locate it and should be inserted into a certain leaf node in the corresponding BlockB-tree of this cluster data, if this leaf node less than, directly key value is inserted in this leaf node, and the pointer of corresponding high dimension vector is pointed in generation, upgrade the key value that its father node is corresponding, if this leaf node is full, the mode of processing is as follows:
31) if the left and right brotgher of node of this leaf node exist less than situation,, in conjunction with its left and right brotgher of node, be inserted into the insertion of high dimension vector and key value, and upgrade the key value that its father node is corresponding;
32) if its left and right brotgher of node is all full, in conjunction with the high dimension vector and the key value that are inserted into, directly this leaf node is divided, and the new leaf node producing after division is inserted in its father node, upgrade the key value that its father node is corresponding simultaneously, if father node is also full, fission process continues upwards to transmit, and upgrades corresponding key value;
4) while retrieving, by query context, filter out those and disjoint each cluster data of query region, each cluster data intersecting with query context is searched for, and each cluster data of intersecting of query context in searching method be: the city block distance calculating between query vector and the selected reference point of this cluster data obtains the inquiry key value in this cluster data, according to query context and inquiry key value, obtain reference position and the end position of the key value that need to search for, high dimension vector corresponding to these key values and the distance between query vector are calculated in scanning, high dimension vector in query context is returned, obtain result for retrieval.
2. the method for claim 1, is characterized in that: the clustering algorithm described in step 1 comprises Kmeans cluster.
3. the method for claim 1, is characterized in that: described in step 2, for each cluster data, choose a reference point, comprise and choose initial point or cluster centre is reference point.
4. the method for claim 1, is characterized in that: during retrieving described in step 4, comprise and adopt range query or k NN Query.
CN201310365384.2A 2013-08-21 2013-08-21 Method for searching high-dimensional vector combining clustering and city block distances Pending CN103514264A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310365384.2A CN103514264A (en) 2013-08-21 2013-08-21 Method for searching high-dimensional vector combining clustering and city block distances

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310365384.2A CN103514264A (en) 2013-08-21 2013-08-21 Method for searching high-dimensional vector combining clustering and city block distances

Publications (1)

Publication Number Publication Date
CN103514264A true CN103514264A (en) 2014-01-15

Family

ID=49896988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310365384.2A Pending CN103514264A (en) 2013-08-21 2013-08-21 Method for searching high-dimensional vector combining clustering and city block distances

Country Status (1)

Country Link
CN (1) CN103514264A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111309984A (en) * 2020-03-10 2020-06-19 支付宝(杭州)信息技术有限公司 Method and device for searching node vector from database by using index

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020147703A1 (en) * 2001-04-05 2002-10-10 Cui Yu Transformation-based method for indexing high-dimensional data for nearest neighbour queries
CN101211355A (en) * 2006-12-30 2008-07-02 中国科学院计算技术研究所 Image inquiry method based on clustering
CN101324904A (en) * 2008-07-04 2008-12-17 西安交通大学 High-dimension index structure technique of equipment failure cases based on distance measurement
CN102306202A (en) * 2011-09-30 2012-01-04 中国传媒大学 High-dimension vector rapid searching algorithm based on block distance

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020147703A1 (en) * 2001-04-05 2002-10-10 Cui Yu Transformation-based method for indexing high-dimensional data for nearest neighbour queries
CN101211355A (en) * 2006-12-30 2008-07-02 中国科学院计算技术研究所 Image inquiry method based on clustering
CN101324904A (en) * 2008-07-04 2008-12-17 西安交通大学 High-dimension index structure technique of equipment failure cases based on distance measurement
CN102306202A (en) * 2011-09-30 2012-01-04 中国传媒大学 High-dimension vector rapid searching algorithm based on block distance

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张海勤 等: "聚类金字塔树:一种新的高维空间数据索引方法", 《中国科学技术大学学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111309984A (en) * 2020-03-10 2020-06-19 支付宝(杭州)信息技术有限公司 Method and device for searching node vector from database by using index
CN111309984B (en) * 2020-03-10 2023-09-05 支付宝(杭州)信息技术有限公司 Method and device for retrieving node vector from database by index

Similar Documents

Publication Publication Date Title
CN102163218B (en) Graph-index-based graph database keyword vicinity searching method
CN102306202B (en) High-dimension vector rapid searching algorithm based on block distance
Chen et al. Efficient metric indexing for similarity search and similarity joins
CN107766433B (en) Range query method and device based on Geo-BTree
CN106897374B (en) Personalized recommendation method based on track big data nearest neighbor query
CN109033340A (en) A kind of searching method and device of the point cloud K neighborhood based on Spark platform
CN104933175B (en) Performance data correlation analysis method and performance monitoring system
Chen et al. Metric similarity joins using MapReduce
CN104391908B (en) Multiple key indexing means based on local sensitivity Hash on a kind of figure
CN111813778B (en) Approximate keyword storage and query method for large-scale road network data
CN103500165A (en) High-dimensional vector quantity search method combining clustering and double key values
Demiryurek et al. Indexing network voronoi diagrams
CN102156756A (en) Method for finding optimal path in road network based on graph embedding
CN102375827A (en) Method for fast loading versioned electricity network model database
Meng et al. DSTTMOD: A future trajectory based moving objects database
CN112035586A (en) Spatial range query method based on extensible learning index
CN109446293B (en) Parallel high-dimensional neighbor query method
CN114372058A (en) Spatial data management method and device, storage medium and block chain system
Abbasifard et al. Efficient indexing for past and current position of moving objects on road networks
CN114153821A (en) Electric quantity graph database construction and search method based on graph theory
Cho et al. A GPS trajectory map-matching mechanism with DTG big data on the HBase system
CN103365960A (en) Off-line searching method of structured data of electric power multistage dispatching management
CN103514264A (en) Method for searching high-dimensional vector combining clustering and city block distances
CN111782663A (en) Aggregation index structure and aggregation index method for improving aggregation query efficiency
CN103514263A (en) Building method and retrieval method by adoption of double-key-value high-dimensional index structure

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140115