MLR-Index: An Index Structure for Fast and Scalable Similarity Search in High Dimensions

Author(s):  
Rahul Malik ◽  
Sangkyum Kim ◽  
Xin Jin ◽  
Chandrasekar Ramachandran ◽  
Jiawei Han ◽  
...  
2016 ◽  
Vol 2016 ◽  
pp. 1-12
Author(s):  
Chunyan Shuai ◽  
Hengcheng Yang ◽  
Xin Ouyang ◽  
Siqi Li ◽  
Zheng Chen

In high-dimensional spaces, accuracy and similarity search by low computing and storage costs are always difficult research topics, and there is a balance between efficiency and accuracy. In this paper, we propose a new structure Similar-PBF-PHT to represent items of a set with high dimensions and retrieve accurate and similar items. The Similar-PBF-PHT contains three parts: parallel bloom filters (PBFs), parallel hash tables (PHTs), and a bitmatrix. Experiments show that the Similar-PBF-PHT is effective in membership query and K-nearest neighbors (K-NN) search. With accurate querying, the Similar-PBF-PHT owns low hit false positive probability (FPP) and acceptable memory costs. With K-NN querying, the average overall ratio and rank-i ratio of the Hamming distance are accurate and ratios of the Euclidean distance are acceptable. It takes CPU time not I/O times to retrieve accurate and similar items and can deal with different data formats not only numerical values.


2003 ◽  
Vol 03 (01) ◽  
pp. 3-29
Author(s):  
CHRISTIAN A. LANG ◽  
AMBUJ K. SINGH

The performance of nearest neighbor (NN) queries degrades noticeably with increasing dimensionality of the data due to reduced selectivity of high-dimensional data and an increased number of seek operations during NN-query execution. If the NN-radii would be known in advance, the disk accesses could be reordered such that seek operations are minimized. We therefore propose a new way of estimating the NN-radius based on the fractal dimensionality and sampling. It is applicable to any page-based index structure. We show that the estimation error is considerably lower than for previous approaches. In the second part of the paper, we present two applications of this technique. We show how the radius estimations can be used to transform k-NN queries into at most two range queries, and how it can be used to reduce the number of page reads during all-NN queries. In both cases, we observe significant speedups over traditional techniques for synthetic and real-world data.


Sign in / Sign up

Export Citation Format

Share Document