Fast Locality-Sensitive Hashing Frameworks for Approximate Near Neighbor Search

Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points S and a radius parameter r > 0, the rnear neighbor (r-NN) problem asks for a data structure that, given any query point q, returns a point p within distance at most r from q. In this paper, we study the r-NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance r from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee.

Download Full-text

Kernel Density Estimation through Density Constrained Near Neighbor Search

2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS) ◽

10.1109/focs46700.2020.00025 ◽

2020 ◽

Author(s):

Moses Charikar ◽

Michael Kapralov ◽

Navid Nouri ◽

Paris Siminelakis

Keyword(s):

Density Estimation ◽

Kernel Density Estimation ◽

Kernel Density ◽

Near Neighbor ◽

Neighbor Search

Download Full-text

Privacy-Preserving near Neighbor Search via Sparse Coding with Ambiguation

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp39728.2021.9414115 ◽

2021 ◽

Author(s):

Behrooz Razeghi ◽

Sohrab Ferdowsi ◽

Dimche Kostadinov ◽

Flavio P. Calmon ◽

Slava Voloshynovskiy

Keyword(s):

Sparse Coding ◽

Privacy Preserving ◽

Near Neighbor ◽

Neighbor Search

Download Full-text

A Supremum Norm Based Near Neighbor Search in High Dimensional Spaces

Computer Vision and Graphics - Lecture Notes in Computer Science ◽

10.1007/978-3-642-33564-8_72 ◽

2012 ◽

pp. 600-609

Author(s):

Nikolai Sergeev

Keyword(s):

Near Neighbor ◽

High Dimensional ◽

Supremum Norm ◽

Neighbor Search

Download Full-text

Fast document summarization using locality sensitive hashing and memory access efficient node ranking

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v6i3.9030 ◽

2016 ◽

Vol 6 (3) ◽

pp. 945

Author(s):

Ercan Canhasi

Keyword(s):

Time Complexity ◽

Nearest Neighbor ◽

Linear Time ◽

Nearest Neighbor Search ◽

Memory Access ◽

Locality Sensitive Hashing ◽

Document Summarization ◽

Neighbor Search ◽

Node Ranking ◽

Similarity Graph

Text modeling and sentence selection are the fundamental steps of a typical extractive document summarization algorithm. The common text modeling method connects a pair of sentences based on their similarities. Even thought it can effectively represent the sentence similarity graph of given document(s) its big drawback is a large time complexity of $O(n^2)$, where n represents the number of sentences. The quadratic time complexity makes it impractical for large documents. In this paper we propose the fast approximation algorithms for the text modeling and the sentence selection. Our text modeling algorithm reduces the time complexity to near-linear time by rapidly finding the most similar sentences to form the sentences similarity graph. In doing so we utilized Locality-Sensitive Hashing, a fast algorithm for the approximate nearest neighbor search. For the sentence selection step we propose a simple memory-access-efficient node ranking method based on the idea of scanning sequentially only the neighborhood arrays. Experimentally, we show that sacrificing a rather small percentage of recall and precision in the quality of the produced summary can reduce the quadratic to sub-linear time complexity. We see the big potential of proposed method in text summarization for mobile devices and big text data summarization for internet of things on cloud. In our experiments, beside evaluating the presented method on the standard general and query multi-document summarization tasks, we also tested it on few alternative summarization tasks including general and query, timeline, and comparative summarization.

Download Full-text

Near-Neighbor Search in Pattern Distance Spaces

Proceedings of the 2005 SIAM International Conference on Data Mining ◽

10.1137/1.9781611972757.66 ◽

2005 ◽

Author(s):

Haixun Wang ◽

Chang-Shing Perng ◽

Philip S. Yu

Keyword(s):

Near Neighbor ◽

Neighbor Search

Download Full-text

Lower Bounds on Near Neighbor Search via Metric Expansion

2010 IEEE 51st Annual Symposium on Foundations of Computer Science ◽

10.1109/focs.2010.82 ◽

2010 ◽

Cited By ~ 22

Author(s):

Rina Panigrahy ◽

Kunal Talwar ◽

Udi Wieder

Keyword(s):

Lower Bounds ◽

Near Neighbor ◽

Neighbor Search

Download Full-text

Compact projection: Simple and efficient near neighbor search with practical memory requirements

2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition ◽

10.1109/cvpr.2010.5539973 ◽

2010 ◽

Cited By ~ 24

Author(s):

Kerui Min ◽

Linjun Yang ◽

John Wright ◽

Lei Wu ◽

Xian-Sheng Hua ◽

...

Keyword(s):

Near Neighbor ◽

Neighbor Search

Download Full-text

Reverse Query-Aware Locality-Sensitive Hashing for High-Dimensional Furthest Neighbor Search

2017 IEEE 33rd International Conference on Data Engineering (ICDE) ◽

10.1109/icde.2017.66 ◽

2017 ◽

Cited By ~ 2

Author(s):

Qiang Huang ◽

Jianlin Feng ◽

Qiong Fang

Keyword(s):

Locality Sensitive Hashing ◽

High Dimensional ◽

Neighbor Search

Download Full-text

A Probabilistic Molecular Fingerprint for Big Data Settings

10.26434/chemrxiv.7176350.v1 ◽

2018 ◽

Author(s):

Daniel Probst ◽

Jean-Louis Reymond

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Molecular Fingerprint ◽

Molecular Fingerprints ◽

Approximate Nearest Neighbor ◽

Neighbor Search ◽

Large Databases ◽

Nearest Neighbor Searches ◽

Extended Connectivity

Background: Among the various molecular fingerprints available to describe small organic molecules, ECFP4 (extended connectivity fingerprint, up to four bonds) performs best in benchmarking drug analog recovery studies as it encodes substructures with a high level of detail. Unfortunately, ECFP4 requires high dimensional representations (≥1,024D) to perform well, resulting in ECFP4 nearest neighbor searches in very large databases such as GDB, PubChem or ZINC to perform very slowly due to the curse of dimensionality. <a></a><a></a> Results: Herein we report a new fingerprint, called MHFP6 (MinHash fingerprint, up to six bonds), which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. Furthermore, MHFP6 outperforms ECFP4 in approximate nearest neighbor searches by two orders of magnitude in terms of speed, while decreasing the error rate. Conclusion<a></a><a>: MHFP6 is a new molecular fingerprint, encoding circular substructures, which outperforms ECFP4 for analog searches while allowing the direct application of locality sensitive hashing algorithms. It should be well suited for the analysis of large databases. The source code for MHFP6 is available on GitHub (</a><a href="https://github.com/reymond-group/mhfp">https://github.com/reymond-group/mhfp</a>).<a></a>

Download Full-text