Stochastically Robust Personalized Ranking for LSH Recommendation Retrieval

SUMMARYA crucial part of probabilistic roadmap planners is the nearest neighbor search, which is typically done by exact methods. Unfortunately, searching the neighbors can become a major bottleneck for the performance. This can occur when the roadmap size grows especially in high-dimensional spaces. In this paper, we investigate how well the approximate nearest neighbor searching works with probabilistic roadmap planners. We propose a method that is based on the locality-sensitive hashing and show that it can speed up the construction of the roadmap considerably without reducing the quality of the produced roadmap.

Download Full-text

Fast document summarization using locality sensitive hashing and memory access efficient node ranking

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v6i3.9030 ◽

2016 ◽

Vol 6 (3) ◽

pp. 945

Author(s):

Ercan Canhasi

Keyword(s):

Time Complexity ◽

Nearest Neighbor ◽

Linear Time ◽

Nearest Neighbor Search ◽

Memory Access ◽

Locality Sensitive Hashing ◽

Document Summarization ◽

Neighbor Search ◽

Node Ranking ◽

Similarity Graph

Text modeling and sentence selection are the fundamental steps of a typical extractive document summarization algorithm. The common text modeling method connects a pair of sentences based on their similarities. Even thought it can effectively represent the sentence similarity graph of given document(s) its big drawback is a large time complexity of $O(n^2)$, where n represents the number of sentences. The quadratic time complexity makes it impractical for large documents. In this paper we propose the fast approximation algorithms for the text modeling and the sentence selection. Our text modeling algorithm reduces the time complexity to near-linear time by rapidly finding the most similar sentences to form the sentences similarity graph. In doing so we utilized Locality-Sensitive Hashing, a fast algorithm for the approximate nearest neighbor search. For the sentence selection step we propose a simple memory-access-efficient node ranking method based on the idea of scanning sequentially only the neighborhood arrays. Experimentally, we show that sacrificing a rather small percentage of recall and precision in the quality of the produced summary can reduce the quadratic to sub-linear time complexity. We see the big potential of proposed method in text summarization for mobile devices and big text data summarization for internet of things on cloud. In our experiments, beside evaluating the presented method on the standard general and query multi-document summarization tasks, we also tested it on few alternative summarization tasks including general and query, timeline, and comparative summarization.

Download Full-text

Approximate Nearest Neighbor Search using a Single Space-filling Curve and Multiple Representations of the Data Points

18th International Conference on Pattern Recognition (ICPR'06) ◽

10.1109/icpr.2006.275 ◽

2006 ◽

Cited By ~ 20

Author(s):

G. Mainar-Ruiz ◽

J. Perez-Cortes

Keyword(s):

Nearest Neighbor ◽

Multiple Representations ◽

Nearest Neighbor Search ◽

Space Filling ◽

Approximate Nearest Neighbor Search ◽

Space Filling Curve ◽

Approximate Nearest Neighbor ◽

Neighbor Search ◽

Data Points ◽

Filling Curve

Download Full-text

A Probabilistic Molecular Fingerprint for Big Data Settings

10.26434/chemrxiv.7176350.v1 ◽

2018 ◽

Author(s):

Daniel Probst ◽

Jean-Louis Reymond

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Molecular Fingerprint ◽

Molecular Fingerprints ◽

Approximate Nearest Neighbor ◽

Neighbor Search ◽

Large Databases ◽

Nearest Neighbor Searches ◽

Extended Connectivity

Background: Among the various molecular fingerprints available to describe small organic molecules, ECFP4 (extended connectivity fingerprint, up to four bonds) performs best in benchmarking drug analog recovery studies as it encodes substructures with a high level of detail. Unfortunately, ECFP4 requires high dimensional representations (≥1,024D) to perform well, resulting in ECFP4 nearest neighbor searches in very large databases such as GDB, PubChem or ZINC to perform very slowly due to the curse of dimensionality. <a></a><a></a> Results: Herein we report a new fingerprint, called MHFP6 (MinHash fingerprint, up to six bonds), which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. Furthermore, MHFP6 outperforms ECFP4 in approximate nearest neighbor searches by two orders of magnitude in terms of speed, while decreasing the error rate. Conclusion<a></a><a>: MHFP6 is a new molecular fingerprint, encoding circular substructures, which outperforms ECFP4 for analog searches while allowing the direct application of locality sensitive hashing algorithms. It should be well suited for the analysis of large databases. The source code for MHFP6 is available on GitHub (</a><a href="https://github.com/reymond-group/mhfp">https://github.com/reymond-group/mhfp</a>).<a></a>

Download Full-text

Fast algorithm for anchor graph hashing

Proceedings of the VLDB Endowment ◽

10.14778/3447689.3447696 ◽

2021 ◽

Vol 14 (6) ◽

pp. 916-928

Author(s):

Yasuhiro Fujiwara ◽

Sekitoshi Kanai ◽

Yasutoshi Ida ◽

Atsutoshi Kumagai ◽

Naonori Ueda

Keyword(s):

Clustering Algorithm ◽

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Computation Cost ◽

Neighbor Search ◽

The Matrix ◽

Data Points ◽

Anchor Points ◽

Data Point ◽

Hash Codes

Anchor graph hashing is used in many applications such as cancer detection, web page classification, and drug discovery. It computes the hash codes from the eigenvectors of the matrix representing the similarities between data points and anchor points; anchors refer to the points representing the data distribution. In performing an approximate nearest neighbor search, the hash codes of a query data point are determined by identifying its closest anchor points. Anchor graph hashing, however, incurs high computation cost since (1) the computation cost of obtaining the eigenvectors is quadratic to the number of anchor points, and (2) the similarities of the query data point to all the anchor points must be computed. Our proposal, Tridiagonal hashing , increases the efficiency of anchor graph hashing because of its two advances: (1) we apply a graph clustering algorithm to compute the eigenvectors from the tridiagonal matrix obtained from the similarities between data points and anchor points, and (2) we detect anchor points closest to the query data point by using a dimensionality reduction approach. Experiments show that our approach is several orders of magnitude faster than the previous approaches. Besides, it yields high search accuracy than the original anchor graph hashing approach.

Download Full-text

A Probabilistic Molecular Fingerprint for Big Data Settings

10.26434/chemrxiv.7176350 ◽

2018 ◽

Author(s):

Daniel Probst ◽

Jean-Louis Reymond

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Molecular Fingerprint ◽

Molecular Fingerprints ◽

Approximate Nearest Neighbor ◽

Neighbor Search ◽

Large Databases ◽

Nearest Neighbor Searches ◽

Extended Connectivity

Background: Among the various molecular fingerprints available to describe small organic molecules, ECFP4 (extended connectivity fingerprint, up to four bonds) performs best in benchmarking drug analog recovery studies as it encodes substructures with a high level of detail. Unfortunately, ECFP4 requires high dimensional representations (≥1,024D) to perform well, resulting in ECFP4 nearest neighbor searches in very large databases such as GDB, PubChem or ZINC to perform very slowly due to the curse of dimensionality. <a></a><a></a> Results: Herein we report a new fingerprint, called MHFP6 (MinHash fingerprint, up to six bonds), which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. Furthermore, MHFP6 outperforms ECFP4 in approximate nearest neighbor searches by two orders of magnitude in terms of speed, while decreasing the error rate. Conclusion<a></a><a>: MHFP6 is a new molecular fingerprint, encoding circular substructures, which outperforms ECFP4 for analog searches while allowing the direct application of locality sensitive hashing algorithms. It should be well suited for the analysis of large databases. The source code for MHFP6 is available on GitHub (</a><a href="https://github.com/reymond-group/mhfp">https://github.com/reymond-group/mhfp</a>).<a></a>

Download Full-text

Fast document summarization using locality sensitive hashing and memory access efficient node ranking

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v6i3.pp945-954 ◽

2016 ◽

Vol 6 (3) ◽

pp. 945

Author(s):

Ercan Canhasi

Keyword(s):

Time Complexity ◽

Nearest Neighbor ◽

Linear Time ◽

Nearest Neighbor Search ◽

Memory Access ◽

Locality Sensitive Hashing ◽

Document Summarization ◽

Neighbor Search ◽

Node Ranking ◽

Similarity Graph

Text modeling and sentence selection are the fundamental steps of a typical extractive document summarization algorithm. The common text modeling method connects a pair of sentences based on their similarities. Even thought it can effectively represent the sentence similarity graph of given document(s) its big drawback is a large time complexity of $O(n^2)$, where n represents the number of sentences. The quadratic time complexity makes it impractical for large documents. In this paper we propose the fast approximation algorithms for the text modeling and the sentence selection. Our text modeling algorithm reduces the time complexity to near-linear time by rapidly finding the most similar sentences to form the sentences similarity graph. In doing so we utilized Locality-Sensitive Hashing, a fast algorithm for the approximate nearest neighbor search. For the sentence selection step we propose a simple memory-access-efficient node ranking method based on the idea of scanning sequentially only the neighborhood arrays. Experimentally, we show that sacrificing a rather small percentage of recall and precision in the quality of the produced summary can reduce the quadratic to sub-linear time complexity. We see the big potential of proposed method in text summarization for mobile devices and big text data summarization for internet of things on cloud. In our experiments, beside evaluating the presented method on the standard general and query multi-document summarization tasks, we also tested it on few alternative summarization tasks including general and query, timeline, and comparative summarization.

Download Full-text

Locality-Sensitive Hashing Techniques for Nearest Neighbor Search

International Journal of Fuzzy Logic and Intelligent Systems ◽

10.5391/ijfis.2012.12.4.300 ◽

2012 ◽

Vol 12 (4) ◽

pp. 300-307 ◽

Cited By ~ 11

Author(s):

Keon Myung Lee

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Neighbor Search

Download Full-text

MP-RW-LSH

Proceedings of the VLDB Endowment ◽

10.14778/3484224.3484226 ◽

2021 ◽

Vol 14 (13) ◽

pp. 3267-3280

Author(s):

Huayi Wang ◽

Jingfan Meng ◽

Long Gong ◽

Jun Xu ◽

Mitsunori Ogihara

Keyword(s):

Nearest Neighbor ◽

Edit Distance ◽

State Of The Art ◽

Hash Table ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Algorithmic Problem ◽

Use Case ◽

Hash Tables ◽

Neighbor Search

Approximate Nearest Neighbor Search (ANNS) is a fundamental algorithmic problem, with numerous applications in many areas of computer science. Locality-Sensitive Hashing (LSH) is one of the most popular solution approaches for ANNS. A common shortcoming of many LSH schemes is that since they probe only a single bucket in a hash table, they need to use a large number of hash tables to achieve a high query accuracy. For ANNS- L 2 , a multi-probe scheme was proposed to overcome this drawback by strategically probing multiple buckets in a hash table. In this work, we propose MP-RW-LSH, the first and so far only multi-probe LSH solution to ANNS in L 1 distance, and show that it achieves a better tradeoff between scalability and query efficiency than all existing LSH-based solutions. We also explain why a state-of-the-art ANNS -L 1 solution called Cauchy projection LSH (CP-LSH) is fundamentally not suitable for multi-probe extension. Finally, as a use case, we construct, using MP-RW-LSH as the underlying "ANNS- L 1 engine", a new ANNS-E (E for edit distance) solution that beats the state of the art.

Download Full-text