scholarly journals Stochastically Robust Personalized Ranking for LSH Recommendation Retrieval

2020 ◽  
Vol 34 (04) ◽  
pp. 4594-4601
Author(s):  
Dung D. Le ◽  
Hady W. Lauw

Locality Sensitive Hashing (LSH) has become one of the most commonly used approximate nearest neighbor search techniques to avoid the prohibitive cost of scanning through all data points. For recommender systems, LSH achieves efficient recommendation retrieval by encoding user and item vectors into binary hash codes, reducing the cost of exhaustively examining all the item vectors to identify the top-k items. However, conventional matrix factorization models may suffer from performance degeneration caused by randomly-drawn LSH hash functions, directly affecting the ultimate quality of the recommendations. In this paper, we propose a framework named øurmodel, which factors in the stochasticity of LSH hash functions when learning real-valued user and item latent vectors, eventually improving the recommendation accuracy after LSH indexing. Experiments on publicly available datasets show that the proposed framework not only effectively learns user's preferences for prediction, but also achieves high compatibility with LSH stochasticity, producing superior post-LSH indexing performances as compared to state-of-the-art baselines.

Robotica ◽  
2014 ◽  
Vol 33 (7) ◽  
pp. 1491-1506
Author(s):  
Mika T. Rantanen ◽  
Martti Juhola

SUMMARYA crucial part of probabilistic roadmap planners is the nearest neighbor search, which is typically done by exact methods. Unfortunately, searching the neighbors can become a major bottleneck for the performance. This can occur when the roadmap size grows especially in high-dimensional spaces. In this paper, we investigate how well the approximate nearest neighbor searching works with probabilistic roadmap planners. We propose a method that is based on the locality-sensitive hashing and show that it can speed up the construction of the roadmap considerably without reducing the quality of the produced roadmap.


Author(s):  
Ercan Canhasi

Text modeling and sentence selection are the fundamental steps of a typical extractive document summarization algorithm.   The common text modeling method connects a pair of sentences based on their similarities.   Even thought it can effectively represent the sentence similarity graph of given document(s) its big drawback is a large time complexity of $O(n^2)$, where n represents the number of sentences.   The quadratic time complexity makes it impractical for large documents.   In this paper we propose the fast approximation algorithms for the text modeling and the sentence selection.   Our text modeling algorithm reduces the time complexity to near-linear time by rapidly finding the most similar sentences to form the sentences similarity graph.   In doing so we utilized Locality-Sensitive Hashing, a fast algorithm for the approximate nearest neighbor search.   For the sentence selection step we propose a simple memory-access-efficient node ranking method based on the idea of scanning sequentially only the neighborhood arrays.    Experimentally, we show that sacrificing a rather small percentage of recall and precision in the quality of the produced summary can reduce the quadratic to sub-linear time complexity.   We see the big potential of proposed method in text summarization for mobile devices and big text data summarization for internet of things on cloud.   In our experiments, beside evaluating the presented method on the standard general and query multi-document summarization tasks, we also tested it on few alternative summarization tasks including general and query, timeline, and comparative summarization.


2018 ◽  
Author(s):  
Daniel Probst ◽  
Jean-Louis Reymond

<p><b>Background</b>: Among the various molecular fingerprints available to describe small organic molecules, ECFP4 (extended connectivity fingerprint, up to four bonds) performs best in benchmarking drug analog recovery studies as it encodes substructures with a high level of detail. Unfortunately, ECFP4 requires high dimensional representations (≥1,024D) to perform well, resulting in ECFP4 nearest neighbor searches in very large databases such as GDB, PubChem or ZINC to perform very slowly due to the curse of dimensionality. <a></a><a></a></p> <p><b>Results</b>: Herein we report a new fingerprint, called MHFP6 (MinHash fingerprint, up to six bonds), which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. Furthermore, MHFP6 outperforms ECFP4 in approximate nearest neighbor searches by two orders of magnitude in terms of speed, while decreasing the error rate. </p> <p><b>Conclusion</b><a></a><a>: MHFP6 is a new molecular fingerprint, encoding circular substructures, which outperforms ECFP4 for analog searches while allowing the direct application of locality sensitive hashing algorithms. It should be well suited for the analysis of large databases. The source code for MHFP6 is available on GitHub (</a><a href="https://github.com/reymond-group/mhfp">https://github.com/reymond-group/mhfp</a>).<a></a></p>


2021 ◽  
Vol 14 (6) ◽  
pp. 916-928
Author(s):  
Yasuhiro Fujiwara ◽  
Sekitoshi Kanai ◽  
Yasutoshi Ida ◽  
Atsutoshi Kumagai ◽  
Naonori Ueda

Anchor graph hashing is used in many applications such as cancer detection, web page classification, and drug discovery. It computes the hash codes from the eigenvectors of the matrix representing the similarities between data points and anchor points; anchors refer to the points representing the data distribution. In performing an approximate nearest neighbor search, the hash codes of a query data point are determined by identifying its closest anchor points. Anchor graph hashing, however, incurs high computation cost since (1) the computation cost of obtaining the eigenvectors is quadratic to the number of anchor points, and (2) the similarities of the query data point to all the anchor points must be computed. Our proposal, Tridiagonal hashing , increases the efficiency of anchor graph hashing because of its two advances: (1) we apply a graph clustering algorithm to compute the eigenvectors from the tridiagonal matrix obtained from the similarities between data points and anchor points, and (2) we detect anchor points closest to the query data point by using a dimensionality reduction approach. Experiments show that our approach is several orders of magnitude faster than the previous approaches. Besides, it yields high search accuracy than the original anchor graph hashing approach.


2018 ◽  
Author(s):  
Daniel Probst ◽  
Jean-Louis Reymond

<p><b>Background</b>: Among the various molecular fingerprints available to describe small organic molecules, ECFP4 (extended connectivity fingerprint, up to four bonds) performs best in benchmarking drug analog recovery studies as it encodes substructures with a high level of detail. Unfortunately, ECFP4 requires high dimensional representations (≥1,024D) to perform well, resulting in ECFP4 nearest neighbor searches in very large databases such as GDB, PubChem or ZINC to perform very slowly due to the curse of dimensionality. <a></a><a></a></p> <p><b>Results</b>: Herein we report a new fingerprint, called MHFP6 (MinHash fingerprint, up to six bonds), which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. Furthermore, MHFP6 outperforms ECFP4 in approximate nearest neighbor searches by two orders of magnitude in terms of speed, while decreasing the error rate. </p> <p><b>Conclusion</b><a></a><a>: MHFP6 is a new molecular fingerprint, encoding circular substructures, which outperforms ECFP4 for analog searches while allowing the direct application of locality sensitive hashing algorithms. It should be well suited for the analysis of large databases. The source code for MHFP6 is available on GitHub (</a><a href="https://github.com/reymond-group/mhfp">https://github.com/reymond-group/mhfp</a>).<a></a></p>


Author(s):  
Ercan Canhasi

Text modeling and sentence selection are the fundamental steps of a typical extractive document summarization algorithm.   The common text modeling method connects a pair of sentences based on their similarities.   Even thought it can effectively represent the sentence similarity graph of given document(s) its big drawback is a large time complexity of $O(n^2)$, where n represents the number of sentences.   The quadratic time complexity makes it impractical for large documents.   In this paper we propose the fast approximation algorithms for the text modeling and the sentence selection.   Our text modeling algorithm reduces the time complexity to near-linear time by rapidly finding the most similar sentences to form the sentences similarity graph.   In doing so we utilized Locality-Sensitive Hashing, a fast algorithm for the approximate nearest neighbor search.   For the sentence selection step we propose a simple memory-access-efficient node ranking method based on the idea of scanning sequentially only the neighborhood arrays.    Experimentally, we show that sacrificing a rather small percentage of recall and precision in the quality of the produced summary can reduce the quadratic to sub-linear time complexity.   We see the big potential of proposed method in text summarization for mobile devices and big text data summarization for internet of things on cloud.   In our experiments, beside evaluating the presented method on the standard general and query multi-document summarization tasks, we also tested it on few alternative summarization tasks including general and query, timeline, and comparative summarization.


2021 ◽  
Vol 14 (13) ◽  
pp. 3267-3280
Author(s):  
Huayi Wang ◽  
Jingfan Meng ◽  
Long Gong ◽  
Jun Xu ◽  
Mitsunori Ogihara

Approximate Nearest Neighbor Search (ANNS) is a fundamental algorithmic problem, with numerous applications in many areas of computer science. Locality-Sensitive Hashing (LSH) is one of the most popular solution approaches for ANNS. A common shortcoming of many LSH schemes is that since they probe only a single bucket in a hash table, they need to use a large number of hash tables to achieve a high query accuracy. For ANNS- L 2 , a multi-probe scheme was proposed to overcome this drawback by strategically probing multiple buckets in a hash table. In this work, we propose MP-RW-LSH, the first and so far only multi-probe LSH solution to ANNS in L 1 distance, and show that it achieves a better tradeoff between scalability and query efficiency than all existing LSH-based solutions. We also explain why a state-of-the-art ANNS -L 1 solution called Cauchy projection LSH (CP-LSH) is fundamentally not suitable for multi-probe extension. Finally, as a use case, we construct, using MP-RW-LSH as the underlying "ANNS- L 1 engine", a new ANNS-E (E for edit distance) solution that beats the state of the art.


Sign in / Sign up

Export Citation Format

Share Document