Fast document summarization using locality sensitive hashing and memory access efficient node ranking

Text modeling and sentence selection are the fundamental steps of a typical extractive document summarization algorithm. The common text modeling method connects a pair of sentences based on their similarities. Even thought it can effectively represent the sentence similarity graph of given document(s) its big drawback is a large time complexity of $O(n^2)$, where n represents the number of sentences. The quadratic time complexity makes it impractical for large documents. In this paper we propose the fast approximation algorithms for the text modeling and the sentence selection. Our text modeling algorithm reduces the time complexity to near-linear time by rapidly finding the most similar sentences to form the sentences similarity graph. In doing so we utilized Locality-Sensitive Hashing, a fast algorithm for the approximate nearest neighbor search. For the sentence selection step we propose a simple memory-access-efficient node ranking method based on the idea of scanning sequentially only the neighborhood arrays. Experimentally, we show that sacrificing a rather small percentage of recall and precision in the quality of the produced summary can reduce the quadratic to sub-linear time complexity. We see the big potential of proposed method in text summarization for mobile devices and big text data summarization for internet of things on cloud. In our experiments, beside evaluating the presented method on the standard general and query multi-document summarization tasks, we also tested it on few alternative summarization tasks including general and query, timeline, and comparative summarization.

Download Full-text

Complexity of Error-Correcting Code Based on Nearest Neighbor Search Algorithm

Mathematical Problems of Computer Science ◽

10.51408/1963-0030 ◽

2019 ◽

pp. 7-20

Author(s):

Levon Arsalanyan ◽

Hayk Danoyan

Keyword(s):

Time Complexity ◽

Nearest Neighbor ◽

Search Algorithm ◽

Error Correcting Code ◽

Error Correcting Codes ◽

Code Word ◽

Nearest Neighbor Search ◽

Perfect Codes ◽

Neighbor Search ◽

Weight Structures

The Nearest Neighbor search algorithm considered in this paper is well known (Elias algorithm). It uses error-correcting codes and constructs appropriate hash-coding schemas. These schemas preprocess the data in the form of lists. Each list is contained in some sphere, centered at a code-word. The algorithm is considered for the cases of perfect codes, so the spheres and, consequently, the lists do not intersect. As such codes exist for the limited set of parameters, the algorithm is considered for some other generalizations of perfect codes, and then the same data point may be contained in different lists. A formula of time complexity of the algorithm is obtained for these cases, using coset weight structures of the mentioned codes

Download Full-text

A Probabilistic Molecular Fingerprint for Big Data Settings

10.26434/chemrxiv.7176350.v1 ◽

2018 ◽

Author(s):

Daniel Probst ◽

Jean-Louis Reymond

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Molecular Fingerprint ◽

Molecular Fingerprints ◽

Approximate Nearest Neighbor ◽

Neighbor Search ◽

Large Databases ◽

Nearest Neighbor Searches ◽

Extended Connectivity

Background: Among the various molecular fingerprints available to describe small organic molecules, ECFP4 (extended connectivity fingerprint, up to four bonds) performs best in benchmarking drug analog recovery studies as it encodes substructures with a high level of detail. Unfortunately, ECFP4 requires high dimensional representations (≥1,024D) to perform well, resulting in ECFP4 nearest neighbor searches in very large databases such as GDB, PubChem or ZINC to perform very slowly due to the curse of dimensionality. <a></a><a></a> Results: Herein we report a new fingerprint, called MHFP6 (MinHash fingerprint, up to six bonds), which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. Furthermore, MHFP6 outperforms ECFP4 in approximate nearest neighbor searches by two orders of magnitude in terms of speed, while decreasing the error rate. Conclusion<a></a><a>: MHFP6 is a new molecular fingerprint, encoding circular substructures, which outperforms ECFP4 for analog searches while allowing the direct application of locality sensitive hashing algorithms. It should be well suited for the analysis of large databases. The source code for MHFP6 is available on GitHub (</a><a href="https://github.com/reymond-group/mhfp">https://github.com/reymond-group/mhfp</a>).<a></a>

Download Full-text

Stochastically Robust Personalized Ranking for LSH Recommendation Retrieval

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5889 ◽

2020 ◽

Vol 34 (04) ◽

pp. 4594-4601

Author(s):

Dung D. Le ◽

Hady W. Lauw

Keyword(s):

Nearest Neighbor ◽

Hash Functions ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Neighbor Search ◽

Data Points ◽

Recommendation Accuracy ◽

Prohibitive Cost ◽

The Cost

Locality Sensitive Hashing (LSH) has become one of the most commonly used approximate nearest neighbor search techniques to avoid the prohibitive cost of scanning through all data points. For recommender systems, LSH achieves efficient recommendation retrieval by encoding user and item vectors into binary hash codes, reducing the cost of exhaustively examining all the item vectors to identify the top-k items. However, conventional matrix factorization models may suffer from performance degeneration caused by randomly-drawn LSH hash functions, directly affecting the ultimate quality of the recommendations. In this paper, we propose a framework named øurmodel, which factors in the stochasticity of LSH hash functions when learning real-valued user and item latent vectors, eventually improving the recommendation accuracy after LSH indexing. Experiments on publicly available datasets show that the proposed framework not only effectively learns user's preferences for prediction, but also achieves high compatibility with LSH stochasticity, producing superior post-LSH indexing performances as compared to state-of-the-art baselines.

Download Full-text

A Probabilistic Molecular Fingerprint for Big Data Settings

10.26434/chemrxiv.7176350 ◽

2018 ◽

Author(s):

Daniel Probst ◽

Jean-Louis Reymond

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Molecular Fingerprint ◽

Molecular Fingerprints ◽

Approximate Nearest Neighbor ◽

Neighbor Search ◽

Large Databases ◽

Nearest Neighbor Searches ◽

Extended Connectivity

Background: Among the various molecular fingerprints available to describe small organic molecules, ECFP4 (extended connectivity fingerprint, up to four bonds) performs best in benchmarking drug analog recovery studies as it encodes substructures with a high level of detail. Unfortunately, ECFP4 requires high dimensional representations (≥1,024D) to perform well, resulting in ECFP4 nearest neighbor searches in very large databases such as GDB, PubChem or ZINC to perform very slowly due to the curse of dimensionality. <a></a><a></a> Results: Herein we report a new fingerprint, called MHFP6 (MinHash fingerprint, up to six bonds), which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. Furthermore, MHFP6 outperforms ECFP4 in approximate nearest neighbor searches by two orders of magnitude in terms of speed, while decreasing the error rate. Conclusion<a></a><a>: MHFP6 is a new molecular fingerprint, encoding circular substructures, which outperforms ECFP4 for analog searches while allowing the direct application of locality sensitive hashing algorithms. It should be well suited for the analysis of large databases. The source code for MHFP6 is available on GitHub (</a><a href="https://github.com/reymond-group/mhfp">https://github.com/reymond-group/mhfp</a>).<a></a>

Download Full-text

Locality-Sensitive Hashing Techniques for Nearest Neighbor Search

International Journal of Fuzzy Logic and Intelligent Systems ◽

10.5391/ijfis.2012.12.4.300 ◽

2012 ◽

Vol 12 (4) ◽

pp. 300-307 ◽

Cited By ~ 11

Author(s):

Keon Myung Lee

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Neighbor Search

Download Full-text

MP-RW-LSH

Proceedings of the VLDB Endowment ◽

10.14778/3484224.3484226 ◽

2021 ◽

Vol 14 (13) ◽

pp. 3267-3280

Author(s):

Huayi Wang ◽

Jingfan Meng ◽

Long Gong ◽

Jun Xu ◽

Mitsunori Ogihara

Keyword(s):

Nearest Neighbor ◽

Edit Distance ◽

State Of The Art ◽

Hash Table ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Algorithmic Problem ◽

Use Case ◽

Hash Tables ◽

Neighbor Search

Approximate Nearest Neighbor Search (ANNS) is a fundamental algorithmic problem, with numerous applications in many areas of computer science. Locality-Sensitive Hashing (LSH) is one of the most popular solution approaches for ANNS. A common shortcoming of many LSH schemes is that since they probe only a single bucket in a hash table, they need to use a large number of hash tables to achieve a high query accuracy. For ANNS- L 2 , a multi-probe scheme was proposed to overcome this drawback by strategically probing multiple buckets in a hash table. In this work, we propose MP-RW-LSH, the first and so far only multi-probe LSH solution to ANNS in L 1 distance, and show that it achieves a better tradeoff between scalability and query efficiency than all existing LSH-based solutions. We also explain why a state-of-the-art ANNS -L 1 solution called Cauchy projection LSH (CP-LSH) is fundamentally not suitable for multi-probe extension. Finally, as a use case, we construct, using MP-RW-LSH as the underlying "ANNS- L 1 engine", a new ANNS-E (E for edit distance) solution that beats the state of the art.

Download Full-text

Efficient molecular surface rendering by linear-time pseudo-Gaussian approximation to Lee–Richards surfaces (PGALRS)

Journal of Applied Crystallography ◽

10.1107/s0021889809054326 ◽

2010 ◽

Vol 43 (2) ◽

pp. 356-361 ◽

Cited By ~ 7

Author(s):

Herbert J. Bernstein ◽

Paul A. Craig

Keyword(s):

Nearest Neighbor ◽

Gaussian Approximation ◽

Linear Time ◽

Nearest Neighbor Search ◽

Molecular Surface ◽

Surface Rendering ◽

Neighbor Search ◽

Starting Point ◽

Feasible Range ◽

Contour Level

ThePGALRS(pseudo-Gaussian approximation to Lee–Richards surfaces) algorithm is discussed. By modeling electron density with unphysical pseudo-Gaussian atoms, the Lee–Richards surface can be approximated by a contour level of that density in time approximately linear in the number of atoms. Having that contour level, the atoms and residues closest to that surface can be identified in average timeO[n2/3log(n)] using aNearTree-based nearest neighbor search. If a high-quality Lee–Richards surface is required, then, as a final stage, one of the standard Lee–Richards algorithms can be used but considering only the previously identified surface residues; the typical cost is thereby reduced toO[n2/3log(n)], making the overall average time for all the stepsO(n). For very large macromolecules, such a reduction in computational burden may be essential to being able to render a meaningful molecular surface. This approach extends the feasible range of application for existing molecular surface software, such asMSMS, to larger macromolecules, especially to macromolecules with more than 50 000 atoms, and can be used as a starting point for surface-based (as opposed to backbone-based) motif identification,e.g.usingProMol.

Download Full-text

NBC: An Efficient Hierarchical Clustering Algorithm for Large Datasets

International Journal of Semantic Computing ◽

10.1142/s1793351x15400085 ◽

2015 ◽

Vol 09 (03) ◽

pp. 307-331 ◽

Cited By ~ 1

Author(s):

Wei Zhang ◽

Gongxuan Zhang ◽

Yongli Wang ◽

Zhaomeng Zhu ◽

Tao Li

Keyword(s):

Hierarchical Clustering ◽

Time Complexity ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Large Datasets ◽

Nearest Neighbor Search ◽

Large Dataset ◽

Neighbor Search ◽

Hierarchical Clustering Algorithm

Nearest neighbor search is a key technique used in hierarchical clustering and its computing complexity decides the performance of the hierarchical clustering algorithm. The time complexity of standard agglomerative hierarchical clustering is O(n3), while the time complexity of more advanced hierarchical clustering algorithms (such as nearest neighbor chain, SLINK and CLINK) is O(n2). This paper presents a new nearest neighbor search method called nearest neighbor boundary (NNB), which first divides a large dataset into independent subset and then finds nearest neighbor of each point in subset. When NNB is used, the time complexity of hierarchical clustering can be reduced to O(n log 2n). Based on NNB, we propose a fast hierarchical clustering algorithm called nearest-neighbor boundary clustering (NBC), and the proposed algorithm can be adapted to the parallel and distributed computing framework. The experimental results demonstrate that our algorithm is practical for large datasets.

Download Full-text

Speeding up probabilistic roadmap planners with locality-sensitive hashing

Robotica ◽

10.1017/s0263574714000873 ◽

2014 ◽

Vol 33 (7) ◽

pp. 1491-1506

Author(s):

Mika T. Rantanen ◽

Martti Juhola

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Exact Methods ◽

Neighbor Search ◽

Probabilistic Roadmap ◽

Nearest Neighbor Searching ◽

Major Bottleneck ◽

Speed Up

SUMMARYA crucial part of probabilistic roadmap planners is the nearest neighbor search, which is typically done by exact methods. Unfortunately, searching the neighbors can become a major bottleneck for the performance. This can occur when the roadmap size grows especially in high-dimensional spaces. In this paper, we investigate how well the approximate nearest neighbor searching works with probabilistic roadmap planners. We propose a method that is based on the locality-sensitive hashing and show that it can speed up the construction of the roadmap considerably without reducing the quality of the produced roadmap.

Download Full-text