A Probabilistic Molecular Fingerprint for Big Data Settings

10.26434/chemrxiv.7176350 ◽

2018 ◽

Author(s):

Daniel Probst ◽

Jean-Louis Reymond

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Molecular Fingerprint ◽

Molecular Fingerprints ◽

Approximate Nearest Neighbor ◽

Neighbor Search ◽

Large Databases ◽

Nearest Neighbor Searches ◽

Extended Connectivity

Background: Among the various molecular fingerprints available to describe small organic molecules, ECFP4 (extended connectivity fingerprint, up to four bonds) performs best in benchmarking drug analog recovery studies as it encodes substructures with a high level of detail. Unfortunately, ECFP4 requires high dimensional representations (≥1,024D) to perform well, resulting in ECFP4 nearest neighbor searches in very large databases such as GDB, PubChem or ZINC to perform very slowly due to the curse of dimensionality. <a></a><a></a> Results: Herein we report a new fingerprint, called MHFP6 (MinHash fingerprint, up to six bonds), which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. Furthermore, MHFP6 outperforms ECFP4 in approximate nearest neighbor searches by two orders of magnitude in terms of speed, while decreasing the error rate. Conclusion<a></a><a>: MHFP6 is a new molecular fingerprint, encoding circular substructures, which outperforms ECFP4 for analog searches while allowing the direct application of locality sensitive hashing algorithms. It should be well suited for the analysis of large databases. The source code for MHFP6 is available on GitHub (</a><a href="https://github.com/reymond-group/mhfp">https://github.com/reymond-group/mhfp</a>).<a></a>

Download Full-text

Optimal Load Factor for Approximate Nearest Neighbor Search under Exact Euclidean Locality Sensitive Hashing

International Journal of Computer Applications ◽

10.5120/12096-8258 ◽

2013 ◽

Vol 69 (21) ◽

pp. 22-31 ◽

Cited By ~ 1

Author(s):

Ruben Buaba ◽

Abdollah Homaifar ◽

Eric Kihn

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Load Factor ◽

Approximate Nearest Neighbor Search ◽

Approximate Nearest Neighbor ◽

Neighbor Search ◽

Optimal Load

Download Full-text

Bi-Level Locality Sensitive Hashing Index Based on Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.556-562.3804 ◽

2014 ◽

Vol 556-562 ◽

pp. 3804-3808

Author(s):

Peng Wang ◽

Dong Yin ◽

Tao Sun

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Search Performance ◽

Hash Tables ◽

Approximate Nearest Neighbor Search ◽

Approximate Nearest Neighbor ◽

Neighbor Search ◽

Search Speed ◽

Locality Sensitive Hash

Locality sensitive hashing is the most popular algorithm for approximate nearest neighbor search. As LSH partitions vector space uniformly and the distribution of vectors is usually non-uniform, it poorly fits real dataset and has limited search performance. In this paper, we propose a new Bi-level locality sensitive hashing algorithm, which has two-level structures to perform approximate nearest neighbor search in high dimensional spaces. In the first level, we train a number of cluster centers, then use the cluster centers to divide the dataset into many clusters and the vectors in each cluster has near uniform distribution. In the second level, we construct locality sensitive hashing tables for each cluster. Given a query, we determine a few clusters that it belongs to with high probability, and then perform approximate nearest neighbor search in the corresponding locality sensitive hash tables. Experimental results on the dataset of 1,000,000 vectors show that the search speed can be increased by 48 times compared to Euclidean locality sensitive hashing, while keeping high search precision.

Download Full-text