nearest neighbor searches
Recently Published Documents


TOTAL DOCUMENTS

28
(FIVE YEARS 9)

H-INDEX

6
(FIVE YEARS 0)

Informatics ◽  
2020 ◽  
Vol 7 (4) ◽  
pp. 38
Author(s):  
Hasan Aljabbouli ◽  
Abdullah Albizri ◽  
Antoine Harfouche

The K-means algorithm is a well-known and widely used clustering algorithm due to its simplicity and convergence properties. However, one of the drawbacks of the algorithm is its instability. This paper presents improvements to the K-means algorithm using a K-dimensional tree (Kd-tree) data structure. The proposed Kd-tree is utilized as a data structure to enhance the choice of initial centers of the clusters and to reduce the number of the nearest neighbor searches required by the algorithm. The developed framework also includes an efficient center insertion technique leading to an incremental operation that overcomes the instability problem of the K-means algorithm. The results of the proposed algorithm were compared with those obtained from the K-means algorithm, K-medoids, and K-means++ in an experiment using six different datasets. The results demonstrated that the proposed algorithm provides superior and more stable clustering solutions.


2020 ◽  
Vol 34 (04) ◽  
pp. 4535-4544
Author(s):  
Inbal Lavi ◽  
Shai Avidan ◽  
Yoram Singer ◽  
Yacov Hel-Or

We introduce a binary embedding framework, called Proximity Preserving Code (PPC), which learns similarity and dissimilarity between data points to create a compact and affinity-preserving binary code. This code can be used to apply fast and memory-efficient approximation to nearest-neighbor searches. Our framework is flexible, enabling different proximity definitions between data points. In contrast to previous methods that extract binary codes based on unsigned graph partitioning, our system models the attractive and repulsive forces in the data by incorporating positive and negative graph weights. The proposed framework is shown to boil down to finding the minimal cut of a signed graph, a problem known to be NP-hard. We offer an efficient approximation and achieve superior results by constructing the code bit after bit. We show that the proposed approximation is superior to the commonly used spectral methods with respect to both accuracy and complexity. Thus, it is useful for many other problems that can be translated into signed graph cut.


2019 ◽  
Author(s):  
Alice Capecchi ◽  
Mahendra Awale ◽  
Daniel Probst ◽  
Jean-Louis Reymond

<p>Seven million of the currently 94 million entries in the PubChem database break at least one of the four Lipinski constraints for oral bioavailability, 183,185 of which are also found in the ChEMBL database. These non-Lipinski PubChem (NLP) and ChEMBL (NLC) subsets are interesting because they contain new modalities that can display biological properties not accessible to small molecule drugs. Unfortunately, the current search tools in PubChem and ChEMBL are designed for small molecules and are not well suited to explore these subsets, which therefore remain poorly appreciated. Herein we report MXFP (macromolecule extended atom-pair fingerprint), a 217-D fingerprint tailored to analyze large molecules in terms of molecular shape and pharmacophores. We implement MXFP in two web-based applications, the first one to visualize NLP and NLC interactively using Faerun (http://faerun.gdb.tools/), the second one to perform MXFP nearest neighbor searches in NLP (http://similaritysearch.gdb.tools/). We show that these tools provide a meaningful insight into the diversity of large molecules in NLP and NLC. The interactive tools presented here are publicly available at http://gdb.unibe.ch and can be used freely to explore and better understand the diversity of non-Lipinski molecules in PubChem and ChEMBL.</p>


2019 ◽  
Author(s):  
Alice Capecchi ◽  
Mahendra Awale ◽  
Daniel Probst ◽  
Jean-Louis Reymond

<p>Seven million of the currently 94 million entries in the PubChem database break at least one of the four Lipinski constraints for oral bioavailability, 183,185 of which are also found in the ChEMBL database. These non-Lipinski PubChem (NLP) and ChEMBL (NLC) subsets are interesting because they contain new modalities that can display biological properties not accessible to small molecule drugs. Unfortunately, the current search tools in PubChem and ChEMBL are designed for small molecules and are not well suited to explore these subsets, which therefore remain poorly appreciated. Herein we report MXFP (macromolecule extended atom-pair fingerprint), a 217-D fingerprint tailored to analyze large molecules in terms of molecular shape and pharmacophores. We implement MXFP in two web-based applications, the first one to visualize NLP and NLC interactively using Faerun (http://faerun.gdb.tools/), the second one to perform MXFP nearest neighbor searches in NLP (http://similaritysearch.gdb.tools/). We show that these tools provide a meaningful insight into the diversity of large molecules in NLP and NLC. The interactive tools presented here are publicly available at http://gdb.unibe.ch and can be used freely to explore and better understand the diversity of non-Lipinski molecules in PubChem and ChEMBL.</p>


2019 ◽  
Author(s):  
Jean-Louis Reymond ◽  
Mahendra Awale ◽  
Daniel Probst ◽  
Alice Capecchi

<p>Seven million of the currently 94 million entries in the PubChem database break at least one of the four Lipinski constraints for oral bioavailability, 183,185 of which are also found in the ChEMBL database. These non-Lipinski PubChem (NLP) and ChEMBL (NLC) subsets are interesting because they contain new modalities that can display biological properties not accessible to small molecule drugs. Unfortunately, the current search tools in PubChem and ChEMBL are designed for small molecules and are not well suited to explore these subsets, which therefore remain poorly appreciated. Herein we report MXFP (macromolecule extended atom-pair fingerprint), a 217-D fingerprint tailored to analyze large molecules in terms of molecular shape and pharmacophores. We implement MXFP in two web-based applications, the first one to visualize NLP and NLC interactively using Faerun (http://faerun.gdb.tools/), the second one to perform MXFP nearest neighbor searches in NLP (http://similaritysearch.gdb.tools/). We show that these tools provide a meaningful insight into the diversity of large molecules in NLP and NLC. The interactive tools presented here are publicly available at http://gdb.unibe.ch and can be used freely to explore and better understand the diversity of non-Lipinski molecules in PubChem and ChEMBL.</p>


2019 ◽  
Vol 76 (8) ◽  
pp. 6177-6194
Author(s):  
Jun-Hong Shen ◽  
Cheng-Jung Yu ◽  
Ching-Ta Lu ◽  
WenYen Lin ◽  
Neil Y. Yen ◽  
...  

2018 ◽  
Author(s):  
Daniel Probst ◽  
Jean-Louis Reymond

<p><b>Background</b>: Among the various molecular fingerprints available to describe small organic molecules, ECFP4 (extended connectivity fingerprint, up to four bonds) performs best in benchmarking drug analog recovery studies as it encodes substructures with a high level of detail. Unfortunately, ECFP4 requires high dimensional representations (≥1,024D) to perform well, resulting in ECFP4 nearest neighbor searches in very large databases such as GDB, PubChem or ZINC to perform very slowly due to the curse of dimensionality. <a></a><a></a></p> <p><b>Results</b>: Herein we report a new fingerprint, called MHFP6 (MinHash fingerprint, up to six bonds), which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. Furthermore, MHFP6 outperforms ECFP4 in approximate nearest neighbor searches by two orders of magnitude in terms of speed, while decreasing the error rate. </p> <p><b>Conclusion</b><a></a><a>: MHFP6 is a new molecular fingerprint, encoding circular substructures, which outperforms ECFP4 for analog searches while allowing the direct application of locality sensitive hashing algorithms. It should be well suited for the analysis of large databases. The source code for MHFP6 is available on GitHub (</a><a href="https://github.com/reymond-group/mhfp">https://github.com/reymond-group/mhfp</a>).<a></a></p>


Sign in / Sign up

Export Citation Format

Share Document