<p><b>Background</b>: Among the various molecular fingerprints available to describe
small organic molecules, ECFP4 (extended connectivity fingerprint, up to four
bonds) performs best in benchmarking drug analog recovery studies as it encodes
substructures with a high level of detail. Unfortunately, ECFP4 requires high
dimensional representations (≥1,024D) to perform well, resulting in ECFP4
nearest neighbor searches in very large databases such as GDB, PubChem or ZINC to
perform very slowly due to the curse of dimensionality. <a></a><a></a></p>
<p><b>Results</b>: Herein we report a new fingerprint, called MHFP6 (MinHash
fingerprint, up to six bonds), which encodes detailed substructures using the
extended connectivity principle of ECFP in a fundamentally different manner,
increasing the performance of exact nearest neighbor searches in benchmarking
studies and enabling the application of locality sensitive hashing (LSH)
approximate nearest neighbor search algorithms. To describe a molecule, MHFP6
extracts the SMILES of all circular substructures around each atom up to a
diameter of six bonds and applies the MinHash method to the resulting set.
MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. Furthermore,
MHFP6 outperforms ECFP4 in approximate nearest neighbor searches by two orders
of magnitude in terms of speed, while decreasing the error rate. </p>
<p><b>Conclusion</b><a></a><a>: MHFP6 is a new molecular fingerprint, encoding circular
substructures, which outperforms ECFP4 for analog searches while allowing the
direct application of locality sensitive hashing algorithms. It should be well
suited for the analysis of large databases. The source code for MHFP6 is
available on GitHub (</a><a href="https://github.com/reymond-group/mhfp">https://github.com/reymond-group/mhfp</a>).<a></a></p>