The chemfp Project

Mapping Intimacies ◽

10.26434/chemrxiv.7877846 ◽

2019 ◽

Author(s):

Andrew Dalke

Keyword(s):

Similarity Search ◽

High Performance ◽

Nearest Neighbor ◽

Main Memory ◽

Nearest Neighbor Search ◽

Search Performance ◽

Command Line ◽

Data Set ◽

Neighbor Search ◽

Order Of Magnitude

<div>This paper describes the 10 years of work and research results of the chemfp project, available from http://chemfp.com/ . The project started as a way to promote the FPS format for cheminformatics fingerprint exchange. This is a line-oriented text format meant to be easy to read and write. It supports metadata such as the fingerprint type and data provenance.The chemfp package for Python was developed to provide the basic command-line tools and Python API for working with fingerprint data, because a format without useful tools will not be used. <br></div><div><br></div><div>The similarity search performance improved by an order of magnitude over the decade, due to careful implementation and effective use of CPU hardware, including AVX2 support for faster popcount calculations than the built-in POPCNT instruction. The implementation details for high-performance search have rarely been discussed in the literature. As a result, many tools and published papers use implementations which are not close to the machine's capabilities.</div><div><br></div><div>This paper describes those details to help with future optimization efforts.</div><div><br></div><div>The most advanced version of chemfp evaluates about 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k=1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query and the same search of the 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest similarity search tools available for CPUs. This appears to be several times faster than previously published work in the field, including in papers which use much more sophisticated data structures.</div><div><br></div><div>A close analysis shows that nearly all earlier work assumes that the intersection popcount was the limiting performance factor, while on modern hardware uncompressed search is effectively memory bandwidth limited. For example, AVX2 search is 10% faster when memory prefetching, and the popcount evaluation time is far faster than fetching a random location in main memory. It proved difficult to evaluate existing tool performance because in the few cases where the tools were available, each used its own format, data sets, and search tasks.</div><div><br></div><div>This paper introduces the chemfp benchmark data set to help make head-to-head comparisons easier in the future, and to help promote the FPS format. The FPS format is slow for tasks like web server reloads and command-line scripting. This paper also describes the FPB format, which is a binary application format for fast loads. </div>

Download Full-text

The Chemfp Project

10.26434/chemrxiv.7877846.v1 ◽

2019 ◽

Author(s):

Andrew Dalke

Keyword(s):

Similarity Search ◽

High Performance ◽

Nearest Neighbor ◽

Main Memory ◽

Nearest Neighbor Search ◽

Search Performance ◽

Command Line ◽

Data Set ◽

Neighbor Search ◽

Order Of Magnitude

<div>This paper describes the 10 years of work and research results of the chemfp project, available from http://chemfp.com/ . The project started as a way to promote the FPS format for cheminformatics fingerprint exchange. This is a line-oriented text format meant to be easy to read and write. It supports metadata such as the fingerprint type and data provenance.The chemfp package for Python was developed to provide the basic command-line tools and Python API for working with fingerprint data, because a format without useful tools will not be used. The similarity search performance improved by an order of magnitude over the decade, due to careful implementation and effective use of CPU hardware, including AVX2 support for faster popcount calculations than the built-in POPCNT instruction. The implementation details for high-performance search have rarely been discussed in the literature. As a result, many tools and published papers use implementations which are not close to the machine's capabilities. This paper describes those details to help with future optimization efforts. The most advanced version of chemfp evaluates about 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k=1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query and the same search of the 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest similarity search tools available for CPUs. This appears to be several times faster than previously published work in the field, including in papers which use much more sophisticated data structures. A close analysis shows that nearly all earlier work assumes that the intersection popcount was the limiting performance factor, while on modern hardware uncompressed search is effectively memory bandwidth limited. For example, AVX2 search is 10% faster when memory prefetching, and the popcount evaluation time is far faster than fetching a random location in main memory. It proved difficult to evaluate existing tool performance because in the few cases where the tools were available, each used its own format, data sets, and search tasks. This paper introduces the chemfp benchmark data set to help make head-to-head comparisons easier in the future, and to help promote the FPS format. The FPS format is slow for tasks like web server reloads and command-line scripting. This paper also describes the FPB format, which is a binary application format for fast loads. </div>

Download Full-text

The chemfp Project

10.26434/chemrxiv.7877846.v2 ◽

2019 ◽

Author(s):

Andrew Dalke

Keyword(s):

Similarity Search ◽

High Performance ◽

Nearest Neighbor ◽

Main Memory ◽

Nearest Neighbor Search ◽

Search Performance ◽

Command Line ◽

Data Set ◽

Neighbor Search ◽

Order Of Magnitude

Download Full-text

Bregman Hyperplane Trees for Fast Approximate Nearest Neighbor Search

International Journal of Multimedia Data Engineering and Management ◽

10.4018/jmdem.2012100104 ◽

2012 ◽

Vol 3 (4) ◽

pp. 75-87

Author(s):

Bilegsaikhan Naidan ◽

Magnus Lie Hetland

Keyword(s):

Query Processing ◽

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Index Structure ◽

Data Sets ◽

Search Performance ◽

Space Partitioning ◽

Neighbor Search ◽

Order Of Magnitude ◽

Exact Nearest Neighbor

This article presents a new approximate index structure, the Bregman hyperplane tree, for indexing the Bregman divergence, aiming to decrease the number of distance computations required at query processing time, by sacrificing some accuracy in the result. The experimental results on various high-dimensional data sets demonstrate that the proposed index structure performs comparably to the state-of-the-art Bregman ball tree in terms of search performance and result quality. Moreover, this method results in a speedup of well over an order of magnitude for index construction. The authors also apply their space partitioning principle to the Bregman ball tree and obtain a new index structure for exact nearest neighbor search that is faster to build and a slightly slower at query processing than the original.

Download Full-text

Improvement of PCA-Based Approximate Nearest Neighbor Search Using Distance Statistics

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2014.p0658 ◽

2014 ◽

Vol 18 (4) ◽

pp. 658-664

Author(s):

Toshiro Ogita ◽

◽

Hidetomo Ichihashi ◽

Akira Notsu ◽

Katsuhiro Honda ◽

...

Keyword(s):

Nearest Neighbor ◽

Search Algorithm ◽

Nearest Neighbor Search ◽

High Dimensional ◽

Search Performance ◽

Neighbor Distance ◽

Data Set ◽

Approximate Nearest Neighbor ◽

Neighbor Search ◽

Nearest Neighbor Distance

In many computer vision applications, nearest neighbor searching in high-dimensional spaces is often the most time consuming component and we have few algorithms for solving these high-dimensional nearest neighbor search problems that are faster than linear search. Approximately nearest neighbor search algorithms can play an important role in achieving significantly faster running times with relatively small errors. This paper considers the improvement of the PCA-tree nearest neighbor search algorithm [1] by employing nearest neighbor distance statistics. During the preprocessing phase of the PCA-tree nearest neighbor search algorithm, a data set is partitioned into clusters by successive use of Principal Component Analysis (PCA). The search performance is significantly improved if the data points are sorted by leaf node, and the threshold value is updated each time a smaller distance is found. The threshold is updated by the ε-approximate nearest neighbor approach together with the fixed-threshold approach. Performance can be further improved by the annulus bound approach. Moreover, nearest neighbor distance statistics is employed for further improving the efficiency of the search algorithm and the several experimental results are shown for demonstrating how its efficiency is improved.

Download Full-text

High‐performance implementation of a two‐bit geohash coding technique for nearest neighbor search

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.6029 ◽

2020 ◽

Author(s):

Varalakshmi M ◽

Amit P. Kesarkar ◽

Daphne Lopez

Keyword(s):

High Performance ◽

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Neighbor Search

Download Full-text

The chemfp project

Journal of Cheminformatics ◽

10.1186/s13321-019-0398-8 ◽

2019 ◽

Vol 11 (1) ◽

Cited By ~ 3

Author(s):

Andrew Dalke

Keyword(s):

Open Source ◽

Open Source Software ◽

Similarity Search ◽

High Performance ◽

Business Models ◽

Nearest Neighbor ◽

Lessons Learned ◽

Nearest Neighbor Search ◽

Memory Bandwidth ◽

Software Project

AbstractThe chemfp project has had four main goals: (1) promote the FPS format as a text-based exchange format for dense binary cheminformatics fingerprints, (2) develop a high-performance implementation of the BitBound algorithm that could be used as an effective baseline to benchmark new similarity search implementations, (3) experiment with funding a pure open source software project through commercial sales, and (4) publish the results and lessons learned as a guide for future implementors. The FPS format has had only minor success, though it did influence development of the FPB binary format, which is faster to load but more complex. Both are summarized. The chemfp benchmark and the no-cost/open source version of chemfp are proposed as a reference baseline to evaluate the effectiveness of other similarity search tools. They are used to evaluate the faster commercial version of chemfp, which can test 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k = 1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query. The same search of 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest CPU-based similarity search implementations. Modern CPUs are fast enough that memory bandwidth and latency are now important factors. Single-threaded search uses most of the available memory bandwidth. Sorting the fingerprints by popcount improves memory coherency, which when combined with 4 OpenMP threads makes it possible to construct an N × N similarity matrix for 1 million fingerprints in about 30 min. These observations may affect the interpretation of previous publications which assumed that search was strongly CPU bound. The chemfp project funding came from selling a purely open-source software product. Several product business models were tried, but none proved sustainable. Some of the experiences are discussed, in order to contribute to the ongoing conversation on the role of open source software in cheminformatics.

Download Full-text

An Error Minimizing Partitioning Method for the Nearest Neighbor Search

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.321-324.2165 ◽

2013 ◽

Vol 321-324 ◽

pp. 2165-2170

Author(s):

Seung Hoon Lee ◽

Jaek Wang Kim ◽

Jae Dong Lee ◽

Jee Hyong Lee

Keyword(s):

Nearest Neighbor ◽

Dimensional Space ◽

Computational Cost ◽

Nearest Neighbor Search ◽

High Dimensional ◽

Index Structures ◽

Cost Index ◽

Data Set ◽

High Dimensional Space ◽

Neighbor Search

The nearest neighbor search in high-dimensional space is an important operation in many applications, such as data mining and multimedia databases. Evaluating similarity in high-dimensional space requires high computational cost; index-structures are frequently used for reducing computational cost. Most of these index-structures are built by partitioning the data set. However, the partitioning approaches potentially have the problem of failing to find the nearest neighbor that is caused by partitions. In this paper, we propose the Error Minimizing Partitioning (EMP) method with a novel tree structure that minimizes the failures of finding the nearest neighbors. EMP divides the data into subsets with considering the distribution of data sets. For partitioning a data set, the proposed method finds the line that minimizes the summation of distance to data points. The method then finds the median of the data set. Finally, our proposed method determines the partitioning hyper-plane that passes the median and is perpendicular to the line. We also make a comparative study between existing methods and the proposed method to verify the effectiveness of our method.

Download Full-text

CONNEKT: Co-Located Nearest Neighbor Search using KNN Querying with K-D Tree

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1741.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 1164-1171

Keyword(s):

Spatial Data ◽

Pattern Mining ◽

Nearest Neighbor ◽

Search Algorithm ◽

Nearest Neighbor Search ◽

Location Based Services ◽

Context Aware ◽

K Nearest Neighbor ◽

Data Set ◽

Neighbor Search

Data about entities or objects associated with geographical or location information could be called as spatial data. Spatial data helps in identifying and positioning anyone or anything globally anywhere across the world. Instances of various spatial features that are closely found together are called as spatial co-located patterns. So far, the spatial co-located patterns have been used only for knowledge discovery process but it would serve a wide variety of applications if analyzed intensively. One such application is to use co-location pattern mining for a context aware based search. Hence the main aim of this work is to extend the K-Nearest Neighbor (KNN) querying to co-located instances for context aware based querying or location-based services (LBS). For the above-said purpose, co-located nearest neighbor search algorithm namely “CONNEKT” is proposed. The co-located instances are mapped onto a K-dimensional tree (K-d tree) inorder to make the querying process efficient. The algorithm is analyzed using a hypothetical data set generated through QGIS

Download Full-text

How Good Are Modern Spatial Libraries?

Data Science and Engineering ◽

10.1007/s41019-020-00147-9 ◽

2020 ◽

Author(s):

Varun Pandey ◽

Alexander van Renen ◽

Andreas Kipf ◽

Alfons Kemper

Keyword(s):

Spatial Data ◽

High Performance ◽

Nearest Neighbor ◽

Vantage Point ◽

Nearest Neighbor Search ◽

Index Structure ◽

Spatial Query ◽

Neighbor Search ◽

Spatial Data Management ◽

Real World Datasets

Abstract Many applications today like Uber, Yelp, Tinder, etc. rely on spatial data or locations from its users. These applications and services either build their own spatial data management systems or rely on existing solutions. JTS Topology Suite (JTS), its C++ port GEOS, Google S2, ESRI Geometry API, and Java Spatial Index (JSI) are some of the spatial processing libraries that these systems build upon. These applications and services depend on indexing capabilities available in these libraries for high-performance spatial query processing. In this work, we compare these libraries qualitatively and quantitatively based on four different spatial queries using two real world datasets. We also compare these libraries with an open-source implementation of the Vantage Point Tree—an index structure that has been well studied in image retrieval and nearest-neighbor search algorithms for high-dimensional data. We found that Vantage Point Trees are very competitive and even outperform the aforementioned libraries in two queries.

Download Full-text

Bi-Level Locality Sensitive Hashing Index Based on Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.556-562.3804 ◽

2014 ◽

Vol 556-562 ◽

pp. 3804-3808

Author(s):

Peng Wang ◽

Dong Yin ◽

Tao Sun

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Search Performance ◽

Hash Tables ◽

Approximate Nearest Neighbor Search ◽

Approximate Nearest Neighbor ◽

Neighbor Search ◽

Search Speed ◽

Locality Sensitive Hash

Locality sensitive hashing is the most popular algorithm for approximate nearest neighbor search. As LSH partitions vector space uniformly and the distribution of vectors is usually non-uniform, it poorly fits real dataset and has limited search performance. In this paper, we propose a new Bi-level locality sensitive hashing algorithm, which has two-level structures to perform approximate nearest neighbor search in high dimensional spaces. In the first level, we train a number of cluster centers, then use the cluster centers to divide the dataset into many clusters and the vectors in each cluster has near uniform distribution. In the second level, we construct locality sensitive hashing tables for each cluster. Given a query, we determine a few clusters that it belongs to with high probability, and then perform approximate nearest neighbor search in the corresponding locality sensitive hash tables. Experimental results on the dataset of 1,000,000 vectors show that the search speed can be increased by 48 times compared to Euclidean locality sensitive hashing, while keeping high search precision.

Download Full-text