Improvement of PCA-Based Approximate Nearest Neighbor Search Using Distance Statistics

In many computer vision applications, nearest neighbor searching in high-dimensional spaces is often the most time consuming component and we have few algorithms for solving these high-dimensional nearest neighbor search problems that are faster than linear search. Approximately nearest neighbor search algorithms can play an important role in achieving significantly faster running times with relatively small errors. This paper considers the improvement of the PCA-tree nearest neighbor search algorithm [1] by employing nearest neighbor distance statistics. During the preprocessing phase of the PCA-tree nearest neighbor search algorithm, a data set is partitioned into clusters by successive use of Principal Component Analysis (PCA). The search performance is significantly improved if the data points are sorted by leaf node, and the threshold value is updated each time a smaller distance is found. The threshold is updated by the ε-approximate nearest neighbor approach together with the fixed-threshold approach. Performance can be further improved by the annulus bound approach. Moreover, nearest neighbor distance statistics is employed for further improving the efficiency of the search algorithm and the several experimental results are shown for demonstrating how its efficiency is improved.

Download Full-text

Scalable Distributed Algorithm for Approximate Nearest Neighbor Search Problem in High Dimensional General Metric Spaces

10.1007/978-3-642-32153-5_10 ◽

2012 ◽

pp. 132-147 ◽

Cited By ~ 13

Author(s):

Yury Malkov ◽

Alexander Ponomarenko ◽

Andrey Logvinov ◽

Vladimir Krylov

Keyword(s):

Distributed Algorithm ◽

Metric Spaces ◽

Nearest Neighbor ◽

Nearest Neighbor Search ◽

High Dimensional ◽

Search Problem ◽

Approximate Nearest Neighbor Search ◽

Approximate Nearest Neighbor ◽

Neighbor Search

Download Full-text

A Fast Approximate Nearest Neighbor Search Algorithm in the Hamming Space

IEEE Transactions on Pattern Analysis and Machine Intelligence ◽

10.1109/tpami.2012.170 ◽

2012 ◽

Vol 34 (12) ◽

pp. 2481-2488 ◽

Cited By ~ 31

Author(s):

Mani Malek Esmaeili ◽

R. K. Ward ◽

M. Fatourechi

Keyword(s):

Nearest Neighbor ◽

Search Algorithm ◽

Nearest Neighbor Search ◽

Approximate Nearest Neighbor Search ◽

Approximate Nearest Neighbor ◽

Neighbor Search ◽

Hamming Space

Download Full-text

Approximate Nearest Neighbor Search on High Dimensional Data — Experiments, Analyses, and Improvement

IEEE Transactions on Knowledge and Data Engineering ◽

10.1109/tkde.2019.2909204 ◽

2020 ◽

Vol 32 (8) ◽

pp. 1475-1488 ◽

Cited By ~ 7

Author(s):

Wen Li ◽

Ying Zhang ◽

Yifang Sun ◽

Wei Wang ◽

Mingjie Li ◽

...

Keyword(s):

Nearest Neighbor ◽

High Dimensional Data ◽

Nearest Neighbor Search ◽

High Dimensional ◽

Approximate Nearest Neighbor Search ◽

Approximate Nearest Neighbor ◽

Neighbor Search

Download Full-text

The Chemfp Project

10.26434/chemrxiv.7877846.v1 ◽

2019 ◽

Author(s):

Andrew Dalke

Keyword(s):

Similarity Search ◽

High Performance ◽

Nearest Neighbor ◽

Main Memory ◽

Nearest Neighbor Search ◽

Search Performance ◽

Command Line ◽

Data Set ◽

Neighbor Search ◽

Order Of Magnitude

<div>This paper describes the 10 years of work and research results of the chemfp project, available from http://chemfp.com/ . The project started as a way to promote the FPS format for cheminformatics fingerprint exchange. This is a line-oriented text format meant to be easy to read and write. It supports metadata such as the fingerprint type and data provenance.The chemfp package for Python was developed to provide the basic command-line tools and Python API for working with fingerprint data, because a format without useful tools will not be used. The similarity search performance improved by an order of magnitude over the decade, due to careful implementation and effective use of CPU hardware, including AVX2 support for faster popcount calculations than the built-in POPCNT instruction. The implementation details for high-performance search have rarely been discussed in the literature. As a result, many tools and published papers use implementations which are not close to the machine's capabilities. This paper describes those details to help with future optimization efforts. The most advanced version of chemfp evaluates about 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k=1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query and the same search of the 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest similarity search tools available for CPUs. This appears to be several times faster than previously published work in the field, including in papers which use much more sophisticated data structures. A close analysis shows that nearly all earlier work assumes that the intersection popcount was the limiting performance factor, while on modern hardware uncompressed search is effectively memory bandwidth limited. For example, AVX2 search is 10% faster when memory prefetching, and the popcount evaluation time is far faster than fetching a random location in main memory. It proved difficult to evaluate existing tool performance because in the few cases where the tools were available, each used its own format, data sets, and search tasks. This paper introduces the chemfp benchmark data set to help make head-to-head comparisons easier in the future, and to help promote the FPS format. The FPS format is slow for tasks like web server reloads and command-line scripting. This paper also describes the FPB format, which is a binary application format for fast loads. </div>

Download Full-text

An Error Minimizing Partitioning Method for the Nearest Neighbor Search

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.321-324.2165 ◽

2013 ◽

Vol 321-324 ◽

pp. 2165-2170

Author(s):

Seung Hoon Lee ◽

Jaek Wang Kim ◽

Jae Dong Lee ◽

Jee Hyong Lee

Keyword(s):

Nearest Neighbor ◽

Dimensional Space ◽

Computational Cost ◽

Nearest Neighbor Search ◽

High Dimensional ◽

Index Structures ◽

Cost Index ◽

Data Set ◽

High Dimensional Space ◽

Neighbor Search

The nearest neighbor search in high-dimensional space is an important operation in many applications, such as data mining and multimedia databases. Evaluating similarity in high-dimensional space requires high computational cost; index-structures are frequently used for reducing computational cost. Most of these index-structures are built by partitioning the data set. However, the partitioning approaches potentially have the problem of failing to find the nearest neighbor that is caused by partitions. In this paper, we propose the Error Minimizing Partitioning (EMP) method with a novel tree structure that minimizes the failures of finding the nearest neighbors. EMP divides the data into subsets with considering the distribution of data sets. For partitioning a data set, the proposed method finds the line that minimizes the summation of distance to data points. The method then finds the median of the data set. Finally, our proposed method determines the partitioning hyper-plane that passes the median and is perpendicular to the line. We also make a comparative study between existing methods and the proposed method to verify the effectiveness of our method.

Download Full-text

Massive parallelization of approximate nearest neighbor search on KD-tree for high-dimensional image descriptor matching

Journal of Visual Communication and Image Representation ◽

10.1016/j.jvcir.2017.01.013 ◽

2017 ◽

Vol 44 ◽

pp. 106-115 ◽

Cited By ~ 3

Author(s):

Linjia Hu ◽

Saeid Nooshabadi

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbor Search ◽

High Dimensional ◽

Approximate Nearest Neighbor Search ◽

Image Descriptor ◽

Dimensional Image ◽

Approximate Nearest Neighbor ◽

Neighbor Search ◽

Massive Parallelization

Download Full-text

The chemfp Project

10.26434/chemrxiv.7877846.v2 ◽

2019 ◽

Author(s):

Andrew Dalke

Keyword(s):

Similarity Search ◽

High Performance ◽

Nearest Neighbor ◽

Main Memory ◽

Nearest Neighbor Search ◽

Search Performance ◽

Command Line ◽

Data Set ◽

Neighbor Search ◽

Order Of Magnitude

<div>This paper describes the 10 years of work and research results of the chemfp project, available from http://chemfp.com/ . The project started as a way to promote the FPS format for cheminformatics fingerprint exchange. This is a line-oriented text format meant to be easy to read and write. It supports metadata such as the fingerprint type and data provenance.The chemfp package for Python was developed to provide the basic command-line tools and Python API for working with fingerprint data, because a format without useful tools will not be used. <br></div><div><br></div><div>The similarity search performance improved by an order of magnitude over the decade, due to careful implementation and effective use of CPU hardware, including AVX2 support for faster popcount calculations than the built-in POPCNT instruction. The implementation details for high-performance search have rarely been discussed in the literature. As a result, many tools and published papers use implementations which are not close to the machine's capabilities.</div><div><br></div><div>This paper describes those details to help with future optimization efforts.</div><div><br></div><div>The most advanced version of chemfp evaluates about 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k=1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query and the same search of the 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest similarity search tools available for CPUs. This appears to be several times faster than previously published work in the field, including in papers which use much more sophisticated data structures.</div><div><br></div><div>A close analysis shows that nearly all earlier work assumes that the intersection popcount was the limiting performance factor, while on modern hardware uncompressed search is effectively memory bandwidth limited. For example, AVX2 search is 10% faster when memory prefetching, and the popcount evaluation time is far faster than fetching a random location in main memory. It proved difficult to evaluate existing tool performance because in the few cases where the tools were available, each used its own format, data sets, and search tasks.</div><div><br></div><div>This paper introduces the chemfp benchmark data set to help make head-to-head comparisons easier in the future, and to help promote the FPS format. The FPS format is slow for tasks like web server reloads and command-line scripting. This paper also describes the FPB format, which is a binary application format for fast loads. </div>

Download Full-text