Fast open modification spectral library searching through approximate nearest neighbor indexing

AbstractOpen modification searching (OMS) is a powerful search strategy that identifies peptides carrying any type of modification by allowing a modified spectrum to match against its unmodified variant by using a very wide precursor mass window. A drawback of this strategy, however, is that it leads to a large increase in search time. Although performing an open search can be done using existing spectral library search engines by simply setting a wide precursor mass window, none of these tools have been optimized for OMS, leading to excessive runtimes and suboptimal identification results. Here we present the ANN-SoLo tool for fast and accurate open spectral library searching. ANN-SoLo uses approximate nearest neighbor indexing to speed up OMS by selecting only a limited number of the most relevant library spectra to compare to an unknown query spectrum. This approach is combined with a cascade search strategy to maximize the number of identified unmodified and modified spectra while strictly controlling the false discovery rate, as well as a shifted dot product score to sensitively match modified spectra to their unmodified counterparts. ANN-SoLo achieves state-of-the-art performance in terms of speed and the number of identifications. On a previously published human cell line data set, ANN-SoLo confidently identifies more spectra than SpectraST or MSFragger and achieves a speedup of an order of magnitude compared to SpectraST.ANN-SoLo is implemented in Python and C++. It is freely available under the Apache 2.0 license athttps://github.com/bittremieux/ANN-SoLo.

Download Full-text

The Chemfp Project

10.26434/chemrxiv.7877846.v1 ◽

2019 ◽

Author(s):

Andrew Dalke

Keyword(s):

Similarity Search ◽

High Performance ◽

Nearest Neighbor ◽

Main Memory ◽

Nearest Neighbor Search ◽

Search Performance ◽

Command Line ◽

Data Set ◽

Neighbor Search ◽

Order Of Magnitude

<div>This paper describes the 10 years of work and research results of the chemfp project, available from http://chemfp.com/ . The project started as a way to promote the FPS format for cheminformatics fingerprint exchange. This is a line-oriented text format meant to be easy to read and write. It supports metadata such as the fingerprint type and data provenance.The chemfp package for Python was developed to provide the basic command-line tools and Python API for working with fingerprint data, because a format without useful tools will not be used. The similarity search performance improved by an order of magnitude over the decade, due to careful implementation and effective use of CPU hardware, including AVX2 support for faster popcount calculations than the built-in POPCNT instruction. The implementation details for high-performance search have rarely been discussed in the literature. As a result, many tools and published papers use implementations which are not close to the machine's capabilities. This paper describes those details to help with future optimization efforts. The most advanced version of chemfp evaluates about 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k=1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query and the same search of the 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest similarity search tools available for CPUs. This appears to be several times faster than previously published work in the field, including in papers which use much more sophisticated data structures. A close analysis shows that nearly all earlier work assumes that the intersection popcount was the limiting performance factor, while on modern hardware uncompressed search is effectively memory bandwidth limited. For example, AVX2 search is 10% faster when memory prefetching, and the popcount evaluation time is far faster than fetching a random location in main memory. It proved difficult to evaluate existing tool performance because in the few cases where the tools were available, each used its own format, data sets, and search tasks. This paper introduces the chemfp benchmark data set to help make head-to-head comparisons easier in the future, and to help promote the FPS format. The FPS format is slow for tasks like web server reloads and command-line scripting. This paper also describes the FPB format, which is a binary application format for fast loads. </div>

Download Full-text

The chemfp Project

10.26434/chemrxiv.7877846.v2 ◽

2019 ◽

Author(s):

Andrew Dalke

Keyword(s):

Similarity Search ◽

High Performance ◽

Nearest Neighbor ◽

Main Memory ◽

Nearest Neighbor Search ◽

Search Performance ◽

Command Line ◽

Data Set ◽

Neighbor Search ◽

Order Of Magnitude

<div>This paper describes the 10 years of work and research results of the chemfp project, available from http://chemfp.com/ . The project started as a way to promote the FPS format for cheminformatics fingerprint exchange. This is a line-oriented text format meant to be easy to read and write. It supports metadata such as the fingerprint type and data provenance.The chemfp package for Python was developed to provide the basic command-line tools and Python API for working with fingerprint data, because a format without useful tools will not be used. <br></div><div><br></div><div>The similarity search performance improved by an order of magnitude over the decade, due to careful implementation and effective use of CPU hardware, including AVX2 support for faster popcount calculations than the built-in POPCNT instruction. The implementation details for high-performance search have rarely been discussed in the literature. As a result, many tools and published papers use implementations which are not close to the machine's capabilities.</div><div><br></div><div>This paper describes those details to help with future optimization efforts.</div><div><br></div><div>The most advanced version of chemfp evaluates about 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k=1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query and the same search of the 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest similarity search tools available for CPUs. This appears to be several times faster than previously published work in the field, including in papers which use much more sophisticated data structures.</div><div><br></div><div>A close analysis shows that nearly all earlier work assumes that the intersection popcount was the limiting performance factor, while on modern hardware uncompressed search is effectively memory bandwidth limited. For example, AVX2 search is 10% faster when memory prefetching, and the popcount evaluation time is far faster than fetching a random location in main memory. It proved difficult to evaluate existing tool performance because in the few cases where the tools were available, each used its own format, data sets, and search tasks.</div><div><br></div><div>This paper introduces the chemfp benchmark data set to help make head-to-head comparisons easier in the future, and to help promote the FPS format. The FPS format is slow for tasks like web server reloads and command-line scripting. This paper also describes the FPB format, which is a binary application format for fast loads. </div>

Download Full-text

The chemfp Project

10.26434/chemrxiv.7877846 ◽

2019 ◽

Author(s):

Andrew Dalke

Keyword(s):

Similarity Search ◽

High Performance ◽

Nearest Neighbor ◽

Main Memory ◽

Nearest Neighbor Search ◽

Search Performance ◽

Command Line ◽

Data Set ◽

Neighbor Search ◽

Order Of Magnitude

<div>This paper describes the 10 years of work and research results of the chemfp project, available from http://chemfp.com/ . The project started as a way to promote the FPS format for cheminformatics fingerprint exchange. This is a line-oriented text format meant to be easy to read and write. It supports metadata such as the fingerprint type and data provenance.The chemfp package for Python was developed to provide the basic command-line tools and Python API for working with fingerprint data, because a format without useful tools will not be used. <br></div><div><br></div><div>The similarity search performance improved by an order of magnitude over the decade, due to careful implementation and effective use of CPU hardware, including AVX2 support for faster popcount calculations than the built-in POPCNT instruction. The implementation details for high-performance search have rarely been discussed in the literature. As a result, many tools and published papers use implementations which are not close to the machine's capabilities.</div><div><br></div><div>This paper describes those details to help with future optimization efforts.</div><div><br></div><div>The most advanced version of chemfp evaluates about 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k=1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query and the same search of the 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest similarity search tools available for CPUs. This appears to be several times faster than previously published work in the field, including in papers which use much more sophisticated data structures.</div><div><br></div><div>A close analysis shows that nearly all earlier work assumes that the intersection popcount was the limiting performance factor, while on modern hardware uncompressed search is effectively memory bandwidth limited. For example, AVX2 search is 10% faster when memory prefetching, and the popcount evaluation time is far faster than fetching a random location in main memory. It proved difficult to evaluate existing tool performance because in the few cases where the tools were available, each used its own format, data sets, and search tasks.</div><div><br></div><div>This paper introduces the chemfp benchmark data set to help make head-to-head comparisons easier in the future, and to help promote the FPS format. The FPS format is slow for tasks like web server reloads and command-line scripting. This paper also describes the FPB format, which is a binary application format for fast loads. </div>

Download Full-text

Improvement of PCA-Based Approximate Nearest Neighbor Search Using Distance Statistics

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2014.p0658 ◽

2014 ◽

Vol 18 (4) ◽

pp. 658-664

Author(s):

Toshiro Ogita ◽

◽

Hidetomo Ichihashi ◽

Akira Notsu ◽

Katsuhiro Honda ◽

...

Keyword(s):

Nearest Neighbor ◽

Search Algorithm ◽

Nearest Neighbor Search ◽

High Dimensional ◽

Search Performance ◽

Neighbor Distance ◽

Data Set ◽

Approximate Nearest Neighbor ◽

Neighbor Search ◽

Nearest Neighbor Distance

In many computer vision applications, nearest neighbor searching in high-dimensional spaces is often the most time consuming component and we have few algorithms for solving these high-dimensional nearest neighbor search problems that are faster than linear search. Approximately nearest neighbor search algorithms can play an important role in achieving significantly faster running times with relatively small errors. This paper considers the improvement of the PCA-tree nearest neighbor search algorithm [1] by employing nearest neighbor distance statistics. During the preprocessing phase of the PCA-tree nearest neighbor search algorithm, a data set is partitioned into clusters by successive use of Principal Component Analysis (PCA). The search performance is significantly improved if the data points are sorted by leaf node, and the threshold value is updated each time a smaller distance is found. The threshold is updated by the ε-approximate nearest neighbor approach together with the fixed-threshold approach. Performance can be further improved by the annulus bound approach. Moreover, nearest neighbor distance statistics is employed for further improving the efficiency of the search algorithm and the several experimental results are shown for demonstrating how its efficiency is improved.

Download Full-text

Fast Open Modification Spectral Library Searching through Approximate Nearest Neighbor Indexing

Journal of Proteome Research ◽

10.1021/acs.jproteome.8b00359 ◽

2018 ◽

Vol 17 (10) ◽

pp. 3463-3474 ◽

Cited By ~ 25

Author(s):

Wout Bittremieux ◽

Pieter Meysman ◽

William Stafford Noble ◽

Kris Laukens

Keyword(s):

Nearest Neighbor ◽

Spectral Library ◽

Approximate Nearest Neighbor

Download Full-text

Machine Learning Verdict of EEG Signals in Brain Computer Interface

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit1838114 ◽

2018 ◽

pp. 429-441

Author(s):

M. Jeyanthi ◽

C. Velayutham

Keyword(s):

Nearest Neighbor ◽

Technology Development ◽

Vital Role ◽

Svm Classifier ◽

K Nearest Neighbor ◽

Data Mining Technique ◽

Data Set ◽

Eeg Data ◽

Irrelevant Attributes

In Science and Technology Development BCI plays a vital role in the field of Research. Classification is a data mining technique used to predict group membership for data instances. Analyses of BCI data are challenging because feature extraction and classification of these data are more difficult as compared with those applied to raw data. In this paper, We extracted features using statistical Haralick features from the raw EEG data . Then the features are Normalized, Binning is used to improve the accuracy of the predictive models by reducing noise and eliminate some irrelevant attributes and then the classification is performed using different classification techniques such as Naïve Bayes, k-nearest neighbor classifier, SVM classifier using BCI dataset. Finally we propose the SVM classification algorithm for the BCI data set.

Download Full-text

Adaptive bit allocation hashing for approximate nearest neighbor search

Neurocomputing ◽

10.1016/j.neucom.2014.10.042 ◽

2015 ◽

Vol 151 ◽

pp. 719-728 ◽

Cited By ~ 4

Author(s):

Qin-Zhen Guo ◽

Zhi Zeng ◽

Shuwu Zhang

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Bit Allocation ◽

Approximate Nearest Neighbor Search ◽

Approximate Nearest Neighbor ◽

Neighbor Search

Download Full-text

A graphical heuristic for reduction and partitioning of large datasets for scalable supervised training

Journal Of Big Data ◽

10.1186/s40537-019-0259-3 ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 2

Author(s):

Sumedh Yadav ◽

Mathis Bode

Keyword(s):

Prediction Accuracy ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Large Datasets ◽

Test Results ◽

Accuracy Test ◽

Reasonable Proportion ◽

Speed Up ◽

Run Time ◽

Information Graph

Abstract A scalable graphical method is presented for selecting and partitioning datasets for the training phase of a classification task. For the heuristic, a clustering algorithm is required to get its computation cost in a reasonable proportion to the task itself. This step is succeeded by construction of an information graph of the underlying classification patterns using approximate nearest neighbor methods. The presented method consists of two approaches, one for reducing a given training set, and another for partitioning the selected/reduced set. The heuristic targets large datasets, since the primary goal is a significant reduction in training computation run-time without compromising prediction accuracy. Test results show that both approaches significantly speed-up the training task when compared against that of state-of-the-art shrinking heuristics available in LIBSVM. Furthermore, the approaches closely follow or even outperform in prediction accuracy. A network design is also presented for a partitioning based distributed training formulation. Added speed-up in training run-time is observed when compared to that of serial implementation of the approaches.

Download Full-text

Improving Approximate Nearest Neighbor Search through Learned Adaptive Early Termination

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data ◽

10.1145/3318464.3380600 ◽

2020 ◽

Author(s):

Conglong Li ◽

Minjia Zhang ◽

David G. Andersen ◽

Yuxiong He

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbor Search ◽

Early Termination ◽

Approximate Nearest Neighbor Search ◽

Approximate Nearest Neighbor ◽

Neighbor Search

Download Full-text

Statistically rigorous calculations do not support common input and long-term synchronization of motor-unit firings

Journal of Neurophysiology ◽

10.1152/jn.00725.2013 ◽

2014 ◽

Vol 112 (11) ◽

pp. 2729-2744 ◽

Cited By ~ 11

Author(s):

Carlo J. De Luca ◽

Joshua C. Kline

Keyword(s):

Motor Unit ◽

Statistical Tests ◽

Vastus Lateralis ◽

Common Input ◽

Data Set ◽

Order Of Magnitude ◽

Score Method ◽

Surface Electromyographic Signal ◽

Measure Synchronization

Over the past four decades, various methods have been implemented to measure synchronization of motor-unit firings. In this work, we provide evidence that prior reports of the existence of universal common inputs to all motoneurons and the presence of long-term synchronization are misleading, because they did not use sufficiently rigorous statistical tests to detect synchronization. We developed a statistically based method (SigMax) for computing synchronization and tested it with data from 17,736 motor-unit pairs containing 1,035,225 firing instances from the first dorsal interosseous and vastus lateralis muscles—a data set one order of magnitude greater than that reported in previous studies. Only firing data, obtained from surface electromyographic signal decomposition with >95% accuracy, were used in the study. The data were not subjectively selected in any manner. Because of the size of our data set and the statistical rigor inherent to SigMax, we have confidence that the synchronization values that we calculated provide an improved estimate of physiologically driven synchronization. Compared with three other commonly used techniques, ours revealed three types of discrepancies that result from failing to use sufficient statistical tests necessary to detect synchronization. 1) On average, the z-score method falsely detected synchronization at 16 separate latencies in each motor-unit pair. 2) The cumulative sum method missed one out of every four synchronization identifications found by SigMax. 3) The common input assumption method identified synchronization from 100% of motor-unit pairs studied. SigMax revealed that only 50% of motor-unit pairs actually manifested synchronization.

Download Full-text