scholarly journals Fast open modification spectral library searching through approximate nearest neighbor indexing

2018 ◽  
Author(s):  
Wout Bittremieux ◽  
Pieter Meysman ◽  
William Stafford Noble ◽  
Kris Laukens

AbstractOpen modification searching (OMS) is a powerful search strategy that identifies peptides carrying any type of modification by allowing a modified spectrum to match against its unmodified variant by using a very wide precursor mass window. A drawback of this strategy, however, is that it leads to a large increase in search time. Although performing an open search can be done using existing spectral library search engines by simply setting a wide precursor mass window, none of these tools have been optimized for OMS, leading to excessive runtimes and suboptimal identification results. Here we present the ANN-SoLo tool for fast and accurate open spectral library searching. ANN-SoLo uses approximate nearest neighbor indexing to speed up OMS by selecting only a limited number of the most relevant library spectra to compare to an unknown query spectrum. This approach is combined with a cascade search strategy to maximize the number of identified unmodified and modified spectra while strictly controlling the false discovery rate, as well as a shifted dot product score to sensitively match modified spectra to their unmodified counterparts. ANN-SoLo achieves state-of-the-art performance in terms of speed and the number of identifications. On a previously published human cell line data set, ANN-SoLo confidently identifies more spectra than SpectraST or MSFragger and achieves a speedup of an order of magnitude compared to SpectraST.ANN-SoLo is implemented in Python and C++. It is freely available under the Apache 2.0 license athttps://github.com/bittremieux/ANN-SoLo.

2019 ◽  
Author(s):  
Andrew Dalke

<div>This paper describes the 10 years of work and research results of the chemfp project, available from http://chemfp.com/ . The project started as a way to promote the FPS format for cheminformatics fingerprint exchange. This is a line-oriented text format meant to be easy to read and write. It supports metadata such as the fingerprint type and data provenance.The chemfp package for Python was developed to provide the basic command-line tools and Python API for working with fingerprint data, because a format without useful tools will not be used. The similarity search performance improved by an order of magnitude over the decade, due to careful implementation and effective use of CPU hardware, including AVX2 support for faster popcount calculations than the built-in POPCNT instruction. The implementation details for high-performance search have rarely been discussed in the literature. As a result, many tools and published papers use implementations which are not close to the machine's capabilities. This paper describes those details to help with future optimization efforts. The most advanced version of chemfp evaluates about 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k=1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query and the same search of the 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest similarity search tools available for CPUs. This appears to be several times faster than previously published work in the field, including in papers which use much more sophisticated data structures. A close analysis shows that nearly all earlier work assumes that the intersection popcount was the limiting performance factor, while on modern hardware uncompressed search is effectively memory bandwidth limited. For example, AVX2 search is 10% faster when memory prefetching, and the popcount evaluation time is far faster than fetching a random location in main memory. It proved difficult to evaluate existing tool performance because in the few cases where the tools were available, each used its own format, data sets, and search tasks. This paper introduces the chemfp benchmark data set to help make head-to-head comparisons easier in the future, and to help promote the FPS format. The FPS format is slow for tasks like web server reloads and command-line scripting. This paper also describes the FPB format, which is a binary application format for fast loads. </div>


2019 ◽  
Author(s):  
Andrew Dalke

<div>This paper describes the 10 years of work and research results of the chemfp project, available from http://chemfp.com/ . The project started as a way to promote the FPS format for cheminformatics fingerprint exchange. This is a line-oriented text format meant to be easy to read and write. It supports metadata such as the fingerprint type and data provenance.The chemfp package for Python was developed to provide the basic command-line tools and Python API for working with fingerprint data, because a format without useful tools will not be used. <br></div><div><br></div><div>The similarity search performance improved by an order of magnitude over the decade, due to careful implementation and effective use of CPU hardware, including AVX2 support for faster popcount calculations than the built-in POPCNT instruction. The implementation details for high-performance search have rarely been discussed in the literature. As a result, many tools and published papers use implementations which are not close to the machine's capabilities.</div><div><br></div><div>This paper describes those details to help with future optimization efforts.</div><div><br></div><div>The most advanced version of chemfp evaluates about 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k=1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query and the same search of the 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest similarity search tools available for CPUs. This appears to be several times faster than previously published work in the field, including in papers which use much more sophisticated data structures.</div><div><br></div><div>A close analysis shows that nearly all earlier work assumes that the intersection popcount was the limiting performance factor, while on modern hardware uncompressed search is effectively memory bandwidth limited. For example, AVX2 search is 10% faster when memory prefetching, and the popcount evaluation time is far faster than fetching a random location in main memory. It proved difficult to evaluate existing tool performance because in the few cases where the tools were available, each used its own format, data sets, and search tasks.</div><div><br></div><div>This paper introduces the chemfp benchmark data set to help make head-to-head comparisons easier in the future, and to help promote the FPS format. The FPS format is slow for tasks like web server reloads and command-line scripting. This paper also describes the FPB format, which is a binary application format for fast loads. </div>


2019 ◽  
Author(s):  
Andrew Dalke

<div>This paper describes the 10 years of work and research results of the chemfp project, available from http://chemfp.com/ . The project started as a way to promote the FPS format for cheminformatics fingerprint exchange. This is a line-oriented text format meant to be easy to read and write. It supports metadata such as the fingerprint type and data provenance.The chemfp package for Python was developed to provide the basic command-line tools and Python API for working with fingerprint data, because a format without useful tools will not be used. <br></div><div><br></div><div>The similarity search performance improved by an order of magnitude over the decade, due to careful implementation and effective use of CPU hardware, including AVX2 support for faster popcount calculations than the built-in POPCNT instruction. The implementation details for high-performance search have rarely been discussed in the literature. As a result, many tools and published papers use implementations which are not close to the machine's capabilities.</div><div><br></div><div>This paper describes those details to help with future optimization efforts.</div><div><br></div><div>The most advanced version of chemfp evaluates about 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k=1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query and the same search of the 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest similarity search tools available for CPUs. This appears to be several times faster than previously published work in the field, including in papers which use much more sophisticated data structures.</div><div><br></div><div>A close analysis shows that nearly all earlier work assumes that the intersection popcount was the limiting performance factor, while on modern hardware uncompressed search is effectively memory bandwidth limited. For example, AVX2 search is 10% faster when memory prefetching, and the popcount evaluation time is far faster than fetching a random location in main memory. It proved difficult to evaluate existing tool performance because in the few cases where the tools were available, each used its own format, data sets, and search tasks.</div><div><br></div><div>This paper introduces the chemfp benchmark data set to help make head-to-head comparisons easier in the future, and to help promote the FPS format. The FPS format is slow for tasks like web server reloads and command-line scripting. This paper also describes the FPB format, which is a binary application format for fast loads. </div>


Author(s):  
Toshiro Ogita ◽  
◽  
Hidetomo Ichihashi ◽  
Akira Notsu ◽  
Katsuhiro Honda ◽  
...  

In many computer vision applications, nearest neighbor searching in high-dimensional spaces is often the most time consuming component and we have few algorithms for solving these high-dimensional nearest neighbor search problems that are faster than linear search. Approximately nearest neighbor search algorithms can play an important role in achieving significantly faster running times with relatively small errors. This paper considers the improvement of the PCA-tree nearest neighbor search algorithm [1] by employing nearest neighbor distance statistics. During the preprocessing phase of the PCA-tree nearest neighbor search algorithm, a data set is partitioned into clusters by successive use of Principal Component Analysis (PCA). The search performance is significantly improved if the data points are sorted by leaf node, and the threshold value is updated each time a smaller distance is found. The threshold is updated by the ε-approximate nearest neighbor approach together with the fixed-threshold approach. Performance can be further improved by the annulus bound approach. Moreover, nearest neighbor distance statistics is employed for further improving the efficiency of the search algorithm and the several experimental results are shown for demonstrating how its efficiency is improved.


2018 ◽  
Vol 17 (10) ◽  
pp. 3463-3474 ◽  
Author(s):  
Wout Bittremieux ◽  
Pieter Meysman ◽  
William Stafford Noble ◽  
Kris Laukens

Author(s):  
M. Jeyanthi ◽  
C. Velayutham

In Science and Technology Development BCI plays a vital role in the field of Research. Classification is a data mining technique used to predict group membership for data instances. Analyses of BCI data are challenging because feature extraction and classification of these data are more difficult as compared with those applied to raw data. In this paper, We extracted features using statistical Haralick features from the raw EEG data . Then the features are Normalized, Binning is used to improve the accuracy of the predictive models by reducing noise and eliminate some irrelevant attributes and then the classification is performed using different classification techniques such as Naïve Bayes, k-nearest neighbor classifier, SVM classifier using BCI dataset. Finally we propose the SVM classification algorithm for the BCI data set.


2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Sumedh Yadav ◽  
Mathis Bode

Abstract A scalable graphical method is presented for selecting and partitioning datasets for the training phase of a classification task. For the heuristic, a clustering algorithm is required to get its computation cost in a reasonable proportion to the task itself. This step is succeeded by construction of an information graph of the underlying classification patterns using approximate nearest neighbor methods. The presented method consists of two approaches, one for reducing a given training set, and another for partitioning the selected/reduced set. The heuristic targets large datasets, since the primary goal is a significant reduction in training computation run-time without compromising prediction accuracy. Test results show that both approaches significantly speed-up the training task when compared against that of state-of-the-art shrinking heuristics available in LIBSVM. Furthermore, the approaches closely follow or even outperform in prediction accuracy. A network design is also presented for a partitioning based distributed training formulation. Added speed-up in training run-time is observed when compared to that of serial implementation of the approaches.


2014 ◽  
Vol 112 (11) ◽  
pp. 2729-2744 ◽  
Author(s):  
Carlo J. De Luca ◽  
Joshua C. Kline

Over the past four decades, various methods have been implemented to measure synchronization of motor-unit firings. In this work, we provide evidence that prior reports of the existence of universal common inputs to all motoneurons and the presence of long-term synchronization are misleading, because they did not use sufficiently rigorous statistical tests to detect synchronization. We developed a statistically based method (SigMax) for computing synchronization and tested it with data from 17,736 motor-unit pairs containing 1,035,225 firing instances from the first dorsal interosseous and vastus lateralis muscles—a data set one order of magnitude greater than that reported in previous studies. Only firing data, obtained from surface electromyographic signal decomposition with >95% accuracy, were used in the study. The data were not subjectively selected in any manner. Because of the size of our data set and the statistical rigor inherent to SigMax, we have confidence that the synchronization values that we calculated provide an improved estimate of physiologically driven synchronization. Compared with three other commonly used techniques, ours revealed three types of discrepancies that result from failing to use sufficient statistical tests necessary to detect synchronization. 1) On average, the z-score method falsely detected synchronization at 16 separate latencies in each motor-unit pair. 2) The cumulative sum method missed one out of every four synchronization identifications found by SigMax. 3) The common input assumption method identified synchronization from 100% of motor-unit pairs studied. SigMax revealed that only 50% of motor-unit pairs actually manifested synchronization.


Sign in / Sign up

Export Citation Format

Share Document