CAMS-RS: Clustering Algorithm for Large-Scale Mass Spectrometry Data Using Restricted Search Space and Intelligent Random Sampling

Abstract Background The clustering of data produced by liquid chromatography coupled to mass spectrometry analyses (LC-MS data) has recently gained interest to extract meaningful chemical or biological patterns. However, recent instrumental pipelines deliver data which size, dimensionality and expected number of clusters are too large to be processed by classical machine learning algorithms, so that most of the state-of-the-art relies on single pass linkage-based algorithms. Results We propose a clustering algorithm that solves the powerful but computationally demanding kernel k-means objective function in a scalable way. As a result, it can process LC-MS data in an acceptable time on a multicore machine. To do so, we combine three essential features: a compressive data representation, Nyström approximation and a hierarchical strategy. In addition, we propose new kernels based on optimal transport, which interprets as intuitive similarity measures between chromatographic elution profiles. Conclusions Our method, referred to as CHICKN, is evaluated on proteomics data produced in our lab, as well as on benchmark data coming from the literature. From a computational viewpoint, it is particularly efficient on raw LC-MS data. From a data analysis viewpoint, it provides clusters which differ from those resulting from state-of-the-art methods, while achieving similar performances. This highlights the complementarity of differently principle algorithms to extract the best from complex LC-MS data.

Download Full-text

MSpectraAI: a powerful platform for deciphering proteome profiling of multi-tumor mass spectrometry data by using deep neural networks

BMC Bioinformatics ◽

10.1186/s12859-020-03783-0 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Shisheng Wang ◽

Hongwen Zhu ◽

Hu Zhou ◽

Jingqiu Cheng ◽

Hao Yang

Keyword(s):

Mass Spectrometry ◽

Neural Networks ◽

Large Scale ◽

Deep Neural Networks ◽

Spectral Feature ◽

Mass Spectrometry Data ◽

Learning Approaches ◽

Proteomics Data ◽

Proteome Profiling ◽

Analytical Technique

Abstract Background Mass spectrometry (MS) has become a promising analytical technique to acquire proteomics information for the characterization of biological samples. Nevertheless, most studies focus on the final proteins identified through a suite of algorithms by using partial MS spectra to compare with the sequence database, while the pattern recognition and classification of raw mass-spectrometric data remain unresolved. Results We developed an open-source and comprehensive platform, named MSpectraAI, for analyzing large-scale MS data through deep neural networks (DNNs); this system involves spectral-feature swath extraction, classification, and visualization. Moreover, this platform allows users to create their own DNN model by using Keras. To evaluate this tool, we collected the publicly available proteomics datasets of six tumor types (a total of 7,997,805 mass spectra) from the ProteomeXchange consortium and classified the samples based on the spectra profiling. The results suggest that MSpectraAI can distinguish different types of samples based on the fingerprint spectrum and achieve better prediction accuracy in MS1 level (average 0.967). Conclusion This study deciphers proteome profiling of raw mass spectrometry data and broadens the promising application of the classification and prediction of proteomics data from multi-tumor samples using deep learning methods. MSpectraAI also shows a better performance compared to the other classical machine learning approaches.

Download Full-text

Efficient exploratory clustering analyses in large-scale exploration processes

The VLDB Journal ◽

10.1007/s00778-021-00716-y ◽

2021 ◽

Author(s):

Manuel Fritz ◽

Michael Behringer ◽

Dennis Tschechlov ◽

Holger Schwarz

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Comprehensive Evaluation ◽

State Of The Art ◽

Clustering Algorithms ◽

Search Space ◽

Large Datasets ◽

Search Spaces ◽

Multiple Challenges ◽

The One

AbstractClustering is a fundamental primitive in manifold applications. In order to achieve valuable results in exploratory clustering analyses, parameters of the clustering algorithm have to be set appropriately, which is a tremendous pitfall. We observe multiple challenges for large-scale exploration processes. On the one hand, they require specific methods to efficiently explore large parameter search spaces. On the other hand, they often exhibit large runtimes, in particular when large datasets are analyzed using clustering algorithms with super-polynomial runtimes, which repeatedly need to be executed within exploratory clustering analyses. We address these challenges as follows: First, we present LOG-Means and show that it provides estimates for the number of clusters in sublinear time regarding the defined search space, i.e., provably requiring less executions of a clustering algorithm than existing methods. Second, we demonstrate how to exploit fundamental characteristics of exploratory clustering analyses in order to significantly accelerate the (repetitive) execution of clustering algorithms on large datasets. Third, we show how these challenges can be tackled at the same time. To the best of our knowledge, this is the first work which simultaneously addresses the above-mentioned challenges. In our comprehensive evaluation, we unveil that our proposed methods significantly outperform state-of-the-art methods, thus especially supporting novice analysts for exploratory clustering analyses in large-scale exploration processes.

Download Full-text

An efficient dynamic programming algorithm for phosphorylation site assignment of large-scale mass spectrometry data

2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops ◽

10.1109/bibmw.2012.6470210 ◽

2012 ◽

Cited By ~ 7

Author(s):

Fahad Saeed ◽

Trairak Pisitkun ◽

Jason D. Hoffert ◽

Guanghui Wang ◽

Marjan Gucek ◽

...

Keyword(s):

Mass Spectrometry ◽

Dynamic Programming ◽

Large Scale ◽

Phosphorylation Site ◽

Dynamic Programming Algorithm ◽

Mass Spectrometry Data ◽

Programming Algorithm ◽

Site Assignment

Download Full-text

Large-scale mass spectrometry data combined with demographics analysis rapidly predicts methicillin resistance in Staphylococcus aureus

Briefings in Bioinformatics ◽

10.1093/bib/bbaa293 ◽

2020 ◽

Author(s):

Zhuo Wang ◽

Hsin-Yao Wang ◽

Chia-Ru Chung ◽

Jorng-Tzong Horng ◽

Jang-Jih Lu ◽

...

Keyword(s):

Mass Spectrometry ◽

Staphylococcus Aureus ◽

Antibiotic Resistance ◽

Large Scale ◽

Operating Characteristic ◽

Susceptibility Testing ◽

Methicillin Resistance ◽

Characteristic Curve ◽

Mass Spectrometry Data ◽

Treatment Efficiency

Abstract Background A mass spectrometry-based assessment of methicillin resistance in Staphylococcus aureus would have huge potential in addressing fast and effective prediction of antibiotic resistance. Since delays in the traditional antibiotic susceptibility testing, methicillin-resistant S. aureus remains a serious threat to human health. Results Here, linking a 7 years of longitudinal study from two cohorts in the Taiwan area of over 20 000 individually resolved methicillin susceptibility testing results, we identify associations of methicillin resistance with the demographics and mass spectrometry data. When combined together, these connections allow for machine-learning-based predictions of methicillin resistance, with an area under the receiver operating characteristic curve of >0.85 in both the discovery [95% confidence interval (CI) 0.88–0.90] and replication (95% CI 0.84–0.86) populations. Conclusions Our predictive model facilitates early detection for methicillin resistance of patients with S. aureus infection. The large-scale antibiotic resistance study has unbiasedly highlighted putative candidates that could improve trials of treatment efficiency and inform on prescriptions.

Download Full-text