IP4M: an integrated platform for mass spectrometry-based metabolomics data mining

Abstract Background Metabolomics data analyses rely on the use of bioinformatics tools. Many integrated multi-functional tools have been developed for untargeted metabolomics data processing and have been widely used. More alternative platforms are expected for both basic and advanced users. Results Integrated mass spectrometry-based untargeted metabolomics data mining (IP4M) software was designed and developed. The IP4M, has 62 functions categorized into 8 modules, covering all the steps of metabolomics data mining, including raw data preprocessing (alignment, peak de-convolution, peak picking, and isotope filtering), peak annotation, peak table preprocessing, basic statistical description, classification and biomarker detection, correlation analysis, cluster and sub-cluster analysis, regression analysis, ROC analysis, pathway and enrichment analysis, and sample size and power analysis. Additionally, a KEGG-derived metabolic reaction database was embedded and a series of ratio variables (product/substrate) can be generated with enlarged information on enzyme activity. A new method, GRaMM, for correlation analysis between metabolome and microbiome data was also provided. IP4M provides both a number of parameters for customized and refined analysis (for expert users), as well as 4 simplified workflows with few key parameters (for beginners who are unfamiliar with computational metabolomics). The performance of IP4M was evaluated and compared with existing computational platforms using 2 data sets derived from standards mixture and 2 data sets derived from serum samples, from GC–MS and LC–MS respectively. Conclusion IP4M is powerful, modularized, customizable and easy-to-use. It is a good choice for metabolomics data processing and analysis. Free versions for Windows, MAC OS, and Linux systems are provided.

Download Full-text

An evolving computational platform for biological mass spectrometry: workflows, statistics and data mining with MASSyPup64

PeerJ ◽

10.7717/peerj.1401 ◽

2015 ◽

Vol 3 ◽

pp. e1401 ◽

Cited By ~ 13

Author(s):

Robert Winkler

Keyword(s):

Mass Spectrometry ◽

Data Mining ◽

Random Forest ◽

Data Processing ◽

Workflow Management ◽

Data Sets ◽

Targeted Metabolomics ◽

Biological Mass Spectrometry ◽

Data Analyses ◽

Mining Methods

In biological mass spectrometry, crude instrumental data need to be converted into meaningful theoretical models. Several data processing and data evaluation steps are required to come to the final results. These operations are often difficult to reproduce, because of too specific computing platforms. This effect, known as ‘workflow decay’, can be diminished by using a standardized informatic infrastructure. Thus, we compiled an integrated platform, which contains ready-to-use tools and workflows for mass spectrometry data analysis. Apart from general unit operations, such as peak picking and identification of proteins and metabolites, we put a strong emphasis on the statistical validation of results and Data Mining. MASSyPup64 includes e.g., the OpenMS/TOPPAS framework, the Trans-Proteomic-Pipeline programs, the ProteoWizard tools, X!Tandem, Comet and SpiderMass. The statistical computing language R is installed with packages for MS data analyses, such as XCMS/metaXCMS and MetabR. The R package Rattle provides a user-friendly access to multiple Data Mining methods. Further, we added the non-conventional spreadsheet program teapot for editing large data sets and a command line tool for transposing large matrices. Individual programs, console commands and modules can be integrated using the Workflow Management System (WMS) taverna. We explain the useful combination of the tools by practical examples: (1) A workflow for protein identification and validation, with subsequent Association Analysis of peptides, (2) Cluster analysis and Data Mining in targeted Metabolomics, and (3) Raw data processing, Data Mining and identification of metabolites in untargeted Metabolomics. Association Analyses reveal relationships between variables across different sample sets. We present its application for finding co-occurring peptides, which can be used for target proteomics, the discovery of alternative biomarkers and protein–protein interactions. Data Mining derived models displayed a higher robustness and accuracy for classifying sample groups in targeted Metabolomics than cluster analyses. Random Forest models do not only provide predictive models, which can be deployed for new data sets, but also the variable importance. We demonstrate that the later is especially useful for tracking down significant signals and affected pathways in untargeted Metabolomics. Thus, Random Forest modeling supports the unbiased search for relevant biological features in Metabolomics. Our results clearly manifest the importance of Data Mining methods to disclose non-obvious information in biological mass spectrometry . The application of a Workflow Management System and the integration of all required programs and data in a consistent platform makes the presented data analyses strategies reproducible for non-expert users. The simple remastering process and the Open Source licenses of MASSyPup64 (http://www. bioprocess.org/massypup/) enable the continuous improvement of the system.

Download Full-text

An evolving computational platform for biological mass spectrometry: Work-flows, statistics and Data Mining with MASSyPup64

10.7287/peerj.preprints.1359v1 ◽

2015 ◽

Author(s):

Robert Winkler

Keyword(s):

Mass Spectrometry ◽

Data Mining ◽

Random Forest ◽

Data Processing ◽

Workflow Management ◽

Data Sets ◽

Targeted Metabolomics ◽

Biological Mass Spectrometry ◽

Data Analyses ◽

Mining Methods

In biological mass spectrometry, crude instrumental data need to be converted into meaningful theoretical models. Several data processing and data evaluation steps are required to come to the final results. These operations are often difficult to reproduce, because of too specific computing platforms. This effect, known as ’workflow decay’, can be diminished by using a standardized informatic infrastructure. Thus, we compiled an integrated platform, which contains ready-to-use tools and workflows for mass spectrometry data analysis. Apart from general unit operations, such as peak picking and identification of proteins and metabolites, we put a strong emphasis on the statistical validation of results and Data Mining. MASSyPup64 includes e.g. the OpenMS/TOPPAS framework, the Trans-Proteomic-Pipeline programs, the ProteoWizard tools, X!Tandem, Comet and SpiderMass. The statistical computing language R is installed with packages for MS data analyses, such as XCMS/metaXCMS and MetabR. The R package Rattle provides a user-friendly access to multiple Data Mining methods. Further, we added the non-conventional spreadsheet program teapot for editing large data sets and a command line tool for transposing large matrices. Individual programs, console commands and modules can be integrated using the Workflow Management System (WMS) taverna. We explain the useful combination of the tools by practical examples: 1) A workflow for protein identification and validation, with subsequent Association Analysis of peptides, 2) Cluster analysis and Data Mining in targeted Metabolomics, and 3) Raw data processing, Data Mining and identification of metabolites in untargeted Metabolomics. Association Analyses reveal relationships between variables across different sample sets. We present its application for finding co-occurring peptides, which can be used for target proteomics, the discovery of alternative biomarkers and protein-protein interactions. Data Mining derived models displayed a higher robustness and accuracy for classifying sample groups in targeted Metabolomics than cluster analyses. Random Forest models do not only provide predictive models, which can be deployed for new data sets, but also the variable importance. We demonstrate that the later is especially useful for tracking down significant signals and affected pathways in untargeted Metabolomics. Thus, Random Forest modeling supports the unbiased search for relevant biological features in Metabolomics. Our results clearly manifest the importance of Data Mining methods to disclose non-obvious information in biological mass spectrometry . The application of a Workflow Management System and the integration of all required programs and data in a consistent platform makes the presented data analyses strategies reproducible for non-expert users. The simple remastering process and the Open Source licenses of MASSyPup64 (http://www. bioprocess.org/massypup/) enable the continuous improvement of the system.

Download Full-text

An evolving computational platform for biological mass spectrometry: Work-flows, statistics and Data Mining with MASSyPup64

10.7287/peerj.preprints.1359 ◽

2015 ◽

Author(s):

Robert Winkler

Keyword(s):

Mass Spectrometry ◽

Data Mining ◽

Random Forest ◽

Data Processing ◽

Workflow Management ◽

Data Sets ◽

Targeted Metabolomics ◽

Biological Mass Spectrometry ◽

Data Analyses ◽

Mining Methods

In biological mass spectrometry, crude instrumental data need to be converted into meaningful theoretical models. Several data processing and data evaluation steps are required to come to the final results. These operations are often difficult to reproduce, because of too specific computing platforms. This effect, known as ’workflow decay’, can be diminished by using a standardized informatic infrastructure. Thus, we compiled an integrated platform, which contains ready-to-use tools and workflows for mass spectrometry data analysis. Apart from general unit operations, such as peak picking and identification of proteins and metabolites, we put a strong emphasis on the statistical validation of results and Data Mining. MASSyPup64 includes e.g. the OpenMS/TOPPAS framework, the Trans-Proteomic-Pipeline programs, the ProteoWizard tools, X!Tandem, Comet and SpiderMass. The statistical computing language R is installed with packages for MS data analyses, such as XCMS/metaXCMS and MetabR. The R package Rattle provides a user-friendly access to multiple Data Mining methods. Further, we added the non-conventional spreadsheet program teapot for editing large data sets and a command line tool for transposing large matrices. Individual programs, console commands and modules can be integrated using the Workflow Management System (WMS) taverna. We explain the useful combination of the tools by practical examples: 1) A workflow for protein identification and validation, with subsequent Association Analysis of peptides, 2) Cluster analysis and Data Mining in targeted Metabolomics, and 3) Raw data processing, Data Mining and identification of metabolites in untargeted Metabolomics. Association Analyses reveal relationships between variables across different sample sets. We present its application for finding co-occurring peptides, which can be used for target proteomics, the discovery of alternative biomarkers and protein-protein interactions. Data Mining derived models displayed a higher robustness and accuracy for classifying sample groups in targeted Metabolomics than cluster analyses. Random Forest models do not only provide predictive models, which can be deployed for new data sets, but also the variable importance. We demonstrate that the later is especially useful for tracking down significant signals and affected pathways in untargeted Metabolomics. Thus, Random Forest modeling supports the unbiased search for relevant biological features in Metabolomics. Our results clearly manifest the importance of Data Mining methods to disclose non-obvious information in biological mass spectrometry . The application of a Workflow Management System and the integration of all required programs and data in a consistent platform makes the presented data analyses strategies reproducible for non-expert users. The simple remastering process and the Open Source licenses of MASSyPup64 (http://www. bioprocess.org/massypup/) enable the continuous improvement of the system.

Download Full-text

Advancements in capturing and mining mass spectrometry data are transforming natural products research

Natural Product Reports ◽

10.1039/d1np00040c ◽

2021 ◽

Author(s):

Scott A. Jarmusch ◽

Justin J. J. van der Hooft ◽

Pieter C. Dorrestein ◽

Alan K. Jarmusch

Keyword(s):

Mass Spectrometry ◽

Data Mining ◽

Natural Products ◽

Data Analysis ◽

Community Participation ◽

Mass Spectrometry Data ◽

Metabolomics Data ◽

Analysis Tools ◽

Public Data ◽

Potential Use

This review covers the current and potential use of mass spectrometry-based metabolomics data mining in natural products. Public data, metadata, databases and data analysis tools are critical. The value and success of data mining rely on community participation.

Download Full-text

Corrigendum to “Comprehensive evaluation of untargeted metabolomics data processing software in feature detection, quantification and discriminating marker selection” [ACA 1029, (2018) 50–57]

Analytica Chimica Acta ◽

10.1016/j.aca.2018.10.029 ◽

2018 ◽

Vol 1044 ◽

pp. 199

Author(s):

Zhucui Li ◽

Yan Lu ◽

Yufeng Guo ◽

Haijie Cao ◽

Qinhong Wang ◽

...

Keyword(s):

Data Processing ◽

Feature Detection ◽

Comprehensive Evaluation ◽

Untargeted Metabolomics ◽

Metabolomics Data ◽

Marker Selection ◽

Data Processing Software ◽

Processing Software

Download Full-text

Reproducibility of mass spectrometry based metabolomics data

BMC Bioinformatics ◽

10.1186/s12859-021-04336-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Tusharkanti Ghosh ◽

Daisy Philtron ◽

Weiming Zhang ◽

Katerina Kechris ◽

Debashis Ghosh

Keyword(s):

Mass Spectrometry ◽

Biological Samples ◽

Real Data ◽

Maximal Rank ◽

Rank Statistic ◽

Chronic Obstructive ◽

Data Sets ◽

Metabolomics Data ◽

Nonparametric Approach ◽

True Value

Abstract Background Assessing the reproducibility of measurements is an important first step for improving the reliability of downstream analyses of high-throughput metabolomics experiments. We define a metabolite to be reproducible when it demonstrates consistency across replicate experiments. Similarly, metabolites which are not consistent across replicates can be labeled as irreproducible. In this work, we introduce and evaluate the use (Ma)ximum (R)ank (R)eproducibility (MaRR) to examine reproducibility in mass spectrometry-based metabolomics experiments. We examine reproducibility across technical or biological samples in three different mass spectrometry metabolomics (MS-Metabolomics) data sets. Results We apply MaRR, a nonparametric approach that detects the change from reproducible to irreproducible signals using a maximal rank statistic. The advantage of using MaRR over model-based methods that it does not make parametric assumptions on the underlying distributions or dependence structures of reproducible metabolites. Using three MS Metabolomics data sets generated in the multi-center Genetic Epidemiology of Chronic Obstructive Pulmonary Disease (COPD) study, we applied the MaRR procedure after data processing to explore reproducibility across technical or biological samples. Under realistic settings of MS-Metabolomics data, the MaRR procedure effectively controls the False Discovery Rate (FDR) when there was a gradual reduction in correlation between replicate pairs for less highly ranked signals. Simulation studies also show that the MaRR procedure tends to have high power for detecting reproducible metabolites in most situations except for smaller values of proportion of reproducible metabolites. Bias (i.e., the difference between the estimated and the true value of reproducible signal proportions) values for simulations are also close to zero. The results reported from the real data show a higher level of reproducibility for technical replicates compared to biological replicates across all the three different datasets. In summary, we demonstrate that the MaRR procedure application can be adapted to various experimental designs, and that the nonparametric approach performs consistently well. Conclusions This research was motivated by reproducibility, which has proven to be a major obstacle in the use of genomic findings to advance clinical practice. In this paper, we developed a data-driven approach to assess the reproducibility of MS-Metabolomics data sets. The methods described in this paper are implemented in the open-source R package marr, which is freely available from Bioconductor at http://bioconductor.org/packages/marr.

Download Full-text

Comparison of Three Untargeted Data Processing Workflows for Evaluating LC-HRMS Metabolomics Data

Metabolites ◽

10.3390/metabo10090378 ◽

2020 ◽

Vol 10 (9) ◽

pp. 378 ◽

Cited By ~ 2

Author(s):

Selina Hemmer ◽

Sascha K. Manier ◽

Svenja Fischmann ◽

Folker Westphal ◽

Lea Wagmann ◽

...

Keyword(s):

Data Processing ◽

Open Source ◽

High Resolution Mass Spectrometry ◽

Liver Microsomes ◽

Reversed Phase ◽

Ease Of Use ◽

Untargeted Metabolomics ◽

Metabolomics Data ◽

Metabolomics Study ◽

High Flexibility

The evaluation of liquid chromatography high-resolution mass spectrometry (LC-HRMS) raw data is a crucial step in untargeted metabolomics studies to minimize false positive findings. A variety of commercial or open source software solutions are available for such data processing. This study aims to compare three different data processing workflows (Compound Discoverer 3.1, XCMS Online combined with MetaboAnalyst 4.0, and a manually programmed tool using R) to investigate LC-HRMS data of an untargeted metabolomics study. Simple but highly standardized datasets for evaluation were prepared by incubating pHLM (pooled human liver microsomes) with the synthetic cannabinoid A-CHMINACA. LC-HRMS analysis was performed using normal- and reversed-phase chromatography followed by full scan MS in positive and negative mode. MS/MS spectra of significant features were subsequently recorded in a separate run. The outcome of each workflow was evaluated by its number of significant features, peak shape quality, and the results of the multivariate statistics. Compound Discoverer as an all-in-one solution is characterized by its ease of use and seems, therefore, suitable for simple and small metabolomic studies. The two open source solutions allowed extensive customization but particularly, in the case of R, made advanced programming skills necessary. Nevertheless, both provided high flexibility and may be suitable for more complex studies and questions.

Download Full-text

MET-IDEA version 2.06; improved efficiency and additional functions for mass spectrometry-based metabolomics data processing

Metabolomics ◽

10.1007/s11306-012-0397-5 ◽

2012 ◽

Vol 8 (S1) ◽

pp. 105-110 ◽

Cited By ~ 20

Author(s):

Zhentian Lei ◽

Haiquan Li ◽

Junil Chang ◽

Patrick X. Zhao ◽

Lloyd W. Sumner

Keyword(s):

Mass Spectrometry ◽

Data Processing ◽

Metabolomics Data

Download Full-text

Data mining in mass spectrometry-based proteomics studies

Science & Technology Development Journal - Engineering and Technology ◽

10.32508/stdjet.v2i4.483 ◽

2020 ◽

Vol 2 (4) ◽

pp. 258-276

Author(s):

Vu Anh Le ◽

Cam Quyen Thi Phan ◽

Thuy Huong Nguyen

Keyword(s):

Mass Spectrometry ◽

Data Mining ◽

Biomedical Research ◽

Protein Level ◽

Protein Identification ◽

Biomarker Discovery ◽

Data Sets ◽

Biological Processes ◽

Data Mining Techniques ◽

Analytical Technique

The post-genomic era consists of experimental and computational efforts to meet the challenge of clarifying and understanding the function of genes and their products. Proteomic studies play a key role in this endeavour by complementing other functional genomics approaches, encompasses the large-scale analysis of complex mixtures, including the identification and quantification of proteins expressed under different conditions, the determination of their properties, modifications and functions. Understanding how biological processes are regulated at the protein level is crucial to understanding the molecular basis of diseases and often highlights the prevention, diagnosis and treatment of diseases. High-throughput technologies are widely used in proteomics to perform the analysis of thousands of proteins. Specifically, mass spectrometry (MS) is an analytical technique for characterizing biological samples and is increasingly used in protein studies because of its targeted, nontargeted, and high performance abilities. However, as large data sets are created, computational methods such as data mining techniques are required to analyze and interpret the relevant data. More specifically, the application of data mining techniques in large proteomic data sets can assist in many interpretations of data; it can reveal protein-protein interactions, improve protein identification, evaluate the experimental methods used and facilitate the diagnosis and biomarker discovery. With the rapid advances in mass spectrometry devices and experimental methodologies, MS-based proteomics has become a reliable and necessary tool for elucidating biological processes at the protein level. Over the past decade, we have witnessed a great expansion of our knowledge of human diseases with the adoption of proteomic technologies based on MS, which leads to many interesting discoveries. Here, we review recent advances of data mining in MS-based proteomics in biomedical research. Recent research in many fields shows that proteomics goes beyond the simple classification of proteins in biological systems and finally reaches its initial potential – as an essential tool to aid related disciplines, notably biomedical research. From here, there is great potential for data mining in MS-based proteomics to move beyond basic research, into clinical research and diagnostics.

Download Full-text

Data Processing Optimization in Untargeted Metabolomics of Urine Using Voigt Lineshape Model Non-Linear Regression Analysis

Metabolites ◽

10.3390/metabo11050285 ◽

2021 ◽

Vol 11 (5) ◽

pp. 285

Author(s):

Kristina E. Haslauer ◽

Philippe Schmitt-Kopplin ◽

Silke S. Heinzmann

Keyword(s):

Nmr Spectroscopy ◽

Data Processing ◽

Large Scale ◽

Linear Regression Analysis ◽

Untargeted Metabolomics ◽

Metabolomics Data ◽

Non Linear ◽

Data Processing Tool ◽

Processing Optimization ◽

Spectral Libraries

Nuclear magnetic resonance (NMR) spectroscopy is well-established to address questions in large-scale untargeted metabolomics. Although several approaches in data processing and analysis are available, significant issues remain. NMR spectroscopy of urine generates information-rich but complex spectra in which signals often overlap. Furthermore, slight changes in pH and salt concentrations cause peak shifting, which introduces, in combination with baseline irregularities, un-informative noise in statistical analysis. Within this work, a straight-forward data processing tool addresses these problems by applying a non-linear curve fitting model based on Voigt function line shape and integration of the underlying peak areas. This method allows a rapid untargeted analysis of urine metabolomics datasets without relying on time-consuming 2D-spectra based deconvolution or information from spectral libraries. The approach is validated with spiking experiments and tested on a human urine 1H dataset compared to conventionally used methods and aims to facilitate metabolomics data analysis.

Download Full-text