Reproducibility of mass spectrometry based metabolomics data

Abstract Background Assessing the reproducibility of measurements is an important first step for improving the reliability of downstream analyses of high-throughput metabolomics experiments. We define a metabolite to be reproducible when it demonstrates consistency across replicate experiments. Similarly, metabolites which are not consistent across replicates can be labeled as irreproducible. In this work, we introduce and evaluate the use (Ma)ximum (R)ank (R)eproducibility (MaRR) to examine reproducibility in mass spectrometry-based metabolomics experiments. We examine reproducibility across technical or biological samples in three different mass spectrometry metabolomics (MS-Metabolomics) data sets. Results We apply MaRR, a nonparametric approach that detects the change from reproducible to irreproducible signals using a maximal rank statistic. The advantage of using MaRR over model-based methods that it does not make parametric assumptions on the underlying distributions or dependence structures of reproducible metabolites. Using three MS Metabolomics data sets generated in the multi-center Genetic Epidemiology of Chronic Obstructive Pulmonary Disease (COPD) study, we applied the MaRR procedure after data processing to explore reproducibility across technical or biological samples. Under realistic settings of MS-Metabolomics data, the MaRR procedure effectively controls the False Discovery Rate (FDR) when there was a gradual reduction in correlation between replicate pairs for less highly ranked signals. Simulation studies also show that the MaRR procedure tends to have high power for detecting reproducible metabolites in most situations except for smaller values of proportion of reproducible metabolites. Bias (i.e., the difference between the estimated and the true value of reproducible signal proportions) values for simulations are also close to zero. The results reported from the real data show a higher level of reproducibility for technical replicates compared to biological replicates across all the three different datasets. In summary, we demonstrate that the MaRR procedure application can be adapted to various experimental designs, and that the nonparametric approach performs consistently well. Conclusions This research was motivated by reproducibility, which has proven to be a major obstacle in the use of genomic findings to advance clinical practice. In this paper, we developed a data-driven approach to assess the reproducibility of MS-Metabolomics data sets. The methods described in this paper are implemented in the open-source R package marr, which is freely available from Bioconductor at http://bioconductor.org/packages/marr.

Download Full-text

A New Approach of Outlier-robust Missing Value Imputation for Metabolomics Data Analysis

Current Bioinformatics ◽

10.2174/1574893612666171121154655 ◽

2018 ◽

Vol 14 (1) ◽

pp. 43-52 ◽

Cited By ~ 3

Author(s):

Nishith Kumar ◽

Md. Aminul Hoque ◽

Md. Shahjaman ◽

S.M. Shahinul Islam ◽

Md. Nurul Haque Mollah

Keyword(s):

Mass Spectrometry ◽

Data Analysis ◽

Missing Values ◽

Real Data ◽

Gas Chromatography Mass Spectrometry ◽

Data Generation ◽

Metabolomics Data ◽

Missing Value ◽

Missing Value Imputation ◽

Chromatography Mass Spectrometry

Background: Metabolomics data generation and quantification are different from other types of molecular “omics” data in bioinformatics. Mass spectrometry (MS) based (gas chromatography mass spectrometry (GC-MS), liquid chromatography mass spectrometry (LC-MS), etc.) metabolomics data frequently contain missing values that make some quantitative analysis complex. Typically metabolomics datasets contain 10% to 20% missing values that originate from several reasons, like analytical, computational as well as biological hazard. Imputation of missing values is a very important and interesting issue for further metabolomics data analysis. </P><P> Objective: This paper introduces a new algorithm for missing value imputation in the presence of outliers for metabolomics data analysis. </P><P> Method: Currently, the most well known missing value imputation techniques in metabolomics data are knearest neighbours (kNN), random forest (RF) and zero imputation. However, these techniques are sensitive to outliers. In this paper, we have proposed an outlier robust missing imputation technique by minimizing twoway empirical mean absolute error (MAE) loss function for imputing missing values in metabolomics data. Results: We have investigated the performance of the proposed missing value imputation technique in a comparison of the other traditional imputation techniques using both simulated and real data analysis in the absence and presence of outliers. Conclusion: Results of both simulated and real data analyses show that the proposed outlier robust missing imputation technique is better performer than the traditional missing imputation methods in both absence and presence of outliers.

Download Full-text

IP4M: an integrated platform for mass spectrometry-based metabolomics data mining

BMC Bioinformatics ◽

10.1186/s12859-020-03786-x ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 2

Author(s):

Dandan Liang ◽

Quan Liu ◽

Kejun Zhou ◽

Wei Jia ◽

Guoxiang Xie ◽

...

Keyword(s):

Mass Spectrometry ◽

Data Mining ◽

Correlation Analysis ◽

Data Processing ◽

Good Choice ◽

Untargeted Metabolomics ◽

Data Sets ◽

Metabolic Reaction ◽

Serum Samples ◽

Metabolomics Data

Abstract Background Metabolomics data analyses rely on the use of bioinformatics tools. Many integrated multi-functional tools have been developed for untargeted metabolomics data processing and have been widely used. More alternative platforms are expected for both basic and advanced users. Results Integrated mass spectrometry-based untargeted metabolomics data mining (IP4M) software was designed and developed. The IP4M, has 62 functions categorized into 8 modules, covering all the steps of metabolomics data mining, including raw data preprocessing (alignment, peak de-convolution, peak picking, and isotope filtering), peak annotation, peak table preprocessing, basic statistical description, classification and biomarker detection, correlation analysis, cluster and sub-cluster analysis, regression analysis, ROC analysis, pathway and enrichment analysis, and sample size and power analysis. Additionally, a KEGG-derived metabolic reaction database was embedded and a series of ratio variables (product/substrate) can be generated with enlarged information on enzyme activity. A new method, GRaMM, for correlation analysis between metabolome and microbiome data was also provided. IP4M provides both a number of parameters for customized and refined analysis (for expert users), as well as 4 simplified workflows with few key parameters (for beginners who are unfamiliar with computational metabolomics). The performance of IP4M was evaluated and compared with existing computational platforms using 2 data sets derived from standards mixture and 2 data sets derived from serum samples, from GC–MS and LC–MS respectively. Conclusion IP4M is powerful, modularized, customizable and easy-to-use. It is a good choice for metabolomics data processing and analysis. Free versions for Windows, MAC OS, and Linux systems are provided.

Download Full-text

Systems-Level Annotation of Metabolomics Data Reduces 25,000 Features to Fewer than 1,000 Unique Metabolites

10.1101/155895 ◽

2017 ◽

Cited By ~ 1

Author(s):

Nathaniel G. Mahieu ◽

Gary J. Patti

Keyword(s):

Mass Spectrometry ◽

Escherichia Coli ◽

Liquid Chromatography ◽

Upper Bound ◽

Biological Samples ◽

Liquid Chromatography Mass Spectrometry ◽

Metabolomics Data ◽

Multiple Features ◽

Biological Features ◽

Poor Understanding

SUMMARYWhen using liquid chromatography/mass spectrometry (LC/MS) to perform untargeted metabolomics, it is now routine to detect tens of thousands of features from biological samples. Poor understanding of the data, however, has complicated interpretation and masked the number of unique metabolites actually being measured in an experiment. Here we place an upper bound on the number of unique metabolites detected in Escherichia coli samples analyzed with one untargeted metabolomic method. We first group multiple features arising from the same analyte, which we call “degenerate features”, using a context-driven annotation approach. Surprisingly, this analysis revealed thousands of previously unreported degeneracies that reduced the number of unique analytes to ~2,961. We then applied an orthogonal approach to remove non-biological features from the data by using the 13C-based credentialing technology. This further reduced the number of unique analytes to less than 1,000.

Download Full-text

Simultaneous determination of seven aconitum alkaloids in biological samples by liquid chromatography-linear ion trap multistage mass spectrometry coupled with on-line solid phase extraction

Chinese Journal of Chromatography ◽

10.3724/sp.j.1123.2015.10044 ◽

2016 ◽

Vol 34 (3) ◽

pp. 249

Author(s):

Dingji ZHU ◽

Minping LU ◽

Kejian HUANG ◽

Zhe ZHOU ◽

Cuiwu LIN ◽

...

Keyword(s):

Mass Spectrometry ◽

Liquid Chromatography ◽

Solid Phase Extraction ◽

Biological Samples ◽

Ion Trap ◽

Solid Phase ◽

Linear Ion Trap ◽

Phase Extraction ◽

On Line

Download Full-text

Transforming variables to central normality

Machine Learning ◽

10.1007/s10994-021-05960-5 ◽

2021 ◽

Author(s):

Jakob Raymaekers ◽

Peter J. Rousseeuw

Keyword(s):

Maximum Likelihood ◽

Maximum Likelihood Estimator ◽

Simulation Study ◽

Real Data ◽

Data Sets ◽

Transformation Parameter ◽

Likelihood Estimator ◽

Extensive Simulation ◽

Highly Sensitive

AbstractMany real data sets contain numerical features (variables) whose distribution is far from normal (Gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box–Cox and Yeo–Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformation parameter is highly sensitive to outliers, and will often try to move outliers inward at the expense of the normality of the central part of the data. We propose a modification of these transformations as well as an estimator of the transformation parameter that is robust to outliers, so the transformed data can be approximately normal in the center and a few outliers may deviate from it. It compares favorably to existing techniques in an extensive simulation study and on real data.

Download Full-text

A New Extension of Thinning-Based Integer-Valued Autoregressive Models for Count Data

Entropy ◽

10.3390/e23010062 ◽

2020 ◽

Vol 23 (1) ◽

pp. 62

Author(s):

Zhengwei Liu ◽

Fukang Zhu

Keyword(s):

Likelihood Estimation ◽

Real Data ◽

Autoregressive Models ◽

Superior Performance ◽

Data Sets ◽

Binomial Thinning ◽

Free Case ◽

Two Parameters ◽

Conditional Maximum ◽

Thinning Operator

The thinning operators play an important role in the analysis of integer-valued autoregressive models, and the most widely used is the binomial thinning. Inspired by the theory about extended Pascal triangles, a new thinning operator named extended binomial is introduced, which is a general case of the binomial thinning. Compared to the binomial thinning operator, the extended binomial thinning operator has two parameters and is more flexible in modeling. Based on the proposed operator, a new integer-valued autoregressive model is introduced, which can accurately and flexibly capture the dispersed features of counting time series. Two-step conditional least squares (CLS) estimation is investigated for the innovation-free case and the conditional maximum likelihood estimation is also discussed. We have also obtained the asymptotic property of the two-step CLS estimator. Finally, three overdispersed or underdispersed real data sets are considered to illustrate a superior performance of the proposed model.

Download Full-text

Goodness-of-Fit Tests for Bivariate Time Series of Counts

Econometrics ◽

10.3390/econometrics9010010 ◽

2021 ◽

Vol 9 (1) ◽

pp. 10

Author(s):

Šárka Hudecová ◽

Marie Hušková ◽

Simos G. Meintanis

Keyword(s):

Goodness Of Fit ◽

Probability Generating Function ◽

Parametric Bootstrap ◽

Real Data ◽

Data Sets ◽

Test Statistics ◽

Finite Sample ◽

Generalized Poisson ◽

Goodness Of Fit Tests ◽

Monte Carlo Experiments

This article considers goodness-of-fit tests for bivariate INAR and bivariate Poisson autoregression models. The test statistics are based on an L2-type distance between two estimators of the probability generating function of the observations: one being entirely nonparametric and the second one being semiparametric computed under the corresponding null hypothesis. The asymptotic distribution of the proposed tests statistics both under the null hypotheses as well as under alternatives is derived and consistency is proved. The case of testing bivariate generalized Poisson autoregression and extension of the methods to dimension higher than two are also discussed. The finite-sample performance of a parametric bootstrap version of the tests is illustrated via a series of Monte Carlo experiments. The article concludes with applications on real data sets and discussion.

Download Full-text

Sample preparation strategy for the detection of steroid-like compounds using MALDI mass spectrometry imaging: pulmonary distribution of budesonide as a case study

Analytical and Bioanalytical Chemistry ◽

10.1007/s00216-021-03393-6 ◽

2021 ◽

Author(s):

Riccardo Zecchi ◽

Pietro Franceschi ◽

Laura Tigli ◽

Davide Amidani ◽

Chiara Catozzi ◽

...

Keyword(s):

Mass Spectrometry ◽

Sample Preparation ◽

Mass Spectrometry Imaging ◽

Maldi Mass Spectrometry ◽

Adult Rabbit ◽

Chronic Obstructive ◽

Ionization Mass ◽

Obstructive Pulmonary Disease ◽

Imaging Protocol ◽

Girard’S Reagent

AbstractCorticosteroids as budesonide can be effective in reducing topic inflammation processes in different organs. Therapeutic use of budesonide in respiratory diseases, like asthma, chronic obstructive pulmonary disease, and allergic rhinitis is well known. However, the pulmonary distribution of budesonide is not well understood, mainly due to the difficulties in tracing the molecule in lung samples without the addition of a label. In this paper, we present a matrix-assisted laser desorption/ionization mass spectrometry imaging protocol that can be used to visualize the pulmonary distribution of budesonide administered to a surfactant-depleted adult rabbit. Considering that budesonide is not easily ionized by MALDI, we developed an on-tissue derivatization method with Girard’s reagent P followed by ferulic acid deposition as MALDI matrix. Interestingly, this sample preparation protocol results as a very effective strategy to raise the sensitivity towards not only budesonide but also other corticosteroids, allowing us to track its distribution and quantify the drug inside lung samples. Graphical abstract

Download Full-text

TraceAll: A Real-Time Processing for Contact Tracing Using Indoor Trajectories

Information ◽

10.3390/info12050202 ◽

2021 ◽

Vol 12 (5) ◽

pp. 202

Author(s):

Louai Alarabi ◽

Saleh Basalamah ◽

Abdeltawab Hendawi ◽

Mohammed Abdalla

Keyword(s):

Infectious Diseases ◽

Infected Patient ◽

Public Health Problem ◽

Real Data ◽

Exposure Period ◽

Contact Tracing ◽

Data Sets ◽

Major Public Health Problem ◽

Real Time Processing ◽

Recent Developments

The rapid spread of infectious diseases is a major public health problem. Recent developments in fighting these diseases have heightened the need for a contact tracing process. Contact tracing can be considered an ideal method for controlling the transmission of infectious diseases. The result of the contact tracing process is performing diagnostic tests, treating for suspected cases or self-isolation, and then treating for infected persons; this eventually results in limiting the spread of diseases. This paper proposes a technique named TraceAll that traces all contacts exposed to the infected patient and produces a list of these contacts to be considered potentially infected patients. Initially, it considers the infected patient as the querying user and starts to fetch the contacts exposed to him. Secondly, it obtains all the trajectories that belong to the objects moved nearby the querying user. Next, it investigates these trajectories by considering the social distance and exposure period to identify if these objects have become infected or not. The experimental evaluation of the proposed technique with real data sets illustrates the effectiveness of this solution. Comparative analysis experiments confirm that TraceAll outperforms baseline methods by 40% regarding the efficiency of answering contact tracing queries.

Download Full-text

The Flexible Burr X-G Family: Properties, Inference, and Applications in Engineering Science

Symmetry ◽

10.3390/sym13030474 ◽

2021 ◽

Vol 13 (3) ◽

pp. 474

Author(s):

Abdulhakim A. Al-Babtain ◽

Ibrahim Elbatal ◽

Hazem Al-Mofleh ◽

Ahmed M. Gemeay ◽

Ahmed Z. Afify ◽

...

Keyword(s):

Numerical Simulations ◽

Exponential Distribution ◽

Real Data ◽

Exponential Model ◽

Statistical Properties ◽

Engineering Science ◽

Data Sets ◽

Engineering Sciences ◽

General Statistical ◽

Anderson Darling

In this paper, we introduce a new flexible generator of continuous distributions called the transmuted Burr X-G (TBX-G) family to extend and increase the flexibility of the Burr X generator. The general statistical properties of the TBX-G family are calculated. One special sub-model, TBX-exponential distribution, is studied in detail. We discuss eight estimation approaches to estimating the TBX-exponential parameters, and numerical simulations are conducted to compare the suggested approaches based on partial and overall ranks. Based on our study, the Anderson–Darling estimators are recommended to estimate the TBX-exponential parameters. Using two skewed real data sets from the engineering sciences, we illustrate the importance and flexibility of the TBX-exponential model compared with other existing competing distributions.

Download Full-text