Missing Value Imputation using XGboost for Label-Free Mass Spectrometry-Based Proteomics Data

Mapping Intimacies ◽

10.1101/2021.04.08.438945 ◽

2021 ◽

Author(s):

Jian Song ◽

Changbin Yu

Keyword(s):

Mass Spectrometry ◽

High Performance ◽

Missing Values ◽

Mean Squared Error ◽

Learning Algorithm ◽

Pearson Correlation ◽

Data Matrix ◽

Label Free ◽

Proteomics Data ◽

Benchmark Datasets

AbstractThe label-free mass spectrometry-based proteomics data inevitably suffer from the problem of missing values. The existence of missing values prevents the downstream analyses which need a complete data matrix. Our motivation is to introduce the state-of-art machine learning algorithm XGboost to realize a method of imputation which can improve the accuracy of imputation. But in practical, XGboost has many parameters need to be tuned to deliver on its potential high performance. Although cross validation may find the best parameters, it is much time-consuming. Alternatively, we empirically determined the parameters to two kinds of base learners of XGboost. To explore the robustness and performance of XGboost based imputation with predetermined parameters, we conducted tests on three benchmark datasets. As a comparative, six common imputation methods were also experimented in terms of normalized root mean squared error and Pearson correlation coefficient. The comparative experimental results indicated that the XGboost based imputation method using the linear base learner is competitive to or out-performs its competitors, including the random forest based imputation, by achieving smaller imputation errors and better structure preservation under the empirical parameters for the three benchmark datasets.

Download Full-text

An IoT-Focused Intrusion Detection System Approach Based on Preprocessing Characterization for Cybersecurity Datasets

Sensors ◽

10.3390/s21020656 ◽

2021 ◽

Vol 21 (2) ◽

pp. 656

Author(s):

Xavier Larriva-Novo ◽

Víctor A. Villagrá ◽

Mario Vega-Barbas ◽

Diego Rivera ◽

Mario Sanz Rodrigo

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

High Performance ◽

Learning Algorithm ◽

Detection System ◽

Machine Learning Algorithms ◽

Statistical Characteristics ◽

Detection Techniques ◽

Traffic Characteristics ◽

Benchmark Datasets

Security in IoT networks is currently mandatory, due to the high amount of data that has to be handled. These systems are vulnerable to several cybersecurity attacks, which are increasing in number and sophistication. Due to this reason, new intrusion detection techniques have to be developed, being as accurate as possible for these scenarios. Intrusion detection systems based on machine learning algorithms have already shown a high performance in terms of accuracy. This research proposes the study and evaluation of several preprocessing techniques based on traffic categorization for a machine learning neural network algorithm. This research uses for its evaluation two benchmark datasets, namely UGR16 and the UNSW-NB15, and one of the most used datasets, KDD99. The preprocessing techniques were evaluated in accordance with scalar and normalization functions. All of these preprocessing models were applied through different sets of characteristics based on a categorization composed by four groups of features: basic connection features, content characteristics, statistical characteristics and finally, a group which is composed by traffic-based features and connection direction-based traffic characteristics. The objective of this research is to evaluate this categorization by using various data preprocessing techniques to obtain the most accurate model. Our proposal shows that, by applying the categorization of network traffic and several preprocessing techniques, the accuracy can be enhanced by up to 45%. The preprocessing of a specific group of characteristics allows for greater accuracy, allowing the machine learning algorithm to correctly classify these parameters related to possible attacks.

Download Full-text

NAguideR: performing and prioritizing missing value imputations for consistent bottom-up proteomic analyses

Nucleic Acids Research ◽

10.1093/nar/gkaa498 ◽

2020 ◽

Vol 48 (14) ◽

pp. e83-e83 ◽

Cited By ~ 1

Author(s):

Shisheng Wang ◽

Wenxue Li ◽

Liqiang Hu ◽

Jingqiu Cheng ◽

Hao Yang ◽

...

Keyword(s):

Mass Spectrometry ◽

Quantitative Proteomics ◽

Missing Values ◽

Protein Complexes ◽

Label Free ◽

Missing Value ◽

Imputation Methods ◽

Data Independent Acquisition ◽

Low Performance ◽

User Friendly

Abstract Mass spectrometry (MS)-based quantitative proteomics experiments frequently generate data with missing values, which may profoundly affect downstream analyses. A wide variety of imputation methods have been established to deal with the missing-value issue. To date, however, there is a scarcity of efficient, systematic, and easy-to-handle tools that are tailored for proteomics community. Herein, we developed a user-friendly and powerful stand-alone software, NAguideR, to enable implementation and evaluation of different missing value methods offered by 23 widely used missing-value imputation algorithms. NAguideR further evaluates data imputation results through classic computational criteria and, unprecedentedly, proteomic empirical criteria, such as quantitative consistency between different charge-states of the same peptide, different peptides belonging to the same proteins, and individual proteins participating protein complexes and functional interactions. We applied NAguideR into three label-free proteomic datasets featuring peptide-level, protein-level, and phosphoproteomic variables respectively, all generated by data independent acquisition mass spectrometry (DIA-MS) with substantial biological replicates. The results indicate that NAguideR is able to discriminate the optimal imputation methods that are facilitating DIA-MS experiments over those sub-optimal and low-performance algorithms. NAguideR further provides downloadable tables and figures supporting flexible data analysis and interpretation. NAguideR is freely available at http://www.omicsolution.org/wukong/NAguideR/ and the source code: https://github.com/wangshisheng/NAguideR/.

Download Full-text

PolySTest: Robust statistical testing of proteomics data with missing values improves detection of biologically relevant features

10.1101/765818 ◽

2019 ◽

Cited By ~ 1

Author(s):

Veit Schwämmle ◽

Christina E Hagensen ◽

Adelina Rogowska-Wrzesinska ◽

Ole N. Jensen

Keyword(s):

Mass Spectrometry ◽

Large Scale ◽

Missing Values ◽

Statistical Tests ◽

Ground Truth ◽

Statistical Testing ◽

Molecular Networks ◽

Proteomics Data ◽

Biologically Relevant ◽

Data Browsing

AbstractStatistical testing remains one of the main challenges for high-confidence detection of differentially regulated proteins or peptides in large-scale quantitative proteomics experiments by mass spectrometry. Statistical tests need to be sufficiently robust to deal with experiment intrinsic data structures and variations and often also reduced feature coverage across different biological samples due to ubiquitous missing values. A robust statistical test provides accurate confidence scores of large-scale proteomics results, regardless of instrument platform, experimental protocol and software tools. However, the multitude of different combinations of experimental strategies, mass spectrometry techniques and informatics methods complicate the decision of choosing appropriate statistical approaches. We address this challenge by introducing PolySTest, a user-friendly web service for statistical testing, data browsing and data visualization. We introduce a new method, Miss Test, that simultaneously tests for missingness and feature abundance, thereby complementing common statistical tests by rescuing otherwise discarded data features. We demonstrate that PolySTest with integrated Miss Test achieves higher confidence and higher sensitivity for artificial and experimental proteomics data sets with known ground truth. Application of PolySTest to mass spectrometry based large-scale proteomics data obtained from differentiating muscle cells resulted in the rescue of 10%-20% additional proteins in the identified molecular networks relevant to muscle differentiation. We conclude that PolySTest is a valuable addition to existing tools and instrument enhancements that improve coverage and depth of large-scale proteomics experiments. A fully functional demo version of PolySTest and Miss Test is available via http://computproteomics.bmb.sdu.dk/Apps/PolySTest.

Download Full-text

Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data

10.1101/171967 ◽

2017 ◽

Cited By ~ 1

Author(s):

Runmin Wei ◽

Jingye Wang ◽

Mingming Su ◽

Erik Jia ◽

Tianlu Chen ◽

...

Keyword(s):

Mass Spectrometry ◽

Missing Values ◽

Pearson Correlation ◽

Imputation Accuracy ◽

Metabolomics Data ◽

Missing Value ◽

Sample Distribution ◽

Imputation Methods ◽

Missing Value Imputation ◽

Squared Error

AbstractIntroductionMissing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection of methods can significantly affect following data analyses and interpretations. According to the definition, there are three types of missing values, missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).ObjectivesThe aim of this study was to comprehensively compare common imputation methods for different types of missing values using two separate metabolomics data sets (977 and 198 serum samples respectively) to propose a strategy to deal with missing values in metabolomics studies.MethodsImputation methods included zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC). Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate the imputation accuracy for MCAR/MAR and MNAR correspondingly. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes sum of squared error were used to evaluate the overall sample distribution. Student’s t-test followed by Pearson correlation analysis was conducted to evaluate the effect of imputation on univariate statistical analysis.ResultsOur findings demonstrated that RF imputation performed the best for MCAR/MAR and QRILC was the favored one for MNAR.ConclusionCombining with “modified 80% rule”, we proposed a comprehensive strategy and developed a public-accessible web-tool for missing value imputation in metabolomics data.

Download Full-text

proDA: Probabilistic Dropout Analysis for Identifying Differentially Abundant Proteins in Label-Free Mass Spectrometry

10.21203/rs.3.rs-36351/v1 ◽

2020 ◽

Author(s):

Constantin Ahlmann-Eltze ◽

Simon Anders

Keyword(s):

Mass Spectrometry ◽

Quantitative Proteomics ◽

Statistical Power ◽

Missing Values ◽

Ad Hoc ◽

Linear Models ◽

Statistical Tests ◽

High Sensitivity ◽

Small Sample ◽

Label Free

Abstract Protein mass spectrometry with label-free quantification (LFQ) is widely used for quantitative proteomics studies. Nevertheless, well-principled statistical inference procedures are still lacking, and most practitioners adopt methods from transcriptomics. These, however, cannot properly treat the principal complication of label-free proteomics, namely many non-randomly missing values. We present proDA, a method to perform statistical tests for differential abundance of proteins. It models missing values in an intensity-dependent probabilistic manner. proDA is based on linear models and thus suitable for complex experimental designs, and boosts statistical power for small sample sizes by using variance moderation. We show that the currently widely used methods based on ad hoc imputation schemes can report excessive false positives, and that proDA not only overcomes this serious issue but also offers high sensitivity. Thus, proDA fills a crucial gap in the toolbox of quantitative proteomics.

Download Full-text

PolySTest: Robust Statistical Testing of Proteomics Data with Missing Values Improves Detection of Biologically Relevant Features

Molecular & Cellular Proteomics ◽

10.1074/mcp.ra119.001777 ◽

2020 ◽

Vol 19 (8) ◽

pp. 1396-1408 ◽

Cited By ~ 2

Author(s):

Veit Schwämmle ◽

Christina E. Hagensen ◽

Adelina Rogowska-Wrzesinska ◽

Ole N. Jensen

Keyword(s):

Mass Spectrometry ◽

Large Scale ◽

Missing Values ◽

Statistical Tests ◽

Ground Truth ◽

Statistical Testing ◽

Molecular Networks ◽

Proteomics Data ◽

Biologically Relevant ◽

Data Browsing

Statistical testing remains one of the main challenges for high-confidence detection of differentially regulated proteins or peptides in large-scale quantitative proteomics experiments by mass spectrometry. Statistical tests need to be sufficiently robust to deal with experiment intrinsic data structures and variations and often also reduced feature coverage across different biological samples due to ubiquitous missing values. A robust statistical test provides accurate confidence scores of large-scale proteomics results, regardless of instrument platform, experimental protocol and software tools. However, the multitude of different combinations of experimental strategies, mass spectrometry techniques and informatics methods complicate the decision of choosing appropriate statistical approaches. We address this challenge by introducing PolySTest, a user-friendly web service for statistical testing, data browsing and data visualization. We introduce a new method, Miss test, that simultaneously tests for missingness and feature abundance, thereby complementing common statistical tests by rescuing otherwise discarded data features. We demonstrate that PolySTest with integrated Miss test achieves higher confidence and higher sensitivity for artificial and experimental proteomics data sets with known ground truth. Application of PolySTest to mass spectrometry based large-scale proteomics data obtained from differentiating muscle cells resulted in the rescue of 10–20% additional proteins in the identified molecular networks relevant to muscle differentiation. We conclude that PolySTest is a valuable addition to existing tools and instrument enhancements that improve coverage and depth of large-scale proteomics experiments. A fully functional demo version of PolySTest and Miss test is available via http://computproteomics.bmb.sdu.dk/Apps/PolySTest.

Download Full-text

IceR improves proteome coverage and data completeness in global and single-cell proteomics

Nature Communications ◽

10.1038/s41467-021-25077-6 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Mathias Kalxdorf ◽

Torsten Müller ◽

Oliver Stegle ◽

Jeroen Krijgsveld

Keyword(s):

Single Cell ◽

Large Scale ◽

Missing Values ◽

Peptide Identification ◽

Protein Quantification ◽

Developmental Trajectory ◽

Ion Current ◽

Label Free ◽

Proteomics Data ◽

Data Completeness

AbstractLabel-free proteomics by data-dependent acquisition enables the unbiased quantification of thousands of proteins, however it notoriously suffers from high rates of missing values, thus prohibiting consistent protein quantification across large sample cohorts. To solve this, we here present IceR (Ion current extraction Re-quantification), an efficient and user-friendly quantification workflow that combines high identification rates of data-dependent acquisition with low missing value rates similar to data-independent acquisition. Specifically, IceR uses ion current information for a hybrid peptide identification propagation approach with superior quantification precision, accuracy, reliability and data completeness compared to other quantitative workflows. Applied to plasma and single-cell proteomics data, IceR enhanced the number of reliably quantified proteins, improved discriminability between single-cell populations, and allowed reconstruction of a developmental trajectory. IceR will be useful to improve performance of large scale global as well as low-input proteomics applications, facilitated by its availability as an easy-to-use R-package.

Download Full-text

Label-Free Quantitative Mass Spectrometry Reveals a Panel of Differentially Expressed Proteins in Colorectal Cancer

BioMed Research International ◽

10.1155/2015/365068 ◽

2015 ◽

Vol 2015 ◽

pp. 1-13 ◽

Cited By ~ 10

Author(s):

Nai-Jun Fan ◽

Jiang-Ling Gao ◽

Yan Liu ◽

Wei Song ◽

Zhan-Yang Zhang ◽

...

Keyword(s):

Mass Spectrometry ◽

High Performance ◽

Cell Structure ◽

Protein Profiling ◽

Differentially Expressed ◽

Antimicrobial Protein ◽

Differentially Expressed Proteins ◽

Label Free ◽

Mucosal Tissues ◽

And Function

To identify potential biomarkers involved in CRC, a shotgun proteomic method was applied to identify soluble proteins in three CRCs and matched normal mucosal tissues using high-performance liquid chromatography and mass spectrometry. Label-free protein profiling of three CRCs and matched normal mucosal tissues were then conducted to quantify and compare proteins. Results showed that 67 of the 784 identified proteins were linked to CRC (28 upregulated and 39 downregulated). Gene Ontology and DAVID databases were searched to identify the location and function of differential proteins that were related to the biological processes of binding, cell structure, signal transduction, cell adhesion, and so on. Among the differentially expressed proteins, tropomyosin-3 (TPM3), endoplasmic reticulum resident protein 29 (ERp29), 18 kDa cationic antimicrobial protein (CAMP), and heat shock 70 kDa protein 8 (HSPA8) were verified to be upregulated in CRC tissue and seven cell lines through western blot analysis. Furthermore, the upregulation of TPM3, ERp29, CAMP, and HSPA8 was validated in 69 CRCs byimmunohistochemistry (IHC) analysis. Combination of TPM3, ERp29, CAMP, and HSPA8 can identify CRC from matched normal mucosal achieving an accuracy of 73.2% using IHC score. These results suggest that TPM3, ERp29, CAMP, and HSPA8 are great potential IHC diagnostic biomarkers for CRC.

Download Full-text

MSImpute: Imputation of label-free mass spectrometry peptides by low-rank approximation

10.1101/2020.08.12.248963 ◽

2020 ◽

Author(s):

Soroor Hediyeh-zadeh ◽

Andrew I. Webb ◽

Melissa J. Davis

Keyword(s):

Mass Spectrometry ◽

Data Analysis ◽

Missing Values ◽

Dynamic Range ◽

Dropout Rate ◽

Low Rank ◽

Proteomics Data ◽

Low Rank Approximation ◽

Formidable Challenge ◽

Rank Approximation

AbstractRecent developments in mass spectrometry (MS) instruments and data acquisition modes have aided multiplexed, fast, reproducible and quantitative analysis of proteome profiles, yet missing values remain a formidable challenge for proteomics data analysis. The stochastic nature of sampling in Data Dependent Acquisition (DDA), suboptimal preprocessing of Data Independent Acquisition (DIA) runs and dynamic range limitation of MS instruments impedes the reproducibility and accuracy of peptide quantification and can introduce systematic patterns of missingness that impact downstream analyses. Thus, imputation of missing values becomes an important element of data analysis. We introduce msImpute, an imputation method based on low-rank approximation, and compare it to six alternative imputation methods using public DDA and DIA datasets. We evaluate the performance of methods by determining the error of imputed values and accuracy of detection of differential expression. We also measure the post-imputation preservation of structures in the data at different levels of granularity. We develop a visual diagnostic to determine the nature of missingness in datasets based on peptides with high biological dropout rate and introduce a method to identify such peptides. Our findings demonstrate that msImpute performs well when data are missing at random and highlights the importance of prior knowledge about nature of missing values in a dataset when selecting an imputation technique.

Download Full-text