Assessment of label-free quantification and missing value imputation for proteomics in non-human primates

Introduction: Reliable and effective label-free quantification (LFQ) analyses are dependent not only on the method of data acquisition in the mass spectrometer, but also on the downstream data processing, including software tools, query database, data normalization and imputation. In non-human primates (NHP), LFQ is challenging because the query databases for NHP are limited since the genomes of these species are not comprehensively annotated. This invariably results in limited discovery of proteins and associated Post Translational Modifications (PTMs) and a higher fraction of missing data points. While identification of fewer proteins and PTMs due to database limitations can negatively impact uncovering important and meaningful biological information, missing data also limits downstream analyses (e.g., multivariate analyses), decreases statistical power, biases statistical inference, and makes biological interpretation of the data more challenging. In this study we attempted to address both issues: first, we used the MetaMorphues proteomics search engine to counter the limits of NHP query databases and maximize the discovery of proteins and associated PTMs, and second, we evaluated different imputation methods for accurate data inference. Results: Using the MetaMorpheus proteomics search engine we obtained quantitative data for 1,622 proteins and 10,634 peptides including 58 different PTMs (biological, metal and artifacts) across a diverse age range of NHP brain frontal cortex. However, among the 1,622 proteins identified, only 293 proteins were quantified across all samples with no missing values, emphasizing the importance of implementing an accurate and statically valid imputation method to fill in missing data. In our imputation analysis we demonstrate that Single Imputation methods that borrow information from correlated proteins such as Generalized Ridge Regression (GRR), Random Forest (RF), local least squares (LLS), and a Bayesian Principal Component Analysis methods (BPCA), are able to estimate missing protein abundance values with great accuracy. Conclusions: Overall, this study offers a detailed comparative analysis of LFQ data generated in NHP and proposes strategies for improved LFQ in NHP proteomics data.

Download Full-text

Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies

10.1101/260281 ◽

2018 ◽

Cited By ~ 2

Author(s):

Kieu Trinh Do ◽

Simone Wahl ◽

Johannes Raffler ◽

Sophie Molnos ◽

Michael Laimighofer ◽

...

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Statistical Power ◽

Missing Values ◽

Biological Evaluation ◽

List Type ◽

Robust Performance ◽

Metabolomics Data ◽

Imputation Methods ◽

Biochemical Pathways

AbstractBACKGROUNDUntargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in epidemiological studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation.METHODSWe investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established genetically metabolic quantitative trait loci.RESULTSRun day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations (MICE) performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable.CONCLUSIONMissing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.Key messagesUntargeted MS-based metabolomics data show missing values due to both batch-specific LOD-based and non-LOD-based effects.Statistical evaluation of multiple imputation methods was conducted on both simulated and real datasets.Biological evaluation on real data assessed the ability of imputation methods to preserve statistical inference of biochemical pathways and correctly estimate effects of genetic variants on metabolite levels.KNN-based imputation on observations with variable pre-selection and K = 10 showed robust performance for all data scenarios across all evaluation schemes.

Download Full-text

Computational Advances in the Label-free Quantification of Cancer Proteomics Data

Current Pharmaceutical Design ◽

10.2174/1381612824666181102125638 ◽

2019 ◽

Vol 24 (32) ◽

pp. 3842-3858 ◽

Cited By ~ 8

Author(s):

Jing Tang ◽

Yang Zhang ◽

Jianbo Fu ◽

Yunxia Wang ◽

Yi Li ◽

...

Keyword(s):

Large Scale ◽

Web Of Science ◽

Label Free ◽

Proteomics Data ◽

Dynamic Information ◽

Label Free Quantification ◽

Cancer Proteomics ◽

Future Direction ◽

Protein Alterations ◽

Free Quantification

Background: Due to its ability to provide quantitative and dynamic information on tumor genesis and development by directly profiling protein expression, the proteomics has become intensely popular for characterizing the functional proteins driving the transformation of malignancy, tracing the large-scale protein alterations induced by anticancer drug, and discovering the innovative targets and first-in-class drugs for oncologic disorders. Objective: To quantify cancer proteomics data, the label-free quantification (LFQ) is frequently employed. However, low precision, poor reproducibility and inaccuracy of the LFQ of proteomics data have been recognized as the key “technical challenge” in the discovery of anticancer targets and drugs. In this paper, the recent advances and development in the computational perspective of LFQ in cancer proteomics were therefore systematically reviewed and analyzed. Methods: PubMed and Web of Science database were searched for label-free quantification approaches, cancer proteomics and computational advances. Results: First, a variety of popular acquisition techniques and state-of-the-art quantification tools are systematically discussed and critically assessed. Then, many processing approaches including transformation, normalization, filtering and imputation are subsequently discussed, and their impacts on improving LFQ performance of cancer proteomics are evaluated. Finally, the future direction for enhancing the computation-based quantification technique for cancer proteomics are also proposed. Conclusion: There is a dramatic increase in LFQ approaches in recent year, which significantly enhance the diversity of the possible quantification strategies for studying cancer proteomics.

Download Full-text

ProtQuant: a tool for the label-free quantification of MudPIT proteomics data

BMC Bioinformatics ◽

10.1186/1471-2105-8-s7-s24 ◽

2007 ◽

Vol 8 (S7) ◽

Cited By ~ 42

Author(s):

Susan M Bridges ◽

G Bryce Magee ◽

Nan Wang ◽

W Paul Williams ◽

Shane C Burgess ◽

...

Keyword(s):

Label Free ◽

Proteomics Data ◽

Label Free Quantification ◽

Free Quantification

Download Full-text

Label-free quantification with FDR-controlled match-between-runs

10.1101/2020.11.02.365437 ◽

2020 ◽

Author(s):

Fengchao Yu ◽

Sarah E. Haynes ◽

Alexey I. Nesvizhskii

Keyword(s):

Protein Identification ◽

Missing Values ◽

Superior Performance ◽

Label Free ◽

Experimental Conditions ◽

Mass Spectrometers ◽

Tandem Mass Spectra ◽

Label Free Quantification ◽

Statistical Confidence ◽

Free Quantification

AbstractMissing values weaken the power of label-free quantitative proteomic experiments to uncover true quantitative differences between biological samples or experimental conditions. Match-between-runs (MBR) has become a common approach to mitigate the missing value problem, where peptides identified by tandem mass spectra in one run are transferred to another by inference based on m/z, charge state, retention time, and ion mobility when applicable. Though tolerances are used to ensure such transferred identifications are reasonably located and meet certain quality thresholds, little work has been done to evaluate the statistical confidence of MBR. Here, we present a mixture model-based approach to estimate the false discovery rate (FDR) of peptide and protein identification transfer, which we implement in the label-free quantification tool IonQuant. Using several benchmarking datasets generated on both Orbitrap and timsTOF mass spectrometers, we demonstrate that IonQuant with FDR-controlled MBR results in superior performance compared to MaxQuant. We further illustrate the need for FDR-controlled MBR in sparse datasets such as those from single-cell proteomics experiments.

Download Full-text

Missing Data - Better "Not to Have Them", but What If You Do? (Part 1)

Marketing ZFP ◽

10.15358/0344-1369-2019-4-21 ◽

2019 ◽

Vol 41 (4) ◽

pp. 21-32

Author(s):

Dirk Temme ◽

Sarah Jensen

Keyword(s):

Missing Data ◽

Statistical Power ◽

Missing Values ◽

Graphical Representation ◽

Marketing Research ◽

Likelihood Estimation ◽

Parameter Estimates ◽

Full Information Maximum Likelihood ◽

Definition Of ◽

Traditional Approaches

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.

Download Full-text