A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Datasets
AbstractMissing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, there is no single imputation method that is best suited for a diverse range of data sets and no clear strategy exists for evaluating imputation methods for large-scale DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a workflow to assess imputation methods on large-scale label-free DIA-MS data sets. We used two distinct DIA-MS data sets with real missing values to evaluate eight different imputation methods with multiple parameters at different levels of protein quantification; dilution series data set and an independent data set with actual experimental samples. We found that imputation methods based on local structures within the data, like local least squares (LLS) and random forest (RF), worked well in our dilution series data set whereas, imputation methods based on global structures within the data, like BPCA performed well in our independent data set. We also found that imputation at the most basic level of protein quantification – fragment level-improved accuracy and number of proteins quantified. Overall, this study indicates that the most suitable imputation method depends on the overall structure and correlations of proteins within the data set and can be identified with the workflow presented here.