Missing Value Imputation using XGboost for Label-Free Mass Spectrometry-Based Proteomics Data
AbstractThe label-free mass spectrometry-based proteomics data inevitably suffer from the problem of missing values. The existence of missing values prevents the downstream analyses which need a complete data matrix. Our motivation is to introduce the state-of-art machine learning algorithm XGboost to realize a method of imputation which can improve the accuracy of imputation. But in practical, XGboost has many parameters need to be tuned to deliver on its potential high performance. Although cross validation may find the best parameters, it is much time-consuming. Alternatively, we empirically determined the parameters to two kinds of base learners of XGboost. To explore the robustness and performance of XGboost based imputation with predetermined parameters, we conducted tests on three benchmark datasets. As a comparative, six common imputation methods were also experimented in terms of normalized root mean squared error and Pearson correlation coefficient. The comparative experimental results indicated that the XGboost based imputation method using the linear base learner is competitive to or out-performs its competitors, including the random forest based imputation, by achieving smaller imputation errors and better structure preservation under the empirical parameters for the three benchmark datasets.