scholarly journals A Benchmark for Data Imputation Methods

2021 ◽  
Vol 4 ◽  
Author(s):  
Sebastian Jäger ◽  
Arndt Allhorn ◽  
Felix Bießmann

With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.

2019 ◽  
Vol 35 (19) ◽  
pp. 3786-3793 ◽  
Author(s):  
Pietro Di Lena ◽  
Claudia Sala ◽  
Andrea Prodi ◽  
Christine Nardini

Abstract Motivation DNA methylation is a stable epigenetic mark with major implications in both physiological (development, aging) and pathological conditions (cancers and numerous diseases). Recent research involving methylation focuses on the development of molecular age estimation methods based on DNA methylation levels (mAge). An increasing number of studies indicate that divergences between mAge and chronological age may be associated to age-related diseases. Current advances in high-throughput technologies have allowed the characterization of DNA methylation levels throughout the human genome. However, experimental methylation profiles often contain multiple missing values that can affect the analysis of the data and also mAge estimation. Although several imputation methods exist, a major deficiency lies in the inability to cope with large datasets, such as DNA methylation chips. Specific methods for imputing missing methylation data are therefore needed. Results We present a simple and computationally efficient imputation method, metyhLImp, based on linear regression. The rationale of the approach lies in the observation that methylation levels show a high degree of inter-sample correlation. We performed a comparative study of our approach with other imputation methods on DNA methylation data of healthy and disease samples from different tissues. Performances have been assessed both in terms of imputation accuracy and in terms of the impact imputed values have on mAge estimation. In comparison to existing methods, our linear regression model proves to perform equally or better and with good computational efficiency. The results of our analysis provide recommendations for accurate estimation of missing methylation values. Availability and implementation The R-package methyLImp is freely available at https://github.com/pdilena/methyLImp. Supplementary information Supplementary data are available at Bioinformatics online.


Missing of partial data is a problem that is prevalent in most of the datasets used for statistical analysis. In this study, we analyzed the missing values in ISBSG R1 2018 dataset and addressed the problem through imputation, a machine learning technique which can increase the availability of data. Additionally, we compare the performance of three imputation methods: Classification and Regression Trees (CART), Polynomial Regression (PR), Predictive Mean Matching (PMM), and Random Forest (RF) applied to ISBSG R1 2018 dataset available from International Standards Benchmarks Group. Through imputation, we were able to increase data availability by four times. We also evaluated the performance of these methods against the original dataset without imputation using an ensemble of Linear Regression, Gradient Boosting, Random Forest, and ANN. Imputation using CART can increase the availability of the overall dataset but only at the loss of some predictive capability of the model. However, CART remains the option of choice to extend the usability of the data by retaining rows that are otherwise removed from the dataset in traditional methods. In our experiments, this approach has been able to increase the usability of the original dataset to 63%, but with 2 to 3% decrease in its overall predictive performance.


Author(s):  
Mingyi Liu ◽  
Ashok Dongre

Abstract Label-free shotgun proteomics is an important tool in biomedical research, where tandem mass spectrometry with data-dependent acquisition (DDA) is frequently used for protein identification and quantification. However, the DDA datasets contain a significant number of missing values (MVs) that severely hinders proper analysis. Existing literature suggests that different imputation methods should be used for the two types of MVs: missing completely at random or missing not at random. However, the simulated or biased datasets utilized by most of such studies offer few clues about the composition and thus proper imputation of MVs in real-life proteomic datasets. Moreover, the impact of imputation methods on downstream differential expression analysis—a critical goal for many biomedical projects—is largely undetermined. In this study, we investigated public DDA datasets of various tissue/sample types to determine the composition of MVs in them. We then developed simulated datasets that imitate the MV profile of real-life datasets. Using such datasets, we compared the impact of various popular imputation methods on the analysis of differentially expressed proteins. Finally, we make recommendations on which imputation method(s) to use for proteomic data beyond just DDA datasets.


Author(s):  
Diwakar Tripathi ◽  
Alok Kumar Shukla ◽  
Ramchandra Reddy B. ◽  
Ghanshyam S. Bopche

Credit scoring is a process to calculate the risk associated with a credit product, and it directly affects the profitability of that industry. Periodically, financial institutions apply credit scoring in various steps. The main focus of this study is to improve the predictive performance of the credit scoring model. To improve the predictive performance of the model, this study proposes a multi-layer hybrid credit scoring model. The first stage concerns pre-processing, which includes treatment for missing values, data-transformation, and reduction of irrelevant and noisy features because they may affect predictive performance of model. The second stage applies various ensemble learning approaches such as Bagging, Adaboost, etc. At the last layer, it applies ensemble classifiers approach, which combines three heterogeneous classifiers, namely: random forest (RF), logistic regression (LR), and sequential minimal optimization (SMO) approaches for classification. Further, the proposed multi-layer model is validated on various real-world credit scoring datasets.


Author(s):  
SUMANTH YENDURI ◽  
S. S. IYENGAR

In this study, we compare the performance of four different imputation strategies ranging from the commonly used Listwise Deletion to model based approaches such as the Maximum Likelihood on enhancing completeness in incomplete software project data sets. We evaluate the impact of each of these methods by implementing them on six different real-time software project data sets which are classified into different categories based on their inherent properties. The reliability of the constructed data sets using these techniques are further tested by building prediction models using stepwise regression. The experimental results are noted and the findings are finally discussed.


PLoS ONE ◽  
2021 ◽  
Vol 16 (7) ◽  
pp. e0254720
Author(s):  
Maritza Mera-Gaona ◽  
Ursula Neumann ◽  
Rubiel Vargas-Canas ◽  
Diego M. López

Handling missing values is a crucial step in preprocessing data in Machine Learning. Most available algorithms for analyzing datasets in the feature selection process and classification or estimation process analyze complete datasets. Consequently, in many cases, the strategy for dealing with missing values is to use only instances with full data or to replace missing values with a mean, mode, median, or a constant value. Usually, discarding missing samples or replacing missing values by means of fundamental techniques causes bias in subsequent analyzes on datasets. Aim: Demonstrate the positive impact of multivariate imputation in the feature selection process on datasets with missing values. Results: We compared the effects of the feature selection process using complete datasets, incomplete datasets with missingness rates between 5 and 50%, and imputed datasets by basic techniques and multivariate imputation. The feature selection algorithms used are well-known methods. The results showed that the datasets imputed by multivariate imputation obtained the best results in feature selection compared to datasets imputed by basic techniques or non-imputed incomplete datasets. Conclusions: Considering the results obtained in the evaluation, applying multivariate imputation by MICE reduces bias in the feature selection process.


2020 ◽  
Author(s):  
James A Sanford ◽  
Yang Wang ◽  
Joshua R Hansen ◽  
Marina A Gritsenko ◽  
Karl K Weitz ◽  
...  

Abstract Background: Global and phosphoproteome profiling has demonstrated great utility for the analysis of clinical specimens. One major barrier to the broad clinical application of proteomic profiling is the large amount of biological material required, particularly for phosphoproteomics—currently on the order of 25 mg wet tissue weight, depending on tissue type. For hematopoietic cancers such as acute myeloid leukemia (AML), the sample requirement is in excess of 10 million (1E7) peripheral blood mononuclear cells (PBMCs). Throughout the course of a prospective study, this requirement will certainly exceed what is obtainable from many of the individual patients/timepoints. For this reason, we were interested in examining the impact of differential peptide loading across multiplex channels on proteomic data quality. Methods: To achieve this, we tested a range of channel loading amounts (20, 40, 100, 200, and 400 μg of tryptic peptides, or approximately the material obtainable from 5E5, 1E6, 2.5E6, 5E6, and 1E7 AML patient cells) to assess proteome coverage, quantification reproducibility and accuracy in experiments utilizing isobaric tandem mass tag (TMT) labeling. Results: As expected, we found that fewer missing values are observed in TMT channels with higher peptide loading amounts compared to those with lower loading. Moreover, channels with lower loading amounts have greater quantitative variability than channels with higher loading amounts. Statistical analysis of the differences in means among the five loading groups showed that the 20 μg loading group was significantly different from the 400 μg loading group. However, no significant differences were detected among the 40, 100, 200 and 400 μg loading groups. Conclusions: These assessment data demonstrate the practical limits of loading differential quantities of peptides across channels in TMT multiplexes, and provide a basis for designing the optimal clinical proteomics study when specimen quantities are limited.


2021 ◽  
Author(s):  
Filomena Catapano ◽  
Stephan Buchert ◽  
Enkelejda Qamili ◽  
Thomas Nilsson ◽  
Jerome Bouffard ◽  
...  

Abstract. Swarm is ESA's (European Space Agency) first Earth observation constellation mission, which was launched in 2013 to study the geomagnetic field and its temporal evolution. Two Langmuir Probes (LPs) on board of each of the three Swarm satellites provide very accurate measurements of plasma parameters, which contribute to the the study of the ionospheric plasma dynamics. To maintain a high data quality for scientific and operational applications, the Swarm products are continuously monitored and validated via science-oriented diagnostics. This paper presents an overview of the data quality of the Swarm Langmuir Probes' measurements. The data quality is assessed by analysing short and long data segments, where the latter are selected sufficiently long to consider the impact of the solar activity. Langmuir Probes data have been validated through comparison with numerical models, other satellite missions, and ground observations. Based on the outcomes from quality control and validation activities conduced by ESA, as well as scientific analysis and feedback provided by the user community, the Swarm products are regularly upgraded. In this paper we discuss the data quality improvements introduced with the latest baseline, and how the data quality is influenced by the solar cycle. The main anomaly affecting the LP measurements is described, as well as possible improvements to be implemented in future baselines.


2020 ◽  
Vol 2020 ◽  
pp. 1-11
Author(s):  
Kamran Mehrabani-Zeinabad ◽  
Marziyeh Doostfatemeh ◽  
Seyyed Mohammad Taghi Ayatollahi

Missing data is one of the most important causes in reduction of classification accuracy. Many real datasets suffer from missing values, especially in medical sciences. Imputation is a common way to deal with incomplete datasets. There are various imputation methods that can be applied, and the choice of the best method depends on the dataset conditions such as sample size, missing percent, and missing mechanism. Therefore, the better solution is to classify incomplete datasets without imputation and without any loss of information. The structure of the “Bayesian additive regression trees” (BART) model is improved with the “Missingness Incorporated in Attributes” approach to solve its inefficiency in handling the missingness problem. Implementation of MIA-within-BART is named “BART.m”. As the abilities of BART.m are not investigated in classification of incomplete datasets, this simulation-based study aimed to provide such resource. The results indicate that BART.m can be used even for datasets with 90 missing present and more importantly, it diagnoses the irrelevant variables and removes them by its own. BART.m outperforms common models for classification with incomplete data, according to accuracy and computational time. Based on the revealed properties, it can be said that BART.m is a high accuracy model in classification of incomplete datasets which avoids any assumptions and preprocess steps.


Sign in / Sign up

Export Citation Format

Share Document