A Multiple Imputation Approach for Handling Missing Data in Classification and Regression Trees

Danielle M. Rodgers; Ross Jacobucci; Kevin J. Grimm

doi:10.35566/jbds/v1n1/p6

A Multiple Imputation Approach for Handling Missing Data in Classification and Regression Trees

Journal of Behavioral Data Science ◽

10.35566/jbds/v1n1/p6 ◽

2021 ◽

Vol 1 (1) ◽

Author(s):

Danielle M. Rodgers ◽

Ross Jacobucci ◽

Kevin J. Grimm

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Missing Values ◽

Selection Process ◽

Superior Performance ◽

Machine Learning Technique ◽

Listwise Deletion ◽

Learning Technique ◽

Missing Data Techniques ◽

Imputation Approach

Decision trees (DTs) is a machine learning technique that searches the predictor space for the variable and observed value that leads to the best prediction when the data are split into two nodes based on the variable and splitting value. The algorithm repeats its search within each partition of the data until a stopping rule ends the search. Missing data can be problematic in DTs because of an inability to place an observation with a missing value into a node based on the chosen splitting variable. Moreover, missing data can alter the selection process because of its inability to place observations with missing values. Simple missing data approaches (e.g., listwise deletion, majority rule, and surrogate split) have been implemented in DT algorithms; however, more sophisticated missing data techniques have not been thoroughly examined. We propose a modified multiple imputation approach to handling missing data in DTs, and compare this approach with simple missing data approaches as well as single imputation and a multiple imputation with prediction averaging via Monte Carlo Simulation. This study evaluated the performance of each missing data approach when data were MAR or MCAR. The proposed multiple imputation approach and surrogate splits had superior performance with the proposed multiple imputation approach performing best in the more severe missing data conditions. We conclude with recommendations for handling missing data in DTs.

Download Full-text

Proposing a missing data method for hospitality research on online customer reviews

International Journal of Contemporary Hospitality Management ◽

10.1108/ijchm-10-2017-0708 ◽

2018 ◽

Vol 30 (11) ◽

pp. 3250-3267

Author(s):

Jewoo Kim ◽

Jongho Im

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Missing Values ◽

Simulation Analysis ◽

Imputation Method ◽

The Other ◽

Online Review ◽

Online Data ◽

Content Type ◽

Missing Data Techniques

Purpose The purpose of this paper is to introduce a new multiple imputation method that can effectively manage missing values in online review data, thereby allowing the online review analysis to yield valid results by using all available data. Design/methodology/approach This study develops a missing data method based on the multivariate imputation chained equation to generate imputed values for online reviews. Sentiment analysis is used to incorporate customers’ textual opinions as the auxiliary information in the imputation procedures. To check the validity of the proposed imputation method, the authors apply this method to missing values of sub-ratings on hotel attributes in both the simulated and real Honolulu hotel review data sets. The estimation results are compared to those of different missing data techniques, namely, listwise deletion and conventional multiple imputation which does not consider text reviews. Findings The findings from the simulation analysis show that the imputation method of the authors produces more efficient and less biased estimates compared to the other two missing data techniques when text reviews are possibly associated with the rating scores and response mechanism. When applying the imputation method to the real hotel review data, the findings show that the text sentiment-based propensity score can effectively explain the missingness of sub-ratings on hotel attributes, and the imputation method considering those propensity scores has better estimation results than the other techniques as in the simulation analysis. Originality/value This study extends multiple imputation to online data considering its spontaneous and unstructured nature. This new method helps make the fuller use of the observed online data while avoiding potential missing problems.

Download Full-text

A data-driven missing value imputation approach for longitudinal datasets

Artificial Intelligence Review ◽

10.1007/s10462-021-09963-5 ◽

2021 ◽

Author(s):

Caio Ribeiro ◽

Alex A. Freitas

Keyword(s):

Missing Data ◽

Longitudinal Data ◽

Missing Values ◽

Error Rates ◽

Imputation Method ◽

Data Driven ◽

Missing Value ◽

Missing Value Imputation ◽

Human Ageing ◽

Imputation Approach

AbstractLongitudinal datasets of human ageing studies usually have a high volume of missing data, and one way to handle missing values in a dataset is to replace them with estimations. However, there are many methods to estimate missing values, and no single method is the best for all datasets. In this article, we propose a data-driven missing value imputation approach that performs a feature-wise selection of the best imputation method, using known information in the dataset to rank the five methods we selected, based on their estimation error rates. We evaluated the proposed approach in two sets of experiments: a classifier-independent scenario, where we compared the applicabilities and error rates of each imputation method; and a classifier-dependent scenario, where we compared the predictive accuracy of Random Forest classifiers generated with datasets prepared using each imputation method and a baseline approach of doing no imputation (letting the classification algorithm handle the missing values internally). Based on our results from both sets of experiments, we concluded that the proposed data-driven missing value imputation approach generally resulted in models with more accurate estimations for missing data and better performing classifiers, in longitudinal datasets of human ageing. We also observed that imputation methods devised specifically for longitudinal data had very accurate estimations. This reinforces the idea that using the temporal information intrinsic to longitudinal data is a worthwhile endeavour for machine learning applications, and that can be achieved through the proposed data-driven approach.

Download Full-text

Comparison of Missing Data Infilling Mechanisms for Recovering a Real-World Single Station Streamflow Observation

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18168375 ◽

2021 ◽

Vol 18 (16) ◽

pp. 8375

Author(s):

Thelma Dede Baddoo ◽

Zhijia Li ◽

Samuel Nii Odai ◽

Kenneth Rodolphe Chabi Boni ◽

Isaac Kwesi Nooni ◽

...

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Real World ◽

Missing Values ◽

Total Error ◽

Extensive Study ◽

Error Measurement ◽

Missing Data Imputation ◽

Single Station ◽

Real World Datasets

Reconstructing missing streamflow data can be challenging when additional data are not available, and missing data imputation of real-world datasets to investigate how to ascertain the accuracy of imputation algorithms for these datasets are lacking. This study investigated the necessary complexity of missing data reconstruction schemes to obtain the relevant results for a real-world single station streamflow observation to facilitate its further use. This investigation was implemented by applying different missing data mechanisms spanning from univariate algorithms to multiple imputation methods accustomed to multivariate data taking time as an explicit variable. The performance accuracy of these schemes was assessed using the total error measurement (TEM) and a recommended localized error measurement (LEM) in this study. The results show that univariate missing value algorithms, which are specially developed to handle univariate time series, provide satisfactory results, but the ones which provide the best results are usually time and computationally intensive. Also, multiple imputation algorithms which consider the surrounding observed values and/or which can understand the characteristics of the data provide similar results to the univariate missing data algorithms and, in some cases, perform better without the added time and computational downsides when time is taken as an explicit variable. Furthermore, the LEM would be especially useful when the missing data are in specific portions of the dataset or where very large gaps of ‘missingness’ occur. Finally, proper handling of missing values of real-world hydroclimatic datasets depends on imputing and extensive study of the particular dataset to be imputed.

Download Full-text

Best Practices for Addressing Missing Data through Multiple Imputation

10.31234/osf.io/uaezh ◽

2021 ◽

Author(s):

Adrienne D. Woods ◽

Pamela Davis-Kean ◽

Max Andrew Halvorson ◽

Kevin Michael King ◽

Jessica A. R. Logan ◽

...

Keyword(s):

Missing Data ◽

Best Practices ◽

Multiple Imputation ◽

Statistical Techniques ◽

Parameter Estimates ◽

Developmental Research ◽

Imputation Methods ◽

Listwise Deletion ◽

Highly Effective ◽

Practical Guidelines

A common challenge in developmental research is the amount of incomplete and missing data that occurs from respondents failing to complete tasks or questionnaires, as well as from disengaging from the study (i.e., attrition). This missingness can lead to biases in parameter estimates and, hence, in the interpretation of findings. These biases can be addressed through statistical techniques that adjust for missing data, such as multiple imputation. Although this technique is highly effective, it has not been widely adopted by developmental scientists given barriers such as lack of training or misconceptions about imputation methods and instead utilizing default methods within software like listwise deletion. This manuscript is intended to provide practical guidelines for developmental researchers to follow when examining their data for missingness, making decisions about how to handle that missingness, and reporting the extent of missing data biases and specific multiple imputation procedures in publications.

Download Full-text

Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices

10.1101/744789 ◽

2019 ◽

Author(s):

Ananya Bhattacharjee ◽

Md. Shamsuzzoha Bayzid

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Missing Data ◽

Phylogenetic Trees ◽

Large Scale ◽

Missing Values ◽

Gene Tree ◽

Estimation Methods ◽

Learning Technique ◽

Distance Matrices

AbstractBackgroundDue to the recent advances in sequencing technologies and species tree estimation methods capable of taking gene tree discordance into account, notable progress has been achieved in constructing large scale phylogenetic trees from genome wide data. However, substantial challenges remain in leveraging this huge amount of molecular data. One of the foremost among these challenges is the need for efficient tools that can handle missing data. Popular distance-based methods such as neighbor joining and UPGMA require that the input distance matrix does not contain any missing values.ResultsWe introduce two highly accurate machine learning based distance imputation techniques. One of our approaches is based on matrix factorization, and the other one is an autoencoder based deep learning technique. We evaluate these two techniques on a collection of simulated and biological datasets, and show that our techniques match or improve upon the best alternate techniques for distance imputation. Moreover, our proposed techniques can handle substantial amount of missing data, to the extent where the best alternate methods fail.ConclusionsThis study shows for the first time the power and feasibility of applying deep learning techniques for imputing distance matrices. The autoencoder based deep learning technique is highly accurate and scalable to large dataset. We have made these techniques freely available as a cross-platform software (available at https://github.com/Ananya-Bhattacharjee/ImputeDistances).

Download Full-text

Multiple imputation in big identifiable data for educational research: An example from the Brazilian education assessment system

Ensaio Avaliação e Políticas Públicas em Educação ◽

10.1590/s0104-40362020002802346 ◽

2020 ◽

Vol 28 (108) ◽

pp. 599-621

Author(s):

Maria Eugénia Ferrão ◽

Paula Prata ◽

Maria Teresa Gonzaga Alves

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Educational Research ◽

Missing Values ◽

Assessment System ◽

Policy And Practice ◽

Real World Data ◽

Missing Completely At Random ◽

Almost All ◽

Identifiable Data

Abstract Almost all quantitative studies in educational assessment, evaluation and educational research are based on incomplete data sets, which have been a problem for years without a single solution. The use of big identifiable data poses new challenges in dealing with missing values. In the first part of this paper, we present the state-of-art of the topic in the Brazilian education scientific literature, and how researchers have dealt with missing data since the turn of the century. Next, we use open access software to analyze real-world data, the 2017 Prova Brasil , for several federation units to document how the naïve assumption of missing completely at random may substantially affect statistical conclusions, researcher interpretations, and subsequent implications for policy and practice. We conclude with straightforward suggestions for any education researcher on applying R routines to conduct the hypotheses test of missing completely at random and, if the null hypothesis is rejected, then how to implement the multiple imputation, which appears to be one of the most appropriate methods for handling missing data.

Download Full-text

Comparison of Selected Multiple Imputation Methods for Continuous Variables – Preliminary Simulation Study Results

Acta Universitatis Lodziensis Folia oeconomica ◽

10.18778/0208-6018.339.05 ◽

2019 ◽

Vol 6 (339) ◽

pp. 73-98

Author(s):

Małgorzata Aleksandra Misztal

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Missing Values ◽

Imputation Accuracy ◽

Imputation Method ◽

Data Sets ◽

Continuous Variables ◽

Imputation Methods ◽

Study Results ◽

Almost All

The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missForest. The comparison was made by artificially simulating different proportions (10–50%) and mechanisms of missing data using 10 complete data sets from the UCI repository of machine learning databases. Then, missing values were imputed with the use of MICE, missForest and the PCA‑based method (MIPCA). The normalised root mean square error (NRMSE) was calculated as a measure of imputation accuracy. On the basis of the conducted analyses, missForest can be recommended as a multiple imputation method providing the lowest rates of imputation errors for all types of missingness. PCA‑based imputation does not perform well in terms of accuracy.

Download Full-text

Constructing bootstrap confidence intervals for principal component loadings in the presence of missing data: A multiple-imputation approach

British Journal of Mathematical and Statistical Psychology ◽

10.1111/j.2044-8317.2010.02006.x ◽

2010 ◽

Vol 64 (3) ◽

pp. 498-515 ◽

Cited By ~ 9

Author(s):

Joost R. van Ginkel ◽

Henk A. L. Kiers

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Confidence Intervals ◽

Principal Component ◽

Bootstrap Confidence Intervals ◽

Component Loadings ◽

Imputation Approach

Download Full-text

Effectiveness of a Hybrid Deep Learning Model Integrated with a Hybrid Parameterisation Model in Decision-Making Analysis

Knowledge Innovation Through Intelligent Software Methodologies, Tools and Techniques - Frontiers in Artificial Intelligence and Applications ◽

10.3233/faia200551 ◽

2020 ◽

Author(s):

Masurah Mohamad ◽

Ali Selamat

Keyword(s):

Neural Network ◽

Machine Learning ◽

Decision Making ◽

Deep Learning ◽

Missing Data ◽

Machine Learning Technique ◽

Processing Times ◽

Learning Technique ◽

Neural Network Algorithm ◽

Deep Learning Model

Deep learning has recently gained the attention of many researchers in various fields. A new and emerging machine learning technique, it is derived from a neural network algorithm capable of analysing unstructured datasets without supervision. This study compared the effectiveness of the deep learning (DL) model vs. a hybrid deep learning (HDL) model integrated with a hybrid parameterisation model in handling complex and missing medical datasets as well as their performance in increasing classification. The results showed that 1) the DL model performed better on its own, 2) DL was able to analyse complex medical datasets even with missing data values, and 3) HDL performed well as well and had faster processing times since it was integrated with a hybrid parameterisation model.

Download Full-text

Improvement of random forest by multiple imputation applied to tower crane accident prediction with missing data

Engineering Construction & Architectural Management ◽

10.1108/ecam-07-2021-0606 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Ling Jiang ◽

Tingsheng Zhao ◽

Chuxuan Feng ◽

Wei Zhang

Keyword(s):

Missing Data ◽

Random Forest ◽

Prediction Model ◽

Multiple Imputation ◽

Incomplete Data ◽

Missing Values ◽

Critical Factors ◽

Content Type ◽

Accident Prediction ◽

Tower Crane

PurposeThis research is aimed at predicting tower crane accident phases with incomplete data.Design/methodology/approachThe tower crane accidents are collected for prediction model training. Random forest (RF) is used to conduct prediction. When there are missing values in the new inputs, they should be filled in advance. Nevertheless, it is difficult to collect complete data on construction site. Thus, the authors use multiple imputation (MI) method to improve RF. Finally the prediction model is applied to a case study.FindingsThe results show that multiple imputation RF (MIRF) can effectively predict tower crane accident when the data are incomplete. This research provides the importance rank of tower crane safety factors. The critical factors should be focused on site, because the missing data affect the prediction results seriously. Also the value of critical factors influences the safety of tower crane.Practical implicationThis research promotes the application of machine learning methods for accident prediction in actual projects. According to the onsite data, the authors can predict the accident phase of tower crane. The results can be used for tower crane accident prevention.Originality/valuePrevious studies have seldom predicted tower crane accidents, especially the phase of accident. This research uses tower crane data collected on site to predict the phase of the tower crane accident. The incomplete data collection is considered in this research according to the actual situation.

Download Full-text