scholarly journals Multiple Imputation of Missing Values: Update

Author(s):  
Patrick Royston

This article describes a substantial update to mvis, which brings it more closely in line with the feature set of S. van Buuren and C. G. M. Oudshoorn's implementation of the MICE system in R and S-PLUS (for details, see http://www.multiple-imputation.com ). To make a clear distinction from mvis, the principal program of the new Stata release is called ice. I will give details of how to use the new features and a practical illustrative example using real data. All the facilities of mvis are retained by ice. Some improvements to micombine for computing estimates from multiply imputed datasets are also described.

Author(s):  
Patrick Royston ◽  
John B. Carlin ◽  
Ian R. White

We present an update of mim, a program for managing multiply imputed datasets and performing inference (estimating parameters) using Rubin's rules for combining estimates from imputed datasets. The new features of particular importance are an option for estimating the Monte Carlo error (due to the sampling variability of the imputation process) in parameter estimates and in related quantities, and a general routine for combining any scalar estimate across imputations.


2007 ◽  
Vol 37 (1) ◽  
pp. 83-117 ◽  
Author(s):  
Paul T. von Hippel

When fitting a generalized linear model—such as linear regression, logistic regression, or hierarchical linear modeling—analysts often wonder how to handle missing values of the dependent variable Y. If missing values have been filled in using multiple imputation, the usual advice is to use the imputed Y values in analysis. We show, however, that using imputed Ys can add needless noise to the estimates. Better estimates can usually be obtained using a modified strategy that we call multiple imputation, then deletion (MID). Under MID, all cases are used for imputation but, following imputation, cases with imputed Y values are excluded from the analysis. When there is something wrong with the imputed Y values, MID protects the estimates from the problematic imputations. And when the imputed Y values are acceptable, MID usually offers somewhat more efficient estimates than an ordinary MI strategy.


2017 ◽  
Vol 42 (4) ◽  
pp. 432-466 ◽  
Author(s):  
Stephen A. Mistler ◽  
Craig K. Enders

Multiple imputation methods can generally be divided into two broad frameworks: joint model (JM) imputation and fully conditional specification (FCS) imputation. JM draws missing values simultaneously for all incomplete variables using a multivariate distribution, whereas FCS imputes variables one at a time from a series of univariate conditional distributions. In single-level multivariate normal data, these two approaches have been shown to be equivalent, but less is known about their similarities and differences with multilevel data. This study examined four multilevel multiple imputation approaches: JM approaches proposed by Schafer and Yucel and Asparouhov and Muthén and FCS methods described by van Buuren and Carpenter and Kenward. Analytic work and computer simulations showed that Asparouhov and Muthén and Carpenter and Kenward methods are most flexible, as they produce imputations that preserve distinct within- and between-cluster covariance structures. As such, these approaches are applicable to random intercept models that posit level-specific relations among variables (e.g., contextual effects analyses, multilevel structural equation models). In contrast, methods from Schafer and Yucel and van Buuren are more restrictive and impose implicit equality constraints on functions of the within- and between-cluster covariance matrices. The analytic work and simulations underscore the conclusion that researchers should not expect to obtain the same results from alternative imputation routines. Rather, it is important to choose an imputation method that partitions variation in a manner that is consistent with the analysis model of interest. A real data analysis example illustrates the various approaches.


2021 ◽  
Author(s):  
Rosa F Ropero ◽  
M Julia Flores ◽  
Rafael Rumí

<p>Environmental data often present missing values or lack of information that make modelling tasks difficult. Under the framework of SAICMA Research Project, a flood risk management system is modelled for Andalusian Mediterranean catchment using information from the Andalusian Hydrological System. Hourly data were collected from October 2011 to September 2020, and present two issues:</p><ul><li>In Guadarranque River, for the dam level variable there is no data from May to August 2020, probably because of sensor damage.</li> <li>No information about river level is collected in the lower part of Guadiaro River, which make difficult to estimate flood risk in the coastal area.</li> </ul><p>In order to avoid removing dam variable from the entire model (or those missing months), or even reject modelling one river system, this abstract aims to provide modelling solutions based on Bayesian networks (BNs) that overcome this limitation.</p><p><em>Guarranque River. Missing values.</em></p><p>Dataset contains 75687 observations for 6 continuous variables. BNs regression models based on fixed structures (Naïve Bayes, NB, and Tree Augmented Naïve, TAN) were learnt using the complete dataset (until September 2019) with the aim of predicting the dam level variable as accurately as possible. A scenario was carried out with data from October 2019 to March 2020 and compared the prediction made for the target variable with the real data. Results show both NB (rmse: 6.29) and TAN (rmse: 5.74) are able to predict the behaviour of the target variable.</p><p>Besides, a BN based on expert’s structural learning was learnt with real data and both datasets with imputed values by NB and TAN. Results show models learnt with imputed data (NB: 3.33; TAN: 3.07) improve the error rate of model with respect to real data (4.26).</p><p><em>Guadairo River. Lack of information.</em></p><p>Dataset contains 73636 observations with 14 continuous variables. Since rainfall variables present a high percentage of zero values (over 94%), they were discretised by Equal Frequency method with 4 intervals. The aim is to predict flooding risk in the coastal area but no data is collected from this area. Thus, an unsupervised classification based on hybrid BNs was performed. Here, target variable classifies all observations into a set of homogeneous groups and gives, for each observation, the probability of belonging to each group. Results show a total of 3 groups:</p><ul><li>Group 0, “Normal situation”: with rainfall values equal to 0, and mean of river level very low.</li> <li>Group 1, “Storm situation”: mean rainfall values are over 0.3 mm and all river level variables duplicate the mean with respect to group 0.</li> <li>Group 2, “Extreme situation”: Both rainfall and river level means values present the highest values far away from both previous groups.</li> </ul><p>Even when validation shows this methodology is able to identify extreme events, further work is needed. In this sense, data from autumn-winter season (from October 2020 to March 2021) will be used. Including this new information it would be possible to check if last extreme events (flooding event during December and Filomenastorm during January) are identified.</p><p> </p><p> </p><p> </p>


2017 ◽  
Vol 33 (4) ◽  
pp. 1005-1019 ◽  
Author(s):  
Bronwyn Loong ◽  
Donald B. Rubin

AbstractSeveral statistical agencies have started to use multiply-imputed synthetic microdata to create public-use data in major surveys. The purpose of doing this is to protect the confidentiality of respondents’ identities and sensitive attributes, while allowing standard complete-data analyses of microdata. A key challenge, faced by advocates of synthetic data, is demonstrating that valid statistical inferences can be obtained from such synthetic data for non-confidential questions. Large discrepancies between observed-data and synthetic-data analytic results for such questions may arise because of uncongeniality; that is, differences in the types of inputs available to the imputer, who has access to the actual data, and to the analyst, who has access only to the synthetic data. Here, we discuss a simple, but possibly canonical, example of uncongeniality when using multiple imputation to create synthetic data, which specifically addresses the choices made by the imputer. An initial, unanticipated but not surprising, conclusion is that non-confidential design information used to impute synthetic data should be released with the confidential synthetic data to allow users of synthetic data to avoid possible grossly conservative inferences.


Biometrika ◽  
2016 ◽  
Vol 103 (1) ◽  
pp. 175-187 ◽  
Author(s):  
Jun Shao ◽  
Lei Wang

Abstract To estimate unknown population parameters based on data having nonignorable missing values with a semiparametric exponential tilting propensity, Kim & Yu (2011) assumed that the tilting parameter is known or can be estimated from external data, in order to avoid the identifiability issue. To remove this serious limitation on the methodology, we use an instrument, i.e., a covariate related to the study variable but unrelated to the missing data propensity, to construct some estimating equations. Because these estimating equations are semiparametric, we profile the nonparametric component using a kernel-type estimator and then estimate the tilting parameter based on the profiled estimating equations and the generalized method of moments. Once the tilting parameter is estimated, so is the propensity, and then other population parameters can be estimated using the inverse propensity weighting approach. Consistency and asymptotic normality of the proposed estimators are established. The finite-sample performance of the estimators is studied through simulation, and a real-data example is also presented.


2019 ◽  
Author(s):  
Donna Coffman ◽  
Jiangxiu Zhou ◽  
Xizhen Cai

Abstract Background Causal effect estimation with observational data is subject to bias due to confounding, which is often controlled for using propensity scores. One unresolved issue in propensity score estimation is how to handle missing values in covariates.Method Several approaches have been proposed for handling covariate missingness, including multiple imputation (MI), multiple imputation with missingness pattern (MIMP), and treatment mean imputation. However, there are other potentially useful approaches that have not been evaluated, including single imputation (SI) + prediction error (PE), SI+PE + parameter uncertainty (PU), and Generalized Boosted Modeling (GBM), which is a nonparametric approach for estimating propensity scores in which missing values are automatically handled in the estimation using a surrogate split method. To evaluate the performance of these approaches, a simulation study was conducted.Results Results suggested that SI+PE, SI+PE+PU, MI, and MIMP perform almost equally well and better than treatment mean imputation and GBM in terms of bias; however, MI and MIMP account for the additional uncertainty of imputing the missingness.Conclusions Applying GBM to the incomplete data and relying on the surrogate split approach resulted in substantial bias. Imputation prior to implementing GBM is recommended.


Author(s):  
Byron C. Jaeger ◽  
Ryan Cantor ◽  
Venkata Sthanam ◽  
Rongbing Xie ◽  
James K. Kirklin ◽  
...  

Background: Risk prediction models play an important role in clinical decision making. When developing risk prediction models, practitioners often impute missing values to the mean. We evaluated the impact of applying other strategies to impute missing values on the prognostic accuracy of downstream risk prediction models, that is, models fitted to the imputed data. A secondary objective was to compare the accuracy of imputation methods based on artificially induced missing values. To complete these objectives, we used data from the Interagency Registry for Mechanically Assisted Circulatory Support. Methods: We applied 12 imputation strategies in combination with 2 different modeling strategies for mortality and transplant risk prediction following surgery to receive mechanical circulatory support. Model performance was evaluated using Monte-Carlo cross-validation and measured based on outcomes 6 months following surgery using the scaled Brier score, concordance index, and calibration error. We used Bayesian hierarchical models to compare model performance. Results: Multiple imputation with random forests emerged as a robust strategy to impute missing values, increasing model concordance by 0.0030 (25th–75th percentile: 0.0008–0.0052) compared with imputation to the mean for mortality risk prediction using a downstream proportional hazards model. The posterior probability that single and multiple imputation using random forests would improve concordance versus mean imputation was 0.464 and >0.999, respectively. Conclusions: Selecting an optimal strategy to impute missing values such as random forests and applying multiple imputation can improve the prognostic accuracy of downstream risk prediction models.


Author(s):  
Thelma Dede Baddoo ◽  
Zhijia Li ◽  
Samuel Nii Odai ◽  
Kenneth Rodolphe Chabi Boni ◽  
Isaac Kwesi Nooni ◽  
...  

Reconstructing missing streamflow data can be challenging when additional data are not available, and missing data imputation of real-world datasets to investigate how to ascertain the accuracy of imputation algorithms for these datasets are lacking. This study investigated the necessary complexity of missing data reconstruction schemes to obtain the relevant results for a real-world single station streamflow observation to facilitate its further use. This investigation was implemented by applying different missing data mechanisms spanning from univariate algorithms to multiple imputation methods accustomed to multivariate data taking time as an explicit variable. The performance accuracy of these schemes was assessed using the total error measurement (TEM) and a recommended localized error measurement (LEM) in this study. The results show that univariate missing value algorithms, which are specially developed to handle univariate time series, provide satisfactory results, but the ones which provide the best results are usually time and computationally intensive. Also, multiple imputation algorithms which consider the surrounding observed values and/or which can understand the characteristics of the data provide similar results to the univariate missing data algorithms and, in some cases, perform better without the added time and computational downsides when time is taken as an explicit variable. Furthermore, the LEM would be especially useful when the missing data are in specific portions of the dataset or where very large gaps of ‘missingness’ occur. Finally, proper handling of missing values of real-world hydroclimatic datasets depends on imputing and extensive study of the particular dataset to be imputed.


Author(s):  
Miroslav Hudec ◽  
Miljan Vučetić ◽  
Mirko Vujošević

Data mining methods based on fuzzy logic have been developed recently and have become an increasingly important research area. In this chapter, the authors examine possibilities for discovering potentially useful knowledge from relational database by integrating fuzzy functional dependencies and linguistic summaries. Both methods use fuzzy logic tools for data analysis, acquiring, and representation of expert knowledge. Fuzzy functional dependencies could detect whether dependency between two examined attributes in the whole database exists. If dependency exists only between parts of examined attributes' domains, fuzzy functional dependencies cannot detect its characters. Linguistic summaries are a convenient method for revealing this kind of dependency. Using fuzzy functional dependencies and linguistic summaries in a complementary way could mine valuable information from relational databases. Mining intensities of dependencies between database attributes could support decision making, reduce the number of attributes in databases, and estimate missing values. The proposed approach is evaluated with case studies using real data from the official statistics. Strengths and weaknesses of the described methods are discussed. At the end of the chapter, topics for further research activities are outlined.


Sign in / Sign up

Export Citation Format

Share Document