scholarly journals Multiple Imputation of Missing Values: New Features for Mim

Author(s):  
Patrick Royston ◽  
John B. Carlin ◽  
Ian R. White

We present an update of mim, a program for managing multiply imputed datasets and performing inference (estimating parameters) using Rubin's rules for combining estimates from imputed datasets. The new features of particular importance are an option for estimating the Monte Carlo error (due to the sampling variability of the imputation process) in parameter estimates and in related quantities, and a general routine for combining any scalar estimate across imputations.

Author(s):  
Patrick Royston

This article describes a substantial update to mvis, which brings it more closely in line with the feature set of S. van Buuren and C. G. M. Oudshoorn's implementation of the MICE system in R and S-PLUS (for details, see http://www.multiple-imputation.com ). To make a clear distinction from mvis, the principal program of the new Stata release is called ice. I will give details of how to use the new features and a practical illustrative example using real data. All the facilities of mvis are retained by ice. Some improvements to micombine for computing estimates from multiply imputed datasets are also described.


2007 ◽  
Vol 37 (1) ◽  
pp. 83-117 ◽  
Author(s):  
Paul T. von Hippel

When fitting a generalized linear model—such as linear regression, logistic regression, or hierarchical linear modeling—analysts often wonder how to handle missing values of the dependent variable Y. If missing values have been filled in using multiple imputation, the usual advice is to use the imputed Y values in analysis. We show, however, that using imputed Ys can add needless noise to the estimates. Better estimates can usually be obtained using a modified strategy that we call multiple imputation, then deletion (MID). Under MID, all cases are used for imputation but, following imputation, cases with imputed Y values are excluded from the analysis. When there is something wrong with the imputed Y values, MID protects the estimates from the problematic imputations. And when the imputed Y values are acceptable, MID usually offers somewhat more efficient estimates than an ordinary MI strategy.


Methodology ◽  
2021 ◽  
Vol 17 (3) ◽  
pp. 205-230
Author(s):  
Kristian Kleinke ◽  
Markus Fritsch ◽  
Mark Stemmler ◽  
Jost Reinecke ◽  
Friedrich Lösel

Quantile regression (QR) is a valuable tool for data analysis and multiple imputation (MI) of missing values – especially when standard parametric modelling assumptions are violated. Yet, Monte Carlo simulations that systematically evaluate QR-based MI in a variety of different practically relevant settings are still scarce. In this paper, we evaluate the method regarding the imputation of ordinal data and compare the results with other standard and robust imputation methods. We then apply QR-based MI to an empirical dataset, where we seek to identify risk factors for corporal punishment of children by their fathers. We compare the modelling results with previously published findings based on complete cases. Our Monte Carlo results highlight the advantages of QR-based MI over fully parametric imputation models: QR-based MI yields unbiased statistical inferences across large parts of the conditional distribution, when parametric modelling assumptions, such as normal and homoscedastic error terms, are violated. Regarding risk factors for corporal punishment, our MI results support previously published findings based on complete cases. Our empirical results indicate that the identified “missing at random” processes in the investigated dataset are negligible.


Methodology ◽  
2016 ◽  
Vol 12 (3) ◽  
pp. 75-88 ◽  
Author(s):  
Simon Grund ◽  
Oliver Lüdtke ◽  
Alexander Robitzsch

Abstract. The analysis of variance (ANOVA) is frequently used to examine whether a number of groups differ on a variable of interest. The global hypothesis test of the ANOVA can be reformulated as a regression model in which all group differences are simultaneously tested against zero. Multiple imputation offers reliable and effective treatment of missing data; however, recommendations differ with regard to what procedures are suitable for pooling ANOVA results from multiply imputed datasets. In this article, we compared several procedures (known as D1, D2, and D3) using Monte Carlo simulations. Even though previous recommendations have advocated that D2 should be avoided in favor of D1 or D3, our results suggest that all procedures provide a suitable test of the ANOVA’s global null hypothesis in many plausible research scenarios. In more extreme settings, D1 was most reliable, whereas D2 and D3 suffered from different limitations. We provide guidelines on how the different methods can be applied in one- and two-factorial ANOVA designs and information about the conditions under which some procedures may perform better than others. Computer code is supplied for each method to be used in freely available statistical software.


2018 ◽  
Vol 28 (5) ◽  
pp. 1311-1327 ◽  
Author(s):  
Faisal M Zahid ◽  
Christian Heumann

Missing data is a common issue that can cause problems in estimation and inference in biomedical, epidemiological and social research. Multiple imputation is an increasingly popular approach for handling missing data. In case of a large number of covariates with missing data, existing multiple imputation software packages may not work properly and often produce errors. We propose a multiple imputation algorithm called mispr based on sequential penalized regression models. Each variable with missing values is assumed to have a different distributional form and is imputed with its own imputation model using the ridge penalty. In the case of a large number of predictors with respect to the sample size, the use of a quadratic penalty guarantees unique estimates for the parameters and leads to better predictions than the usual Maximum Likelihood Estimation (MLE), with a good compromise between bias and variance. As a result, the proposed algorithm performs well and provides imputed values that are better even for a large number of covariates with small samples. The results are compared with the existing software packages mice, VIM and Amelia in simulation studies. The missing at random mechanism was the main assumption in the simulation study. The imputation performance of the proposed algorithm is evaluated with mean squared imputation error and mean absolute imputation error. The mean squared error ([Formula: see text]), parameter estimates with their standard errors and confidence intervals are also computed to compare the performance in the regression context. The proposed algorithm is observed to be a good competitor to the existing algorithms, with smaller mean squared imputation error, mean absolute imputation error and mean squared error. The algorithm’s performance becomes considerably better than that of the existing algorithms with increasing number of covariates, especially when the number of predictors is close to or even greater than the sample size. Two real-life datasets are also used to examine the performance of the proposed algorithm using simulations.


Marketing ZFP ◽  
2019 ◽  
Vol 41 (4) ◽  
pp. 21-32
Author(s):  
Dirk Temme ◽  
Sarah Jensen

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.


2008 ◽  
Vol 10 (2) ◽  
pp. 153-162 ◽  
Author(s):  
B. G. Ruessink

When a numerical model is to be used as a practical tool, its parameters should preferably be stable and consistent, that is, possess a small uncertainty and be time-invariant. Using data and predictions of alongshore mean currents flowing on a beach as a case study, this paper illustrates how parameter stability and consistency can be assessed using Markov chain Monte Carlo. Within a single calibration run, Markov chain Monte Carlo estimates the parameter posterior probability density function, its mode being the best-fit parameter set. Parameter stability is investigated by stepwise adding new data to a calibration run, while consistency is examined by calibrating the model on different datasets of equal length. The results for the present case study indicate that various tidal cycles with strong (say, >0.5 m/s) currents are required to obtain stable parameter estimates, and that the best-fit model parameters and the underlying posterior distribution are strongly time-varying. This inconsistent parameter behavior may reflect unresolved variability of the processes represented by the parameters, or may represent compensational behavior for temporal violations in specific model assumptions.


Author(s):  
M. K. Abu Husain ◽  
N. I. Mohd Zaki ◽  
M. B. Johari ◽  
G. Najafian

For an offshore structure, wind, wave, current, tide, ice and gravitational forces are all important sources of loading which exhibit a high degree of statistical uncertainty. The capability to predict the probability distribution of the response extreme values during the service life of the structure is essential for safe and economical design of these structures. Many different techniques have been introduced for evaluation of statistical properties of response. In each case, sea-states are characterised by an appropriate water surface elevation spectrum, covering a wide range of frequencies. In reality, the most versatile and reliable technique for predicting the statistical properties of the response of an offshore structure to random wave loading is the time domain simulation technique. To this end, conventional time simulation (CTS) procedure or commonly called Monte Carlo time simulation method is the best known technique for predicting the short-term and long-term statistical properties of the response of an offshore structure to random wave loading due to its capability of accounting for various nonlinearities. However, this technique requires very long simulations in order to reduce the sampling variability to acceptable levels. In this paper, the effect of sampling variability of a Monte Carlo technique is investigated.


2017 ◽  
Vol 33 (4) ◽  
pp. 1005-1019 ◽  
Author(s):  
Bronwyn Loong ◽  
Donald B. Rubin

AbstractSeveral statistical agencies have started to use multiply-imputed synthetic microdata to create public-use data in major surveys. The purpose of doing this is to protect the confidentiality of respondents’ identities and sensitive attributes, while allowing standard complete-data analyses of microdata. A key challenge, faced by advocates of synthetic data, is demonstrating that valid statistical inferences can be obtained from such synthetic data for non-confidential questions. Large discrepancies between observed-data and synthetic-data analytic results for such questions may arise because of uncongeniality; that is, differences in the types of inputs available to the imputer, who has access to the actual data, and to the analyst, who has access only to the synthetic data. Here, we discuss a simple, but possibly canonical, example of uncongeniality when using multiple imputation to create synthetic data, which specifically addresses the choices made by the imputer. An initial, unanticipated but not surprising, conclusion is that non-confidential design information used to impute synthetic data should be released with the confidential synthetic data to allow users of synthetic data to avoid possible grossly conservative inferences.


2019 ◽  
Author(s):  
Donna Coffman ◽  
Jiangxiu Zhou ◽  
Xizhen Cai

Abstract Background Causal effect estimation with observational data is subject to bias due to confounding, which is often controlled for using propensity scores. One unresolved issue in propensity score estimation is how to handle missing values in covariates.Method Several approaches have been proposed for handling covariate missingness, including multiple imputation (MI), multiple imputation with missingness pattern (MIMP), and treatment mean imputation. However, there are other potentially useful approaches that have not been evaluated, including single imputation (SI) + prediction error (PE), SI+PE + parameter uncertainty (PU), and Generalized Boosted Modeling (GBM), which is a nonparametric approach for estimating propensity scores in which missing values are automatically handled in the estimation using a surrogate split method. To evaluate the performance of these approaches, a simulation study was conducted.Results Results suggested that SI+PE, SI+PE+PU, MI, and MIMP perform almost equally well and better than treatment mean imputation and GBM in terms of bias; however, MI and MIMP account for the additional uncertainty of imputing the missingness.Conclusions Applying GBM to the incomplete data and relying on the surrogate split approach resulted in substantial bias. Imputation prior to implementing GBM is recommended.


Sign in / Sign up

Export Citation Format

Share Document