scholarly journals Evaluation of Four Multiple Imputation Methods for Handling Missing Binary Outcome Data in the Presence of an Interaction between a Dummy and a Continuous Variable

2021 ◽  
Vol 2021 ◽  
pp. 1-14
Author(s):  
Sara Javadi ◽  
Abbas Bahrampour ◽  
Mohammad Mehdi Saber ◽  
Behshid Garrusi ◽  
Mohammad Reza Baneshi

Multiple imputation by chained equations (MICE) is the most common method for imputing missing data. In the MICE algorithm, imputation can be performed using a variety of parametric and nonparametric methods. The default setting in the implementation of MICE is for imputation models to include variables as linear terms only with no interactions, but omission of interaction terms may lead to biased results. It is investigated, using simulated and real datasets, whether recursive partitioning creates appropriate variability between imputations and unbiased parameter estimates with appropriate confidence intervals. We compared four multiple imputation (MI) methods on a real and a simulated dataset. MI methods included using predictive mean matching with an interaction term in the imputation model in MICE (MICE-interaction), classification and regression tree (CART) for specifying the imputation model in MICE (MICE-CART), the implementation of random forest (RF) in MICE (MICE-RF), and MICE-Stratified method. We first selected secondary data and devised an experimental design that consisted of 40 scenarios (2 × 5 × 4), which differed by the rate of simulated missing data (10%, 20%, 30%, 40%, and 50%), the missing mechanism (MAR and MCAR), and imputation method (MICE-Interaction, MICE-CART, MICE-RF, and MICE-Stratified). First, we randomly drew 700 observations with replacement 300 times, and then the missing data were created. The evaluation was based on raw bias (RB) as well as five other measurements that were averaged over the repetitions. Next, in a simulation study, we generated data 1000 times with a sample size of 700. Then, we created missing data for each dataset once. For all scenarios, the same criteria were used as for real data to evaluate the performance of methods in the simulation study. It is concluded that, when there is an interaction effect between a dummy and a continuous predictor, substantial gains are possible by using recursive partitioning for imputation compared to parametric methods, and also, the MICE-Interaction method is always more efficient and convenient to preserve interaction effects than the other methods.

2021 ◽  
Vol 19 (1) ◽  
Author(s):  
Albee Ling ◽  
Maria Montez-Rath ◽  
Maya Mathur ◽  
Kris Kapphahn ◽  
Manisha Desai

Propensity score matching (PSM) has been widely used to mitigate confounding in observational studies, although complications arise when the covariates used to estimate the PS are only partially observed. Multiple imputation (MI) is a potential solution for handling missing covariates in the estimation of the PS. However, it is not clear how to best apply MI strategies in the context of PSM. We conducted a simulation study to compare the performances of popular non-MI missing data methods and various MI-based strategies under different missing data mechanisms. We found that commonly applied missing data methods resulted in biased and inefficient estimates, and we observed large variation in performance across MI-based strategies. Based on our findings, we recommend 1) estimating the PS after applying MI to impute missing confounders; 2) conducting PSM within each imputed dataset followed by averaging the treatment effects to arrive at one summarized finding; 3) a bootstrapped-based variance to account for uncertainty of PS estimation, matching, and imputation; and 4) inclusion of key auxiliary variables in the imputation model.


2021 ◽  
Author(s):  
Marcus Richard Waldman ◽  
Katherine E. Masyn

It is well established that omitting important variables that are related to the propensity for missingness can lead to biased parameter estimates and invalid inference. Nevertheless, researchers conducting a person-centered analysis ubiquitously adopt a full information maximum likelihood (FIML) approach to treat missing data in a manner that assumes the missingness is only related to the observed indicators and is not related to any external variables. Such an assumption is generally considered overly restrictive in the behavioral sciences where the data are observational in nature. At the same time, previous research has discouraged the adoption of multiple imputation to treat missing data in person-centered analyses because traditional imputation models make a single-class assumption and do not reflect the multiple group structure of data with latent subpopulations (Enders & Gottschall, 2011). However, more modern imputation models that rely on recursive partitioning do not impose a single-class structure to the data. Focusing on latent profile analysis, we demonstrate in simulations that in samples of N = 1,200 or greater, recursive partitioning imputation algorithms can effectively incorporate external information from auxiliary variables to attenuate nonresponse bias better than FIML and multivariate normal imputation. Moreover, we find that recursive imputation models lead to confidence intervals with adequate coverage and they better recover posterior class probabilities than alternative missing data strategies. Taken together, our findings point to the promise and potential of multiple imputation in person-centered analyses once remaining methodological gaps around pooling and class enumeration are filled.


Author(s):  
Tra My Pham ◽  
Irene Petersen ◽  
James Carpenter ◽  
Tim Morris

ABSTRACT BackgroundEthnicity is an important factor to be considered in health research because of its association with inequality in disease prevalence and the utilisation of healthcare. Ethnicity recording has been incorporated in primary care electronic health records, and hence is available in large UK primary care databases such as The Health Improvement Network (THIN). However, since primary care data are routinely collected for clinical purposes, a large amount of data that are relevant for research including ethnicity is often missing. A popular approach for missing data is multiple imputation (MI). However, the conventional MI method assuming data are missing at random does not give plausible estimates of the ethnicity distribution in THIN compared to the general UK population. This might be due to the fact that ethnicity data in primary care are likely to be missing not at random. ObjectivesI propose a new MI method, termed ‘weighted multiple imputation’, to deal with data that are missing not at random in categorical variables.MethodsWeighted MI combines MI and probability weights which are calculated using external data sources. Census summary statistics for ethnicity can be used to form weights in weighted MI such that the correct marginal ethnic breakdown is recovered in THIN. I conducted a simulation study to examine weighted MI when ethnicity data are missing not at random. In this simulation study which resembled a THIN dataset, ethnicity was an independent variable in a survival model alongside other covariates. Weighted MI was compared to the conventional MI and other traditional missing data methods including complete case analysis and single imputation.ResultsWhile a small bias was still present in ethnicity coefficient estimates under weighted MI, it was less severe compared to MI assuming missing at random. Complete case analysis and single imputation were inadequate to handle data that are missing not at random in ethnicity.ConclusionsAlthough not a total cure, weighted MI represents a pragmatic approach that has potential applications not only in ethnicity but also in other incomplete categorical health indicators in electronic health records.


2019 ◽  
Author(s):  
Leili Tapak ◽  
Omid Hamidi ◽  
Majid Sadeghifar ◽  
Hassan Doosti ◽  
Ghobad Moradi

Abstract Objectives Zero-inflated proportion or rate data nested in clusters due to the sampling structure can be found in many disciplines. Sometimes, the rate response may not be observed for some study units because of some limitations (false negative) like failure in recording data and the zeros are observed instead of the actual value of the rate/proportions (low incidence). In this study, we proposed a multilevel zero-inflated censored Beta regression model that can address zero-inflation rate data with low incidence.Methods We assumed that the random effects are independent and normally distributed. The performance of the proposed approach was evaluated by application on a three level real data set and a simulation study. We applied the proposed model to analyze brucellosis diagnosis rate data and investigate the effects of climatic and geographical position. For comparison, we also applied the standard zero-inflated censored Beta regression model that does not account for correlation.Results Results showed the proposed model performed better than zero-inflated censored Beta based on AIC criterion. Height (p-value <0.0001), temperature (p-value <0.0001) and precipitation (p-value = 0.0006) significantly affected brucellosis rates. While, precipitation in ZICBETA model was not statistically significant (p-value =0.385). Simulation study also showed that the estimations obtained by maximum likelihood approach had reasonable in terms of mean square error.Conclusions The results showed that the proposed method can capture the correlations in the real data set and yields accurate parameter estimates.


2021 ◽  
Author(s):  
Adrienne D. Woods ◽  
Pamela Davis-Kean ◽  
Max Andrew Halvorson ◽  
Kevin Michael King ◽  
Jessica A. R. Logan ◽  
...  

A common challenge in developmental research is the amount of incomplete and missing data that occurs from respondents failing to complete tasks or questionnaires, as well as from disengaging from the study (i.e., attrition). This missingness can lead to biases in parameter estimates and, hence, in the interpretation of findings. These biases can be addressed through statistical techniques that adjust for missing data, such as multiple imputation. Although this technique is highly effective, it has not been widely adopted by developmental scientists given barriers such as lack of training or misconceptions about imputation methods and instead utilizing default methods within software like listwise deletion. This manuscript is intended to provide practical guidelines for developmental researchers to follow when examining their data for missingness, making decisions about how to handle that missingness, and reporting the extent of missing data biases and specific multiple imputation procedures in publications.


2013 ◽  
Vol 03 (05) ◽  
pp. 370-378 ◽  
Author(s):  
Jochen Hardt ◽  
Max Herke ◽  
Tamara Brian ◽  
Wilfried Laubach

2019 ◽  
Vol 80 (1) ◽  
pp. 41-66 ◽  
Author(s):  
Dexin Shi ◽  
Taehun Lee ◽  
Amanda J. Fairchild ◽  
Alberto Maydeu-Olivares

This study compares two missing data procedures in the context of ordinal factor analysis models: pairwise deletion (PD; the default setting in Mplus) and multiple imputation (MI). We examine which procedure demonstrates parameter estimates and model fit indices closer to those of complete data. The performance of PD and MI are compared under a wide range of conditions, including number of response categories, sample size, percent of missingness, and degree of model misfit. Results indicate that both PD and MI yield parameter estimates similar to those from analysis of complete data under conditions where the data are missing completely at random (MCAR). When the data are missing at random (MAR), PD parameter estimates are shown to be severely biased across parameter combinations in the study. When the percentage of missingness is less than 50%, MI yields parameter estimates that are similar to results from complete data. However, the fit indices (i.e., χ2, RMSEA, and WRMR) yield estimates that suggested a worse fit than results observed in complete data. We recommend that applied researchers use MI when fitting ordinal factor models with missing data. We further recommend interpreting model fit based on the TLI and CFI incremental fit indices.


1988 ◽  
Vol 255 (3) ◽  
pp. R353-R367 ◽  
Author(s):  
B. K. Slinker ◽  
S. A. Glantz

Physiologists often wish to compare the effects of several different treatments on a continuous variable of interest, which requires an analysis of variance. Analysis of variance, as presented in most statistics texts, generally requires that there be no missing data and often that each sample group be the same size. Unfortunately, this requirement is rarely satisfied, and investigators are confronted with the problem of how to analyze data that do not strictly fit the traditional analysis of variance paradigm. One can avoid these pitfalls by recasting the analysis of variance as a multiple linear regression problem. When there are no missing data, the results of a traditional analysis of variance and the corresponding multiple regression problem are identical; when the sample sizes are unequal or there are missing data, one can use a regression formulation to analyze data that cannot be easily handled in a traditional analysis of variance paradigm and thus overcome a practical computational limitation of traditional analysis of variance. In addition to overcoming practical limitations of traditional analysis of variance, the multiple linear regression approach is more efficient because in one run of a statistics routine, not only is the analysis of variance done but also one obtains estimates of the size of the treatment effects (as opposed to just an indication of whether such effects are present or not), and many of the pairwise multiple comparisons are done (they are equivalent to t tests for significance of the regression parameter estimates). Finally, interaction between the different treatment factors is easier to interpret than it is in traditional analysis of variance.


Sign in / Sign up

Export Citation Format

Share Document