scholarly journals How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data

2019 ◽  
Vol 7 ◽  
pp. 205031211882291 ◽  
Author(s):  
Marianne Riksheim Stavseth ◽  
Thomas Clausen ◽  
Jo Røislien

Objectives: Missing data is a recurrent issue in many fields of medical research, particularly in questionnaires. The aim of this article is to describe and compare six conceptually different multiple imputation methods, alongside the commonly used complete case analysis, and to explore whether the choice of methodology for handling missing data might impact clinical conclusions drawn from a regression model when data are categorical. Methods: In addition to the commonly used complete case analysis, we tested the following six imputation methods: multiple imputation using expectation–maximization with bootstrapping, multiple imputation using multiple correspondence analysis, multiple imputation using latent class analysis, multiple hot deck imputation and multivariate imputation by chained equations with two different model specifications: logistic regression and random forests. The methods are tested on real data from a questionnaire-based study in the Norwegian opioid maintenance treatment programme. Results: All methods performed relatively well when the sample size was large (n = 1000). For a smaller sample size (n = 200), the regression estimates depend heavily on the level of missing. When the amount of missing was ⩾20%, in particular, complete case analysis, hot deck and random forests had biased estimates with too low coverage. Multiple imputation using multiple correspondence analysis had the best performance all over. Conclusion: The choice of missing handling methodology has a significant impact on the clinical interpretation of the accompanying statistical analyses. With missing data, the choice of whether to impute or not, and choice of imputation method, can influence clinical conclusion drawn from a regression model and should therefore be given sufficient consideration.

2019 ◽  
Vol 76 (24) ◽  
pp. 2048-2052
Author(s):  
Sujita W Narayan ◽  
Kar Yu Ho ◽  
Jonathan Penm ◽  
Barbara Mintzes ◽  
Ardalan Mirzaei ◽  
...  

Abstract Purpose This study aimed to document the ways by which missing data were handled in clinical pharmacy research to provide an insight into the amount of attention paid to the importance of missing data in this field of research. Methods Our cross-sectional descriptive report evaluated 10 journals affiliated with pharmacy organizations in the United States, Canada, the United Kingdom, and Australia. Randomized controlled trials, cohort studies, case-control studies, and cross-sectional studies published in 2018 were included. The primary outcome measure was the proportion of studies that reported the handling of missing data in their methods or results. Results A total of 178 studies were included in the analysis. Of these, 19.7% (n = 35) mentioned missing data either in their methods (3.4%, n = 6), results (15.2%, n = 27), or in both sections (1.1%, n = 2). Only 4.5% (n = 8) of the studies mentioned how they handled missing data, the most common method being multiple imputation (n = 3), followed by indicator (n = 2), complete case analysis (n = 2), and simple imputation (n = 1). One study using multiple imputation and both studies using an indicator method also combined other strategies to account for missing data. One study only used complete case analysis for subgroup analysis, and the other study only used this method if a specific baseline variable was missing. Conclusions Very few studies in clinical pharmacy literature report any handling of missing data. This has the potential to lead to biased results. We advocate that researchers should report how missing data were handled to increase the transparency of findings and minimize bias.


Author(s):  
Tra My Pham ◽  
Irene Petersen ◽  
James Carpenter ◽  
Tim Morris

ABSTRACT BackgroundEthnicity is an important factor to be considered in health research because of its association with inequality in disease prevalence and the utilisation of healthcare. Ethnicity recording has been incorporated in primary care electronic health records, and hence is available in large UK primary care databases such as The Health Improvement Network (THIN). However, since primary care data are routinely collected for clinical purposes, a large amount of data that are relevant for research including ethnicity is often missing. A popular approach for missing data is multiple imputation (MI). However, the conventional MI method assuming data are missing at random does not give plausible estimates of the ethnicity distribution in THIN compared to the general UK population. This might be due to the fact that ethnicity data in primary care are likely to be missing not at random. ObjectivesI propose a new MI method, termed ‘weighted multiple imputation’, to deal with data that are missing not at random in categorical variables.MethodsWeighted MI combines MI and probability weights which are calculated using external data sources. Census summary statistics for ethnicity can be used to form weights in weighted MI such that the correct marginal ethnic breakdown is recovered in THIN. I conducted a simulation study to examine weighted MI when ethnicity data are missing not at random. In this simulation study which resembled a THIN dataset, ethnicity was an independent variable in a survival model alongside other covariates. Weighted MI was compared to the conventional MI and other traditional missing data methods including complete case analysis and single imputation.ResultsWhile a small bias was still present in ethnicity coefficient estimates under weighted MI, it was less severe compared to MI assuming missing at random. Complete case analysis and single imputation were inadequate to handle data that are missing not at random in ethnicity.ConclusionsAlthough not a total cure, weighted MI represents a pragmatic approach that has potential applications not only in ethnicity but also in other incomplete categorical health indicators in electronic health records.


2020 ◽  
Vol 189 (12) ◽  
pp. 1583-1589
Author(s):  
Rachael K Ross ◽  
Alexander Breskin ◽  
Daniel Westreich

Abstract When estimating causal effects, careful handling of missing data is needed to avoid bias. Complete-case analysis is commonly used in epidemiologic analyses. Previous work has shown that covariate-stratified effect estimates from complete-case analysis are unbiased when missingness is independent of the outcome conditional on the exposure and covariates. Here, we assess the bias of complete-case analysis for adjusted marginal effects when confounding is present under various causal structures of missing data. We show that estimation of the marginal risk difference requires an unbiased estimate of the unconditional joint distribution of confounders and any other covariates required for conditional independence of missingness and outcome. The dependence of missing data on these covariates must be considered to obtain a valid estimate of the covariate distribution. If none of these covariates are effect-measure modifiers on the absolute scale, however, the marginal risk difference will equal the stratified risk differences and the complete-case analysis will be unbiased when the stratified effect estimates are unbiased. Estimation of unbiased marginal effects in complete-case analysis therefore requires close consideration of causal structure and effect-measure modification.


2019 ◽  
Vol 12 (1) ◽  
pp. 45-55
Author(s):  
Mwiche Musukuma ◽  
Brian Sonkwe ◽  
Isaac Fwemba ◽  
Patrick Musonda

Background: With the increase in the use of secondary data in epidemiological studies, the inquiry of how to manage missing data has become more relevant. Our study applied imputation techniques on traumatic spinal cord injuries data; a medical problem where data is generally sporadic. Traumatic spinal cord injuries due to blunt force cause widespread physiological impairments, medical and non-medical problems. The effects of spinal cord injuries are a burden not only to the victims but to their families and to the entire health system of a country. This study also evaluated the causes of traumatic spinal cord injuries in patients admitted to the University Teaching Hospital and factors associated with clinical complications in these patients. Methods: The study used data from medical records of patients who were admitted to the University Teaching Hospital in Lusaka, Zambia. Patients presenting with traumatic spinal cord injuries between 1st January 2013 and 31st December 2017 were part of the study. The data was first analysed using complete case analysis, then multiple imputation techniques were applied, to account for the missing data. Thereafter, both descriptive and inferential analyses were performed on the imputed data. Results: During the study period of interest, a total of 176 patients were identified as having suffered from spinal cord injuries. Road traffic accidents accounted for 56% (101) of the injuries. Clinical complications suffered by these patients included paralysis, death, bowel and bladder dysfunction and pressure sores among other things. Eighty-eight (50%) patients had paralysis. Patients with cervical spine injuries compared to patients with thoracic spine injuries had 87% reduced odds of suffering from clinical complications (OR=0.13, 95% CI{0.08, 0.22}p<.0001). Being paraplegic at discharge increased the odds of developing a clinical complication by 8.1 times (OR=8.01, 95% CI{2.74, 23.99}, p<.001). Under-going an operation increased the odds of having a clinical complication (OR=3.71, 95% CI{=1.99, 6.88}, p<.0001). A patient who presented with Frankel Grade C or E had a 96% reduction in the odds of having a clinical complication (OR=.04, 95% CI{0.02, 0.09} and {0.02, 0.12} respectively, p<.0001) compared to a patient who presented with Frankel Grade A. Conclusion: A comparison of estimates obtained from complete case analysis and from multiple imputations revealed that when there are a lot of missing values, estimates obtained from complete case analysis are unreliable and lack power. Efforts should be made to use ideas to deal with missing values such as multiple imputation techniques. The most common cause of traumatic spinal cord injuries was road traffic accidents. Findings suggest that paralysis had the greatest negative effect on clinical complications. When the category of Frankel Grade increased from A-E, the less likely a patient was likely to succumb to clinical complications. No evidence of an association was found between age, sex and developing a clinical complication.


2010 ◽  
Vol 29 (12) ◽  
pp. 1357-1357 ◽  
Author(s):  
Nicholas J. Horton ◽  
Ian R. White ◽  
James Carpenter

2021 ◽  
Vol 11 (6) ◽  
pp. 249-262
Author(s):  
Sachit Ganapathy ◽  
Binukumar Bhaskarapillai ◽  
Shailendra Dandge

Background: National Family Health Survey-4 (NFHS-4) revealed a significant improvement in the percentage of complete immunization attained in India. Even though determinants of immunization coverage in India are addressed by some studies, the impact of missing data in such large-scale surveys has not been accounted earlier. The present study aimed to identify the potential factors associated with immunization coverage in India using the complete case analysis (CCA) and multiple imputation by chained equations (MICE) analysis. Materials and methods: We created a dichotomous immunization variable based on the status of all the vaccines given to the child. All relevant variables were summarized using appropriate descriptive statistics along with the proportion of missingness. Further, MICE procedure was performed to impute the missing values after assessing the missing data mechanism. Multiple logistic regression after accounting for the sampling weights were used to report the estimates of odds-ratio (OR) and 95% confidence intervals (CI) for both CCA and MICE analysis and compared. Results: The percentage of children under five years of age who had total immunization was 69%. Further, we observed that female sex and rural habitation had higher odds of getting immunized in both CCA and MICE. Moreover, wealth index, number of antenatal visits, checkup after delivery and place of birth played an important role in the immunization coverage. Conclusion: MICE provided more precise risk estimates on potential factors associated with vaccination coverage compared to CCA, even if the major findings did not alter due to large sample size. Key words: Immunization, Health surveys, missing data, Logistic regression, complete case analysis, MICE.


Sign in / Sign up

Export Citation Format

Share Document