A Note on Listwise Deletion versus Multiple Imputation

This letter compares the performance of multiple imputation and listwise deletion using a simulation approach. The focus is on data that are “missing not at random” (MNAR), in which case both multiple imputation and listwise deletion are known to be biased. In these simulations, multiple imputation yields results that are frequently more biased, less efficient, and with worse coverage than listwise deletion when data are MNAR. This is the case even with very strong correlations between fully observed variables and variables with missing values, such that the data are very nearly “missing at random.” These results recommend caution when comparing the results from multiple imputation and listwise deletion, when the true data generating process is unknown.

Download Full-text

Missing not at random in end of life care studies: multiple imputation and sensitivity analysis on data from the ACTION study

BMC Medical Research Methodology ◽

10.1186/s12874-020-01180-y ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Giulia Carreras ◽

◽

Guido Miccinesi ◽

Andrew Wilcock ◽

Nancy Preston ◽

...

Keyword(s):

Sensitivity Analysis ◽

Multiple Imputation ◽

End Of Life ◽

End Of Life Care ◽

Missing Values ◽

Controlled Trial ◽

Missing At Random ◽

Life Care ◽

Missing Not At Random ◽

Cluster Randomized

Abstract Background Missing data are common in end-of-life care studies, but there is still relatively little exploration of which is the best method to deal with them, and, in particular, if the missing at random (MAR) assumption is valid or missing not at random (MNAR) mechanisms should be assumed. In this paper we investigated this issue through a sensitivity analysis within the ACTION study, a multicenter cluster randomized controlled trial testing advance care planning in patients with advanced lung or colorectal cancer. Methods Multiple imputation procedures under MAR and MNAR assumptions were implemented. Possible violation of the MAR assumption was addressed with reference to variables measuring quality of life and symptoms. The MNAR model assumed that patients with worse health were more likely to have missing questionnaires, making a distinction between single missing items, which were assumed to satisfy the MAR assumption, and missing values due to completely missing questionnaire for which a MNAR mechanism was hypothesized. We explored the sensitivity to possible departures from MAR on gender differences between key indicators and on simple correlations. Results Up to 39% of follow-up data were missing. Results under MAR reflected that missingness was related to poorer health status. Correlations between variables, although very small, changed according to the imputation method, as well as the differences in scores by gender, indicating a certain sensitivity of the results to the violation of the MAR assumption. Conclusions The findings confirmed the importance of undertaking this kind of analysis in end-of-life care studies.

Download Full-text

Data Missingness Patterns in Homicide Datasets: An Applied Test on a Primary Data Set

Violence and Victims ◽

10.1891/vv-d-17-00189 ◽

2020 ◽

Vol 35 (4) ◽

pp. 589-614

Author(s):

Melanie-Angela Neuilly ◽

Ming-Li Hsieh ◽

Alex Kigerl ◽

Zachary K. Hamilton

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Missing At Random ◽

Primary Data ◽

Random Pattern ◽

Validity Of Results ◽

Data Set ◽

Missing Not At Random ◽

Listwise Deletion ◽

The Relationship

Research on homicide missing data conventionally posits a Missing At Random pattern despite the relationship between missing data and clearance. The latter, however, cannot be satisfactorily modeled using variables traditionally available in homicide datasets. For this reason, it has been argued that missingness in homicide data follows a Nonignorable pattern instead. Hence, the use of multiple imputation strategies as recommended in the field for ignorable patterns would thus pose a threat to the validity of results obtained in such a way. This study examines missing data mechanisms by using a set of primary data collected in New Jersey. After comparing Listwise Deletion, Multiple Imputation, Propensity Score Matching, and Log-Multiplicative Association Models, our findings underscore that data in homicide datasets are indeed Missing Not At Random.

Download Full-text

Weighted multiple imputation of ethnicity data that are missing not at random in primary care databases

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.54 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Tra My Pham ◽

Irene Petersen ◽

James Carpenter ◽

Tim Morris

Keyword(s):

Primary Care ◽

Missing Data ◽

Multiple Imputation ◽

Simulation Study ◽

Case Analysis ◽

Missing At Random ◽

Complete Case ◽

Missing Not At Random ◽

Health Records ◽

Ethnicity Data

ABSTRACT BackgroundEthnicity is an important factor to be considered in health research because of its association with inequality in disease prevalence and the utilisation of healthcare. Ethnicity recording has been incorporated in primary care electronic health records, and hence is available in large UK primary care databases such as The Health Improvement Network (THIN). However, since primary care data are routinely collected for clinical purposes, a large amount of data that are relevant for research including ethnicity is often missing. A popular approach for missing data is multiple imputation (MI). However, the conventional MI method assuming data are missing at random does not give plausible estimates of the ethnicity distribution in THIN compared to the general UK population. This might be due to the fact that ethnicity data in primary care are likely to be missing not at random. ObjectivesI propose a new MI method, termed ‘weighted multiple imputation’, to deal with data that are missing not at random in categorical variables.MethodsWeighted MI combines MI and probability weights which are calculated using external data sources. Census summary statistics for ethnicity can be used to form weights in weighted MI such that the correct marginal ethnic breakdown is recovered in THIN. I conducted a simulation study to examine weighted MI when ethnicity data are missing not at random. In this simulation study which resembled a THIN dataset, ethnicity was an independent variable in a survival model alongside other covariates. Weighted MI was compared to the conventional MI and other traditional missing data methods including complete case analysis and single imputation.ResultsWhile a small bias was still present in ethnicity coefficient estimates under weighted MI, it was less severe compared to MI assuming missing at random. Complete case analysis and single imputation were inadequate to handle data that are missing not at random in ethnicity.ConclusionsAlthough not a total cure, weighted MI represents a pragmatic approach that has potential applications not only in ethnicity but also in other incomplete categorical health indicators in electronic health records.

Download Full-text

Multiple Imputation Approaches Applied to the Missing Value Problem in Bottom-Up Proteomics

International Journal of Molecular Sciences ◽

10.3390/ijms22179650 ◽

2021 ◽

Vol 22 (17) ◽

pp. 9650

Author(s):

Miranda L. Gardner ◽

Michael A. Freitas

Keyword(s):

Multiple Imputation ◽

Missing At Random ◽

Data Sets ◽

Proteomics Data ◽

Missing Not At Random ◽

Differential Abundance ◽

Missing Value ◽

Bottom Up ◽

Missing Value Imputation ◽

Impute Data

Analysis of differential abundance in proteomics data sets requires careful application of missing value imputation. Missing abundance values widely vary when performing comparisons across different sample treatments. For example, one would expect a consistent rate of “missing at random” (MAR) across batches of samples and varying rates of “missing not at random” (MNAR) depending on the inherent difference in sample treatments within the study. The missing value imputation strategy must thus be selected that best accounts for both MAR and MNAR simultaneously. Several important issues must be considered when deciding the appropriate missing value imputation strategy: (1) when it is appropriate to impute data; (2) how to choose a method that reflects the combinatorial manner of MAR and MNAR that occurs in an experiment. This paper provides an evaluation of missing value imputation strategies used in proteomics and presents a case for the use of hybrid left-censored missing value imputation approaches that can handle the MNAR problem common to proteomics data.

Download Full-text

A framework for testing different imputation methods for tabular datasets

10.1101/773762 ◽

2019 ◽

Author(s):

Tabea Kossen ◽

Michelle Livne ◽

Vince I Madai ◽

Ivana Galinovic ◽

Dietmar Frey ◽

...

Keyword(s):

Linear Model ◽

Missing Values ◽

Mean Squared Error ◽

Missing At Random ◽

Imputation Method ◽

Similar Data ◽

Missing Value ◽

Imputation Methods ◽

Listwise Deletion ◽

Clinical Dataset

AbstractBackground and purposeHandling missing values is a prevalent challenge in the analysis of clinical data. The rise of data-driven models demands an efficient use of the available data. Methods to impute missing values are thus crucial. Here, we developed a publicly available framework to test different imputation methods and compared their impact in a typical stroke clinical dataset as a use case.MethodsA clinical dataset based on the 1000Plus stroke study with 380 completed-entries patients was used. 13 common clinical parameters including numerical and categorical values were selected. Missing values in a missing-at-random (MAR) and missing-completely-at-random (MCAR) fashion from 0% to 60% were simulated and consequently imputed using the mean, hot-deck, multiple imputation by chained equations, expectation maximization method and listwise deletion. The performance was assessed by the root mean squared error, the absolute bias and the performance of a linear model for discharge mRS prediction.ResultsListwise deletion was the worst performing method and started to be significantly worse than any imputation method from 2% (MAR) and 3% (MCAR) missing values on. The underlying missing value mechanism seemed to have a crucial influence on the identified best performing imputation method. Consequently no single imputation method outperformed all others. A significant performance drop of the linear model started from 11% (MAR+MCAR) and 18% (MCAR) missing values.ConclusionsIn the presented case study of a typical clinical stroke dataset we confirmed that listwise deletion should be avoided for dealing with missing values. Our findings indicate that the underlying missing value mechanism and other dataset characteristics strongly influence the best choice of imputation method. For future studies with similar data structure, we thus suggest to use the developed framework in this study to select the most suitable imputation method for a given dataset prior to analysis.

Download Full-text

Handling missing data in an FFQ: multiple imputation and nutrient intake estimates

Public Health Nutrition ◽

10.1017/s1368980019000168 ◽

2019 ◽

Vol 22 (8) ◽

pp. 1351-1360 ◽

Cited By ~ 1

Author(s):

Mari Ichikawa ◽

Akihiro Hosono ◽

Yuya Tamai ◽

Miki Watanabe ◽

Kiyoshi Shibata ◽

...

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Nutrient Intake ◽

Missing Values ◽

Personal Characteristics ◽

Missing Not At Random ◽

Food Items ◽

Self Administered Questionnaire ◽

Better Than

AbstractObjectiveWe aimed to examine missing data in FFQ and to assess the effects on estimating dietary intake by comparing between multiple imputation and zero imputation.DesignWe used data from the Okazaki Japan Multi-Institutional Collaborative Cohort (J-MICC) study. A self-administered questionnaire including an FFQ was implemented at baseline (FFQ1) and 5-year follow-up (FFQ2). Missing values in FFQ2 were replaced by corresponding FFQ1 values, multiple imputation and zero imputation.SettingA methodological sub-study of the Okazaki J-MICC study.ParticipantsOf a total of 7585 men and women aged 35–79 years at baseline, we analysed data for 5120 participants who answered all items in FFQ1 and at least 50% of items in FFQ2.ResultsAmong 5120 participants, the proportion of missing data was 3·7%. The increasing number of missing food items in FFQ2 varied with personal characteristics. Missing food items not eaten often in FFQ2 were likely to represent zero intake in FFQ1. Most food items showed that the observed proportion of zero intake was likely to be similar to the probability that the missing value is zero intake. Compared with FFQ1 values, multiple imputation had smaller differences of total energy and nutrient estimates, except for alcohol, than zero imputation.ConclusionsOur results indicate that missing values due to zero intake, namely missing not at random, in FFQ can be predicted reasonably well from observed data. Multiple imputation performed better than zero imputation for most nutrients and may be applied to FFQ data when missing is low.

Download Full-text

Estimating Average Treatment Effects Utilizing Fractional Imputation when Confounders are Subject to Missingness

Journal of Causal Inference ◽

10.1515/jci-2019-0024 ◽

2020 ◽

Vol 8 (1) ◽

pp. 249-271

Author(s):

Nathan Corder ◽

Shu Yang

Keyword(s):

Multiple Imputation ◽

Treatment Effects ◽

Missing Values ◽

Missing At Random ◽

Average Treatment Effect ◽

Asymptotically Normal ◽

Average Treatment ◽

Accuracy And Precision ◽

Causal Treatment ◽

Average Treatment Effects

Abstract The problem of missingness in observational data is ubiquitous. When the confounders are missing at random, multiple imputation is commonly used; however, the method requires congeniality conditions for valid inferences, which may not be satisfied when estimating average causal treatment effects. Alternatively, fractional imputation, proposed by Kim 2011, has been implemented to handling missing values in regression context. In this article, we develop fractional imputation methods for estimating the average treatment effects with confounders missing at random. We show that the fractional imputation estimator of the average treatment effect is asymptotically normal, which permits a consistent variance estimate. Via simulation study, we compare fractional imputation’s accuracy and precision with that of multiple imputation.

Download Full-text

Multiple Imputation with Missing Indicators as Proxies for Unmeasured Variables: Simulation Study

10.21203/rs.3.rs-24268/v3 ◽

2020 ◽

Author(s):

Matthew Sperrin ◽

Glen P. Martin

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Simulation Study ◽

Causal Effect ◽

Missing At Random ◽

Directed Acyclic Graphs ◽

Missing Not At Random ◽

Routinely Collected Health Data ◽

Effect Estimation ◽

Minimal Bias

Abstract Background : Within routinely collected health data, missing data for an individual might provide useful information in itself. This occurs, for example, in the case of electronic health records, where the presence or absence of data is informative. While the naive use of missing indicators to try to exploit such information can introduce bias, its use in conjunction with multiple imputation may unlock the potential value of missingness to reduce bias in causal effect estimation, particularly in missing not at random scenarios and where missingness might be associated with unmeasured confounders. Methods: We conducted a simulation study to determine when the use of a missing indicator, combined with multiple imputation, would reduce bias for causal effect estimation, under a range of scenarios including unmeasured variables, missing not at random, and missing at random mechanisms. We use directed acyclic graphs and structural models to elucidate a variety of causal structures of interest. We handled missing data using complete case analysis, and multiple imputation with and without missing indicator terms. Results: We find that multiple imputation combined with a missing indicator gives minimal bias for causal effect estimation in most scenarios. In particular the approach: 1) does not introduce bias in missing (completely) at random scenarios; 2)reduces bias in missing not at random scenarios where the missing mechanism depends on the missing variable itself; and 3) may reduce or increase bias when unmeasured confounding is present. Conclusion : In the presence of missing data, careful use of missing indicators, combined with multiple imputation, can improve causal effect estimation when missingness is informative, and is not detrimental when missingness is at random.

Download Full-text

Using the CES-D scale in a large cohort study and dealing with missing data: Application to the French E3N cohort

European Psychiatry ◽

10.1016/s0924-9338(11)72279-9 ◽

2011 ◽

Vol 26 (S2) ◽

pp. 572-572

Author(s):

N. Resseguier ◽

H. Verdoux ◽

F. Clavel-Chapelon ◽

X. Paoletti

Keyword(s):

Sensitivity Analysis ◽

Missing Data ◽

Multiple Imputation ◽

Missing Values ◽

Large Population ◽

Missing At Random ◽

Population Based ◽

Missing Value ◽

Perform Sensitivity Analysis ◽

The Impact

IntroductionThe CES-D scale is commonly used to assess depressive symptoms (DS) in large population-based studies. Missing values in items of the scale may create biases.ObjectivesTo explore reasons for not completing items of the CES-D scale and to perform sensitivity analysis of the prevalence of DS to assess the impact of different missing data hypotheses.Methods71412 women included in the French E3N cohort returned in 2005 a questionnaire containing the CES-D scale. 45% presented at least one missing value in the scale. An interview study was carried out on a random sample of 204 participants to examine the different hypotheses for the missing value mechanism. The prevalence of DS was estimated according to different methods for handling missing values: complete cases analysis, single imputation, multiple imputation under MAR (missing at random) and MNAR (missing not at random) assumptions.ResultsThe interviews showed that participants were not embarrassed to fill in questions about DS. Potential reasons of nonresponse were identified. MAR and MNAR hypotheses remained plausible and were explored.Among complete responders, the prevalence of DS was 26.1%. After multiple imputation under MAR assumption, it was 28.6%, 29.8% and 31.7% among women presenting up to 4, to 10 and to 20 missing values, respectively. The estimates were robust after applying various scenarios of MNAR data for the sensitivity analysis.ConclusionsThe CES-D scale can easily be used to assess DS in large cohorts. Multiple imputation under MAR assumption allows to reliably handle missing values.

Download Full-text

Handling incomplete smoking history data in survival analysis

Statistical Methods in Medical Research ◽

10.1177/0962280214556794 ◽

2014 ◽

Vol 26 (2) ◽

pp. 707-723 ◽

Cited By ~ 5

Author(s):

Kyoji Furukawa ◽

Dale L. Preston ◽

Munechika Misumi ◽

Harry M. Cullings

Keyword(s):

Multiple Imputation ◽

Missing Values ◽

Atomic Bomb ◽

Missing At Random ◽

Smoking Initiation ◽

Smoking History ◽

Time Varying ◽

Smoking Intensity ◽

Atomic Bomb Survivors ◽

Relevant Variables

While data are unavoidably missing or incomplete in most observational studies, consequences of mishandling such incompleteness in analysis are often overlooked. When time-varying information is collected irregularly and infrequently over a long period, even precisely obtained data may implicitly involve substantial incompleteness. Motivated by an analysis to quantitatively evaluate the effects of smoking and radiation on lung cancer risks among Japanese atomic-bomb survivors, we provide a unique application of multiple imputation to incompletely observed smoking histories under the assumption of missing at random. Predicting missing values for the age of smoking initiation and, given initiation, smoking intensity and cessation age, analyses can be based on complete, though partially imputed, smoking histories. A simulation study shows that multiple imputation appropriately conditioned on the outcome and other relevant variables can produce consistent estimates when data are missing at random. Our approach is particularly appealing in large cohort studies where a considerable amount of time-varying information is incomplete under a mechanism depending in a complex manner on other variables. In application to the motivating example, this approach is expected to reduce estimation bias that might be unavoidable in naive analyses, while keeping efficiency by retaining known information.

Download Full-text