Adjusting for selection bias due to missing data in electronic health records-based research

While electronic health records data provide unique opportunities for research, numerous methodological issues must be considered. Among these, selection bias due to incomplete/missing data has received far less attention than other issues. Unfortunately, standard missing data approaches (e.g. inverse-probability weighting and multiple imputation) generally fail to acknowledge the complex interplay of heterogeneous decisions made by patients, providers, and health systems that govern whether specific data elements in the electronic health records are observed. This, in turn, renders the missing-at-random assumption difficult to believe in standard approaches. In the clinical literature, the collection of decisions that gives rise to the observed data is referred to as the data provenance. Building on a recently-proposed framework for modularizing the data provenance, we develop a general and scalable framework for estimation and inference with respect to regression models based on inverse-probability weighting that allows for a hierarchy of missingness mechanisms to better align with the complex nature of electronic health records data. We show that the proposed estimator is consistent and asymptotically Normal, derive the form of the asymptotic variance, and propose two consistent estimators. Simulations show that naïve application of standard methods may yield biased point estimates, that the proposed estimators have good small-sample properties, and that researchers may have to contend with a bias-variance trade-off as they consider how to handle missing data. The proposed methods are motivated by an on-going, electronic health records-based study of bariatric surgery.

Download Full-text

Robust inference when combining inverse-probability weighting and multiple imputation to address missing data with application to an electronic health records-based study of bariatric surgery

The Annals of Applied Statistics ◽

10.1214/20-aoas1386 ◽

2021 ◽

Vol 15 (1) ◽

Author(s):

Tanayott Thaweethai ◽

David E. Arterburn ◽

Karen J. Coleman ◽

Sebastien Haneuse

Keyword(s):

Bariatric Surgery ◽

Missing Data ◽

Electronic Health Records ◽

Multiple Imputation ◽

Inverse Probability Weighting ◽

Robust Inference ◽

Probability Weighting ◽

Health Records ◽

Inverse Probability ◽

Electronic Health

Download Full-text

Inverse-probability weighting and multiple imputation for evaluating selection bias in the estimation of childhood obesity prevalence using data from electronic health records

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-020-1020-8 ◽

2020 ◽

Vol 20 (1) ◽

Cited By ~ 3

Author(s):

Carmen Sayon-Orea ◽

Conchi Moreno-Iribas ◽

Josu Delfrade ◽

Manuela Sanchez-Echenique ◽

Pilar Amiano ◽

...

Keyword(s):

Childhood Obesity ◽

Electronic Health Records ◽

Multiple Imputation ◽

Selection Bias ◽

Inverse Probability Weighting ◽

Probability Weighting ◽

Obesity Prevalence ◽

Health Records ◽

Inverse Probability ◽

Using Data

Download Full-text

On weighting approaches for missing data

Statistical Methods in Medical Research ◽

10.1177/0962280211403597 ◽

2011 ◽

Vol 22 (1) ◽

pp. 14-30 ◽

Cited By ~ 36

Author(s):

Lingling Li ◽

Changyu Shen ◽

Xiaochun Li ◽

James M Robins

Keyword(s):

Missing Data ◽

Selection Bias ◽

Inverse Probability Weighting ◽

Probability Weighting ◽

Full Data ◽

Inverse Probability ◽

Intuitive Idea ◽

Complex Settings ◽

Conceptual Overview

We review the class of inverse probability weighting (IPW) approaches for the analysis of missing data under various missing data patterns and mechanisms. The IPW methods rely on the intuitive idea of creating a pseudo-population of weighted copies of the complete cases to remove selection bias introduced by the missing data. However, different weighting approaches are required depending on the missing data pattern and mechanism. We begin with a uniform missing data pattern (i.e. a scalar missing indicator indicating whether or not the full data is observed) to motivate the approach. We then generalise to more complex settings. Our goal is to provide a conceptual overview of existing IPW approaches and illustrate the connections and differences among these approaches.

Download Full-text

Investigating Bias from Missing Data in an Electronic Health Records-Based Study of Weight Loss After Bariatric Surgery

Obesity Surgery ◽

10.1007/s11695-021-05226-y ◽

2021 ◽

Author(s):

Lily Koffman ◽

Alexander W. Levis ◽

David Arterburn ◽

Karen J. Coleman ◽

Lisa J. Herrinton ◽

...

Keyword(s):

Bariatric Surgery ◽

Weight Loss ◽

Missing Data ◽

Electronic Health Records ◽

Health Records ◽

Electronic Health

Download Full-text

Inverse probability weighting is an effective method to address selection bias during the analysis of high dimensional data

Genetic Epidemiology ◽

10.1002/gepi.22418 ◽

2021 ◽

Author(s):

Patrick M. Carry ◽

Lauren A. Vanderlinden ◽

Fran Dong ◽

Teresa Buckner ◽

Elizabeth Litkowski ◽

...

Keyword(s):

Selection Bias ◽

High Dimensional Data ◽

Inverse Probability Weighting ◽

High Dimensional ◽

Probability Weighting ◽

Inverse Probability

Download Full-text

Contextualizing selection bias in Mendelian randomization: how bad is it likely to be?

International Journal of Epidemiology ◽

10.1093/ije/dyy202 ◽

2018 ◽

Vol 48 (3) ◽

pp. 691-701 ◽

Cited By ~ 33

Author(s):

Apostolos Gkatzionis ◽

Stephen Burgess

Keyword(s):

Risk Factor ◽

Selection Bias ◽

Simulation Study ◽

Cardiovascular Mortality ◽

Mendelian Randomization ◽

Inverse Probability Weighting ◽

Probability Weighting ◽

Inverse Probability ◽

Type 1 Error

Abstract Background Selection bias affects Mendelian randomization investigations when selection into the study sample depends on a collider between the genetic variant and confounders of the risk factor–outcome association. However, the relative importance of selection bias for Mendelian randomization compared with other potential biases is unclear. Methods We performed an extensive simulation study to assess the impact of selection bias on a typical Mendelian randomization investigation. We considered inverse probability weighting as a potential method for reducing selection bias. Finally, we investigated whether selection bias may explain a recently reported finding that lipoprotein(a) is not a causal risk factor for cardiovascular mortality in individuals with previous coronary heart disease. Results Selection bias had a severe impact on bias and Type 1 error rates in our simulation study, but only when selection effects were large. For moderate effects of the risk factor on selection, bias was generally small and Type 1 error rate inflation was not considerable. Inverse probability weighting ameliorated bias when the selection model was correctly specified, but increased bias when selection bias was moderate and the model was misspecified. In the example of lipoprotein(a), strong genetic associations and strong confounder effects on selection mean the reported null effect on cardiovascular mortality could plausibly be explained by selection bias. Conclusions Selection bias can adversely affect Mendelian randomization investigations, but its impact is likely to be less than other biases. Selection bias is substantial when the effects of the risk factor and confounders on selection are particularly large.

Download Full-text

Privacy by Data Provenance with Digital Watermarking - A Proof-of-Concept Implementation for Medical Services with Electronic Health Records

2010 Sixth International Conference on Intelligent Information Hiding and Multimedia Signal Processing ◽

10.1109/iihmsp.2010.130 ◽

2010 ◽

Cited By ~ 4

Author(s):

Jeremie Tharaud ◽

Sven Wohlgemuth ◽

Isao Echizen ◽

Noboru Sonehara ◽

Gunter Muller ◽

...

Keyword(s):

Electronic Health Records ◽

Digital Watermarking ◽

Medical Services ◽

Data Provenance ◽

Proof Of Concept ◽

Health Records ◽

Electronic Health

Download Full-text

Statistical inference for association studies using electronic health records: handling both selection bias and outcome misclassification

Biometrics ◽

10.1111/biom.13400 ◽

2020 ◽

Cited By ~ 1

Author(s):

Lauren J. Beesley ◽

Bhramar Mukherjee

Keyword(s):

Electronic Health Records ◽

Statistical Inference ◽

Selection Bias ◽

Association Studies ◽

Health Records ◽

Electronic Health

Download Full-text

Learning About Missing Data Mechanisms in Electronic Health Records-based Research

Epidemiology ◽

10.1097/ede.0000000000000393 ◽

2016 ◽

Vol 27 (1) ◽

pp. 82-90 ◽

Cited By ~ 8

Author(s):

Sebastien Haneuse ◽

Andy Bogart ◽

Ina Jazic ◽

Emily O. Westbrook ◽

Denise Boudreau ◽

...

Keyword(s):

Missing Data ◽

Electronic Health Records ◽

Health Records ◽

Electronic Health

Download Full-text

Responsiveness-informed multiple imputation and inverse probability-weighting in cohort studies with missing data that are non-monotone or not missing at random

Statistical Methods in Medical Research ◽

10.1177/0962280216628902 ◽

2016 ◽

Vol 27 (2) ◽

pp. 352-363 ◽

Cited By ~ 8

Author(s):

James C Doidge

Keyword(s):

Missing Data ◽

Data Collection ◽

Cohort Studies ◽

Multiple Imputation ◽

Missing At Random ◽

Inverse Probability Weighting ◽

Probability Weighting ◽

Inverse Probability ◽

Not Missing At Random ◽

Over Time

Population-based cohort studies are invaluable to health research because of the breadth of data collection over time, and the representativeness of their samples. However, they are especially prone to missing data, which can compromise the validity of analyses when data are not missing at random. Having many waves of data collection presents opportunity for participants’ responsiveness to be observed over time, which may be informative about missing data mechanisms and thus useful as an auxiliary variable. Modern approaches to handling missing data such as multiple imputation and maximum likelihood can be difficult to implement with the large numbers of auxiliary variables and large amounts of non-monotone missing data that occur in cohort studies. Inverse probability-weighting can be easier to implement but conventional wisdom has stated that it cannot be applied to non-monotone missing data. This paper describes two methods of applying inverse probability-weighting to non-monotone missing data, and explores the potential value of including measures of responsiveness in either inverse probability-weighting or multiple imputation. Simulation studies are used to compare methods and demonstrate that responsiveness in longitudinal studies can be used to mitigate bias induced by missing data, even when data are not missing at random.

Download Full-text