scholarly journals Characterizing and Managing Missing Structured Data in Electronic Health Records

2017 ◽  
Author(s):  
Brett K. Beaulieu-Jones ◽  
Daniel R. Lavage ◽  
John W. Snyder ◽  
Jason H. Moore ◽  
Sarah A Pendergrass ◽  
...  

ABSTRACTMissing data is a challenge for all studies; however, this is especially true for electronic health record (EHR) based analyses. Failure to appropriately consider missing data can lead to biased results. Here, we provide detailed procedures for when and how to conduct imputation of EHR data. We demonstrate how the mechanism of missingness can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered. We analyzed clinical lab measures from 602,366 patients in the Geisinger Health System EHR. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on 4 mechanisms of missingness. Our results show that several methods including variations of Multivariate Imputation by Chained Equations (MICE) and softImpute consistently imputed missing values with low error; however, only a subset of the MICE methods were suitable for multiple imputation. The analyses described provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs and all of our methods and code are publicly available.

Author(s):  
Brett K Beaulieu-Jones ◽  
Daniel R Lavage ◽  
John W Snyder ◽  
Jason H Moore ◽  
Sarah A Pendergrass ◽  
...  

BACKGROUND Missing data is a challenge for all studies; however, this is especially true for electronic health record (EHR)-based analyses. Failure to appropriately consider missing data can lead to biased results. While there has been extensive theoretical work on imputation, and many sophisticated methods are now available, it remains quite challenging for researchers to implement these methods appropriately. Here, we provide detailed procedures for when and how to conduct imputation of EHR laboratory results. OBJECTIVE The objective of this study was to demonstrate how the mechanism of missingness can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered. METHODS We analyzed clinical laboratory measures from 602,366 patients in the EHR of Geisinger Health System in Pennsylvania, USA. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on 4 mechanisms of missingness (missing completely at random, missing not at random, missing at random, and real data modelling). RESULTS Our results showed that several methods, including variations of Multivariate Imputation by Chained Equations (MICE) and softImpute, consistently imputed missing values with low error; however, only a subset of the MICE methods was suitable for multiple imputation. CONCLUSIONS The analyses we describe provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs, and all of our methods and code are publicly available.


2019 ◽  
Vol 6 (339) ◽  
pp. 73-98
Author(s):  
Małgorzata Aleksandra Misztal

The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missForest. The comparison was made by artificially simulating different proportions (10–50%) and mechanisms of missing data using 10 complete data sets from the UCI repository of machine learning databases. Then, missing values were imputed with the use of MICE, missForest and the PCA‑based method (MIPCA). The normalised root mean square error (NRMSE) was calculated as a measure of imputation accuracy. On the basis of the conducted analyses, missForest can be recommended as a multiple imputation method providing the lowest rates of imputation errors for all types of missingness. PCA‑based imputation does not perform well in terms of accuracy.


2018 ◽  
Vol 22 (4) ◽  
pp. 391-409
Author(s):  
John M. Roberts ◽  
Aki Roberts ◽  
Tim Wadsworth

Incident-level homicide datasets such as the Supplementary Homicide Reports (SHR) commonly exhibit missing data. We evaluated multiple imputation methods (that produce multiple completed datasets, across which imputed values may vary) via unique data that included actual values, from police agency incident reports, of seemingly missing SHR data. This permitted evaluation under a real, not assumed or simulated, missing data mechanism. We compared analytic results based on multiply imputed and actual data; multiple imputation rather successfully recovered victim–offender relationship distributions and regression coefficients that hold in the actual data. Results are encouraging for users of multiple imputation, though it is still important to minimize the extent of missing information in SHR and similar data.


2017 ◽  
Vol 25 (3) ◽  
pp. 951-959 ◽  
Author(s):  
Gregor Stiglic ◽  
Primoz Kocbek ◽  
Nino Fijacko ◽  
Aziz Sheikh ◽  
Majda Pajnkihar

The increasing availability of data stored in electronic health records brings substantial opportunities for advancing patient care and population health. This is, however, fundamentally dependant on the completeness and quality of data in these electronic health records. We sought to use electronic health record data to populate a risk prediction model for identifying patients with undiagnosed type 2 diabetes mellitus. We, however, found substantial (up to 90%) amounts of missing data in some healthcare centres. Attempts at imputing for these missing data or using reduced dataset by removing incomplete records resulted in a major deterioration in the performance of the prediction model. This case study illustrates the substantial wasted opportunities resulting from incomplete records by simulation of missing and incomplete records in predictive modelling process. Government and professional bodies need to prioritise efforts to address these data shortcomings in order to ensure that electronic health record data are maximally exploited for patient and population benefit.


2018 ◽  
Author(s):  
Kieu Trinh Do ◽  
Simone Wahl ◽  
Johannes Raffler ◽  
Sophie Molnos ◽  
Michael Laimighofer ◽  
...  

AbstractBACKGROUNDUntargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in epidemiological studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation.METHODSWe investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established genetically metabolic quantitative trait loci.RESULTSRun day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations (MICE) performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable.CONCLUSIONMissing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.Key messagesUntargeted MS-based metabolomics data show missing values due to both batch-specific LOD-based and non-LOD-based effects.Statistical evaluation of multiple imputation methods was conducted on both simulated and real datasets.Biological evaluation on real data assessed the ability of imputation methods to preserve statistical inference of biochemical pathways and correctly estimate effects of genetic variants on metabolite levels.KNN-based imputation on observations with variable pre-selection and K = 10 showed robust performance for all data scenarios across all evaluation schemes.


2018 ◽  
Vol 39 (1) ◽  
pp. 88-117 ◽  
Author(s):  
Julianne M. Edwards ◽  
W. Holmes Finch

AbstractMissing data is a common problem faced by psychometricians and measurement professionals. To address this issue, there are a number of techniques that have been proposed to handle missing data regarding Item Response Theory. These methods include several types of data imputation methods - corrected item mean substitution imputation, response function imputation, multiple imputation, and the EM algorithm, as well as approaches that do not rely on the imputation of missing values - treating the item as not presented, coding missing responses as incorrect, or as fractionally correct. Of these methods, even though multiple imputation has demonstrated the best performance in prior research, higher MAE was still present. Given this higher model parameter estimation MAE for even the best performing missing data methods, this simulation study’s goal was to explore the performance of a set of potentially promising data imputation methods based on recursive partitioning. Results of this study demonstrated that approaches that combine multivariate imputation by chained equations and recursive partitioning algorithms yield data with relatively low estimation MAE for both item difficulty and item discrimination. Implications of these findings are discussed.


2008 ◽  
Vol 47 (01) ◽  
pp. 8-13 ◽  
Author(s):  
T. Dostálová ◽  
P. Hanzlíček ◽  
Z. Teuberová ◽  
M. Nagy ◽  
M. Pieš ◽  
...  

Summary Objectives: To identify support of structured data entry for electronic health record application in forensic dentistry. Methods: The methods of structuring information in dentistry are described and validation of structured data entry in electronic health records for forensic dentistry is performed on several real cases with the interactive DentCross component. The connection of this component to MUDR and MUDRLite electronic health records is described. Results: The use of the electronic health record MUDRLite and the interactive DentCross component to collect dental information required by standardized Disaster Victim Identification Form by Interpol for possible victim identification is shown. Conclusions: The analysis of structured data entry for dentistry using the DentCross component connected to an electronic health record showed the practical ability of the DentCross component to deliver a real service to dental care and the ability to support the identification of a person in forensic dentistry.


2016 ◽  
Vol 23 (4) ◽  
pp. 291-303 ◽  
Author(s):  
Kostas Pantazos ◽  
Soren Lauesen ◽  
Soren Lippert

A health record database contains structured data fields that identify the patient, such as patient ID, patient name, e-mail and phone number. These data are fairly easy to de-identify, that is, replace with other identifiers. However, these data also occur in fields with doctors’ free-text notes written in an abbreviated style that cannot be analyzed grammatically. If we replace a word that looks like a name, but isn’t, we degrade readability and medical correctness. If we fail to replace it when we should, we degrade confidentiality. We de-identified an existing Danish electronic health record database, ending up with 323,122 patient health records. We had to invent many methods for de-identifying potential identifiers in the free-text notes. The de-identified health records should be used with caution for statistical purposes because we removed health records that were so special that they couldn’t be de-identified. Furthermore, we distorted geography by replacing zip codes with random zip codes.


2017 ◽  
Vol 8 (1) ◽  
pp. 155-172
Author(s):  
Duygu Koçak

The present study aims to investigate the effects of model based missing data methods on guessing parameter in case of ignorable missing data. For this purpose, data based on Item Response Theory with 3 parameters logistic model were created in sample sizes of 500, 1000 and 3000; and then, missing values at random and missing values at completely random were created in ratios of 2.00%, 5.00% and 10.00%. These missing values were completed using expectation'maximization (EM) algorithm and multiple imputation methods. It was concluded that the performance of EM algorithm and multiple imputation methods was efficient depending on the rate of missing values on the data sets with missing values completely at random. When the missing value rate was 2.00%, both methods performed well in all sample sizes; however, they moved away from reference point as the number of missing values increased. On the other hand, it was also found that when the sample size was 3000, the cuts were closer to reference point even when the number of missing values was high. As for missing values at random mechanism, it was observed that both methods performed efficiently on guessing parameter when the number of missing values was low. Yet, this performance deteriorated considerably as the number of missing values increased. Both EM algorithm and multiple imputation methods did not perform effectively on guessing parameter in missing values at random mechanism.


Sign in / Sign up

Export Citation Format

Share Document