scholarly journals Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis (Preprint)

Author(s):  
Brett K Beaulieu-Jones ◽  
Daniel R Lavage ◽  
John W Snyder ◽  
Jason H Moore ◽  
Sarah A Pendergrass ◽  
...  

BACKGROUND Missing data is a challenge for all studies; however, this is especially true for electronic health record (EHR)-based analyses. Failure to appropriately consider missing data can lead to biased results. While there has been extensive theoretical work on imputation, and many sophisticated methods are now available, it remains quite challenging for researchers to implement these methods appropriately. Here, we provide detailed procedures for when and how to conduct imputation of EHR laboratory results. OBJECTIVE The objective of this study was to demonstrate how the mechanism of missingness can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered. METHODS We analyzed clinical laboratory measures from 602,366 patients in the EHR of Geisinger Health System in Pennsylvania, USA. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on 4 mechanisms of missingness (missing completely at random, missing not at random, missing at random, and real data modelling). RESULTS Our results showed that several methods, including variations of Multivariate Imputation by Chained Equations (MICE) and softImpute, consistently imputed missing values with low error; however, only a subset of the MICE methods was suitable for multiple imputation. CONCLUSIONS The analyses we describe provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs, and all of our methods and code are publicly available.

2017 ◽  
Author(s):  
Brett K. Beaulieu-Jones ◽  
Daniel R. Lavage ◽  
John W. Snyder ◽  
Jason H. Moore ◽  
Sarah A Pendergrass ◽  
...  

ABSTRACTMissing data is a challenge for all studies; however, this is especially true for electronic health record (EHR) based analyses. Failure to appropriately consider missing data can lead to biased results. Here, we provide detailed procedures for when and how to conduct imputation of EHR data. We demonstrate how the mechanism of missingness can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered. We analyzed clinical lab measures from 602,366 patients in the Geisinger Health System EHR. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on 4 mechanisms of missingness. Our results show that several methods including variations of Multivariate Imputation by Chained Equations (MICE) and softImpute consistently imputed missing values with low error; however, only a subset of the MICE methods were suitable for multiple imputation. The analyses described provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs and all of our methods and code are publicly available.


2020 ◽  
Author(s):  
Pietro Di Lena ◽  
Claudia Sala ◽  
Andrea Prodi ◽  
Christine Nardini

Abstract Background: High-throughput technologies enable the cost-effective collection and analysis of DNA methylation data throughout the human genome. This naturally entails missing values management that can complicate the analysis of the data. Several general and specific imputation methods are suitable for DNA methylation data. However, there are no detailed studies of their performances under different missing data mechanisms -(completely) at random or not- and different representations of DNA methylation levels (β and M-value). Results: We make an extensive analysis of the imputation performances of seven imputation methods on simulated missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) methylation data. We further consider imputation performances on the β- and M-value popular representations of methylation levels. Overall, β -values enable better imputation performances than M-values. Imputation accuracy is lower for mid-range β -values, while it is generally more accurate for values at the extremes of the β -value range. The MAR values distribution is on the average more dense in the mid-range in comparison to the expected β -value distribution. As a consequence, MAR values are on average harder to impute. Conclusions: The results of the analysis provide guidelines for the most suitable imputation approaches for DNA methylation data under different representations of DNA methylation levels and different missing data mechanisms.


2020 ◽  
Author(s):  
Pietro Di Lena ◽  
Claudia Sala ◽  
Andrea Prodi ◽  
Christine Nardini

Abstract Background: High-throughput technologies enable the cost-effective collection and analysis of DNA methylation data throughout the human genome. This naturally entails missing values management that can complicate the analysis of the data. Several general and specific imputation methods are suitable for DNA methylation data. However, there are no detailed studies of their performances under different missing data mechanisms –(completely) at random or not- and different representations of DNA methylation levels ($\beta$ and $M$-value). Results: We make an extensive analysis of the imputation performances of seven imputation methods on simulated missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) methylation data. We further consider imputation performances on the β- and M-value popular representations of methylation levels. Overall, β -values enable better imputation performances than M-values. Imputation accuracy is lower for mid-range β -values, while it is generally more accurate for values at the extremes of the β -value range. The MAR values distribution is on the average more dense in the mid-range in comparison to the expected β -value distribution. As a consequence, MAR values are on average harder to impute. Conclusions: The results of the analysis provide guidelines for the most suitable imputation approaches for DNA methylation data under different representations of DNA methylation levels and different missing data mechanisms.


Biometrika ◽  
2016 ◽  
Vol 103 (1) ◽  
pp. 175-187 ◽  
Author(s):  
Jun Shao ◽  
Lei Wang

Abstract To estimate unknown population parameters based on data having nonignorable missing values with a semiparametric exponential tilting propensity, Kim & Yu (2011) assumed that the tilting parameter is known or can be estimated from external data, in order to avoid the identifiability issue. To remove this serious limitation on the methodology, we use an instrument, i.e., a covariate related to the study variable but unrelated to the missing data propensity, to construct some estimating equations. Because these estimating equations are semiparametric, we profile the nonparametric component using a kernel-type estimator and then estimate the tilting parameter based on the profiled estimating equations and the generalized method of moments. Once the tilting parameter is estimated, so is the propensity, and then other population parameters can be estimated using the inverse propensity weighting approach. Consistency and asymptotic normality of the proposed estimators are established. The finite-sample performance of the estimators is studied through simulation, and a real-data example is also presented.


2021 ◽  
Vol 8 (3) ◽  
pp. 215-226
Author(s):  
Parisa Saeipourdizaj ◽  
Parvin Sarbakhsh ◽  
Akbar Gholampour

Background: PIn air quality studies, it is very often to have missing data due to reasons such as machine failure or human error. The approach used in dealing with such missing data can affect the results of the analysis. The main aim of this study was to review the types of missing mechanism, imputation methods, application of some of them in imputation of missing of PM10 and O3 in Tabriz, and compare their efficiency. Methods: Methods of mean, EM algorithm, regression, classification and regression tree, predictive mean matching (PMM), interpolation, moving average, and K-nearest neighbor (KNN) were used. PMM was investigated by considering the spatial and temporal dependencies in the model. Missing data were randomly simulated with 10, 20, and 30% missing values. The efficiency of methods was compared using coefficient of determination (R2 ), mean absolute error (MAE) and root mean square error (RMSE). Results: Based on the results for all indicators, interpolation, moving average, and KNN had the best performance, respectively. PMM did not perform well with and without spatio-temporal information. Conclusion: Given that the nature of pollution data always depends on next and previous information, methods that their computational nature is based on before and after information indicated better performance than others, so in the case of pollutant data, it is recommended to use these methods.


2019 ◽  
Author(s):  
Bret Beheim ◽  
Quentin Atkinson ◽  
Joseph Bulbulia ◽  
Will M Gervais ◽  
Russell Gray ◽  
...  

Whitehouse, et al. have recently used the Seshat archaeo-historical databank to argue that beliefs in moralizing gods appear in world history only after the formation of complex “megasocieties” of around one million people. Inspection of the authors’ data, however, shows that 61% of Seshat data points on moralizing gods are missing values, mostly from smaller populations below one million people, and during the analysis the authors re-coded these data points to signify the absence of moralizing gods beliefs. When we confine the analysis only to the extant data or use various standard imputation methods, the reported finding is reversed: moralizing gods precede increases in social complexity. We suggest that the reported “megasociety threshold” for the emergence of moralizing gods is thus solely a consequence of the decision to re-code nearly two-thirds of Seshat data from unknown values to known absences of moralizing gods.


2020 ◽  
Author(s):  
David N Borg ◽  
Robert Nguyen ◽  
Nicholas J Tierney

Missing data are often unavoidable. The reason values go missing, along with decisions made of how missing data are handled (deleted or imputed), can have a profound effect on the validity and accuracy of study results. In this article, we aimed to: estimate the proportion of studies in football research that included a missing data statement, highlight several practices to avoid in relation to missing data, and provide recommendations for exploring, visualising and reporting missingness. Football related articles, published in 2019 were studied. A survey of 136 articles, sampled at random, was conducted to determine whether a missing data statement was included. As expected, the proportion of studies in football research that included a missing data statement was low, at only 11.0% (95% CI: 6.3% to 17.5%); suggesting that missingness is seldom considered by researchers. We recommend that researchers describe the number and percentage of missing values, including when there are no missing values. Exploratory analysis should be conducted to explore missing values, and visualisations describing missingness overall should be provided in the paper, or at least supplementary materials. Missing values should almost always be imputed, and imputation methods should be explored to ensure they are appropriately representative. Researchers should consider these recommendations, and pay greater attention to missing data and its influence on research results.


2021 ◽  
Author(s):  
Eric Venner ◽  
Victoria Yi ◽  
David Murdock ◽  
Sara E. Kalla ◽  
Tsung-Jung Wu ◽  
...  

AbstractPurposeGenomic medicine holds great promise for improving healthcare, but integrating searchable and actionable genetic data into electronic health records remains a challenge. Here, we describe Neptune, a system for managing the interaction between a clinical laboratory and an electronic health record system.MethodsWe developed Neptune and applied it to two clinical sequencing projects that required report customization, variant reanalysis and EHR integration.ResultsNeptune enabled the analysis of data for generation of and delivery to EHR systems of over 15,000 clinical genomic reports. These projects demanded customizable clinical reports that contained a variety of genetic data types including SNVs, CNVs, pharmacogenomics and polygenic risk scores. Two variant reanalysis activities were also supported, highlighting this important workflow.ConclusionsMethods are needed for delivering structured genetic data to EHRs. This need extends beyond developing data formats to providing infrastructure that manages the reporting process itself. Neptune was successfully applied on two high-throughput clinical sequencing projects to build and deliver clinical reports to EHR systems. The software is open and available at https://gitlab.com/bcm-hgsc/neptune.


2019 ◽  
Vol 6 (339) ◽  
pp. 73-98
Author(s):  
Małgorzata Aleksandra Misztal

The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missForest. The comparison was made by artificially simulating different proportions (10–50%) and mechanisms of missing data using 10 complete data sets from the UCI repository of machine learning databases. Then, missing values were imputed with the use of MICE, missForest and the PCA‑based method (MIPCA). The normalised root mean square error (NRMSE) was calculated as a measure of imputation accuracy. On the basis of the conducted analyses, missForest can be recommended as a multiple imputation method providing the lowest rates of imputation errors for all types of missingness. PCA‑based imputation does not perform well in terms of accuracy.


2019 ◽  
Author(s):  
Tabea Kossen ◽  
Michelle Livne ◽  
Vince I Madai ◽  
Ivana Galinovic ◽  
Dietmar Frey ◽  
...  

AbstractBackground and purposeHandling missing values is a prevalent challenge in the analysis of clinical data. The rise of data-driven models demands an efficient use of the available data. Methods to impute missing values are thus crucial. Here, we developed a publicly available framework to test different imputation methods and compared their impact in a typical stroke clinical dataset as a use case.MethodsA clinical dataset based on the 1000Plus stroke study with 380 completed-entries patients was used. 13 common clinical parameters including numerical and categorical values were selected. Missing values in a missing-at-random (MAR) and missing-completely-at-random (MCAR) fashion from 0% to 60% were simulated and consequently imputed using the mean, hot-deck, multiple imputation by chained equations, expectation maximization method and listwise deletion. The performance was assessed by the root mean squared error, the absolute bias and the performance of a linear model for discharge mRS prediction.ResultsListwise deletion was the worst performing method and started to be significantly worse than any imputation method from 2% (MAR) and 3% (MCAR) missing values on. The underlying missing value mechanism seemed to have a crucial influence on the identified best performing imputation method. Consequently no single imputation method outperformed all others. A significant performance drop of the linear model started from 11% (MAR+MCAR) and 18% (MCAR) missing values.ConclusionsIn the presented case study of a typical clinical stroke dataset we confirmed that listwise deletion should be avoided for dealing with missing values. Our findings indicate that the underlying missing value mechanism and other dataset characteristics strongly influence the best choice of imputation method. For future studies with similar data structure, we thus suggest to use the developed framework in this study to select the most suitable imputation method for a given dataset prior to analysis.


Sign in / Sign up

Export Citation Format

Share Document