An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge

Abstract Background The use of Electronic Health Records (EHR) data in clinical research is incredibly increasing, but the abundancy of data resources raises the challenge of data cleaning. It can save time if the data cleaning can be done automatically. In addition, the automated data cleaning tools for data in other domains often process all variables uniformly, meaning that they cannot serve well for clinical data, as there is variable-specific information that needs to be considered. This paper proposes an automated data cleaning method for EHR data with clinical knowledge taken into consideration. Methods We used EHR data collected from primary care in Flanders, Belgium during 1994–2015. We constructed a Clinical Knowledge Database to store all the variable-specific information that is necessary for data cleaning. We applied Fuzzy search to automatically detect and replace the wrongly spelled units, and performed the unit conversion following the variable-specific conversion formula. Then the numeric values were corrected and outliers were detected considering the clinical knowledge. In total, 52 clinical variables were cleaned, and the percentage of missing values (completeness) and percentage of values within the normal range (correctness) before and after the cleaning process were compared. Results All variables were 100% complete before data cleaning. 42 variables had a drop of less than 1% in the percentage of missing values and 9 variables declined by 1–10%. Only 1 variable experienced large decline in completeness (13.36%). All variables had more than 50% values within the normal range after cleaning, of which 43 variables had a percentage higher than 70%. Conclusions We propose a general method for clinical variables, which achieves high automation and is capable to deal with large-scale data. This method largely improved the efficiency to clean the data and removed the technical barriers for non-technical people.

Download Full-text

Diabetes and the direct secondary use of electronic health records: Using routinely collected and stored data to drive research and understanding

Digital Health ◽

10.1177/2055207618804650 ◽

2018 ◽

Vol 4 ◽

pp. 205520761880465 ◽

Cited By ~ 2

Author(s):

Tim Robbins ◽

Sarah N Lim Choi Keung ◽

Sailesh Sankar ◽

Harpal Randeva ◽

Theodoros N Arvanitis

Keyword(s):

Electronic Health Records ◽

Missing Values ◽

Research Output ◽

Healthcare Research ◽

Healthcare Sector ◽

Diabetes Research ◽

Health Records ◽

Diverse Range ◽

Secondary Use ◽

Electronic Health

Introduction Electronic health records provide an unparalleled opportunity for the use of patient data that is routinely collected and stored, in order to drive research and develop an epidemiological understanding of disease. Diabetes, in particular, stands to benefit, being a data-rich, chronic-disease state. This article aims to provide an understanding of the extent to which the healthcare sector is using routinely collected and stored data to inform research and epidemiological understanding of diabetes mellitus. Methods Narrative literature review of articles, published in both the medical- and engineering-based informatics literature. Results There has been a significant increase in the number of papers published, which utilise electronic health records as a direct data source for diabetes research. These articles consider a diverse range of research questions. Internationally, the secondary use of electronic health records, as a research tool, is most prominent in the USA. The barriers most commonly described in research studies include missing values and misclassification, alongside challenges of establishing the generalisability of results. Discussion Electronic health record research is an important and expanding area of healthcare research. Much of the research output remains in the form of conference abstracts and proceedings, rather than journal articles. There is enormous opportunity within the United Kingdom to develop these research methodologies, due to national patient identifiers. Such a healthcare context may enable UK researchers to overcome many of the barriers encountered elsewhere and thus to truly unlock the potential of electronic health records.

Download Full-text

A deep learning–based, unsupervised method to impute missing values in electronic health records for improved patient management

Journal of Biomedical Informatics ◽

10.1016/j.jbi.2020.103576 ◽

2020 ◽

Vol 111 ◽

pp. 103576

Author(s):

Da Xu ◽

Paul Jen-Hwa Hu ◽

Ting-Shuo Huang ◽

Xiao Fang ◽

Chih-Chin Hsu

Keyword(s):

Deep Learning ◽

Electronic Health Records ◽

Patient Management ◽

Missing Values ◽

Health Records ◽

Unsupervised Method ◽

Electronic Health

Download Full-text

Archetype sub-ontology: Improving constraint-based clinical knowledge model in electronic health records

Knowledge-Based Systems ◽

10.1016/j.knosys.2011.07.004 ◽

2012 ◽

Vol 26 ◽

pp. 75-85 ◽

Cited By ~ 10

Author(s):

Anny Kartika Sari ◽

Wenny Rahayu ◽

Mehul Bhatt

Keyword(s):

Electronic Health Records ◽

Clinical Knowledge ◽

Knowledge Model ◽

Health Records ◽

Electronic Health

Download Full-text

Comparison of clinical knowledge management capabilities of commercially-available and leading internally-developed electronic health records

BMC Medical Informatics and Decision Making ◽

10.1186/1472-6947-11-13 ◽

2011 ◽

Vol 11 (1) ◽

Cited By ~ 31

Author(s):

Dean F Sittig ◽

Adam Wright ◽

Seth Meltzer ◽

Linas Simonaitis ◽

R Scott Evans ◽

...

Keyword(s):

Knowledge Management ◽

Electronic Health Records ◽

Clinical Knowledge ◽

Health Records ◽

Electronic Health ◽

Knowledge Management Capabilities ◽

Management Capabilities

Download Full-text

Biomedical Data Science and Informatics Challenges to Implementing Pharmacogenomics with Electronic Health Records

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-020320-093614 ◽

2020 ◽

Vol 3 (1) ◽

pp. 289-314 ◽

Cited By ~ 1

Author(s):

James M. Hoffman ◽

Allen J. Flynn ◽

Justin E. Juskewitch ◽

Robert R. Freimuth

Keyword(s):

Electronic Health Records ◽

Data Science ◽

Clinical Decision ◽

Lessons Learned ◽

Biomedical Data ◽

Clinical Knowledge ◽

Health Records ◽

Knowledge Resources ◽

Electronic Health ◽

Clinical Systems

Pharmacogenomic information must be incorporated into electronic health records (EHRs) with clinical decision support in order to fully realize its potential to improve drug therapy. Supported by various clinical knowledge resources, pharmacogenomic workflows have been implemented in several healthcare systems. Little standardization exists across these efforts, however, which limits scalability both within and across clinical sites. Limitations in information standards, knowledge management, and the capabilities of modern EHRs remain challenges for the widespread use of pharmacogenomics in the clinic, but ongoing efforts are addressing these challenges. Although much work remains to use pharmacogenomic information more effectively within clinical systems, the experiences of pioneering sites and lessons learned from those programs may be instructive for other clinical areas beyond genomics. We present a vision of what can be achieved as informatics and data science converge to enable further adoption of pharmacogenomics in the clinic.

Download Full-text

A Stochastic Multivariate Irregularly Sampled Time Series Imputation Method for Electronic Health Records

BioMedInformatics ◽

10.3390/biomedinformatics1030011 ◽

2021 ◽

Vol 1 (3) ◽

pp. 166-181

Author(s):

Muhammad Adib Uz Zaman ◽

Dongping Du

Keyword(s):

Neural Networks ◽

Time Series ◽

Electronic Health Records ◽

Missing Values ◽

Time Series Data ◽

Temporal Information ◽

Series Data ◽

High Dimensional ◽

Health Records ◽

Electronic Health

Electronic health records (EHRs) can be very difficult to analyze since they usually contain many missing values. To build an efficient predictive model, a complete dataset is necessary. An EHR usually contains high-dimensional longitudinal time series data. Most commonly used imputation methods do not consider the importance of temporal information embedded in EHR data. Besides, most time-dependent neural networks such as recurrent neural networks (RNNs) inherently consider the time steps to be equal, which in many cases, is not appropriate. This study presents a method using the gated recurrent unit (GRU), neural ordinary differential equations (ODEs), and Bayesian estimation to incorporate the temporal information and impute sporadically observed time series measurements in high-dimensional EHR data.

Download Full-text

New Unintended Adverse Consequences of Electronic Health Records

Yearbook of Medical Informatics ◽

10.15265/iy-2016-023 ◽

2016 ◽

Vol 25 (01) ◽

pp. 7-12 ◽

Cited By ~ 13

Author(s):

A. Wright ◽

J. Ash ◽

H. Singh ◽

D. F. Sittig

Keyword(s):

Electronic Health Records ◽

Point Of Care ◽

Quality Measurement ◽

Patient Specific ◽

Specific Information ◽

Design Development ◽

Order Entry ◽

Health Records ◽

Computer Based ◽

Electronic Health

SummaryAlthough the health information technology industry has made considerable progress in the design, development, implementation, and use of electronic health records (EHRs), the lofty expectations of the early pioneers have not been met. In 2006, the Provider Order Entry Team at Oregon Health & Science University described a set of unintended adverse consequences (UACs), or unpredictable, emergent problems associated with computer-based provider order entry implementation, use, and maintenance. Many of these originally identified UACs have not been completely addressed or alleviated, some have evolved over time, and some new ones have emerged as EHRs became more widely available. The rapid increase in the adoption of EHRs, coupled with the changes in the types and attitudes of clinical users, has led to several new UACs, specifically: complete clinical information unavailable at the point of care; lack of innovations to improve system usability leading to frustrating user experiences; inadvertent disclosure of large amounts of patient-specific information; increased focus on computer-based quality measurement negatively affecting clinical workflows and patient-provider interactions; information overload from marginally useful computer-generated data; and a decline in the development and use of internally-developed EHRs. While each of these new UACs poses significant challenges to EHR developers and users alike, they also offer many opportunities. The challenge for clinical informatics researchers is to continue to refine our current systems while exploring new methods of overcoming these challenges and developing innovations to improve EHR interoperability, usability, security, functionality, clinical quality measurement, and information summarization and display.

Download Full-text

A Nuisance-Free Inference Procedure Accounting for the Unknown Missingness with Application to Electronic Health Records

Entropy ◽

10.3390/e22101154 ◽

2020 ◽

Vol 22 (10) ◽

pp. 1154

Author(s):

Jiwei Zhao ◽

Chi Chen

Keyword(s):

Electronic Health Records ◽

Missing Values ◽

Outcome Variable ◽

High Dimensional ◽

Real World Data ◽

Inference Procedure ◽

Health Records ◽

Missingness Mechanism ◽

Patient Reported ◽

Electronic Health

We study how to conduct statistical inference in a regression model where the outcome variable is prone to missing values and the missingness mechanism is unknown. The model we consider might be a traditional setting or a modern high-dimensional setting where the sparsity assumption is usually imposed and the regularization technique is popularly used. Motivated by the fact that the missingness mechanism, albeit usually treated as a nuisance, is difficult to specify correctly, we adopt the conditional likelihood approach so that the nuisance can be completely ignored throughout our procedure. We establish the asymptotic theory of the proposed estimator and develop an easy-to-implement algorithm via some data manipulation strategy. In particular, under the high-dimensional setting where regularization is needed, we propose a data perturbation method for the post-selection inference. The proposed methodology is especially appealing when the true missingness mechanism tends to be missing not at random, e.g., patient reported outcomes or real world data such as electronic health records. The performance of the proposed method is evaluated by comprehensive simulation experiments as well as a study of the albumin level in the MIMIC-III database.

Download Full-text