scholarly journals An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge

2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Xi Shi ◽  
Charlotte Prins ◽  
Gijs Van Pottelbergh ◽  
Pavlos Mamouris ◽  
Bert Vaes ◽  
...  

Abstract Background The use of Electronic Health Records (EHR) data in clinical research is incredibly increasing, but the abundancy of data resources raises the challenge of data cleaning. It can save time if the data cleaning can be done automatically. In addition, the automated data cleaning tools for data in other domains often process all variables uniformly, meaning that they cannot serve well for clinical data, as there is variable-specific information that needs to be considered. This paper proposes an automated data cleaning method for EHR data with clinical knowledge taken into consideration. Methods We used EHR data collected from primary care in Flanders, Belgium during 1994–2015. We constructed a Clinical Knowledge Database to store all the variable-specific information that is necessary for data cleaning. We applied Fuzzy search to automatically detect and replace the wrongly spelled units, and performed the unit conversion following the variable-specific conversion formula. Then the numeric values were corrected and outliers were detected considering the clinical knowledge. In total, 52 clinical variables were cleaned, and the percentage of missing values (completeness) and percentage of values within the normal range (correctness) before and after the cleaning process were compared. Results All variables were 100% complete before data cleaning. 42 variables had a drop of less than 1% in the percentage of missing values and 9 variables declined by 1–10%. Only 1 variable experienced large decline in completeness (13.36%). All variables had more than 50% values within the normal range after cleaning, of which 43 variables had a percentage higher than 70%. Conclusions We propose a general method for clinical variables, which achieves high automation and is capable to deal with large-scale data. This method largely improved the efficiency to clean the data and removed the technical barriers for non-technical people.

2018 ◽  
Vol 4 ◽  
pp. 205520761880465 ◽  
Author(s):  
Tim Robbins ◽  
Sarah N Lim Choi Keung ◽  
Sailesh Sankar ◽  
Harpal Randeva ◽  
Theodoros N Arvanitis

Introduction Electronic health records provide an unparalleled opportunity for the use of patient data that is routinely collected and stored, in order to drive research and develop an epidemiological understanding of disease. Diabetes, in particular, stands to benefit, being a data-rich, chronic-disease state. This article aims to provide an understanding of the extent to which the healthcare sector is using routinely collected and stored data to inform research and epidemiological understanding of diabetes mellitus. Methods Narrative literature review of articles, published in both the medical- and engineering-based informatics literature. Results There has been a significant increase in the number of papers published, which utilise electronic health records as a direct data source for diabetes research. These articles consider a diverse range of research questions. Internationally, the secondary use of electronic health records, as a research tool, is most prominent in the USA. The barriers most commonly described in research studies include missing values and misclassification, alongside challenges of establishing the generalisability of results. Discussion Electronic health record research is an important and expanding area of healthcare research. Much of the research output remains in the form of conference abstracts and proceedings, rather than journal articles. There is enormous opportunity within the United Kingdom to develop these research methodologies, due to national patient identifiers. Such a healthcare context may enable UK researchers to overcome many of the barriers encountered elsewhere and thus to truly unlock the potential of electronic health records.


2020 ◽  
Vol 3 (1) ◽  
pp. 289-314 ◽  
Author(s):  
James M. Hoffman ◽  
Allen J. Flynn ◽  
Justin E. Juskewitch ◽  
Robert R. Freimuth

Pharmacogenomic information must be incorporated into electronic health records (EHRs) with clinical decision support in order to fully realize its potential to improve drug therapy. Supported by various clinical knowledge resources, pharmacogenomic workflows have been implemented in several healthcare systems. Little standardization exists across these efforts, however, which limits scalability both within and across clinical sites. Limitations in information standards, knowledge management, and the capabilities of modern EHRs remain challenges for the widespread use of pharmacogenomics in the clinic, but ongoing efforts are addressing these challenges. Although much work remains to use pharmacogenomic information more effectively within clinical systems, the experiences of pioneering sites and lessons learned from those programs may be instructive for other clinical areas beyond genomics. We present a vision of what can be achieved as informatics and data science converge to enable further adoption of pharmacogenomics in the clinic.


2021 ◽  
Vol 1 (3) ◽  
pp. 166-181
Author(s):  
Muhammad Adib Uz Zaman ◽  
Dongping Du

Electronic health records (EHRs) can be very difficult to analyze since they usually contain many missing values. To build an efficient predictive model, a complete dataset is necessary. An EHR usually contains high-dimensional longitudinal time series data. Most commonly used imputation methods do not consider the importance of temporal information embedded in EHR data. Besides, most time-dependent neural networks such as recurrent neural networks (RNNs) inherently consider the time steps to be equal, which in many cases, is not appropriate. This study presents a method using the gated recurrent unit (GRU), neural ordinary differential equations (ODEs), and Bayesian estimation to incorporate the temporal information and impute sporadically observed time series measurements in high-dimensional EHR data.


2016 ◽  
Vol 25 (01) ◽  
pp. 7-12 ◽  
Author(s):  
A. Wright ◽  
J. Ash ◽  
H. Singh ◽  
D. F. Sittig

SummaryAlthough the health information technology industry has made considerable progress in the design, development, implementation, and use of electronic health records (EHRs), the lofty expectations of the early pioneers have not been met. In 2006, the Provider Order Entry Team at Oregon Health & Science University described a set of unintended adverse consequences (UACs), or unpredictable, emergent problems associated with computer-based provider order entry implementation, use, and maintenance. Many of these originally identified UACs have not been completely addressed or alleviated, some have evolved over time, and some new ones have emerged as EHRs became more widely available. The rapid increase in the adoption of EHRs, coupled with the changes in the types and attitudes of clinical users, has led to several new UACs, specifically: complete clinical information unavailable at the point of care; lack of innovations to improve system usability leading to frustrating user experiences; inadvertent disclosure of large amounts of patient-specific information; increased focus on computer-based quality measurement negatively affecting clinical workflows and patient-provider interactions; information overload from marginally useful computer-generated data; and a decline in the development and use of internally-developed EHRs. While each of these new UACs poses significant challenges to EHR developers and users alike, they also offer many opportunities. The challenge for clinical informatics researchers is to continue to refine our current systems while exploring new methods of overcoming these challenges and developing innovations to improve EHR interoperability, usability, security, functionality, clinical quality measurement, and information summarization and display.


Entropy ◽  
2020 ◽  
Vol 22 (10) ◽  
pp. 1154
Author(s):  
Jiwei Zhao ◽  
Chi Chen

We study how to conduct statistical inference in a regression model where the outcome variable is prone to missing values and the missingness mechanism is unknown. The model we consider might be a traditional setting or a modern high-dimensional setting where the sparsity assumption is usually imposed and the regularization technique is popularly used. Motivated by the fact that the missingness mechanism, albeit usually treated as a nuisance, is difficult to specify correctly, we adopt the conditional likelihood approach so that the nuisance can be completely ignored throughout our procedure. We establish the asymptotic theory of the proposed estimator and develop an easy-to-implement algorithm via some data manipulation strategy. In particular, under the high-dimensional setting where regularization is needed, we propose a data perturbation method for the post-selection inference. The proposed methodology is especially appealing when the true missingness mechanism tends to be missing not at random, e.g., patient reported outcomes or real world data such as electronic health records. The performance of the proposed method is evaluated by comprehensive simulation experiments as well as a study of the albumin level in the MIMIC-III database.


Sign in / Sign up

Export Citation Format

Share Document