scholarly journals Increasing the Density of Laboratory Measures for Machine Learning Applications

2020 ◽  
Vol 10 (1) ◽  
pp. 103
Author(s):  
Vida Abedi ◽  
Jiang Li ◽  
Manu K. Shivakumar ◽  
Venkatesh Avula ◽  
Durgesh P. Chaudhary ◽  
...  

Background. The imputation of missingness is a key step in Electronic Health Records (EHR) mining, as it can significantly affect the conclusions derived from the downstream analysis in translational medicine. The missingness of laboratory values in EHR is not at random, yet imputation techniques tend to disregard this key distinction. Consequently, the development of an adaptive imputation strategy designed specifically for EHR is an important step in improving the data imbalance and enhancing the predictive power of modeling tools for healthcare applications. Method. We analyzed the laboratory measures derived from Geisinger’s EHR on patients in three distinct cohorts—patients tested for Clostridioides difficile (Cdiff) infection, patients with a diagnosis of inflammatory bowel disease (IBD), and patients with a diagnosis of hip or knee osteoarthritis (OA). We extracted Logical Observation Identifiers Names and Codes (LOINC) from which we excluded those with 75% or more missingness. The comorbidities, primary or secondary diagnosis, as well as active problem lists, were also extracted. The adaptive imputation strategy was designed based on a hybrid approach. The comorbidity patterns of patients were transformed into latent patterns and then clustered. Imputation was performed on a cluster of patients for each cohort independently to show the generalizability of the method. The results were compared with imputation applied to the complete dataset without incorporating the information from comorbidity patterns. Results. We analyzed a total of 67,445 patients (11,230 IBD patients, 10,000 OA patients, and 46,215 patients tested for C. difficile infection). We extracted 495 LOINC and 11,230 diagnosis codes for the IBD cohort, 8160 diagnosis codes for the Cdiff cohort, and 2042 diagnosis codes for the OA cohort based on the primary/secondary diagnosis and active problem list in the EHR. Overall, the most improvement from this strategy was observed when the laboratory measures had a higher level of missingness. The best root mean square error (RMSE) difference for each dataset was recorded as −35.5 for the Cdiff, −8.3 for the IBD, and −11.3 for the OA dataset. Conclusions. An adaptive imputation strategy designed specifically for EHR that uses complementary information from the clinical profile of the patient can be used to improve the imputation of missing laboratory values, especially when laboratory codes with high levels of missingness are included in the analysis.

2020 ◽  
Vol 9 (3) ◽  
pp. 614
Author(s):  
Manuel Méndez-Bailón ◽  
Rodrigo Jiménez-García ◽  
Valentín Hernández-Barrera ◽  
Javier de Miguel-Díez ◽  
José M. de Miguel-Yanes ◽  
...  

Background: We aimed to (1) analyze time trends in the incidence and in-hospital outcomes of heart failure (HF) patients suffering Clostridioides difficile infection (CDI); (2) compare clinical characteristics of CDI patients between those with HF and matched non-HF patients; and (3) identify predictors of in-hospital mortality (IHM) among HF patients suffering CDI. Methods: Retrospective study using the Spanish National Hospital Discharge Database from 2001 to 2015. Patients of age ≥40 years with CDI were included. For each HF patient, we selected a year, age, sex, and readmission status-matched non-HF patient. Results: We found 44,695 patients hospitalized with CDI (15.46% with HF). HF patients had a higher incidence of CDI (202.05 vs. 145.09 per 100,000 hospitalizations) than patients without HF (adjusted IRR 1.35; 95% CI 1.31–1.40). IHM was significantly higher in patients with HF when CDI was coded as primary (18.39% vs. 7.63%; p < 0.001) and secondary diagnosis (21.12% vs. 14.76%; p < 0.001). Among HF patient’s predictor of IHM were older age (OR 8.80; 95% CI 2.55–20.33 for ≥85 years old), those with more comorbidities (OR 1.68; 95% CI 1.12–2.53 for those with Charlson Comorbidity index ≥2), and in those with severe CDI (OR 6.19; 95% CI 3.80–10.02). Conclusions: This research showed that incidence of CDI was higher in HF than non-HF patients. HF is a risk factor for IHM after suffering CDI.


2019 ◽  
Vol 6 (Supplement_2) ◽  
pp. S338-S338
Author(s):  
Ryan H Rochat ◽  
Gail J Demmler-Harrison

Abstract Background The electronic medical record (EMR) has become a modern compendium of health information, from broad clinical assessments down to an individual’s heart rate. The wealth of information in these EMRs hold promise for clinical discovery and hypothesis generation. Unfortunately, as these systems have become more robust, mining them for relevant clinical information is hindered by the overall data architecture, and often requires the expertise of a clinical informatician to extract relevant data. However, as the information presented to the clinician through the digital workspace is derived from the core EMR database, the format is well structured and can be mined using text recognition and parsing scripts. Methods Here we present a program which can parse output from Epic Hyperspace®, generating a relational database of clinical information. To facilitate ease of use, our protocol capitalizes on the familiarity of Microsoft Excel® as an intermediary for storing the raw output from the EMR, with data parsing and processing scripts written in SAS V9.4 (Cary, North Carolina). Results As a proof of concept, we extracted the diagnosis codes and standard laboratories for 190 patients seen in our Congenital Cytomegalovirus Clinic at Texas Children’s Hospital in Houston, Texas. Manual extraction of these data into Microsoft Excel® took 1 hour, and the scripts to parse the data took less than 5 seconds to run. Data from these patients included: 3800 ICD-10 codes (along with their metadata) and 33,000 individual laboratory values. In total, more than 850,000 characters were extracted from the EMR using this technique. Manual review of 10 randomly selected charts, found the data in perfect concordant with the EMR, a direct reflection of the fidelity of the parsing scripts. On average, an experienced user was able to enter three ICD-10 codes each minute, and six individual laboratory values per minute. At best, this same process would have taken at least 110 hours using a conventional chart review technique. Conclusion High-throughput data mining tools have the potential to improve the feasibility of studies dependent upon information stored in the EMR. When coupled with specific content knowledge, this approach can consolidate months of data collection into a day’s task. Disclosures All authors: No reported disclosures


Author(s):  
Ibrahim Sahin ◽  
Canan Ersoy ◽  
Ilker Ercan ◽  
Melahat Dirican

Objective: Our aim is to perform an analysis, using big data, of cases diagnosed with primary hypothyroidism and aged 18 and over who presented to our hospital, by evaluating the laboratory and socio-demographic data of the patients. Clustering analysis was performed in the big dataset for the purpose of structure-search study on the subject. Methods: According to ICD 10 diagnoses of hypothyroidism between 2005 to 2018 in our hospital 130159 patients aged 18 and over with E03 and E06 diagnosis codes were included in the study. Since drugs containing levothyroxine used in primary hypothyroidism treatment have an effect on the measured hormone levels, in our study, TSH, fT3 and fT4 laboratory values in the first diagnosis of cases who had not received any treatment as part of the diagnosis according to demographics were analysed. Patients with one or more missing laboratory values were excluded, and data of 2680 patients with complete data and TSH values above 4.94 mU/L were retained. Analysis was made with the k means clustering technique, with the data separated into two sets. k means clustering was performed by including age, TSH, fT3 and fT4 variables. Cliff’s Delta effect size coefficients and confidence intervals were calculated to perform size of the difference. Results: The higher prevalence of primary hypothyroidism in female and the peak in hypothyroidism at 4-5 decades in both genders were observed. In which ages were low, fT3 and fT4 values were higher, whereas TSH values were lower in male. In which ages were low, TSH values were higher, whereas fT4 values were lower in female. Conclusion: This study is the first big data analysis study carried out about primary hypothyroidism in our country. Despite the difficulties in implementation, it should not be forgotten that studies like these are important methods for enabling data to be created in our country.


2021 ◽  
Author(s):  
Rohan Khera ◽  
Bobak Mortazavi ◽  
Veer Sangha ◽  
Frederick Warner ◽  
H Patrick Young ◽  
...  

Objective: Real-world data, including administrative claims and electronic health record (EHR) data, have been critical for rapid-knowledge generation throughout the COVID-19 pandemic. Many studies relied on these data to identify cases and ascertain outcomes., commonly using diagnostic codes. However, to ensure high-quality results are delivered to guide clinical decision making, guide the public health response, and characterize the response to interventions, it is essential to establish the accuracy of these approaches for case identification of infections and hospitalizations. Methods: Real-world EHR data were obtained from the clinical data warehouse and computational health platform at a large academic health system that includes 5 regional hospitals in Connecticut and Rhode Island and their associated ambulatory practices. Demographic information, diagnosis codes, SARS-CoV-2 nucleic acid and antigen testing results, and visit data including discharge disposition were obtained from our OMOP common data model for all patients with either a positive SARS-CoV-2 test or ICD-10 diagnosis of COVID-19 (U07.1) between April 1, 2020 and March 1, 2021. Various computable phenotype definitions using combinations of test results and diagnostic codes were evaluated for their accuracy to identify SARS-CoV-2 infection and COVID-19 hospitalizations. The association with each phenotype was further compared with case volumes and, for hospitalizations, in-hospital mortality. We conducted a quantitative assessment with a manual chart review for a sample of 40 patients who had discordance between diagnostic code and laboratory result findings. Results: There were 69,423 individuals with either a diagnosis code or a laboratory diagnosis of a SARS-CoV-2 infection. Of these, 61,023 individuals had a principal or a secondary diagnosis code for COVID-19 and 50,355 had a positive SARS-CoV-2 PCR or antigen test. Among those with a positive PCR, 38,506 (76.5%) also had a principal and 3449 (6.8%) a secondary diagnosis of COVID-19, but 8400 (16.7%) had no COVID-19 diagnosis in the medical record. Moreover, of the 61,023 patients who had a COVID-19 diagnosis, 19,068 (31.2%) did not have a positive laboratory test for SARS-CoV-2 in the EHR. In a manual chart review of this sample of patients, we found that these many had a COVID-19 diagnosis code added during healthcare encounters related to asymptomatic testing, either as part of a screening program or following exposure, but with negative subsequent test results. The positive predictive value (precision) and sensitivity (recall) of a COVID-19 diagnosis in the medical record for a positive SARS-CoV-2 PCR were 68.8% and 83.3%, respectively. Further, among 5,109 patients who were hospitalized with a principal diagnosis of COVID-19, 4843 (94.8%) had a positive SARS-CoV-2 PCR or antigen test within the 2 weeks preceding hospital admission or during hospitalization. In a random sample of 10 without a positive test during the index hospitalization selected for manual chart review, 7 (70.0%) had been tested at an outside laboratory before admission and the remaining had a strong clinical suspicion for COVID-19. In addition, 789 hospitalizations had a secondary diagnosis of COVID-19, of which 446 (56.5%) had a principal diagnosis that was consistent with severe clinical manifestation of COVID-19 (e.g., sepsis or respiratory failure). Compared with the cohort that had a principal diagnosis of COVID-19, those with a secondary diagnosis more frequently male and White and had more than 2-fold higher in-hospital mortality (13.2% vs 28.0%, P<0.001). Conclusions: In a large integrated health system, COVID-19 diagnosis codes were not adequate for case identification and epidemiological surveillance of SARS-CoV-2 infection. In contrast, a principal diagnosis of COVID-19 diagnosis consistently identified hospitalized patients with the disease but missed nearly 10% of cases that presented with more severe manifestations of disease and had over 2-fold higher mortality. Data from the EHR can provide additional data elements compared to administrative claims alone, such as laboratory testing results, that can be used to in conjunction with diagnostic codes to create more fine-tuned phenotypes that are designed for specific analytical use cases.


Author(s):  
Mingkai Peng ◽  
Danielle Southern ◽  
Tyler Williamson ◽  
Hude Quan

ABSTRACTObjectivesAdministrative health data including hospital discharge abstract data have been widely collected and analyzed for various purposes, including disease surveillance, case-mix costing, tracking healthcare system performance, policy-making and research. This study examined the coding validity of hypertension, diabetes, obesity and depression related to the presence of their co-existing conditions, death status and number of diagnosis codes in hospital discharge abstract data (DAD). ApproachWe randomly selected around 4000 DAD records from four teaching hospitals in Alberta, Canada and reviewed their charts to extract 31 conditions listed in Charlson and Elixhauser comorbidity indices. Conditions associated with the four study conditions were identified through multivariable logistic regression. We examined the coding validity of the four study conditions related to whether their co-existing conditions were coded, whether the patient died in hospital and the total number of diagnosis codes recorded in a DAD record.Results Hypertension, diabetes, obesity and depression are generally secondary diagnosis and their validity are affected by the coding of their co-existing conditions. The sensitivity for the four conditions increased as the total number of diagnosis codes in the record increased. The impact of death status on coding validity for the four conditions was minimal.ConclusionThe coding validity of conditions is closely related to its clinical importance and complexity of patients’ case mix. We recommend mandatory coding of certain secondary diagnosis to meet the need of health research based on administrative health data.


Advancements in health informatics pave the way to explore new medical decision making systems which are characterized by an exponential evolution of knowledge. In the medical domain, disease prediction has become the centre of research with the increasing trend of healthcare applications. The predictive knowledge for the diagnosis of disease highly depends on the subjective knowledge of the experts. So the development of a disease prediction model in time is essential for patients and physicians to overcome the problem of medical distress. This paper explores a hybrid approach (Cooperative Ant Miner Genetic Algorithm) for classifying the medical data. Three benchmarked Type II diabetic datasets (US, PIMA, German) from the UCI machine learning repository were used to analyze the effectiveness of the disease prediction model. The devised classification algorithm with a Soft-Set approach was deployed in a Multi-Cloud environment for enhancing the storage and retrieval of data with reduced response and computation time. The cooperative classification algorithm in the cloud database distinguishes the diseased cases from the normal ones .The soft set theory analyzes the severity of the diseased cases by calculating the percentage of diabetic risk using soft intelligent rules and stores them in a separate knowledge base. Thus the proposed model serves as a suitable tool for eliciting and representing the expert’s decision which aids in prediction of Type II diabetic risk percentage leading to the timely treatment of patients.


2016 ◽  
Vol 23 (4) ◽  
pp. 260-267 ◽  
Author(s):  
Mingkai Peng ◽  
Danielle A Southern ◽  
Tyler Williamson ◽  
Hude Quan

This study examined the coding validity of hypertension, diabetes, obesity and depression related to the presence of their co-existing conditions, death status and the number of diagnosis codes in hospital discharge abstract database. We randomly selected 4007 discharge abstract database records from four teaching hospitals in Alberta, Canada and reviewed their charts to extract 31 conditions listed in Charlson and Elixhauser comorbidity indices. Conditions associated with the four study conditions were identified through multivariable logistic regression. Coding validity (i.e. sensitivity, positive predictive value) of the four conditions was related to the presence of their associated conditions. Sensitivity increased with increasing number of diagnosis code. Impact of death on coding validity is minimal. Coding validity of conditions is closely related to its clinical importance and complexity of patients’ case mix. We recommend mandatory coding of certain secondary diagnosis to meet the need of health research based on administrative health data.


2015 ◽  
Vol 3 (2) ◽  
pp. 121-131
Author(s):  
Ismail Khalid Kazmi ◽  
Lihua You ◽  
Jian Jun Zhang

Abstract Organic modeling of 3D characters is a challenging task when it comes to correctly modeling the anatomy of the human body. Most sketch based modeling tools available today for modeling organic models (humans, animals, creatures etc) are focused towards modeling base mesh models only and provide little or no support to add details to the base mesh. We propose a hybrid approach which combines geometrical primitives such as generalized cylinders and cube with Shape-from-Shading (SFS) algorithms to create plausible human character models from sketches. The results show that an artist can quickly create detailed character models from sketches by using this hybrid approach.


Sign in / Sign up

Export Citation Format

Share Document