scholarly journals Development of data-driven framework for automatically identifying patient cohorts from linked electronic health records

Author(s):  
Fabiola Fernández-Gutiérrez ◽  
Jonathan Kennedy ◽  
Roxanne Cooksey ◽  
Mark Atkinson ◽  
Ernest Choy ◽  
...  

ABSTRACTObjectives 1) To develop a fully data-driven framework for automatically identifying patients with a condition from routine electronic primary care records; 2) to identify informative codes (risk factors) of arthropathy conditions in primary care records that can accurately predict a diagnosis of the conditions in secondary care records. ApproachThis study linked routine primary and secondary care records in Wales, UK held in the SAIL (Secured Anonymised Information Linkage) databank, in which the secondary care records were used as golden standard. As such, we proposed to use machine learning techniques to extract patient information and identify cohorts with a condition from the large and high-dimensional linked dataset using the following phases: data preparation, performed in the machine learning context fashion; pre-selection of initial features, ranking and selecting features into a meaningful subset by using feature selection methods; and identification algorithm development which incorporates mechanisms of tackling the imbalanced nature of the data. This data-driven framework was then validated on an independent dataset, and compared with existing algorithm which had been developed using expert clinician knowledge for arthropathy conditions. ResultsRheumatoid arthritis (RA) and ankylosing spondylitis (AS) were used to demonstrate the feasibility of this framework. Linking primary care records with the secondary care rheumatology clinical system, we collected 9,657 patients with 1,484 RA patients and 204 AS patients. The proposed framework identified various compact subsets of informative features (risk factors) from 43,100 potential Read codes. Applying to an independent test data, this framework achieved the classification accuracy and positive predictive values (PPVs) of 86.19% and 88.46% respectively for RA and 99.23 % and 97.75% respectively for AS, which are comparable with the performance of clinical knowledge-based method - the accuracy of 85.85%, the PPV of 85.28% for RA and the accuracy of 97.86% , the PPV of 95.65% for AS. ConclusionThe proposed data-driven framework provides a rapid and cost-effective way of reliably identifying patients with a medical condition from primary care data. It performed as well as the clinically derived algorithm. This framework does not intend to substitute clinical expertise, instead it provides an decision support tool for clinicians during their decision process, in particular selection of patients for clinical trials.

2020 ◽  
Vol 79 (Suppl 1) ◽  
pp. 897.2-897
Author(s):  
M. Maurits ◽  
T. Huizinga ◽  
M. Reinders ◽  
S. Raychaudhuri ◽  
E. Karlson ◽  
...  

Background:Heterogeneity in disease populations complicates discovery of risk factors. To identify risk factors for subpopulations of diseases, we need analytical methods that can deal with unidentified disease subgroups.Objectives:Inspired by successful approaches from the Big Data field, we developed a high-throughput approach to identify subpopulations within patients with heterogeneous, complex diseases using the wealth of information available in Electronic Medical Records (EMRs).Methods:We extracted longitudinal healthcare-interaction records coded by 1,853 PheCodes[1] of the 64,819 patients from the Boston’s Partners-Biobank. Through dimensionality reduction using t-SNE[2] we created a 2D embedding of 32,424 of these patients (set A). We then identified distinct clusters post-t-SNE using DBscan[3] and visualized the relative importance of individual PheCodes within them using specialized spectrographs. We replicated this procedure in the remaining 32,395 records (set B).Results:Summary statistics of both sets were comparable (Table 1).Table 1.Summary statistics of the total Partners Biobank dataset and the 2 partitions.Set-Aset-BTotalEntries12,200,31112,177,13124,377,442Patients32,42432,39564,819Patientyears369,546.33368,597.92738,144.2unique ICD codes25,05624,95326,305unique Phecodes1,8511,8531,853We found 284 clusters in set A and 295 in set B, of which 63.4% from set A could be mapped to a cluster in set B with a median (range) correlation of 0.24 (0.03 – 0.58).Clusters represented similar yet distinct clinical phenotypes; e.g. patients diagnosed with “other headache syndrome” were separated into four distinct clusters characterized by migraines, neurofibromatosis, epilepsy or brain cancer, all resulting in patients presenting with headaches (Fig. 1 & 2). Though EMR databases tend to be noisy, our method was also able to differentiate misclassification from true cases; SLE patients with RA codes clustered separately from true RA cases.Figure 1.Two dimensional representation of Set A generated using dimensionality reduction (tSNE) and clustering (DBScan).Figure 2.Phenotype Spectrographs (PheSpecs) of four clusters characterized by “Other headache syndromes”, driven by codes relating to migraine, epilepsy, neurofibromatosis or brain cancer.Conclusion:We have shown that EMR data can be used to identify and visualize latent structure in patient categorizations, using an approach based on dimension reduction and clustering machine learning techniques. Our method can identify misclassified patients as well as separate patients with similar problems into subsets with different associated medical problems. Our approach adds a new and powerful tool to aid in the discovery of novel risk factors in complex, heterogeneous diseases.References:[1] Denny, J.C. et al. Bioinformatics (2010)[2]van der Maaten et al. Journal of Machine Learning Research (2008)[3] Ester, M. et al. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. (1996)Disclosure of Interests:Marc Maurits: None declared, Thomas Huizinga Grant/research support from: Ablynx, Bristol-Myers Squibb, Roche, Sanofi, Consultant of: Ablynx, Bristol-Myers Squibb, Roche, Sanofi, Marcel Reinders: None declared, Soumya Raychaudhuri: None declared, Elizabeth Karlson: None declared, Erik van den Akker: None declared, Rachel Knevel: None declared


Water ◽  
2021 ◽  
Vol 13 (9) ◽  
pp. 1208
Author(s):  
Massimiliano Bordoni ◽  
Fabrizio Inzaghi ◽  
Valerio Vivaldi ◽  
Roberto Valentino ◽  
Marco Bittelli ◽  
...  

Soil water potential is a key factor to study water dynamics in soil and for estimating the occurrence of natural hazards, as landslides. This parameter can be measured in field or estimated through physically-based models, limited by the availability of effective input soil properties and preliminary calibrations. Data-driven models, based on machine learning techniques, could overcome these gaps. The aim of this paper is then to develop an innovative machine learning methodology to assess soil water potential trends and to implement them in models to predict shallow landslides. Monitoring data since 2012 from test-sites slopes in Oltrepò Pavese (northern Italy) were used to build the models. Within the tested techniques, Random Forest models allowed an outstanding reconstruction of measured soil water potential temporal trends. Each model is sensitive to meteorological and hydrological characteristics according to soil depths and features. Reliability of the proposed models was confirmed by correct estimation of days when shallow landslides were triggered in the study areas in December 2020, after implementing the modeled trends on a slope stability model, and by the correct choice of physically-based rainfall thresholds. These results confirm the potential application of the developed methodology to estimate hydrological scenarios that could be used for decision-making purposes.


CNS Spectrums ◽  
2021 ◽  
Vol 26 (2) ◽  
pp. 167-168
Author(s):  
C. Brendan Montano ◽  
Mehul Patel ◽  
Rakesh Jain ◽  
Prakash S. Masand ◽  
Amanda Harrington ◽  
...  

AbstractIntroductionApproximately 70% of patients with bipolar disorder (BPD) are initially misdiagnosed, resulting in significantly delayed diagnosis of 7–10 years on average. Misdiagnosis and diagnostic delay adversely affect health outcomes and lead to the use of inappropriate treatments. As depressive episodes and symptoms are the predominant symptom presentation in BPD, misdiagnosis as major depressive disorder (MDD) is common. Self-rated screening instruments for BPD exist but their length and reliance on past manic symptoms are barriers to implementation, especially in primary care settings where many of these patients initially present. We developed a brief, pragmatic bipolar I disorder (BPD-I) screening tool that not only screens for manic symptoms but also includes risk factors for BPD-I (eg, age of depression onset) to help clinicians reduce the misdiagnosis of BPD-I as MDD.MethodsExisting questionnaires and risk factors were identified through a targeted literature search; a multidisciplinary panel of experts participated in 2 modified Delphi panels to select concepts thought to differentiate BPD-I from MDD. Individuals with self-reported BPD-I or MDD participated in cognitive debriefing interviews (N=12) to test and refine item wording. A multisite, cross-sectional, observational study was conducted to evaluate the screening tool’s predictive validity. Participants with clinical interview-confirmed diagnoses of BPD-I or MDD completed a draft 10-item screening tool and additional questionnaires/questions. Different combinations of item sets with various item permutations (eg, number of depressive episodes, age of onset) were simultaneously tested. The final combination of items and thresholds was selected based on multiple considerations including clinical validity, optimization of sensitivity and specificity, and pragmatism.ResultsA total of 160 clinical interviews were conducted; 139 patients had clinical interview-confirmed BPD-I (n=67) or MDD (n=72). The screening tool was reduced from 10 to 6 items based on item-level analysis. When 4 items or more were endorsed (yes) in this analysis sample, the sensitivity of this tool for identifying patients with BPD-I was 0.88 and specificity was 0.80; positive and negative predictive values were 0.80 and 0.88, respectively. These properties represent an improvement over the Mood Disorder Questionnaire, while using >50% fewer items.ConclusionThis new 6-item BPD-I screening tool serves to differentiate BPD-I from MDD in patients with depressive symptoms. Use of this tool can provide real-world guidance to primary care practitioners on whether more comprehensive assessment for BPD-I is warranted. Use of a brief and valid tool provides an opportunity to reduce misdiagnosis, improve treatment selection, and enhance health outcomes in busy clinical practices.FundingAbbVie Inc.


2021 ◽  
Author(s):  
Daniël den Heijer ◽  
Bernard Foing

<p>The lunar south pole is of particular interest to researchers because of its unique geographical features. It contains craters where the near-constant sunlight does not reach the interior. These craters are of enormous importance in the process of human exploration of the moon.This research aims to develop an identification algorithm applied to LROC data to characterize and identify potential regions of interest on the lunar south pole. Such areas of interest include (surroundings of) lava tubes, skylights, crater detection for age estimation, and planning traverses for the Artemis successive missions.Identifying these regions will be done using machine learning techniques such as a deep convolutional neural network that will be trained on labeled data and are then used to identify and characterize new regions of interest.</p>


2020 ◽  
Vol 20 (1) ◽  
Author(s):  
Georgios Kantidakis ◽  
Hein Putter ◽  
Carlo Lancia ◽  
Jacob de Boer ◽  
Andries E. Braat ◽  
...  

Abstract Background Predicting survival of recipients after liver transplantation is regarded as one of the most important challenges in contemporary medicine. Hence, improving on current prediction models is of great interest.Nowadays, there is a strong discussion in the medical field about machine learning (ML) and whether it has greater potential than traditional regression models when dealing with complex data. Criticism to ML is related to unsuitable performance measures and lack of interpretability which is important for clinicians. Methods In this paper, ML techniques such as random forests and neural networks are applied to large data of 62294 patients from the United States with 97 predictors selected on clinical/statistical grounds, over more than 600, to predict survival from transplantation. Of particular interest is also the identification of potential risk factors. A comparison is performed between 3 different Cox models (with all variables, backward selection and LASSO) and 3 machine learning techniques: a random survival forest and 2 partial logistic artificial neural networks (PLANNs). For PLANNs, novel extensions to their original specification are tested. Emphasis is given on the advantages and pitfalls of each method and on the interpretability of the ML techniques. Results Well-established predictive measures are employed from the survival field (C-index, Brier score and Integrated Brier Score) and the strongest prognostic factors are identified for each model. Clinical endpoint is overall graft-survival defined as the time between transplantation and the date of graft-failure or death. The random survival forest shows slightly better predictive performance than Cox models based on the C-index. Neural networks show better performance than both Cox models and random survival forest based on the Integrated Brier Score at 10 years. Conclusion In this work, it is shown that machine learning techniques can be a useful tool for both prediction and interpretation in the survival context. From the ML techniques examined here, PLANN with 1 hidden layer predicts survival probabilities the most accurately, being as calibrated as the Cox model with all variables. Trial registration Retrospective data were provided by the Scientific Registry of Transplant Recipients under Data Use Agreement number 9477 for analysis of risk factors after liver transplantation.


Sign in / Sign up

Export Citation Format

Share Document