Classification of CT pulmonary angiography reports by presence, chronicity, and location of pulmonary embolism with natural language processing

Abstract Electronic Health Records (EHR) contain rich data to identify and study diabetes. Many phenotype algorithms have been developed to identify research subjects with type 2 diabetes (T2D), but very few accurately identify type 1 diabetes (T1D) cases or more rare forms of monogenic and atypical metabolic presentations. Polygenetic risk scores (PRS) quantify risk of a disease using common genomic variants well for both T1D and T2D. In this study, we apply validated phenotyping algorithms to EHRs linked to a genomic biobank to understand the independent contribution of PRS to classification of diabetes etiology and generate additional novel markers to distinguish subtypes of diabetes in EHR data. Using a de-identified mirror of medical center’s electronic health record, we applied published algorithms for T1D and T2D to identify cases, and used natural language processing and chart review strategies to identify cases of maturity onset diabetes of the young (MODY) and other more rare presentations. This novel approach included additional data types such as medication sequencing, ratio and temporality of insulin and non-insulin agents, clinical genetic testing, and ratios of diagnostic codes. Chart review was performed to validate etiology. To calculate PRS, we used genome wide genotyping from our BioBank, the de-identified biobank linking EHR to genomic data using coefficients of 65 published T1D SNPS and 76,996 T2D SNPS using PLINK in Caucasian subjects. In the dataset, we identified 82,238 cases of T2D but only 130 cases of T1D using the most cited published algorithms. Adding novel structured elements and natural language processing identified an additional 138 cases of T1D and distinguished 354 cases as MODY. Among over 90,000 subjects with genotyping data available, we included 72,624 Caucasian subjects since PRS coefficients were generated in Caucasian cohorts. Among those subjects, 248, 6,488, and 21 subjects were identified as T1D, T2D, and MODY subjects respectively in our final PRS cohort. The T1D PRS did significantly discriminate well between cases and controls (Mann-Whitney p-value is 3.4 e-17). The PRS for T2D did not significantly discriminate between cases and controls using published algorithms. The atypical case count was too low to calculate PRS discrimination. Calculation of the PRS score was limited by quality inclusion of variants available, and discrimination may improve in larger data sets. Additionally, blinded physician case review is ongoing to validate the novel classification scheme and provide a gold standard for machine learning approaches that can be applied in validation sets.

Download Full-text

Emotion Classification in Spanish: Exploring the Hard Classes

Information ◽

10.3390/info12110438 ◽

2021 ◽

Vol 12 (11) ◽

pp. 438

Author(s):

Aiala Rosá ◽

Luis Chiruzzo

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Emotion Classification

The study of affective language has had numerous developments in the Natural Language Processing area in recent years, but the focus has been predominantly on Sentiment Analysis, an expression usually used to refer to the classification of texts according to their polarity or valence (positive vs. negative). The study of emotions, such as joy, sadness, anger, surprise, among others, has been much less developed and has fewer resources, both for English and for other languages, such as Spanish. In this paper, we present the most relevant existing resources for the study of emotions, mainly for Spanish; we describe some heuristics for the union of two existing corpora of Spanish tweets; and based on some experiments for classification of tweets according to seven categories (anger, disgust, fear, joy, sadness, surprise, and others) we analyze the most problematic classes.

Download Full-text