Automatic Extraction and Classification of Patients’ Smoking Status from Free Text Using Natural Language Processing

Background: Tobacco smoking is widespread among HIV patients and is likely to contribute to the development of chronic diseases, yet accurate information on smoking is often difficult to obtain from observational data sources. We sought to validate a natural language processing (NLP) classifier to identify smokers among HIV and control cohorts and investigate whether HIV is an independent predictor of smoking and failure to quit smoking. Methods: We applied an NLP classifier developed within the Partners HealthCare System to electronic medical records for a cohort of HIV patients and a control cohort matched on age, gender, race, and number of clinical encounters. The NLP classifier searches free text notes for “tokens”, phrases containing smoking-related words, and assigns a smoking status (current, former, or non-smoker) to each token. We developed an algorithm for combining token classifications from a 12 month period into a single smoking status (current vs. non-smoker). We validated the yearly smoking status on a random sample of 500 patients (250 from each cohort) using as a gold standard a trained nurse medical record reviewer who assigned a tobacco smoking status to each patient for each calendar year observed. We calculated sensitivity, specificity and area under the receiver operating characteristic curve (AUC). Using NLP, we classified the full cohorts as ever versus never smokers and current versus non-smokers at last observation. We used logistic regression to assess HIV as a predictor of smoking and failure to quit (current smoking among ever smokers). Results: Smoking-related tokens were found in the records of 2926/3554 HIV and 7039/9601 non-HIV patients, providing 34,956 patient years of data. Using NLP to assign smoking status by year yielded sensitivity of 92.4, specificity of 86.2, and AUC of 0.89 (95% confidence interval [CI] 0.88-0.91). NLP assignment of ever versus never smoking status yielded sensitivity of 94.3, specificity of 73.4 and AUC of 0.84 (95% CI 0.81-0.87). Performance of the classifier did not vary by HIV status, gender, age, calendar year, or number of tokens/year. Ever and current smoking were more common in HIV patients than controls (54% vs. 44% and 42% vs. 30%, respectively, both p<0.001). In multivariate models adjusting for demographics, cardiovascular risk factors, coronary heart disease and history of psychiatric illness, HIV was an independent predictor of ever smoking (odds ratio [OR] 1.41, 95% CI 1.29-1.54, P <0.001), current smoking (OR 1.57, 95% CI 1.43-1.72, P<0.001), and failure to quit (OR 1.51, 95% CI 1.31-1.75, P<0.001). Conclusions: We validated a novel tool to ascertain smoking status from HIV observational cohort data. HIV was independently associated with both smoking and failure to quit smoking. These data underscore the need for aggressive smoking cessation strategies specific to HIV patients.

Download Full-text

Automate incidental findings in radiology reports using natural language processing and machine learning to identify and classify lung nodules.

Journal of Global Oncology ◽

10.1200/jgo.2019.5.suppl.49 ◽

2019 ◽

Vol 5 (suppl) ◽

pp. 49-49

Author(s):

Christi French ◽

Dax Kurbegov ◽

David R. Spigel ◽

Maciek Makowski ◽

Samantha Terker ◽

...

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Pulmonary Nodule ◽

Incidental Findings ◽

Free Text ◽

Radiology Reports

49 Background: Pulmonary nodule incidental findings challenge providers to balance resource efficiency and high clinical quality. Incidental findings tend to be under evaluated with studies reporting appropriate follow-up rates as low as 29%. The efficient identification of patients with high risk nodules is foundational to ensuring appropriate follow-up and requires the clinical reading and classification of radiology reports. We tested the feasibility of automating this process with natural language processing (NLP) and machine learning (ML). Methods: In cooperation with Sarah Cannon, the Cancer Institute of HCA Healthcare, we conducted a series of experiments on 8,879 free-text, narrative CT radiology reports. A representative sample of health system ED, IP, and OP reports dated from Dec 2015 - April 2017 were divided into a development set for model training and validation, and a test set to evaluate model performance. A “Nodule Model” was trained to detect the reported presence of a pulmonary nodule and a rules-based “Size Model” was developed to extract the size of the nodule in mms. Reports were bucketed into three prediction groups: ≥ 6 mm, <6 mm, and no size indicated. Nodules were placed in a queue for follow-up if the nodule was predicted ≥ 6 mm, or if the nodule had no size indicated and the report contained the word “mass.” The Fleischner Society Guidelines and clinical review informed these definitions. Results: Precision and recall metrics were calculated for multiple model thresholds. A threshold was selected based on the validation set calculations and a success criterion of 90% queue precision was selected to minimize false positives. On the test dataset, the F1 measure of the entire pipeline was 72.9%, recall was 60.3%, and queue precision was 90.2%, exceeding success criteria. Conclusions: The experiments demonstrate the feasibility of technology to automate the detection and classification of pulmonary nodule incidental findings in radiology reports. This approach promises to improve healthcare quality by increasing the rate of appropriate lung nodule incidental finding follow-up and treatment without excessive labor or risking overutilization.

Download Full-text

Sentiment Analysis Techniques Applied to Raw-Text Data from a Csq-8 Questionnaire about Mindfulness in Times of COVID-19 to Improve Strategy Generation

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18126408 ◽

2021 ◽

Vol 18 (12) ◽

pp. 6408

Author(s):

Mario Jojoa Acosta ◽

Gema Castillo-Sánchez ◽

Begonya Garcia-Zapirain ◽

Isabel de la Torre Díez ◽

Manuel Franco-Martín

Keyword(s):

Health Care ◽

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Transfer Learning ◽

Language Processing ◽

Health Care Professionals ◽

Ground Truth ◽

Relevant Information ◽

Free Text

The use of artificial intelligence in health care has grown quickly. In this sense, we present our work related to the application of Natural Language Processing techniques, as a tool to analyze the sentiment perception of users who answered two questions from the CSQ-8 questionnaires with raw Spanish free-text. Their responses are related to mindfulness, which is a novel technique used to control stress and anxiety caused by different factors in daily life. As such, we proposed an online course where this method was applied in order to improve the quality of life of health care professionals in COVID 19 pandemic times. We also carried out an evaluation of the satisfaction level of the participants involved, with a view to establishing strategies to improve future experiences. To automatically perform this task, we used Natural Language Processing (NLP) models such as swivel embedding, neural networks, and transfer learning, so as to classify the inputs into the following three categories: negative, neutral, and positive. Due to the limited amount of data available—86 registers for the first and 68 for the second—transfer learning techniques were required. The length of the text had no limit from the user’s standpoint, and our approach attained a maximum accuracy of 93.02% and 90.53%, respectively, based on ground truth labeled by three experts. Finally, we proposed a complementary analysis, using computer graphic text representation based on word frequency, to help researchers identify relevant information about the opinions with an objective approach to sentiment. The main conclusion drawn from this work is that the application of NLP techniques in small amounts of data using transfer learning is able to obtain enough accuracy in sentiment analysis and text classification stages.

Download Full-text

Applying natural language processing and machine learning techniques to patient experience feedback: a systematic review

BMJ Health & Care Informatics ◽

10.1136/bmjhci-2020-100262 ◽

2021 ◽

Vol 28 (1) ◽

pp. e100262

Author(s):

Mustafa Khanbhai ◽

Patrick Anyadi ◽

Joshua Symons ◽

Kelsey Flott ◽

Ara Darzi ◽

...

Keyword(s):

Machine Learning ◽

Systematic Review ◽

Social Media ◽

Natural Language Processing ◽

Natural Language ◽

Patient Experience ◽

Language Processing ◽

Performance Metrics ◽

Free Text ◽

Patient Feedback

ObjectivesUnstructured free-text patient feedback contains rich information, and analysing these data manually would require a lot of personnel resources which are not available in most healthcare organisations.To undertake a systematic review of the literature on the use of natural language processing (NLP) and machine learning (ML) to process and analyse free-text patient experience data.MethodsDatabases were systematically searched to identify articles published between January 2000 and December 2019 examining NLP to analyse free-text patient feedback. Due to the heterogeneous nature of the studies, a narrative synthesis was deemed most appropriate. Data related to the study purpose, corpus, methodology, performance metrics and indicators of quality were recorded.ResultsNineteen articles were included. The majority (80%) of studies applied language analysis techniques on patient feedback from social media sites (unsolicited) followed by structured surveys (solicited). Supervised learning was frequently used (n=9), followed by unsupervised (n=6) and semisupervised (n=3). Comments extracted from social media were analysed using an unsupervised approach, and free-text comments held within structured surveys were analysed using a supervised approach. Reported performance metrics included the precision, recall and F-measure, with support vector machine and Naïve Bayes being the best performing ML classifiers.ConclusionNLP and ML have emerged as an important tool for processing unstructured free text. Both supervised and unsupervised approaches have their role depending on the data source. With the advancement of data analysis tools, these techniques may be useful to healthcare organisations to generate insight from the volumes of unstructured free-text data.

Download Full-text

Measuring Adoption of Patient Priorities-Aligned Care Using Natural Language Processing

Innovation in Aging ◽

10.1093/geroni/igaa057.592 ◽

2020 ◽

Vol 4 (Supplement_1) ◽

pp. 183-183

Author(s):

Javad Razjouyan ◽

Jennifer Freytag ◽

Edward Odom ◽

Lilian Dindo ◽

Aanand Naik

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Chart Review ◽

Group Analysis ◽

Intervention Group ◽

Multiple Chronic Conditions ◽

Free Text ◽

Term Care

Abstract Patient Priorities Care (PPC) is a model of care that aligns health care recommendations with priorities of older adults with multiple chronic conditions. Social workers (SW), after online training, document PPC in the patient’s electronic health record (EHR). Our goal is to identify free-text notes with PPC language using a natural language processing (NLP) model and to measure PPC adoption and effect on long term services and support (LTSS) use. Free-text notes from the EHR produced by trained SWs passed through a hybrid NLP model that utilized rule-based and statistical machine learning. NLP accuracy was validated against chart review. Patients who received PPC were propensity matched with patients not receiving PPC (control) on age, gender, BMI, Charlson comorbidity index, facility and SW. The change in LTSS utilization 6-month intervals were compared by groups with univariate analysis. Chart review indicated that 491 notes out of 689 had PPC language and the NLP model reached to precision of 0.85, a recall of 0.90, an F1 of 0.87, and an accuracy of 0.91. Within group analysis shows that intervention group used LTSS 1.8 times more in the 6 months after the encounter compared to 6 months prior. Between group analysis shows that intervention group has significant higher number of LTSS utilization (p=0.012). An automated NLP model can be used to reliably measure the adaptation of PPC by SW. PPC seems to encourage use of LTSS that may delay time to long term care placement.

Download Full-text

Natural language processing methods for knowledge management—Applying document clustering for fast search and grouping of engineering documents

Concurrent Engineering ◽

10.1177/1063293x20982973 ◽

2021 ◽

pp. 1063293X2098297

Author(s):

Ivar Örn Arnarsson ◽

Otto Frost ◽

Emil Gustavsson ◽

Mats Jirstrand ◽

Johan Malmqvist

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Domain Knowledge ◽

Clustering Algorithms ◽

Document Clustering ◽

Unstructured Data ◽

Free Text ◽

Engineering Change ◽

Engineering Documents

Product development companies collect data in form of Engineering Change Requests for logged design issues, tests, and product iterations. These documents are rich in unstructured data (e.g. free text). Previous research affirms that product developers find that current IT systems lack capabilities to accurately retrieve relevant documents with unstructured data. In this research, we demonstrate a method using Natural Language Processing and document clustering algorithms to find structurally or contextually related documents from databases containing Engineering Change Request documents. The aim is to radically decrease the time needed to effectively search for related engineering documents, organize search results, and create labeled clusters from these documents by utilizing Natural Language Processing algorithms. A domain knowledge expert at the case company evaluated the results and confirmed that the algorithms we applied managed to find relevant document clusters given the queries tested.

Download Full-text

Development of algorithm for classification smoking status from unstructured bilingual electronic health records based on natural language processing (Preprint)

10.2196/preprints.26978 ◽

2021 ◽

Author(s):

Ye Seul Bae ◽

Kyung Hwan Kim ◽

Han Kyul Kim ◽

Sae Won Choi ◽

Taehoon Ko ◽

...

Keyword(s):

Natural Language Processing ◽

Electronic Health Records ◽

Natural Language ◽

Language Processing ◽

Smoking Status ◽

Svm Classifier ◽

Keyword Extraction ◽

Health Records ◽

Clinical Notes ◽

Electronic Health

BACKGROUND Smoking is a major risk factor and important variable for clinical research, but there are few studies regarding automatic obtainment of smoking classification from unstructured bilingual electronic health records (EHR). OBJECTIVE We aim to develop an algorithm to classify smoking status based on unstructured EHRs using natural language processing (NLP). METHODS With acronym replacement and Python package Soynlp, we normalize 4,711 bilingual clinical notes. Each EHR notes was classified into 4 categories: current smokers, past smokers, never smokers, and unknown. Subsequently, SPPMI (Shifted Positive Point Mutual Information) is used to vectorize words in the notes. By calculating cosine similarity between these word vectors, keywords denoting the same smoking status are identified. RESULTS Compared to other keyword extraction methods (word co-occurrence-, PMI-, and NPMI-based methods), our proposed approach improves keyword extraction precision by as much as 20.0%. These extracted keywords are used in classifying 4 smoking statuses from our bilingual clinical notes. Given an identical SVM classifier, the extracted keywords improve the F1 score by as much as 1.8% compared to those of the unigram and bigram Bag of Words. CONCLUSIONS Our study shows the potential of SPPMI in classifying smoking status from bilingual, unstructured EHRs. Our current findings show how smoking information can be easily acquired and used for clinical practice and research.

Download Full-text

An Evaluation of Patient Safety Event Report Categories Using Unsupervised Topic Modeling

Methods of Information in Medicine ◽

10.3414/me15-01-0010 ◽

2015 ◽

Vol 54 (04) ◽

pp. 338-345 ◽

Cited By ~ 10

Author(s):

A. Fong ◽

R. Ratwani

Keyword(s):

Patient Safety ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Topic Modeling ◽

Free Text ◽

Event Data ◽

Event Type ◽

Modeling Approach ◽

Safety Event

SummaryObjective: Patient safety event data repositories have the potential to dramatically improve safety if analyzed and leveraged appropriately. These safety event reports often consist of both structured data, such as general event type categories, and unstructured data, such as free text descriptions of the event. Analyzing these data, particularly the rich free text narratives, can be challenging, especially with tens of thousands of reports. To overcome the resource intensive manual review process of the free text descriptions, we demonstrate the effectiveness of using an unsupervised natural language processing approach.Methods: An unsupervised natural language processing technique, called topic modeling, was applied to a large repository of patient safety event data to identify topics, or themes, from the free text descriptions of the data. Entropy measures were used to evaluate and compare these topics to the general event type categories that were originally assigned by the event reporter.Results: Measures of entropy demonstrated that some topics generated from the un-supervised modeling approach aligned with the clinical general event type categories that were originally selected by the individual entering the report. Importantly, several new latent topics emerged that were not originally identified. The new topics provide additional insights into the patient safety event data that would not otherwise easily be detected.Conclusion: The topic modeling approach provides a method to identify topics or themes that may not be immediately apparent and has the potential to allow for automatic reclassification of events that are ambiguously classified by the event reporter.

Download Full-text