Development of a generalizable natural language processing pipeline to extract physician-reported pain from clinical reports: Generated using publicly-available datasets and tested on institutional clinical reports for cancer patients with bone metastases

BACKGROUND The COVID-19 pandemic has created a pressing need for integrating information from disparate sources, in order to assist decision makers. Social media is important in this respect, however, to make sense of the textual information it provides and be able to automate the processing of large amounts of data, natural language processing methods are needed. Social media posts are often noisy, yet they may provide valuable insights regarding the severity and prevalence of the disease in the population. In particular, machine learning techniques for triage and diagnosis could allow for a better understanding of what social media may offer in this respect. OBJECTIVE This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts, in order to provide researchers and other interested parties with additional information on the symptoms, severity and prevalence of the disease. METHODS The text processing pipeline first extracts COVID-19 symptoms and related concepts such as severity, duration, negations, and body parts from patients’ posts using conditional random fields. An unsupervised rule-based algorithm is then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations are subsequently used to construct two different vector representations of each post. These vectors are applied separately to build support vector machine learning models to triage patients into three categories and diagnose them for COVID-19. RESULTS We report that Macro- and Micro-averaged F_{1\ }scores in the range of 71-96% and 61-87%, respectively, for the triage and diagnosis of COVID-19, when the models are trained on human labelled data. Our experimental results indicate that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. Also, we highlight important features uncovered by our diagnostic machine learning models and compare them with the most frequent symptoms revealed in another COVID-19 dataset. In particular, we found that the most important features are not always the most frequent ones. CONCLUSIONS Our preliminary results show that it is possible to automatically triage and diagnose patients for COVID-19 from natural language narratives using a machine learning pipeline, in order to provide additional information on the severity and prevalence of the disease through the eyes of social media.

Download Full-text

A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes

Journal of Biomedical Informatics X ◽

10.1016/j.yjbinx.2020.100076 ◽

2020 ◽

Vol 8 ◽

pp. 100076 ◽

Cited By ~ 1

Author(s):

Ari Z. Klein ◽

Haitao Cai ◽

Davy Weissenbacher ◽

Lisa D. Levine ◽

Graciela Gonzalez-Hernandez

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Pregnancy Outcomes ◽

Adverse Pregnancy Outcomes ◽

Processing Pipeline ◽

Twitter Data ◽

Digital Epidemiology ◽

Adverse Pregnancy

Download Full-text

Natural Language Processing to Assess End-of-Life Quality Indicators in Breast Cancer Patients with Leptomeningeal Disease (SA528C)

Journal of Pain and Symptom Management ◽

10.1016/j.jpainsymman.2018.12.206 ◽

2019 ◽

Vol 57 (2) ◽

pp. 454-455

Author(s):

Kate Brizzi ◽

Charlotta Lindvall ◽

Sophia Zupanc

Keyword(s):

Breast Cancer ◽

Natural Language Processing ◽

Natural Language ◽

End Of Life ◽

Cancer Patients ◽

Quality Indicators ◽

Language Processing ◽

Life Quality ◽

Leptomeningeal Disease ◽

Breast Cancer Patients

Download Full-text

Natural language processing for automated quantification of bone metastases reported in free-text bone scintigraphy reports

Acta Oncologica ◽

10.1080/0284186x.2020.1819563 ◽

2020 ◽

Vol 59 (12) ◽

pp. 1455-1460

Author(s):

Olivier Q. Groot ◽

Michiel E. R. Bongers ◽

Aditya V. Karhade ◽

Neal D. Kapoor ◽

Brian P. Fenn ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Bone Metastases ◽

Bone Scintigraphy ◽

Language Processing ◽

Free Text ◽

Automated Quantification

Download Full-text

Natural language processing and recurrent network models for identifying genomic mutation-associated cancer treatment change from patient progress notes

JAMIA Open ◽

10.1093/jamiaopen/ooy061 ◽

2019 ◽

Vol 2 (1) ◽

pp. 139-149 ◽

Cited By ~ 9

Author(s):

Meijian Guan ◽

Samuel Cho ◽

Robin Petro ◽

Wei Zhang ◽

Boris Pasche ◽

...

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Cancer Patients ◽

Language Processing ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Free Text ◽

Treatment Change ◽

Progress Notes

Abstract Objectives Natural language processing (NLP) and machine learning approaches were used to build classifiers to identify genomic-related treatment changes in the free-text visit progress notes of cancer patients. Methods We obtained 5889 deidentified progress reports (2439 words on average) for 755 cancer patients who have undergone a clinical next generation sequencing (NGS) testing in Wake Forest Baptist Comprehensive Cancer Center for our data analyses. An NLP system was implemented to process the free-text data and extract NGS-related information. Three types of recurrent neural network (RNN) namely, gated recurrent unit, long short-term memory (LSTM), and bidirectional LSTM (LSTM_Bi) were applied to classify documents to the treatment-change and no-treatment-change groups. Further, we compared the performances of RNNs to 5 machine learning algorithms including Naive Bayes, K-nearest Neighbor, Support Vector Machine for classification, Random forest, and Logistic Regression. Results Our results suggested that, overall, RNNs outperformed traditional machine learning algorithms, and LSTM_Bi showed the best performance among the RNNs in terms of accuracy, precision, recall, and F1 score. In addition, pretrained word embedding can improve the accuracy of LSTM by 3.4% and reduce the training time by more than 60%. Discussion and Conclusion NLP and RNN-based text mining solutions have demonstrated advantages in information retrieval and document classification tasks for unstructured clinical progress notes.

Download Full-text