A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes

BACKGROUND The COVID-19 pandemic has created a pressing need for integrating information from disparate sources, in order to assist decision makers. Social media is important in this respect, however, to make sense of the textual information it provides and be able to automate the processing of large amounts of data, natural language processing methods are needed. Social media posts are often noisy, yet they may provide valuable insights regarding the severity and prevalence of the disease in the population. In particular, machine learning techniques for triage and diagnosis could allow for a better understanding of what social media may offer in this respect. OBJECTIVE This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts, in order to provide researchers and other interested parties with additional information on the symptoms, severity and prevalence of the disease. METHODS The text processing pipeline first extracts COVID-19 symptoms and related concepts such as severity, duration, negations, and body parts from patients’ posts using conditional random fields. An unsupervised rule-based algorithm is then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations are subsequently used to construct two different vector representations of each post. These vectors are applied separately to build support vector machine learning models to triage patients into three categories and diagnose them for COVID-19. RESULTS We report that Macro- and Micro-averaged F_{1\ }scores in the range of 71-96% and 61-87%, respectively, for the triage and diagnosis of COVID-19, when the models are trained on human labelled data. Our experimental results indicate that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. Also, we highlight important features uncovered by our diagnostic machine learning models and compare them with the most frequent symptoms revealed in another COVID-19 dataset. In particular, we found that the most important features are not always the most frequent ones. CONCLUSIONS Our preliminary results show that it is possible to automatically triage and diagnose patients for COVID-19 from natural language narratives using a machine learning pipeline, in order to provide additional information on the severity and prevalence of the disease through the eyes of social media.

Download Full-text

Sentiment Analysis on Twitter Data of World Cup Soccer Tournament Using Machine Learning

IoT ◽

10.3390/iot1020014 ◽

2020 ◽

Vol 1 (2) ◽

pp. 218-239 ◽

Cited By ~ 2

Author(s):

Ravikumar Patel ◽

Kalpdrum Passi

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Random Forest ◽

Natural Language ◽

Language Processing ◽

Machine Learning Algorithms ◽

World Cup ◽

Part Of Speech ◽

Twitter Data ◽

Processing Techniques

In the derived approach, an analysis is performed on Twitter data for World Cup soccer 2014 held in Brazil to detect the sentiment of the people throughout the world using machine learning techniques. By filtering and analyzing the data using natural language processing techniques, sentiment polarity was calculated based on the emotion words detected in the user tweets. The dataset is normalized to be used by machine learning algorithms and prepared using natural language processing techniques like word tokenization, stemming and lemmatization, part-of-speech (POS) tagger, name entity recognition (NER), and parser to extract emotions for the textual data from each tweet. This approach is implemented using Python programming language and Natural Language Toolkit (NLTK). A derived algorithm extracts emotional words using WordNet with its POS (part-of-speech) for the word in a sentence that has a meaning in the current context, and is assigned sentiment polarity using the SentiWordNet dictionary or using a lexicon-based method. The resultant polarity assigned is further analyzed using naïve Bayes, support vector machine (SVM), K-nearest neighbor (KNN), and random forest machine learning algorithms and visualized on the Weka platform. Naïve Bayes gives the best accuracy of 88.17% whereas random forest gives the best area under the receiver operating characteristics curve (AUC) of 0.97.

Download Full-text

Accuracy of a Natural Language Processing Pipeline to Identify Patient Symptoms during Radiation Therapy

International Journal of Radiation Oncology*Biology*Physics ◽

10.1016/j.ijrobp.2019.06.522 ◽

2019 ◽

Vol 105 (1) ◽

pp. S70 ◽

Cited By ~ 2

Author(s):

J.C. Hong ◽

J. Tanksley ◽

D. Niedzwiecki ◽

M. Palta ◽

J.D. Tenenbaum

Keyword(s):

Radiation Therapy ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Processing Pipeline

Download Full-text

Sentiment Analysis of Multilingual Twitter Data using Natural Language Processing

2018 8th International Conference on Communication Systems and Network Technologies (CSNT) ◽

10.1109/csnt.2018.8820254 ◽

2018 ◽

Author(s):

Vikas Goel ◽

Amit Kr. Gupta ◽

Narendra Kumar

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Twitter Data

Download Full-text

A natural language processing pipeline for pairing measurements uniquely across free-text CT reports

Journal of Biomedical Informatics ◽

10.1016/j.jbi.2014.08.015 ◽

2015 ◽

Vol 53 ◽

pp. 36-48 ◽

Cited By ~ 9

Author(s):

Merlijn Sevenster ◽

Jeffrey Bozeman ◽

Andrea Cowhy ◽

William Trost

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Free Text ◽

Processing Pipeline

Download Full-text

Automating curation using a natural language processing pipeline

Genome Biology ◽

10.1186/gb-2008-9-s2-s10 ◽

2008 ◽

Vol 9 (Suppl 2) ◽

pp. S10 ◽

Cited By ~ 9

Author(s):

Beatrice Alex ◽

Claire Grover ◽

Barry Haddow ◽

Mijail Kabadjov ◽

Ewan Klein ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Processing Pipeline

Download Full-text

RACAI's Natural Language Processing pipeline for Universal Dependencies

10.18653/v1/k17-3018 ◽

2017 ◽

Cited By ~ 1

Author(s):

Stefan Daniel Dumitrescu ◽

Tiberiu Boroş ◽

Dan Tufiş

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Processing Pipeline

Download Full-text

Jigg: A Framework for an Easy Natural Language Processing Pipeline

10.18653/v1/p16-4018 ◽

2016 ◽

Cited By ~ 2

Author(s):

Hiroshi Noji ◽

Yusuke Miyao

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Processing Pipeline

Download Full-text

Longitudinal Visual Analytics for Unpacking the Cancer Journey

10.1101/444356 ◽

2018 ◽

Author(s):

Zhou Yuan ◽

Sean Finan ◽

Jeremy Warner ◽

Guergana Savova ◽

Harry Hochheiser

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Visual Analytics ◽

Qualitative Inquiry ◽

Inclusion Criteria ◽

Clinical Notes ◽

Processing Pipeline ◽

User Tasks ◽

Data Elements

AbstractRetrospective cancer research requires identification of patients matching both categorical and temporal inclusion criteria, often based on factors exclusively available in clinical notes. Although natural language processing approaches for inferring higher-level concepts have shown promise for bringing structure to clinical texts, interpreting results is often challenging, involving the need to move between abstracted representations and constituent text elements. We discuss qualitative inquiry into user tasks and goals, data elements and models resulting in an innovative natural language processing pipeline and a visual analytics tool designed to facilitate interpretation of patient summaries and identification of cohorts for retrospective research.

Download Full-text