Developing an automatic pipeline for analyzing chatter about health services from social media: A case study for Medicaid

AbstractObjectiveSocial media can be an effective but challenging resource for conducting close-to-real-time assessments of consumers’ perceptions about health services. Our objective was to develop and evaluate an automatic pipeline, involving natural language processing and machine learning, for automatically characterizing user-posted Twitter data about Medicaid.Material and MethodsWe collected Twitter data via the public API using Medicaid-related keywords (Corpus-1), and the website’s search option using agency-specific handles (Corpus-2). We manually labeled a sample of tweets into five pre-determined categories or other, and artificially increased the number of training posts from specific low-frequency categories. We trained and evaluated several supervised learning algorithms using manually-labeled data, and applied the best-performing classifier to collected tweets for post-classification analyses assessing the utility of our methods.ResultsWe collected 628,411 and 27,377 tweets for Corpus-1 and -2, respectively. We manually annotated 9,571 (Corpus-1: 8,180; Corpus-2: 1,391) tweets, using 7,923 (82.8%) for training and 1,648 (17.2%) for evaluation. A BERT-based (bidirectional encoder representations from transformers) classifier obtained the highest accuracies (83.9%, Corpus-1; 86.4%, Corpus-2), outperforming the second-best classifier (SVMs: 79.6%; 76.4%). Post-classification analyses revealed differing inter-corpora distributions of tweet categories, with political (63%) and consumer-feedback (43%) tweets being most frequent for Corpus-1 and -2, respectively.Discussion and ConclusionThe broad and variable content of Medicaid-related tweets necessitates automatic categorization to identify topic-relevant posts. Our proposed pipeline presents a feasible solution for automatic categorization, and can be deployed/generalized for health service programs other than Medicaid. Annotated data and methods are available for future studies (LINK_TO_BE_AVAILABLE).

Download Full-text

Developing an Automatic System for Classifying Chatter About Health Services on Twitter: Case Study for Medicaid (Preprint)

10.2196/preprints.26616 ◽

2020 ◽

Author(s):

Yuan-Chi Yang ◽

Mohammed Ali Al-Garadi ◽

Whitney Bremer ◽

Jane M Zhu ◽

David Grande ◽

...

Keyword(s):

Social Media ◽

Health Services ◽

Language Processing ◽

Automatic System ◽

Short Term Memory ◽

The United States ◽

Support Vector ◽

K Nearest Neighbor ◽

Automatic Categorization ◽

Consumer Feedback

BACKGROUND The wide adoption of social media in daily life renders it a rich and effective resource for conducting near real-time assessments of consumers’ perceptions of health services. However, its use in these assessments can be challenging because of the vast amount of data and the diversity of content in social media chatter. OBJECTIVE This study aims to develop and evaluate an automatic system involving natural language processing and machine learning to automatically characterize user-posted Twitter data about health services using Medicaid, the single largest source of health coverage in the United States, as an example. METHODS We collected data from Twitter in two ways: via the public streaming application programming interface using Medicaid-related keywords (Corpus 1) and by using the website’s search option for tweets mentioning agency-specific handles (Corpus 2). We manually labeled a sample of tweets in 5 predetermined categories or other and artificially increased the number of training posts from specific low-frequency categories. Using the manually labeled data, we trained and evaluated several supervised learning algorithms, including support vector machine, random forest (RF), naïve Bayes, shallow neural network (NN), k-nearest neighbor, bidirectional long short-term memory, and bidirectional encoder representations from transformers (BERT). We then applied the best-performing classifier to the collected tweets for postclassification analyses to assess the utility of our methods. RESULTS We manually annotated 11,379 tweets (Corpus 1: 9179; Corpus 2: 2200) and used 7930 (69.7%) for training, 1449 (12.7%) for validation, and 2000 (17.6%) for testing. A classifier based on BERT obtained the highest accuracies (81.7%, Corpus 1; 80.7%, Corpus 2) and F1 scores on consumer feedback (0.58, Corpus 1; 0.90, Corpus 2), outperforming the second best classifiers in terms of accuracy (74.6%, RF on Corpus 1; 69.4%, RF on Corpus 2) and F1 score on consumer feedback (0.44, NN on Corpus 1; 0.82, RF on Corpus 2). Postclassification analyses revealed differing intercorpora distributions of tweet categories, with political (400778/628411, 63.78%) and consumer feedback (15073/27337, 55.14%) tweets being the most frequent for Corpus 1 and Corpus 2, respectively. CONCLUSIONS The broad and variable content of Medicaid-related tweets necessitates automatic categorization to identify topic-relevant posts. Our proposed system presents a feasible solution for automatic categorization and can be deployed and generalized for health service programs other than Medicaid. Annotated data and methods are available for future studies. CLINICALTRIAL

Download Full-text

Developing an Automatic System for Classifying Chatter About Health Services on Twitter: Case Study for Medicaid

Journal of Medical Internet Research ◽

10.2196/26616 ◽

2021 ◽

Vol 23 (5) ◽

pp. e26616

Author(s):

Yuan-Chi Yang ◽

Mohammed Ali Al-Garadi ◽

Whitney Bremer ◽

Jane M Zhu ◽

David Grande ◽

...

Keyword(s):

Social Media ◽

Health Services ◽

Language Processing ◽

Automatic System ◽

Short Term Memory ◽

The United States ◽

Support Vector ◽

K Nearest Neighbor ◽

Automatic Categorization ◽

Consumer Feedback

Background The wide adoption of social media in daily life renders it a rich and effective resource for conducting near real-time assessments of consumers’ perceptions of health services. However, its use in these assessments can be challenging because of the vast amount of data and the diversity of content in social media chatter. Objective This study aims to develop and evaluate an automatic system involving natural language processing and machine learning to automatically characterize user-posted Twitter data about health services using Medicaid, the single largest source of health coverage in the United States, as an example. Methods We collected data from Twitter in two ways: via the public streaming application programming interface using Medicaid-related keywords (Corpus 1) and by using the website’s search option for tweets mentioning agency-specific handles (Corpus 2). We manually labeled a sample of tweets in 5 predetermined categories or other and artificially increased the number of training posts from specific low-frequency categories. Using the manually labeled data, we trained and evaluated several supervised learning algorithms, including support vector machine, random forest (RF), naïve Bayes, shallow neural network (NN), k-nearest neighbor, bidirectional long short-term memory, and bidirectional encoder representations from transformers (BERT). We then applied the best-performing classifier to the collected tweets for postclassification analyses to assess the utility of our methods. Results We manually annotated 11,379 tweets (Corpus 1: 9179; Corpus 2: 2200) and used 7930 (69.7%) for training, 1449 (12.7%) for validation, and 2000 (17.6%) for testing. A classifier based on BERT obtained the highest accuracies (81.7%, Corpus 1; 80.7%, Corpus 2) and F1 scores on consumer feedback (0.58, Corpus 1; 0.90, Corpus 2), outperforming the second best classifiers in terms of accuracy (74.6%, RF on Corpus 1; 69.4%, RF on Corpus 2) and F1 score on consumer feedback (0.44, NN on Corpus 1; 0.82, RF on Corpus 2). Postclassification analyses revealed differing intercorpora distributions of tweet categories, with political (400778/628411, 63.78%) and consumer feedback (15073/27337, 55.14%) tweets being the most frequent for Corpus 1 and Corpus 2, respectively. Conclusions The broad and variable content of Medicaid-related tweets necessitates automatic categorization to identify topic-relevant posts. Our proposed system presents a feasible solution for automatic categorization and can be deployed and generalized for health service programs other than Medicaid. Annotated data and methods are available for future studies.

Download Full-text

Analysing Discussions Around Rural Health on Twitter During the COVID-19 Pandemic

10.20944/preprints202109.0216.v1 ◽

2021 ◽

Author(s):

Wasim Ahmed ◽

Josep Vidal-Alaball ◽

Josep Maria Vilaseca Llobet

Keyword(s):

Social Media ◽

Social Network ◽

Network Analysis ◽

Rural Health ◽

Language Processing ◽

Rural Areas ◽

Twitter Data ◽

Public Views ◽

Share Information ◽

Insight Into

Individuals from rural areas are increasingly using social media as a means of communication, receiving information, or actively complaining of inequalities and injustices. This study captured 57 days’ worth of Twitter data from June to August 2021 related to rural health. The study utilised social network analysis and natural language processing to analyse the data. It was found that Twitter served as a fruitful platform to raise awareness of problems faced by those living in rural areas. Overall, Twitter was utilised in rural areas to express complaints, to debate, and share information. Twitter could be leveraged as a powerful social listening tool for individuals and organisations who want to gain insight into public views around rural health.

Download Full-text

Let’s play on Facebook: using sentiment analysis and social media metrics to measure the success of YouTube gamers’ post types

Personal and Ubiquitous Computing ◽

10.1007/s00779-019-01361-7 ◽

2019 ◽

Cited By ~ 2

Author(s):

Flora Poecze ◽

Claus Ebster ◽

Christine Strauss

Keyword(s):

Social Media ◽

Sentiment Analysis ◽

Language Processing ◽

Nearest Neighbor ◽

Consumer Feedback ◽

Youtube Videos ◽

The Masses ◽

Processing Techniques ◽

Facebook Pages ◽

Future Work

AbstractThis paper discusses the analysis results of successful self-marketing techniques on Facebook pages in the cases of three YouTube gamers: PewDiePie, Markiplier, and Kwebbelkop. The research focus was to identify significant differences in terms of the gamers’ user-generated Facebook metrics and commentary sentiments. Analysis of variance (ANOVA) and k-nearest neighbor sentiment analysis were employed as core research methods. ANOVA of the classified post categories revealed that photos tended to show significantly more user-generated interactions than other post types, while, on the other hand, re-posted YouTube videos gained significantly fewer numbers in the retrieved metrics than other content types. K-nearest neighbor sentiment analysis pointed out underlying follower negativity in cases where user-generated activity was relatively low, thereby improving the understanding of the opinion of the masses previously hidden behind metrics such as the number of likes, comments, and shares. The paper at hand highlights the methodological design of the study as well as a detailed discussion of key findings and their implications, and future work. The results per se indicate the need to utilize natural language processing techniques to optimize brand communication on social media and highlight the importance of considering machine learning sentiment analysis techniques for a better understanding of consumer feedback.

Download Full-text

Sentiment on Twitter Data Set using Recurrent Neural Network - Long Short Term Memory

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.k1244.09811s19 ◽

2019 ◽

Vol 8 (11S) ◽

pp. 1206-1211

Keyword(s):

Social Media ◽

Sentiment Analysis ◽

Language Processing ◽

Short Term Memory ◽

Short Term ◽

Data Set ◽

Term Memory ◽

Twitter Data ◽

The People ◽

Long Short Term Memory

Social media is a combination of different platforms where a huge amount of user-generated data is collected. People from various parts of the country express their opinions, reviews, feedback and marketing strategies through social media such as Twitter, Facebook, Instagram, and YouTube. It is vital to explore, gather data, analyze them and consolidate the people views for better decision making. Sentiment analysis is a natural language processing for information extraction that identifies the user’s views. It is used for extracting reviews and opinions about the satisfaction of products, the events, and people for understanding the current trends of product or user’s behavior. The paper reviews and analyses the existing general approaches and algorithms for sentiment analysis. The proposed system selected to perform sentiment analysis on Twitter data set is Long Short Term Memory [LSTM] and evaluated with Naive Bayes Approach.

Download Full-text

Analysing Discussions Around Rural Health on Twitter During the COVID-19 Pandemic

10.20944/preprints202109.0216.v2 ◽

2021 ◽

Author(s):

Wasim Ahmed ◽

Josep Vidal-Alaball ◽

Josep Maria Vilaseca Llobet

Keyword(s):

Social Media ◽

Social Network ◽

Network Analysis ◽

Rural Health ◽

Language Processing ◽

Rural Areas ◽

English Language ◽

Twitter Data ◽

Share Information ◽

Insight Into

Individuals from rural areas are increasingly using social media as a means of communication, receiving information, or actively complaining of inequalities and injustices. This study captured 57 days’ worth of Twitter data from June to August 2021 related to rural health using English language keywords. The study utilised social network analysis and natural language processing to analyse the data. It was found that Twitter served as a fruitful platform to raise awareness of problems faced by those living in rural areas. Overall, Twitter was utilised in rural areas to express complaints, to debate, and share information. Twitter could be leveraged as a powerful social listening tool for individuals and organisations who want to gain insight into popular narratives around rural health.

Download Full-text

Location Analysis for Arabic COVID-19 Twitter Data Using Enhanced Dialect Identification Models

Applied Sciences ◽

10.3390/app112311328 ◽

2021 ◽

Vol 11 (23) ◽

pp. 11328

Author(s):

Nader Essam ◽

Abdullah M. Moussa ◽

Khaled M. Elsayed ◽

Sherif Abdou ◽

Mohsen Rashwan ◽

...

Keyword(s):

Social Media ◽

Language Processing ◽

State Of The Art ◽

Weighted Average ◽

Arabic Language ◽

Arab Countries ◽

The State ◽

Identification Accuracy ◽

Language Models ◽

Twitter Data

The recent surge of social media networks has provided a channel to gather and publish vital medical and health information. The focal role of these networks has become more prominent in periods of crisis, such as the recent pandemic of COVID-19. These social networks have been the leading platform for broadcasting health news updates, precaution instructions, and governmental procedures. They also provide an effective means for gathering public opinion and tracking breaking events and stories. To achieve location-based analysis for social media input, the location information of the users must be captured. Most of the time, this information is either missing or hidden. For some languages, such as Arabic, the users’ location can be predicted from their dialects. The Arabic language has many local dialects for most Arab countries. Natural Language Processing (NLP) techniques have provided several approaches for dialect identification. The recent advanced language models using contextual-based word representations in the continuous domain, such as BERT models, have provided significant improvement for many NLP applications. In this work, we present our efforts to use BERT-based models to improve the dialect identification of Arabic text. We show the results of the developed models to recognize the source of the Arabic country, or the Arabic region, from Twitter data. Our results show 3.4% absolute enhancement in dialect identification accuracy on the regional level over the state-of-the-art result. When we excluded the Modern Standard Arabic (MSA) set, which is formal Arabic language, we achieved 3% absolute gain in accuracy between the three major Arabic dialects over the state-of-the-art level. Finally, we applied the developed models on a recently collected resource for COVID-19 Arabic tweets to recognize the source country from the users’ tweets. We achieved a weighted average accuracy of 97.36%, which proposes a tool to be used by policymakers to support country-level disaster-related activities.

Download Full-text

Using a Machine Learning Approach to Monitor COVID-19 Vaccine Adverse Events (VAE) from Twitter Data

Vaccines ◽

10.3390/vaccines10010103 ◽

2022 ◽

Vol 10 (1) ◽

pp. 103

Author(s):

Andrew T. Lian ◽

Jingcheng Du ◽

Lu Tang

Keyword(s):

Machine Learning ◽

New York ◽

Social Media ◽

Adverse Effects ◽

Adverse Events ◽

Language Processing ◽

Twitter Data ◽

Personal Experiences ◽

Machine Learning Approach ◽

The Us

Social media can be used to monitor the adverse effects of vaccines. The goal of this project is to develop a machine learning and natural language processing approach to identify COVID-19 vaccine adverse events (VAE) from Twitter data. Based on COVID-19 vaccine-related tweets (1 December 2020–1 August 2021), we built a machine learning-based pipeline to identify tweets containing personal experiences with COVID-19 vaccinations and to extract and normalize VAE-related entities, including dose(s); vaccine types (Pfizer, Moderna, and Johnson & Johnson); and symptom(s) from tweets. We further analyzed the extracted VAE data based on the location, time, and frequency. We found that the four most populous states (California, Texas, Florida, and New York) in the US witnessed the most VAE discussions on Twitter. The frequency of Twitter discussions of VAE coincided with the progress of the COVID-19 vaccinations. Sore to touch, fatigue, and headache are the three most common adverse effects of all three COVID-19 vaccines in the US. Our findings demonstrate the feasibility of using social media data to monitor VAEs. To the best of our knowledge, this is the first study to identify COVID-19 vaccine adverse event signals from social media. It can be an excellent supplement to the existing vaccine pharmacovigilance systems.

Download Full-text

Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set (Preprint)

10.2196/preprints.25314 ◽

2020 ◽

Author(s):

Ari Z Klein ◽

Arjun Magge ◽

Karen O'Connor ◽

Jesus Ivan Flores Amaro ◽

Davy Weissenbacher ◽

...

Keyword(s):

Language Processing ◽

State Level ◽

The United States ◽

Self Report ◽

Regular Expressions ◽

Data Set ◽

Processing Pipeline ◽

Twitter Data ◽

Complementary Resource ◽

Automatic Pipeline

BACKGROUND In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone. OBJECTIVE The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the Centers for Disease Control and Prevention. METHODS Beginning January 23, 2020, we collected English tweets from the Twitter Streaming application programming interface that mention keywords related to COVID-19. We applied handwritten regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out “reported speech” (eg, quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on bidirectional encoder representations from transformers (BERT). Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020. RESULTS Interannotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen κ). A deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F1-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have US state–level geolocations. CONCLUSIONS We have made the 13,714 tweets identified in this study, along with each tweet’s time stamp and US state–level geolocation, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.

Download Full-text

Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set

Journal of Medical Internet Research ◽

10.2196/25314 ◽

2021 ◽

Vol 23 (1) ◽

pp. e25314

Author(s):

Ari Z Klein ◽

Arjun Magge ◽

Karen O'Connor ◽

Jesus Ivan Flores Amaro ◽

Davy Weissenbacher ◽

...

Keyword(s):

Language Processing ◽

State Level ◽

The United States ◽

Self Report ◽

Regular Expressions ◽

Data Set ◽

Processing Pipeline ◽

Twitter Data ◽

Complementary Resource ◽

Automatic Pipeline

Background In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone. Objective The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the Centers for Disease Control and Prevention. Methods Beginning January 23, 2020, we collected English tweets from the Twitter Streaming application programming interface that mention keywords related to COVID-19. We applied handwritten regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out “reported speech” (eg, quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on bidirectional encoder representations from transformers (BERT). Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020. Results Interannotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen κ). A deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F1-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have US state–level geolocations. Conclusions We have made the 13,714 tweets identified in this study, along with each tweet’s time stamp and US state–level geolocation, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.

Download Full-text