Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set (Preprint)

BACKGROUND In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone. OBJECTIVE The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the Centers for Disease Control and Prevention. METHODS Beginning January 23, 2020, we collected English tweets from the Twitter Streaming application programming interface that mention keywords related to COVID-19. We applied handwritten regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out “reported speech” (eg, quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on bidirectional encoder representations from transformers (BERT). Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020. RESULTS Interannotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen κ). A deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F1-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have US state–level geolocations. CONCLUSIONS We have made the 13,714 tweets identified in this study, along with each tweet’s time stamp and US state–level geolocation, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.

Download Full-text

Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set

Journal of Medical Internet Research ◽

10.2196/25314 ◽

2021 ◽

Vol 23 (1) ◽

pp. e25314

Author(s):

Ari Z Klein ◽

Arjun Magge ◽

Karen O'Connor ◽

Jesus Ivan Flores Amaro ◽

Davy Weissenbacher ◽

...

Keyword(s):

Language Processing ◽

State Level ◽

The United States ◽

Self Report ◽

Regular Expressions ◽

Data Set ◽

Processing Pipeline ◽

Twitter Data ◽

Complementary Resource ◽

Automatic Pipeline

Background In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone. Objective The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the Centers for Disease Control and Prevention. Methods Beginning January 23, 2020, we collected English tweets from the Twitter Streaming application programming interface that mention keywords related to COVID-19. We applied handwritten regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out “reported speech” (eg, quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on bidirectional encoder representations from transformers (BERT). Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020. Results Interannotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen κ). A deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F1-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have US state–level geolocations. Conclusions We have made the 13,714 tweets identified in this study, along with each tweet’s time stamp and US state–level geolocation, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.

Download Full-text

The Effects of the Opioid Epidemic on Prescribing Practices in Long-Term Care Residents

The Senior Care Pharmacist ◽

10.4140/tcp.n.2019.258 ◽

2019 ◽

Vol 34 (4) ◽

pp. 258-267

Author(s):

Lisa Yamagishi ◽

Olivia Erickson ◽

Kelly Mazzei ◽

Christine O'Neil ◽

Khalid M. Kamal

Keyword(s):

Severe Pain ◽

Long Term Care ◽

Care Facility ◽

The United States ◽

Self Report ◽

Prescribing Practices ◽

Term Care ◽

Data Set ◽

Opioid Prescribing

OBJECTIVE: Evaluate opioid prescribing practices for older adults since the opioid crisis in the United States. DESIGN: Interrupted time-series analysis on retrospective observational cohort study. SETTING: 176-bed skilled-nursing facility (SNF). PARTICIPANTS: Patients admitted to a long-term care facility with pain-related diagnoses between October 1, 2015, and March 31, 2017, were included. Residents discharged prior to 14 days were excluded. Of 392 residents, 258 met inclusion criteria with 313 admissions. MAIN OUTCOME MEASURE: Changes in opioid prescribing frequency between two periods: Q1 to Q3 (Spring 2016) and Q4 to Q6 for pre- and postgovernment countermeasure, respectively. RESULTS: Opioid prescriptions for patients with pain-related diagnoses decreased during period one at -0.10% per quarter (95% confidence interval [CI] -0.85-0.85; P = 0.99), with the rate of decline increasing at -3.8% per quarter from period 1 and 2 (95% CI -0.23-0.15; P = 0.64). Opioid prescribing from top International Classification of Diseases, Ninth Revision category, "Injury and Poisoning" decreased in prescribing frequency by -3.0% per quarter from Q1 to Q6 (95% CI -0.16-0.10; P = 0.54). Appropriateness of pain-control was obtained from the Minimum Data Set version 3.0 "Percent of Residents Who Self-Report Moderate to Severe Pain (Short Stay)" measure; these results showed a significant increase in inadequacy of pain relief by 0.28% per quarter (95% CI 0.12-0.44; P = 0.009). CONCLUSION: Residents who self-report moderate- to severe pain have significantly increased since October 2015. Opioid prescriptions may have decreased for elderly patients in SNFs since Spring 2016. Further investigation with a larger population and wider time frame is warranted to further evaluate significance.

Download Full-text

Identifying and intercepting health misinformation on Reddit dermatology forums with artificially intelligent bots using natural language processing (Preprint)

10.2196/preprints.20975 ◽

2021 ◽

Author(s):

Monique B. Sager ◽

Aditya M. Kashyap ◽

Mila Tamminga ◽

Sadhana Ravoori ◽

Christopher Callison-Burch ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

The United States ◽

Test Accuracy ◽

Limited Data ◽

Test Environment ◽

Data Set ◽

Inappropriate Care ◽

Processing Techniques

BACKGROUND Reddit, the fifth most popular website in the United States, boasts a large and engaged user base on its dermatology forums where users crowdsource free medical opinions. Unfortunately, much of the advice provided is unvalidated and could lead to inappropriate care. Initial testing has shown that artificially intelligent bots can detect misinformation on Reddit forums and may be able to produce responses to posts containing misinformation. OBJECTIVE To analyze the ability of bots to find and respond to health misinformation on Reddit’s dermatology forums in a controlled test environment. METHODS Using natural language processing techniques, we trained bots to target misinformation using relevant keywords and to post pre-fabricated responses. By evaluating different model architectures across a held-out test set, we compared performances. RESULTS Our models yielded data test accuracies ranging from 95%-100%, with a BERT fine-tuned model resulting in the highest level of test accuracy. Bots were then able to post corrective pre-fabricated responses to misinformation. CONCLUSIONS Using a limited data set, bots had near-perfect ability to detect these examples of health misinformation within Reddit dermatology forums. Given that these bots can then post pre-fabricated responses, this technique may allow for interception of misinformation. Providing correct information, even instantly, however, does not mean users will be receptive or find such interventions persuasive. Further work should investigate this strategy’s effectiveness to inform future deployment of bots as a technique in combating health misinformation. CLINICALTRIAL N/A

Download Full-text

Drug Use During Pregnancy Policies in the United States From 1970 to 2016

Contemporary Drug Problems ◽

10.1177/0091450918790790 ◽

2018 ◽

Vol 45 (4) ◽

pp. 441-459 ◽

Cited By ~ 17

Author(s):

Sue Thomas ◽

Ryan Treffers ◽

Nancy F. Berglas ◽

Laurie Drabble ◽

Sarah C. M. Roberts

Keyword(s):

Drug Use ◽

State Level ◽

Original Data ◽

The United States ◽

State Policies ◽

Alcohol Policy ◽

Data Set ◽

Drug Laws ◽

Policy Domain ◽

Policy Environments

As U.S. states legalize marijuana and as governmental attention is paid to the “opioid crisis,” state policies pertaining to drug use during pregnancy are increasingly important. Little is known about the scope of state policies targeting drug use during pregnancy, how they have evolved, and how they compare to alcohol use during pregnancy policies. Method: Our 46-year original data set of statutes and regulations in U.S. states covers the entirety of state-level legislation in this policy domain. Data were obtained through original legal research and from the National Institute on Alcohol Abuse and Alcoholism’s Alcohol Policy Information System. Policies were analyzed individually as well as by classification as punitive toward or supportive of women. Results: The number of states with drug use during pregnancy policies has increased from 1 in 1974 to 43 in 2016. Policies started as punitive. By the mid- to late 1980s, supportive policies emerged, and mixed policy environments dominated in the 2000s. Overall, drug/pregnancy policy environments have become less supportive over time. Comparisons of drug laws to alcohol laws show that the policy trajectories started in opposite directions, but by 2016, the results were the same: Punitive policies were more prevalent than supportive policies across states. Moreover, there is a great deal of overlap between drug use during pregnancy policies and alcohol/pregnancy policies. Conclusion: This study breaks new ground. More studies are needed that explore the effects of these policies on alcohol and other drug use by pregnant women and on birth outcomes.

Download Full-text

A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes

Journal of Biomedical Informatics X ◽

10.1016/j.yjbinx.2020.100076 ◽

2020 ◽

Vol 8 ◽

pp. 100076 ◽

Cited By ~ 1

Author(s):

Ari Z. Klein ◽

Haitao Cai ◽

Davy Weissenbacher ◽

Lisa D. Levine ◽

Graciela Gonzalez-Hernandez

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Pregnancy Outcomes ◽

Adverse Pregnancy Outcomes ◽

Processing Pipeline ◽

Twitter Data ◽

Digital Epidemiology ◽

Adverse Pregnancy

Download Full-text

The Law and Labor Strife in the United States, 1881–1894

The Journal of Economic History ◽

10.1017/s0022050700024645 ◽

2000 ◽

Vol 60 (1) ◽

pp. 42-66 ◽

Cited By ~ 12

Author(s):

Janet Currie ◽

Joseph Ferrie

Keyword(s):

United States ◽

State Level ◽

The United States ◽

State Governments ◽

Legal Environment ◽

Labor Disputes ◽

Data Set ◽

Unique Data ◽

The Law

This article examines the effect of state-level legal innovations governing labor disputes in the late 1800s. This was a period of legal ferment in which worker organizations and employers actively lobbied state governments for changes in the rules governing labor disputes. Cross-state heterogeneity in the legal environment provides an unusual opportunity to investigate the effects of these laws. We use a unique data set with information on 12,965 strikes to show that most of these law changes had surprisingly little effect on strike incidence or outcomes. Important exceptions were maximum hours laws and the use of injunctions.

Download Full-text

The Twitter Social Mobility Index: Measuring Social Distancing Practices With Geolocated Tweets (Preprint)

10.2196/preprints.21499 ◽

2020 ◽

Author(s):

Paiheng Xu ◽

Mark Dredze ◽

David A Broniatowski

Keyword(s):

United States ◽

South Carolina ◽

Social Mobility ◽

State Level ◽

The United States ◽

Mobility Index ◽

Social Distancing ◽

Mobility Data ◽

Response Planning ◽

Twitter Data

BACKGROUND Social distancing is an important component of the response to the COVID-19 pandemic. Minimizing social interactions and travel reduces the rate at which the infection spreads and “flattens the curve” so that the medical system is better equipped to treat infected individuals. However, it remains unclear how the public will respond to these policies as the pandemic continues. OBJECTIVE The aim of this study is to present the Twitter Social Mobility Index, a measure of social distancing and travel derived from Twitter data. We used public geolocated Twitter data to measure how much users travel in a given week. METHODS We collected 469,669,925 tweets geotagged in the United States from January 1, 2019, to April 27, 2020. We analyzed the aggregated mobility variance of a total of 3,768,959 Twitter users at the city and state level from the start of the COVID-19 pandemic. RESULTS We found a large reduction (61.83%) in travel in the United States after the implementation of social distancing policies. However, the variance by state was high, ranging from 38.54% to 76.80%. The eight states that had not issued statewide social distancing orders as of the start of April ranked poorly in terms of travel reduction: Arkansas (45), Iowa (37), Nebraska (35), North Dakota (22), South Carolina (38), South Dakota (46), Oklahoma (50), Utah (14), and Wyoming (53). We are presenting our findings on the internet and will continue to update our analysis during the pandemic. CONCLUSIONS We observed larger travel reductions in states that were early adopters of social distancing policies and smaller changes in states without such policies. The results were also consistent with those based on other mobility data to a certain extent. Therefore, geolocated tweets are an effective way to track social distancing practices using a public resource, and this tracking may be useful as part of ongoing pandemic response planning.

Download Full-text

Understanding the Public Discussion About the Centers for Disease Control and Prevention During the COVID-19 Pandemic Using Twitter Data: Text Mining Analysis Study

Journal of Medical Internet Research ◽

10.2196/25108 ◽

2021 ◽

Vol 23 (2) ◽

pp. e25108

Author(s):

Joanne Chen Lyu ◽

Garving K Luli

Keyword(s):

Public Health ◽

Disease Control ◽

The United States ◽

Data Set ◽

Public Discussion ◽

The Public ◽

Twitter Data ◽

The World ◽

Control And Prevention ◽

Centers For Disease Control

Background The Centers for Disease Control and Prevention (CDC) is a national public health protection agency in the United States. With the escalating impact of the COVID-19 pandemic on society in the United States and around the world, the CDC has become one of the focal points of public discussion. Objective This study aims to identify the topics and their overarching themes emerging from the public COVID-19-related discussion about the CDC on Twitter and to further provide insight into public's concerns, focus of attention, perception of the CDC's current performance, and expectations from the CDC. Methods Tweets were downloaded from a large-scale COVID-19 Twitter chatter data set from March 11, 2020, when the World Health Organization declared COVID-19 a pandemic, to August 14, 2020. We used R (The R Foundation) to clean the tweets and retain tweets that contained any of five specific keywords—cdc, CDC, centers for disease control and prevention, CDCgov, and cdcgov—while eliminating all 91 tweets posted by the CDC itself. The final data set included in the analysis consisted of 290,764 unique tweets from 152,314 different users. We used R to perform the latent Dirichlet allocation algorithm for topic modeling. Results The Twitter data generated 16 topics that the public linked to the CDC when they talked about COVID-19. Among the topics, the most discussed was COVID-19 death counts, accounting for 12.16% (n=35,347) of the total 290,764 tweets in the analysis, followed by general opinions about the credibility of the CDC and other authorities and the CDC's COVID-19 guidelines, with over 20,000 tweets for each. The 16 topics fell into four overarching themes: knowing the virus and the situation, policy and government actions, response guidelines, and general opinion about credibility. Conclusions Social media platforms, such as Twitter, provide valuable databases for public opinion. In a protracted pandemic, such as COVID-19, quickly and efficiently identifying the topics within the public discussion on Twitter would help public health agencies improve the next-round communication with the public.

Download Full-text

COUnty aggRegation mixup AuGmEntation (COURAGE) COVID-19 prediction

Scientific Reports ◽

10.1038/s41598-021-93545-6 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Siawpeng Er ◽

Shihao Yang ◽

Tuo Zhao

Keyword(s):

Language Processing ◽

State Level ◽

The United States ◽

Short Term ◽

Attention Model ◽

Learning Techniques ◽

Novel Coronavirus ◽

Short Term Prediction ◽

Transformer Model ◽

Available Information

AbstractThe global spread of COVID-19, the disease caused by the novel coronavirus SARS-CoV-2, has casted a significant threat to mankind. As the COVID-19 situation continues to evolve, predicting localized disease severity is crucial for advanced resource allocation. This paper proposes a method named COURAGE (COUnty aggRegation mixup AuGmEntation) to generate a short-term prediction of 2-week-ahead COVID-19 related deaths for each county in the United States, leveraging modern deep learning techniques. Specifically, our method adopts a self-attention model from Natural Language Processing, known as the transformer model, to capture both short-term and long-term dependencies within the time series while enjoying computational efficiency. Our model solely utilizes publicly available information for COVID-19 related confirmed cases, deaths, community mobility trends and demographic information, and can produce state-level predictions as an aggregation of the corresponding county-level predictions. Our numerical experiments demonstrate that our model achieves the state-of-the-art performance among the publicly available benchmark models.

Download Full-text

Sentiment on Twitter Data Set using Recurrent Neural Network - Long Short Term Memory

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.k1244.09811s19 ◽

2019 ◽

Vol 8 (11S) ◽

pp. 1206-1211

Keyword(s):

Social Media ◽

Sentiment Analysis ◽

Language Processing ◽

Short Term Memory ◽

Short Term ◽

Data Set ◽

Term Memory ◽

Twitter Data ◽

The People ◽

Long Short Term Memory

Social media is a combination of different platforms where a huge amount of user-generated data is collected. People from various parts of the country express their opinions, reviews, feedback and marketing strategies through social media such as Twitter, Facebook, Instagram, and YouTube. It is vital to explore, gather data, analyze them and consolidate the people views for better decision making. Sentiment analysis is a natural language processing for information extraction that identifies the user’s views. It is used for extracting reviews and opinions about the satisfaction of products, the events, and people for understanding the current trends of product or user’s behavior. The paper reviews and analyses the existing general approaches and algorithms for sentiment analysis. The proposed system selected to perform sentiment analysis on Twitter data set is Long Short Term Memory [LSTM] and evaluated with Naive Bayes Approach.

Download Full-text