Session details: Big Data and Social Media for Public Health Surveillance

Author(s):  
Patty Kostkova
2019 ◽  
Author(s):  
Joana M Barros ◽  
Jim Duggan ◽  
Dietrich Rebholz-Schuhmann

BACKGROUND Public health surveillance is based on the continuous and systematic collection, analysis, and interpretation of data. This informs the development of early warning systems to monitor epidemics and documents the impact of intervention measures. The introduction of digital data sources, and specifically sources available on the internet, has impacted the field of public health surveillance. New opportunities enabled by the underlying availability and scale of internet-based sources (IBSs) have paved the way for novel approaches for disease surveillance, exploration of health communities, and the study of epidemic dynamics. This field and approach is also known as infodemiology or infoveillance. OBJECTIVE This review aimed to assess research findings regarding the application of IBSs for public health surveillance (infodemiology or infoveillance). To achieve this, we have presented a comprehensive systematic literature review with a focus on these sources and their limitations, the diseases targeted, and commonly applied methods. METHODS A systematic literature review was conducted targeting publications between 2012 and 2018 that leveraged IBSs for public health surveillance, outbreak forecasting, disease characterization, diagnosis prediction, content analysis, and health-topic identification. The search results were filtered according to previously defined inclusion and exclusion criteria. RESULTS Spanning a total of 162 publications, we determined infectious diseases to be the preferred case study (108/162, 66.7%). Of the eight categories of IBSs (search queries, social media, news, discussion forums, websites, web encyclopedia, and online obituaries), search queries and social media were applied in 95.1% (154/162) of the reviewed publications. We also identified limitations in representativeness and biased user age groups, as well as high susceptibility to media events by search queries, social media, and web encyclopedias. CONCLUSIONS IBSs are a valuable proxy to study illnesses affecting the general population; however, it is important to characterize which diseases are best suited for the available sources; the literature shows that the level of engagement among online platforms can be a potential indicator. There is a necessity to understand the population’s online behavior; in addition, the exploration of health information dissemination and its content is significantly unexplored. With this information, we can understand how the population communicates about illnesses online and, in the process, benefit public health.


2021 ◽  
Author(s):  
Zahra Shakeri Hossein Abad ◽  
Gregory P. Butler ◽  
Wendy Thompson ◽  
Joon Lee

BACKGROUND Advances in automated data processing and machine learning (ML) models, together with the unprecedented growth in the number of social media users who publicly share and discuss health-related information, have made public health surveillance (PHS) one of the long-lasting social media applications. However, the existing PHS systems feeding on social media data have not been widely deployed in national surveillance systems, which appears to stem from the lack of practitioners and the public’s trust in social media data. More robust and reliable datasets over which supervised machine learning models can be trained and tested reliably is a significant step toward overcoming this hurdle. OBJECTIVE The health implications of daily behaviours (physical activity, sedentary behaviour, and sleep (PASS)), as an evergreen topic in PHS, are widely studied through traditional data sources such as surveillance surveys and administrative databases, which are often several months out of date by the time they are utilized, costly to collect, and thus limited in quantity and coverage. In this paper, we present LPHEADA, a multicountry and fully Labelled digital Public HEAlth DAtaset of tweets originated in Australia, Canada, the United Kingdom (UK), or the United States (US). METHODS We collected the data of this study from Twitter using the Twitter livestream application programming interface (API) between 28th November 2018 to 19th June 2020. To obtain PASS-related tweets for manual annotation, we iteratively used regular expressions, unsupervised natural language processing, domain-specific ontologies and linguistic analysis. We used Amazon Mechanical Turk (AMT) to label the collected data to self-reported PASS categories and implemented a quality control pipeline to monitor and manage the validity of crow-generated labels. Moreover, we used ML, latent semantic analysis, linguistic analysis, and label inference analysis to validate different components of the dataset. RESULTS LPHEADA contains 366,405 crowd-generated labels (three labels per tweet) for 122,135 PASS-related tweets, labelled by 708 unique annotators on AMT. In addition to crowd-generated labels, LPHEADA provides details about the three critical components of any PHS system: place, time, and demographics (gender, age range) associated with each tweet. CONCLUSIONS Publicly available datasets for digital PASS surveillance are usually isolated and only provide labels for small subsets of the data. We believe that the novelty and comprehensiveness of the dataset provided in this study will help develop, evaluate, and deploy digital PASS surveillance systems. LPHEADA will be an invaluable resource for both public health researchers and practitioners.


2018 ◽  
Vol 4 (3) ◽  
pp. 205630511879076 ◽  
Author(s):  
Daniel K. Cortese ◽  
Glen Szczypka ◽  
Sherry Emery ◽  
Shuai Wang ◽  
Elizabeth Hair ◽  
...  

Our research provides social scientists with areas of inquiry in tobacco-related health disparities in young adult women and opportunities for intervention, as Instagram may be a powerful tool for the public health surveillance of smoking behavior and social norms among young women. Social media has fundamentally changed how to engage with health-related information. Researchers increasingly turn to social media platforms for public health surveillance. Instagram currently is one of the fastest growing social networks with over 53% of young adults (aged 18-29) using the platform and young adult women comprise a significant user base. We conducted a content analysis of a sample of smoking imagery drawn from Instagram’s public Application Programming Interface (API). From August 2014 to July 2015, 18 popular tobacco- and e-cigarette-related text tags were used to collect 2.3 million image posts. Trained undergraduate coders (aged 21-29) coded 8,000 images ( r = .91) by type of artifact, branding, number of persons, gender, age, ethnicity, and the presence of smoke. Approximately 71.5% of images were tobacco-relevant and informed our research. Images of cigarettes were the most popular (49%), followed by e-cigarettes (32.1%). “Selfies while smoking” was the dominant form of portrait expression, with 61.4% of images containing only one person, and of those, 65.7% contained images of women. The most common selfie was women engaged in “smoke play” (62.4%) that the viewer could interpret as “cool.” These “cool” images may counteract public health efforts to denormalize smoking, and young women are bearing the brunt of this under-the-radar tobacco advertising. Social media further normalizes tobacco use because positive images and brand messaging are easily seen and shared, and also operates as unpaid advertising on image-based platforms like Instagram. These findings portend a dangerous trend for young women in the absence of effective public health intervention strategies.


10.2196/13680 ◽  
2020 ◽  
Vol 22 (3) ◽  
pp. e13680 ◽  
Author(s):  
Joana M Barros ◽  
Jim Duggan ◽  
Dietrich Rebholz-Schuhmann

Background Public health surveillance is based on the continuous and systematic collection, analysis, and interpretation of data. This informs the development of early warning systems to monitor epidemics and documents the impact of intervention measures. The introduction of digital data sources, and specifically sources available on the internet, has impacted the field of public health surveillance. New opportunities enabled by the underlying availability and scale of internet-based sources (IBSs) have paved the way for novel approaches for disease surveillance, exploration of health communities, and the study of epidemic dynamics. This field and approach is also known as infodemiology or infoveillance. Objective This review aimed to assess research findings regarding the application of IBSs for public health surveillance (infodemiology or infoveillance). To achieve this, we have presented a comprehensive systematic literature review with a focus on these sources and their limitations, the diseases targeted, and commonly applied methods. Methods A systematic literature review was conducted targeting publications between 2012 and 2018 that leveraged IBSs for public health surveillance, outbreak forecasting, disease characterization, diagnosis prediction, content analysis, and health-topic identification. The search results were filtered according to previously defined inclusion and exclusion criteria. Results Spanning a total of 162 publications, we determined infectious diseases to be the preferred case study (108/162, 66.7%). Of the eight categories of IBSs (search queries, social media, news, discussion forums, websites, web encyclopedia, and online obituaries), search queries and social media were applied in 95.1% (154/162) of the reviewed publications. We also identified limitations in representativeness and biased user age groups, as well as high susceptibility to media events by search queries, social media, and web encyclopedias. Conclusions IBSs are a valuable proxy to study illnesses affecting the general population; however, it is important to characterize which diseases are best suited for the available sources; the literature shows that the level of engagement among online platforms can be a potential indicator. There is a necessity to understand the population’s online behavior; in addition, the exploration of health information dissemination and its content is significantly unexplored. With this information, we can understand how the population communicates about illnesses online and, in the process, benefit public health.


Author(s):  
Albert Park ◽  
Mike Conway

Objective: We aim to understand (1) the frequency of URL sharing and (2) types of shared URLs among opioid related discussions that take place in the social media platform called Reddit.Introduction: Nearly 100 people per day die from opioid overdose in the United States. Further, prescription opioid abuse is assumed to be responsible for a 15-year increase in opioid overdose deaths1. However, with increasing use of social media comes increasing opportunity to seek and share information. For instance, 80% of Internet users obtain health information online2, including popular social interaction sites like Reddit (http://www.reddit.com), which had more than 82.5 billion page views in 20153. In Reddit, members often share information, and include URLs to supplement the information. Understanding the frequency of URL sharing and types of shared URLs can improve our knowledge of information seeking/sharing behaviors as well as domains of shared information on social media. Such knowledge has the potential to provide opportunities to improve public health surveillance practice. We use Reddit to track opioid related discussions and then investigate types of shared URLs among Reddit members in those discussions.Methods: First, we use a dataset4—made available on Reddit—that has been used in several informatics studies5,6. The dataset is comprised of 13,213,173 unique member IDs, 114,320,798 posts, and 1,659,361,605 associated comments that are made on 239,772 (including active and inactive) subreddits (i.e., sub-communities) from October 2007 to May 2015. Second, we identified 9 terms that are associated with opioids. The terms are 'opioid', 'opium', 'morphine', 'opiate', 'hydrocodone', 'oxycodone', 'fentanyl', 'heroin', and 'methadone'. Third, we preprocessed the entire dataset (i.e., converting text to lower cases and removing punctuation) and extracted discussions with opioid terms and their metadata (e.g., user ID, post ID) via a lexicon-based approach. Fourth, we extracted URLs using Python from these discussions, categorized the URLs by domain, and then visualized the results in a bubble chart7.Results: We extracted 1,121,187 posts/comments that were made by 328,179 unique member IDs from 8,892 subreddits. Of the 1,121,187 posts/comments, 82,639 posts/comments contained URLs (7.37%), and these posts consisted of 272,551 individual URLs and 138,206 unique URLs. The types of shared URLs in these opioid related discussions are summarized in Figure 1. The color and size represent the type and size respectively of shared URLs. The ‘.com’ is in blue; ‘.org’ is in orange; and ‘.gov’ is in green.Conclusions: We present preliminary findings concerning the types of shared URLs in opioid-related discussions among Reddit members. Our initial results suggest that Reddit members openly discuss opioid related issues and URL sharing is a part of information sharing. Although members share many URLs from reliable information sources (e.g., ‘ncbi.nlm.nih.gov’, ‘wikipedia.org, ‘nytimes.com’, ‘sciencedirect.com’), further investigation is needed concerning many of the ‘.com’ URLs, which have the potential to contain high and/or low quality information (e.g., ‘youtube.com’, ‘reddit.com’, ‘google.com’, ‘amazon.com’) to fully understand information seeking/sharing behaviors on social media and to identify opportunities, such as misinformation dissemination for improving public health surveillance practice.


2018 ◽  
Author(s):  
Youshan Zhang ◽  
Jon-Patrick Allem ◽  
Jennifer Beth Unger ◽  
Tess Boley Cruz

BACKGROUND Instagram, with millions of posts per day, can be used to inform public health surveillance targets and policies. However, current research relying on image-based data often relies on hand coding of images, which is time-consuming and costly, ultimately limiting the scope of the study. Current best practices in automated image classification (eg, support vector machine (SVM), backpropagation neural network, and artificial neural network) are limited in their capacity to accurately distinguish between objects within images. OBJECTIVE This study aimed to demonstrate how a convolutional neural network (CNN) can be used to extract unique features within an image and how SVM can then be used to classify the image. METHODS Images of waterpipes or hookah (an emerging tobacco product possessing similar harms to that of cigarettes) were collected from Instagram and used in the analyses (N=840). A CNN was used to extract unique features from images identified to contain waterpipes. An SVM classifier was built to distinguish between images with and without waterpipes. Methods for image classification were then compared to show how a CNN+SVM classifier could improve accuracy. RESULTS As the number of validated training images increased, the total number of extracted features increased. In addition, as the number of features learned by the SVM classifier increased, the average level of accuracy increased. Overall, 99.5% (418/420) of images classified were correctly identified as either hookah or nonhookah images. This level of accuracy was an improvement over earlier methods that used SVM, CNN, or bag-of-features alone. CONCLUSIONS A CNN extracts more features of images, allowing an SVM classifier to be better informed, resulting in higher accuracy compared with methods that extract fewer features. Future research can use this method to grow the scope of image-based studies. The methods presented here might help detect increases in the popularity of certain tobacco products over time on social media. By taking images of waterpipes from Instagram, we place our methods in a context that can be utilized to inform health researchers analyzing social media to understand user experience with emerging tobacco products and inform public health surveillance targets and policies.


Sign in / Sign up

Export Citation Format

Share Document