Probabilistic social learning improves the public’s judgments of news veracity

The digital spread of misinformation is one of the leading threats to democracy, public health, and the global economy. Popular strategies for mitigating misinformation include crowdsourcing, machine learning, and media literacy programs that require social media users to classify news in binary terms as either true or false. However, research on peer influence suggests that framing decisions in binary terms can amplify judgment errors and limit social learning, whereas framing decisions in probabilistic terms can reliably improve judgments. In this preregistered experiment, we compare online peer networks that collaboratively evaluated the veracity of news by communicating either binary or probabilistic judgments. Exchanging probabilistic estimates of news veracity substantially improved individual and group judgments, with the effect of eliminating polarization in news evaluation. By contrast, exchanging binary classifications reduced social learning and maintained polarization. The benefits of probabilistic social learning are robust to participants’ education, gender, race, income, religion, and partisanship.

Download Full-text

Predicting Age Groups of Reddit Users Based on Posting Behavior and Metadata: Classification Model Development and Validation (Preprint)

10.2196/preprints.25807 ◽

2020 ◽

Author(s):

Robert Chew ◽

Caroline Kery ◽

Laura Baum ◽

Thomas Bukowski ◽

Annice Kim ◽

...

Keyword(s):

Public Health ◽

Machine Learning ◽

Social Media ◽

Age Groups ◽

Demographic Characteristics ◽

Machine Learning Algorithms ◽

Classification Model ◽

Age Group ◽

Adult Group ◽

Linguistic Patterns

BACKGROUND Social media are important for monitoring perceptions of public health issues and for educating target audiences about health; however, limited information about the demographics of social media users makes it challenging to identify conversations among target audiences and limits how well social media can be used for public health surveillance and education outreach efforts. Certain social media platforms provide demographic information on followers of a user account, if given, but they are not always disclosed, and researchers have developed machine learning algorithms to predict social media users’ demographic characteristics, mainly for Twitter. To date, there has been limited research on predicting the demographic characteristics of Reddit users. OBJECTIVE We aimed to develop a machine learning algorithm that predicts the age segment of Reddit users, as either adolescents or adults, based on publicly available data. METHODS This study was conducted between January and September 2020 using publicly available Reddit posts as input data. We manually labeled Reddit users’ age by identifying and reviewing public posts in which Reddit users self-reported their age. We then collected sample posts, comments, and metadata for the labeled user accounts and created variables to capture linguistic patterns, posting behavior, and account details that would distinguish the adolescent age group (aged 13 to 20 years) from the adult age group (aged 21 to 54 years). We split the data into training (n=1660) and test sets (n=415) and performed 5-fold cross validation on the training set to select hyperparameters and perform feature selection. We ran multiple classification algorithms and tested the performance of the models (precision, recall, F1 score) in predicting the age segments of the users in the labeled data. To evaluate associations between each feature and the outcome, we calculated means and confidence intervals and compared the two age groups, with 2-sample t tests, for each transformed model feature. RESULTS The gradient boosted trees classifier performed the best, with an F1 score of 0.78. The test set precision and recall scores were 0.79 and 0.89, respectively, for the adolescent group (n=254) and 0.78 and 0.63, respectively, for the adult group (n=161). The most important feature in the model was the number of sentences per comment (permutation score: mean 0.100, SD 0.004). Members of the adolescent age group tended to have created accounts more recently, have higher proportions of submissions and comments in the r/teenagers subreddit, and post more in subreddits with higher subscriber counts than those in the adult group. CONCLUSIONS We created a Reddit age prediction algorithm with competitive accuracy using publicly available data, suggesting machine learning methods can help public health agencies identify age-related target audiences on Reddit. Our results also suggest that there are characteristics of Reddit users’ posting behavior, linguistic patterns, and account features that distinguish adolescents from adults.

Download Full-text

Suicide Risk and Protective Factors in Online Support Forum Posts: Annotation Scheme Development and Validation Study (Preprint)

10.2196/preprints.24471 ◽

2020 ◽

Author(s):

Stevie Chancellor ◽

Steven A Sumner ◽

Corinne David-Ferdon ◽

Tahirah Ahmad ◽

Munmun De Choudhury

Keyword(s):

Public Health ◽

Machine Learning ◽

Social Media ◽

Protective Factors ◽

Suicide Risk ◽

Risk And Protective Factors ◽

Prior Work ◽

Annotation Scheme ◽

Social Media Data ◽

Media Data

BACKGROUND Online communities provide support for individuals looking for help with suicidal ideation and crisis. As community data are increasingly used to devise machine learning models to infer who might be at risk, there have been limited efforts to identify both risk and protective factors in web-based posts. These annotations can enrich and augment computational assessment approaches to identify appropriate intervention points, which are useful to public health professionals and suicide prevention researchers. OBJECTIVE This qualitative study aims to develop a valid and reliable annotation scheme for evaluating risk and protective factors for suicidal ideation in posts in suicide crisis forums. METHODS We designed a valid, reliable, and clinically grounded process for identifying risk and protective markers in social media data. This scheme draws on prior work on construct validity and the social sciences of measurement. We then applied the scheme to annotate 200 posts from r/SuicideWatch—a Reddit community focused on suicide crisis. RESULTS We documented our results on producing an annotation scheme that is consistent with leading public health information coding schemes for suicide and advances attention to protective factors. Our study showed high internal validity, and we have presented results that indicate that our approach is consistent with findings from prior work. CONCLUSIONS Our work formalizes a framework that incorporates construct validity into the development of annotation schemes for suicide risk on social media. This study furthers the understanding of risk and protective factors expressed in social media data. This may help public health programming to prevent suicide and computational social science research and investigations that rely on the quality of labels for downstream machine learning tasks.

Download Full-text

Suicide Risk and Protective Factors in Online Support Forum Posts: Annotation Scheme Development and Validation Study

JMIR Mental Health ◽

10.2196/24471 ◽

2021 ◽

Vol 8 (11) ◽

pp. e24471

Author(s):

Stevie Chancellor ◽

Steven A Sumner ◽

Corinne David-Ferdon ◽

Tahirah Ahmad ◽

Munmun De Choudhury

Keyword(s):

Public Health ◽

Machine Learning ◽

Social Media ◽

Protective Factors ◽

Suicide Risk ◽

Risk And Protective Factors ◽

Prior Work ◽

Annotation Scheme ◽

Social Media Data ◽

Media Data

Background Online communities provide support for individuals looking for help with suicidal ideation and crisis. As community data are increasingly used to devise machine learning models to infer who might be at risk, there have been limited efforts to identify both risk and protective factors in web-based posts. These annotations can enrich and augment computational assessment approaches to identify appropriate intervention points, which are useful to public health professionals and suicide prevention researchers. Objective This qualitative study aims to develop a valid and reliable annotation scheme for evaluating risk and protective factors for suicidal ideation in posts in suicide crisis forums. Methods We designed a valid, reliable, and clinically grounded process for identifying risk and protective markers in social media data. This scheme draws on prior work on construct validity and the social sciences of measurement. We then applied the scheme to annotate 200 posts from r/SuicideWatch—a Reddit community focused on suicide crisis. Results We documented our results on producing an annotation scheme that is consistent with leading public health information coding schemes for suicide and advances attention to protective factors. Our study showed high internal validity, and we have presented results that indicate that our approach is consistent with findings from prior work. Conclusions Our work formalizes a framework that incorporates construct validity into the development of annotation schemes for suicide risk on social media. This study furthers the understanding of risk and protective factors expressed in social media data. This may help public health programming to prevent suicide and computational social science research and investigations that rely on the quality of labels for downstream machine learning tasks.

Download Full-text

Young people’s responses to COVID-19 as the pandemic spread across the world: lessons from real-time analysis of online viewer engagement with a pan-African COVID-19 miniseries (Preprint)

10.2196/preprints.30449 ◽

2021 ◽

Author(s):

Venetia Baker ◽

Georgia Arnold ◽

Sara Piot ◽

Lesedi Thwala ◽

Judith Glynn ◽

...

Keyword(s):

Public Health ◽

Social Media ◽

Young People ◽

Peer Influence ◽

Online Community ◽

Accurate Information ◽

Psychosocial Needs ◽

Real Time Analysis ◽

The Usa ◽

The Impact

BACKGROUND In April 2020, as cases of the novel coronavirus disease (COVID-19) spread across the globe, MTV Staying Alive Foundation created the educational entertainment miniseries MTV Shuga: Alone Together. In 70 short episodes released daily on YouTube, Alone Together aimed to disseminate timely and accurate information to increase young people’s knowledge, motivation and actions to prevent COVID-19. OBJECTIVE We sought to identify young people’s perspectives on the global COVID-19 pandemic and national lockdowns by examining the words, conversations, experiences and emotions expressed on social media in response to the Alone Together episodes. We also assessed how viewers used the series and its online community as a source of support during the global pandemic. METHODS 3,982 comments and 70 live chat conversations were extracted from YouTube between April-October 2020 and analysed through a data-led inductive thematic approach. Analyses were conducted within one week of episodes premiering. Aggregated demographic and geographical data were collected using YouTube Analytics. RESULTS The miniseries had a global reach across 5 continents, with a total 7.7 million views across MTV Shuga platforms. The series had almost 1 million views over 70 episodes on YouTube and an average 5,683 unique viewers per episode on YouTube. The dominant audience was young people (65%, between 18-24 years old) and women (85%). Across diverse countries like Nigeria, Ghana, the USA and the UK, viewers believed that COVID-19 was serious and expressed that it was people’s social responsibility to follow public health measures. The series storylines about the impact of self-isolation on mental health, exposure to violence in lockdowns and restricted employment opportunities due to the pandemic resonated with young viewers. Tuning in to the miniseries provided viewers with reliable information, entertainment, and an online community during an isolating, confusing and worrying time. CONCLUSIONS During the first wave of COVID-19, young people from 53 countries connected on social media via the MTV miniseries. The analysis showed how digitally connected young people, predominantly young women, felt compelled to follow COVID-19 safety measures despite the pandemic’s impact on their psychosocial needs. Viewers used social media to reach out to fellow viewers for advice, solace, support and resources. Organisations, governments and individuals have been forced to innovate during the pandemic to ensure people can access services safely and remotely. This analysis showed that young people are receptive to receiving support from online communities and media services. Peer influence and support online can be a powerful public health tool as young people have a great capacity to influence each other and shape norms around public health. However, online services are not accessible to everyone, and COVID-19 has increased disparities between digitally connected and unconnected young people.

Download Full-text

Physical activity, sedentary behaviour, and sleep on Twitter: A multicountry and fully labelled dataset for public health surveillance research (Preprint)

10.2196/preprints.32355 ◽

2021 ◽

Author(s):

Zahra Shakeri Hossein Abad ◽

Gregory P. Butler ◽

Wendy Thompson ◽

Joon Lee

Keyword(s):

Public Health ◽

Physical Activity ◽

Machine Learning ◽

Social Media ◽

Sedentary Behaviour ◽

Public Health Surveillance ◽

Health Surveillance ◽

Surveillance Systems ◽

Social Media Data ◽

Media Data

BACKGROUND Advances in automated data processing and machine learning (ML) models, together with the unprecedented growth in the number of social media users who publicly share and discuss health-related information, have made public health surveillance (PHS) one of the long-lasting social media applications. However, the existing PHS systems feeding on social media data have not been widely deployed in national surveillance systems, which appears to stem from the lack of practitioners and the public’s trust in social media data. More robust and reliable datasets over which supervised machine learning models can be trained and tested reliably is a significant step toward overcoming this hurdle. OBJECTIVE The health implications of daily behaviours (physical activity, sedentary behaviour, and sleep (PASS)), as an evergreen topic in PHS, are widely studied through traditional data sources such as surveillance surveys and administrative databases, which are often several months out of date by the time they are utilized, costly to collect, and thus limited in quantity and coverage. In this paper, we present LPHEADA, a multicountry and fully Labelled digital Public HEAlth DAtaset of tweets originated in Australia, Canada, the United Kingdom (UK), or the United States (US). METHODS We collected the data of this study from Twitter using the Twitter livestream application programming interface (API) between 28th November 2018 to 19th June 2020. To obtain PASS-related tweets for manual annotation, we iteratively used regular expressions, unsupervised natural language processing, domain-specific ontologies and linguistic analysis. We used Amazon Mechanical Turk (AMT) to label the collected data to self-reported PASS categories and implemented a quality control pipeline to monitor and manage the validity of crow-generated labels. Moreover, we used ML, latent semantic analysis, linguistic analysis, and label inference analysis to validate different components of the dataset. RESULTS LPHEADA contains 366,405 crowd-generated labels (three labels per tweet) for 122,135 PASS-related tweets, labelled by 708 unique annotators on AMT. In addition to crowd-generated labels, LPHEADA provides details about the three critical components of any PHS system: place, time, and demographics (gender, age range) associated with each tweet. CONCLUSIONS Publicly available datasets for digital PASS surveillance are usually isolated and only provide labels for small subsets of the data. We believe that the novelty and comprehensiveness of the dataset provided in this study will help develop, evaluate, and deploy digital PASS surveillance systems. LPHEADA will be an invaluable resource for both public health researchers and practitioners.

Download Full-text

Tactics of news literacy: How young people access, evaluate, and engage with news on social media

New Media & Society ◽

10.1177/14614448211011447 ◽

2021 ◽

pp. 146144482110114

Author(s):

Joëlle Swart

Keyword(s):

Social Media ◽

Young People ◽

Everyday Life ◽

Media Literacy ◽

Social Contexts ◽

Literacy Programs ◽

Situated Knowledge ◽

Knowledge And Skills ◽

News Literacy ◽

Depth Interviews

Young people’s increasing dependence on social media for news demands increasing levels of news literacy, leading to a rise in media literacy programs that aim to support youth’s abilities to critically and mindfully navigate news. However, being news literate does not necessarily mean such knowledge and skills are applied in practice. This article starts from young people’s own news practices and experiences on social media to explore when news literacy becomes meaningful in the practice of everyday life. Based on in-depth interviews with 36 young people aged 16–22, it explores what strategies and tactics they employ to access, evaluate, or engage with news. It argues that such practices can be considered as expressions of news literacy, through which young people negotiate platform structures and norms taught in media education. Moreover, it reconceptualizes news literacy as a form of situated knowledge, emphasizing how platform and social contexts shape users’ attitudes, motivations, and perceptions of agency.

Download Full-text

Identifying Key Target Audiences for Public Health Campaigns: Leveraging Machine Learning in the Case of Hookah Tobacco Smoking (Preprint)

10.2196/preprints.12443 ◽

2018 ◽

Author(s):

Kar-Hai Chu ◽

Jason Colditz ◽

Momin Malik ◽

Tabitha Yates ◽

Brian Primack

Keyword(s):

Public Health ◽

Machine Learning ◽

Social Media ◽

Language Processing ◽

Tobacco Smoking ◽

A Priori ◽

Machine Learning Techniques ◽

Systematic Research ◽

Health Campaigns ◽

Public Health Officials

BACKGROUND Hookah tobacco smoking (HTS) is a particularly important issue for public health professionals to address owing to its prevalence and deleterious health effects. Social media sites can be a valuable tool for public health officials to conduct informational health campaigns. Current social media platforms provide researchers with opportunities to better identify and target specific audiences and even individuals. However, we are not aware of systematic research attempting to identify audiences with mixed or ambivalent views toward HTS. OBJECTIVE The objective of this study was to (1) confirm previous research showing positively skewed HTS sentiment on Twitter using a larger dataset by leveraging machine learning techniques and (2) systematically identify individuals who exhibit mixed opinions about HTS via the Twitter platform and therefore represent key audiences for intervention. METHODS We prospectively collected tweets related to HTS from January to June 2016. We double-coded sentiment for a subset of approximately 5000 randomly sampled tweets for sentiment toward HTS and used these data to train a machine learning classifier to assess the remaining approximately 556,000 HTS-related Twitter posts. Natural language processing software was used to extract linguistic features (ie, language-based covariates). The data were processed by machine learning tools and algorithms using R. Finally, we used the results to identify individuals who, because they had consistently posted both positive and negative content, might be ambivalent toward HTS and represent an ideal audience for intervention. RESULTS There were 561,960 HTS-related tweets: 373,911 were classified as positive and 183,139 were classified as negative. A set of 12,861 users met a priori criteria indicating that they posted both positive and negative tweets about HTS. CONCLUSIONS Sentiment analysis can allow researchers to identify audience segments on social media that demonstrate ambiguity toward key public health issues, such as HTS, and therefore represent ideal populations for intervention. Using large social media datasets can help public health officials to preemptively identify specific audience segments that would be most receptive to targeted campaigns.

Download Full-text

Recent Advances in Using Natural Language Processing to Address Public Health Research Questions Using Social Media and ConsumerGenerated Data

Yearbook of Medical Informatics ◽

10.1055/s-0039-1677918 ◽

2019 ◽

Vol 28 (01) ◽

pp. 208-217 ◽

Cited By ~ 9

Author(s):

Mike Conway ◽

Mengke Hu ◽

Wendy W. Chapman

Keyword(s):

Public Health ◽

Mental Health ◽

Machine Learning ◽

Social Media ◽

Natural Language Processing ◽

Language Processing ◽

Online Health Communities ◽

Machine Learning Methods ◽

Health Communities ◽

Health Applications

Objective: We present a narrative review of recent work on the utilisation of Natural Language Processing (NLP) for the analysis of social media (including online health communities) specifically for public health applications. Methods: We conducted a literature review of NLP research that utilised social media or online consumer-generated text for public health applications, focussing on the years 2016 to 2018. Papers were identified in several ways, including PubMed searches and the inspection of recent conference proceedings from the Association of Computational Linguistics (ACL), the Conference on Human Factors in Computing Systems (CHI), and the International AAAI (Association for the Advancement of Artificial Intelligence) Conference on Web and Social Media (ICWSM). Popular data sources included Twitter, Reddit, various online health communities, and Facebook. Results: In the recent past, communicable diseases (e.g., influenza, dengue) have been the focus of much social media-based NLP health research. However, mental health and substance use and abuse (including the use of tobacco, alcohol, marijuana, and opioids) have been the subject of an increasing volume of research in the 2016 - 2018 period. Associated with this trend, the use of lexicon-based methods remains popular given the availability of psychologically validated lexical resources suitable for mental health and substance abuse research. Finally, we found that in the period under review “modern" machine learning methods (i.e. deep neural-network-based methods), while increasing in popularity, remain less widely used than “classical" machine learning methods.

Download Full-text

Leveraging Social Media Activity and Machine Learning for HIV and Substance Abuse Risk Assessment (Preprint)

10.2196/preprints.22042 ◽

2020 ◽

Author(s):

Anaelia Ovalle ◽

Orpaz Goldstein ◽

Mohammad Kachuee ◽

Elizabeth Wu ◽

Ian W Holloway ◽

...

Keyword(s):

Public Health ◽

Machine Learning ◽

Social Media ◽

Health Promotion ◽

Substance Use ◽

Health Interventions ◽

Risk Scores ◽

Social Media Data ◽

Sex With Men ◽

Media Data

BACKGROUND Online social media networks provide an abundance of diverse information that can be leveraged for data-driven applications across various social and physical sciences. One opportunity to utilize such data exists in the public health domain, where data collection is often constrained by organizational funding and limited user adoption. Furthermore, the efficacy of health interventions are often based on self-reported data, which is not always reliable. Health-promotion strategies for communities facing multiple vulnerabilities, such as men who have sex with men, can benefit from an automated system that not only determines health behavior risk but also suggests appropriate intervention targets. OBJECTIVE This study aimed to determine the value in leveraging social media interactions to identify health risk behavior for men who have sex with men. METHODS The Gay Social Networking Analysis Program (GSNAP) was created as a preliminary framework for intelligent online health-promotion intervention. The program consisted of a data collection system that automatically gathered social media data, health questionnaires, and clinical results for sexually transmitted diseases and drug tests across 51 participants over a 3-month period. Machine learning techniques were utilized to assess the relationship between social media messages and participants' offline sexual health and substance use biological outcomes. The F1 score, a weighted average of precision and recall, was used to evaluate each algorithm. Natural language processing techniques were employed to create health behavior risk scores from participant messages. RESULTS Across several machine learning algorithms, offline HIV, amphetamine, and methamphetamine use were able to be identified using only social media data, with the best model providing F1 scores of 82.6\%, 85.9\%, and 85.3\%, respectively. Additionally, constructed risk scores were found to be reasonably comparable to risk scores adapted from the Center for Disease Control. CONCLUSIONS To our knowledge, our study is the first implementation and empirical evaluation of a social-media based public health intervention framework in MSM. We found that social media data is correlated with offline sexual health and substance use, verified through biological testing. The proof of concept and initial results validate that public health interventions can indeed use social media-based systems to successfully determine offline health risk behaviors. The findings demonstrate the promise of deploying a social media-based just-in-time adaptive intervention to target substance use and HIV risk behavior.

Download Full-text

Auswirkungen der Corona-Pandemie auf die Maßnahmen zur Suchtprävention der Bundeszentrale für gesundheitliche Aufklärung (BZgA)

SUCHT - Zeitschrift für Wissenschaft und Praxis / Journal of Addiction Research and Practice ◽

10.1024/0939-5911/a000677 ◽

2020 ◽

Vol 66 (5) ◽

pp. 259-264

Author(s):

Michaela Goecke

Keyword(s):

Public Health ◽

Social Media ◽

Junge Erwachsene ◽

Bundeszentrale Für Gesundheitliche Aufklärung

Zusammenfassung. Abstract: Hintergrund: Die Bundeszentrale für gesundheitliche Aufklärung (BZgA) ist als Fachbehörde unter anderem für die Umsetzung nationaler Programme zur Suchtprävention zuständig. Die jährlichen Arbeitsprogramme werden mit dem Bundesministerium für Gesundheit abgestimmt und sehen aktuell vor dem Hintergrund der Public-Health-Relevanz Schwerpunkte in der Prävention der legalen Substanzen Tabak und Alkohol vor. Vorrangige Zielgruppen sind Jugendliche und junge Erwachsene, da sich bei ihnen riskante Konsummuster entwickeln und festigen können. Die Präventionsprogramme der BZgA umfassen schulische Angebote, Webportale, Social Media und Printmedien wie Informationsbroschüren. Aktuelle Situation: Die Corona-Pandemie hat Einfluss genommen auf die Suchtprävention der BZgA. Zu nennen ist die thematische Verzahnung im Kontext von Corona und ein veränderter inhaltlicher Beratungsbedarf – telefonisch und online. Auch die durch die Corona-Pandemie bedingten Kontaktbeschränkungen während des „Lockdowns“ sowie die neuen Rahmenbedingungen für ein persönliches Miteinander haben die Suchtprävention verändert. Interaktive Präventionsangebote in Schulen wurden ebenso wie die Unterstützung von Mitmachaktionen in Sportvereinen oder die Durchführung von Peer-Programmen ausgesetzt. Dafür rückte die Nutzung digitaler Möglichkeiten sowohl bei der Umsetzung von suchtpräventiven Angeboten als auch in der Kooperation und Vernetzung mit den Ländern in einen neuen Fokus. Die Corona-Krise kann perspektivisch auch eine Chance für mehr Digitalisierung in der Suchtprävention werden.

Download Full-text