Suicide Risk and Protective Factors in Online Support Forum Posts: Annotation Scheme Development and Validation Study

Background Online communities provide support for individuals looking for help with suicidal ideation and crisis. As community data are increasingly used to devise machine learning models to infer who might be at risk, there have been limited efforts to identify both risk and protective factors in web-based posts. These annotations can enrich and augment computational assessment approaches to identify appropriate intervention points, which are useful to public health professionals and suicide prevention researchers. Objective This qualitative study aims to develop a valid and reliable annotation scheme for evaluating risk and protective factors for suicidal ideation in posts in suicide crisis forums. Methods We designed a valid, reliable, and clinically grounded process for identifying risk and protective markers in social media data. This scheme draws on prior work on construct validity and the social sciences of measurement. We then applied the scheme to annotate 200 posts from r/SuicideWatch—a Reddit community focused on suicide crisis. Results We documented our results on producing an annotation scheme that is consistent with leading public health information coding schemes for suicide and advances attention to protective factors. Our study showed high internal validity, and we have presented results that indicate that our approach is consistent with findings from prior work. Conclusions Our work formalizes a framework that incorporates construct validity into the development of annotation schemes for suicide risk on social media. This study furthers the understanding of risk and protective factors expressed in social media data. This may help public health programming to prevent suicide and computational social science research and investigations that rely on the quality of labels for downstream machine learning tasks.

Download Full-text

Suicide Risk and Protective Factors in Online Support Forum Posts: Annotation Scheme Development and Validation Study (Preprint)

10.2196/preprints.24471 ◽

2020 ◽

Author(s):

Stevie Chancellor ◽

Steven A Sumner ◽

Corinne David-Ferdon ◽

Tahirah Ahmad ◽

Munmun De Choudhury

Keyword(s):

Public Health ◽

Machine Learning ◽

Social Media ◽

Protective Factors ◽

Suicide Risk ◽

Risk And Protective Factors ◽

Prior Work ◽

Annotation Scheme ◽

Social Media Data ◽

Media Data

BACKGROUND Online communities provide support for individuals looking for help with suicidal ideation and crisis. As community data are increasingly used to devise machine learning models to infer who might be at risk, there have been limited efforts to identify both risk and protective factors in web-based posts. These annotations can enrich and augment computational assessment approaches to identify appropriate intervention points, which are useful to public health professionals and suicide prevention researchers. OBJECTIVE This qualitative study aims to develop a valid and reliable annotation scheme for evaluating risk and protective factors for suicidal ideation in posts in suicide crisis forums. METHODS We designed a valid, reliable, and clinically grounded process for identifying risk and protective markers in social media data. This scheme draws on prior work on construct validity and the social sciences of measurement. We then applied the scheme to annotate 200 posts from r/SuicideWatch—a Reddit community focused on suicide crisis. RESULTS We documented our results on producing an annotation scheme that is consistent with leading public health information coding schemes for suicide and advances attention to protective factors. Our study showed high internal validity, and we have presented results that indicate that our approach is consistent with findings from prior work. CONCLUSIONS Our work formalizes a framework that incorporates construct validity into the development of annotation schemes for suicide risk on social media. This study furthers the understanding of risk and protective factors expressed in social media data. This may help public health programming to prevent suicide and computational social science research and investigations that rely on the quality of labels for downstream machine learning tasks.

Download Full-text

Physical activity, sedentary behaviour, and sleep on Twitter: A multicountry and fully labelled dataset for public health surveillance research (Preprint)

10.2196/preprints.32355 ◽

2021 ◽

Author(s):

Zahra Shakeri Hossein Abad ◽

Gregory P. Butler ◽

Wendy Thompson ◽

Joon Lee

Keyword(s):

Public Health ◽

Physical Activity ◽

Machine Learning ◽

Social Media ◽

Sedentary Behaviour ◽

Public Health Surveillance ◽

Health Surveillance ◽

Surveillance Systems ◽

Social Media Data ◽

Media Data

BACKGROUND Advances in automated data processing and machine learning (ML) models, together with the unprecedented growth in the number of social media users who publicly share and discuss health-related information, have made public health surveillance (PHS) one of the long-lasting social media applications. However, the existing PHS systems feeding on social media data have not been widely deployed in national surveillance systems, which appears to stem from the lack of practitioners and the public’s trust in social media data. More robust and reliable datasets over which supervised machine learning models can be trained and tested reliably is a significant step toward overcoming this hurdle. OBJECTIVE The health implications of daily behaviours (physical activity, sedentary behaviour, and sleep (PASS)), as an evergreen topic in PHS, are widely studied through traditional data sources such as surveillance surveys and administrative databases, which are often several months out of date by the time they are utilized, costly to collect, and thus limited in quantity and coverage. In this paper, we present LPHEADA, a multicountry and fully Labelled digital Public HEAlth DAtaset of tweets originated in Australia, Canada, the United Kingdom (UK), or the United States (US). METHODS We collected the data of this study from Twitter using the Twitter livestream application programming interface (API) between 28th November 2018 to 19th June 2020. To obtain PASS-related tweets for manual annotation, we iteratively used regular expressions, unsupervised natural language processing, domain-specific ontologies and linguistic analysis. We used Amazon Mechanical Turk (AMT) to label the collected data to self-reported PASS categories and implemented a quality control pipeline to monitor and manage the validity of crow-generated labels. Moreover, we used ML, latent semantic analysis, linguistic analysis, and label inference analysis to validate different components of the dataset. RESULTS LPHEADA contains 366,405 crowd-generated labels (three labels per tweet) for 122,135 PASS-related tweets, labelled by 708 unique annotators on AMT. In addition to crowd-generated labels, LPHEADA provides details about the three critical components of any PHS system: place, time, and demographics (gender, age range) associated with each tweet. CONCLUSIONS Publicly available datasets for digital PASS surveillance are usually isolated and only provide labels for small subsets of the data. We believe that the novelty and comprehensiveness of the dataset provided in this study will help develop, evaluate, and deploy digital PASS surveillance systems. LPHEADA will be an invaluable resource for both public health researchers and practitioners.

Download Full-text

Leveraging Social Media Activity and Machine Learning for HIV and Substance Abuse Risk Assessment (Preprint)

10.2196/preprints.22042 ◽

2020 ◽

Author(s):

Anaelia Ovalle ◽

Orpaz Goldstein ◽

Mohammad Kachuee ◽

Elizabeth Wu ◽

Ian W Holloway ◽

...

Keyword(s):

Public Health ◽

Machine Learning ◽

Social Media ◽

Health Promotion ◽

Substance Use ◽

Health Interventions ◽

Risk Scores ◽

Social Media Data ◽

Sex With Men ◽

Media Data

BACKGROUND Online social media networks provide an abundance of diverse information that can be leveraged for data-driven applications across various social and physical sciences. One opportunity to utilize such data exists in the public health domain, where data collection is often constrained by organizational funding and limited user adoption. Furthermore, the efficacy of health interventions are often based on self-reported data, which is not always reliable. Health-promotion strategies for communities facing multiple vulnerabilities, such as men who have sex with men, can benefit from an automated system that not only determines health behavior risk but also suggests appropriate intervention targets. OBJECTIVE This study aimed to determine the value in leveraging social media interactions to identify health risk behavior for men who have sex with men. METHODS The Gay Social Networking Analysis Program (GSNAP) was created as a preliminary framework for intelligent online health-promotion intervention. The program consisted of a data collection system that automatically gathered social media data, health questionnaires, and clinical results for sexually transmitted diseases and drug tests across 51 participants over a 3-month period. Machine learning techniques were utilized to assess the relationship between social media messages and participants' offline sexual health and substance use biological outcomes. The F1 score, a weighted average of precision and recall, was used to evaluate each algorithm. Natural language processing techniques were employed to create health behavior risk scores from participant messages. RESULTS Across several machine learning algorithms, offline HIV, amphetamine, and methamphetamine use were able to be identified using only social media data, with the best model providing F1 scores of 82.6\%, 85.9\%, and 85.3\%, respectively. Additionally, constructed risk scores were found to be reasonably comparable to risk scores adapted from the Center for Disease Control. CONCLUSIONS To our knowledge, our study is the first implementation and empirical evaluation of a social-media based public health intervention framework in MSM. We found that social media data is correlated with offline sexual health and substance use, verified through biological testing. The proof of concept and initial results validate that public health interventions can indeed use social media-based systems to successfully determine offline health risk behaviors. The findings demonstrate the promise of deploying a social media-based just-in-time adaptive intervention to target substance use and HIV risk behavior.

Download Full-text

Hybrid features prediction model of movie quality using Multi-machine learning techniques for effective business resource planning

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201844 ◽

2021 ◽

Vol 40 (5) ◽

pp. 9361-9382 ◽

Cited By ~ 1

Author(s):

Naeem Iqbal ◽

Rashid Ahmad ◽

Faisal Jamil ◽

Do-Hyeun Kim

Keyword(s):

Machine Learning ◽

Social Media ◽

Resource Planning ◽

Experimental Results ◽

Quality Prediction ◽

Classification Models ◽

Hybrid Features ◽

Social Media Data ◽

Media Data

Quality prediction plays an essential role in the business outcome of the product. Due to the business interest of the concept, it has extensively been studied in the last few years. Advancement in machine learning (ML) techniques and with the advent of robust and sophisticated ML algorithms, it is required to analyze the factors influencing the success of the movies. This paper presents a hybrid features prediction model based on pre-released and social media data features using multiple ML techniques to predict the quality of the pre-released movies for effective business resource planning. This study aims to integrate pre-released and social media data features to form a hybrid features-based movie quality prediction (MQP) model. The proposed model comprises of two different experimental models; (i) predict movies quality using the original set of features and (ii) develop a subset of features based on principle component analysis technique to predict movies success class. This work employ and implement different ML-based classification models, such as Decision Tree (DT), Support Vector Machines with the linear and quadratic kernel (L-SVM and Q-SVM), Logistic Regression (LR), Bagged Tree (BT) and Boosted Tree (BOT), to predict the quality of the movies. Different performance measures are utilized to evaluate the performance of the proposed ML-based classification models, such as Accuracy (AC), Precision (PR), Recall (RE), and F-Measure (FM). The experimental results reveal that BT and BOT classifiers performed accurately and produced high accuracy compared to other classifiers, such as DT, LR, LSVM, and Q-SVM. The BT and BOT classifiers achieved an accuracy of 90.1% and 89.7%, which shows an efficiency of the proposed MQP model compared to other state-of-art- techniques. The proposed work is also compared with existing prediction models, and experimental results indicate that the proposed MQP model performed slightly better compared to other models. The experimental results will help the movies industry to formulate business resources effectively, such as investment, number of screens, and release date planning, etc.

Download Full-text

Predicting ethnicity with data on personal names in Russia

10.31235/osf.io/wf6p4 ◽

2021 ◽

Author(s):

Alexey Bessudnov ◽

Denis Tarasov ◽

Viacheslav Panasovets ◽

Veronica Kostenko ◽

Ivan Smirnov ◽

...

Keyword(s):

Machine Learning ◽

Social Media ◽

Ethnic Groups ◽

Geographical Location ◽

Ethnic Relations ◽

Social Media Data ◽

Personal Names ◽

Learning Classifier ◽

Media Data

In this paper we develop a machine learning classifier that predicts perceived ethnicity from data on personal names for major ethnic groups populating Russia. We collect data from VK, the largest Russian social media website. Ethnicity has been determined from languages spoken by users and their geographical location, with the data manually cleaned by crowd workers. The classifier shows the accuracy of 0.82 for a scheme with 24 ethnic groups and 0.92 for 15 aggregated ethnic groups. It can be used for research on ethnicity and ethnic relations in Russia, in particular with VK and other social media data.

Download Full-text

Sentiment Analysis in Social Media using Machine Learning Techniques

Iraqi Journal of Science ◽

10.24996/ijs.2020.61.1.22 ◽

2020 ◽

pp. 193-201 ◽

Cited By ~ 1

Author(s):

Hayder A. Alatabi ◽

Ayad R. Abbas

Keyword(s):

Machine Learning ◽

Social Media ◽

Sentiment Analysis ◽

Machine Learning Techniques ◽

Great Success ◽

Social Media Data ◽

Learning Techniques ◽

The World ◽

Analysis System ◽

Media Data

Over the last period, social media achieved a widespread use worldwide where the statistics indicate that more than three billion people are on social media, leading to large quantities of data online. To analyze these large quantities of data, a special classification method known as sentiment analysis, is used. This paper presents a new sentiment analysis system based on machine learning techniques, which aims to create a process to extract the polarity from social media texts. By using machine learning techniques, sentiment analysis achieved a great success around the world. This paper investigates this topic and proposes a sentiment analysis system built on Bayesian Rough Decision Tree (BRDT) algorithm. The experimental results show the success of this system where the accuracy of the system is more than 95% on social media data.

Download Full-text

Communication Sentiment Analyzer using Machine Learning with Naive Bayes Bernoullinb

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a1610.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 5976-5979

Keyword(s):

Machine Learning ◽

Social Media ◽

Major Part ◽

Naive Bayes ◽

Naïve Bayes ◽

User Preferences ◽

Social Media Data ◽

Machine Learning Model ◽

The World ◽

Media Data

In this never-ending social media era it is estimated that over 5 billion people use smartphones. Out of these, there are over 1.5 billion active users in the world. In which we all are a major part and before opening our messages we all are curious about what message we have received. No doubt, we all always hope for a good message to be received. So Sentiment analysis on social media data has been seen by many as an effective tool to monitor user preferences and inclination. Finally, we propose a scalable machine learning model to analyze the polarity of a communicative text using Naive Bayes’ Bernoulli classifier. This paper works on only two polarities that is whether the sentence is positive or negative. Bernoulli classifier is used in this paper because it is best suited for binary inputs which in turn enhances the accuracy of up to 97%.

Download Full-text

Understanding COVID-19 Public Sentiment Towards Public Health Policies Using Social Media Data

10.1109/ghtc53159.2021.9612509 ◽

2021 ◽

Author(s):

Olivia Figueira ◽

Yuka Hatori ◽

Liying Liang ◽

Christine Chye ◽

Yuhong Liu

Keyword(s):

Public Health ◽

Social Media ◽

Health Policies ◽

Social Media Data ◽

Public Health Policies ◽

Public Sentiment ◽

Media Data

Download Full-text

An unsupervised machine learning model for discovering latent infectious diseases using social media data

Journal of Biomedical Informatics ◽

10.1016/j.jbi.2016.12.007 ◽

2017 ◽

Vol 66 ◽

pp. 82-94 ◽

Cited By ~ 43

Author(s):

Sunghoon Lim ◽

Conrad S. Tucker ◽

Soundar Kumara

Keyword(s):

Machine Learning ◽

Social Media ◽

Infectious Diseases ◽

Learning Model ◽

Unsupervised Machine Learning ◽

Social Media Data ◽

Machine Learning Model ◽

Media Data

Download Full-text

Comprehensive scoping review of health research using social media data

BMJ Open ◽

10.1136/bmjopen-2018-022931 ◽

2018 ◽

Vol 8 (12) ◽

pp. e022931 ◽

Cited By ~ 3

Author(s):

Joanna Taylor ◽

Claudia Pagliari

Keyword(s):

Public Health ◽

Social Media ◽

Scoping Review ◽

Ethical Issues ◽

Social Media Data ◽

Related Research ◽

Health Related Research ◽

Health Related ◽

Google Search ◽

Media Data

IntroductionThe rising popularity of social media, since their inception around 20 years ago, has been echoed in the growth of health-related research using data derived from them. This has created a demand for literature reviews to synthesise this emerging evidence base and inform future activities. Existing reviews tend to be narrow in scope, with limited consideration of the different types of data, analytical methods and ethical issues involved. There has also been a tendency for research to be siloed within different academic communities (eg, computer science, public health), hindering knowledge translation. To address these limitations, we will undertake a comprehensive scoping review, to systematically capture the broad corpus of published, health-related research based on social media data. Here, we present the review protocol and the pilot analyses used to inform it.MethodsA version of Arksey and O’Malley’s five-stage scoping review framework will be followed: (1) identifying the research question; (2) identifying the relevant literature; (3) selecting the studies; (4) charting the data and (5) collating, summarising and reporting the results. To inform the search strategy, we developed an inclusive list of keyword combinations related to social media, health and relevant methodologies. The frequency and variability of terms were charted over time and cross referenced with significant events, such as the advent of Twitter. Five leading health, informatics, business and cross-disciplinary databases will be searched: PubMed, Scopus, Association of Computer Machinery, Institute of Electrical and Electronics Engineers and Applied Social Sciences Index and Abstracts, alongside the Google search engine. There will be no restriction by date.Ethics and disseminationThe review focuses on published research in the public domain therefore no ethics approval is required. The completed review will be submitted for publication to a peer-reviewed, interdisciplinary open access journal, and conferences on public health and digital research.

Download Full-text