Applying Unsupervised and Supervised Machine Learning Methodologies in Social Media Textual Traffic Data

Exploring fake news identification using word and sentence embeddings

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189865 ◽

2021 ◽

pp. 1-8

Author(s):

V.T Priyanga ◽

J.P Sanjanasri ◽

Vijay Krishna Menon ◽

E.A Gopalakrishnan ◽

K.P Soman

Keyword(s):

Machine Learning ◽

Social Media ◽

Network Analysis ◽

Supervised Machine Learning ◽

Breeding Ground ◽

Fake News ◽

Data Set ◽

Highly Correlated ◽

Use Of Social Media ◽

The Liar

The widespread use of social media like Facebook, Twitter, Whatsapp, etc. has changed the way News is created and published; accessing news has become easy and inexpensive. However, the scale of usage and inability to moderate the content has made social media, a breeding ground for the circulation of fake news. Fake news is deliberately created either to increase the readership or disrupt the order in the society for political and commercial benefits. It is of paramount importance to identify and filter out fake news especially in democratic societies. Most existing methods for detecting fake news involve traditional supervised machine learning which has been quite ineffective. In this paper, we are analyzing word embedding features that can tell apart fake news from true news. We use the LIAR and ISOT data set. We churn out highly correlated news data from the entire data set by using cosine similarity and other such metrices, in order to distinguish their domains based on central topics. We then employ auto-encoders to detect and differentiate between true and fake news while also exploring their separability through network analysis.

Download Full-text

Towards scaling Twitter for digital epidemiology of birth defects

npj Digital Medicine ◽

10.1038/s41746-019-0170-5 ◽

2019 ◽

Vol 2 (1) ◽

Cited By ~ 4

Author(s):

Ari Z. Klein ◽

Abeed Sarker ◽

Davy Weissenbacher ◽

Graciela Gonzalez-Hernandez

Keyword(s):

Machine Learning ◽

Social Media ◽

Language Processing ◽

Birth Defects ◽

Birth Defect ◽

Learning Algorithms ◽

Class Imbalance ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Svm Classifier

Abstract Social media has recently been used to identify and study a small cohort of Twitter users whose pregnancies with birth defect outcomes—the leading cause of infant mortality—could be observed via their publicly available tweets. In this study, we exploit social media on a larger scale by developing natural language processing (NLP) methods to automatically detect, among thousands of users, a cohort of mothers reporting that their child has a birth defect. We used 22,999 annotated tweets to train and evaluate supervised machine learning algorithms—feature-engineered and deep learning-based classifiers—that automatically distinguish tweets referring to the user’s pregnancy outcome from tweets that merely mention birth defects. Because 90% of the tweets merely mention birth defects, we experimented with under-sampling and over-sampling approaches to address this class imbalance. An SVM classifier achieved the best performance for the two positive classes: an F1-score of 0.65 for the “defect” class and 0.51 for the “possible defect” class. We deployed the classifier on 20,457 unlabeled tweets that mention birth defects, which helped identify 542 additional users for potential inclusion in our cohort. Contributions of this study include (1) NLP methods for automatically detecting tweets by users reporting their birth defect outcomes, (2) findings that an SVM classifier can outperform a deep neural network-based classifier for highly imbalanced social media data, (3) evidence that automatic classification can be used to identify additional users for potential inclusion in our cohort, and (4) a publicly available corpus for training and evaluating supervised machine learning algorithms.

Download Full-text

Schizophrenia Detection Using Machine Learning Approach from Social Media Content

Sensors ◽

10.3390/s21175924 ◽

2021 ◽

Vol 21 (17) ◽

pp. 5924

Author(s):

Yi Ji Bae ◽

Midan Shim ◽

Won Hee Lee

Keyword(s):

Mental Health ◽

Machine Learning ◽

Social Media ◽

Mental Health Problems ◽

Negative Emotion ◽

Supervised Machine Learning ◽

Control Group ◽

Learning Approaches ◽

Linguistic Features ◽

Media Texts

Schizophrenia is a severe mental disorder that ranks among the leading causes of disability worldwide. However, many cases of schizophrenia remain untreated due to failure to diagnose, self-denial, and social stigma. With the advent of social media, individuals suffering from schizophrenia share their mental health problems and seek support and treatment options. Machine learning approaches are increasingly used for detecting schizophrenia from social media posts. This study aims to determine whether machine learning could be effectively used to detect signs of schizophrenia in social media users by analyzing their social media texts. To this end, we collected posts from the social media platform Reddit focusing on schizophrenia, along with non-mental health related posts (fitness, jokes, meditation, parenting, relationships, and teaching) for the control group. We extracted linguistic features and content topics from the posts. Using supervised machine learning, we classified posts belonging to schizophrenia and interpreted important features to identify linguistic markers of schizophrenia. We applied unsupervised clustering to the features to uncover a coherent semantic representation of words in schizophrenia. We identified significant differences in linguistic features and topics including increased use of third person plural pronouns and negative emotion words and symptom-related topics. We distinguished schizophrenic from control posts with an accuracy of 96%. Finally, we found that coherent semantic groups of words were the key to detecting schizophrenia. Our findings suggest that machine learning approaches could help us understand the linguistic characteristics of schizophrenia and identify schizophrenia or otherwise at-risk individuals using social media texts.

Download Full-text

Microblog credibility indicators regarding misinformation of genetically modified food on Weibo

PLoS ONE ◽

10.1371/journal.pone.0252392 ◽

2021 ◽

Vol 16 (6) ◽

pp. e0252392

Author(s):

Jiaojiao Ji ◽

Naipeng Chao ◽

Shitong Wei ◽

George A. Barnett

Keyword(s):

Machine Learning ◽

Social Media ◽

Science Communication ◽

Predictive Power ◽

Genetically Modified ◽

Public Understanding ◽

Supervised Machine Learning ◽

Genetically Modified Food ◽

The Public ◽

Gm Food

The considerable amount of misinformation on social media regarding genetically modified (GM) food will not only hinder public understanding but also mislead the public to make unreasoned decisions. This study discovered a new mechanism of misinformation diffusion in the case of GM food and applied a framework of supervised machine learning to identify effective credibility indicators for the misinformation prediction of GM food. Main indicators are proposed, including user identities involved in spreading information, linguistic styles, and propagation dynamics. Results show that linguistic styles, including sentiment and topics, have the dominant predictive power. In addition, among the user identities, engagement, and extroversion are effective predictors, while reputation has almost no predictive power in this study. Finally, we provide strategies that readers should be aware of when assessing the credibility of online posts and suggest improvements that Weibo can use to avoid rumormongering and enhance the science communication of GM food.

Download Full-text

Building and Annotating a Codeswitched Hate Speech Corpora

International Journal of Information Technology and Computer Science ◽

10.5815/ijitcs.2021.03.03 ◽

2021 ◽

Vol 13 (3) ◽

pp. 33-52

Author(s):

Edward Ombui ◽

◽

Lawrence Muchemi ◽

Peter Wagacha

Keyword(s):

Machine Learning ◽

Social Media ◽

Hate Speech ◽

Empirical Studies ◽

Presidential Campaign ◽

Supervised Machine Learning ◽

Annotation Scheme ◽

Speech Corpora ◽

Duplex Theory ◽

Speech Identification

Presidential campaign periods are a major trigger event for hate speech on social media in almost every country. A systematic review of previous studies indicates inadequate publicly available annotated datasets and hardly any evidence of theoretical underpinning for the annotation schemes used for hate speech identification. This situation stifles the development of empirically useful data for research, especially in supervised machine learning. This paper describes the methodology that was used to develop a multidimensional hate speech framework based on the duplex theory of hate [1] components that include distance, passion, commitment to hate, and hate as a story. Subsequently, an annotation scheme based on the framework was used to annotate a random sample of ~51k tweets from ~400k tweets that were collected during the August and October 2017 presidential campaign period in Kenya. This resulted in a goldstandard codeswitched dataset that could be used for comparative and empirical studies in supervised machine learning. The resulting classifiers trained on this dataset could be used to provide real-time monitoring of hate speech spikes on social media and inform data-driven decision-making by relevant security agencies in government.

Download Full-text

Evolution of corporate reputation during an evolving controversy

Journal of Communication Management ◽

10.1108/jcom-08-2018-0072 ◽

2019 ◽

Vol 23 (1) ◽

pp. 52-71 ◽

Cited By ~ 3

Author(s):

Siyoung Chung ◽

Mark Chong ◽

Jie Sheng Chua ◽

Jin Cheon Na

Keyword(s):

Machine Learning ◽

Social Media ◽

Sentiment Analysis ◽

Corporate Reputation ◽

Supervised Machine Learning ◽

Future Research ◽

Content Type ◽

Twitter Users ◽

Corporate Crisis ◽

The Impact

PurposeThe purpose of this paper is to investigate the evolution of online sentiments toward a company (i.e. Chipotle) during a crisis, and the effects of corporate apology on those sentiments.Design/methodology/approachUsing a very large data set of tweets (i.e. over 2.6m) about Company A’s food poisoning case (2015–2016). This case was selected because it is widely known, drew attention from various stakeholders and had many dynamics (e.g. multiple outbreaks, and across different locations). This study employed a supervised machine learning approach. Its sentiment polarity classification and relevance classification consisted of five steps: sampling, labeling, tokenization, augmentation of semantic representation, and the training of supervised classifiers for relevance and sentiment prediction.FindingsThe findings show that: the overall sentiment of tweets specific to the crisis was neutral; promotions and marketing communication may not be effective in converting negative sentiments to positive sentiments; a corporate crisis drew public attention and sparked public discussion on social media; while corporate apologies had a positive effect on sentiments, the effect did not last long, as the apologies did not remove public concerns about food safety; and some Twitter users exerted a significant influence on online sentiments through their popular tweets, which were heavily retweeted among Twitter users.Research limitations/implicationsEven with multiple training sessions and the use of a voting procedure (i.e. when there was a discrepancy in the coding of a tweet), there were some tweets that could not be accurately coded for sentiment. Aspect-based sentiment analysis and deep learning algorithms can be used to address this limitation in future research. This analysis of the impact of Chipotle’s apologies on sentiment did not test for a direct relationship. Future research could use manual coding to include only specific responses to the corporate apology. There was a delay between the time social media users received the news and the time they responded to it. Time delay poses a challenge to the sentiment analysis of Twitter data, as it is difficult to interpret which peak corresponds with which incident/s. This study focused solely on Twitter, which is just one of several social media sites that had content about the crisis.Practical implicationsFirst, companies should use social media as official corporate news channels and frequently update them with any developments about the crisis, and use them proactively. Second, companies in crisis should refrain from marketing efforts. Instead, they should focus on resolving the issue at hand and not attempt to regain a favorable relationship with stakeholders right away. Third, companies can leverage video, images and humor, as well as individuals with large online social networks to increase the reach and diffusion of their messages.Originality/valueThis study is among the first to empirically investigate the dynamics of corporate reputation as it evolves during a crisis as well as the effects of corporate apology on online sentiments. It is also one of the few studies that employs sentiment analysis using a supervised machine learning method in the area of corporate reputation and communication management. In addition, it offers valuable insights to both researchers and practitioners who wish to utilize big data to understand the online perceptions and behaviors of stakeholders during a corporate crisis.

Download Full-text

Building High Performance Explainable Machine Learning Models for Social Media-based Substance Use Prediction

International Journal of Artificial Intelligence Tools ◽

10.1142/s021821302060009x ◽

2020 ◽

Vol 29 (03n04) ◽

pp. 2060009

Author(s):

Tao Ding ◽

Fatema Hasan ◽

Warren K. Bickel ◽

Shimei Pan

Keyword(s):

Machine Learning ◽

Social Media ◽

Substance Use ◽

Human Behavior ◽

High Performance ◽

Supervised Machine Learning ◽

Learning Models ◽

Wide Range ◽

And Behavior ◽

Machine Learning Models

Social media contain rich information that can be used to help understand human mind and behavior. Social media data, however, are mostly unstructured (e.g., text and image) and a large number of features may be needed to represent them (e.g., we may need millions of unigrams to represent social media texts). Moreover, accurately assessing human behavior is often difficult (e.g., assessing addiction may require medical diagnosis). As a result, the ground truth data needed to train a supervised human behavior model are often difficult to obtain at a large scale. To avoid overfitting, many state-of-the-art behavior models employ sophisticated unsupervised or self-supervised machine learning methods to leverage a large amount of unsupervised data for both feature learning and dimension reduction. Unfortunately, despite their high performance, these advanced machine learning models often rely on latent features that are hard to explain. Since understanding the knowledge captured in these models is important to behavior scientists and public health providers, we explore new methods to build machine learning models that are not only accurate but also interpretable. We evaluate the effectiveness of the proposed methods in predicting Substance Use Disorders (SUD). We believe the methods we proposed are general and applicable to a wide range of data-driven human trait and behavior analysis applications.

Download Full-text

Enabling Rapid Classification of Social Media Communications During Crises

International Journal of Information Systems for Crisis Response and Management ◽

10.4018/ijiscram.2016070101 ◽

2016 ◽

Vol 8 (3) ◽

pp. 1-17 ◽

Cited By ~ 2

Author(s):

Muhammad Imran ◽

Prasenjit Mitra ◽

Jaideep Srivastava

Keyword(s):

Machine Learning ◽

Social Media ◽

Real Time ◽

Poor Performance ◽

Crisis Response ◽

Classification Performance ◽

Sudden Onset ◽

Supervised Machine Learning ◽

Online Information ◽

Real Time Analysis

The use of social media platforms such as Twitter by affected people during crises is considered a vital source of information for crisis response. However, rapid crisis response requires real-time analysis of online information. When a disaster happens, among other data processing techniques, supervised machine learning can help classify online information in real-time. However, scarcity of labeled data causes poor performance in machine training. Often labeled data from past event is available. Can past labeled data be reused to train classifiers? We study the usefulness of labeled data of past events. We observe the performance of our classifiers trained using different combinations of training sets obtained from past disasters. Moreover, we propose two approaches (target labeling and active learning) to boost classification performance of a learning scheme. We perform extensive experimentation on real crisis datasets and show the utility of past-labeled data to train machine learning classifiers to process sudden-onset crisis-related data in real-time.

Download Full-text

Rumor Detection on Twitter Using a Supervised Machine Learning Framework

Research Anthology on Fake News, Political Warfare, and Combatting the Spread of Misinformation ◽

10.4018/978-1-7998-7291-7.ch026 ◽

2021 ◽

pp. 452-465

Author(s):

Hardeo Kumar Thakur ◽

Anand Gupta ◽

Ayushi Bhardwaj ◽

Devanshi Verma

Keyword(s):

Machine Learning ◽

Social Networks ◽

Social Media ◽

Supervised Machine Learning ◽

Learning Framework ◽

Online Social Media ◽

Rumor Propagation ◽

Multiple Classification ◽

The Government ◽

Rumor Detection

This article describes how a rumor can be defined as a circulating unverified story or a doubtful truth. Rumor initiators seek social networks vulnerable to illimitable spread, therefore, online social media becomes their stage. Hence, this misinformation imposes colossal damage to individuals, organizations, and the government, etc. Existing work, analyzing temporal and linguistic characteristics of rumors seems to give ample time for rumor propagation. Meanwhile, with the huge outburst of data on social media, studying these characteristics for each tweet becomes spatially complex. Therefore, in this article, a two-fold supervised machine-learning framework is proposed that detects rumors by filtering and then analyzing their linguistic properties. This method attempts to automate filtering by training multiple classification algorithms with accuracy higher than 81.079%. Finally, using textual characteristics on the filtered data, rumors are detected. The effectiveness of the proposed framework is shown through extensive experiments on over 10,000 tweets.

Download Full-text

Characterizing and Identifying the Prevalence of Web-Based Misinformation Relating to Medication for Opioid Use Disorder: Machine Learning Approach

Journal of Medical Internet Research ◽

10.2196/30753 ◽

2021 ◽

Vol 23 (12) ◽

pp. e30753

Author(s):

Mai ElSherief ◽

Steven A Sumner ◽

Christopher M Jones ◽

Royal K Law ◽

Akadia Kacha-Ochana ◽

...

Keyword(s):

Machine Learning ◽

Social Media ◽

Large Scale ◽

Addiction Treatment ◽

Opioid Use Disorder ◽

Supervised Machine Learning ◽

Computational Techniques ◽

Opioid Use ◽

Web Based ◽

Health Communities

Background Expanding access to and use of medication for opioid use disorder (MOUD) is a key component of overdose prevention. An important barrier to the uptake of MOUD is exposure to inaccurate and potentially harmful health misinformation on social media or web-based forums where individuals commonly seek information. There is a significant need to devise computational techniques to describe the prevalence of web-based health misinformation related to MOUD to facilitate mitigation efforts. Objective By adopting a multidisciplinary, mixed methods strategy, this paper aims to present machine learning and natural language analysis approaches to identify the characteristics and prevalence of web-based misinformation related to MOUD to inform future prevention, treatment, and response efforts. Methods The team harnessed public social media posts and comments in the English language from Twitter (6,365,245 posts), YouTube (99,386 posts), Reddit (13,483,419 posts), and Drugs-Forum (5549 posts). Leveraging public health expert annotations on a sample of 2400 of these social media posts that were found to be semantically most similar to a variety of prevailing opioid use disorder–related myths based on representational learning, the team developed a supervised machine learning classifier. This classifier identified whether a post’s language promoted one of the leading myths challenging addiction treatment: that the use of agonist therapy for MOUD is simply replacing one drug with another. Platform-level prevalence was calculated thereafter by machine labeling all unannotated posts with the classifier and noting the proportion of myth-indicative posts over all posts. Results Our results demonstrate promise in identifying social media postings that center on treatment myths about opioid use disorder with an accuracy of 91% and an area under the curve of 0.9, including how these discussions vary across platforms in terms of prevalence and linguistic characteristics, with the lowest prevalence on web-based health communities such as Reddit and Drugs-Forum and the highest on Twitter. Specifically, the prevalence of the stated MOUD myth ranged from 0.4% on web-based health communities to 0.9% on Twitter. Conclusions This work provides one of the first large-scale assessments of a key MOUD-related myth across multiple social media platforms and highlights the feasibility and importance of ongoing assessment of health misinformation related to addiction treatment.

Download Full-text