Deep Learning for Identification of Alcohol on Social Media: Exploratory Analysis of Alcohol-Related Outcomes from Reddit and Twitter (Preprint)

2021 ◽  
Author(s):  
Benjamin Joseph Ricard ◽  
Saeed Hassanpour

BACKGROUND Many social media studies have explored the ability of thematic structures, such as hashtags and subreddits, to identify information related to a wide variety of mental health disorders. However, studies and models trained on specific themed communities are often difficult to apply to different social media platforms and related outcomes. A deep learning framework using thematic structures from Reddit and Twitter can have distinct advantages for studying alcohol abuse, particularly among the youth, in the United States. OBJECTIVE This study proposes a new deep learning pipeline that uses thematic structures to identify alcohol-related content across different platforms. We applied our method on Twitter to determine the association of the prevalence of alcohol-related tweets and alcohol-related outcomes reported from the National Institute of Alcoholism and Alcohol Abuse (NIAAA), Centers for Disease Control Behavioral Risk Factor Surveillance System (CDC BRFSS), County Health Rankings, and the National Industry Classification System (NAICS). METHODS A Bidirectional Encoder Representations from Transformers (BERT) neural network learned to classify 1,302,524 Reddit posts as either alcohol-related or control subreddits. The trained model identified 24 alcohol-related hashtags from an unlabeled dataset of 843,769 random tweets. Querying alcohol-related hashtags identified 25,558,846 alcohol-related tweets, including 790,544 location-specific (geotagged) tweets. We calculated the correlation of the prevalence of alcohol-related tweets with alcohol-related outcomes, controlling for confounding effects from age, sex, income, education, and self-reported race, as recorded by the 2013-2018 American Community Survey (ACS). RESULTS Here, we present a novel natural language processing pipeline developed using Reddit alcohol-related subreddits that identifies highly specific alcohol-related Twitter hashtags. Prevalence of identified hashtags contains interpretable information about alcohol consumption at both coarse (e.g., U.S. State) and fine-grained (e.g., MMSA, County) geographical designations. This approach can expand research and interventions on alcohol abuse and other behavioral health outcomes. CONCLUSIONS Here, we present a novel natural language processing pipeline developed using Reddit alcohol-related subreddits that identifies highly specific alcohol-related Twitter hashtags. Prevalence of identified hashtags contains interpretable information about alcohol consumption at both coarse (e.g., U.S. State) and fine-grained (e.g., MMSA, County) geographical designations. This approach can expand research and interventions on alcohol abuse and other behavioral health outcomes.

2021 ◽  
Author(s):  
Abul Hasan ◽  
Mark Levene ◽  
David Weston ◽  
Renate Fromson ◽  
Nicolas Koslover ◽  
...  

BACKGROUND The COVID-19 pandemic has created a pressing need for integrating information from disparate sources, in order to assist decision makers. Social media is important in this respect, however, to make sense of the textual information it provides and be able to automate the processing of large amounts of data, natural language processing methods are needed. Social media posts are often noisy, yet they may provide valuable insights regarding the severity and prevalence of the disease in the population. In particular, machine learning techniques for triage and diagnosis could allow for a better understanding of what social media may offer in this respect. OBJECTIVE This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts, in order to provide researchers and other interested parties with additional information on the symptoms, severity and prevalence of the disease. METHODS The text processing pipeline first extracts COVID-19 symptoms and related concepts such as severity, duration, negations, and body parts from patients’ posts using conditional random fields. An unsupervised rule-based algorithm is then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations are subsequently used to construct two different vector representations of each post. These vectors are applied separately to build support vector machine learning models to triage patients into three categories and diagnose them for COVID-19. RESULTS We report that Macro- and Micro-averaged F_{1\ }scores in the range of 71-96% and 61-87%, respectively, for the triage and diagnosis of COVID-19, when the models are trained on human labelled data. Our experimental results indicate that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. Also, we highlight important features uncovered by our diagnostic machine learning models and compare them with the most frequent symptoms revealed in another COVID-19 dataset. In particular, we found that the most important features are not always the most frequent ones. CONCLUSIONS Our preliminary results show that it is possible to automatically triage and diagnose patients for COVID-19 from natural language narratives using a machine learning pipeline, in order to provide additional information on the severity and prevalence of the disease through the eyes of social media.


2021 ◽  
Vol 6 (1) ◽  
Author(s):  
Solomon Akinboro ◽  
Oluwadamilola Adebusoye ◽  
Akintoye Onamade

Offensive content refers to messages which are socially unacceptable including vulgar or derogatory messages. As the use of social media increases worldwide, social media administrators are faced with the challenges of tackling the inclusion of offensive content, to ensure clean and non-abusive or offensive conversations on the platforms they provide.  This work organizes and describes techniques used for the automated detection of offensive languages in social media content in recent times, providing a structured overview of previous approaches, including algorithms, methods and main features used.   Selection was from peer-reviewed articles on Google scholar. Search terms include: Profane words, natural language processing, multilingual context, hybrid methods for detecting profane words and deep learning approach for detecting profane words. Exclusions were made based on some criteria. Initial search returned 203 of which only 40 studies met the inclusion criteria; 6 were on natural language processing, 6 studies were on Deep learning approaches, 5 reports analysed hybrid approaches, multi-level classification/multi-lingual classification appear in 13 reports while 10 reports were on other related methods.The limitations of previous efforts to tackle the challenges with regards to the detection of offensive contents are highlighted to aid future research in this area.  Keywords— algorithm, offensive content, profane words, social media, texts


2017 ◽  
Vol 24 (4) ◽  
pp. 813-821 ◽  
Author(s):  
Anne Cocos ◽  
Alexander G Fiks ◽  
Aaron J Masino

Abstract Objective Social media is an important pharmacovigilance data source for adverse drug reaction (ADR) identification. Human review of social media data is infeasible due to data quantity, thus natural language processing techniques are necessary. Social media includes informal vocabulary and irregular grammar, which challenge natural language processing methods. Our objective is to develop a scalable, deep-learning approach that exceeds state-of-the-art ADR detection performance in social media. Materials and Methods We developed a recurrent neural network (RNN) model that labels words in an input sequence with ADR membership tags. The only input features are word-embedding vectors, which can be formed through task-independent pretraining or during ADR detection training. Results Our best-performing RNN model used pretrained word embeddings created from a large, non–domain-specific Twitter dataset. It achieved an approximate match F-measure of 0.755 for ADR identification on the dataset, compared to 0.631 for a baseline lexicon system and 0.65 for the state-of-the-art conditional random field model. Feature analysis indicated that semantic information in pretrained word embeddings boosted sensitivity and, combined with contextual awareness captured in the RNN, precision. Discussion Our model required no task-specific feature engineering, suggesting generalizability to additional sequence-labeling tasks. Learning curve analysis showed that our model reached optimal performance with fewer training examples than the other models. Conclusions ADR detection performance in social media is significantly improved by using a contextually aware model and word embeddings formed from large, unlabeled datasets. The approach reduces manual data-labeling requirements and is scalable to large social media datasets.


Author(s):  
Suvigya Jain

Abstract: Stock Market has always been one of the most active fields of research, many companies and organizations have focused their research in trying to find better ways to predict market trends. The stock market has been the instrument to measure the performance of a company and many have tried to develop methods that reduce risk for the investors. Since, the implementation of concepts like Deep Learning and Natural Language Processing has been made possible due to modern computing there has been a revolution in forecasting market trends. Also, the democratization of knowledge related to companies made possible due to the internet has provided the stake holders a means to learn about assets they choose to invest in through news media and social media also stock trading has become easier due to apps like robin hood etc. Every company now a days has some kind of social media presence or is usually reported by news media. This presence can lead to the growth of the companies by creating positive sentiment and also many losses by creating negative sentiments due to some public events. Our goal in this paper is to study the influence of news media and social media on market trends using sentiment analysis. Keywords: Deep Learning, Natural Language Processing, Stock Market, Sentiment analysis


Spreading of fake news in online social media is a major nuisance to the public and there is no state of art tool to detect whether a news is a fake or an original one in an automated manner. Hence, this paper analyses the online social media and the news feeds for detection of fake news. The work proposes solution using Natural Language Processing and Deep Learning techniques for detecting the fake news in online social media.


10.2196/20794 ◽  
2020 ◽  
Vol 6 (3) ◽  
pp. e20794
Author(s):  
Tim Ken Mackey ◽  
Jiawei Li ◽  
Vidya Purushothaman ◽  
Matthew Nali ◽  
Neal Shah ◽  
...  

Background The coronavirus disease (COVID-19) pandemic is perhaps the greatest global health challenge of the last century. Accompanying this pandemic is a parallel “infodemic,” including the online marketing and sale of unapproved, illegal, and counterfeit COVID-19 health products including testing kits, treatments, and other questionable “cures.” Enabling the proliferation of this content is the growing ubiquity of internet-based technologies, including popular social media platforms that now have billions of global users. Objective This study aims to collect, analyze, identify, and enable reporting of suspected fake, counterfeit, and unapproved COVID-19–related health care products from Twitter and Instagram. Methods This study is conducted in two phases beginning with the collection of COVID-19–related Twitter and Instagram posts using a combination of web scraping on Instagram and filtering the public streaming Twitter application programming interface for keywords associated with suspect marketing and sale of COVID-19 products. The second phase involved data analysis using natural language processing (NLP) and deep learning to identify potential sellers that were then manually annotated for characteristics of interest. We also visualized illegal selling posts on a customized data dashboard to enable public health intelligence. Results We collected a total of 6,029,323 tweets and 204,597 Instagram posts filtered for terms associated with suspect marketing and sale of COVID-19 health products from March to April for Twitter and February to May for Instagram. After applying our NLP and deep learning approaches, we identified 1271 tweets and 596 Instagram posts associated with questionable sales of COVID-19–related products. Generally, product introduction came in two waves, with the first consisting of questionable immunity-boosting treatments and a second involving suspect testing kits. We also detected a low volume of pharmaceuticals that have not been approved for COVID-19 treatment. Other major themes detected included products offered in different languages, various claims of product credibility, completely unsubstantiated products, unapproved testing modalities, and different payment and seller contact methods. Conclusions Results from this study provide initial insight into one front of the “infodemic” fight against COVID-19 by characterizing what types of health products, selling claims, and types of sellers were active on two popular social media platforms at earlier stages of the pandemic. This cybercrime challenge is likely to continue as the pandemic progresses and more people seek access to COVID-19 testing and treatment. This data intelligence can help public health agencies, regulatory authorities, legitimate manufacturers, and technology platforms better remove and prevent this content from harming the public.


2020 ◽  
Author(s):  
Tim Ken Mackey ◽  
Jiawei Li ◽  
Vidya Purushothaman ◽  
Matthew Nali ◽  
Neal Shah ◽  
...  

BACKGROUND The coronavirus disease (COVID-19) pandemic is perhaps the greatest global health challenge of the last century. Accompanying this pandemic is a parallel “infodemic,” including the online marketing and sale of unapproved, illegal, and counterfeit COVID-19 health products including testing kits, treatments, and other questionable “cures.” Enabling the proliferation of this content is the growing ubiquity of internet-based technologies, including popular social media platforms that now have billions of global users. OBJECTIVE This study aims to collect, analyze, identify, and enable reporting of suspected fake, counterfeit, and unapproved COVID-19–related health care products from Twitter and Instagram. METHODS This study is conducted in two phases beginning with the collection of COVID-19–related Twitter and Instagram posts using a combination of web scraping on Instagram and filtering the public streaming Twitter application programming interface for keywords associated with suspect marketing and sale of COVID-19 products. The second phase involved data analysis using natural language processing (NLP) and deep learning to identify potential sellers that were then manually annotated for characteristics of interest. We also visualized illegal selling posts on a customized data dashboard to enable public health intelligence. RESULTS We collected a total of 6,029,323 tweets and 204,597 Instagram posts filtered for terms associated with suspect marketing and sale of COVID-19 health products from March to April for Twitter and February to May for Instagram. After applying our NLP and deep learning approaches, we identified 1271 tweets and 596 Instagram posts associated with questionable sales of COVID-19–related products. Generally, product introduction came in two waves, with the first consisting of questionable immunity-boosting treatments and a second involving suspect testing kits. We also detected a low volume of pharmaceuticals that have not been approved for COVID-19 treatment. Other major themes detected included products offered in different languages, various claims of product credibility, completely unsubstantiated products, unapproved testing modalities, and different payment and seller contact methods. CONCLUSIONS Results from this study provide initial insight into one front of the “infodemic” fight against COVID-19 by characterizing what types of health products, selling claims, and types of sellers were active on two popular social media platforms at earlier stages of the pandemic. This cybercrime challenge is likely to continue as the pandemic progresses and more people seek access to COVID-19 testing and treatment. This data intelligence can help public health agencies, regulatory authorities, legitimate manufacturers, and technology platforms better remove and prevent this content from harming the public.


Author(s):  
Sarojini Yarramsetti ◽  
Anvar Shathik J ◽  
Renisha. P.S.

In this digital world, experience sharing, knowledge exploration, taught posting and other related social exploitations are common to every individual as well as social media/network such as FaceBook, Twitter, etc plays a vital role in such kinds of activities. In general, many social network based sentimental feature extraction details and logics are available as well as many researchers work on that domain for last few years. But all those research specification are narrowed in the sense of building a way for estimating the opinions and sentiments with respect to the tweets and posts the user raised on the social network or any other related web interfacing medium. Many social network schemes provides an ability to the users to push the voice tweets and voice messages, so that the voice messages may contain some harmful as well as normal and important contents. In this paper, a new methodology is designed called Intensive Deep Learning based Voice Estimation Principle (IDLVEP), in which it is used to identify the voice message content and extract the features based on the Natural Language Processing (NLP) logic. The association of such Deep Learning and Natural Language Processing provides an efficient approach to build the powerful data processing model to identify the sentimental features from the social networking medium. This hybrid logic provides support for both text based and voice based tweet sentimental feature estimations. The Natural Language Processing principles assists the proposed approach of IDLVEP to extracts the voice content from the input message and provides a raw text content, based on that the deep learning principles classify the messages with respect to the estimation of harmful or normal tweets. The tweets raised by the user are initially sub-divided into two categories such as voice tweets and text tweets. The voice tweets will be taken care by the NLP principles and the text enabled tweets will be handled by means of deep learning principles, in which the voice tweets are also extracted and taken care by the deep learning principle only. The social network has two different faces such as provides support to developments as well as the same it provides a way to access that for harmful things. So, that this approach of IDLVEP identifies the harmful contents from the user tweets and remove that in an intelligent manner by using the proposed approach classification strategies. This paper concentrates on identifying the sentimental features from the user tweets and provides the harm free social network environment to the society.


Sign in / Sign up

Export Citation Format

Share Document