scholarly journals Identifying vulgarity in Bengali social media textual content

2021 ◽  
Vol 7 ◽  
pp. e665
Author(s):  
Salim Sazzed

The presence of abusive and vulgar language in social media has become an issue of increasing concern in recent years. However, research pertaining to the prevalence and identification of vulgar language has remained largely unexplored in low-resource languages such as Bengali. In this paper, we provide the first comprehensive analysis on the presence of vulgarity in Bengali social media content. We develop two benchmark corpora consisting of 7,245 reviews collected from YouTube and manually annotate them into vulgar and non-vulgar categories. The manual annotation reveals the ubiquity of vulgar and swear words in Bengali social media content (i.e., in two corpora), ranging from 20% to 34%. To automatically identify vulgarity, we employ various approaches, such as classical machine learning (CML) classifiers, Stochastic Gradient Descent (SGD) optimizer, a deep learning (DL) based architecture, and lexicon-based methods. Although small in size, we find that the swear/vulgar lexicon is effective at identifying the vulgar language due to the high presence of some swear terms in Bengali social media. We observe that the performances of machine leanings (ML) classifiers are affected by the class distribution of the dataset. The DL-based BiLSTM (Bidirectional Long Short Term Memory) model yields the highest recall scores for identifying vulgarity in both datasets (i.e., in both original and class-balanced settings). Besides, the analysis reveals that vulgarity is highly correlated with negative sentiment in social media comments.

Author(s):  
T. V. Divya ◽  
Barnali Gupta Banik

Fake news detection on job advertisements has grabbed the attention of many researchers over past decade. Various classifiers such as Support Vector Machine (SVM), XGBoost Classifier and Random Forest (RF) methods are greatly utilized for fake and real news detection pertaining to job advertisement posts in social media. Bi-Directional Long Short-Term Memory (Bi-LSTM) classifier is greatly utilized for learning word representations in lower-dimensional vector space and learning significant words word embedding or terms revealed through Word embedding algorithm. The fake news detection is greatly achieved along with real news on job post from online social media is achieved by Bi-LSTM classifier and thereby evaluating corresponding performance. The performance metrics such as Precision, Recall, F1-score, and Accuracy are assessed for effectiveness by fraudulency based on job posts. The outcome infers the effectiveness and prominence of features for detecting false news. .


2021 ◽  
Vol 4 (1) ◽  
pp. 121-128
Author(s):  
A Iorliam ◽  
S Agber ◽  
MP Dzungwe ◽  
DK Kwaghtyo ◽  
S Bum

Social media provides opportunities for individuals to anonymously communicate and express hateful feelings and opinions at the comfort of their rooms. This anonymity has become a shield for many individuals or groups who use social media to express deep hatred for other individuals or groups, tribes or race, religion, gender, as well as belief systems. In this study, a comparative analysis is performed using Long Short-Term Memory and Convolutional Neural Network deep learning techniques for Hate Speech classification. This analysis demonstrates that the Long Short-Term Memory classifier achieved an accuracy of 92.47%, while the Convolutional Neural Network classifier achieved an accuracy of 92.74%. These results showed that deep learning techniques can effectively classify hate speech from normal speech.


2018 ◽  
Vol 2018 ◽  
pp. 1-8 ◽  
Author(s):  
Buzhou Tang ◽  
Jianglu Hu ◽  
Xiaolong Wang ◽  
Qingcai Chen

Social media in medicine, where patients can express their personal treatment experiences by personal computers and mobile devices, usually contains plenty of useful medical information, such as adverse drug reactions (ADRs); mining this useful medical information from social media has attracted more and more attention from researchers. In this study, we propose a deep neural network (called LSTM-CRF) combining long short-term memory (LSTM) neural networks (a type of recurrent neural networks) and conditional random fields (CRFs) to recognize ADR mentions from social media in medicine and investigate the effects of three factors on ADR mention recognition. The three factors are as follows: (1) representation for continuous and discontinuous ADR mentions: two novel representations, that is, “BIOHD” and “Multilabel,” are compared; (2) subject of posts: each post has a subject (i.e., drug here); and (3) external knowledge bases. Experiments conducted on a benchmark corpus, that is, CADEC, show that LSTM-CRF achieves better F-score than CRF; “Multilabel” is better in representing continuous and discontinuous ADR mentions than “BIOHD”; both subjects of comments and external knowledge bases are individually beneficial to ADR mention recognition. To the best of our knowledge, this is the first time to investigate deep neural networks to mine continuous and discontinuous ADRs from social media.


Information ◽  
2020 ◽  
Vol 11 (6) ◽  
pp. 312 ◽  
Author(s):  
Asma Baccouche ◽  
Sadaf Ahmed ◽  
Daniel Sierra-Sosa ◽  
Adel Elmaghraby

Identifying internet spam has been a challenging problem for decades. Several solutions have succeeded to detect spam comments in social media or fraudulent emails. However, an adequate strategy for filtering messages is difficult to achieve, as these messages resemble real communications. From the Natural Language Processing (NLP) perspective, Deep Learning models are a good alternative for classifying text after being preprocessed. In particular, Long Short-Term Memory (LSTM) networks are one of the models that perform well for the binary and multi-label text classification problems. In this paper, an approach merging two different data sources, one intended for Spam in social media posts and the other for Fraud classification in emails, is presented. We designed a multi-label LSTM model and trained it on the joint datasets including text with common bigrams, extracted from each independent dataset. The experiment results show that our proposed model is capable of identifying malicious text regardless of the source. The LSTM model trained with the merged dataset outperforms the models trained independently on each dataset.


2021 ◽  
pp. 016555152110077
Author(s):  
Şura Genç ◽  
Elif Surer

Clickbait is a strategy that aims to attract people’s attention and direct them to specific content. Clickbait titles, created by the information that is not included in the main content or using intriguing expressions with various text-related features, have become very popular, especially in social media. This study expands the Turkish clickbait dataset that we had constructed for clickbait detection in our proof-of-concept study, written in Turkish. We achieve a 48,060 sample size by adding 8859 tweets and release a publicly available dataset – ClickbaitTR – with its open-source data analysis library. We apply machine learning algorithms such as Artificial Neural Network (ANN), Logistic Regression, Random Forest, Long Short-Term Memory Network (LSTM), Bidirectional Long Short-Term Memory (BiLSTM) and Ensemble Classifier on 48,060 news headlines extracted from Twitter. The results show that the Logistic Regression algorithm has 85% accuracy; the Random Forest algorithm has a performance of 86% accuracy; the LSTM has 93% accuracy; the ANN has 93% accuracy; the Ensemble Classifier has 93% accuracy; and finally, the BiLSTM has 97% accuracy. A thorough discussion is provided for the psychological aspects of clickbait strategy focusing on curiosity and interest arousal. In addition to a successful clickbait detection performance and the detailed analysis of clickbait sentences in terms of language and psychological aspects, this study also contributes to clickbait detection studies with the largest clickbait dataset in Turkish.


Sign in / Sign up

Export Citation Format

Share Document