Sinhala Hate Speech Detection in Social Media using Text Mining and Machine learning

Social media services make it possible for an increasing number of people to express their opinion publicly. In this context, large amounts of hateful comments are published daily. The PHARM project aims at monitoring and modeling hate speech against refugees and migrants in Greece, Italy, and Spain. In this direction, a web interface for the creation and the query of a multi-source database containing hate speech-related content is implemented and evaluated. The selected sources include Twitter, YouTube, and Facebook comments and posts, as well as comments and articles from a selected list of websites. The interface allows users to search in the existing database, scrape social media using keywords, annotate records through a dedicated platform and contribute new content to the database. Furthermore, the functionality for hate speech detection and sentiment analysis of texts is provided, making use of novel methods and machine learning models. The interface can be accessed online with a graphical user interface compatible with modern internet browsers. For the evaluation of the interface, a multifactor questionnaire was formulated, targeting to record the users’ opinions about the web interface and the corresponding functionality.

Download Full-text

Hate Speech Detection Using Text Mining and Machine Learning

International Journal of Decision Support System Technology ◽

10.4018/ijdsst.286680 ◽

2022 ◽

Vol 14 (1) ◽

pp. 0-0

Keyword(s):

Machine Learning ◽

Text Mining ◽

Hate Speech ◽

Confusion Matrix ◽

Data Sets ◽

Speech Detection ◽

Machine Learning Classification ◽

Legal Implications ◽

Real People ◽

Violent Acts

Automatic hate speech detection on social media is becoming an outstanding concern in modern countries. Indeed, hate speech towards people brings about violent acts and social chaos, hence law prohibits it, and it engenders moral and legal implications. It is crucial that we can precisely categorize the hate speech, and not a hate speech automatically, while this allows us to identify easily real people who represent a threat for our society, and who wrongly regard as hateful speakers. In this paper, we applied a complete text mining process and Naïve Bayes machine learning classification algorithm to two different data sets (tweets_Num1 and tweets_Num2) taken from Twitter, to better classify tweets. The results obtained demonstrate that our model performed well regarding different metrics based on the confusion matrix including the accuracy metric, which achieved 87. 23% on the first dataset, and 93. 06% on the second.

Download Full-text

Advances in Machine Learning Algorithms for Hate Speech Detection in Social Media: A Review

IEEE Access ◽

10.1109/access.2021.3089515 ◽

2021 ◽

pp. 1-1

Author(s):

Nanlir Sallau Mullah ◽

Wan Mohd Nazmee Wan Zainon

Keyword(s):

Machine Learning ◽

Social Media ◽

Hate Speech ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Speech Detection

Download Full-text

Lexicon-Based Indonesian Local Language Abusive Words Dictionary to Detect Hate Speech in Social Media

Journal of Information Systems Engineering and Business Intelligence ◽

10.20473/jisebi.6.1.9-17 ◽

2020 ◽

Vol 6 (1) ◽

pp. 9

Author(s):

Mardhiya Hayaty ◽

Sumarni Adi ◽

Anggit Dwi Hartanto

Keyword(s):

Machine Learning ◽

Social Media ◽

Random Sampling ◽

Hate Speech ◽

Sampling Technique ◽

Stratified Random Sampling ◽

Speech Detection ◽

Or Groups ◽

Machine Learning Approach ◽

Local Languages

Background: Hate speech is an expression to someone or a group of people that contain feelings of hate and/or anger at people or groups. On social media users are free to express themselves by writing harsh words and share them with a group of people so that it triggers separations and conflicts between groups. Currently, research has been conducted by several experts to detect hate speech in social media namely machine learning-based and lexicon-based, but the machine learning approach has a weakness namely the manual labelling process by an annotator in separating positive, negative or neutral opinions takes time long and tiringObjective: This study aims to produce a dictionary containing abusive words from local languages in Indonesia. Lexicon-base is very dependent on the language contained in dictionary words. Indonesia has thousands of tribes with 2500 local languages, and 80% of the population of Indonesia use local languages in communication, with the result that a significant challenge to detect hate speech of social media.Methods: Abusive words surveys are conducted by using proportionate stratified random sampling techniques in 4 major tribes on the island of Java, namely Betawi, Sundanese, Javanese, MadureseResults: The experimental results produce 250 abusive words dictionary from 4 major Indonesian tribes to detect hate speech in Indonesian social media by using the lexicon-based approach. Conclusion: A stratified random sampling technique has been conducted in 4 major Indonesian tribes to produce 250 abusive words for hate speech detection using the lexicon-based approach.

Download Full-text

Automatic Hate Speech Detection in English-Odia Code Mixed Social Media Data Using Machine Learning Techniques

Applied Sciences ◽

10.3390/app11188575 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8575

Author(s):

Sudhir Kumar Mohapatra ◽

Srinivas Prasad ◽

Dwiti Krishna Bebarta ◽

Tapan Kumar Das ◽

Kathiravan Srinivasan ◽

...

Keyword(s):

Machine Learning ◽

Social Media ◽

Hate Speech ◽

Learning Algorithm ◽

Machine Learning Techniques ◽

Mixed Data ◽

Support Vector ◽

Speech Detection ◽

Detection Model ◽

Feature Based

Hate speech on social media may spread quickly through online users and subsequently, may even escalate into local vile violence and heinous crimes. This paper proposes a hate speech detection model by means of machine learning and text mining feature extraction techniques. In this study, the authors collected the hate speech of English-Odia code mixed data from a Facebook public page and manually organized them into three classes. In order to build binary and ternary datasets, the data are further converted into binary classes. The modeling of hate speech employs the combination of a machine learning algorithm and features extraction. Support vector machine (SVM), naïve Bayes (NB) and random forest (RF) models were trained using the whole dataset, with the extracted feature based on word unigram, bigram, trigram, combined n-grams, term frequency-inverse document frequency (TF-IDF), combined n-grams weighted by TF-IDF and word2vec for both the datasets. Using the two datasets, we developed two kinds of models with each feature—binary models and ternary models. The models based on SVM with word2vec achieved better performance than the NB and RF models for both the binary and ternary categories. The result reveals that the ternary models achieved less confusion between hate and non-hate speech than the binary models.

Download Full-text

Review Paper on Hate Speech Detection

Engineering and Technology Journal ◽

10.47191/etj/v6i12.05 ◽

2021 ◽

Vol 06 (12) ◽

Author(s):

Dr Ramakrishna Hegde ◽

Keyword(s):

Mental Health ◽

Machine Learning ◽

Social Media ◽

Deep Learning ◽

Review Paper ◽

Hate Speech ◽

Anxiety And Depression ◽

Speech Detection ◽

Speech Content ◽

The Way

This is a review paper on the topic “Hate Speech Detection”. One of the main disadvantages of social media is the way it is used to spread hate. This hate can affect an individual or a group in different ways like, degrading their mental health leading to anxiety and depression. This can lead to suicides or homicide. So it is very important to control how a platform can be used in spreading a particular message. To do this we have to identify the hate speech content automatically, this can be done with the help of techniques in machine learning and deep learning. We have reviewed few papers that deal with the different methodologies of detecting hate speech in a given text

Download Full-text

Bangla hate speech detection on social media using attention-based recurrent neural network

Journal of Intelligent Systems ◽

10.1515/jisys-2020-0060 ◽

2021 ◽

Vol 30 (1) ◽

pp. 578-591

Author(s):

Amit Kumar Das ◽

Abdullah Al Asif ◽

Anik Paul ◽

Md. Nur Hossain

Keyword(s):

Neural Network ◽

Machine Learning ◽

Social Media ◽

Hate Speech ◽

Negative Aspect ◽

Speech Detection ◽

Use Of Technology ◽

Machine Learning Model ◽

Bengali Language ◽

Facebook Pages

Abstract Hate speech has spread more rapidly through the daily use of technology and, most notably, by sharing your opinions or feelings on social media in a negative aspect. Although numerous works have been carried out in detecting hate speeches in English, German, and other languages, very few works have been carried out in the context of the Bengali language. In contrast, millions of people communicate on social media in Bengali. The few existing works that have been carried out need improvements in both accuracy and interpretability. This article proposed encoder–decoder-based machine learning model, a popular tool in NLP, to classify user’s Bengali comments from Facebook pages. A dataset of 7,425 Bengali comments, consisting of seven distinct categories of hate speeches, was used to train and evaluate our model. For extracting and encoding local features from the comments, 1D convolutional layers were used. Finally, the attention mechanism, LSTM, and GRU-based decoders have been used for predicting hate speech categories. Among the three encoder–decoder algorithms, attention-based decoder obtained the best accuracy (77%).

Download Full-text

YouTube based religious hate speech and extremism detection dataset with machine learning baselines

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-219264 ◽

2021 ◽

pp. 1-9

Author(s):

Noman Ashraf ◽

Abid Rafiq ◽

Sabur Butt ◽

Hafiz Muhammad Faisal Shehzad ◽

Grigori Sidorov ◽

...

Keyword(s):

Machine Learning ◽

Social Networking ◽

Social Networking Sites ◽

Nearest Neighbor ◽

Hate Speech ◽

Support Vector ◽

K Nearest Neighbor ◽

Speech Detection ◽

Supervised Learning Algorithms ◽

Youtube Videos

On YouTube, billions of videos are watched online and millions of short messages are posted each day. YouTube along with other social networking sites are used by individuals and extremist groups for spreading hatred among users. In this paper, we consider religion as the most targeted domain for spreading hate speech among people of different religions. We present a methodology for the detection of religion-based hate videos on YouTube. Messages posted on YouTube videos generally express the opinions of users’ related to that video. We provide a novel dataset for religious hate speech detection on Youtube comments. The proposed methodology applies data mining techniques on extracted comments from religious videos in order to filter religion-oriented messages and detect those videos which are used for spreading hate. The supervised learning algorithms: Support Vector Machine (SVM), Logistic Regression (LR), and k-Nearest Neighbor (k-NN) are used for baseline results.

Download Full-text

Application of Machine Learning Techniques for Hate Speech Detection in Mobile Applications

2018 International Conference on Information Technologies (InfoTech) ◽

10.1109/infotech.2018.8510738 ◽

2018 ◽

Cited By ~ 2

Author(s):

Bujar Raufi ◽

Ildi Xhaferri

Keyword(s):

Machine Learning ◽

Mobile Applications ◽

Hate Speech ◽

Machine Learning Techniques ◽

Speech Detection ◽

Learning Techniques

Download Full-text

Online Multilingual Hate Speech Detection: Experimenting with Hindi and English Social Media

10.20944/preprints202011.0646.v1 ◽

2020 ◽

Author(s):

Neeraj Vashistha ◽

Arkaitz Zubiaga

Keyword(s):

Social Media ◽

Hate Speech ◽

Model Performance ◽

Academic Community ◽

Human Interaction ◽

Superior Performance ◽

Competitive Performance ◽

Speech Detection ◽

Improve Model ◽

Use Of The Internet

The exponential increase in the use of the Internet and social media over the last two decades has changed human interaction. This has led to many positive outcomes, but at the same time it has brought risks and harms. While the volume of harmful content online, such as hate speech, is not manageable by humans, interest in the academic community to investigate automated means for hate speech detection has increased. In this study, we analyse six publicly available datasets by combining them into a single homogeneous dataset and classify them into three classes, abusive, hateful or neither. We create a baseline model and we improve model performance scores using various optimisation techniques. After attaining a competitive performance score, we create a tool which identifies and scores a page with effective metric in near-real time and uses the same as feedback to re-train our model. We prove the competitive performance of our multilingual model on two langauges, English and Hindi, leading to comparable or superior performance to most monolingual models.

Download Full-text