Hate or Non-hate: Translation based hate speech identification in Code-Mixed Hinglish data set

Author(s):  
Shankar Biradar ◽  
Sunil Saumya ◽  
Arun Chauhan
Author(s):  
Edward Ombui ◽  
◽  
Lawrence Muchemi ◽  
Peter Wagacha

Presidential campaign periods are a major trigger event for hate speech on social media in almost every country. A systematic review of previous studies indicates inadequate publicly available annotated datasets and hardly any evidence of theoretical underpinning for the annotation schemes used for hate speech identification. This situation stifles the development of empirically useful data for research, especially in supervised machine learning. This paper describes the methodology that was used to develop a multidimensional hate speech framework based on the duplex theory of hate [1] components that include distance, passion, commitment to hate, and hate as a story. Subsequently, an annotation scheme based on the framework was used to annotate a random sample of ~51k tweets from ~400k tweets that were collected during the August and October 2017 presidential campaign period in Kenya. This resulted in a goldstandard codeswitched dataset that could be used for comparative and empirical studies in supervised machine learning. The resulting classifiers trained on this dataset could be used to provide real-time monitoring of hate speech spikes on social media and inform data-driven decision-making by relevant security agencies in government.


2020 ◽  
pp. 1-1
Author(s):  
Muhammad Usman Shahid Khan ◽  
Assad Abbas ◽  
Attiqa Rehman ◽  
Raheel Nawaz

Apart from this there are many domains including medical, voice synthesis, hate speech classification and other custom applications where classification of speech plays an important role. The conventional techniques of speech processing and classification works on a small data set also provide lower accuracy of the classification. This paper introduces a learning model using neural network (NN) for the large dataset machine training and classification using critical feature analysis for the pattern of speech spectrogram and waveforms. The performance evaluation of the proposed training model for the speech classification is validated on a single CPU and found to achieve (12-82) % of accuracy in just 5-epochs and also continuously decreases the loss at successive iteration of the epochs. This method provides learning model framework for the speech processing and classification for a very large dataset.


Sensors ◽  
2021 ◽  
Vol 21 (23) ◽  
pp. 7859
Author(s):  
Fernando H. Calderón ◽  
Namrita Balani ◽  
Jherez Taylor ◽  
Melvyn Peignon ◽  
Yen-Hao Huang ◽  
...  

The permanent transition to online activity has brought with it a surge in hate speech discourse. This has prompted increased calls for automatic detection methods, most of which currently rely on a dictionary of hate speech words, and supervised classification. This approach often falls short when dealing with newer words and phrases produced by online extremist communities. These code words are used with the aim of evading automatic detection by systems. Code words are frequently used and have benign meanings in regular discourse, for instance, “skypes, googles, bing, yahoos” are all examples of words that have a hidden hate speech meaning. Such overlap presents a challenge to the traditional keyword approach of collecting data that is specific to hate speech. In this work, we first introduced a word embedding model that learns the hidden hate speech meaning of words. With this insight on code words, we developed a classifier that leverages linguistic patterns to reduce the impact of individual words. The proposed method was evaluated across three different datasets to test its generalizability. The empirical results show that the linguistic patterns approach outperforms the baselines and enables further analysis on hate speech expressions.


2021 ◽  
Vol 11 (3) ◽  
pp. 1294
Author(s):  
Krzysztof Fiok ◽  
Waldemar Karwowski ◽  
Edgar Gutierrez ◽  
Tameika Liciaga ◽  
Alessandro Belmonte ◽  
...  

Volcanoes of hate and disrespect erupt in societies often not without fatal consequences. To address this negative phenomenon scientists struggled to understand and analyze its roots and language expressions described as hate speech. As a result, it is now possible to automatically detect and counter hate speech in textual data spreading rapidly, for example, in social media. However, recently another approach to tackling the roots of disrespect was proposed, it is based on the concept of promoting positive behavior instead of only penalizing hate and disrespect. In our study, we followed this approach and discovered that it is hard to find any textual data sets or studies discussing automatic detection regarding respectful behaviors and their textual expressions. Therefore, we decided to contribute probably one of the first human-annotated data sets which allows for supervised training of text analysis methods for automatic detection of respectful messages. By choosing a data set of tweets which already possessed sentiment annotations we were also able to discuss the correlation of sentiment and respect. Finally, we provide a comparison of recent machine and deep learning text analysis methods and their performance which allowed us to demonstrate that automatic detection of respectful messages in social media is feasible.


Author(s):  
Edward Ombui ◽  
◽  
Lawrence Muchemi ◽  
Peter Wagacha

This study examines the problem of hate speech identification in codeswitched text from social media using a natural language processing approach. It explores different features in training nine models and empirically evaluates their predictiveness in identifying hate speech in a ~50k human-annotated dataset. The study espouses a novel approach to handle this challenge by introducing a hierarchical approach that employs Latent Dirichlet Analysis to generate topic models that help build a high-level Psychosocial feature set that we acronym PDC. PDC groups similar meaning words in word families, which is significant in capturing codeswitching during the preprocessing stage for supervised learning models. The high-level PDC features generated are based on a hate speech annotation framework [1] that is largely informed by the duplex theory of hate [2]. Results obtained from frequency-based models using the PDC feature on the dataset comprising of tweets generated during the 2012 and 2017 presidential elections in Kenya indicate an f-score of 83% (precision: 81%, recall: 85%) in identifying hate speech. The study is significant in that it publicly shares a unique codeswitched dataset for hate speech that is valuable for comparative studies. Secondly, it provides a methodology for building a novel PDC feature set to identify nuanced forms of hate speech, camouflaged in codeswitched data, which conventional methods could not adequately identify.


Sign in / Sign up

Export Citation Format

Share Document