scholarly journals Psychosocial Features for Identifying Hate Speech in Social Media Text

Author(s):  
Edward Ombui ◽  
Lawrence Muchemi ◽  
Peter Wagacha

This study uses natural language processing to identify hate speech in social media codeswitched text. It trains nine models and tests their predictiveness in recognizing hate speech in a 50k human-annotated dataset. The article proposes a novel hierarchical approach that leverages Latent Dirichlet Analysis to develop topic models that assist build a high-level Psychosocial feature set we call PDC. PDC organizes words into word families, which helps capture codeswitching during preprocessing for supervised learning models. Informed by the duplex theory of hate, the PDC features are based on a hate speech annotation framework. Frequency-based models employing the PDC feature on tweets from the 2012 and 2017 Kenyan presidential elections yielded an f-score of 83 percent (precision: 81 percent, recall: 85 percent) in recognizing hate speech. The study is notable because it publicly exposes a rich codeswitched dataset for comparative studies. Second, it describes how to create a novel PDC feature set to detect subtle types of hate speech hidden in codeswitched data that previous approaches could not detect.

Author(s):  
Edward Ombui ◽  
◽  
Lawrence Muchemi ◽  
Peter Wagacha

This study examines the problem of hate speech identification in codeswitched text from social media using a natural language processing approach. It explores different features in training nine models and empirically evaluates their predictiveness in identifying hate speech in a ~50k human-annotated dataset. The study espouses a novel approach to handle this challenge by introducing a hierarchical approach that employs Latent Dirichlet Analysis to generate topic models that help build a high-level Psychosocial feature set that we acronym PDC. PDC groups similar meaning words in word families, which is significant in capturing codeswitching during the preprocessing stage for supervised learning models. The high-level PDC features generated are based on a hate speech annotation framework [1] that is largely informed by the duplex theory of hate [2]. Results obtained from frequency-based models using the PDC feature on the dataset comprising of tweets generated during the 2012 and 2017 presidential elections in Kenya indicate an f-score of 83% (precision: 81%, recall: 85%) in identifying hate speech. The study is significant in that it publicly shares a unique codeswitched dataset for hate speech that is valuable for comparative studies. Secondly, it provides a methodology for building a novel PDC feature set to identify nuanced forms of hate speech, camouflaged in codeswitched data, which conventional methods could not adequately identify.


2019 ◽  
Vol 3 (1) ◽  
pp. 72
Author(s):  
Irfan Afandi

The humanitarian problem in the development of the industrial revolution 4.0 is very complex and is at the stage of worrying. No human being separated from the effect of the waves. High school is active users (user) of the results of the industrial revolution the 4.0. The problem that arises in the use of social media including the demise of expertise, the dissemination of hate speech and fabricated news. Teaching Islamic education material should be able to respond to this by providing normative information in the Qur'an and Hadith so that students can escape from its negative effects. One of the solutions offered was to integrate these materials with integratsi learning models in the themes that have been arranged in the school's learning policy. Integrating this material must through the phases between the awarding phase of learning, information or materials to grow a critical reason, generate hypotheses and generalities.


2021 ◽  
Vol 13 (3) ◽  
pp. 80
Author(s):  
Lazaros Vrysis ◽  
Nikolaos Vryzas ◽  
Rigas Kotsakis ◽  
Theodora Saridou ◽  
Maria Matsiola ◽  
...  

Social media services make it possible for an increasing number of people to express their opinion publicly. In this context, large amounts of hateful comments are published daily. The PHARM project aims at monitoring and modeling hate speech against refugees and migrants in Greece, Italy, and Spain. In this direction, a web interface for the creation and the query of a multi-source database containing hate speech-related content is implemented and evaluated. The selected sources include Twitter, YouTube, and Facebook comments and posts, as well as comments and articles from a selected list of websites. The interface allows users to search in the existing database, scrape social media using keywords, annotate records through a dedicated platform and contribute new content to the database. Furthermore, the functionality for hate speech detection and sentiment analysis of texts is provided, making use of novel methods and machine learning models. The interface can be accessed online with a graphical user interface compatible with modern internet browsers. For the evaluation of the interface, a multifactor questionnaire was formulated, targeting to record the users’ opinions about the web interface and the corresponding functionality.


2021 ◽  
Author(s):  
Abul Hasan ◽  
Mark Levene ◽  
David Weston ◽  
Renate Fromson ◽  
Nicolas Koslover ◽  
...  

BACKGROUND The COVID-19 pandemic has created a pressing need for integrating information from disparate sources, in order to assist decision makers. Social media is important in this respect, however, to make sense of the textual information it provides and be able to automate the processing of large amounts of data, natural language processing methods are needed. Social media posts are often noisy, yet they may provide valuable insights regarding the severity and prevalence of the disease in the population. In particular, machine learning techniques for triage and diagnosis could allow for a better understanding of what social media may offer in this respect. OBJECTIVE This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts, in order to provide researchers and other interested parties with additional information on the symptoms, severity and prevalence of the disease. METHODS The text processing pipeline first extracts COVID-19 symptoms and related concepts such as severity, duration, negations, and body parts from patients’ posts using conditional random fields. An unsupervised rule-based algorithm is then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations are subsequently used to construct two different vector representations of each post. These vectors are applied separately to build support vector machine learning models to triage patients into three categories and diagnose them for COVID-19. RESULTS We report that Macro- and Micro-averaged F_{1\ }scores in the range of 71-96% and 61-87%, respectively, for the triage and diagnosis of COVID-19, when the models are trained on human labelled data. Our experimental results indicate that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. Also, we highlight important features uncovered by our diagnostic machine learning models and compare them with the most frequent symptoms revealed in another COVID-19 dataset. In particular, we found that the most important features are not always the most frequent ones. CONCLUSIONS Our preliminary results show that it is possible to automatically triage and diagnose patients for COVID-19 from natural language narratives using a machine learning pipeline, in order to provide additional information on the severity and prevalence of the disease through the eyes of social media.


2021 ◽  
Author(s):  
Lucas Rodrigues ◽  
Antonio Jacob Junior ◽  
Fábio Lobato

Posts with defamatory content or hate speech are constantly foundon social media. The results for readers are numerous, not restrictedonly to the psychological impact, but also to the growth of thissocial phenomenon. With the General Law on the Protection ofPersonal Data and the Marco Civil da Internet, service providersbecame responsible for the content in their platforms. Consideringthe importance of this issue, this paper aims to analyze the contentpublished (news and comments) on the G1 News Portal with techniquesbased on data visualization and Natural Language Processing,such as sentiment analysis and topic modeling. The results showthat even with most of the comments being neutral or negative andclassified or not as hate speech, the majority of them were acceptedby the users.


Author(s):  
Yanchun Sun ◽  
Hang Yin ◽  
Jiu Wen ◽  
Zhiyu Sun

Urban region functions are the types of potential activities in an urban region, such as residence, commerce, transportation, entertainment, etc. A service which mines urban region functions is of great value for various applications, including urban planning and transportation management, etc. Many studies have been carried out to dig out different regions’ functions, but few studies are based on social media text analysis. Considering that the semantic information embedded in social media texts is very useful to infer an urban region’s main functions, we design a service which extracts human activities using Sina Weibo ( www.weibo.com ; the largest microblog system in Chinese, similar to Twitter) with location information and further describes a region’s main functions with a function vector based on the human activities. First, we predefine a variety of human activities to get the related activities corresponding to each Weibo post using an urban function classification model. Second, urban regions’ function vectors are generated, with which we can easily do some high-level work such as similar place recommendation. At last, with the function vectors generated, we develop a Web application for urban region function querying. We also conduct a case study among the urban regions in Beijing, and the experiment results demonstrate the feasibility of our method.


2015 ◽  
Vol 2 (2) ◽  
pp. 34-52 ◽  
Author(s):  
Nwachukwu Andrew Egbunike ◽  
Noel Ihebuzor ◽  
Ngozi Onyechi

Social media is becoming increasingly important as a means for social engagement. In Nigeria, Twitter is employed to convey opinion and make commentary on matters ranging from football to politics. Tweets are also used to inform, advocate, recruit and even incite. Previous studies have shown that Twitter could be effective for political mobilization. However, there is dearth of research on how Twitter has been used as a purveyor of neutral and/or hate speech in the Nigerian context. This study examined the nature of tweets in the immediate aftermath of the 2015 presidential election in Nigeria. The authors employed content analysis of 250 purposively selected tweets from the #Igbo hashtag which trended between March 29 and 31, 2015. The tweets were then categorized into five explicit hate and one neutral tweet category respectively. Results revealed the dominance of three hate tweet types: derogatory, mocking and blaming. These findings were then discussed bearing in mind earlier theories on the functionality of tweets and voting patterns from an analysis of the election results.


Author(s):  
Bhushan R. Chincholkar

Sentiment analysis is one of the fastest growing fields with its demand and potential benefits that are increasing every day. Sentiment analysis aims to classify the polarity of a document through natural language processing, text analysis. With the help of internet and modern technology, there has bee n a tremendous growth in the amount of data. Each individual is in position to precise his/her own ideas freely on social media. All of this data can be analyzed and used in order to draw benefits and quality information. In this paper, the focus is on cyber-hate classification based on for public opinion or views, since the spread of hate speech using social media can have disruptive impacts on social sentiment analysis. In particular, here proposing a modified approach with two stage training for dealing with text ambiguity and classifying three type approach positive, negative and neutral sentiment, and compare its performance with those popular methods also as well as some existing fuzzy approaches. Afterword comparing the performance of proposed approach with commonly used sentiment classifiers which are known to perform well in this task. The experimental results indicate that our modified approach performs marginally better than the other algorithms.


Author(s):  
Edward Ombui ◽  
◽  
Lawrence Muchemi ◽  
Peter Wagacha

Presidential campaign periods are a major trigger event for hate speech on social media in almost every country. A systematic review of previous studies indicates inadequate publicly available annotated datasets and hardly any evidence of theoretical underpinning for the annotation schemes used for hate speech identification. This situation stifles the development of empirically useful data for research, especially in supervised machine learning. This paper describes the methodology that was used to develop a multidimensional hate speech framework based on the duplex theory of hate [1] components that include distance, passion, commitment to hate, and hate as a story. Subsequently, an annotation scheme based on the framework was used to annotate a random sample of ~51k tweets from ~400k tweets that were collected during the August and October 2017 presidential campaign period in Kenya. This resulted in a goldstandard codeswitched dataset that could be used for comparative and empirical studies in supervised machine learning. The resulting classifiers trained on this dataset could be used to provide real-time monitoring of hate speech spikes on social media and inform data-driven decision-making by relevant security agencies in government.


Sign in / Sign up

Export Citation Format

Share Document