scholarly journals Hindi-English Hate Speech Detection: Author Profiling, Debiasing, and Practical Perspectives

2020 ◽  
Vol 34 (01) ◽  
pp. 386-393
Author(s):  
Shivang Chopra ◽  
Ramit Sawhney ◽  
Puneet Mathur ◽  
Rajiv Ratn Shah

Code-switching in linguistically diverse, low resource languages is often semantically complex and lacks sophisticated methodologies that can be applied to real-world data for precisely detecting hate speech. In an attempt to bridge this gap, we introduce a three-tier pipeline that employs profanity modeling, deep graph embeddings, and author profiling to retrieve instances of hate speech in Hindi-English code-switched language (Hinglish) on social media platforms like Twitter. Through extensive comparison against several baselines on two real-world datasets, we demonstrate how targeted hate embeddings combined with social network-based features outperform state of the art, both quantitatively and qualitatively. Additionally, we present an expert-in-the-loop algorithm for bias elimination in the proposed model pipeline and study the prevalence and performance impact of the debiasing. Finally, we discuss the computational, practical, ethical, and reproducibility aspects of the deployment of our pipeline across the Web.

Author(s):  
Patricia Chiril ◽  
Endang Wahyu Pamungkas ◽  
Farah Benamara ◽  
Véronique Moriceau ◽  
Viviana Patti

AbstractHate Speech and harassment are widespread in online communication, due to users' freedom and anonymity and the lack of regulation provided by social media platforms. Hate speech is topically focused (misogyny, sexism, racism, xenophobia, homophobia, etc.), and each specific manifestation of hate speech targets different vulnerable groups based on characteristics such as gender (misogyny, sexism), ethnicity, race, religion (xenophobia, racism, Islamophobia), sexual orientation (homophobia), and so on. Most automatic hate speech detection approaches cast the problem into a binary classification task without addressing either the topical focus or the target-oriented nature of hate speech. In this paper, we propose to tackle, for the first time, hate speech detection from a multi-target perspective. We leverage manually annotated datasets, to investigate the problem of transferring knowledge from different datasets with different topical focuses and targets. Our contribution is threefold: (1) we explore the ability of hate speech detection models to capture common properties from topic-generic datasets and transfer this knowledge to recognize specific manifestations of hate speech; (2) we experiment with the development of models to detect both topics (racism, xenophobia, sexism, misogyny) and hate speech targets, going beyond standard binary classification, to investigate how to detect hate speech at a finer level of granularity and how to transfer knowledge across different topics and targets; and (3) we study the impact of affective knowledge encoded in sentic computing resources (SenticNet, EmoSenticNet) and in semantically structured hate lexicons (HurtLex) in determining specific manifestations of hate speech. We experimented with different neural models including multitask approaches. Our study shows that: (1) training a model on a combination of several (training sets from several) topic-specific datasets is more effective than training a model on a topic-generic dataset; (2) the multi-task approach outperforms a single-task model when detecting both the hatefulness of a tweet and its topical focus in the context of a multi-label classification approach; and (3) the models incorporating EmoSenticNet emotions, the first level emotions of SenticNet, a blend of SenticNet and EmoSenticNet emotions or affective features based on Hurtlex, obtained the best results. Our results demonstrate that multi-target hate speech detection from existing datasets is feasible, which is a first step towards hate speech detection for a specific topic/target when dedicated annotated data are missing. Moreover, we prove that domain-independent affective knowledge, injected into our models, helps finer-grained hate speech detection.


2019 ◽  
Vol 22 (1) ◽  
pp. 69-80 ◽  
Author(s):  
Stefanie Ullmann ◽  
Marcus Tomalin

Abstract In this paper we explore quarantining as a more ethical method for delimiting the spread of Hate Speech via online social media platforms. Currently, companies like Facebook, Twitter, and Google generally respond reactively to such material: offensive messages that have already been posted are reviewed by human moderators if complaints from users are received. The offensive posts are only subsequently removed if the complaints are upheld; therefore, they still cause the recipients psychological harm. In addition, this approach has frequently been criticised for delimiting freedom of expression, since it requires the service providers to elaborate and implement censorship regimes. In the last few years, an emerging generation of automatic Hate Speech detection systems has started to offer new strategies for dealing with this particular kind of offensive online material. Anticipating the future efficacy of such systems, the present article advocates an approach to online Hate Speech detection that is analogous to the quarantining of malicious computer software. If a given post is automatically classified as being harmful in a reliable manner, then it can be temporarily quarantined, and the direct recipients can receive an alert, which protects them from the harmful content in the first instance. The quarantining framework is an example of more ethical online safety technology that can be extended to the handling of Hate Speech. Crucially, it provides flexible options for obtaining a more justifiable balance between freedom of expression and appropriate censorship.


2021 ◽  
Vol 4 ◽  
Author(s):  
Bradley Butcher ◽  
Vincent S. Huang ◽  
Christopher Robinson ◽  
Jeremy Reffin ◽  
Sema K. Sgaier ◽  
...  

Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs.


Rheumatology ◽  
2021 ◽  
Author(s):  
Suchitra Kataria ◽  
Vinod Ravindran

Abstract The diversity of diseases in rheumatology and variability in disease prevalence necessitates greater data parity in disease presentation, treatment responses including adverse events to drugs and various comorbidities. Randomized controlled trials are the gold standard for drug development and performance evaluation. However, when the drug is applied outside the controlled environment, the outcomes may differ in patient populations. In this context, the need to understand the macro and micro changes involved in disease evolution and progression becomes important and so is the need for harvesting and harnessing the real-world data from various resources to use them in generating real-world evidence. Digital tools with potential relevance to rheumatology can potentially be leveraged to obtain greater patient insights, greater information on disease progression and disease micro processes and even in the early diagnosis of diseases. Since the patients spend only a minuscule portion of their time in hospital or in a clinic, using modern digital tools to generate realistic, bias-proof, real-world data in a non-invasive patient-friendly manner becomes critical. In this review we have appraised different digital mediums and mechanisms for collecting real-world data and proposed digital care models for generating real-world evidence in rheumatology.


Informatics ◽  
2021 ◽  
Vol 8 (4) ◽  
pp. 69
Author(s):  
Wassen Aldjanabi ◽  
Abdelghani Dahou ◽  
Mohammed A. A. Al-qaness ◽  
Mohamed Abd Elaziz ◽  
Ahmed Mohamed Helmi ◽  
...  

As social media platforms offer a medium for opinion expression, social phenomena such as hatred, offensive language, racism, and all forms of verbal violence have increased spectacularly. These behaviors do not affect specific countries, groups, or communities only, extending beyond these areas into people’s everyday lives. This study investigates offensive and hate speech on Arab social media to build an accurate offensive and hate speech detection system. More precisely, we develop a classification system for determining offensive and hate speech using a multi-task learning (MTL) model built on top of a pre-trained Arabic language model. We train the MTL model on the same task using cross-corpora representing a variation in the offensive and hate context to learn global and dataset-specific contextual representations. The developed MTL model showed a significant performance and outperformed existing models in the literature on three out of four datasets for Arabic offensive and hate speech detection tasks.


Author(s):  
Shuji Hao ◽  
Peilin Zhao ◽  
Yong Liu ◽  
Steven C. H. Hoi ◽  
Chunyan Miao

Relative similarity learning~(RSL) aims to learn similarity functions from data with relative constraints. Most previous algorithms developed for RSL are batch-based learning approaches which suffer from poor scalability when dealing with real-world data arriving sequentially. These methods are often designed to learn a single similarity function for a specific task. Therefore, they may be sub-optimal to solve multiple task learning problems. To overcome these limitations, we propose a scalable RSL framework named OMTRSL (Online Multi-Task Relative Similarity Learning). Specifically, we first develop a simple yet effective online learning algorithm for multi-task relative similarity learning. Then, we also propose an active learning algorithm to save the labeling cost. The proposed algorithms not only enjoy theoretical guarantee, but also show high efficacy and efficiency in extensive experiments on real-world datasets.


2021 ◽  
Author(s):  
Andrew Lensen

When faced with a new dataset, most practitioners begin by performing exploratory data analysis to discover interesting patterns and characteristics within data. Techniques such as association rule mining are commonly applied to uncover relationships between features (attributes) of the data. However, association rules are primarily designed for use on binary or categorical data, due to their use of rule-based machine learning. A large proportion of real-world data is continuous in nature, and discretisation of such data leads to inaccurate and less informative association rules. In this paper, we propose an alternative approach called feature relationship mining (FRM), which uses a genetic programming approach to automatically discover symbolic relationships between continuous or categorical features in data. To the best of our knowledge, our proposed approach is the first such symbolic approach with the goal of explicitly discovering relationships between features. Empirical testing on a variety of real-world datasets shows the proposed method is able to find high-quality, simple feature relationships which can be easily interpreted and which provide clear and non-trivial insight into data.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Rosy Tsopra ◽  
Xose Fernandez ◽  
Claudio Luchinat ◽  
Lilia Alberghina ◽  
Hans Lehrach ◽  
...  

Abstract Background Artificial intelligence (AI) has the potential to transform our healthcare systems significantly. New AI technologies based on machine learning approaches should play a key role in clinical decision-making in the future. However, their implementation in health care settings remains limited, mostly due to a lack of robust validation procedures. There is a need to develop reliable assessment frameworks for the clinical validation of AI. We present here an approach for assessing AI for predicting treatment response in triple-negative breast cancer (TNBC), using real-world data and molecular -omics data from clinical data warehouses and biobanks. Methods The European “ITFoC (Information Technology for the Future Of Cancer)” consortium designed a framework for the clinical validation of AI technologies for predicting treatment response in oncology. Results This framework is based on seven key steps specifying: (1) the intended use of AI, (2) the target population, (3) the timing of AI evaluation, (4) the datasets used for evaluation, (5) the procedures used for ensuring data safety (including data quality, privacy and security), (6) the metrics used for measuring performance, and (7) the procedures used to ensure that the AI is explainable. This framework forms the basis of a validation platform that we are building for the “ITFoC Challenge”. This community-wide competition will make it possible to assess and compare AI algorithms for predicting the response to TNBC treatments with external real-world datasets. Conclusions The predictive performance and safety of AI technologies must be assessed in a robust, unbiased and transparent manner before their implementation in healthcare settings. We believe that the consideration of the ITFoC consortium will contribute to the safe transfer and implementation of AI in clinical settings, in the context of precision oncology and personalized care.


2021 ◽  
Author(s):  
Robert A Player ◽  
Angeline M Aguinaldo ◽  
Brian B Merritt ◽  
Lisa N Maszkiewicz ◽  
Oluwaferanmi E Adeyemo ◽  
...  

A major challenge in the field of metagenomics is the selection of the correct combination of sequencing platform and downstream metagenomic analysis algorithm, or classifier. Here, we present the Metagenomic Evaluation Tool Analyzer (META), which produces simulated data and facilitates platform and algorithm selection for any given metagenomic use case. META-generated in silico read data are modular, scalable, and reflect user-defined community profiles, while the downstream analysis is done using a variety of metagenomic classifiers. Reported results include information on resource utilization, time-to-answer, and performance. Real-world data can also be analyzed using selected classifiers and results benchmarked against simulations. To test the utility of the META software, simulated data was compared to real-world viral and bacterial metagenomic samples run on four different sequencers and analyzed using 12 metagenomic classifiers. Lastly, we introduce META Score: a unified, quantitative value which rates an analytic classifiers' ability to both identify and count taxa in a representative sample.


2019 ◽  
pp. bmjebm-2019-111226 ◽  
Author(s):  
Duane Schulthess ◽  
Daniel Gassull ◽  
Amr Makady ◽  
Anna Ludlow ◽  
Brian Rothman ◽  
...  

With the increasing use of new regulatory tools, like the Food and Drug Administration’s breakthrough designation, there are increasing challenges for European health technology assessors (HTAs) to make an accurate assessment of the long-term value and performance of chimeric antigen receptor T-cell (CAR-T) therapies, particularly for orphan conditions, such as acute lymphoblastic leukaemia. The aim of this study was to demonstrate a novel methodology harnessing longitudinal real-world data, extracted from the electronic health records of a medical centre functioning as a clinical trial site, to develop an accurate analysis of the performance of CAR-T compared with the next-best treatment option, namely allogeneic haematopoietic cell transplant (HCT). The study population comprised 43 subjects in two cohorts: 29 who had undergone HCT treatment and 14 who had undergone CAR-T therapy. The 3-year relapse-free survival probability was 46% (95% CI: 08% to 79%) in the CAR-T cohort and 68% (95% CI: 46% to 83%) in the HCT cohort. To explain the lower RFS probability in the CAR-T cohort compared with the HCT cohort, the authors hypothesised that the CAR-T cohort had a far higher level of disease burden. This was validated by log-rank test analysis (p=0.0001) and confirmed in conversations with practitioners at the study site. The authors are aware that the small populations in this study will be seen as limiting the generalisability of the findings to some readers. However, in consultation with many European HTAs and regulators, there is broad agreement that this methodology warrants further investigation with a larger study.


Sign in / Sign up

Export Citation Format

Share Document