Towards Offensive Language Identification for Tamil Code-Mixed YouTube Comments and Posts

The pervasiveness of offensive content in social media has become an important reason for concern for online platforms. With the aim of improving online safety, a large number of studies applying computational models to identify such content have been published in the last few years, with promising results. The majority of these studies, however, deal with high-resource languages such as English due to the availability of datasets in these languages. Recent work has addressed offensive language identification from a low-resource perspective, exploring data augmentation strategies and trying to take advantage of existing multilingual pretrained models to cope with data scarcity in low-resource scenarios. In this work, we revisit the problem of low-resource offensive language identification by evaluating the performance of multilingual transformers in offensive language identification for languages spoken in India. We investigate languages from different families such as Indo-Aryan (e.g., Bengali, Hindi, and Urdu) and Dravidian (e.g., Tamil, Malayalam, and Kannada), creating important new technology for these languages. The results show that multilingual offensive language identification models perform better than monolingual models and that cross-lingual transformers show strong zero-shot and few-shot performance across languages.

Download Full-text

LIIR at SemEval-2020 Task 12: A Cross-Lingual Augmentation Approach for Multilingual Offensive Language Identification

10.18653/v1/2020.semeval-1.274 ◽

2020 ◽

Author(s):

Erfan Ghadery ◽

Marie-Francine Moens

Keyword(s):

Language Identification ◽

Cross Lingual ◽

Offensive Language

Download Full-text

Deep Learning for predicting neutralities in Offensive Language Identification Dataset

Expert Systems with Applications ◽

10.1016/j.eswa.2021.115458 ◽

2021 ◽

pp. 115458

Author(s):

Mayukh Sharma ◽

Ilanthenral Kandasamy ◽

Vasantha Kandasamy

Keyword(s):

Deep Learning ◽

Language Identification ◽

Offensive Language

Download Full-text

SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification

10.18653/v1/2021.findings-acl.80 ◽

2021 ◽

Author(s):

Sara Rosenthal ◽

Pepa Atanasova ◽

Georgi Karadzhov ◽

Marcos Zampieri ◽

Preslav Nakov

Keyword(s):

Large Scale ◽

Language Identification ◽

Offensive Language

Download Full-text

ConvAI at SemEval-2019 Task 6: Offensive Language Identification and Categorization with Perspective and BERT

10.18653/v1/s19-2102 ◽

2019 ◽

Cited By ~ 2

Author(s):

John Pavlopoulos ◽

Nithum Thain ◽

Lucas Dixon ◽

Ion Androutsopoulos

Keyword(s):

Language Identification ◽

Offensive Language

Download Full-text

Multilingual Offensive Language Identification for Low-resource Languages

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3457610 ◽

2022 ◽

Vol 21 (1) ◽

pp. 1-13

Author(s):

Tharindu Ranasinghe ◽

Marcos Zampieri

Keyword(s):

Transfer Learning ◽

Hate Speech ◽

Training And Development ◽

Language Identification ◽

Shared Task ◽

Low Resource ◽

Government Organizations ◽

Cross Lingual ◽

Offensive Language ◽

Clear Majority

Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this article, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task [23], 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020 [58], 0.8568 F1 macro for Hindi in HASOC 2019 shared task [27], and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) [7], showing that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.

Download Full-text

The Titans at SemEval-2019 Task 6: Offensive Language Identification, Categorization and Target Identification

10.18653/v1/s19-2133 ◽

2019 ◽

Author(s):

Avishek Garain ◽

Arpan Basu

Keyword(s):

Target Identification ◽

Language Identification ◽

Offensive Language

Download Full-text