An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India

The pervasiveness of offensive content in social media has become an important reason for concern for online platforms. With the aim of improving online safety, a large number of studies applying computational models to identify such content have been published in the last few years, with promising results. The majority of these studies, however, deal with high-resource languages such as English due to the availability of datasets in these languages. Recent work has addressed offensive language identification from a low-resource perspective, exploring data augmentation strategies and trying to take advantage of existing multilingual pretrained models to cope with data scarcity in low-resource scenarios. In this work, we revisit the problem of low-resource offensive language identification by evaluating the performance of multilingual transformers in offensive language identification for languages spoken in India. We investigate languages from different families such as Indo-Aryan (e.g., Bengali, Hindi, and Urdu) and Dravidian (e.g., Tamil, Malayalam, and Kannada), creating important new technology for these languages. The results show that multilingual offensive language identification models perform better than monolingual models and that cross-lingual transformers show strong zero-shot and few-shot performance across languages.

Download Full-text

Multilingual Offensive Language Identification for Low-resource Languages

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3457610 ◽

2022 ◽

Vol 21 (1) ◽

pp. 1-13

Author(s):

Tharindu Ranasinghe ◽

Marcos Zampieri

Keyword(s):

Transfer Learning ◽

Hate Speech ◽

Training And Development ◽

Language Identification ◽

Shared Task ◽

Low Resource ◽

Government Organizations ◽

Cross Lingual ◽

Offensive Language ◽

Clear Majority

Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this article, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task [23], 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020 [58], 0.8568 F1 macro for Hindi in HASOC 2019 shared task [27], and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) [7], showing that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.

Download Full-text

Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi

10.26615/978-954-452-072-4_050 ◽

2021 ◽

Author(s):

Saurabh Gaikwad ◽

◽

Tharindu Ranasinghe ◽

Marcos Zampieri ◽

Christopher M. Homan ◽

...

Keyword(s):

Language Identification ◽

Low Resource ◽

Cross Lingual ◽

Offensive Language

Download Full-text

Improving Loanword Identification in Low-Resource Language with Data Augmentation and Multiple Feature Fusion

Computational Intelligence and Neuroscience ◽

10.1155/2021/9975078 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Chenggang Mi ◽

Shaolin Zhu ◽

Rui Nie

Keyword(s):

Language Processing ◽

Data Augmentation ◽

Feature Fusion ◽

Training Data ◽

Low Resource ◽

High Resource ◽

Part Of Speech ◽

Word Level ◽

Cross Lingual ◽

Log Linear

Loanword identification is studied in recent years to alleviate data sparseness in several natural language processing (NLP) tasks, such as machine translation, cross-lingual information retrieval, and so on. However, recent studies on this topic usually put efforts on high-resource languages (such as Chinese, English, and Russian); for low-resource languages, such as Uyghur and Mongolian, due to the limitation of resources and lack of annotated data, loanword identification on these languages tends to have lower performance. To overcome this problem, we first propose a lexical constraint-based data augmentation method to generate training data for low-resource language loanword identification; then, a loanword identification model based on a log-linear RNN is introduced to improve the performance of low-resource loanword identification by incorporating features such as word-level embeddings, character-level embeddings, pronunciation similarity, and part-of-speech (POS) into one model. Experimental results on loanword identification in Uyghur (in this study, we mainly focus on Arabic, Chinese, Russian, and Turkish loanwords in Uyghur) showed that our proposed method achieves best performance compared with several strong baseline systems.

Download Full-text

Improving Candidate Generation for Low-resource Cross-lingual Entity Linking

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00303 ◽

2020 ◽

Vol 8 ◽

pp. 109-124

Author(s):

Shuyan Zhou ◽

Shruti Rijhwani ◽

John Wieting ◽

Jaime Carbonell ◽

Graham Neubig

Keyword(s):

State Of The Art ◽

Target Language ◽

Entity Linking ◽

Average Gain ◽

Source Language ◽

Low Resource ◽

High Resource ◽

Language Knowledge ◽

Cross Lingual ◽

Improved Model

Cross-lingual entity linking (XEL) is the task of finding referents in a target-language knowledge base (KB) for mentions extracted from source-language texts. The first step of (X)EL is candidate generation, which retrieves a list of plausible candidate entities from the target-language KB for each mention. Approaches based on resources from Wikipedia have proven successful in the realm of relatively high-resource languages, but these do not extend well to low-resource languages with few, if any, Wikipedia pages. Recently, transfer learning methods have been shown to reduce the demand for resources in the low-resource languages by utilizing resources in closely related languages, but the performance still lags far behind their high-resource counterparts. In this paper, we first assess the problems faced by current entity candidate generation methods for low-resource XEL, then propose three improvements that (1) reduce the disconnect between entity mentions and KB entries, and (2) improve the robustness of the model to low-resource scenarios. The methods are simple, but effective: We experiment with our approach on seven XEL datasets and find that they yield an average gain of 16.9% in Top-30 gold candidate recall, compared with state-of-the-art baselines. Our improved model also yields an average gain of 7.9% in in-KB accuracy of end-to-end XEL. 1

Download Full-text

LIIR at SemEval-2020 Task 12: A Cross-Lingual Augmentation Approach for Multilingual Offensive Language Identification

10.18653/v1/2020.semeval-1.274 ◽

2020 ◽

Author(s):

Erfan Ghadery ◽

Marie-Francine Moens

Keyword(s):

Language Identification ◽

Cross Lingual ◽

Offensive Language

Download Full-text

Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

EURASIP Journal on Audio Speech and Music Processing ◽

10.1186/s13636-021-00225-4 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Zolzaya Byambadorj ◽

Ryota Nishimura ◽

Altangerel Ayush ◽

Kengo Ohta ◽

Norihide Kitaoka

Keyword(s):

Transfer Learning ◽

Data Augmentation ◽

Prediction Models ◽

Target Language ◽

Text To Speech ◽

Paired Data ◽

Low Resource ◽

Language Data ◽

Single Speaker ◽

Cross Lingual

AbstractDeep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resulting in significant improvements in performance. However, these methods require large amounts of text-speech paired data for model training, and collecting this data is costly. Therefore, in this paper, we propose a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target language, using only 30 min of target language text-speech paired data for training. We evaluate three approaches for training the spectrogram prediction models of our TTS system, which produce mel-spectrograms from the input phoneme sequence: (1) cross-lingual transfer learning, (2) data augmentation, and (3) a combination of the previous two methods. In the cross-lingual transfer learning method, we used two high-resource language datasets, English (24 h) and Japanese (10 h). We also used 30 min of target language data for training in all three approaches, and for generating the augmented data used for training in methods 2 and 3. We found that using both cross-lingual transfer learning and augmented data during training resulted in the most natural synthesized target speech output. We also compare single-speaker and multi-speaker training methods, using sequential and simultaneous training, respectively. The multi-speaker models were found to be more effective for constructing a single-speaker, low-resource TTS model. In addition, we trained two Parallel WaveGAN (PWG) neural vocoders, one using 13 h of our augmented data with 30 min of target language data and one using the entire 12 h of the original target language dataset. Our subjective AB preference test indicated that the neural vocoder trained with augmented data achieved almost the same perceived speech quality as the vocoder trained with the entire target language dataset. Overall, we found that our proposed TTS system consisting of a spectrogram prediction network and a PWG neural vocoder was able to achieve reasonable performance using only 30 min of target language training data. We also found that by using 3 h of target language data, for training the model and for generating augmented data, our proposed TTS model was able to achieve performance very similar to that of the baseline model, which was trained with 12 h of target language data.

Download Full-text

Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models

Applied Sciences ◽

10.3390/app11051974 ◽

2021 ◽

Vol 11 (5) ◽

pp. 1974 ◽

Cited By ~ 1

Author(s):

Chanhee Lee ◽

Kisu Yang ◽

Taesun Whang ◽

Chanjun Park ◽

Andrew Matteson ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Language Model ◽

Language Modeling ◽

Language Models ◽

Low Resource ◽

High Resource ◽

Cross Lingual ◽

Data Efficiency

Language model pretraining is an effective method for improving the performance of downstream natural language processing tasks. Even though language modeling is unsupervised and thus collecting data for it is relatively less expensive, it is still a challenging process for languages with limited resources. This results in great technological disparity between high- and low-resource languages for numerous downstream natural language processing tasks. In this paper, we aim to make this technology more accessible by enabling data efficient training of pretrained language models. It is achieved by formulating language modeling of low-resource languages as a domain adaptation task using transformer-based language models pretrained on corpora of high-resource languages. Our novel cross-lingual post-training approach selectively reuses parameters of the language model trained on a high-resource language and post-trains them while learning language-specific parameters in the low-resource language. We also propose implicit translation layers that can learn linguistic differences between languages at a sequence level. To evaluate our method, we post-train a RoBERTa model pretrained in English and conduct a case study for the Korean language. Quantitative results from intrinsic and extrinsic evaluations show that our method outperforms several massively multilingual and monolingual pretrained language models in most settings and improves the data efficiency by a factor of up to 32 compared to monolingual training.

Download Full-text

Can vectors read minds better than experts? Comparing data augmentation strategies for the automated scoring of children’s mindreading ability

10.18653/v1/2021.acl-long.96 ◽

2021 ◽

Author(s):

Venelin Kovatchev ◽

Phillip Smith ◽

Mark Lee ◽

Rory Devine

Keyword(s):

Data Augmentation ◽

Automated Scoring ◽

Augmentation Strategies ◽

Better Than

Download Full-text

Machine Translation in Low-Resource Languages by an Adversarial Neural Network

Applied Sciences ◽

10.3390/app112210860 ◽

2021 ◽

Vol 11 (22) ◽

pp. 10860

Author(s):

Mengtao Sun ◽

Hao Wang ◽

Mark Pasquine ◽

Ibrahim A. Hameed

Keyword(s):

Machine Translation ◽

Transfer Learning ◽

Grammatical Structure ◽

Neural Machine Translation ◽

Low Resource ◽

High Resource ◽

Learning Techniques ◽

Good Potential ◽

Target Languages ◽

Cross Lingual

Existing Sequence-to-Sequence (Seq2Seq) Neural Machine Translation (NMT) shows strong capability with High-Resource Languages (HRLs). However, this approach poses serious challenges when processing Low-Resource Languages (LRLs), because the model expression is limited by the training scale of parallel sentence pairs. This study utilizes adversary and transfer learning techniques to mitigate the lack of sentence pairs in LRL corpora. We propose a new Low resource, Adversarial, Cross-lingual (LAC) model for NMT. In terms of the adversary technique, LAC model consists of a generator and discriminator. The generator is a Seq2Seq model that produces the translations from source to target languages, while the discriminator measures the gap between machine and human translations. In addition, we introduce transfer learning on LAC model to help capture the features in rare resources because some languages share the same subject-verb-object grammatical structure. Rather than using the entire pretrained LAC model, we separately utilize the pretrained generator and discriminator. The pretrained discriminator exhibited better performance in all experiments. Experimental results demonstrate that the LAC model achieves higher Bilingual Evaluation Understudy (BLEU) scores and has good potential to augment LRL translations.

Download Full-text

Multilingual Offensive Language Identification with Cross-lingual Embeddings

10.18653/v1/2020.emnlp-main.470 ◽

2020 ◽

Author(s):

Tharindu Ranasinghe ◽

Marcos Zampieri

Keyword(s):

Language Identification ◽

Cross Lingual ◽

Offensive Language

Download Full-text