Tübingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing

2022 ◽

Vol 21 (1) ◽

pp. 1-13

Author(s):

Tharindu Ranasinghe ◽

Marcos Zampieri

Keyword(s):

Transfer Learning ◽

Hate Speech ◽

Training And Development ◽

Language Identification ◽

Shared Task ◽

Low Resource ◽

Government Organizations ◽

Cross Lingual ◽

Offensive Language ◽

Clear Majority

Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this article, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task [23], 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020 [58], 0.8568 F1 macro for Hindi in HASOC 2019 shared task [27], and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) [7], showing that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.

Download Full-text

NATIVE LANGUAGE IDENTIFICATION FOR RUSSIAN USING ERRORS TYPES

Computational Linguistics and Intellectual Technologies ◽

10.28995/2075-7182-2020-19-1123-1133 ◽

2020 ◽

Author(s):

N. V. Remnev ◽

Keyword(s):

Research Group ◽

English Language ◽

Native Language ◽

Language Identification ◽

Data Driven ◽

Shared Task ◽

Identification Task ◽

Learner Corpus ◽

Russian Speakers ◽

Russian Research

The task of recognizing the author’s native (Native Language Identification—NLI) language based on a texts, written in a language that is non-native to the author—is the task of automatically recognizing native language (L1). The NLI task was studied in detail for the English language, and two shared tasks were conducted in 2013 and 2017, where TOEFL English essays and essay samples were used as data. There is also a small number of works where the NLI problem was solved for other languages. The NLI problem was investigated for Russian by Ladygina (2017) and Remnev (2019). This paper discusses the use of well-established approaches in the NLI Shared Task 2013 and 2017 competitions to solve the problem of recognizing the author’s native language, as well as to recognize the type of speaker—learners of Russian or Heritage Russian speakers. Native language identification task is also solved based on the types of errors specific to different languages. This study is data-driven and is possible thanks to the Russian Learner Corpus developed by the Higher School of Economics (HSE) Learner Russian Research Group on the basis of which experiments are being conducted.

Download Full-text

SIGTYP 2021 Shared Task: Robust Spoken Language Identification

10.18653/v1/2021.sigtyp-1.11 ◽

2021 ◽

Author(s):

Elizabeth Salesky ◽

Badr M. Abdullah ◽

Sabrina Mielke ◽

Elena Klyachko ◽

Oleg Serikov ◽

...

Keyword(s):

Spoken Language ◽

Language Identification ◽

Shared Task

Download Full-text

Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System

10.3115/v1/w14-3908 ◽

2014 ◽

Cited By ~ 11

Author(s):

Gokul Chittaranjan ◽

Yogarshi Vyas ◽

Kalika Bali ◽

Monojit Choudhury

Keyword(s):

Language Identification ◽

Code Switching ◽

Shared Task ◽

Word Level

Download Full-text

An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India

Information ◽

10.3390/info12080306 ◽

2021 ◽

Vol 12 (8) ◽

pp. 306

Author(s):

Tharindu Ranasinghe ◽

Marcos Zampieri

Keyword(s):

Computational Models ◽

Data Augmentation ◽

New Technology ◽

Language Identification ◽

Low Resource ◽

High Resource ◽

Augmentation Strategies ◽

Cross Lingual ◽

Offensive Language ◽

Better Than

The pervasiveness of offensive content in social media has become an important reason for concern for online platforms. With the aim of improving online safety, a large number of studies applying computational models to identify such content have been published in the last few years, with promising results. The majority of these studies, however, deal with high-resource languages such as English due to the availability of datasets in these languages. Recent work has addressed offensive language identification from a low-resource perspective, exploring data augmentation strategies and trying to take advantage of existing multilingual pretrained models to cope with data scarcity in low-resource scenarios. In this work, we revisit the problem of low-resource offensive language identification by evaluating the performance of multilingual transformers in offensive language identification for languages spoken in India. We investigate languages from different families such as Indo-Aryan (e.g., Bengali, Hindi, and Urdu) and Dravidian (e.g., Tamil, Malayalam, and Kannada), creating important new technology for these languages. The results show that multilingual offensive language identification models perform better than monolingual models and that cross-lingual transformers show strong zero-shot and few-shot performance across languages.

Download Full-text