scholarly journals Tübingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing

Author(s):  
Çağrı Çöltekin ◽  
Taraka Rama
Author(s):  
Tharindu Ranasinghe ◽  
Marcos Zampieri

Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this article, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task [23], 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020 [58], 0.8568 F1 macro for Hindi in HASOC 2019 shared task [27], and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) [7], showing that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.


Author(s):  
N. V. Remnev ◽  

The task of recognizing the author’s native (Native Language Identification—NLI) language based on a texts, written in a language that is non-native to the author—is the task of automatically recognizing native language (L1). The NLI task was studied in detail for the English language, and two shared tasks were conducted in 2013 and 2017, where TOEFL English essays and essay samples were used as data. There is also a small number of works where the NLI problem was solved for other languages. The NLI problem was investigated for Russian by Ladygina (2017) and Remnev (2019). This paper discusses the use of well-established approaches in the NLI Shared Task 2013 and 2017 competitions to solve the problem of recognizing the author’s native language, as well as to recognize the type of speaker—learners of Russian or Heritage Russian speakers. Native language identification task is also solved based on the types of errors specific to different languages. This study is data-driven and is possible thanks to the Russian Learner Corpus developed by the Higher School of Economics (HSE) Learner Russian Research Group on the basis of which experiments are being conducted.


2021 ◽  
Author(s):  
Elizabeth Salesky ◽  
Badr M. Abdullah ◽  
Sabrina Mielke ◽  
Elena Klyachko ◽  
Oleg Serikov ◽  
...  

2014 ◽  
Author(s):  
Gokul Chittaranjan ◽  
Yogarshi Vyas ◽  
Kalika Bali ◽  
Monojit Choudhury

Information ◽  
2021 ◽  
Vol 12 (8) ◽  
pp. 306
Author(s):  
Tharindu Ranasinghe ◽  
Marcos Zampieri

The pervasiveness of offensive content in social media has become an important reason for concern for online platforms. With the aim of improving online safety, a large number of studies applying computational models to identify such content have been published in the last few years, with promising results. The majority of these studies, however, deal with high-resource languages such as English due to the availability of datasets in these languages. Recent work has addressed offensive language identification from a low-resource perspective, exploring data augmentation strategies and trying to take advantage of existing multilingual pretrained models to cope with data scarcity in low-resource scenarios. In this work, we revisit the problem of low-resource offensive language identification by evaluating the performance of multilingual transformers in offensive language identification for languages spoken in India. We investigate languages from different families such as Indo-Aryan (e.g., Bengali, Hindi, and Urdu) and Dravidian (e.g., Tamil, Malayalam, and Kannada), creating important new technology for these languages. The results show that multilingual offensive language identification models perform better than monolingual models and that cross-lingual transformers show strong zero-shot and few-shot performance across languages.


2014 ◽  
Author(s):  
Thamar Solorio ◽  
Elizabeth Blair ◽  
Suraj Maharjan ◽  
Steven Bethard ◽  
Mona Diab ◽  
...  

2016 ◽  
Author(s):  
Giovanni Molina ◽  
Fahad AlGhamdi ◽  
Mahmoud Ghoneim ◽  
Abdelati Hawwari ◽  
Nicolas Rey-Villamizar ◽  
...  

2017 ◽  
Author(s):  
Shervin Malmasi ◽  
Keelan Evanini ◽  
Aoife Cahill ◽  
Joel Tetreault ◽  
Robert Pugh ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document