Oracle and Human Baselines for Native Language Identification

Author(s):  
Shervin Malmasi ◽  
Joel Tetreault ◽  
Mark Dras
2015 ◽  
Vol 1 (2) ◽  
pp. 187-209 ◽  
Author(s):  
Kristopher Kyle ◽  
Scott A. Crossley ◽  
YouJin Kim

This study evaluates the impact of writing proficiency on native language identification (NLI), a topic that has important implications for the generalizability of NLI models and detection-based arguments for cross-linguistic influence (Jarvis 2010, 2012; CLI). The study uses multinomial logistic regression to classify the first language (L1) group membership of essays at two proficiency levels based on systematic lexical and phrasal choices made by members of five L1 groups. The results indicate that lower proficiency essays are significantly easier to classify than higher proficiency essays, suggesting that lower proficiency writers make lexical and phrasal choices that are more similar to other lower proficiency writers that share an L1 than higher proficiency writers that share an L1. A close analysis of the findings also indicates that the relationship between NLI accuracy and proficiency differed across L1 groups.


2020 ◽  
pp. 1-31
Author(s):  
Ilia Markov ◽  
Vivi Nastase ◽  
Carlo Strapparava

Abstract Native language identification (NLI)—the task of automatically identifying the native language (L1) of persons based on their writings in the second language (L2)—is based on the hypothesis that characteristics of L1 will surface and interfere in the production of texts in L2 to the extent that L1 is identifiable. We present an in-depth investigation of features that model a variety of linguistic phenomena potentially involved in native language interference in the context of the NLI task: the languages’ structuring of information through punctuation usage, emotion expression in language, and similarities of form with the L1 vocabulary through the use of anglicized words, cognates, and other misspellings. The results of experiments with different combinations of features in a variety of settings allow us to quantify the native language interference value of these linguistic phenomena and show how robust they are in cross-corpus experiments and with respect to proficiency in L2. These experiments provide a deeper insight into the NLI task, showing how native language interference explains the gap between baseline, corpus-independent features, and the state of the art that relies on features/representations that cover (indiscriminately) a variety of linguistic phenomena.


Author(s):  
N. V. Remnev ◽  

The task of recognizing the author’s native (Native Language Identification—NLI) language based on a texts, written in a language that is non-native to the author—is the task of automatically recognizing native language (L1). The NLI task was studied in detail for the English language, and two shared tasks were conducted in 2013 and 2017, where TOEFL English essays and essay samples were used as data. There is also a small number of works where the NLI problem was solved for other languages. The NLI problem was investigated for Russian by Ladygina (2017) and Remnev (2019). This paper discusses the use of well-established approaches in the NLI Shared Task 2013 and 2017 competitions to solve the problem of recognizing the author’s native language, as well as to recognize the type of speaker—learners of Russian or Heritage Russian speakers. Native language identification task is also solved based on the types of errors specific to different languages. This study is data-driven and is possible thanks to the Russian Learner Corpus developed by the Higher School of Economics (HSE) Learner Russian Research Group on the basis of which experiments are being conducted.


Author(s):  
Ria Ambrocio Sagum

The study developed a tool for identification of a Filipino Native Language given a textual data. The Filipino Language identified were Cebuano, Kapampangan and Pangasinan. It used Markov Chain Model for language modeling using bag of words (a total of 35,144 words for Cebuano, 14752 for Kapampangan, and 13969 of Pangasinan) from each language and maximum likelihood decision rule for the identification of the native language. The obtained model implementing Markov model, was applied in one hundred fifty text files with minimum length of ten words and maximum length of fifty words. The result of the evaluation shows the system’s accuracy of 86.25% and an F-Score of 90.55%.


2017 ◽  
Author(s):  
Artur Kulmizev ◽  
Bo Blankers ◽  
Johannes Bjerva ◽  
Malvina Nissim ◽  
Gertjan van Noord ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document