Native Language Identification using probabilistic graphical models

Author(s):  
Garrett Nicolai ◽  
Md Asadul Islam ◽  
Russ Greiner
2015 ◽  
Author(s):  
Shervin Malmasi ◽  
Joel Tetreault ◽  
Mark Dras

2015 ◽  
Vol 1 (2) ◽  
pp. 187-209 ◽  
Author(s):  
Kristopher Kyle ◽  
Scott A. Crossley ◽  
YouJin Kim

This study evaluates the impact of writing proficiency on native language identification (NLI), a topic that has important implications for the generalizability of NLI models and detection-based arguments for cross-linguistic influence (Jarvis 2010, 2012; CLI). The study uses multinomial logistic regression to classify the first language (L1) group membership of essays at two proficiency levels based on systematic lexical and phrasal choices made by members of five L1 groups. The results indicate that lower proficiency essays are significantly easier to classify than higher proficiency essays, suggesting that lower proficiency writers make lexical and phrasal choices that are more similar to other lower proficiency writers that share an L1 than higher proficiency writers that share an L1. A close analysis of the findings also indicates that the relationship between NLI accuracy and proficiency differed across L1 groups.


Author(s):  
Arjun P. Athreya ◽  
Tanja Brückl ◽  
Elisabeth B. Binder ◽  
A. John Rush ◽  
Joanna Biernacka ◽  
...  

AbstractHeterogeneity in the clinical presentation of major depressive disorder and response to antidepressants limits clinicians’ ability to accurately predict a specific patient’s eventual response to therapy. Validated depressive symptom profiles may be an important tool for identifying poor outcomes early in the course of treatment. To derive these symptom profiles, we first examined data from 947 depressed subjects treated with selective serotonin reuptake inhibitors (SSRIs) to delineate the heterogeneity of antidepressant response using probabilistic graphical models (PGMs). We then used unsupervised machine learning to identify specific depressive symptoms and thresholds of improvement that were predictive of antidepressant response by 4 weeks for a patient to achieve remission, response, or nonresponse by 8 weeks. Four depressive symptoms (depressed mood, guilt feelings and delusion, work and activities and psychic anxiety) and specific thresholds of change in each at 4 weeks predicted eventual outcome at 8 weeks to SSRI therapy with an average accuracy of 77% (p = 5.5E-08). The same four symptoms and prognostic thresholds derived from patients treated with SSRIs correctly predicted outcomes in 72% (p = 1.25E-05) of 1996 patients treated with other antidepressants in both inpatient and outpatient settings in independent publicly-available datasets. These predictive accuracies were higher than the accuracy of 53% for predicting SSRI response achieved using approaches that (i) incorporated only baseline clinical and sociodemographic factors, or (ii) used 4-week nonresponse status to predict likely outcomes at 8 weeks. The present findings suggest that PGMs providing interpretable predictions have the potential to enhance clinical treatment of depression and reduce the time burden associated with trials of ineffective antidepressants. Prospective trials examining this approach are forthcoming.


2020 ◽  
pp. 1-31
Author(s):  
Ilia Markov ◽  
Vivi Nastase ◽  
Carlo Strapparava

Abstract Native language identification (NLI)—the task of automatically identifying the native language (L1) of persons based on their writings in the second language (L2)—is based on the hypothesis that characteristics of L1 will surface and interfere in the production of texts in L2 to the extent that L1 is identifiable. We present an in-depth investigation of features that model a variety of linguistic phenomena potentially involved in native language interference in the context of the NLI task: the languages’ structuring of information through punctuation usage, emotion expression in language, and similarities of form with the L1 vocabulary through the use of anglicized words, cognates, and other misspellings. The results of experiments with different combinations of features in a variety of settings allow us to quantify the native language interference value of these linguistic phenomena and show how robust they are in cross-corpus experiments and with respect to proficiency in L2. These experiments provide a deeper insight into the NLI task, showing how native language interference explains the gap between baseline, corpus-independent features, and the state of the art that relies on features/representations that cover (indiscriminately) a variety of linguistic phenomena.


Author(s):  
Andrés Cano ◽  
Manuel Gómez-Olmedo ◽  
Serafín Moral ◽  
Cora B. Pérez-Ariza

Author(s):  
N. V. Remnev ◽  

The task of recognizing the author’s native (Native Language Identification—NLI) language based on a texts, written in a language that is non-native to the author—is the task of automatically recognizing native language (L1). The NLI task was studied in detail for the English language, and two shared tasks were conducted in 2013 and 2017, where TOEFL English essays and essay samples were used as data. There is also a small number of works where the NLI problem was solved for other languages. The NLI problem was investigated for Russian by Ladygina (2017) and Remnev (2019). This paper discusses the use of well-established approaches in the NLI Shared Task 2013 and 2017 competitions to solve the problem of recognizing the author’s native language, as well as to recognize the type of speaker—learners of Russian or Heritage Russian speakers. Native language identification task is also solved based on the types of errors specific to different languages. This study is data-driven and is possible thanks to the Russian Learner Corpus developed by the Higher School of Economics (HSE) Learner Russian Research Group on the basis of which experiments are being conducted.


Sign in / Sign up

Export Citation Format

Share Document