Integrating learner corpora and natural language processing: A crucial step towards reconciling technological sophistication and pedagogical effectiveness

AbstractLearner corpora, electronic collections of spoken or written data from foreign language learners, offer unparalleled access to many hitherto uncovered aspects of learner language, particularly in their error-tagged format. This article aims to demonstrate the role that the learner corpus can play in CALL, particularly when used in conjunction with web-based interfaces which provide flexible access to error-tagged corpora that have been enhanced with simple NLP techniques such as POS-tagging or lemmatization and linked to a wide range of learner and task variables such as mother tongue background or activity type. This new resource is of interest to three main types of users: teachers wishing to prepare pedagogical materials that target learners' attested difficulties; learners themselves for editing or language awareness purposes and NLP researchers, for whom it serves as a benchmark for testing automatic error detection systems.

Download Full-text

Subcategorization frame identification for learner English

International Journal of Corpus Linguistics ◽

10.1075/ijcl.18097.hua ◽

2020 ◽

Author(s):

Yan Huang ◽

Akira Murakami ◽

Theodora Alexopoulou ◽

Anna Korhonen

Keyword(s):

Second Language ◽

Second Language Acquisition ◽

Language Processing ◽

Large Scale ◽

Structural Information ◽

Syntactic Structures ◽

Verb Classes ◽

Learner Corpus ◽

Wide Range ◽

Learner Language

Abstract As large-scale learner corpora become increasingly available, it is vital that natural language processing (NLP) technology is developed to provide rich linguistic annotations necessary for second language (L2) research. We present a system for automatically analyzing subcategorization frames (SCFs) for learner English. SCFs link lexis with morphosyntax, shedding light on the interplay between lexical and structural information in learner language. Meanwhile, SCFs are crucial to the study of a wide range of phenomena including individual verbs, verb classes and varying syntactic structures. To illustrate the usefulness of our system for learner corpus research and second language acquisition (SLA), we investigate how L2 learners diversify their use of SCFs in text and how this diversity changes with L2 proficiency.

Download Full-text

A New Approach to Automatic Detection and Correction of Derivational Errors in L2 Russian

Vestnik NSU Series Linguistics and Intercultural Communication ◽

10.25205/1818-7935-2021-19-3-57-68 ◽

2021 ◽

Vol 19 (3) ◽

pp. 57-68

Author(s):

A. S. Vyrenkova ◽

I. Yu. Smirnov

Keyword(s):

Second Language Acquisition ◽

Foreign Language ◽

Language Learners ◽

Error Detection ◽

Channel Model ◽

Finite State Automaton ◽

Noisy Channel ◽

Learner Corpus ◽

Foreign Language Learners ◽

Formidable Challenge

Learner corpora serve as one of the most valuable sources of statistical data on learners' errors. For instance, data from foreign-language learners’ corpora can be used for the Second Language Acquisition research. However, corpora representativity strongly depends on the quality of its error markup, which is most frequently carried out manually and thus presents a time-consuming and painstaking routine for the annotators. To make annotation process easier, additional tools, such as spellcheckers, are usually used. This paper focuses on developing a program for automatic correction of derivational errors made by learners of Russian as a foreign language. Derivational errors, which are not common for adult Russian native speakers (L1), but occur quite often in written texts or speech of Russian as foreign language learners (L2) [Chernigovskaya, Gor, 2000], were chosen as scope of our research because correction of such mistakes presents a formidable challenge for existing spellcheckers. Using the data from the Russian Learner Corpus (http://www.web-corpora.net/RLC/), we tested two already existing approaches to solve such kind of problems. The first one is based on a finite state automaton principle developed by Dickinson and Herring 2008, and it was test-ed as algorithm for derivational errors detection. The second one which relies on the Noisy Channel model by Brill and Moore, 2000, was used for studying errors correction. After we analyzed effectiveness of these tests, we developed our own system for autocorrection of derivational errors. In our program the algorithm of Dickinson and Herring was used as word-formation error detection module. The Noisy Channel model has been rejected, and we decided to use instead the Continuous Bag of Words FastText model, based on Harris distributional semantics theory [1954]. In addition, filtering rules have been developed for correcting frequent errors that the model is unable to handle. To restore automatically the correct grammatical word form, dictionary of word paradigms is used. Model results were validated on the data of Russian Learner Corpus.

Download Full-text

Linking learner corpus and experimental data in studying second language learners’ knowledge of verb-argument constructions

ICAME Journal ◽

10.2478/icame-2014-0006 ◽

2014 ◽

Vol 38 (1) ◽

pp. 115-135 ◽

Cited By ~ 13

Author(s):

Ute Römer ◽

Audrey Roberson ◽

Matthew B. O’Donnell ◽

Nick C. Ellis

Keyword(s):

Experimental Data ◽

Second Language ◽

Experimental Evidence ◽

Language Learners ◽

Second Language Learners ◽

First Language ◽

Data Sets ◽

Advanced Learners ◽

Learner Corpus ◽

Learner Corpora

Abstract This paper combines data from learner corpora and psycholinguistic experiments in an attempt to find out what advanced learners of English (first language backgrounds German and Spanish) know about a range of common verbargument constructions (VACs), such as the ‘V about n’ construction (e.g. she thinks about chocolate a lot). Learners’ dominant verb-VAC associations are examined based on evidence retrieved from the German and Spanish subcomponents of ICLE and LINDSEI and collected in lexical production tasks in which participants complete VAC frames (e.g. ‘he ___ about the...’) with verbs that may fill the blank (e.g. talked, thought, wondered). The paper compares findings from the different data sets and highlights the value of linking corpus and experimental evidence in studying linguistic phenomena

Download Full-text

The Study of Interaction Features Used by Intermediate Iranian EFL Learners in Their Lexical Language Related Episodes

Journal of Language Teaching and Research ◽

10.17507/jltr.0905.20 ◽

2018 ◽

Vol 9 (5) ◽

pp. 1053

Author(s):

Ehsan Alijanian ◽

Saeed Ketabi ◽

Ahmad Moinzadeh

Keyword(s):

Foreign Language ◽

Language Learners ◽

Self Regulation ◽

Metalinguistic Awareness ◽

Negotiation Of Meaning ◽

Efl Learners ◽

Foreign Language Learners ◽

Wide Range ◽

Language School ◽

Work Done

Negotiation of meaning refers to interactional work done by interlocutors to attain joint understanding when a communication difficulty comes about. This study uses a qualitative perspective to consider the development of participant utterances in interaction in every moment. 10 English as a foreign language learners in a language school in Iran were chosen to participate in a dictogloss activity in which they were required to describe a certain word. The interaction features in their lexical language related episodes were analyzed. The results indicate that students use a wide range of interaction features in their collaborations. These features help learners generate a scaffolding structure in the LLREs in which meaning discovering is made. The use of interactive features fostered metalinguistic awareness and encouraged learners’ self-regulation.

Download Full-text

Analyzing the linguistic complexity of German learner language in a reading comprehension task

International Journal of Learner Corpus Research ◽

10.1075/ijlcr.20006.wei ◽

2021 ◽

Vol 7 (1) ◽

Author(s):

Zarah Weiss ◽

Detmar Meurers

Keyword(s):

Reading Comprehension ◽

Language Proficiency ◽

Complexity Analysis ◽

College Level ◽

Complexity Measures ◽

Linguistic Complexity ◽

Foreign Language Learners ◽

Wide Range ◽

Linguistic Methods ◽

Learner Language

Abstract While traditionally linguistic complexity analysis of learner language is mostly based on essays, there is increasing interest in other task types. This is crucial for obtaining a broader empirical basis for characterizing language proficiency and highlights the need to advance our understanding of how task and learner properties interact in shaping the linguistic complexity of learner productions. It also makes it important to determine which complexity measures generalize well across which tasks. In this paper, we investigate the linguistic complexity of answers to reading comprehension questions written by foreign language learners of German at the college level. Analyzing the corpus with computational linguistic methods identifying a wide range of complexity features, we explore which linguistic complexity analyses can successfully be performed for such short answers, how learner proficiency impacts the results, how generalizable they are across different contexts, and how the quality of the underlying analysis impacts the results.

Download Full-text

Latviešu valodas apguvēju korpuss lietojumā: teorētisks un metodoloģisks ieskats

10.37384/lva.2021.162 ◽

2021 ◽

pp. 162-177

Author(s):

Antra Kļavinska ◽

Keyword(s):

Language Acquisition ◽

Foreign Language ◽

Language Learners ◽

Corpus Linguistics ◽

Theoretical Perspective ◽

Language Learner ◽

Learner Corpus ◽

Text Corpora ◽

Learner Corpora ◽

Corpus Data

Several text corpora have been created in Latvia, including learner corpora. One of the latest projects is the Latvian Language Learner Corpus (LaVA), which contains the works of international students studying in Latvian higher education institutions who are learning Latvian as a foreign language. The texts are morphologically tagged automatically, and learner errors are tagged manually. A sufficient scope of publications is available, which provides the theoretical basis for the creation of Latvian language learner corpora; however, there is a lack of studies or practical methodological guidelines concerning the opportunities for their application, and there is little data about the use of text corpora in language acquisition. The aim of this study is to explain from the theoretical perspective for what purposes learner corpus data may be used, as well as to illustrate the methodological groundwork with examples from the LaVA corpus. Analysis of theoretical literature has demonstrated the functions and meaning of learner corpora in research, and experience with the use of corpora in acquiring a foreign language has been analysed. Examples of the use of the LaVA corpus as a didactic resource have been prepared using Corpus Linguistics methods. The study was conducted within the state research programme project “The Latvian Language”. After studying the functions of learner corpora from the theoretical perspective, it was concluded that the target audience of the LaVA corpus mainly includes teachers of Latvian as a foreign language (LATS), authors of teaching materials, as well as Latvian language learners. To facilitate the use of the LaVA corpus, it is important to have basic knowledge of Corpus Linguistics, an understanding of the theory of language, as well as an understanding of foreign language teaching methodology. LATS teachers can use the LaVA corpus data in the creation of curricula and teaching materials, in the preparation of language proficiency tests, etc. Using the inductive approach in language acquisition, language learners can also become language researchers, can analyse the errors of other learners, etc. Undeniably, the LaVA corpus can be used in broader linguistic research, for example, in contrastive interlanguage analysis, comparing the data of language learners with the data of native speakers or the data of different groups of language learners.

Download Full-text

self-paced reading study of language processing and retention comparing guided induction and deductive instruction

Instructed Second Language Acquisition ◽

10.1558/isla.40636 ◽

2020 ◽

Vol 4 (2) ◽

Author(s):

Paul A. Malovrh ◽

James F. Lee ◽

Stephen Doherty ◽

Alecia Nichols

Keyword(s):

Language Learners ◽

Language Processing ◽

Time On Task ◽

Processing Instruction ◽

Pedagogical Intervention ◽

Foreign Language Learners ◽

Post Test ◽

Reading Study ◽

Guided Inductive

The present study measured the effects of guided-inductive (GI) versus deductive computer-delivered instruction on the processing and retention of the Spanish true passive using a self-paced reading design. Fifty-four foreign language learners of Spanish participated in the study, which operationalised guided-inductive and deductive approaches using an adaptation of the PACE model and processing instruction (PI), respectively. Results revealed that each experimental group significantly improved after the pedagogical intervention, and that the GI group outperformed the PI group in terms of accuracy on an immediate post-test. Differences between the groups, however, were not durative; at the delayed post-test, each group performed the same. Additional analyses revealed that the GI group spent over twice as much time on task during instruction than the PI group, with no long-term advantages on processing, calling into question the pedagogical justification for implementing GI at a curricular level.

Download Full-text

Creating a grammar checker for CALL by constraint relaxation: a feasibility study

ReCALL ◽

10.1017/s095834400100101x ◽

2001 ◽

Vol 13 (1) ◽

pp. 110-120 ◽

Cited By ~ 6

Author(s):

ANNE VANDEVENTER

Keyword(s):

Natural Language Processing ◽

Language Learners ◽

Language Processing ◽

Large Scale ◽

Native Speaker ◽

Viable Option ◽

Major Drawback ◽

Advanced Learners ◽

Learner Corpus ◽

Constraint Relaxation

Intelligent feedback on learners’ full written sentence productions requires the use of Natural Language Processing (NLP) tools and, in particular, of a diagnosis system. Most syntactic parsers, on which grammar checkers are based, are designed to parse grammatical sentences and/or native speaker productions. They are therefore not necessarily suitable for language learners. In this paper, we concentrate on the transformation of a French syntactic parser into a grammar checker geared towards intermediate to advanced learners of French. Several techniques are envisaged to allow the parser to handle ill-formed input, including constraint relaxation. By the very nature of this technique, parsers can generate complete analyses for ungrammatical sentences. Proper labelling of where the analysis has been able to proceed thanks to a specific constraint relaxation forms the basis of the error diagnosis. Parsers with relaxed constraints tend to produce more complete, although incorrect, analyses for grammatical sentences, and several complete analyses for ungrammatical sentences. This increased number of analyses per sentence has one major drawback: it slows down the system and requires more memory. An experiment was conducted to observe the behaviour of our parser in the context of constraint relaxation. Three specific constraints, agreement in number, gender, and person, were selected and relaxed in different combinations. A learner corpus was parsed with each combination. The evolution of the number of correct diagnoses and of parsing speed, among other factors, were monitored. We then evaluated, by comparing the results, whether large scale constraint relaxation is a viable option to transform our syntactic parser into an efficient grammar checker for CALL.

Download Full-text

Mapping tense form and meaning for L2 learning – From theory to practice

IRAL - International Review of Applied Linguistics in Language Teaching ◽

10.1515/iral-2016-0105 ◽

2019 ◽

Vol 57 (4) ◽

pp. 417-445

Author(s):

Agneta Marie-Louise Svalberg

Keyword(s):

New York ◽

Language Learning ◽

Language Learners ◽

Interpersonal Relations ◽

Language Awareness ◽

Foreign Language Learners ◽

Time Space ◽

Temporal Relationships ◽

Wide Range ◽

Temporal Structures

Abstract English tense presents second/foreign language learners with considerable cognitive challenges and, it will be argued, grammars and textbooks are generally inadequate sources of knowledge of the tense system as system. A modified version of Reichenbach's (1947. Elements of Symbolic Logic. New York: Macmillan) tense model is then presented. The original model has been criticized for its inability to deal with temporal relationships in natural text (e. g. Declerck, R. 1986. From Reichenbach (1947) to Comrie (1985) and beyond. Towards a theory of tense. Lingua 70. 305–364; Declerck, R. 2015. Tense in English. Its structure and use in discourse. London: Routledge; Carroll, M., C. Von Stutterheim & W. Klein. 2003. Two ways of construing complex temporal structures. In F. Lenz (ed.), Deictic Conceptualisation of Time, Space and Person, 97–134. Amsterdam: Benjamins). It is argued here instead that speakers employ the limited choices the system provides creatively, to express a wide range of temporal and interpersonal relations in the real world. The tense - aspect and tense - modality interfaces are briefly discussed. A pedagogical Language Awareness approach (Svalberg, A. M-L. 2007. Language Awareness and Language Learning. Language Teaching 40(4). 287–308) is then illustrated, with the theoretical model as mediating artefact providing visual and metalinguistic scaffolding, allowing learners to investigate tense use in context while drawing on both intuitive understanding and conscious knowledge.

Download Full-text

A statistical model for automatic Error Detection and Correction of Assamese Words

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b3859.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 6111-6116

Keyword(s):

Language Processing ◽

Error Detection ◽

Mother Tongue ◽

Research Work ◽

Language Models ◽

Indian Languages ◽

Computational Point ◽

Proposed Model ◽

Error Detection And Correction ◽

Assamese Language

Digitization of local languages is getting importance in the present scenario and the Language Processing task is also becoming popular among the Linguistic and IT people. It is very common that most of the people are comfortable with their native mother tongue. Writing of corrected word-form is also an important task in the digital platforms for the future existence of a language. In this research work, the Assamese language is taken as a Natural Language which is processed in the experiments. The Assamese language is one of the Indian languages and the research & development of the Assamese language is going on; from the computational point of view, Assamese is in the development phase. In Assamese, there are some similar characters which are phonetically same but their glyphs are different these characters or symbols often cause confusion to the users while writing, these types of characters are specially taken into consideration in this research work. A list of 14 confusing characters pairs of Assamese letters is taken for experimental purpose. In addition, this research work has focused on errors of Assamese words, which are checked by using bigram and trigram models. Moreover, the proposed model also tries to find the erroneous character which causes the incorrectness and shows the suggestions for that incorrect character. A score based system is designed for the Assamese characters and each character is assigned a score from their probability of occurrences by using bigram and trigram language models. Different types of experiments are performed to check the correctness of the Assamese words and the proposed model is able to check the correctness of the Assamese word with accuracy ranging from 81% to 86%. Error rate in Assamese can be reduced by using this model in any digital platform where a user can type in Assamese

Download Full-text