error annotation Latest Research Papers

Abstract To date, the assessment of student translations has been largely based on configurations of error categories that address some facet of the translation product. Focal points of such product-oriented error annotation include language mechanics (punctuation, grammar, lexis and syntax, for example) and various kinds of transfer errors. In recent years, screen recording technology has opened new doors for empirically informing translation assessment from a more process-oriented perspective (Massey and Ehrensberger-Dow, 2014; Angelone, 2019). Screen recording holds particular promise when tracing errors documented in the product back to potential underlying triggers in the form of processes that co-occur on screen in their presence. Assessor observations made during screen recording analysis can give shape to process-oriented error categories that parallel and complement product-oriented categories. This paper proposes a series of empirically informed, process-oriented error categories that can be used for assessing translations in contexts where screen recordings are applied as a diagnostic tool. The categories are based on lexical and semantic patterns derived from a corpus-based analysis of think-aloud protocols documenting articulations made by assessors when commenting on errors made in student translations while watching screen recordings of their work. It is hoped that these process-oriented error categories will contribute to a more robust means by which to assess and classify errors in translation.

Download Full-text

Markup Fragment in the RuTuBiC Linguistic Corpus. Code-Switching or Lexical Borrowing?

Voprosy leksikografii ◽

10.17223/22274200/20/5 ◽

2021 ◽

pp. 91-104

Author(s):

Zoya I. Rezanova ◽

Keyword(s):

Mother Tongue ◽

Code Switching ◽

National Language ◽

Speech Corpus ◽

Lexical Borrowing ◽

The Family ◽

The Status ◽

Error Annotation ◽

Turkic Languages ◽

The Right

The article presents a solution to one of the problems of special linguistic markup in the RuTuBiC corpus – the Russian Speech Corpus of Russian-Turkic Bilinguals, asso-ciated with error annotation at the lexical level. The corpus includes three subcorpuses representing materials of the Russian speech of Shor-Russian, Tatar-Russian and Khakass-Russian bilinguals. The article presents solutions developed on the basis of all subcorpuses; the illustrative contexts are drawn from the Shor-Russian subcorpus, recordings of interviews with 14 respondents, about 20 hours of sound. The recordings were made during expeditions to Shoria in 2017–2019. Bilingualism of the respondents is defined as early natural bilingualism with the dominance of the second Russian lan-guage, mother tongues are languages of the family heritage. The theoretical basis of the research was works on linguistic contact at the lexical level. Solutions based on the differentiation of lexemes fully mastered by the system of standard Russian and units with the status of borrowings from other subsystems of the national language and other languages are proposed. In the latter case, linguistic and contextual features are distin-guished that oppose lexical borrowing and code-switching. The typical errors singled out at the lexical level are: [LexId] – idiomatic expressions that are not fixed in the standard language (dialectal and vernacular, slang, etc.), they can also be Turkic calques; [LexSem] – general Russian words used in meanings different from those fixed in the normative sources; [LexSemAgr] – violations of the lexical and semantic agreement norms. The units borrowed from the mother tongue of the respondents are located on the scale of transitions from nuclear to borderline. The nuclear units marked with the [Lex] tag are dialectal units, common words, other word usage cases that are outside the standard, as well as borrowings from the Turkic languages that are not included in the dictionaries of standard Russian. On the border “to the left” are borrowings assimilated to different degrees. On the border “to the right” are non-assimilated borrowings and code-switches. The [CodeSw] marks code-switching, insertion of mother tongue elements into Russian speech. The author considers the inclusion of statements as nuclear cases of code-switching, and single lexical inclusions as transitional cases. Code-switching is evidenced by metatext and linguistic proper, primarily phonetic, indicators. There is an insignificant number of both lexical borrowings and cases of code-switching in the speech of the respondents of the RuTuBiC corpus, which depends on the type of bilingualism. The typicality of metatext marking of borrowings and code-switches is determined by the discursive, genre and thematic limitations of the corpus.

Download Full-text

Automatização no diagnóstico de nível de língua: anotação e versatilidade dos recursos para PLE

Revista da Associação Portuguesa de Linguística ◽

10.26334/2183-9077/rapln7ano2020a1 ◽

2020 ◽

pp. 1-20

Author(s):

Raquel Amaro ◽

Susana Correia ◽

Carolina Gramacho ◽

Amália Mendes

Keyword(s):

Foreign Language ◽

Language Learners ◽

Automatic Analysis ◽

Manual Annotation ◽

Automatic Diagnosis ◽

Migrant Populations ◽

Foreign Language Learners ◽

European Portuguese ◽

Linguistic Barriers ◽

Error Annotation

The automatic diagnosis and analysis of the production of foreign language learners can contribute to overcome linguistic barriers that hinder the integration of migrant populations. The richness and complexity of the phenomena observed in this context and the multiplicity of objectives served by automatic analysis tools demonstrate the inevitability of manual annotation of data and the importance of producing versatile resources, in order to maximize their usability. The present study aims, therefore, to contrast the needs of automatic diagnosis systems and the analysis of the phenomena reflected in the annotations for European Portuguese, based on COPLE2 and the corpora analysis conducted within the scope of the POR Nível project, proposing a system of annotation that includes error annotation and annotation of structures associated with complexity. The results highlight the need to enhance the usability of resources, to acknowledge their value and to promote the necessary investment in their development.

Download Full-text

Detailed Error Annotation for Morphologically Rich Languages: Latvian Use Case

Frontiers in Artificial Intelligence and Applications - Human Language Technologies – The Baltic Perspective ◽

10.3233/faia200629 ◽

2020 ◽

Author(s):

Roberts Darģis ◽

Ilze Auzin̦a ◽

Kristīne Levāne-Petrova ◽

Inga Kaija

Keyword(s):

Word Formation ◽

Language Learner ◽

Use Case ◽

Project Development ◽

Syntactic Structures ◽

Learner Corpus ◽

Error Annotation ◽

Morphologically Rich Languages ◽

Free Word ◽

Ongoing Project

This paper presents a detailed error annotation for morphologically rich languages. The described approach is used to create Latvian Language Learner corpus (LaVA) which is part of a currently ongoing project Development of Learner corpus of Latvian: methods, tools and applications. There is no need for an advanced multi-token error annotation schema, because error annotated texts are written by beginner level (A1 and A2) who use simple syntactic structures. This schema focuses on in-depth categorization of spelling and word formation errors. The annotation schema will work best for languages with relatively free word order and rich morphology.

Download Full-text

Error Tagging in the Lithuanian Learner Corpus

Frontiers in Artificial Intelligence and Applications - Human Language Technologies – The Baltic Perspective ◽

10.3233/faia200631 ◽

2020 ◽

Author(s):

Jūratė Ruzaitė ◽

Sigita Dereškevičiūtė ◽

Viktorija Kavaliauskaitė-Vilkinienė ◽

Eglė Krivickaitė-Leišienė

Keyword(s):

Native Language ◽

Progress Report ◽

Proficiency Levels ◽

Work In Progress ◽

Learner Corpus ◽

Current State ◽

Error Annotation

This paper is a work-in-progress report on error annotation in the Lithuanian Learner Corpus (LLC), which has been developed using the TEITOK environment. The LLC is the first electronic corpus of learner Lithuanian that represents learners of very diverse native language backgrounds and different proficiency levels. In this paper, we have a double aim: firstly, we present the structure of the corpus in its current state; and secondly, we describe the main principles, procedures, and challenges of error annotation in the LLC. The main types of errors that are tagged in this corpus and analysed in this paper are orthographic, lexical, and syntactic.

Download Full-text

Error annotation in the COPLE2 corpus

Revista da Associação Portuguesa de Linguística ◽

10.26334//2183-9077/rapln4ano2018a42 ◽

2019 ◽

pp. 225-239

Author(s):

Iria Del Rio ◽

Amália Mendes

Keyword(s):

Second Language Acquisition ◽

Coarse Grained ◽

Computer Assisted ◽

Learner Corpus ◽

Fine Grained ◽

Error Annotation ◽

Level Information ◽

New Research ◽

Two Stages ◽

General Architecture

We present the general architecture of the error annotation system applied to the COPLE2 corpus, a learner corpus of Portuguese implemented on the TEITOK platform. We give a general overview of the corpus and of the TEITOK functionalities and describe how the error annotation is structured in a two-level system: first, a fully manual token-based and coarse-grained annotation is applied and produces a rough classification of the errors in three categories, paired with multi-level information for POS and lemma; second, a multi-word and fine-grained annotation in standoff is then semi-automatically produced based on the first level of annotation. The token-based level has been applied to 47% of the total corpus. We compare our system with other proposals of error annotation, and discuss the fine-grained tag set and the experiments to validate its applicability. An inter-annotator (IAA) experiment was performed on the two stages of our system using Cohen’s kappa and it achieved good results on both levels. We explore the possibilities offered by the tokenlevel error annotation, POS and lemma to automatically generate the fine-grained error tags by applying conversion scripts. The model is planned in such a way as to reduce manual effort and rapidly increase the coverage of the error annotation over the full corpus. As the first learner corpus of Portuguese with error annotation, we expect COPLE2 to support new research in different fields connected with Portuguese as second/foreign language, like Second Language Acquisition/Teaching or Computer Assisted Learning.

Download Full-text

Translation Quality and Error Recognition in Professional Neural Machine Translation Post-Editing

Informatics ◽

10.3390/informatics6030041 ◽

2019 ◽

Vol 6 (3) ◽

pp. 41 ◽

Cited By ~ 1

Author(s):

Jennifer Vardaro ◽

Moritz Schaeffer ◽

Silvia Hansen-Schirra

Keyword(s):

Machine Translation ◽

Corpus Analysis ◽

Error Type ◽

Neural Machine Translation ◽

Fine Grained ◽

Directorate General ◽

Error Annotation ◽

Total Reading Time ◽

Post Editor ◽

Error Types

This study aims to analyse how translation experts from the German department of the European Commission’s Directorate-General for Translation (DGT) identify and correct different error categories in neural machine translated texts (NMT) and their post-edited versions (NMTPE). The term translation expert encompasses translator, post-editor as well as revisor. Even though we focus on neural machine-translated segments, translator and post-editor are used synonymously because of the combined workflow using CAT-Tools as well as machine translation. Only the distinction between post-editor, which refers to a DGT translation expert correcting the neural machine translation output, and revisor, which refers to a DGT translation expert correcting the post-edited version of the neural machine translation output, is important and made clear whenever relevant. Using an automatic error annotation tool and the more fine-grained manual error annotation framework to identify characteristic error categories in the DGT texts, a corpus analysis revealed that quality assurance measures by post-editors and revisors of the DGT are most often necessary for lexical errors. More specifically, the corpus analysis showed that, if post-editors correct mistranslations, terminology or stylistic errors in an NMT sentence, revisors are likely to correct the same error type in the same post-edited sentence, suggesting that the DGT experts were being primed by the NMT output. Subsequently, we designed a controlled eye-tracking and key-logging experiment to compare participants’ eye movements for test sentences containing the three identified error categories (mistranslations, terminology or stylistic errors) and for control sentences without errors. We examined the three error types’ effect on early (first fixation durations, first pass durations) and late eye movement measures (e.g., total reading time and regression path durations). Linear mixed-effects regression models predict what kind of behaviour of the DGT experts is associated with the correction of different error types during the post-editing process.

Download Full-text

Error annotation in the COPLE2 corpus

Revista da Associação Portuguesa de Linguística ◽

10.26334/2183-9077/rapln4ano2018a42 ◽

2018 ◽

pp. 225-239

Author(s):

Iria Del Rio ◽

Amália Mendes

Keyword(s):

Second Language Acquisition ◽

Coarse Grained ◽

Computer Assisted ◽

Learner Corpus ◽

Fine Grained ◽

Error Annotation ◽

Level Information ◽

New Research ◽

Two Stages ◽

General Architecture

We present the general architecture of the error annotation system applied to the COPLE2 corpus, a learner corpus of Portuguese implemented on the TEITOK platform. We give a general overview of the corpus and of the TEITOK functionalities and describe how the error annotation is structured in a two-level system: first, a fully manual token-based and coarse-grained annotation is applied and produces a rough classification of the errors in three categories, paired with multi-level information for POS and lemma; second, a multi-word and fine-grained annotation in standoff is then semi-automatically produced based on the first level of annotation. The token-based level has been applied to 47% of the total corpus. We compare our system with other proposals of error annotation, and discuss the fine-grained tag set and the experiments to validate its applicability. An inter-annotator (IAA) experiment was performed on the two stages of our system using Cohen’s kappa and it achieved good results on both levels. We explore the possibilities offered by the tokenlevel error annotation, POS and lemma to automatically generate the fine-grained error tags by applying conversion scripts. The model is planned in such a way as to reduce manual effort and rapidly increase the coverage of the error annotation over the full corpus. As the first learner corpus of Portuguese with error annotation, we expect COPLE2 to support new research in different fields connected with Portuguese as second/foreign language, like Second Language Acquisition/Teaching or Computer Assisted Learning.

Download Full-text