Contrastive Linguistics, Translation, and Parallel Corpora

Semantics, contrastive linguistics and parallel corporaIn view of the ambiguity of the term “semantics”, the author shows the differences between the traditional lexical semantics and the contemporary semantics in the light of various semantic schools. She examines semantics differently in connection with contrastive studies where the description must necessary go from the meaning towards the linguistic form, whereas in traditional contrastive studies the description proceeded from the form towards the meaning. This requirement regarding theoretical contrastive studies necessitates construction of a semantic interlanguage, rather than only singling out universal semantic categories expressed with various language means. Such studies can be strongly supported by parallel corpora. However, in order to make them useful for linguists in manual and computer translations, as well as in the development of dictionaries, including online ones, we need not only formal, often automatic, annotation of texts, but also semantic annotation - which is unfortunately manual. In the article we focus on semantic annotation concerning time, aspect and quantification of names and predicates in the whole semantic structure of the sentence on the example of the “Polish-Bulgarian-Russian parallel corpus”.

Download Full-text

The Shifting of the Demonstrative Determiner in French and Dutch in Parallel Corpora: From Translation Mechanisms to Structural Differences

Meta Journal des traducteurs ◽

10.7202/1006186ar ◽

2011 ◽

Vol 56 (2) ◽

pp. 443-464 ◽

Cited By ~ 1

Author(s):

Gudrun Vanderbauwhede ◽

Piet Desmet ◽

Peter Lauwers

Keyword(s):

Noun Phrase ◽

Definite Article ◽

Personal Pronoun ◽

Corpus Study ◽

Parallel Corpora ◽

Parallel Corpus ◽

Structural Differences ◽

Contrastive Linguistics ◽

Underlying Mechanisms ◽

Different Levels

This paper focuses on translational shifts with respect to the demonstrative determiner in French and Dutch in parallel corpora. The paper aims to identify the types of translation shifts that occur systematically, and to explore the underlying mechanisms and semantic effects of this process. For this purpose, a well-balanced sub-corpus of the Dutch Parallel Corpus is used, making it possible to analyze both directions (French – Dutch and Dutch – French). In this corpus, 50% of the demonstrative determiners are translated by a demonstrative in the target text (in both directions). In 20% of the cases, the demonstrative is translated by a definite article, or vice versa, while 30% are translated by another grammatical element (e.g., indefinite determiner, adverb, personal pronoun) or vice versa. The parallel corpus study reveals that translational shifts with respect to French and Dutch demonstratives can be attributed to three different mechanisms: (1) translator preference related to translation universals at the level of the noun phrase (omissions, additions and reformulations of the noun phrase), (2) specific manifestations of translation universals within the noun phrase (syntagmatic and paradigmatic explicitation and implicitation involving demonstrative shifting) and (3) structural divergences between the French and Dutch demonstrative determiner systems (fixed expressions and semantic differences). This analysis demonstrates the usefulness of a detailed parallel corpus study, which clearly distinguishes between changes occurring at different levels, in accounting for divergent translations of the demonstrative determiner in different languages. To this end, several types of explanation drawn from various fields (such as translation studies and contrastive linguistics), must be considered.

Download Full-text

The ACTRES parallel corpus: an English–Spanish translation corpus

Corpora ◽

10.3366/e1749503208000051 ◽

2008 ◽

Vol 3 (1) ◽

pp. 31-41 ◽

Cited By ~ 13

Author(s):

Marlén Izquierdo ◽

Knut Hofland ◽

Øystein Reigem

Keyword(s):

Information Technology ◽

Research Group ◽

Translation Studies ◽

Contrastive Analysis ◽

Spanish Translation ◽

Parallel Corpus ◽

Empirical Results ◽

Linguistic Research ◽

Actual Use ◽

The University

This paper describes the compilation of the ACTRES Parallel Corpus, an English–Spanish translation corpus built at the Department of Modern Languages at the University of León (Spain) by the ACTRES research group. The computerisation of the corpus was carried out in collaboration with Knut Hofland and Øystein Reigem, from the Department of Culture, Language and Information Technology, Aksis, at the UNIFOB/University of Bergen (Norway). The corpus is conceived as a powerful tool for cross-linguistic research in the fields of Contrastive Analysis and Descriptive Translation Studies. It was the need to bridge the gap between these disciplines and to extend applications that encouraged the building of a parallel corpus as a suitable tool to achieve these goals. This paper focusses on the practical aspects of building the corpus. A brief account of the research which prompted this endeavour precedes the description of this process. 4 4 This paper is an account of the building of the ACTRES Parallel Corpus, so no empirical results from research done on the basis of the corpus are reported here. Concerning new insights drawn from the actual use of P-ACTRES in English–Spanish translation and contrastive projects, there is an extended bibliography at: http://actres.unileon.es/

Download Full-text

Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation

Computational Intelligence and Neuroscience ◽

10.1155/2021/6682385 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Michael Adjeisah ◽

Guohua Liu ◽

Douglas Omwenga Nyabuga ◽

Richard Nuetey Nortey ◽

Jinling Song

Keyword(s):

Machine Translation ◽

Language Processing ◽

Training Data ◽

Target Language ◽

Similarity Metrics ◽

Mahalanobis Distances ◽

Parallel Corpora ◽

Parallel Corpus ◽

Low Resource ◽

Sentence Level

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.

Download Full-text

Arabic-English Parallel Corpus: A New Resource for Translation Training and Language Teaching

10.31235/osf.io/rek3w ◽

2017 ◽

Author(s):

Arab World English Journal ◽

Hind M. Alotaibi

Keyword(s):

Language Teaching ◽

Data Driven ◽

Text Segmentation ◽

Web Interface ◽

King Saud University ◽

Parallel Corpora ◽

Parallel Corpus ◽

Source Language ◽

User Friendly ◽

Ongoing Project

Parallel corpora can be defined as collections of aligned, translated texts of two or more languages. They play a major role in translation and contrastive studies, and are also becoming popular in translation training and language teaching, with the advent of the data-driven learning (DDL) approach. Despite their significance, however, Arabic seems to lack a satisfactory general-use parallel corpus resource. The literature describes few Arabic–English parallel corpora, and these few are usually inaccurate and/or expensive. Some are small in size, while others are restricted in terms of genre, failing to meet the requirements of many academics and researchers. This paper describes an ongoing project at the College of Languages and Translation, King Saud University, to compile a 10-million-word Arabic–English parallel corpus to be used as a resource for translation training and language teaching. The bidirectional corpus can be used to compare translated and source language and identify differences. The corpus has been manually verified at different stages, including translation, text segmentation, alignment, and file preparation; it is available as full-text in XML format and through a user-friendly web interface that provides a concordancer to support bilingual search queries and several filtering options.

Download Full-text

Using ParaConc to extract bilingual terminology from parallel corpora: A case of English and Ndebele

Literator ◽

10.4102/lit.v37i2.1278 ◽

2016 ◽

Vol 37 (1) ◽

Author(s):

Ketiwe Ndhlovu

Keyword(s):

Media Law ◽

Parallel Corpora ◽

Parallel Corpus ◽

African Languages ◽

Bilingual Dictionary ◽

Bilingual Dictionaries ◽

Key Word ◽

Science And Education ◽

Frequency Feature

The development of African languages into languages of science and technology is dependent on action being taken to promote the use of these languages in specialised fields such as technology, commerce, administration, media, law, science and education among others. One possible way of developing African languages is the compilation of specialised dictionaries (Chabata 2013). This article explores how parallel corpora can be interrogated using a bilingual concordancer (ParaConc) to extract bilingual terminology that can be used to create specialised bilingual dictionaries. An English–Ndebele Parallel Corpus was used as a resource and through ParaConc, an alphabetic list was compiled from which headwords and possible translations were sought. These translations provided possible terms for entry in a bilingual dictionary. The frequency feature and ‘hot words’ tool in ParaConc were used to determine the suitability of terms for inclusion in the dictionary and for identifying possible synonyms, respectively. Since parallel corpora are aligned and data are presented in context (Key Word in Context), it was possible to draw examples showing how headwords are used. Using this approach produced results quickly and accurately, whilst minimising the process of translating terms manually. It was noted that the quality of the dictionary is dependent on the quality of the corpus, hence the need for creating a representative and clean corpus needs to be emphasised. Although technology has multiple benefits in dictionary making, the research underscores the importance of collaboration between lexicographers, translators, subject experts and target communities so that representative dictionaries are created.

Download Full-text

The Web as a Parallel Corpus

Computational Linguistics ◽

10.1162/089120103322711578 ◽

2003 ◽

Vol 29 (3) ◽

pp. 349-380 ◽

Cited By ~ 178

Author(s):

Philip Resnik ◽

Noah A. Smith

Keyword(s):

Language Processing ◽

Large Scale ◽

Structural Features ◽

Classification Performance ◽

Internet Archive ◽

Parallel Corpora ◽

Parallel Corpus ◽

Original Algorithm ◽

Parallel Text ◽

The Web

Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.

Download Full-text

Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics ◽

10.1162/089120105775299168 ◽

2005 ◽

Vol 31 (4) ◽

pp. 477-504 ◽

Cited By ~ 104

Author(s):

Dragos Stefan Munteanu ◽

Daniel Marcu

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Translation System ◽

Parallel Corpora ◽

Parallel Corpus ◽

Scarce Resources ◽

Parallel Data ◽

Machine Translation System ◽

Novel Method ◽

Arabic And English

We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available.

Download Full-text

Das deutsche Kopulaverb sein und seine thailändischen Entsprechungen

Linguistik Online ◽

10.13092/lo.91.4393 ◽

2018 ◽

Vol 91 (4) ◽

Author(s):

Korakoch Attaviriyanupap

Keyword(s):

Short Stories ◽

The Other ◽

Contrastive Analysis ◽

Parallel Corpus

Although German and Thai are typologically different from each other, both languages do have copulative constructions. The verb sein is the most important copular verb in German. Thai does have literary equivalents for this German verb but they involve different verbs. However, only pen and khɯ: are usually considered as Thai copular verbs. This study aims to compare the German verb sein in copulative constructions with pen and khɯ:. The contrastive analysis is based on a bidirectional parallel corpus consisting of 12 Thai and 13 German contemporary short stories and their translation into the other language. Three questions are to be answered: 1) Which forms are found in Thai as equivalents to the German copular verb sein? 2) Which linguistic elements in German occur as equivalents of the Thai copulative constructions with pen and khɯ:? 3) How can the use of copular verbs in German and in Thai be described? The results of this study show that the equivalents of the German copulative constructions with sein are not only pen and khɯ: but also many other constructions. At the same time, the Thai copular verbs are often used differently and may be expressed in various German constructions and, especially in form of punctuations.

Download Full-text

The Bulgarian-Polish-Russian parallel corpus

Cognitive Studies | Études cognitives ◽

10.11649/cs.2011.015 ◽

2015 ◽

pp. 241-254

Author(s):

Maksim Duškin ◽

Joanna Satoła-Staśkowiak

Keyword(s):

Parallel Corpora ◽

Parallel Corpus ◽

Polish Literature ◽

Slavic Languages ◽

Eastern Group ◽

Language Studies ◽

Western Group ◽

Characteristic Features ◽

Academy Of Sciences ◽

Linguistic Material

The Bulgarian-Polish-Russian parallel corpusThe Semantics Laboratory Team of Institute of Slavic Studies of Polish Academy of Sciences is planning to begin work on the creation of a Bulgarian-Polish-Russian parallel corpus. The three selected languages are representatives of the main groups of Slavic languages: Bulgarian represents the southern group of Slavic languages, Polish – the western group of Slavic languages, Russian – the eastern group of Slavic languages. Our project will be the first parallel corpus of these three languages. The planned corpus will be based on material, dating from one period (the 20th century) and will have a synchronous nature. The project will not constitute the sum of the separate corpora of selected languages.One of the problems with creating multilingual parallel corpora are different proportions of translated texts between the selected languages, for example, Polish literature is often translated into Bulgarian, but not vice versa.Bulgarian, Russian and Polish differ typologically – Bulgarian is an analytic language, Polish and Russian are synthetic. The parallel corpus should have compatible annotation, while taking into account the characteristic features of the selected languages.We hope that the Bulgarian-Polish-Russian parallel corpus will serve as a source of linguistic material of contrastive language studies and may prove to be a big help for linguists, translators, terminologists and students of linguistics. The results of our work will be available on the Internet.

Download Full-text