scholarly journals Contrastive Linguistics, Translation, and Parallel Corpora

2002 ◽  
Vol 43 (4) ◽  
pp. 602-615 ◽  
Author(s):  
Jarle Ebeling

Abstract This paper regards parallel corpora as suitable sources of data for investigating the differences and similarities between languages, and adopts the notion of translation equivalence as a methodology for contrastive analysis. It uses a bidirectional parallel corpus of Norwegian and English texts to examine the behaviour of presentative English there-constructions as well as the Norwegian equivalent det-constructions in original and translated English, and original and translated Norwegian respectively.

2014 ◽  
pp. 85-100
Author(s):  
Violetta Koseska

Semantics, contrastive linguistics and parallel corporaIn view of the ambiguity of the term “semantics”, the author shows the differences between the traditional lexical semantics and the contemporary semantics in the light of various semantic schools. She examines semantics differently in connection with contrastive studies where the description must necessary go from the meaning towards the linguistic form, whereas in traditional contrastive studies the description proceeded from the form towards the meaning. This requirement regarding theoretical contrastive studies necessitates construction of a semantic interlanguage, rather than only singling out universal semantic categories expressed with various language means. Such studies can be strongly supported by parallel corpora. However, in order to make them useful for linguists in manual and computer translations, as well as in the development of dictionaries, including online ones, we need not only formal, often automatic, annotation of texts, but also semantic annotation - which is unfortunately manual. In the article we focus on semantic annotation concerning time, aspect and quantification of names and predicates in the whole semantic structure of the sentence on the example of the “Polish-Bulgarian-Russian parallel corpus”.


2011 ◽  
Vol 56 (2) ◽  
pp. 443-464 ◽  
Author(s):  
Gudrun Vanderbauwhede ◽  
Piet Desmet ◽  
Peter Lauwers

This paper focuses on translational shifts with respect to the demonstrative determiner in French and Dutch in parallel corpora. The paper aims to identify the types of translation shifts that occur systematically, and to explore the underlying mechanisms and semantic effects of this process. For this purpose, a well-balanced sub-corpus of the Dutch Parallel Corpus is used, making it possible to analyze both directions (French – Dutch and Dutch – French). In this corpus, 50% of the demonstrative determiners are translated by a demonstrative in the target text (in both directions). In 20% of the cases, the demonstrative is translated by a definite article, or vice versa, while 30% are translated by another grammatical element (e.g., indefinite determiner, adverb, personal pronoun) or vice versa. The parallel corpus study reveals that translational shifts with respect to French and Dutch demonstratives can be attributed to three different mechanisms: (1) translator preference related to translation universals at the level of the noun phrase (omissions, additions and reformulations of the noun phrase), (2) specific manifestations of translation universals within the noun phrase (syntagmatic and paradigmatic explicitation and implicitation involving demonstrative shifting) and (3) structural divergences between the French and Dutch demonstrative determiner systems (fixed expressions and semantic differences). This analysis demonstrates the usefulness of a detailed parallel corpus study, which clearly distinguishes between changes occurring at different levels, in accounting for divergent translations of the demonstrative determiner in different languages. To this end, several types of explanation drawn from various fields (such as translation studies and contrastive linguistics), must be considered.


Corpora ◽  
2008 ◽  
Vol 3 (1) ◽  
pp. 31-41 ◽  
Author(s):  
Marlén Izquierdo ◽  
Knut Hofland ◽  
Øystein Reigem

This paper describes the compilation of the ACTRES Parallel Corpus, an English–Spanish translation corpus built at the Department of Modern Languages at the University of León (Spain) by the ACTRES research group. The computerisation of the corpus was carried out in collaboration with Knut Hofland and Øystein Reigem, from the Department of Culture, Language and Information Technology, Aksis, at the UNIFOB/University of Bergen (Norway). The corpus is conceived as a powerful tool for cross-linguistic research in the fields of Contrastive Analysis and Descriptive Translation Studies. It was the need to bridge the gap between these disciplines and to extend applications that encouraged the building of a parallel corpus as a suitable tool to achieve these goals. This paper focusses on the practical aspects of building the corpus. A brief account of the research which prompted this endeavour precedes the description of this process. 4 4 This paper is an account of the building of the ACTRES Parallel Corpus, so no empirical results from research done on the basis of the corpus are reported here. Concerning new insights drawn from the actual use of P-ACTRES in English–Spanish translation and contrastive projects, there is an extended bibliography at: http://actres.unileon.es/


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Michael Adjeisah ◽  
Guohua Liu ◽  
Douglas Omwenga Nyabuga ◽  
Richard Nuetey Nortey ◽  
Jinling Song

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.


2017 ◽  
Author(s):  
Arab World English Journal ◽  
Hind M. Alotaibi

Parallel corpora can be defined as collections of aligned, translated texts of two or more languages. They play a major role in translation and contrastive studies, and are also becoming popular in translation training and language teaching, with the advent of the data-driven learning (DDL) approach. Despite their significance, however, Arabic seems to lack a satisfactory general-use parallel corpus resource. The literature describes few Arabic–English parallel corpora, and these few are usually inaccurate and/or expensive. Some are small in size, while others are restricted in terms of genre, failing to meet the requirements of many academics and researchers. This paper describes an ongoing project at the College of Languages and Translation, King Saud University, to compile a 10-million-word Arabic–English parallel corpus to be used as a resource for translation training and language teaching. The bidirectional corpus can be used to compare translated and source language and identify differences. The corpus has been manually verified at different stages, including translation, text segmentation, alignment, and file preparation; it is available as full-text in XML format and through a user-friendly web interface that provides a concordancer to support bilingual search queries and several filtering options.


Literator ◽  
2016 ◽  
Vol 37 (1) ◽  
Author(s):  
Ketiwe Ndhlovu

The development of African languages into languages of science and technology is dependent on action being taken to promote the use of these languages in specialised fields such as technology, commerce, administration, media, law, science and education among others. One possible way of developing African languages is the compilation of specialised dictionaries (Chabata 2013). This article explores how parallel corpora can be interrogated using a bilingual concordancer (ParaConc) to extract bilingual terminology that can be used to create specialised bilingual dictionaries. An English–Ndebele Parallel Corpus was used as a resource and through ParaConc, an alphabetic list was compiled from which headwords and possible translations were sought. These translations provided possible terms for entry in a bilingual dictionary. The frequency feature and ‘hot words’ tool in ParaConc were used to determine the suitability of terms for inclusion in the dictionary and for identifying possible synonyms, respectively. Since parallel corpora are aligned and data are presented in context (Key Word in Context), it was possible to draw examples showing how headwords are used. Using this approach produced results quickly and accurately, whilst minimising the process of translating terms manually. It was noted that the quality of the dictionary is dependent on the quality of the corpus, hence the need for creating a representative and clean corpus needs to be emphasised. Although technology has multiple benefits in dictionary making, the research underscores the importance of collaboration between lexicographers, translators, subject experts and target communities so that representative dictionaries are created.


2003 ◽  
Vol 29 (3) ◽  
pp. 349-380 ◽  
Author(s):  
Philip Resnik ◽  
Noah A. Smith

Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.


2005 ◽  
Vol 31 (4) ◽  
pp. 477-504 ◽  
Author(s):  
Dragos Stefan Munteanu ◽  
Daniel Marcu

We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available.


2018 ◽  
Vol 91 (4) ◽  
Author(s):  
Korakoch Attaviriyanupap

Although German and Thai are typologically different from each other, both languages do have copulative constructions. The verb sein is the most important copular verb in German. Thai does have literary equivalents for this German verb but they involve different verbs. However, only pen and khɯ: are usually considered as Thai copular verbs. This study aims to compare the German verb sein in copulative constructions with pen and khɯ:. The contrastive analysis is based on a bidirectional parallel corpus consisting of 12 Thai and 13 German contemporary short stories and their translation into the other language. Three questions are to be answered: 1) Which forms are found in Thai as equivalents to the German copular verb sein? 2) Which linguistic elements in German occur as equivalents of the Thai copulative constructions with pen and khɯ:? 3) How can the use of copular verbs in German and in Thai be described? The results of this study show that the equivalents of the German copulative constructions with sein are not only pen and khɯ: but also many other constructions. At the same time, the Thai copular verbs are often used differently and may be expressed in various German constructions and, especially in form of punctuations.


2015 ◽  
pp. 241-254
Author(s):  
Maksim Duškin ◽  
Joanna Satoła-Staśkowiak

The Bulgarian-Polish-Russian parallel corpusThe Semantics Laboratory Team of Institute of Slavic Studies of Polish Academy of Sciences is planning to begin work on the creation of a Bulgarian-Polish-Russian parallel corpus. The three selected languages are representatives of the main groups of Slavic languages: Bulgarian represents the southern group of Slavic languages, Polish – the western group of Slavic languages, Russian – the eastern group of Slavic languages. Our project will be the first parallel corpus of these three languages. The planned corpus will be based on material, dating from one period (the 20th century) and will have a synchronous nature. The project will not constitute the sum of the separate corpora of selected languages.One of the problems with creating multilingual parallel corpora are different proportions of translated texts between the selected languages, for example, Polish literature is often translated into Bulgarian, but not vice versa.Bulgarian, Russian and Polish differ typologically – Bulgarian is an analytic language, Polish and Russian are synthetic. The parallel corpus should have compatible annotation, while taking into account the characteristic features of the selected languages.We hope that the Bulgarian-Polish-Russian parallel corpus will serve as a source of linguistic material of contrastive language studies and may prove to be a big help for linguists, translators, terminologists and students of linguistics. The results of our work will be available on the Internet.


Sign in / Sign up

Export Citation Format

Share Document