scholarly journals Semantics, contrastive linguistics and parallel corpora

2014 ◽  
pp. 85-100
Author(s):  
Violetta Koseska

Semantics, contrastive linguistics and parallel corporaIn view of the ambiguity of the term “semantics”, the author shows the differences between the traditional lexical semantics and the contemporary semantics in the light of various semantic schools. She examines semantics differently in connection with contrastive studies where the description must necessary go from the meaning towards the linguistic form, whereas in traditional contrastive studies the description proceeded from the form towards the meaning. This requirement regarding theoretical contrastive studies necessitates construction of a semantic interlanguage, rather than only singling out universal semantic categories expressed with various language means. Such studies can be strongly supported by parallel corpora. However, in order to make them useful for linguists in manual and computer translations, as well as in the development of dictionaries, including online ones, we need not only formal, often automatic, annotation of texts, but also semantic annotation - which is unfortunately manual. In the article we focus on semantic annotation concerning time, aspect and quantification of names and predicates in the whole semantic structure of the sentence on the example of the “Polish-Bulgarian-Russian parallel corpus”.

2011 ◽  
Vol 56 (2) ◽  
pp. 443-464 ◽  
Author(s):  
Gudrun Vanderbauwhede ◽  
Piet Desmet ◽  
Peter Lauwers

This paper focuses on translational shifts with respect to the demonstrative determiner in French and Dutch in parallel corpora. The paper aims to identify the types of translation shifts that occur systematically, and to explore the underlying mechanisms and semantic effects of this process. For this purpose, a well-balanced sub-corpus of the Dutch Parallel Corpus is used, making it possible to analyze both directions (French – Dutch and Dutch – French). In this corpus, 50% of the demonstrative determiners are translated by a demonstrative in the target text (in both directions). In 20% of the cases, the demonstrative is translated by a definite article, or vice versa, while 30% are translated by another grammatical element (e.g., indefinite determiner, adverb, personal pronoun) or vice versa. The parallel corpus study reveals that translational shifts with respect to French and Dutch demonstratives can be attributed to three different mechanisms: (1) translator preference related to translation universals at the level of the noun phrase (omissions, additions and reformulations of the noun phrase), (2) specific manifestations of translation universals within the noun phrase (syntagmatic and paradigmatic explicitation and implicitation involving demonstrative shifting) and (3) structural divergences between the French and Dutch demonstrative determiner systems (fixed expressions and semantic differences). This analysis demonstrates the usefulness of a detailed parallel corpus study, which clearly distinguishes between changes occurring at different levels, in accounting for divergent translations of the demonstrative determiner in different languages. To this end, several types of explanation drawn from various fields (such as translation studies and contrastive linguistics), must be considered.


2002 ◽  
Vol 43 (4) ◽  
pp. 602-615 ◽  
Author(s):  
Jarle Ebeling

Abstract This paper regards parallel corpora as suitable sources of data for investigating the differences and similarities between languages, and adopts the notion of translation equivalence as a methodology for contrastive analysis. It uses a bidirectional parallel corpus of Norwegian and English texts to examine the behaviour of presentative English there-constructions as well as the Norwegian equivalent det-constructions in original and translated English, and original and translated Norwegian respectively.


2016 ◽  
Vol 51 ◽  
pp. 191-217
Author(s):  
Violetta Koseska-Toszewa ◽  
Roman Roszko

Slavic languages and the Lithuanian language in the Clarin-PL parallel corporaThe Clarin Eric and Clarin-PL strategic scientific purpose is to support humanistic research in a multicultural and multilingual Europe. Polish researchers put the emphasis on building a bridge between the Polish language and Polish linguistic technologies and other European languages and their linguistic technologies. So far, the Polish scientific community has mainly focused on Polish-English connections. Clarin-PL has been developing the first and only multilingual corpora of the Polish language in conjunction with other Slavic languages and the Lithuanian language: the Polish-Bulgarian-Russian Parallel Corpus and the Polish- Lithuanian Parallel Corpus. The parallel corpora created by the ISS PAS Corpus Linguistics and Semantics Team break through the existing “canons” and allow scientists access to interlinked multilingual language resources – in the first phase limited to the languages of the three Slavic groups and the Lithuanian language. In the article, the authors present very detailed information on their original system of the semantic annotation of scope quantification in multilingual parallel corpora, hitherto unused in the subject literature. Due to the system’s originality, the semantic annotation is carried out manually. Identification of particular values of scope quantification in a sentence and the hereby presented attempts of its recording are supported by long-term research conducted by an international team of linguists and computer scientists / mathematicians developing the issue of quantification of names, time and aspect in natural languages. Języki słowiańskie i litewski w korpusach równoległych Clarin-PLStrategicznym celem naukowym Clarin ERIC i Clarin-PL jest wspieranie badań humanistycznych w wielokulturowej i wielojęzycznej Europie. Dla polskich badaczy ważna jest budowa pomostu między językiem polskim, polskimi technologiami językowymi a innymi językami europejskimi i na ich rzecz opracowanymi technologiami językowymi. Dotychczas w nauce polskiej największy nacisk był kładziony na powiązania polsko-angielskie. Clarin-PL opracowuje zatem pierwsze jak dotąd wielojęzyczne korpusy języka polskiego w zestawieniu z innymi językami słowiańskimi oraz z językiem litewskim: Korpus równoległy polsko-bułgarsko-rosyjski i Korpus równoległy polsko-litewski. Tworzone przez Zespół Lingwistyki Korpusowej i Semantyki (IS PAN) korpusy równoległe przełamują dotychczasowe „kanony” i udostępniają nauce powiązane wielojęzyczne zasoby – w pierwszym etapie ograniczone do języków trzech grup słowiańskich oraz języka litewskiego. W artykule autorzy przedstawiają bardzo szczegółową informację o zastosowanej po raz pierwszy w literaturze przedmiotu anotacji semantycznej dotyczącej kwantyfikacji zakresowej w wielojęzycznych korpusach równoległych. Z powodu swojego rozległego zakresu i nowatorstwa ta anotacja semantyczna jest nanoszona ręcznie. Identyfikacja poszczególnych wartości kwantyfikacji zakresowej w zdaniu oraz przedstawiane tu próby jej zapisu są poparte wieloletnimi badaniami międzynarodowego zespołu lingwistów i matematyków-informatyków opracowujących zagadnienie kwantyfikacji imion, czasu i aspektu w językach naturalnych.


Corpora ◽  
2009 ◽  
Vol 4 (2) ◽  
pp. 189-208 ◽  
Author(s):  
Yufang Qian ◽  
Scott Piao

In this paper, we propose a corpus annotation scheme and lexicon for Chinese kinship terms. We modify existing traditional Chinese kinship schemes into a comprehensive semantic field framework that covers kinship semantic categories in contemporary Chinese. The scheme is inspired by the Lancaster USAS (UCREL Semantic Analysis System) taxonomy, which contains categories for English kinship terms. We show how our scheme works with a Chinese kinship semantic lexicon which covers parents, siblings, marital relations, off-spring and same-sex partnerships. The kinship lexicon was created through a pilot study involving the Lancaster University Mandarin Corpus. We foresee that our annotation scheme and lexicon will provide a framework and resource for the kinship annotation of Chinese corpora and corpus-based kinship studies.


Author(s):  
Darija Omrčen ◽  
Hrvoje Pečarić

Nicknaming of individual athletes and sports teams is a multifaceted phenomenon the analysis of which reveals numerous reasons for choosing a particular name or nickname. The practice of nicknaming has become so embedded in the concept of sport that it requires exceptional attention by those who create these labels. The goal of this research was to analyse the semantic structure of boxers’ nicknames, i.e. the possible principles of their formation. To realize the research aims 378 male boxers’ nicknames, predominantly in the English language, were collected. The nicknames were allocated to semantic categories according to the content area or areas they referred to. Counts and percentages were calculated for the nicknames in each subsample created with regard to the number of semantic categories used to create a boxer’s nickname and for the group of nicknames allocated to the miscellaneous group. Counts were calculated for all groups within each subsample.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Michael Adjeisah ◽  
Guohua Liu ◽  
Douglas Omwenga Nyabuga ◽  
Richard Nuetey Nortey ◽  
Jinling Song

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.


2017 ◽  
Author(s):  
Arab World English Journal ◽  
Hind M. Alotaibi

Parallel corpora can be defined as collections of aligned, translated texts of two or more languages. They play a major role in translation and contrastive studies, and are also becoming popular in translation training and language teaching, with the advent of the data-driven learning (DDL) approach. Despite their significance, however, Arabic seems to lack a satisfactory general-use parallel corpus resource. The literature describes few Arabic–English parallel corpora, and these few are usually inaccurate and/or expensive. Some are small in size, while others are restricted in terms of genre, failing to meet the requirements of many academics and researchers. This paper describes an ongoing project at the College of Languages and Translation, King Saud University, to compile a 10-million-word Arabic–English parallel corpus to be used as a resource for translation training and language teaching. The bidirectional corpus can be used to compare translated and source language and identify differences. The corpus has been manually verified at different stages, including translation, text segmentation, alignment, and file preparation; it is available as full-text in XML format and through a user-friendly web interface that provides a concordancer to support bilingual search queries and several filtering options.


Literator ◽  
2016 ◽  
Vol 37 (1) ◽  
Author(s):  
Ketiwe Ndhlovu

The development of African languages into languages of science and technology is dependent on action being taken to promote the use of these languages in specialised fields such as technology, commerce, administration, media, law, science and education among others. One possible way of developing African languages is the compilation of specialised dictionaries (Chabata 2013). This article explores how parallel corpora can be interrogated using a bilingual concordancer (ParaConc) to extract bilingual terminology that can be used to create specialised bilingual dictionaries. An English–Ndebele Parallel Corpus was used as a resource and through ParaConc, an alphabetic list was compiled from which headwords and possible translations were sought. These translations provided possible terms for entry in a bilingual dictionary. The frequency feature and ‘hot words’ tool in ParaConc were used to determine the suitability of terms for inclusion in the dictionary and for identifying possible synonyms, respectively. Since parallel corpora are aligned and data are presented in context (Key Word in Context), it was possible to draw examples showing how headwords are used. Using this approach produced results quickly and accurately, whilst minimising the process of translating terms manually. It was noted that the quality of the dictionary is dependent on the quality of the corpus, hence the need for creating a representative and clean corpus needs to be emphasised. Although technology has multiple benefits in dictionary making, the research underscores the importance of collaboration between lexicographers, translators, subject experts and target communities so that representative dictionaries are created.


2003 ◽  
Vol 29 (3) ◽  
pp. 349-380 ◽  
Author(s):  
Philip Resnik ◽  
Noah A. Smith

Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.


2005 ◽  
Vol 31 (4) ◽  
pp. 477-504 ◽  
Author(s):  
Dragos Stefan Munteanu ◽  
Daniel Marcu

We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available.


Sign in / Sign up

Export Citation Format

Share Document