Semantics, contrastive linguistics and parallel corpora

Cognitive Studies | Études cognitives ◽

10.11649/cs.2014.009 ◽

2014 ◽

pp. 85-100

Author(s):

Violetta Koseska

Keyword(s):

Lexical Semantics ◽

Semantic Annotation ◽

Semantic Structure ◽

Automatic Annotation ◽

Parallel Corpora ◽

Parallel Corpus ◽

Linguistic Form ◽

Semantic Categories ◽

Contrastive Linguistics

Semantics, contrastive linguistics and parallel corporaIn view of the ambiguity of the term “semantics”, the author shows the differences between the traditional lexical semantics and the contemporary semantics in the light of various semantic schools. She examines semantics differently in connection with contrastive studies where the description must necessary go from the meaning towards the linguistic form, whereas in traditional contrastive studies the description proceeded from the form towards the meaning. This requirement regarding theoretical contrastive studies necessitates construction of a semantic interlanguage, rather than only singling out universal semantic categories expressed with various language means. Such studies can be strongly supported by parallel corpora. However, in order to make them useful for linguists in manual and computer translations, as well as in the development of dictionaries, including online ones, we need not only formal, often automatic, annotation of texts, but also semantic annotation - which is unfortunately manual. In the article we focus on semantic annotation concerning time, aspect and quantification of names and predicates in the whole semantic structure of the sentence on the example of the “Polish-Bulgarian-Russian parallel corpus”.

Download Full-text

The Shifting of the Demonstrative Determiner in French and Dutch in Parallel Corpora: From Translation Mechanisms to Structural Differences

Meta Journal des traducteurs ◽

10.7202/1006186ar ◽

2011 ◽

Vol 56 (2) ◽

pp. 443-464 ◽

Cited By ~ 1

Author(s):

Gudrun Vanderbauwhede ◽

Piet Desmet ◽

Peter Lauwers

Keyword(s):

Noun Phrase ◽

Definite Article ◽

Personal Pronoun ◽

Corpus Study ◽

Parallel Corpora ◽

Parallel Corpus ◽

Structural Differences ◽

Contrastive Linguistics ◽

Underlying Mechanisms ◽

Different Levels

This paper focuses on translational shifts with respect to the demonstrative determiner in French and Dutch in parallel corpora. The paper aims to identify the types of translation shifts that occur systematically, and to explore the underlying mechanisms and semantic effects of this process. For this purpose, a well-balanced sub-corpus of the Dutch Parallel Corpus is used, making it possible to analyze both directions (French – Dutch and Dutch – French). In this corpus, 50% of the demonstrative determiners are translated by a demonstrative in the target text (in both directions). In 20% of the cases, the demonstrative is translated by a definite article, or vice versa, while 30% are translated by another grammatical element (e.g., indefinite determiner, adverb, personal pronoun) or vice versa. The parallel corpus study reveals that translational shifts with respect to French and Dutch demonstratives can be attributed to three different mechanisms: (1) translator preference related to translation universals at the level of the noun phrase (omissions, additions and reformulations of the noun phrase), (2) specific manifestations of translation universals within the noun phrase (syntagmatic and paradigmatic explicitation and implicitation involving demonstrative shifting) and (3) structural divergences between the French and Dutch demonstrative determiner systems (fixed expressions and semantic differences). This analysis demonstrates the usefulness of a detailed parallel corpus study, which clearly distinguishes between changes occurring at different levels, in accounting for divergent translations of the demonstrative determiner in different languages. To this end, several types of explanation drawn from various fields (such as translation studies and contrastive linguistics), must be considered.

Download Full-text

Contrastive Linguistics, Translation, and Parallel Corpora

Meta Journal des traducteurs ◽

10.7202/002692ar ◽

2002 ◽

Vol 43 (4) ◽

pp. 602-615 ◽

Cited By ~ 7

Author(s):

Jarle Ebeling

Keyword(s):

Contrastive Analysis ◽

Parallel Corpora ◽

Parallel Corpus ◽

Contrastive Linguistics

Abstract This paper regards parallel corpora as suitable sources of data for investigating the differences and similarities between languages, and adopts the notion of translation equivalence as a methodology for contrastive analysis. It uses a bidirectional parallel corpus of Norwegian and English texts to examine the behaviour of presentative English there-constructions as well as the Norwegian equivalent det-constructions in original and translated English, and original and translated Norwegian respectively.

Download Full-text

Języki słowiańskie i litewski w korpusach równoległych Clarin-PL

Studia z Filologii Polskiej i Słowiańskiej ◽

10.11649/sfps.2016.011 ◽

2016 ◽

Vol 51 ◽

pp. 191-217

Author(s):

Violetta Koseska-Toszewa ◽

Roman Roszko

Keyword(s):

Corpus Linguistics ◽

Semantic Annotation ◽

Original System ◽

Language Resources ◽

Natural Languages ◽

Parallel Corpora ◽

Parallel Corpus ◽

Slavic Languages ◽

European Languages ◽

Polish Language

Slavic languages and the Lithuanian language in the Clarin-PL parallel corporaThe Clarin Eric and Clarin-PL strategic scientific purpose is to support humanistic research in a multicultural and multilingual Europe. Polish researchers put the emphasis on building a bridge between the Polish language and Polish linguistic technologies and other European languages and their linguistic technologies. So far, the Polish scientific community has mainly focused on Polish-English connections. Clarin-PL has been developing the first and only multilingual corpora of the Polish language in conjunction with other Slavic languages and the Lithuanian language: the Polish-Bulgarian-Russian Parallel Corpus and the Polish- Lithuanian Parallel Corpus. The parallel corpora created by the ISS PAS Corpus Linguistics and Semantics Team break through the existing “canons” and allow scientists access to interlinked multilingual language resources – in the first phase limited to the languages of the three Slavic groups and the Lithuanian language. In the article, the authors present very detailed information on their original system of the semantic annotation of scope quantification in multilingual parallel corpora, hitherto unused in the subject literature. Due to the system’s originality, the semantic annotation is carried out manually. Identification of particular values of scope quantification in a sentence and the hereby presented attempts of its recording are supported by long-term research conducted by an international team of linguists and computer scientists / mathematicians developing the issue of quantification of names, time and aspect in natural languages. Języki słowiańskie i litewski w korpusach równoległych Clarin-PLStrategicznym celem naukowym Clarin ERIC i Clarin-PL jest wspieranie badań humanistycznych w wielokulturowej i wielojęzycznej Europie. Dla polskich badaczy ważna jest budowa pomostu między językiem polskim, polskimi technologiami językowymi a innymi językami europejskimi i na ich rzecz opracowanymi technologiami językowymi. Dotychczas w nauce polskiej największy nacisk był kładziony na powiązania polsko-angielskie. Clarin-PL opracowuje zatem pierwsze jak dotąd wielojęzyczne korpusy języka polskiego w zestawieniu z innymi językami słowiańskimi oraz z językiem litewskim: Korpus równoległy polsko-bułgarsko-rosyjski i Korpus równoległy polsko-litewski. Tworzone przez Zespół Lingwistyki Korpusowej i Semantyki (IS PAN) korpusy równoległe przełamują dotychczasowe „kanony” i udostępniają nauce powiązane wielojęzyczne zasoby – w pierwszym etapie ograniczone do języków trzech grup słowiańskich oraz języka litewskiego. W artykule autorzy przedstawiają bardzo szczegółową informację o zastosowanej po raz pierwszy w literaturze przedmiotu anotacji semantycznej dotyczącej kwantyfikacji zakresowej w wielojęzycznych korpusach równoległych. Z powodu swojego rozległego zakresu i nowatorstwa ta anotacja semantyczna jest nanoszona ręcznie. Identyfikacja poszczególnych wartości kwantyfikacji zakresowej w zdaniu oraz przedstawiane tu próby jej zapisu są poparte wieloletnimi badaniami międzynarodowego zespołu lingwistów i matematyków-informatyków opracowujących zagadnienie kwantyfikacji imion, czasu i aspektu w językach naturalnych.

Download Full-text

The development of a semantic annotation scheme for Chinese kinship

Corpora ◽

10.3366/e1749503209000306 ◽

2009 ◽

Vol 4 (2) ◽

pp. 189-208 ◽

Cited By ~ 2

Author(s):

Yufang Qian ◽

Scott Piao

Keyword(s):

Semantic Analysis ◽

Semantic Annotation ◽

Corpus Annotation ◽

Semantic Field ◽

Same Sex ◽

Annotation Scheme ◽

Marital Relations ◽

Semantic Categories ◽

Analysis System ◽

Contemporary Chinese

In this paper, we propose a corpus annotation scheme and lexicon for Chinese kinship terms. We modify existing traditional Chinese kinship schemes into a comprehensive semantic field framework that covers kinship semantic categories in contemporary Chinese. The scheme is inspired by the Lancaster USAS (UCREL Semantic Analysis System) taxonomy, which contains categories for English kinship terms. We show how our scheme works with a Chinese kinship semantic lexicon which covers parents, siblings, marital relations, off-spring and same-sex partnerships. The kinship lexicon was created through a pilot study involving the Lancaster University Mandarin Corpus. We foresee that our annotation scheme and lexicon will provide a framework and resource for the kinship annotation of Chinese corpora and corpus-based kinship studies.

Download Full-text

ANALYSIS OF MALE BOXER'S NICKNAMES

Journal of Teaching English for Specific and Academic Purposes ◽

10.22190/jtesap1801001o ◽

2018 ◽

Vol 6 (1) ◽

pp. 001

Author(s):

Darija Omrčen ◽

Hrvoje Pečarić

Keyword(s):

English Language ◽

Content Area ◽

Semantic Structure ◽

Sports Teams ◽

Semantic Categories

Nicknaming of individual athletes and sports teams is a multifaceted phenomenon the analysis of which reveals numerous reasons for choosing a particular name or nickname. The practice of nicknaming has become so embedded in the concept of sport that it requires exceptional attention by those who create these labels. The goal of this research was to analyse the semantic structure of boxers’ nicknames, i.e. the possible principles of their formation. To realize the research aims 378 male boxers’ nicknames, predominantly in the English language, were collected. The nicknames were allocated to semantic categories according to the content area or areas they referred to. Counts and percentages were calculated for the nicknames in each subsample created with regard to the number of semantic categories used to create a boxer’s nickname and for the group of nicknames allocated to the miscellaneous group. Counts were calculated for all groups within each subsample.

Download Full-text

Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation

Computational Intelligence and Neuroscience ◽

10.1155/2021/6682385 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Michael Adjeisah ◽

Guohua Liu ◽

Douglas Omwenga Nyabuga ◽

Richard Nuetey Nortey ◽

Jinling Song

Keyword(s):

Machine Translation ◽

Language Processing ◽

Training Data ◽

Target Language ◽

Similarity Metrics ◽

Mahalanobis Distances ◽

Parallel Corpora ◽

Parallel Corpus ◽

Low Resource ◽

Sentence Level

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.

Download Full-text

Arabic-English Parallel Corpus: A New Resource for Translation Training and Language Teaching

10.31235/osf.io/rek3w ◽

2017 ◽

Author(s):

Arab World English Journal ◽

Hind M. Alotaibi

Keyword(s):

Language Teaching ◽

Data Driven ◽

Text Segmentation ◽

Web Interface ◽

King Saud University ◽

Parallel Corpora ◽

Parallel Corpus ◽

Source Language ◽

User Friendly ◽

Ongoing Project

Parallel corpora can be defined as collections of aligned, translated texts of two or more languages. They play a major role in translation and contrastive studies, and are also becoming popular in translation training and language teaching, with the advent of the data-driven learning (DDL) approach. Despite their significance, however, Arabic seems to lack a satisfactory general-use parallel corpus resource. The literature describes few Arabic–English parallel corpora, and these few are usually inaccurate and/or expensive. Some are small in size, while others are restricted in terms of genre, failing to meet the requirements of many academics and researchers. This paper describes an ongoing project at the College of Languages and Translation, King Saud University, to compile a 10-million-word Arabic–English parallel corpus to be used as a resource for translation training and language teaching. The bidirectional corpus can be used to compare translated and source language and identify differences. The corpus has been manually verified at different stages, including translation, text segmentation, alignment, and file preparation; it is available as full-text in XML format and through a user-friendly web interface that provides a concordancer to support bilingual search queries and several filtering options.

Download Full-text

Using ParaConc to extract bilingual terminology from parallel corpora: A case of English and Ndebele

Literator ◽

10.4102/lit.v37i2.1278 ◽

2016 ◽

Vol 37 (1) ◽

Author(s):

Ketiwe Ndhlovu

Keyword(s):

Media Law ◽

Parallel Corpora ◽

Parallel Corpus ◽

African Languages ◽

Bilingual Dictionary ◽

Bilingual Dictionaries ◽

Key Word ◽

Science And Education ◽

Frequency Feature

The development of African languages into languages of science and technology is dependent on action being taken to promote the use of these languages in specialised fields such as technology, commerce, administration, media, law, science and education among others. One possible way of developing African languages is the compilation of specialised dictionaries (Chabata 2013). This article explores how parallel corpora can be interrogated using a bilingual concordancer (ParaConc) to extract bilingual terminology that can be used to create specialised bilingual dictionaries. An English–Ndebele Parallel Corpus was used as a resource and through ParaConc, an alphabetic list was compiled from which headwords and possible translations were sought. These translations provided possible terms for entry in a bilingual dictionary. The frequency feature and ‘hot words’ tool in ParaConc were used to determine the suitability of terms for inclusion in the dictionary and for identifying possible synonyms, respectively. Since parallel corpora are aligned and data are presented in context (Key Word in Context), it was possible to draw examples showing how headwords are used. Using this approach produced results quickly and accurately, whilst minimising the process of translating terms manually. It was noted that the quality of the dictionary is dependent on the quality of the corpus, hence the need for creating a representative and clean corpus needs to be emphasised. Although technology has multiple benefits in dictionary making, the research underscores the importance of collaboration between lexicographers, translators, subject experts and target communities so that representative dictionaries are created.

Download Full-text

The Web as a Parallel Corpus

Computational Linguistics ◽

10.1162/089120103322711578 ◽

2003 ◽

Vol 29 (3) ◽

pp. 349-380 ◽

Cited By ~ 178

Author(s):

Philip Resnik ◽

Noah A. Smith

Keyword(s):

Language Processing ◽

Large Scale ◽

Structural Features ◽

Classification Performance ◽

Internet Archive ◽

Parallel Corpora ◽

Parallel Corpus ◽

Original Algorithm ◽

Parallel Text ◽

The Web

Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.

Download Full-text

Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics ◽

10.1162/089120105775299168 ◽

2005 ◽

Vol 31 (4) ◽

pp. 477-504 ◽

Cited By ~ 104

Author(s):

Dragos Stefan Munteanu ◽

Daniel Marcu

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Translation System ◽

Parallel Corpora ◽

Parallel Corpus ◽

Scarce Resources ◽

Parallel Data ◽

Machine Translation System ◽

Novel Method ◽

Arabic And English

We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available.

Download Full-text