scholarly journals On Semantic Annotation in Clarin-PL Parallel Corpora

2015 ◽  
pp. 211-236
Author(s):  
Violetta Koseska-Toszewa ◽  
Roman Roszko

On Semantic Annotation in Clarin-PL Parallel CorporaIn the article, the authors present a proposal for semantic annotation in Clarin-PL parallel corpora: Polish-Bulgarian-Russian and Polish-Lithuanian ones. Semantic annotation of quantification is a novum in developing sentence level semantics in multilingual parallel corpora. This is why our semantic annotation is manual. The authors hope it will be interesting to IT specialists working on automatic processing of the given natural languages. Semantic annotation defined the way it is defined here will make contrastive studies of natural languages more efficient, which in turn will help verify the results of those studies, and will certainly improve human and machine translations.

2015 ◽  
pp. 67-78
Author(s):  
Violetta Koseska-Toszewa

About Certain Semantic Annotation in Parallel CorporaThe semantic notation analyzed in this works is contained in the second stream of semantic theories presented here – in the direct approach semantics. We used this stream in our work on the Bulgarian-Polish Contrastive Grammar. Our semantic notation distinguishes quantificational meanings of names and predicates, and indicates aspectual and temporal meanings of verbs. It relies on logical scope-based quantification and on the contemporary theory of processes, known as “Petri nets”. Thanks to it, we can distinguish precisely between a language form and its contents, e.g. a perfective verb form has two meanings: an event or a sequence of events and states, finally ended with an event. An imperfective verb form also has two meanings: a state or a sequence of states and events, finally ended with a state. In turn, names are quantified universally or existentially when they are “undefined”, and uniquely (using the iota operator) when they are “defined”. A fact worth emphasizing is the possibility of quantifying not only names, but also the predicate, and then quantification concerns time and aspect.  This is a novum in elaborating sentence-level semantics in parallel corpora. For this reason, our semantic notation is manual. We are hoping that it will raise the interest of computer scientists working on automatic methods for processing the given natural languages. Semantic annotation defined like in this work will facilitate contrastive studies of natural languages, and this in turn will verify the results of those studies, and will certainly facilitate human and machine translations.


2014 ◽  
pp. 13-20
Author(s):  
Ludmila Dimitrova ◽  
Violetta Koseska ◽  
Danuta Roszko ◽  
Roman Roszko

Trilingual aligned corpus – current state and new applicationsThis article describes current state of a trilingual parallel corpus consisted of texts in two Slavic (Bulgarian and Polish) and one Baltic language (Lithuanian). The corpus contains original literary texts (fiction, novels, and short stories) in one of the three languages with translations to the other two, and texts in other languages translated into Bulgarian, Polish, and Lithuanian. A part of the texts are aligned at the sentence level. The authors propose a semantic annotation of verbs appearing in these aligned texts that will facilitate contrastive studies of natural languages. A theoretical background for the proposed semantic annotation is briefly also discussed.


2016 ◽  
Vol 51 ◽  
pp. 191-217
Author(s):  
Violetta Koseska-Toszewa ◽  
Roman Roszko

Slavic languages and the Lithuanian language in the Clarin-PL parallel corporaThe Clarin Eric and Clarin-PL strategic scientific purpose is to support humanistic research in a multicultural and multilingual Europe. Polish researchers put the emphasis on building a bridge between the Polish language and Polish linguistic technologies and other European languages and their linguistic technologies. So far, the Polish scientific community has mainly focused on Polish-English connections. Clarin-PL has been developing the first and only multilingual corpora of the Polish language in conjunction with other Slavic languages and the Lithuanian language: the Polish-Bulgarian-Russian Parallel Corpus and the Polish- Lithuanian Parallel Corpus. The parallel corpora created by the ISS PAS Corpus Linguistics and Semantics Team break through the existing “canons” and allow scientists access to interlinked multilingual language resources – in the first phase limited to the languages of the three Slavic groups and the Lithuanian language. In the article, the authors present very detailed information on their original system of the semantic annotation of scope quantification in multilingual parallel corpora, hitherto unused in the subject literature. Due to the system’s originality, the semantic annotation is carried out manually. Identification of particular values of scope quantification in a sentence and the hereby presented attempts of its recording are supported by long-term research conducted by an international team of linguists and computer scientists / mathematicians developing the issue of quantification of names, time and aspect in natural languages. Języki słowiańskie i litewski w korpusach równoległych Clarin-PLStrategicznym celem naukowym Clarin ERIC i Clarin-PL jest wspieranie badań humanistycznych w wielokulturowej i wielojęzycznej Europie. Dla polskich badaczy ważna jest budowa pomostu między językiem polskim, polskimi technologiami językowymi a innymi językami europejskimi i na ich rzecz opracowanymi technologiami językowymi. Dotychczas w nauce polskiej największy nacisk był kładziony na powiązania polsko-angielskie. Clarin-PL opracowuje zatem pierwsze jak dotąd wielojęzyczne korpusy języka polskiego w zestawieniu z innymi językami słowiańskimi oraz z językiem litewskim: Korpus równoległy polsko-bułgarsko-rosyjski i Korpus równoległy polsko-litewski. Tworzone przez Zespół Lingwistyki Korpusowej i Semantyki (IS PAN) korpusy równoległe przełamują dotychczasowe „kanony” i udostępniają nauce powiązane wielojęzyczne zasoby – w pierwszym etapie ograniczone do języków trzech grup słowiańskich oraz języka litewskiego. W artykule autorzy przedstawiają bardzo szczegółową informację o zastosowanej po raz pierwszy w literaturze przedmiotu anotacji semantycznej dotyczącej kwantyfikacji zakresowej w wielojęzycznych korpusach równoległych. Z powodu swojego rozległego zakresu i nowatorstwa ta anotacja semantyczna jest nanoszona ręcznie. Identyfikacja poszczególnych wartości kwantyfikacji zakresowej w zdaniu oraz przedstawiane tu próby jej zapisu są poparte wieloletnimi badaniami międzynarodowego zespołu lingwistów i matematyków-informatyków opracowujących zagadnienie kwantyfikacji imion, czasu i aspektu w językach naturalnych.


Informatica ◽  
2018 ◽  
Vol 29 (4) ◽  
pp. 693-710
Author(s):  
Algirdas Laukaitis ◽  
Darius Plikynas ◽  
Egidijus Ostasius

Electronics ◽  
2021 ◽  
Vol 10 (13) ◽  
pp. 1589
Author(s):  
Yongkeun Hwang ◽  
Yanghoon Kim ◽  
Kyomin Jung

Neural machine translation (NMT) is one of the text generation tasks which has achieved significant improvement with the rise of deep neural networks. However, language-specific problems such as handling the translation of honorifics received little attention. In this paper, we propose a context-aware NMT to promote translation improvements of Korean honorifics. By exploiting the information such as the relationship between speakers from the surrounding sentences, our proposed model effectively manages the use of honorific expressions. Specifically, we utilize a novel encoder architecture that can represent the contextual information of the given input sentences. Furthermore, a context-aware post-editing (CAPE) technique is adopted to refine a set of inconsistent sentence-level honorific translations. To demonstrate the efficacy of the proposed method, honorific-labeled test data is required. Thus, we also design a heuristic that labels Korean sentences to distinguish between honorific and non-honorific styles. Experimental results show that our proposed method outperforms sentence-level NMT baselines both in overall translation quality and honorific translations.


2014 ◽  
pp. 85-100
Author(s):  
Violetta Koseska

Semantics, contrastive linguistics and parallel corporaIn view of the ambiguity of the term “semantics”, the author shows the differences between the traditional lexical semantics and the contemporary semantics in the light of various semantic schools. She examines semantics differently in connection with contrastive studies where the description must necessary go from the meaning towards the linguistic form, whereas in traditional contrastive studies the description proceeded from the form towards the meaning. This requirement regarding theoretical contrastive studies necessitates construction of a semantic interlanguage, rather than only singling out universal semantic categories expressed with various language means. Such studies can be strongly supported by parallel corpora. However, in order to make them useful for linguists in manual and computer translations, as well as in the development of dictionaries, including online ones, we need not only formal, often automatic, annotation of texts, but also semantic annotation - which is unfortunately manual. In the article we focus on semantic annotation concerning time, aspect and quantification of names and predicates in the whole semantic structure of the sentence on the example of the “Polish-Bulgarian-Russian parallel corpus”.


2021 ◽  
Vol 23 ◽  
pp. 421-440
Author(s):  
Enrique Javier Vercher García

El presente artículo plantea la existencia y analiza la categoría de humanicidad, entendida como el modo en que las lenguas naturales clasifican y expresan la realidad externa en dos grandes ámbitos: el ámbito humano (aquel que el hablante entiende como perteneciente a la sociedad humana, a la esfera de la vida, costumbres, rituales, civilización y cultura específicamente propios del ser humano) y el ámbito natural (la esfera de todo aquello ajeno a la comunidad humana, de lo que está fuera del área de influencia de la civilización humana, es decir, los fenómenos naturales, flora y fauna en su estado salvaje no “domesticado” o no “civilizado”). El campo-semántico funcional de la humanicidadsería el conjunto de recursos de los diferentes niveles lingüísticos (fonético-fonológico, morfológico, sintáctico y léxico) de una lengua dada para configurar los referentes de la realidad y clasificarlos en función de su categoría de humanicidad(ámbito humano vs. ámbito natural). La humanicidad, por tanto, no debe ser confundida con fenómenos bien conocidos como los de animacidad lingüísticao la distinción morfosintáctica entre humano/no humano. This article proposes the existence and analyses the category of humanicity, understood as the way in which natural languages classify and express external reality in two large fields: the human sphere (which the speaker understands as belonging to human society, the area of life, customs, rituals, civilization and culture specific to human beings) and the natural sphere (the sphere of everything outwith the human community, outwith the area of influence of human civilization; that is, natural phenomena, flora and fauna in their wild, “undomesticated” or “uncivilised” state). The functional-semantic field of humanicitywould be the set of resources of the different linguistic levels (phonetic-phonological, morphological, syntactic and lexical) of a given language for configuring the reference points of reality and classifying them based on their category of humanicity(human sphere vs natural sphere). Humanicity, must therefore not be confused with well-known phenomena such as linguistic animacyor the morphosyntactic distinction between human/non-human.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Michael Adjeisah ◽  
Guohua Liu ◽  
Douglas Omwenga Nyabuga ◽  
Richard Nuetey Nortey ◽  
Jinling Song

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.


Philosophy ◽  
2000 ◽  
Vol 75 (1) ◽  
pp. 127-130 ◽  
Author(s):  
JC Beall
Keyword(s):  

In his recent ‘A Prolegomenon to an Identity Theory of Truth’ (Vol. 74, 1999) Stewart Candlish discusses the so called identity theory of truth. His aim in the article is to clear away initial difficulties that apparently stand in the way of developing the budding theory. There is one difficulty, however, that, by Candlish's lights, cannot be overcome—at least not easily. My aim in this paper is to help the identity theory by showing that, pace Candlish, the given difficulty is merely apparent. I do not ‘solve’ the alleged problem; I dissolve it. Dissolution, however, is solution enough.


Author(s):  
Eleonora Bilotta ◽  
Pietro Pantano

Structural models and patterns are vitally important for human beings. From birth, we base our emotional and cognitive representations of the external world on species-specific signals (the human face) and exploit these signals to structure our instinctive behavior. The creation of cognitive patterns to represent the world lies at the very heart of human cognition. It is this process that underlies our efficient use of signs, our ability to communicate with natural languages and to build cognitive artifacts, the way we organize the external world, and the way we organize external events in our memories and our flow of consciousness. Patterns are sometimes called schemas, or models, and discussed in terms of a gestalt (Piaget, 1960; 1970; Koelher, 1974). In the middle ages a pattern meant “the.original.proposed.to.imitation;.the. archetype;.that.which.is.to.be.copied;.an.exemplar” (from the On Line Etymology Dictionary). Modern use dates back to the XVIII century. In 1977 Christopher Alexander introduced a new way of using the term in architecture. For Alexander, a pattern was a model used to encode and organize existing knowledge, avoiding the need to reinvent the knowledge every time it was needed. For Alexander a pattern was “a three part rule, which expresses a relation between a certain context, a problem, and a solution” (Alexander et al., 1977).


Sign in / Sign up

Export Citation Format

Share Document