scholarly journals Języki słowiańskie i litewski w korpusach równoległych Clarin-PL

2016 ◽  
Vol 51 ◽  
pp. 191-217
Author(s):  
Violetta Koseska-Toszewa ◽  
Roman Roszko

Slavic languages and the Lithuanian language in the Clarin-PL parallel corporaThe Clarin Eric and Clarin-PL strategic scientific purpose is to support humanistic research in a multicultural and multilingual Europe. Polish researchers put the emphasis on building a bridge between the Polish language and Polish linguistic technologies and other European languages and their linguistic technologies. So far, the Polish scientific community has mainly focused on Polish-English connections. Clarin-PL has been developing the first and only multilingual corpora of the Polish language in conjunction with other Slavic languages and the Lithuanian language: the Polish-Bulgarian-Russian Parallel Corpus and the Polish- Lithuanian Parallel Corpus. The parallel corpora created by the ISS PAS Corpus Linguistics and Semantics Team break through the existing “canons” and allow scientists access to interlinked multilingual language resources – in the first phase limited to the languages of the three Slavic groups and the Lithuanian language. In the article, the authors present very detailed information on their original system of the semantic annotation of scope quantification in multilingual parallel corpora, hitherto unused in the subject literature. Due to the system’s originality, the semantic annotation is carried out manually. Identification of particular values of scope quantification in a sentence and the hereby presented attempts of its recording are supported by long-term research conducted by an international team of linguists and computer scientists / mathematicians developing the issue of quantification of names, time and aspect in natural languages. Języki słowiańskie i litewski w korpusach równoległych Clarin-PLStrategicznym celem naukowym Clarin ERIC i Clarin-PL jest wspieranie badań humanistycznych w wielokulturowej i wielojęzycznej Europie. Dla polskich badaczy ważna jest budowa pomostu między językiem polskim, polskimi technologiami językowymi a innymi językami europejskimi i na ich rzecz opracowanymi technologiami językowymi. Dotychczas w nauce polskiej największy nacisk był kładziony na powiązania polsko-angielskie. Clarin-PL opracowuje zatem pierwsze jak dotąd wielojęzyczne korpusy języka polskiego w zestawieniu z innymi językami słowiańskimi oraz z językiem litewskim: Korpus równoległy polsko-bułgarsko-rosyjski i Korpus równoległy polsko-litewski. Tworzone przez Zespół Lingwistyki Korpusowej i Semantyki (IS PAN) korpusy równoległe przełamują dotychczasowe „kanony” i udostępniają nauce powiązane wielojęzyczne zasoby – w pierwszym etapie ograniczone do języków trzech grup słowiańskich oraz języka litewskiego. W artykule autorzy przedstawiają bardzo szczegółową informację o zastosowanej po raz pierwszy w literaturze przedmiotu anotacji semantycznej dotyczącej kwantyfikacji zakresowej w wielojęzycznych korpusach równoległych. Z powodu swojego rozległego zakresu i nowatorstwa ta anotacja semantyczna jest nanoszona ręcznie. Identyfikacja poszczególnych wartości kwantyfikacji zakresowej w zdaniu oraz przedstawiane tu próby jej zapisu są poparte wieloletnimi badaniami międzynarodowego zespołu lingwistów i matematyków-informatyków opracowujących zagadnienie kwantyfikacji imion, czasu i aspektu w językach naturalnych.

2014 ◽  
pp. 85-100
Author(s):  
Violetta Koseska

Semantics, contrastive linguistics and parallel corporaIn view of the ambiguity of the term “semantics”, the author shows the differences between the traditional lexical semantics and the contemporary semantics in the light of various semantic schools. She examines semantics differently in connection with contrastive studies where the description must necessary go from the meaning towards the linguistic form, whereas in traditional contrastive studies the description proceeded from the form towards the meaning. This requirement regarding theoretical contrastive studies necessitates construction of a semantic interlanguage, rather than only singling out universal semantic categories expressed with various language means. Such studies can be strongly supported by parallel corpora. However, in order to make them useful for linguists in manual and computer translations, as well as in the development of dictionaries, including online ones, we need not only formal, often automatic, annotation of texts, but also semantic annotation - which is unfortunately manual. In the article we focus on semantic annotation concerning time, aspect and quantification of names and predicates in the whole semantic structure of the sentence on the example of the “Polish-Bulgarian-Russian parallel corpus”.


2018 ◽  
Vol 4 ◽  
pp. 63-78
Author(s):  
Dorota Jagódzka

Polish auxiliary clitics constitute an interesting set of data which draws attention to cross-linguistic differences among Slavic languages. A general principle for clitic placement in Indo-European languages is the one described by Jacob Wackernagel in his 1892 work. He concluded that clitics appeared in the second position in the clause, after the first word in a sentence. This pattern was true to some degree in Old Church Slavonic and still holds for a number of contemporary Slavic languages e.g. Serbo-Croatian, Slovene, Czech and Slovak which have second position clitics. Bulgarian and Macedonian have verb adjacent pronominal clitics and Polish has auxiliary clitics (Migdalski 2007, 2010, Pancheva 2005). Also in the older versions of Polish language the above mentioned tendency was strong. In Modern Polish auxiliary clitics attach to the l-participle most frequently. However, one of the unusual properties they possess is the ability to choose almost every clausal element for their host. Polish auxiliary clitics can trigger morphophonological alternations on their hosts, which is an affix-like property; however, at the same time they display clearly clitic-like behaviour when they attach freely to words of any lexical class. The aim of this paper is to present and analyze the morpho-syntactic properties of two kinds of auxiliary clitics: bound and free. The bound clitics carry person-number agreement markers for past tense (the so called ‘floating’ or ‘mobile’ inflections). The free clitic is the morpheme by used for conditional and subjunctive mood.


2015 ◽  
pp. 241-254
Author(s):  
Maksim Duškin ◽  
Joanna Satoła-Staśkowiak

The Bulgarian-Polish-Russian parallel corpusThe Semantics Laboratory Team of Institute of Slavic Studies of Polish Academy of Sciences is planning to begin work on the creation of a Bulgarian-Polish-Russian parallel corpus. The three selected languages are representatives of the main groups of Slavic languages: Bulgarian represents the southern group of Slavic languages, Polish – the western group of Slavic languages, Russian – the eastern group of Slavic languages. Our project will be the first parallel corpus of these three languages. The planned corpus will be based on material, dating from one period (the 20th century) and will have a synchronous nature. The project will not constitute the sum of the separate corpora of selected languages.One of the problems with creating multilingual parallel corpora are different proportions of translated texts between the selected languages, for example, Polish literature is often translated into Bulgarian, but not vice versa.Bulgarian, Russian and Polish differ typologically – Bulgarian is an analytic language, Polish and Russian are synthetic. The parallel corpus should have compatible annotation, while taking into account the characteristic features of the selected languages.We hope that the Bulgarian-Polish-Russian parallel corpus will serve as a source of linguistic material of contrastive language studies and may prove to be a big help for linguists, translators, terminologists and students of linguistics. The results of our work will be available on the Internet.


2015 ◽  
Vol 49 (2) ◽  
Author(s):  
Natalia Levshina

AbstractThis study investigates formal and functional variation in analytic causatives (ACs) in eighteen European languages from the Indo-European and Uralic language families. Employing the comparative concept approach, the paper presents a probabilistic semantic map of the main functions of ACs on the basis of a multilingual parallel corpus of film subtitles. This method enables us to detect common dimensions of semantic variation in ACs and to pinpoint cross-linguistic commonalities in the form–meaning mapping. The paper also presents three case studies, which test previous hypotheses about the grammaticalization clines in Romance and Germanic and facts of language contact between German and Slavic languages. The role of language contact is further explored in quantitative analyses that compare how the languages “carve up” the semantic space of causation. The results of this comparison suggest that frequently occurring semantically vague ACs may be regarded as a feature of Standard Average European.


Author(s):  
Oksana Novitska

The article analyzes the names of food and kitchen appliances from the Polish language used in the sub-dialectal speech of the inhabitants of Pidhaitsi region. Their semantics, etymology, functioning, peculiarities of word-formation have been determined. The correlation of the surveyed sub-dialects with other European languages and Ukrainian sub-dialects has been determined. The study of the names of food and kitchen appliances in the sub-dialects of Pidhaitsi region suggests that the vocabulary of the sub-dialects is rich in lexis borrowed from other languages, and in archaic elements and has a close connection with the lexis of neighboring languages and their dialects. Borrowing from different languages took place at different times, it is conditioned by a number of factors, among which the most important are the historical, socio-economic, political and cultural conditions of the development of Pidhaitsi region. The Polish language is one of the neighboring Slavic languages, which had one of the most powerful influences on the Ukrainian language. Borrowing foreign language vocabulary was not always straightforward. By the medium of the Polish language, Pidhaitsi region sub-dialects contain a lot of borrowings from other European languages. Pidhaitsi region sub-dialects are the central part of the Naddnistriansk dialect and to a certain extent represent the characteristic features of sub-dialects of Naddnistriansk dialect of the southwestern dialect.


2021 ◽  
pp. 016555152199275
Author(s):  
Juryong Cheon ◽  
Youngjoong Ko

Translation language resources, such as bilingual word lists and parallel corpora, are important factors affecting the effectiveness of cross-language information retrieval (CLIR) systems. In particular, when large domain-appropriate parallel corpora are not available, developing an effective CLIR system is particularly difficult. Furthermore, creating a large parallel corpus is costly and requires considerable effort. Therefore, we here demonstrate the construction of parallel corpora from Wikipedia as well as improved query translation, wherein the queries are used for a CLIR system. To do so, we first constructed a bilingual dictionary, termed WikiDic. Then, we evaluated individual language resources and combinations of them in terms of their ability to extract parallel sentences; the combinations of our proposed WikiDic with the translation probability from the Web’s bilingual example sentence pairs and WikiDic was found to be best suited to parallel sentence extraction. Finally, to evaluate the parallel corpus generated from this best combination of language resources, we compared its performance in query translation for CLIR to that of a manually created English–Korean parallel corpus. As a result, the corpus generated by our proposed method achieved a better performance than did the manually created corpus, thus demonstrating the effectiveness of the proposed method for automatic parallel corpus extraction. Not only can the method demonstrated herein be used to inform the construction of other parallel corpora from language resources that are readily available, but also, the parallel sentence extraction method will naturally improve as Wikipedia continues to be used and its content develops.


2019 ◽  
Vol 55 (2) ◽  
pp. 469-490
Author(s):  
Krzysztof Wołk ◽  
Agnieszka Wołk ◽  
Krzysztof Marasek

Abstract Several natural languages have undergone a great deal of processing, but the problem of limited textual linguistic resources remains. The manual creation of parallel corpora by humans is rather expensive and time consuming, while the language data required for statistical machine translation (SMT) do not exist in adequate quantities for their statistical information to be used to initiate the research process. On the other hand, applying known approaches to build parallel resources from multiple sources, such as comparable or quasi-comparable corpora, is very complicated and provides rather noisy output, which later needs to be further processed and requires in-domain adaptation. To optimize the performance of comparable corpora mining algorithms, it is essential to use a quality parallel corpus for training of a good data classifier. In this research, we have developed a methodology for generating an accurate parallel corpus (Czech-English, Polish-English) from monolingual resources by calculating the compatibility between the results of three machine translation systems. We have created translations of large, single-language resources by applying multiple translation systems and strictly measuring translation compatibility using rules based on the Levenshtein distance. The results produced by this approach were very favorable. The generated corpora successfully improved the quality of SMT systems and seem to be useful for many other natural language processing tasks.


2015 ◽  
pp. 397-411
Author(s):  
Wojciech Paweł Sosnowski ◽  
Pascal Bonnard

The Current Evolution of Slavic Languages in Central and Eastern Europe in the Context of the EU Multilingualism PolicyThe respect for and protection of cultural and linguistic diversity have long been guaranteed in various international and European legislative acts. More recently, the European Union has also developed laws aimed at the preservation and promotion of multilingualism. Linguistic diversity has long been seen as an obstacle to the effective functioning of EU institutions. Recently, however, it has been considered as a valuable “heritage” of the EU.In our article, we will present a brief overview of policies promoting multilingualism in Europe, and more specifically, in the EU. Subsequently, we will compare this framework to the changes occurring presently in modern Slavic languages of Central and Eastern Europe. The tendency of these languages towards increased analitism transforms these predominantly synthetic languages into more analytical ones. These conclusions have led us to the following question: What is the current state of modern Slavic languages and how far may their evolution be addressed by policies promoting multilingualism? Our analysis consists of two parts: first, we scrutinised various European legislative acts promoting multilingualism; second, we analysed modern Slavic languages by means of the parallel corpora of chosen languages from the Common Language Resources and Technology Infrastructure project (including UNESCO and EU legislation, etc.).


2015 ◽  
pp. 67-78
Author(s):  
Violetta Koseska-Toszewa

About Certain Semantic Annotation in Parallel CorporaThe semantic notation analyzed in this works is contained in the second stream of semantic theories presented here – in the direct approach semantics. We used this stream in our work on the Bulgarian-Polish Contrastive Grammar. Our semantic notation distinguishes quantificational meanings of names and predicates, and indicates aspectual and temporal meanings of verbs. It relies on logical scope-based quantification and on the contemporary theory of processes, known as “Petri nets”. Thanks to it, we can distinguish precisely between a language form and its contents, e.g. a perfective verb form has two meanings: an event or a sequence of events and states, finally ended with an event. An imperfective verb form also has two meanings: a state or a sequence of states and events, finally ended with a state. In turn, names are quantified universally or existentially when they are “undefined”, and uniquely (using the iota operator) when they are “defined”. A fact worth emphasizing is the possibility of quantifying not only names, but also the predicate, and then quantification concerns time and aspect.  This is a novum in elaborating sentence-level semantics in parallel corpora. For this reason, our semantic notation is manual. We are hoping that it will raise the interest of computer scientists working on automatic methods for processing the given natural languages. Semantic annotation defined like in this work will facilitate contrastive studies of natural languages, and this in turn will verify the results of those studies, and will certainly facilitate human and machine translations.


2015 ◽  
pp. 211-236
Author(s):  
Violetta Koseska-Toszewa ◽  
Roman Roszko

On Semantic Annotation in Clarin-PL Parallel CorporaIn the article, the authors present a proposal for semantic annotation in Clarin-PL parallel corpora: Polish-Bulgarian-Russian and Polish-Lithuanian ones. Semantic annotation of quantification is a novum in developing sentence level semantics in multilingual parallel corpora. This is why our semantic annotation is manual. The authors hope it will be interesting to IT specialists working on automatic processing of the given natural languages. Semantic annotation defined the way it is defined here will make contrastive studies of natural languages more efficient, which in turn will help verify the results of those studies, and will certainly improve human and machine translations.


Sign in / Sign up

Export Citation Format

Share Document