scholarly journals POS-tagging a bilingual parallel corpus: methods and challenges

2017 ◽  
pp. 35-46 ◽  
Author(s):  
Irene Doval

This paper reviews the author’s experiences of tokenizing and POS tagging a bilingual parallel corpus, the PaGeS Corpus, consisting mostly of German and Spanish fictional texts. This is part of an ongoing process of annotating the corpus for part-of-speech information. This study discusses the specific problems encountered so far. On the one hand, tagging performance degrades significantly when applied to fictional data and, on the other, pre-existing annotation schemes are all language specific. To further improve accuracy during post-editing, the author has developed a common tagset and identified major error patterns.

Author(s):  
Dany Amiot ◽  
Edwige Dugas

Word-formation encompasses a wide range of processes, among which we find derivation and compounding, two processes yielding productive patterns which enable the speaker to understand and to coin new lexemes. This article draws a distinction between two types of constituents (suffixes, combining forms, splinters, affixoids, etc.) on the one hand and word-formation processes (derivation, compounding, blending, etc.) on the other hand but also shows that a given constituent can appear in different word-formation processes. First, it describes prototypical derivation and compounding in terms of word-formation processes and of their constituents: Prototypical derivation involves a base lexeme, that is, a free lexical elements belonging to a major part-of-speech category (noun, verb, or adjective) and, very often, an affix (e.g., Fr. laverV ‘to wash’ > lavableA ‘washable’), while prototypical compounding involves two lexemes (e.g., Eng. rainN + fallV > rainfallN). The description of these prototypical phenomena provides a starting point for the description of other types of constituents and word-formation processes. There are indeed at least two phenomena which do not meet this description, namely, combining forms (henceforth CFs) and affixoids, and which therefore pose an interesting challenge to linguistic description, be it synchronic or diachronic. The distinction between combining forms and affixoids is not easy to establish and the definitions are often confusing, but productivity is a good criterion to distinguish them from each other, even if it does not answer all the questions raised by bound forms. In the literature, the notions of CF and affixoid are not unanimously agreed upon, especially that of affixoid. Yet this article stresses that they enable us to highlight, and even conceptualize, the gradual nature of linguistic phenomena, whether from a synchronic or a diachronic point of view.


2000 ◽  
Vol 18 (2) ◽  
pp. 91-109 ◽  
Author(s):  
Anne Sa'adah

Even as the tenth anniversary of the opening of the Berlin Wall was being celebrated, a scandal was beginning that seems destined to bring the Kohl era, however it is defined, to a close. My purpose in this article is to propose a framework for thinking about the broader political meaning and possible impact of the CDU’s difficulties. In this instance as in many others, I will argue, events in the Federal Republic are best understood if approached simultaneously from two angles. On the one hand, Germany remains bound to, if not necessarily by, its multiple experiences of dictatorship. Viewed in this context, events acquire meaning and significance as part of an ongoing process of democratization, or of an effort to “master” a past to some degree enduringly unmasterable. On the other hand, a half-century after its creation, the Federal Republic is an established democracy with a remarkable record of success and a predictable roster of problems. From this perspective, developments in Germany illustrate dilemmas and dysfunctions common across the advanced industrial democracies.


2016 ◽  
Vol 7 (2) ◽  
pp. 341-360 ◽  
Author(s):  
Thomas Burri

The autonomy robots enjoy is understood in different ways. On the one hand, a technical understanding of autonomy is firmly anchored in the present and concerned with what can be achieved now by means of code and programming; on the other hand, a philosophical understanding of robot autonomy looks into the future and tries to anticipate how robots will evolve in the years to come. The two understandings are at odds at times, occasionally they even clash. However, not one of them is necessarily truer than the other. Each is driven by certain real-life factors; each rests on its own justification. This article discusses these two “views of robot autonomy” in depth and witnesses them at work at two of the most relevant events of robotics in recent times, namely the Darpa Robotics Challenge, which took place in California in June 2015, and the ongoing process to address lethal autonomous weapons in humanitarian Geneva, which is spurred on by a “Campaign to Stop Killer Robots”.


Kalbotyra ◽  
2021 ◽  
Vol 74 ◽  
pp. 72-87
Author(s):  
Jan Goes
Keyword(s):  

In this article we propose an alternative to the theories which subdivide the adjective into three major types (qualifier, relational, adjective of the third type), themselves subdivided into several subclasses. We believe instead that there is only one adjectival lexeme with different uses (unitary hypothesis). To do this, we start from the two ways of looking for the adjectival prototype: on the one hand, the abstract prototype built by accumulating criteria, on the other, the semantic prototype. We examine the behavior of occurrences of the abstract prototype (admirable, monumental) and the semantic prototype (grand) with respect to gradation, the attributive function (more specifically the place of the adjective) and the predicative function. The examples show not only that the two prototype models can be reconciled, but above all that the behavior and the meaning of any adjective depend in large part on the noun it qualifies, a result which confirms our unitary hypothesis. The syntactic-semantic dependence of the adjective on the supporting substantive is such that it can be concluded that the adjective is a syncategorematic part of speech, rather than a polysemous one.


2021 ◽  
Vol 11 (1) ◽  
pp. 145-161
Author(s):  
Hilde Hasselgård

This study compares sequences of noun and preposition in English and Norwegian using data from the English-Norwegian Parallel Corpus. One purpose is to test the use of sequences of part-of-speech tags as a search method for contrastive studies. The other is to investigate the functions and meanings of prepositional phrases in the position after a noun across the two languages. The comparison of original texts shows that the function of postmodifier is most frequent in both languages, with adverbial in second place. Other functions are rare. English has more postmodifiers and fewer adverbials than Norwegian. Furthermore, the prepositional phrases express locative meaning, in both functions, more frequently in Norwegian than in English. The study of translations reveals that the adverbials have congruent correspondences more often than postmodifiers, particularly in translations from English into Norwegian.


2019 ◽  
Vol 8 (1) ◽  
pp. 39-66 ◽  
Author(s):  
Mali Satthachai ◽  
Dorothy Kenny

Abstract Scholarly interest in legislative translation has grown substantially over recent decades, with corpus-based approaches contributing to our understanding of the relationship between translated legislation and source texts, on the one hand, and translated and non-translated legislative texts in the target language, on the other. To date, however, most studies have been conducted on European languages. This study is part of a first attempt to use corpus techniques to explore legislative translation from English into Thai. Drawing on a purpose-built, 400,000-word, parallel corpus of international treaties translated from English into Thai, and a one million-word monolingual corpus of legislative texts originally written in Thai, we investigate how instances of deontic modality are translated into Thai. We analyse the modal strength of translations and conduct our inter-linguistic and intra-linguistic comparisons in the light of Biel’s (2014) concepts of equivalence and textual fit.


1996 ◽  
Vol 2 (4) ◽  
pp. 355-364 ◽  
Author(s):  
EVA EJERHED

The paper presents background and motivation for a processing model that segments discourse into units that are simple, non-nested clauses, prior to the recognition of clause internal phrasal constituents, and experimental results in support of this model. One set of results is derived from a statistical reanalysis of the Swedish empirical data in Strangert, Ejerhed and Huber 1993 concerning the linguistic structure of major prosodic units. The other set of results is derived from experiments in segmenting part of speech annotated Swedish text corpora into clauses, using a new clause segmentation algorithm. The clause segmented corpus data is taken from the Stockholm Umeå Corpus (SUC), 1 M words of Swedish texts from different genres, part of speech annotated by hand, and from the Umeå corpus DAGENS INDUSTRI 1993 (DI93), 5 M words of Swedish financial newspaper text, processed by fully automatic means consisting of tokenizing, lexical analysis, and probabilistic POS tagging. The results of these two experiments show that the proposed clause segmentation algorithm is 96% correct when applied to manually tagged text, and 91% correct when applied to probabilistically tagged text.


2014 ◽  
Vol 18 (5) ◽  
pp. 402-420
Author(s):  
Thomas Haipeter ◽  
Christine Slomka

Profit sharing is an increasingly important component of pay in Germany and is the main reason for the intensification of wage drift that has been observed in recent years. However, little is known about how widespread it is, how it is collectively regulated, the regulatory practices that accompany it and the effects those practices have on works councils and unions. These questions are analysed with respect to the development of profit sharing in the German metalworking industry, where it is relatively widespread. Our analysis shows that profit sharing is characterized by ambiguities and contradictions. On the one hand, it is contributing to the ongoing process of wage modernization; on the other hand, however, profit sharing is fuelling trends towards the fragmentation and financialization of pay. Moreover, although driving up actual wages, profit sharing is also a manifestation of an increasing redistribution of income between capital and labour, between firms and between different categories of employee.


2019 ◽  
Vol 112 (3) ◽  
pp. 843-860
Author(s):  
Isabel Grimm-Stadelmann

Abstract The anatomical and physiological treatise Περὶ τῆς τοῦ ἀνθρώπου ἀνθρώπου κατασκευῆς is characterized by a peculiarity of medical terminology which is largely unknown from comparable texts: on the one hand, anatomical terms are put into relation with corresponding terms from poetic language, on the other hand they are precisely defined by descriptions of objects of everyday use. The considerable discrepancy between the Greek original and its Latin translation is of particular interest against the background of the renaissance of Περὶ τῆς τοῦ ἀνθρώπου κατασκευῆς in the 16th century AD. The multiple versions of the Latin translation show that medical terminology in Latin language was still in an ongoing process of development, for which reason many Greek anatomical terms were inserted untranslated into the Latin text due to a lack of an adequate Latin equivalents. For this reason Περὶ τῆς τοῦ τοῦ ἀνθρώπου κατασκευῆς plays a central role in the development of anatomical terminology, but also in its becoming more and more specific and precise.


2020 ◽  
Vol 2 (2) ◽  
pp. 71-83
Author(s):  
Mohammad Mursyit ◽  
Aji Prasetya Wibawa ◽  
Ilham Ari Elbaith Zaeni ◽  
Harits Ar Rosyid

Part of Speech Tagging atau POS Tagging adalah proses memberikan label pada setiap kata dalam sebuah kalimat secara otomatis. Penelitian ini menggunakan algoritma Hidden Markov Model (HMM) untuk proses POS Tagging. Perlakuan untuk unknown words menggunakan Most Probable POS-Tag. Dataset yang digunakan berupa 10 cerita pendek berbahasa Jawa terdiri dari 10.180 kata yang telah diberikan tagsetBahasa Jawa. Pada penelitian ini proses POS Tagging menggunakan dua skenario. Skenario pertama yaitu menggunakan algoritma Hidden Markov Model (HMM) tanpa menggunakan perlakuan untuk unknown words. Skenario yang kedua menggunakan HMM dan Most Probable POS-Tag untuk perlakuan unknown words. Hasil menunjukan skenario pertama menghasilkan akurasi sebesar 45.5% dan skenario kedua menghasilkan akurasi sebesar 70.78%. Most Probable POS-Tag dapat meningkatkan akurasi pada POS Tagging tetapi tidak selalu menunjukan hasil yang benar dalam pemberian label. Most Probable POS-Tag dapat menghilangkan probabilitas bernilai Nol dari POS Tagging Hidden Markov Model. Hasil penelitian ini menunjukan bahwa POS Tagging dengan menggunakan Hidden Markov Model dipengaruhi oleh perlakuan terhadap unknown words, perbendaharaan kata dan hubungan label kata pada dataset.  Part of Speech Tagging or POS Tagging is the process of automatically giving labels to each word in a sentence. This study uses the Hidden Markov Model (HMM) algorithm for the POS Tagging process. Treatment for unknown words uses the Most Probable POS-Tag. The dataset used is in the form of 10 short stories in Javanese consisting of 10,180 words which have been given the Javanese tagset. In this study, the POS Tagging process uses two scenarios. The first scenario is using the Hidden Markov Model (HMM) algorithm without using treatment for unknown words. The second scenario uses HMM and Most Probable POS-Tag for treatment of unknown words. The results show that the first scenario produces an accuracy of 45.5% and the second scenario produces an accuracy of 70.78%. Most Probable POS-Tag can improve accuracy in POS Tagging but does not always produce correct labels. Most Probable POS-Tag can remove zero-value probability from POS Tagging Hidden Markov Model. The results of this study indicate that POS Tagging using the Hidden Markov Model is influenced by the treatment of unknown words, vocabulary and word label relationships in the dataset.


Sign in / Sign up

Export Citation Format

Share Document