Exploring and categorising the Arabic copula and auxiliary kāna through enhanced part-of-speech tagging

Corpora ◽  
2021 ◽  
Vol 16 (3) ◽  
pp. 305-335
Author(s):  
Andrew Hardie ◽  
Wesam Ibrahim

Arabic syntax has yet to be studied in detail from a corpus-based perspective. The Arabic copula kāna (‘be’), functions also as an auxiliary, creating periphrastic tense–aspect constructions; but the literature on these functions is far from exhaustive. To analyse kāna within the one-million word Corpus of Contemporary Arabic, part-of-speech tagging (using novel, targeted enhancements to a previously described program which improves the accessibility for linguistic analysis of the output of Habash et al.’s [2012] mada disambiguator for the Buckwalter Arabic morphological analyser) is applied to disambiguate copula and auxiliary at a high rate of accuracy. Concordances of both are extracted, and 10 percent samples (499 instances of copula kāna and 387 of auxiliary kāna) are analysed manually to identify surface-level grammatical patterns and meanings. This raw analysis is then systematised according to the more general patterns’ main parameters of variation; special descriptions are developed for specific, apparently fixed-form expressions (including two phraseologies which afford expression of verbal and adjectival modality). Overall, we uncover substantial new detail, not mentioned in existing grammars (e.g., the quantitative predominance of the past imperfect construction over other uses of auxiliary kāna). There exists notable potential for these corpus-based findings to inform and enhance not only grammatical descriptions but also pedagogy of Arabic as a first or second/foreign language.

Author(s):  
Mark Davies

Abstract Within the past decade several large, freely-available online corpora of Spanish and Portuguese have become available. With these new corpora, researchers of Spanish and Portuguese can now carry out the same type of corpus-based research that has been done for other languages (such as English) for years. This includes advanced research on morphological and syntactic variation (thanks to full functionality with substring searches, part of speech tagging, and lemmatization), semantics and pragmatics (via collocates, synonyms, customized word lists, and word comparisons), and historical changes and synchronic register variation (via architectures and interfaces that allow easy comparisons of frequency in different sections of the corpus).


ICAME Journal ◽  
2021 ◽  
Vol 45 (1) ◽  
pp. 37-86
Author(s):  
Jonathan Culpeper ◽  
Andrew Hardie ◽  
Jane Demmen ◽  
Jennifer Hughes ◽  
Matt Timperley

Abstract This article explores challenges in the corpus linguistic analysis of Shakespeare’s language, and Early Modern English more generally, with particular focus on elaborating possible solutions and the benefits they bring. An account of work that took place within the Encyclopedia of Shakespeare’s Language Project (2016–2019) is given, which discusses the development of the project’s data resources, specifically, the Enhanced Shakespearean Corpus. Topics covered include the composition of the corpus and its subcomponents; the structure of the XML markup; the design of the extensive character metadata; and the word-level corpus annotation, including spelling regularisation, part-of-speech tagging, lemmatisation and semantic tagging. The challenges that arise from each of these undertakings are not exclusive to a corpus-based treatment of Shakespeare’s plays but it is in the context of Shakespeare’s language that they are so severe as to seem almost insurmountable. The solutions developed for the Enhanced Shakespearean Corpus – often combining automated manipulation with manual interventions, and always principled – offer a way through.


Author(s):  
Bethan Siân Tovey

Part of speech tagging, labeling every token in a text with its grammatical category, is a complicated business. Natural language is messy, especially when that language consists of social-media conversations between bilinguals. The process can be done with or without human intervention, in a supervised or unsupervised manner, on a statistical basis or by the application of rules. Often, it involves a combination of these methods. It is, on the one hand, an obvious markup problem: mark up the tokens with appropriate grammatical categories. But it is also much richer than that. Theoretical problems that have been identified in the domain of markup can throw light on the problem of grammatical category disambiguation. Topics considered include subjectivity and objectivity, the semantics of tag sets, licensing of inference, proleptic and metaleptic markup, and the interesting characteristics of the Welsh “verbnoun”.


2019 ◽  
Vol 24 (1) ◽  
pp. 112-128
Author(s):  
Christian Griesinger ◽  
Michael Stolz

Abstract This paper sheds light on the possibilities and perspectives of linking digital editions of Medieval German texts to each other and to other digital resources. Furthermore, it discusses some of the internal and technical conditions necessary to render this linkage meaningful, like lemmatisation, part-of-speech-tagging, and using standardised mark-up languages. Finally, the sustainability and reusability of digital editions are considered. While in the past, editions of medieval texts were conceived as rather isolated scholarly works of individual editors, nowadays the collaboration and cooperation of greater working groups is essential in editing projects. Due to the complexity of editions consisting of multiple textual layers, e. g. apparatus entries, annotations, or facsimiles, the requirements for future digital editions have risen. The first approach to respond to these demands is to link the various textual layers to each other, enabling the users to navigate between these layers in a sensible way. The second step is to link the edited text to other resources, such as online dictionaries or other editions, allowing complex research networks to be created. These goals are achieved by lemmatising and other tagging methods, ensuring the information being mapped to a normalised and idealised frame of reference. Common standards like Unicode or the TEI guidelines are of great importance for such purposes, as they assure the interchange and re-use of scientific data, as well as their sustainability.


Terminology ◽  
2001 ◽  
Vol 7 (1) ◽  
pp. 31-48 ◽  
Author(s):  
Jorge Vivaldi ◽  
Horacio Rodríguez

Two different reasons suggest that combining the performance of several term extractors could lead to an improvement in overall system accuracy. On the one hand, there is no clear agreement on whether to follow statistical, linguistic or hybrid approaches for (semi-) automatic term extraction. On the other hand, combining different knowledge sources (e.g. classifiers) has proved successful in improving the performance of individual sources on several NLP tasks (some of them closely related to or involved in term extraction), such as context-sensitive spelling correction, part-of-speech tagging, word sense disambiguation, parsing, text classification and filtering, etc. In this paper, we present a proposal for combining a number of different term extraction techniques in order to improve the accuracy of the resulting system. The approach has been applied to the domain of medicine for the Spanish language. A number of tests have been carried out with encouraging results.


Author(s):  
Nindian Puspa Dewi ◽  
Ubaidi Ubaidi

POS Tagging adalah dasar untuk pengembangan Text Processing suatu bahasa. Dalam penelitian ini kita meneliti pengaruh penggunaan lexicon dan perubahan morfologi kata dalam penentuan tagset yang tepat untuk suatu kata. Aturan dengan pendekatan morfologi kata seperti awalan, akhiran, dan sisipan biasa disebut sebagai lexical rule. Penelitian ini menerapkan lexical rule hasil learner dengan menggunakan algoritma Brill Tagger. Bahasa Madura adalah bahasa daerah yang digunakan di Pulau Madura dan beberapa pulau lainnya di Jawa Timur. Objek penelitian ini menggunakan Bahasa Madura yang memiliki banyak sekali variasi afiksasi dibandingkan dengan Bahasa Indonesia. Pada penelitian ini, lexicon selain digunakan untuk pencarian kata dasar Bahasa Madura juga digunakan sebagai salah satu tahap pemberian POS Tagging. Hasil ujicoba dengan menggunakan lexicon mencapai akurasi yaitu 86.61% sedangkan jika tidak menggunakan lexicon hanya mencapai akurasi 28.95 %. Dari sini dapat disimpulkan bahwa ternyata lexicon sangat berpengaruh terhadap POS Tagging.


2021 ◽  
Vol 184 ◽  
pp. 148-155
Author(s):  
Abdul Munem Nerabie ◽  
Manar AlKhatib ◽  
Sujith Samuel Mathew ◽  
May El Barachi ◽  
Farhad Oroumchian

Sign in / Sign up

Export Citation Format

Share Document