Exploring and categorising the Arabic copula and auxiliary kāna through enhanced part-of-speech tagging

Andrew Hardie; Wesam Ibrahim

doi:10.3366/cor.2021.0225

Exploring and categorising the Arabic copula and auxiliary kāna through enhanced part-of-speech tagging

Corpora ◽

10.3366/cor.2021.0225 ◽

2021 ◽

Vol 16 (3) ◽

pp. 305-335

Author(s):

Andrew Hardie ◽

Wesam Ibrahim

Keyword(s):

Foreign Language ◽

Linguistic Analysis ◽

High Rate ◽

Surface Level ◽

Part Of Speech Tagging ◽

The Past ◽

Part Of Speech ◽

The One ◽

Speech Tagging

Arabic syntax has yet to be studied in detail from a corpus-based perspective. The Arabic copula kāna (‘be’), functions also as an auxiliary, creating periphrastic tense–aspect constructions; but the literature on these functions is far from exhaustive. To analyse kāna within the one-million word Corpus of Contemporary Arabic, part-of-speech tagging (using novel, targeted enhancements to a previously described program which improves the accessibility for linguistic analysis of the output of Habash et al.’s [2012] mada disambiguator for the Buckwalter Arabic morphological analyser) is applied to disambiguate copula and auxiliary at a high rate of accuracy. Concordances of both are extracted, and 10 percent samples (499 instances of copula kāna and 387 of auxiliary kāna) are analysed manually to identify surface-level grammatical patterns and meanings. This raw analysis is then systematised according to the more general patterns’ main parameters of variation; special descriptions are developed for specific, apparently fixed-form expressions (including two phraseologies which afford expression of verbal and adjectival modality). Overall, we uncover substantial new detail, not mentioned in existing grammars (e.g., the quantitative predominance of the past imperfect construction over other uses of auxiliary kāna). There exists notable potential for these corpus-based findings to inform and enhance not only grammatical descriptions but also pedagogy of Arabic as a first or second/foreign language.

Download Full-text

New Directions in Spanish and Portuguese Corpus Linguistics

Studies in Hispanic and Lusophone Linguistics ◽

10.1515/shll-2008-1009 ◽

2008 ◽

Vol 1 (1) ◽

Cited By ~ 3

Author(s):

Mark Davies

Keyword(s):

Corpus Linguistics ◽

Syntactic Variation ◽

Historical Changes ◽

Part Of Speech Tagging ◽

The Past ◽

Part Of Speech ◽

New Directions ◽

Semantics And Pragmatics ◽

Speech Tagging

Abstract Within the past decade several large, freely-available online corpora of Spanish and Portuguese have become available. With these new corpora, researchers of Spanish and Portuguese can now carry out the same type of corpus-based research that has been done for other languages (such as English) for years. This includes advanced research on morphological and syntactic variation (thanks to full functionality with substring searches, part of speech tagging, and lemmatization), semantics and pragmatics (via collocates, synonyms, customized word lists, and word comparisons), and historical changes and synchronic register variation (via architectures and interfaces that allow easy comparisons of frequency in different sections of the corpus).

Download Full-text

Supporting the corpus-based study of Shakespeare’s language: Enhancing a corpus of the First Folio

ICAME Journal ◽

10.2478/icame-2021-0002 ◽

2021 ◽

Vol 45 (1) ◽

pp. 37-86

Author(s):

Jonathan Culpeper ◽

Andrew Hardie ◽

Jane Demmen ◽

Jennifer Hughes ◽

Matt Timperley

Keyword(s):

Early Modern ◽

Linguistic Analysis ◽

Corpus Annotation ◽

Semantic Tagging ◽

Part Of Speech Tagging ◽

Corpus Linguistic ◽

Part Of Speech ◽

Word Level ◽

First Folio ◽

Speech Tagging

Abstract This article explores challenges in the corpus linguistic analysis of Shakespeare’s language, and Early Modern English more generally, with particular focus on elaborating possible solutions and the benefits they bring. An account of work that took place within the Encyclopedia of Shakespeare’s Language Project (2016–2019) is given, which discusses the development of the project’s data resources, specifically, the Enhanced Shakespearean Corpus. Topics covered include the composition of the corpus and its subcomponents; the structure of the XML markup; the design of the extensive character metadata; and the word-level corpus annotation, including spelling regularisation, part-of-speech tagging, lemmatisation and semantic tagging. The challenges that arise from each of these undertakings are not exclusive to a corpus-based treatment of Shakespeare’s plays but it is in the context of Shakespeare’s language that they are so severe as to seem almost insurmountable. The solutions developed for the Enhanced Shakespearean Corpus – often combining automated manipulation with manual interventions, and always principled – offer a way through.

Download Full-text

You’re not the POS of me: part-of-speech tagging as a markup problem

Proceedings of Balisage: The Markup Conference 2019 ◽

10.4242/balisagevol23.tovey01 ◽

2019 ◽

Cited By ~ 1

Author(s):

Bethan Siân Tovey

Keyword(s):

Social Media ◽

Natural Language ◽

Human Intervention ◽

Grammatical Category ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Grammatical Categories ◽

The One ◽

Speech Tagging

Part of speech tagging, labeling every token in a text with its grammatical category, is a complicated business. Natural language is messy, especially when that language consists of social-media conversations between bilinguals. The process can be done with or without human intervention, in a supervised or unsupervised manner, on a statistical basis or by the application of rules. Often, it involves a combination of these methods. It is, on the one hand, an obvious markup problem: mark up the tokens with appropriate grammatical categories. But it is also much richer than that. Theoretical problems that have been identified in the domain of markup can throw light on the problem of grammatical category disambiguation. Topics considered include subjectivity and objectivity, the semantics of tag sets, licensing of inference, proleptic and metaleptic markup, and the interesting characteristics of the Welsh “verbnoun”.

Download Full-text

Sprachwissenschaftliche Erschließungsmethoden für digitale Editionen mittelhochdeutscher Texte

Das Mittelalter ◽

10.1515/mial-2019-0008 ◽

2019 ◽

Vol 24 (1) ◽

pp. 112-128

Author(s):

Christian Griesinger ◽

Michael Stolz

Keyword(s):

Scientific Data ◽

Frame Of Reference ◽

Second Step ◽

Digital Resources ◽

Part Of Speech Tagging ◽

The Past ◽

Part Of Speech ◽

Working Groups ◽

Speech Tagging ◽

Tagging Methods

Abstract This paper sheds light on the possibilities and perspectives of linking digital editions of Medieval German texts to each other and to other digital resources. Furthermore, it discusses some of the internal and technical conditions necessary to render this linkage meaningful, like lemmatisation, part-of-speech-tagging, and using standardised mark-up languages. Finally, the sustainability and reusability of digital editions are considered. While in the past, editions of medieval texts were conceived as rather isolated scholarly works of individual editors, nowadays the collaboration and cooperation of greater working groups is essential in editing projects. Due to the complexity of editions consisting of multiple textual layers, e. g. apparatus entries, annotations, or facsimiles, the requirements for future digital editions have risen. The first approach to respond to these demands is to link the various textual layers to each other, enabling the users to navigate between these layers in a sensible way. The second step is to link the edited text to other resources, such as online dictionaries or other editions, allowing complex research networks to be created. These goals are achieved by lemmatising and other tagging methods, ensuring the information being mapped to a normalised and idealised frame of reference. Common standards like Unicode or the TEI guidelines are of great importance for such purposes, as they assure the interchange and re-use of scientific data, as well as their sustainability.

Download Full-text

Improving term extraction by combining different techniques

Terminology ◽

10.1075/term.7.1.04viv ◽

2001 ◽

Vol 7 (1) ◽

pp. 31-48 ◽

Cited By ~ 9

Author(s):

Jorge Vivaldi ◽

Horacio Rodríguez

Keyword(s):

Word Sense Disambiguation ◽

Word Sense ◽

Context Sensitive ◽

Part Of Speech Tagging ◽

Hybrid Approaches ◽

Part Of Speech ◽

Term Extraction ◽

Automatic Term Extraction ◽

The One ◽

Speech Tagging

Two different reasons suggest that combining the performance of several term extractors could lead to an improvement in overall system accuracy. On the one hand, there is no clear agreement on whether to follow statistical, linguistic or hybrid approaches for (semi-) automatic term extraction. On the other hand, combining different knowledge sources (e.g. classifiers) has proved successful in improving the performance of individual sources on several NLP tasks (some of them closely related to or involved in term extraction), such as context-sensitive spelling correction, part-of-speech tagging, word sense disambiguation, parsing, text classification and filtering, etc. In this paper, we present a proposal for combining a number of different term extraction techniques in order to improve the accuracy of the resulting system. The approach has been applied to the domain of medicine for the Spanish language. A number of tests have been carried out with encouraging results.

Download Full-text

Lexical Rule and Lexicon Effect for Part of Speech Tagging Bahasa Madura

Matrik Jurnal Manajemen Teknik Informatika dan Rekayasa Komputer ◽

10.30812/matrik.v18i1.332 ◽

2018 ◽

Vol 18 (1) ◽

pp. 65-72

Author(s):

Nindian Puspa Dewi ◽

Ubaidi Ubaidi

Keyword(s):

Text Processing ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Speech Tagging ◽

Bahasa Indonesia

POS Tagging adalah dasar untuk pengembangan Text Processing suatu bahasa. Dalam penelitian ini kita meneliti pengaruh penggunaan lexicon dan perubahan morfologi kata dalam penentuan tagset yang tepat untuk suatu kata. Aturan dengan pendekatan morfologi kata seperti awalan, akhiran, dan sisipan biasa disebut sebagai lexical rule. Penelitian ini menerapkan lexical rule hasil learner dengan menggunakan algoritma Brill Tagger. Bahasa Madura adalah bahasa daerah yang digunakan di Pulau Madura dan beberapa pulau lainnya di Jawa Timur. Objek penelitian ini menggunakan Bahasa Madura yang memiliki banyak sekali variasi afiksasi dibandingkan dengan Bahasa Indonesia. Pada penelitian ini, lexicon selain digunakan untuk pencarian kata dasar Bahasa Madura juga digunakan sebagai salah satu tahap pemberian POS Tagging. Hasil ujicoba dengan menggunakan lexicon mencapai akurasi yaitu 86.61% sedangkan jika tidak menggunakan lexicon hanya mencapai akurasi 28.95 %. Dari sini dapat disimpulkan bahwa ternyata lexicon sangat berpengaruh terhadap POS Tagging.

Download Full-text

Mongolian part-of-speech tagging approach based on conditional random fields

Journal of Computer Applications ◽

10.3724/sp.j.1087.2010.02038 ◽

2010 ◽

Vol 30 (8) ◽

pp. 2038-2040

Author(s):

Yu-long YING ◽

Miao LI ◽

bala Wuda ◽

Hai ZHU

Keyword(s):

Random Fields ◽

Conditional Random Fields ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Speech Tagging

Download Full-text

The Impact of Arabic Part of Speech Tagging on Sentiment Analysis: A New Corpus and Deep Learning Approach

Procedia Computer Science ◽

10.1016/j.procs.2021.03.026 ◽

2021 ◽

Vol 184 ◽

pp. 148-155

Author(s):

Abdul Munem Nerabie ◽

Manar AlKhatib ◽

Sujith Samuel Mathew ◽

May El Barachi ◽

Farhad Oroumchian

Keyword(s):

Deep Learning ◽

Sentiment Analysis ◽

Learning Approach ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

The Impact ◽

Speech Tagging

Download Full-text

Part-of-speech tagging for web search queries using a large-scale web corpus

Proceedings of the Symposium on Applied Computing - SAC '17 ◽

10.1145/3019612.3019694 ◽

2017 ◽

Cited By ~ 1

Author(s):

Atsushi Keyaki ◽

Jun Miyazaki

Keyword(s):

Large Scale ◽

Web Search ◽

Search Queries ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Speech Tagging

Download Full-text

Korean Part-of-speech Tagging Based on Morpheme Generation

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3373608 ◽

2020 ◽

Vol 19 (3) ◽

pp. 1-10

Author(s):

Hyun-Je Song ◽

Seong-Bae Park

Keyword(s):

Part Of Speech Tagging ◽

Part Of Speech ◽

Speech Tagging

Download Full-text