scholarly journals A standard tag set expounding traditional morphological features for Arabic language part-of-speech tagging

2013 ◽  
Vol 6 (1) ◽  
pp. 43-99 ◽  
Author(s):  
Majdi Sawalha ◽  
Eric Atwell

The SALMA Morphological Features Tag Set (SALMA, Sawalha Atwell Leeds Morphological Analysis tag set for Arabic) captures long-established traditional morphological features of grammar and Arabic, in a compact yet transparent notation. First, we introduce Part-of-Speech tagging and tag set standards for English and other European languages, and then survey Arabic Part-of-Speech taggers and corpora, and long-established Arabic traditions in analysis of morphology. A range of existing Arabic Part-of-Speech tag sets are illustrated and compared; and we review generic design criteria for corpus tag sets. For a morphologically-rich language like Arabic, the Part-of-Speech tag set should be defined in terms of morphological features characterizing word structure. We describe the SALMA Tag Set in detail, explaining and illustrating each feature and possible values. In our analysis, a tag consists of 22 characters; each position represents a feature and the letter at that location represents a value or attribute of the morphological feature; the dash ‘-’ represents a feature not relevant to a given word. The first character shows the main Parts of Speech, from: noun, verb, particle, punctuation, and Other (residual); these last two are an extension to the traditional three classes to handle modern texts. ‘Noun’ in Arabic subsumes what are traditionally referred to in English as ‘noun’ and ‘adjective’. The characters 2, 3, and 4 are used to represent subcategories; traditional Arabic grammar recognizes 34 subclasses of noun (letter 2), 3 subclasses of verb (letter 3), 21 subclasses of particle (letter 4). Others (residuals) and punctuation marks are represented in letters 5 and 6 respectively. The next letters represent traditional morphological features: gender (7), number (8), person (9), inflectional morphology (10) case or mood (11), case and mood marks (12), definiteness (13), voice (14), emphasized and non-emphasized (15), transitivity (16), rational (17), declension and conjugation (18). Finally there are four characters representing morphological information which is useful in Arabic text analysis, although not all linguists would count these as traditional features: unaugmented and augmented (19), number of root letters (20), verb root (21), types of nouns according to their final letters (22). The SALMA Tag Set is not tied to a specific tagging algorithm or theory, and other tag sets could be mapped onto this standard, to simplify and promote comparisons between and reuse of Arabic taggers and tagged corpora.

2021 ◽  
Vol 3 (32) ◽  
pp. 05-35
Author(s):  
Hashem Alsharif ◽  

There exist no corpora of Arabic nouns. Furthermore, in any Arabic text, nouns can be found in different forms. In fact, by tagging nouns in an Arabic text, the beginning of each sentence can determine whether it starts with a noun or a verb. Part of Speech Tagging (POS) is the task of labeling each word in a sentence with its appropriate category, which is called a Tag (Noun, Verb and Article). In this thesis, we attempt to tag non-vocalized Arabic text. The proposed POS Tagger for Arabic Text is based on searching for each word of the text in our lists of Verbs and Articles. Nouns are found by eliminating Verbs and Articles. Our hypothesis states that, if the word in the text is not found in our lists, then it is a Noun. These comparisons will be made for each of the words in the text until all of them have been tagged. To apply our method, we have prepared a list of articles and verbs in the Arabic language with a total of 112 million verbs and articles combined, which are used in our comparisons to prove our hypothesis. To evaluate our proposed method, we used pre-tagged words from "The Quranic Arabic Corpus", making a total of 78,245 words, with our method, the Template-based tagging approach compared with (AraMorph) a rule-based tagging approach and the Stanford Log-linear Part-Of-Speech Tagger. Finally, AraMorph produced 40% correctly-tagged words and Stanford Log-linear Part-Of-Speech Tagger produced 68% correctly-tagged words, while our method produced 68,501 correctly-tagged words (88%).


2018 ◽  
Vol 7 (3.27) ◽  
pp. 125
Author(s):  
Ahmed H. Aliwy ◽  
Duaa A. Al_Raza

Part Of Speech (POS) tagging of Arabic words is a difficult and non-travail task it was studied in details for the last twenty years and its performance affects many applications and tasks in area of natural language processing (NLP). The sentence in Arabic language is very long compared with English sentence. This affect tagging process for any approach deals with complete sentence at once as in Hidden Markov Model HMM tagger. In this paper, new approach is suggested for using HMM and n-grams taggers for tagging Arabic words in a long sentence. The suggested approach is very simple and easy to implement. It is implemented on data set of 1000 documents of 526321 tokens annotated manually (containing punctuations). The results shows that the suggested approach has higher accuracy than HMM and n-gram taggers. The F-measures were 0.888, 0.925 and 0.957 for n-grams, HMM and the suggested approach respectively.


Author(s):  
Oscar Täckström ◽  
Dipanjan Das ◽  
Slav Petrov ◽  
Ryan McDonald ◽  
Joakim Nivre

We consider the construction of part-of-speech taggers for resource-poor languages. Recently, manually constructed tag dictionaries from Wiktionary and dictionaries projected via bitext have been used as type constraints to overcome the scarcity of annotated data in this setting. In this paper, we show that additional token constraints can be projected from a resource-rich source language to a resource-poor target language via word-aligned bitext. We present several models to this end; in particular a partially observed conditional random field model, where coupled token and type constraints provide a partial signal for training. Averaged across eight previously studied Indo-European languages, our model achieves a 25% relative error reduction over the prior state of the art. We further present successful results on seven additional languages from different families, empirically demonstrating the applicability of coupled token and type constraints across a diverse set of languages.


Author(s):  
Shuhrat Mirziyatov ◽  

This article, devoted to the analysis of parts of speech in the works of Makhmud Zamakhshari, addresses the question of conjugation of verbs in the last chapter named “Tasrifu-l-af’al” of the book “Mukaddamatu-l-adab”. The article emphasizes that the verb is an important part of speech in Arabic, that it is impossible to master the grammatical rules and categories without knowing its morphological features, that some parts of speech, especially masdars, the degrees of adjectives are formed from verbal roots. In “Mukaddamatu-l-Adab” was written that verbs in Arabic are divided into verbs with three and four roots and the majority are the verbs with three roots. Verbs with four roots, as well as verbs with three roots, lean with the help of those suffixes and prefixes. In the formation of the present tense forms, imperative forms, masdars, participles are also based on the same rules as for three-verbs. Makhmud Zamakhshari, defining the doubled verbs as verbs in the three-root group, in which the second and third roots consist of the same letter, emphasizes that the hamza is a “healthy” letter, not defective, and because of its complex pronunciation it is either changed with another letter or sometimes it is missed when pronounced and this provides ease of pronunciation. The question of writing hamza and its spelling has always been a difficult question of the language. Since Zamakhshari created his work for the quick study of Arabic and its grammar by non-Arab people, he did not go deeply into the essence of some difficult questions of Arabic language. The scientist notices that ings are added to the verbs of the actual voice gives samples conjugation of regular verbs in the past tense, and says that all regular verbs and verbs that are similar to regular verbs are conjugated in the above order. In his work, Zamahshari gave a sample of the conjugations of the verbs of the passive voice and examples of adding personal endings to such verbs, as well as conjugations of regular verbs, and verbs similar to regular verbs, empty and defective verbs. The scholar’s work not only gave conjugation of verbs, but also provided exceptions to the rules, it also highlighted a separate chapter in the interpretation of the imperative form in Arabic. The work contains information that the formation of an imperative form from verbs of the present-future tense. The article emphasizes that the verbs of surprise are formed only from the first chapter of the three-root verbs, that such forms are not formed from verbs expressing physical imperfection. Ways of expressing astonishment from doubled and defective verbs are commented. Regarding the verb conjugation, which is devoted to the chapter on the study of infinitives (masdar), the author dwells on the names of actions, ways of forming masdars from empty verbs, gives definition to real and passive participles, gives examples of their formation. This chapter provides information on the formation of real and passive participles from the derived chapters and four-root verbs, an interpretation of the adjective forms of the excellent and comparative degrees.


2017 ◽  
Vol 68 (2) ◽  
pp. 396-403
Author(s):  
Hana Žižková

Abstract Compound adverbs represent an interesting issue in terms of Automatic Morphological Analysis (AMA). The reason is that compound adverbs in Czech are expressions formed by compounding existing words that are different parts of speech without any change in their form. An indicative sign of compound adverbs is that they can always be decomposed again. Compound adverbs may be written as one word but sometimes a multiword form coexists. A word that is originally a different part of speech gains an adverbial meaning and becomes an adverb. This article presents the results of a corpus probe aimed at mapping expressions that are demonstrably compound adverbs and were not recognized by AMA or were incorrectly tagged by AMA as another part of speech. Analysis of data obtained from the Czech National Corpus (ČNK) SYN v3 show that the unrecognized and incorrectly tagged units can be divided into several groups. Based on knowledge of these groups it is possible to refine part of speech tagging by AMA. The corpus probe examined units written in accordance with the current codification as well as substandard units.


Author(s):  
Nesreen Mohammad Alsharman ◽  
Inna V. Pivkina

This article describes a new method for generating extractive summaries directly via unigram and bigram extraction techniques. The methodology uses the selective part of speech tagging to extract significant unigrams and bigrams from a set of sentences. Extracted unigrams and bigrams along with other features are used to build a final summary. A new selective rule-based part of speech tagging system is developed that concentrates on the most important parts of speech for summarizations: noun, verb, and adjective. Other parts of speech such as prepositions, articles, adverbs, etc., play a lesser role in determining the meaning of sentences; therefore, they are not considered when choosing significant unigrams and bigrams. The proposed method is tested on two problem domains: citations and opinosis data sets. Results show that the proposed method performs better than Text-Rank, LexRank, and Edmundson summarization methods. The proposed method is general enough to summarize texts from any domain.


2014 ◽  
Vol 519-520 ◽  
pp. 784-787
Author(s):  
Zhi Qiang Wu ◽  
Hong Zhi Yu ◽  
Shu Hui Wan

It’s a basic work for Tibetan information processing to tag the Tibetan parts of speech,the results can be used in machine translation, speech synthesis and so on. By studying the Tibetan language grammar and the classification of Tibetan parts of speech, established the Tibetan parts of speech tagging sets, and tagged the corpus, used the CRFs to solve the problem that automatic tagging of Tibetan parts of speech, the experimental results show that in the closed test set, part-of-speech tagging accuracy is 94.2%, and in the opening set, the accuracy is 91.5%.


2019 ◽  
Vol 1 (2) ◽  
pp. 23
Author(s):  
Mohamed Labidi

One of the important tasks in Natural language processing is the part of speech tagging. For the Arabic language we have a lot of works but their performances do not rise to the required level, due to the complexity of the task and the Arabic language characteristics. In this work we study a combination between twodifferent approaches for Arabic POS-Taggers. The first one isa maximum entropy-based one, and the second is a statistical/rule-based one. Furthermore, we add a knowledge-based method to annotate Arabic particles. Our idea improves the accuracy rate. We passed from almost 85% to almost 90% using our combined method, which seem promoter.


Sign in / Sign up

Export Citation Format

Share Document