A Comparative Study of Arabic Part of Speech Taggers Using Literary Text Samples from Saudi Novels

Part of Speech (POS) tagging is one of the most common techniques used in natural language processing (NLP) applications and corpus linguistics. Various POS tagging tools have been developed for Arabic. These taggers differ in several aspects, such as in their modeling techniques, tag sets and training and testing data. In this paper we conduct a comparative study of five Arabic POS taggers, namely: Stanford Arabic, CAMeL Tools, Farasa, MADAMIRA and Arabic Linguistic Pipeline (ALP) which examine their performance using text samples from Saudi novels. The testing data has been extracted from different novels that represent different types of narrations. The main result we have obtained indicates that the ALP tagger performs better than others in this particular case, and that Adjective is the most frequent mistagged POS type as compared to Noun and Verb.

Download Full-text

MulTed: a multilingual aligned and tagged parallel corpus

Applied Computing and Informatics ◽

10.1016/j.aci.2018.12.003 ◽

2020 ◽

Vol ahead-of-print (ahead-of-print) ◽

Cited By ~ 2

Author(s):

Imad Zeroual ◽

Abdelhak Lakhouaja

Keyword(s):

Language Processing ◽

Corpus Linguistics ◽

Statistical Machine Translation ◽

Language Recognition ◽

Parallel Corpora ◽

Pos Tagging ◽

Part Of Speech ◽

Language Studies ◽

Mapping Process ◽

Pos Tagger

Recently, more data-driven approaches are demanding multilingual parallel resources primarily in the cross-language studies. To meet these demands, building multilingual parallel corpora are becoming the focus of many Natural Language Processing (NLP) scientific groups. Unlike monolingual corpora, the number of available multilingual parallel corpora is limited. In this paper, the MulTed, a corpus of subtitles extracted from TEDx talks is introduced. It is multilingual, Part of Speech (PoS) tagged, and bilingually sentence-aligned with English as a pivot language. This corpus is designed for many NLP applications, where the sentence-alignment, the PoS tagging, and the size of corpora are influential such as statistical machine translation, language recognition, and bilingual dictionary generation. Currently, the corpus has subtitles that cover 1100 talks available in over 100 languages. The subtitles are classified based on a variety of topics such as Business, Education, and Sport. Regarding the PoS tagging, the Treetagger, a language-independent PoS tagger, is used; then, to make the PoS tagging maximally useful, a mapping process to a universal common tagset is performed. Finally, we believe that making the MulTed corpus available for a public use can be a significant contribution to the literature of NLP and corpus linguistics, especially for under-resourced languages.

Download Full-text

Generation of Cross-Lingual Word Vectors for Low-Resourced Languages Using Deep Learning and Topological Metrics in a Data-Efficient Way

Electronics ◽

10.3390/electronics10121372 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1372

Author(s):

Sanjanasri JP ◽

Vijay Krishna Menon ◽

Soman KP ◽

Rajendran S ◽

Agnieszka Wolk

Keyword(s):

Deep Learning ◽

Language Processing ◽

Semantic Space ◽

Semantic Interpretation ◽

Learning Approaches ◽

Qualitative Comparison ◽

Bilingual Dictionary ◽

Pos Tagging ◽

Part Of Speech ◽

Cross Lingual

Linguists have been focused on a qualitative comparison of the semantics from different languages. Evaluation of the semantic interpretation among disparate language pairs like English and Tamil is an even more formidable task than for Slavic languages. The concept of word embedding in Natural Language Processing (NLP) has enabled a felicitous opportunity to quantify linguistic semantics. Multi-lingual tasks can be performed by projecting the word embeddings of one language onto the semantic space of the other. This research presents a suite of data-efficient deep learning approaches to deduce the transfer function from the embedding space of English to that of Tamil, deploying three popular embedding algorithms: Word2Vec, GloVe and FastText. A novel evaluation paradigm was devised for the generation of embeddings to assess their effectiveness, using the original embeddings as ground truths. Transferability across other target languages of the proposed model was assessed via pre-trained Word2Vec embeddings from Hindi and Chinese languages. We empirically prove that with a bilingual dictionary of a thousand words and a corresponding small monolingual target (Tamil) corpus, useful embeddings can be generated by transfer learning from a well-trained source (English) embedding. Furthermore, we demonstrate the usability of generated target embeddings in a few NLP use-case tasks, such as text summarization, part-of-speech (POS) tagging, and bilingual dictionary induction (BDI), bearing in mind that those are not the only possible applications.

Download Full-text

A Cascaded Unsupervised Model for PoS Tagging

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3447759 ◽

2021 ◽

Vol 20 (1) ◽

pp. 1-23

Author(s):

Necva Bölücü ◽

Burcu Can

Keyword(s):

Linear Model ◽

Language Processing ◽

Bayesian Model ◽

Linear Models ◽

Syntactic Category ◽

Semantic Parsing ◽

Pos Tagging ◽

Part Of Speech ◽

Sentence Level ◽

Log Linear

Part of speech (PoS) tagging is one of the fundamental syntactic tasks in Natural Language Processing, as it assigns a syntactic category to each word within a given sentence or context (such as noun, verb, adjective, etc.). Those syntactic categories could be used to further analyze the sentence-level syntax (e.g., dependency parsing) and thereby extract the meaning of the sentence (e.g., semantic parsing). Various methods have been proposed for learning PoS tags in an unsupervised setting without using any annotated corpora. One of the widely used methods for the tagging problem is log-linear models. Initialization of the parameters in a log-linear model is very crucial for the inference. Different initialization techniques have been used so far. In this work, we present a log-linear model for PoS tagging that uses another fully unsupervised Bayesian model to initialize the parameters of the model in a cascaded framework. Therefore, we transfer some knowledge between two different unsupervised models to leverage the PoS tagging results, where a log-linear model benefits from a Bayesian model’s expertise. We present results for Turkish as a morphologically rich language and for English as a comparably morphologically poor language in a fully unsupervised framework. The results show that our framework outperforms other unsupervised models proposed for PoS tagging.

Download Full-text

Part-of-Speech Tagging via Deep Neural Networks for Northern-Ethiopic Languages

Information Technology And Control ◽

10.5755/j01.itc.49.4.26808 ◽

2020 ◽

Vol 49 (4) ◽

pp. 482-494

Author(s):

Jurgita Kapočiūtė-Dzikienė ◽

Senait Gebremichael Tesfagergish

Keyword(s):

Neural Network ◽

Neural Networks ◽

Language Processing ◽

Deep Neural Networks ◽

Short Term Memory ◽

Parameter Tuning ◽

Feed Forward Neural Network ◽

Pos Tagging ◽

Part Of Speech ◽

Pos Tagger

Deep Neural Networks (DNNs) have proven to be especially successful in the area of Natural Language Processing (NLP) and Part-Of-Speech (POS) tagging—which is the process of mapping words to their corresponding POS labels depending on the context. Despite recent development of language technologies, low-resourced languages (such as an East African Tigrinya language), have received too little attention. We investigate the effectiveness of Deep Learning (DL) solutions for the low-resourced Tigrinya language of the Northern-Ethiopic branch. We have selected Tigrinya as the testbed example and have tested state-of-the-art DL approaches seeking to build the most accurate POS tagger. We have evaluated DNN classifiers (Feed Forward Neural Network – FFNN, Long Short-Term Memory method – LSTM, Bidirectional LSTM, and Convolutional Neural Network – CNN) on a top of neural word2vec word embeddings with a small training corpus known as Nagaoka Tigrinya Corpus. To determine the best DNN classifier type, its architecture and hyper-parameter set both manual and automatic hyper-parameter tuning has been performed. BiLSTM method was proved to be the most suitable for our solving task: it achieved the highest accuracy equal to 92% that is 65% above the random baseline.

Download Full-text

PENENTUAN KELAS KATA PADA PART OF SPEECH TAGGING KATA AMBIGU BAHASA INDONESIA

JISKA (Jurnal Informatika Sunan Kalijaga) ◽

10.14421/jiska.2018.23-05 ◽

2018 ◽

Vol 2 (3) ◽

pp. 157

Author(s):

Ahmad Subhan Yazid ◽

Agung Fatwanto

Keyword(s):

Language Processing ◽

Word Class ◽

Rule Based ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Ambiguous Words ◽

Computer Science Faculty ◽

Speech Tagging ◽

Bahasa Indonesia

Indonesian hold a fundamental role in the communication. There is ambiguous problem in its machine learning implementation. In the Natural Language Processing study, Part of Speech (POS) tagging has a role in the decreasing this problem. This study use the Rule Based method to determine the best word class for ambiguous words in Indonesian. This research follows some stages: knowledge inventory, making algorithms, implementation, Testing, Analysis, and Conclusions. The first data used is Indonesian corpus that was developed by Language department of Computer science Faculty, Indonesia University. Then, data is processed and shown descriptively by following certain rules and specification. The result is a POS tagging algorithm included 71 rules in flowchart and descriptive sentence notation. Refer to testing result, the algorithm successfully provides 92 labeling of 100 tested words (92%). The results of the implementation are influenced by the availability of rules, word class tagsets and corpus data.

Download Full-text

Improving Brill's tagger lexical and transformation rule for Afaan Oromo language

10.7287/peerj.preprints.1225v1 ◽

2015 ◽

Author(s):

Abraham G Ayana

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Transformation Rule ◽

Initial State ◽

Training Corpus ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Speech Tagging

Natural Language Processing (NLP) refers to Human-like language processing which reveals that it is a discipline within the field of Artificial Intelligence (AI). However, the ultimate goal of research on Natural Language Processing is to parse and understand language, which is not fully achieved yet. For this reason, much research in NLP has focused on intermediate tasks that make sense of some of the structure inherent in language without requiring complete understanding. One such task is part-of-speech tagging, or simply tagging. Lack of standard part of speech tagger for Afaan Oromo will be the main obstacle for researchers in the area of machine translation, spell checkers, dictionary compilation and automatic sentence parsing and constructions. Even though several works have been done in POS tagging for Afaan Oromo, the performance of the tagger is not sufficiently improved yet. Hence,the aim of this thesis is to improve Brill’s tagger lexical and transformation rule for Afaan Oromo POS tagging with sufficiently large training corpus. Accordingly, Afaan Oromo literatures on grammar and morphology are reviewed to understand nature of the language and also to identify possible tagsets. As a result, 26 broad tagsets were identified and 17,473 words from around 1100 sentences containing 6750 distinct words were tagged for training and testing purpose. From which 258 sentences are taken from the previous work. Since there is only a few ready made standard corpuses, the manual tagging process to prepare corpus for this work was challenging and hence, it is recommended that a standard corpus is prepared. Transformation-based Error driven learning are adapted for Afaan Oromo part of speech tagging. Different experiments are conducted for the rule based approach taking 20% of the whole data for testing. A comparison with the previously adapted Brill’s Tagger made. The previously adapted Brill’s Tagger shows an accuracy of 80.08% whereas the improved Brill’s Tagger result shows an accuracy of 95.6% which has an improvement of 15.52%. Hence, it is found that the size of the training corpus, the rule generating system in the lexical rule learner, and moreover, using Afaan Oromo HMM tagger as initial state tagger have a significant effect on the improvement of the tagger.

Download Full-text

Development of Part of Speech Tagger using Deep Learning

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a1531.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 3384-3391

Keyword(s):

Language Processing ◽

Initial Step ◽

Processing Application ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Order Language ◽

Popular Language ◽

Speech Tagging ◽

Free Word

Part of speech tagging is the initial step in development of NLP (natural language processing) application. POS Tagging is sequence labelling task in which we assign Part-of-speech to every word (Wi) which is sequence in sentence and tag (Ti) to corresponding word as label such as (Wi/Ti…. Wn/Tn). In this research project part of speech tagging is perform on Hindi. Hindi is the fourth most popular language and spoken by approximately 4billion people across the globe. Hindi is free word-order language and morphologically rich language due to this applying Part of Speech tagging is very challenging task. In this paper we have shown the development of POS tagging using neural approach.

Download Full-text

Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3488381 ◽

2022 ◽

Vol 21 (3) ◽

pp. 1-24

Author(s):

Sunita Warjri ◽

Partha Pakray ◽

Saralin A. Lyngdoh ◽

Arnab Kumar Maji

Keyword(s):

Deep Learning ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Research Work ◽

Pos Tagging ◽

Part Of Speech ◽

Corpus Size ◽

Increase In Accuracy ◽

Pos Tagger

Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words concerning their context usage, and also the orthography problems that arise in the designed POS corpus. The designed Khasi corpus size is around 96,100 tokens and consists of 6,616 distinct words. Initially, while running the first few sets of data of around 41,000 tokens in our experiment the taggers are found to yield considerably accurate results. When the Khasi corpus size has been increased to 96,100 tokens, we see an increase in accuracy rate and the analyses are more pertinent. As results, accuracy of 96.81% is achieved for the BiLSTM method, 96.98% for BiLSTM with CRF technique, and 95.86% for character-based with LSTM. Concerning substantial research from the NLP perspectives for Khasi, we also present some of the recently existing POS taggers and other NLP works on the Khasi language for comparative purposes.

Download Full-text

Part-of-Speech Tagging for Arabic Text using Particle Swarm Optimization and Genetic Algorithm

Recent Advances in Computer Science and Communications ◽

10.2174/2666255814666210114120558 ◽

2021 ◽

Vol 14 ◽

Author(s):

Ahmad T. Al-Taani ◽

Fadi A. ALkhazaaleh

Keyword(s):

Particle Swarm Optimization ◽

Language Processing ◽

Question Answering ◽

Particle Swarm ◽

Pso Algorithm ◽

Arabic Language ◽

Genetic Operators ◽

Swarm Optimization ◽

Pos Tagging ◽

Part Of Speech

Background: Part of Speech (POS) Tagging is a process of defining the suitable part of speech for each word in the given context such as defining if a word is a verb, a noun or a particle. POS tagging is an important preprocessing step in many Natural Language Processing (NLP) applications such as question answering, text summarization, and information retrieval. Objective: The performance of NLP applications depends on the accuracy of POS taggers since assigning right tags for the words in a sentence enables the application to work properly after tagging. Many approaches have been proposed for the Arabic language, but more investigations are needed to improve the efficiency of Arabic POS taggers. Method: In this study, we propose a supervised POS tagging system for the Arabic language using Particle Swarm Optimization (PSO) and Genetic Algorithms (GA) as well as Hidden Markov Model (HMM). The tagging process is considered as an optimization problem and illustrated as a swarm which consists of group of particles. Each particle represents sequence of tags. The PSO algorithm is applied to find the best sequence of tags which represent the correct tags of the sentence. The genetic operators: crossover and mutation are used to find personal best, global best, and velocity of the PSO algorithm. HMM is used to find the fitness of particles in the swarm. Results : The performance of the proposed approach is evaluated on the KALIMAT dataset which consists of 18 million words and a tag set consists of 45 tags which covers all Arabic POS tags. The proposed tagger achieved an accuracy of 90.5%. Conclusion: Experimental results revealed that the proposed tagger achieved promising results compared to four existing approaches. Other approaches can identify only three tags: noun, verb and particle. Also, the accuracy for some tags are outperformed those achieved by other approaches.

Download Full-text

MICE

International Journal of Corpus Linguistics ◽

10.1075/ijcl.9.1.03are ◽

2004 ◽

Vol 9 (1) ◽

pp. 53-68 ◽

Cited By ~ 6

Author(s):

Montserrat Arévalo Rodríguez ◽

Montserrat Civit Torruella ◽

Maria Antònia Martí

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Information Extraction ◽

Language Processing ◽

Corpus Linguistics ◽

Proper Names ◽

Named Entity ◽

Different Types

In the field of corpus linguistics, Named Entity treatment includes the recognition and classification of different types of discursive elements like proper names, date, time, etc. These discursive elements play an important role in different Natural Language Processing applications and techniques such as Information Retrieval, Information Extraction, translations memories, document routers, etc.

Download Full-text