pos tagging
Recently Published Documents





Sunita Warjri ◽  
Partha Pakray ◽  
Saralin A. Lyngdoh ◽  
Arnab Kumar Maji

Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words concerning their context usage, and also the orthography problems that arise in the designed POS corpus. The designed Khasi corpus size is around 96,100 tokens and consists of 6,616 distinct words. Initially, while running the first few sets of data of around 41,000 tokens in our experiment the taggers are found to yield considerably accurate results. When the Khasi corpus size has been increased to 96,100 tokens, we see an increase in accuracy rate and the analyses are more pertinent. As results, accuracy of 96.81% is achieved for the BiLSTM method, 96.98% for BiLSTM with CRF technique, and 95.86% for character-based with LSTM. Concerning substantial research from the NLP perspectives for Khasi, we also present some of the recently existing POS taggers and other NLP works on the Khasi language for comparative purposes.

Tatsuya Hiraoka ◽  
Sho Takase ◽  
Kei Uchiumi ◽  
Atsushi Keyaki ◽  
Naoaki Okazaki

We propose a method to pay attention to high-order relations among latent states to improve the conventional HMMs that focus only on the latest latent state, since they assume Markov property. To address the high-order relations, we apply an RNN to each sequence of latent states, because the RNN can represent the information of an arbitrary-length sequence with their cell: a fixed-size vector. However, the simplest way, which provides all latent sequences explicitly for the RNN, is intractable due to the combinatorial explosion of the search space of latent states. Thus, we modify the RNN to represent the history of latent states from the beginning of the sequence to the current state with a fixed number of RNN cells whose number is equal to the number of possible states. We conduct experiments on unsupervised POS tagging and synthetic datasets. Experimental results show that the proposed method achieves better performance than previous methods. In addition, the results on the synthetic dataset indicate that the proposed method can capture the high-order relations.

2022 ◽  
Vol 14 (1) ◽  
pp. 0-0

POS (Parts of Speech) tagging, a vital step in diverse Natural Language Processing (NLP) tasks has not drawn much attention in case of Odia a computationally under-developed language. The proposed hybrid method suggests a robust POS tagger for Odia. Observing the rich morphology of the language and unavailability of sufficient annotated text corpus a combination of machine learning and linguistic rules is adopted in the building of the tagger. The tagger is trained on tagged text corpus from the domain of tourism and is capable of obtaining a perceptible improvement in the result. Also an appreciable performance is observed for news articles texts of varied domains. The performance of proposed algorithm experimenting on Odia language shows its manifestation in dominating over existing methods like rule based, hidden Markov model (HMM), maximum entropy (ME) and conditional random field (CRF).

2021 ◽  
pp. 587-595
Alebachew Chiche ◽  
Hiwot Kadi ◽  
Tibebu Bekele

Natural language processing plays a great role in providing an interface for human-computer communication. It enables people to talk with the computer in their formal language rather than machine language. This study aims at presenting a Part of speech tagger that can assign word class to words in a given paragraph sentence. Some of the researchers developed parts of speech taggers for different languages such as English Amharic, Afan Oromo, Tigrigna, etc. On the other hand, many other languages do not have POS taggers like Shekki’noono language.  POS tagger is incorporated in most natural language processing tools like machine translation, information extraction as a basic component. So, it is compulsory to develop a part of speech tagger for languages then it is possible to work with an advanced natural language application. Because those applications enhance machine to machine, machine to human, and human to human communications. Although, one language POS tagger cannot be directly applied for other languages POS tagger. With the purpose for developing the Shekki’noono POS tagger, we have used the stochastic Hidden Markov Model. For the study, we have used 1500 sentences collected from different sources such as newspapers (which includes social, economic, and political aspects), modules, textbooks, Radio Programs, and bulletins.  The collected sentences are labeled by language experts with their appropriate parts of speech for each word.  With the experiments carried out, the part of speech tagger is trained on the training sets using Hidden Markov model. As experiments showed, HMM based POS tagging has achieved 92.77 % accuracy for Shekki’noono. And the POS tagger model is compared with the previous experiments in related works using HMM. As a future work, the proposed approaches can be utilized to perform an evaluation on a larger corpus.

Electronics ◽  
2021 ◽  
Vol 11 (1) ◽  
pp. 56
Hongwei Li ◽  
Hongyan Mao ◽  
Jingzi Wang

Part-of-Speech (POS) tagging is one of the most important tasks in the field of natural language processing (NLP). POS tagging for a word depends not only on the word itself but also on its position, its surrounding words, and their POS tags. POS tagging can be an upstream task for other NLP tasks, further improving their performance. Therefore, it is important to improve the accuracy of POS tagging. In POS tagging, bidirectional Long Short-Term Memory (Bi-LSTM) is commonly used and achieves good performance. However, Bi-LSTM is not as powerful as Transformer in leveraging contextual information, since Bi-LSTM simply concatenates the contextual information from left-to-right and right-to-left. In this study, we propose a novel approach for POS tagging to improve the accuracy. For each token, all possible POS tags are obtained without considering context, and then rules are applied to prune out these possible POS tags, which we call rule-based data preprocessing. In this way, the number of possible POS tags of most tokens can be reduced to one, and they are considered to be correctly tagged. Finally, POS tags of the remaining tokens are masked, and a model based on Transformer is used to only predict the masked POS tags, which enables it to leverage bidirectional contexts. Our experimental result shows that our approach leads to better performance than other methods using Bi-LSTM.

Information ◽  
2021 ◽  
Vol 12 (12) ◽  
pp. 523
Reyadh Alluhaibi ◽  
Tareq Alfraidi ◽  
Mohammad A. R. Abdeen ◽  
Ahmed Yatimi

Part of Speech (POS) tagging is one of the most common techniques used in natural language processing (NLP) applications and corpus linguistics. Various POS tagging tools have been developed for Arabic. These taggers differ in several aspects, such as in their modeling techniques, tag sets and training and testing data. In this paper we conduct a comparative study of five Arabic POS taggers, namely: Stanford Arabic, CAMeL Tools, Farasa, MADAMIRA and Arabic Linguistic Pipeline (ALP) which examine their performance using text samples from Saudi novels. The testing data has been extracted from different novels that represent different types of narrations. The main result we have obtained indicates that the ALP tagger performs better than others in this particular case, and that Adjective is the most frequent mistagged POS type as compared to Noun and Verb.

2021 ◽  
Vol 2021 ◽  
pp. 1-13
Muhammad Yasir ◽  
Li Chen ◽  
Amna Khatoon ◽  
Muhammad Amir Malik ◽  
Fazeel Abid

Mixed script identification is a hindrance for automated natural language processing systems. Mixing cursive scripts of different languages is a challenge because NLP methods like POS tagging and word sense disambiguation suffer from noisy text. This study tackles the challenge of mixed script identification for mixed-code dataset consisting of Roman Urdu, Hindi, Saraiki, Bengali, and English. The language identification model is trained using word vectorization and RNN variants. Moreover, through experimental investigation, different architectures are optimized for the task associated with Long Short-Term Memory (LSTM), Bidirectional LSTM, Gated Recurrent Unit (GRU), and Bidirectional Gated Recurrent Unit (Bi-GRU). Experimentation achieved the highest accuracy of 90.17 for Bi-GRU, applying learned word class features along with embedding with GloVe. Moreover, this study addresses the issues related to multilingual environments, such as Roman words merged with English characters, generative spellings, and phonetic typing.

2021 ◽  
Vol 72 (2) ◽  
pp. 590-602
Kirill I. Semenov ◽  
Armine K. Titizian ◽  
Aleksandra O. Piskunova ◽  
Yulia O. Korotkova ◽  
Alena D. Tsvetkova ◽  

Abstract The article tackles the problems of linguistic annotation in the Chinese texts presented in the Ruzhcorp – Russian-Chinese Parallel Corpus of RNC, and the ways to solve them. Particular attention is paid to the processing of Russian loanwords. On the one hand, we present the theoretical comparison of the widespread standards of Chinese text processing. On the other hand, we describe our experiments in three fields: word segmentation, grapheme-to-phoneme conversion, and PoS-tagging, on the specific corpus data that contains many transliterations and loanwords. As a result, we propose the preprocessing pipeline of the Chinese texts, that will be implemented in Ruzhcorp.

Otman Maarouf ◽  
Rachid El Ayachi ◽  
Mohamed Biniz

Natural language processing (NLP) is a part of artificial intelligence that dissects, comprehends, and changes common dialects with computers in composed and spoken settings. At that point in scripts. Grammatical features part-of-speech (POS) allow marking the word as per its statement. We find in the literature that POS is used in a few dialects, in particular: French and English. This paper investigates the attention-based long short-term memory (LSTM) networks and simple recurrent neural network (RNN) in Tifinagh POS tagging when it is compared to conditional random fields (CRF) and decision tree. The attractiveness of LSTM networks is their strength in modeling long-distance dependencies. The experiment results show that LSTM networks perform better than RNN, CRF and decision tree that has a near performance.

Sign in / Sign up

Export Citation Format

Share Document