pos tagging Latest Research Papers

Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3488381 ◽

2022 ◽

Vol 21 (3) ◽

pp. 1-24

Author(s):

Sunita Warjri ◽

Partha Pakray ◽

Saralin A. Lyngdoh ◽

Arnab Kumar Maji

Keyword(s):

Deep Learning ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Research Work ◽

Pos Tagging ◽

Part Of Speech ◽

Corpus Size ◽

Increase In Accuracy ◽

Pos Tagger

Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words concerning their context usage, and also the orthography problems that arise in the designed POS corpus. The designed Khasi corpus size is around 96,100 tokens and consists of 6,616 distinct words. Initially, while running the first few sets of data of around 41,000 tokens in our experiment the taggers are found to yield considerably accurate results. When the Khasi corpus size has been increased to 96,100 tokens, we see an increase in accuracy rate and the analyses are more pertinent. As results, accuracy of 96.81% is achieved for the BiLSTM method, 96.98% for BiLSTM with CRF technique, and 95.86% for character-based with LSTM. Concerning substantial research from the NLP perspectives for Khasi, we also present some of the recently existing POS taggers and other NLP works on the Khasi language for comparative purposes.

Download Full-text

Recurrent Neural Hidden Markov Model for High-order Transition

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3476511 ◽

2022 ◽

Vol 21 (2) ◽

pp. 1-15

Author(s):

Tatsuya Hiraoka ◽

Sho Takase ◽

Kei Uchiumi ◽

Atsushi Keyaki ◽

Naoaki Okazaki

Keyword(s):

Markov Property ◽

Search Space ◽

Fixed Number ◽

High Order ◽

Arbitrary Length ◽

Pos Tagging ◽

Current State ◽

Order Relations ◽

History Of ◽

Synthetic Datasets

We propose a method to pay attention to high-order relations among latent states to improve the conventional HMMs that focus only on the latest latent state, since they assume Markov property. To address the high-order relations, we apply an RNN to each sequence of latent states, because the RNN can represent the information of an arbitrary-length sequence with their cell: a fixed-size vector. However, the simplest way, which provides all latent sequences explicitly for the RNN, is intractable due to the combinatorial explosion of the search space of latent states. Thus, we modify the RNN to represent the history of latent states from the beginning of the sequence to the current state with a fixed number of RNN cells whose number is equal to the number of possible states. We conduct experiments on unsupervised POS tagging and synthetic datasets. Experimental results show that the proposed method achieves better performance than previous methods. In addition, the results on the synthetic dataset indicate that the proposed method can capture the high-order relations.

Download Full-text

A Modified Markov Based Maximum-entropy Model for POS Tagging of Odia Text

International Journal of Decision Support System Technology ◽

10.4018/ijdsst.286690 ◽

2022 ◽

Vol 14 (1) ◽

pp. 0-0

Keyword(s):

Maximum Entropy ◽

Language Processing ◽

Conditional Random Field ◽

Entropy Model ◽

Text Corpus ◽

Parts Of Speech ◽

Pos Tagging ◽

Linguistic Rules ◽

The Rich ◽

Pos Tagger

POS (Parts of Speech) tagging, a vital step in diverse Natural Language Processing (NLP) tasks has not drawn much attention in case of Odia a computationally under-developed language. The proposed hybrid method suggests a robust POS tagger for Odia. Observing the rich morphology of the language and unavailability of sufficient annotated text corpus a combination of machine learning and linguistic rules is adopted in the building of the tagger. The tagger is trained on tagged text corpus from the domain of tourism and is capable of obtaining a perceptible improvement in the result. Also an appreciable performance is observed for news articles texts of varied domains. The performance of proposed algorithm experimenting on Odia language shows its manifestation in dominating over existing methods like rule based, hidden Markov model (HMM), maximum entropy (ME) and conditional random field (CRF).

Download Full-text

Part of Speech (PoS) Tagging for Konkani Language Using HMM

ICT Systems and Sustainability - Lecture Notes in Networks and Systems ◽

10.1007/978-981-16-5987-4_61 ◽

2022 ◽

pp. 601-609

Author(s):

Annie Rajan ◽

Ambuja Salgaonkar

Keyword(s):

Pos Tagging ◽

Part Of Speech

Download Full-text

A Hidden Markov Model-based Part of Speech Tagger for Shekki’noono Language

International Journal of Computing ◽

10.47839/ijc.20.4.2448 ◽

2021 ◽

pp. 587-595

Author(s):

Alebachew Chiche ◽

Hiwot Kadi ◽

Tibebu Bekele

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Markov Model ◽

Hidden Markov Model ◽

Language Processing ◽

Hidden Markov ◽

Parts Of Speech ◽

Pos Tagging ◽

Part Of Speech ◽

Pos Tagger

Natural language processing plays a great role in providing an interface for human-computer communication. It enables people to talk with the computer in their formal language rather than machine language. This study aims at presenting a Part of speech tagger that can assign word class to words in a given paragraph sentence. Some of the researchers developed parts of speech taggers for different languages such as English Amharic, Afan Oromo, Tigrigna, etc. On the other hand, many other languages do not have POS taggers like Shekki’noono language. POS tagger is incorporated in most natural language processing tools like machine translation, information extraction as a basic component. So, it is compulsory to develop a part of speech tagger for languages then it is possible to work with an advanced natural language application. Because those applications enhance machine to machine, machine to human, and human to human communications. Although, one language POS tagger cannot be directly applied for other languages POS tagger. With the purpose for developing the Shekki’noono POS tagger, we have used the stochastic Hidden Markov Model. For the study, we have used 1500 sentences collected from different sources such as newspapers (which includes social, economic, and political aspects), modules, textbooks, Radio Programs, and bulletins. The collected sentences are labeled by language experts with their appropriate parts of speech for each word. With the experiments carried out, the part of speech tagger is trained on the training sets using Hidden Markov model. As experiments showed, HMM based POS tagging has achieved 92.77 % accuracy for Shekki’noono. And the POS tagger model is compared with the previous experiments in related works using HMM. As a future work, the proposed approaches can be utilized to perform an evaluation on a larger corpus.

Download Full-text

Part-of-Speech Tagging with Rule-Based Data Preprocessing and Transformer

Electronics ◽

10.3390/electronics11010056 ◽

2021 ◽

Vol 11 (1) ◽

pp. 56

Author(s):

Hongwei Li ◽

Hongyan Mao ◽

Jingzi Wang

Keyword(s):

Language Processing ◽

Short Term Memory ◽

Contextual Information ◽

Data Preprocessing ◽

Experimental Result ◽

Rule Based ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Novel Approach

Part-of-Speech (POS) tagging is one of the most important tasks in the field of natural language processing (NLP). POS tagging for a word depends not only on the word itself but also on its position, its surrounding words, and their POS tags. POS tagging can be an upstream task for other NLP tasks, further improving their performance. Therefore, it is important to improve the accuracy of POS tagging. In POS tagging, bidirectional Long Short-Term Memory (Bi-LSTM) is commonly used and achieves good performance. However, Bi-LSTM is not as powerful as Transformer in leveraging contextual information, since Bi-LSTM simply concatenates the contextual information from left-to-right and right-to-left. In this study, we propose a novel approach for POS tagging to improve the accuracy. For each token, all possible POS tags are obtained without considering context, and then rules are applied to prune out these possible POS tags, which we call rule-based data preprocessing. In this way, the number of possible POS tags of most tokens can be reduced to one, and they are considered to be correctly tagged. Finally, POS tags of the remaining tokens are masked, and a model based on Transformer is used to only predict the masked POS tags, which enables it to leverage bidirectional contexts. Our experimental result shows that our approach leads to better performance than other methods using Bi-LSTM.

Download Full-text

A Comparative Study of Arabic Part of Speech Taggers Using Literary Text Samples from Saudi Novels

Information ◽

10.3390/info12120523 ◽

2021 ◽

Vol 12 (12) ◽

pp. 523

Author(s):

Reyadh Alluhaibi ◽

Tareq Alfraidi ◽

Mohammad A. R. Abdeen ◽

Ahmed Yatimi

Keyword(s):

Comparative Study ◽

Language Processing ◽

Corpus Linguistics ◽

Pos Tagging ◽

Part Of Speech ◽

Testing Data ◽

Modeling Techniques ◽

Different Types ◽

And Training ◽

Better Than

Part of Speech (POS) tagging is one of the most common techniques used in natural language processing (NLP) applications and corpus linguistics. Various POS tagging tools have been developed for Arabic. These taggers differ in several aspects, such as in their modeling techniques, tag sets and training and testing data. In this paper we conduct a comparative study of five Arabic POS taggers, namely: Stanford Arabic, CAMeL Tools, Farasa, MADAMIRA and Arabic Linguistic Pipeline (ALP) which examine their performance using text samples from Saudi novels. The testing data has been extracted from different novels that represent different types of narrations. The main result we have obtained indicates that the ALP tagger performs better than others in this particular case, and that Adjective is the most frequent mistagged POS type as compared to Noun and Verb.

Download Full-text

Mixed Script Identification Using Automated DNN Hyperparameter Optimization

Computational Intelligence and Neuroscience ◽

10.1155/2021/8415333 ◽

2021 ◽

Vol 2021 ◽

pp. 1-13

Author(s):

Muhammad Yasir ◽

Li Chen ◽

Amna Khatoon ◽

Muhammad Amir Malik ◽

Fazeel Abid

Keyword(s):

Language Processing ◽

Short Term Memory ◽

Word Sense Disambiguation ◽

Word Sense ◽

Script Identification ◽

Pos Tagging ◽

Sense Disambiguation ◽

Noisy Text ◽

Gated Recurrent Unit ◽

Mixed Code

Mixed script identification is a hindrance for automated natural language processing systems. Mixing cursive scripts of different languages is a challenge because NLP methods like POS tagging and word sense disambiguation suffer from noisy text. This study tackles the challenge of mixed script identification for mixed-code dataset consisting of Roman Urdu, Hindi, Saraiki, Bengali, and English. The language identification model is trained using word vectorization and RNN variants. Moreover, through experimental investigation, different architectures are optimized for the task associated with Long Short-Term Memory (LSTM), Bidirectional LSTM, Gated Recurrent Unit (GRU), and Bidirectional Gated Recurrent Unit (Bi-GRU). Experimentation achieved the highest accuracy of 90.17 for Bi-GRU, applying learned word class features along with embedding with GloVe. Moreover, this study addresses the issues related to multilingual environments, such as Roman words merged with English characters, generative spellings, and phonetic typing.

Download Full-text

Linguistic Annotation of Translated Chinese Texts: Coordinating Theory, Algorithms and Data

Journal of Linguistics/Jazykovedný casopis ◽

10.2478/jazcas-2021-0054 ◽

2021 ◽

Vol 72 (2) ◽

pp. 590-602

Author(s):

Kirill I. Semenov ◽

Armine K. Titizian ◽

Aleksandra O. Piskunova ◽

Yulia O. Korotkova ◽

Alena D. Tsvetkova ◽

...

Keyword(s):

Chinese Text ◽

Text Processing ◽

Word Segmentation ◽

The Other ◽

Pos Tagging ◽

Theoretical Comparison ◽

Linguistic Annotation ◽

Corpus Data ◽

Chinese Texts ◽

The One

Abstract The article tackles the problems of linguistic annotation in the Chinese texts presented in the Ruzhcorp – Russian-Chinese Parallel Corpus of RNC, and the ways to solve them. Particular attention is paid to the processing of Russian loanwords. On the one hand, we present the theoretical comparison of the widespread standards of Chinese text processing. On the other hand, we describe our experiments in three fields: word segmentation, grapheme-to-phoneme conversion, and PoS-tagging, on the specific corpus data that contains many transliterations and loanwords. As a result, we propose the preprocessing pipeline of the Chinese texts, that will be implemented in Ruzhcorp.

Download Full-text

Amazigh part-of-speech tagging with machine learning and deep learning

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v24.i3.pp1814-1822 ◽

2021 ◽

Vol 24 (3) ◽

pp. 1814

Author(s):

Otman Maarouf ◽

Rachid El Ayachi ◽

Mohamed Biniz

Keyword(s):

Decision Tree ◽

Language Processing ◽

Conditional Random Fields ◽

Short Term Memory ◽

Long Distance ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

French And English ◽

Speech Tagging

Natural language processing (NLP) is a part of artificial intelligence that dissects, comprehends, and changes common dialects with computers in composed and spoken settings. At that point in scripts. Grammatical features part-of-speech (POS) allow marking the word as per its statement. We find in the literature that POS is used in a few dialects, in particular: French and English. This paper investigates the attention-based long short-term memory (LSTM) networks and simple recurrent neural network (RNN) in Tifinagh POS tagging when it is compared to conditional random fields (CRF) and decision tree. The attractiveness of LSTM networks is their strength in modeling long-distance dependencies. The experiment results show that LSTM networks perform better than RNN, CRF and decision tree that has a near performance.

Download Full-text

pos tagging
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus

Recurrent Neural Hidden Markov Model for High-order Transition

A Modified Markov Based Maximum-entropy Model for POS Tagging of Odia Text

Part of Speech (PoS) Tagging for Konkani Language Using HMM

A Hidden Markov Model-based Part of Speech Tagger for Shekki’noono Language

Part-of-Speech Tagging with Rule-Based Data Preprocessing and Transformer

A Comparative Study of Arabic Part of Speech Taggers Using Literary Text Samples from Saudi Novels

Mixed Script Identification Using Automated DNN Hyperparameter Optimization

Linguistic Annotation of Translated Chinese Texts: Coordinating Theory, Algorithms and Data

Amazigh part-of-speech tagging with machine learning and deep learning

Export Citation Format

pos taggingRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus

Recurrent Neural Hidden Markov Model for High-order Transition

A Modified Markov Based Maximum-entropy Model for POS Tagging of Odia Text

Part of Speech (PoS) Tagging for Konkani Language Using HMM

A Hidden Markov Model-based Part of Speech Tagger for Shekki’noono Language

Part-of-Speech Tagging with Rule-Based Data Preprocessing and Transformer

A Comparative Study of Arabic Part of Speech Taggers Using Literary Text Samples from Saudi Novels

Mixed Script Identification Using Automated DNN Hyperparameter Optimization

Linguistic Annotation of Translated Chinese Texts: Coordinating Theory, Algorithms and Data

Amazigh part-of-speech tagging with machine learning and deep learning

pos tagging
Recently Published Documents