A Modified Markov Based Maximum-entropy Model for POS Tagging of Odia Text

POS (Parts of Speech) tagging, a vital step in diverse Natural Language Processing (NLP) tasks has not drawn much attention in case of Odia a computationally under-developed language. The proposed hybrid method suggests a robust POS tagger for Odia. Observing the rich morphology of the language and unavailability of sufficient annotated text corpus a combination of machine learning and linguistic rules is adopted in the building of the tagger. The tagger is trained on tagged text corpus from the domain of tourism and is capable of obtaining a perceptible improvement in the result. Also an appreciable performance is observed for news articles texts of varied domains. The performance of proposed algorithm experimenting on Odia language shows its manifestation in dominating over existing methods like rule based, hidden Markov model (HMM), maximum entropy (ME) and conditional random field (CRF).

Download Full-text

Advanced Tamil POS Tagger for Language Learners

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.j8886.0881019 ◽

2019 ◽

Vol 8 (10) ◽

pp. 741-745

Keyword(s):

Machine Translation ◽

Language Learners ◽

Language Processing ◽

Research Work ◽

Important Work ◽

Parts Of Speech ◽

Pos Tagging ◽

Pos Tagger ◽

The Given ◽

Speech Identification

In the emerging technology Natural Language Processing, machine translation is one of the important roles. The machine translation is translation of text in one language to another with the implementation of Machines. The research topic POS Tagging is one of the most basic and important work in Machine translation. POS tagging simply, we say that to assign the Parts of speech identification for each word in the given sentence. In my research work, I tried the POS Tagging for Tamil language. There may be some numerous research were done in the same topic. I have viewed this in different and very detailed implementation. Most of the detailed grammatical identifications are made for this proposed research. It is very useful to know the basic grammar in Tamil language

Download Full-text

A Hidden Markov Model-based Part of Speech Tagger for Shekki’noono Language

International Journal of Computing ◽

10.47839/ijc.20.4.2448 ◽

2021 ◽

pp. 587-595

Author(s):

Alebachew Chiche ◽

Hiwot Kadi ◽

Tibebu Bekele

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Markov Model ◽

Hidden Markov Model ◽

Language Processing ◽

Hidden Markov ◽

Parts Of Speech ◽

Pos Tagging ◽

Part Of Speech ◽

Pos Tagger

Natural language processing plays a great role in providing an interface for human-computer communication. It enables people to talk with the computer in their formal language rather than machine language. This study aims at presenting a Part of speech tagger that can assign word class to words in a given paragraph sentence. Some of the researchers developed parts of speech taggers for different languages such as English Amharic, Afan Oromo, Tigrigna, etc. On the other hand, many other languages do not have POS taggers like Shekki’noono language. POS tagger is incorporated in most natural language processing tools like machine translation, information extraction as a basic component. So, it is compulsory to develop a part of speech tagger for languages then it is possible to work with an advanced natural language application. Because those applications enhance machine to machine, machine to human, and human to human communications. Although, one language POS tagger cannot be directly applied for other languages POS tagger. With the purpose for developing the Shekki’noono POS tagger, we have used the stochastic Hidden Markov Model. For the study, we have used 1500 sentences collected from different sources such as newspapers (which includes social, economic, and political aspects), modules, textbooks, Radio Programs, and bulletins. The collected sentences are labeled by language experts with their appropriate parts of speech for each word. With the experiments carried out, the part of speech tagger is trained on the training sets using Hidden Markov model. As experiments showed, HMM based POS tagging has achieved 92.77 % accuracy for Shekki’noono. And the POS tagger model is compared with the previous experiments in related works using HMM. As a future work, the proposed approaches can be utilized to perform an evaluation on a larger corpus.

Download Full-text

Part-of-Speech Tagging via Deep Neural Networks for Northern-Ethiopic Languages

Information Technology And Control ◽

10.5755/j01.itc.49.4.26808 ◽

2020 ◽

Vol 49 (4) ◽

pp. 482-494

Author(s):

Jurgita Kapočiūtė-Dzikienė ◽

Senait Gebremichael Tesfagergish

Keyword(s):

Neural Network ◽

Neural Networks ◽

Language Processing ◽

Deep Neural Networks ◽

Short Term Memory ◽

Parameter Tuning ◽

Feed Forward Neural Network ◽

Pos Tagging ◽

Part Of Speech ◽

Pos Tagger

Deep Neural Networks (DNNs) have proven to be especially successful in the area of Natural Language Processing (NLP) and Part-Of-Speech (POS) tagging—which is the process of mapping words to their corresponding POS labels depending on the context. Despite recent development of language technologies, low-resourced languages (such as an East African Tigrinya language), have received too little attention. We investigate the effectiveness of Deep Learning (DL) solutions for the low-resourced Tigrinya language of the Northern-Ethiopic branch. We have selected Tigrinya as the testbed example and have tested state-of-the-art DL approaches seeking to build the most accurate POS tagger. We have evaluated DNN classifiers (Feed Forward Neural Network – FFNN, Long Short-Term Memory method – LSTM, Bidirectional LSTM, and Convolutional Neural Network – CNN) on a top of neural word2vec word embeddings with a small training corpus known as Nagaoka Tigrinya Corpus. To determine the best DNN classifier type, its architecture and hyper-parameter set both manual and automatic hyper-parameter tuning has been performed. BiLSTM method was proved to be the most suitable for our solving task: it achieved the highest accuracy equal to 92% that is 65% above the random baseline.

Download Full-text

CWPC_BiAtt: Character–Word–Position Combined BiLSTM-Attention for Chinese Named Entity Recognition

Information ◽

10.3390/info11010045 ◽

2020 ◽

Vol 11 (1) ◽

pp. 45 ◽

Cited By ~ 1

Author(s):

Shardrom Johnson ◽

Sherlock Shen ◽

Yuanchen Liu

Keyword(s):

Language Processing ◽

Short Term Memory ◽

Conditional Random Field ◽

Named Entity Recognition ◽

Attention Mechanism ◽

Entity Recognition ◽

Position Information ◽

Named Entity ◽

Pos Tagging ◽

Word Position

Usually taken as linguistic features by Part-Of-Speech (POS) tagging, Named Entity Recognition (NER) is a major task in Natural Language Processing (NLP). In this paper, we put forward a new comprehensive-embedding, considering three aspects, namely character-embedding, word-embedding, and pos-embedding stitched in the order we give, and thus get their dependencies, based on which we propose a new Character–Word–Position Combined BiLSTM-Attention (CWPC_BiAtt) for the Chinese NER task. Comprehensive-embedding via the Bidirectional Llong Short-Term Memory (BiLSTM) layer can get the connection between the historical and future information, and then employ the attention mechanism to capture the connection between the content of the sentence at the current position and that at any location. Finally, we utilize Conditional Random Field (CRF) to decode the entire tagging sequence. Experiments show that CWPC_BiAtt model we proposed is well qualified for the NER task on Microsoft Research Asia (MSRA) dataset and Weibo NER corpus. A high precision and recall were obtained, which verified the stability of the model. Position-embedding in comprehensive-embedding can compensate for attention-mechanism to provide position information for the disordered sequence, which shows that comprehensive-embedding has completeness. Looking at the entire model, our proposed CWPC_BiAtt has three distinct characteristics: completeness, simplicity, and stability. Our proposed CWPC_BiAtt model achieved the highest F-score, achieving the state-of-the-art performance in the MSRA dataset and Weibo NER corpus.

Download Full-text

Maximum Entropy Model and Conditional Random Field

Machine Learning for Multimedia Content Analysis ◽

10.1007/978-0-387-69942-4_9 ◽

2007 ◽

pp. 201-233

Keyword(s):

Random Field ◽

Maximum Entropy ◽

Conditional Random Field ◽

Maximum Entropy Model ◽

Entropy Model

Download Full-text

Fusion of Clustering Trigger-Pair Features for POS Tagging Based on Maximum Entropy Model

Journal of Computer Research and Development ◽

10.1360/crad20060212 ◽

2006 ◽

Vol 43 (2) ◽

pp. 268

Author(s):

Yan Zhao

Keyword(s):

Maximum Entropy ◽

Maximum Entropy Model ◽

Entropy Model ◽

Pos Tagging

Download Full-text

Verb Based Sentiment Research

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1289.0982s1119 ◽

2019 ◽

Vol 8 (2S11) ◽

pp. 2468-2471

Keyword(s):

Sentiment Analysis ◽

Research Work ◽

Online Reviews ◽

Parts Of Speech ◽

Pos Tagging ◽

New Strategy ◽

Semantic Orientation ◽

The Difference ◽

Pos Tagger ◽

Speech Tagging

Sentiment Analysis is one of the leading research work. This paper proposes a model for the description of verbs that provide a structure for developing sentiment analysis. The verbs are very significant language elements and they receive the attention of linguistic researchers. The text is processed for parts-of-speech tagging (POS tagging). With the help of POS tagger, the verbs from each sentence are extracted to show the difference in sentiment analysis values. The work includes performing parts-of-speech tagging to obtain verb words and implement TextBlob and VADER to find the semantic orientation to mine the opinion from the movie review. We achieved interesting results, which were assessed effectively for accuracy by considering with and without verb form words. The findings show that concerning verb words accuracy increases along with emotion words. This introduces a new strategy to classify online reviews using components of algorithms for parts-of-speech..

Download Full-text

Statistical Methods

10.1093/oxfordhb/9780199276349.013.0019 ◽

2012 ◽

Author(s):

Christer Samuelsson

Keyword(s):

Language Learning ◽

Maximum Entropy ◽

Statistical Methods ◽

Language Processing ◽

Mathematical Statistics ◽

Semantic Interpretation ◽

Entropy Model ◽

Heterogeneous Information ◽

Part Of Speech ◽

Maximum Entropy Modelling

Statistical methods now belong to mainstream natural language processing. They have been successfully applied to virtually all tasks within language processing and neighbouring fields, including part-of-speech tagging, syntactic parsing, semantic interpretation, lexical acquisition, machine translation, information retrieval, and information extraction and language learning. This article reviews mathematical statistics and applies it to language modelling problems, leading up to the hidden Markov model and maximum entropy model. The real strength of maximum-entropy modelling lies in combining evidence from several rules, each one of which alone might not be conclusive, but which taken together dramatically affect the probability. Maximum-entropy modelling allows combining heterogeneous information sources to produce a uniform probabilistic model where each piece of information is formulated as a feature. The key ideas of mathematical statistics are simple and intuitive, but tend to be buried in a sea of mathematical technicalities. Finally, the article provides mathematical detail related to the topic of discussion.

Download Full-text

Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3488381 ◽

2022 ◽

Vol 21 (3) ◽

pp. 1-24

Author(s):

Sunita Warjri ◽

Partha Pakray ◽

Saralin A. Lyngdoh ◽

Arnab Kumar Maji

Keyword(s):

Deep Learning ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Research Work ◽

Pos Tagging ◽

Part Of Speech ◽

Corpus Size ◽

Increase In Accuracy ◽

Pos Tagger

Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words concerning their context usage, and also the orthography problems that arise in the designed POS corpus. The designed Khasi corpus size is around 96,100 tokens and consists of 6,616 distinct words. Initially, while running the first few sets of data of around 41,000 tokens in our experiment the taggers are found to yield considerably accurate results. When the Khasi corpus size has been increased to 96,100 tokens, we see an increase in accuracy rate and the analyses are more pertinent. As results, accuracy of 96.81% is achieved for the BiLSTM method, 96.98% for BiLSTM with CRF technique, and 95.86% for character-based with LSTM. Concerning substantial research from the NLP perspectives for Khasi, we also present some of the recently existing POS taggers and other NLP works on the Khasi language for comparative purposes.

Download Full-text

POS Tagging and NER System for Kannada Using Conditional Random Fields

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2021100101 ◽

2021 ◽

Vol 11 (4) ◽

pp. 1-13

Author(s):

Arpitha Swamy ◽

Srinath S.

Keyword(s):

Random Fields ◽

Conditional Random Fields ◽

Named Entity Recognition ◽

Model Testing ◽

Entity Recognition ◽

Parts Of Speech ◽

Named Entity ◽

Pos Tagging ◽

Proper Nouns ◽

Pos Tagger

Parts-of-speech (POS) tagging is a method used to assign the POS tag for every word present in the text, and named entity recognition (NER) is a process to identify the proper nouns in the text and to classify the identified nouns into certain predefined categories. A POS tagger and a NER system for Kannada text have been proposed utilizing conditional random fields (CRFs). The dataset used for POS tagging consists of 147K tokens, where 103K tokens are used for training and the remaining tokens are used for testing. The proposed CRF model for POS tagging of Kannada text obtained 91.3% of precision, 91.6% of recall, and 91.4% of f-score values, respectively. To develop the NER system for Kannada, the data required is created manually using the modified tag-set containing 40 labels. The dataset used for NER system consists of 16.5K tokens, where 70% of the total words are used for training the model, and the remaining 30% of total words are used for model testing. The developed NER model obtained the 94% of precision, 93.9% of recall, and 93.9% of F1-measure values, respectively.

Download Full-text