Looking into the Operational Modalities Adopted in Some of the POS Tagging Tools in Identification of Contextual Part-of-Speech of Words in Texts

The chapter provides an eloquent account of the major methodologies and advances in the field of Natural Language Processing. The most popular models that have been used over time for the task of Natural Language Processing have been discussed along with their applications in their specific tasks. The chapter begins with the fundamental concepts of regex and tokenization. It provides an insight to text preprocessing and its methodologies such as Stemming and Lemmatization, Stop Word Removal, followed by Part-of-Speech tagging and Named Entity Recognition. Further, this chapter elaborates the concept of Word Embedding, its various types, and some common frameworks such as word2vec, GloVe, and fastText. A brief description of classification algorithms used in Natural Language Processing is provided next, followed by Neural Networks and its advanced forms such as Recursive Neural Networks and Seq2seq models that are used in Computational Linguistics. A brief description of chatbots and Memory Networks concludes the chapter.

Download Full-text

Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3488381 ◽

2022 ◽

Vol 21 (3) ◽

pp. 1-24

Author(s):

Sunita Warjri ◽

Partha Pakray ◽

Saralin A. Lyngdoh ◽

Arnab Kumar Maji

Keyword(s):

Deep Learning ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Research Work ◽

Pos Tagging ◽

Part Of Speech ◽

Corpus Size ◽

Increase In Accuracy ◽

Pos Tagger

Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words concerning their context usage, and also the orthography problems that arise in the designed POS corpus. The designed Khasi corpus size is around 96,100 tokens and consists of 6,616 distinct words. Initially, while running the first few sets of data of around 41,000 tokens in our experiment the taggers are found to yield considerably accurate results. When the Khasi corpus size has been increased to 96,100 tokens, we see an increase in accuracy rate and the analyses are more pertinent. As results, accuracy of 96.81% is achieved for the BiLSTM method, 96.98% for BiLSTM with CRF technique, and 95.86% for character-based with LSTM. Concerning substantial research from the NLP perspectives for Khasi, we also present some of the recently existing POS taggers and other NLP works on the Khasi language for comparative purposes.

Download Full-text

POS Tagging Bahasa Madura dengan Menggunakan Algoritma Brill Tagger

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.2020722449 ◽

2020 ◽

Vol 7 (6) ◽

pp. 1121

Author(s):

Nindian Puspa Dewi ◽

Ubaidi Ubaidi

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Digital Media ◽

Language Processing ◽

Text Processing ◽

Pos Tagging ◽

Part Of Speech ◽

Average Accuracy ◽

Conducting Research ◽

Regional Languages

Bahasa Madura adalah bahasa daerah yang selain digunakan di Pulau Madura juga digunakan di daerah lainnya seperti di kota Jember, Pasuruan, dan Probolinggo. Sebagai bahasa daerah, Bahasa Madura mulai banyak ditinggalkan khususnya di kalangan anak muda. Beberapa penyebabnya adalah adanya rasa gengsi dan tingkat kesulitan untuk mempelajari Bahasa Madura yang memiliki ragam dialek dan tingkat bahasa. Berkurangnya penggunaan Bahasa Madura dapat mengakibatkan punahnya Bahasa Madura sebagai salah satu bahasa daerah yang ada di Indonesia. Oleh karena itu, perlu adanya usaha untuk mempertahankan dan memelihara Bahasa Madura. Salah satunya adalah dengan melakukan penelitian tentang Bahasa Madura dalam bidang Natural Language Processing sehingga kedepannya pembelajaran tentang Bahasa Madura dapat dilakukan melalui media digital. Part Of Speech (POS) Tagging adalah dasar penelitian text processing, sehingga perlu untuk dibuat aplikasi POS Tagging Bahasa Madura untuk digunakan pada penelitian Natural Languange Processing lainnya. Dalam penelitian ini, POS Tagging dibuat dengan menggunakan Algoritma Brill Tagger dengan menggunakan corpus yang berisi 10.535 kata Bahasa Madura. POS Tagging dengan Brill Tagger dapat memberikan kelas kata yang sesuai pada kata dengan menggunakan aturan leksikal dan kontekstual. Brill Tagger merupakan algoritma dengan tingkat akurasi yang paling baik saat diterapkan dalam Bahasa Inggris, Bahasa Indonesia dan beberapa bahasa lainnya. Dari serangkaian percobaan dengan beberapa perubahan nilai threshold tanpa memperhatikan OOV (Out Of Vocabulary), menunjukkan rata-rata akurasi mencapai lebih dari 80% dengan akurasi tertinggi mencapai 86.67% dan untuk pengujian dengan memperhatikan OOV mencapai rata-rata akurasi 67.74%. Jadi dapat disimpulkan bahwa Brill Tagger dapat digunakan untuk Bahasa Madura dengan tingkat akurasi yang baik. Abstract Bahasa Madura is regional language which is not only used on Madura Island but is also used in other areas such as in several regions in Jember, Pasuruan, and Probolinggo. Today, Bahasa Madura began to be abandoned, especially among young people. One reason is sense of pride and also quite difficult to learn Bahasa Madura because it has a variety of dialects and language levels. The reduced use of Bahasa Madura can lead to the extinction of Bahasa Madura as one of the regional languages in Indonesia. Therefore, there needs to be an effort to maintain Madurese Language. One of them is by conducting research on Madurese Language in the field of Natural Language Processing so that in the future learning about Madurese can be done through digital media. Part of Speech (POS) Tagging is the basis of text processing research, so the Madura Language POS Tagging application needs to be made for use in other Natural Language Processing research. This study uses Brill Tagger by using a corpus containing 10,535 words. POS Tagging with Brill Tagger Algorithm can provide the appropriate word class to word using lexical and contextual rule. The reason for using Brill Tagger is because it is the algorithm that has the best accuracy when implemented in English, Indonesian and several other languages. The experimental results with Brill Tagger show that the average accuracy without OOV (Out Of Vocabulary) obtained is 86.6% with the highest accuracy of 86.94% and the average accuracy for OOV words reached 67.22%. So it can be concluded that the Brill Tagger Algorithm can also be used for Bahasa Madura with a good degree of accuracy.

Download Full-text

Improving the Performance of Vietnamese–Korean Neural Machine Translation with Contextual Embedding

Applied Sciences ◽

10.3390/app112311119 ◽

2021 ◽

Vol 11 (23) ◽

pp. 11119

Author(s):

Van-Hai Vu ◽

Quang-Phuoc Nguyen ◽

Ebipatei Victoria Tunyan ◽

Cheol-Young Ock

Keyword(s):

Natural Language ◽

Machine Translation ◽

Language Processing ◽

Word Sense Disambiguation ◽

Named Entity Recognition ◽

Entity Recognition ◽

Word Sense ◽

Pos Tagging ◽

Part Of Speech ◽

Learning Machine

With the recent evolution of deep learning, machine translation (MT) models and systems are being steadily improved. However, research on MT in low-resource languages such as Vietnamese and Korean is still very limited. In recent years, a state-of-the-art context-based embedding model introduced by Google, bidirectional encoder representations for transformers (BERT), has begun to appear in the neural MT (NMT) models in different ways to enhance the accuracy of MT systems. The BERT model for Vietnamese has been developed and significantly improved in natural language processing (NLP) tasks, such as part-of-speech (POS), named-entity recognition, dependency parsing, and natural language inference. Our research experimented with applying the Vietnamese BERT model to provide POS tagging and morphological analysis (MA) for Vietnamese sentences,, and applying word-sense disambiguation (WSD) for Korean sentences in our Vietnamese–Korean bilingual corpus. In the Vietnamese–Korean NMT system, with contextual embedding, the BERT model for Vietnamese is concurrently connected to both encoder layers and decoder layers in the NMT model. Experimental results assessed through BLEU, METEOR, and TER metrics show that contextual embedding significantly improves the quality of Vietnamese–Korean NMT.

Download Full-text

Machine Learning in Natural Language Processing

Handbook of Research on Machine Learning Applications and Trends ◽

10.4018/978-1-60566-766-9.ch014 ◽

2010 ◽

pp. 302-324

Author(s):

Marina Sokolova ◽

Stan Szpakowicz

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Processing ◽

Word Sense Disambiguation ◽

Machine Learning Techniques ◽

Word Sense ◽

Part Of Speech ◽

Applications Of Machine Learning

This chapter presents applications of machine learning techniques to traditional problems in natural language processing, including part-of-speech tagging, entity recognition and word-sense disambiguation. People usually solve such problems without difficulty or at least do a very good job. Linguistics may suggest labour-intensive ways of manually constructing rule-based systems. It is, however, the easy availability of large collections of texts that has made machine learning a method of choice for processing volumes of data well above the human capacity. One of the main purposes of text processing is all manner of information extraction and knowledge extraction from such large text. Machine learning methods discussed in this chapter have stimulated wide-ranging research in natural language processing and helped build applications with serious deployment potential.

Download Full-text

Improving Brill's tagger lexical and transformation rule for Afaan Oromo language

10.7287/peerj.preprints.1225v1 ◽

2015 ◽

Author(s):

Abraham G Ayana

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Transformation Rule ◽

Initial State ◽

Training Corpus ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Speech Tagging

Natural Language Processing (NLP) refers to Human-like language processing which reveals that it is a discipline within the field of Artificial Intelligence (AI). However, the ultimate goal of research on Natural Language Processing is to parse and understand language, which is not fully achieved yet. For this reason, much research in NLP has focused on intermediate tasks that make sense of some of the structure inherent in language without requiring complete understanding. One such task is part-of-speech tagging, or simply tagging. Lack of standard part of speech tagger for Afaan Oromo will be the main obstacle for researchers in the area of machine translation, spell checkers, dictionary compilation and automatic sentence parsing and constructions. Even though several works have been done in POS tagging for Afaan Oromo, the performance of the tagger is not sufficiently improved yet. Hence,the aim of this thesis is to improve Brill’s tagger lexical and transformation rule for Afaan Oromo POS tagging with sufficiently large training corpus. Accordingly, Afaan Oromo literatures on grammar and morphology are reviewed to understand nature of the language and also to identify possible tagsets. As a result, 26 broad tagsets were identified and 17,473 words from around 1100 sentences containing 6750 distinct words were tagged for training and testing purpose. From which 258 sentences are taken from the previous work. Since there is only a few ready made standard corpuses, the manual tagging process to prepare corpus for this work was challenging and hence, it is recommended that a standard corpus is prepared. Transformation-based Error driven learning are adapted for Afaan Oromo part of speech tagging. Different experiments are conducted for the rule based approach taking 20% of the whole data for testing. A comparison with the previously adapted Brill’s Tagger made. The previously adapted Brill’s Tagger shows an accuracy of 80.08% whereas the improved Brill’s Tagger result shows an accuracy of 95.6% which has an improvement of 15.52%. Hence, it is found that the size of the training corpus, the rule generating system in the lexical rule learner, and moreover, using Afaan Oromo HMM tagger as initial state tagger have a significant effect on the improvement of the tagger.

Download Full-text

Search for the Relation of Form and Function Using the ForFun Database

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2018-0003 ◽

2018 ◽

Vol 110 (1) ◽

pp. 71-84

Author(s):

Marie Mikulová ◽

Eduard Bejček ◽

Eva Hajičová ◽

Jarmila Panevová

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

International Workshop ◽

Form And Function ◽

Form Function ◽

Theoretical Linguistics ◽

And Function ◽

Annotated Corpora ◽

Function Relation

Abstract The aim of the contribution is to introduce a database of linguistic forms and their functions built with the use of the multi-layer annotated corpora of Czech, the Prague Dependency Treebanks. The purpose of the Prague Database of Forms and Functions (ForFun) is to help the linguists to study the form-function relation, which we assume to be one of the principal tasks of both theoretical linguistics and natural language processing. We demonstrate possibilities of the exploitation of the ForFun database. This article is largely based on a paper presented at the 16th International Workshop on Treebanks and Linguistic Theories in Prague (Bejček et al., 2017).

Download Full-text

QUASE: AN Ontology-Based Domain Specific Natural Language Question Answering System

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d6773.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 261-268

Keyword(s):

Natural Language ◽

Language Processing ◽

Question Answering ◽

Closed Domain ◽

Question Answering System ◽

Pos Tagging ◽

Part Of Speech ◽

Natural Language Question ◽

Finite Set ◽

Language Question

Since early days Question Answering (QA) has been an intuitive way of understanding the concept by humans. Considering its inevitable importance it has been introduced to children from very early age and they are promoted to ask more and more questions. With the progress in Machine Learning & Ontological semantics, Natural Language Question Answering (NLQA) has gained more popularity in recent years. In this paper QUASE (QUestion Answering System for Education) question answering system for answering natural language questions has been proposed which help to find answer for any given question in a closed domain containing finite set of documents. Th e QA s y st em m a inl y focuses on factoid questions. QUASE has used Question Taxonomy for Question Classification. Several Natural Language Processing techniques like Part of Speech (POS) tagging, Lemmatization, Sentence Tokenization have been applied for document processing to make search better and faster. DBPedia ontology has been used to validate the candidate answers. By application of this system the learners can gain knowledge on their own by getting precise answers to their questions asked in natural language instead of getting back merely a list of documents. The precision, recall and F measure metrics have been taken into account to evaluate the performance of answer type evaluation. The metric Mean Reciprocal Rank has been considered to evaluate the performance of QA system. Our experiment has shown significant improvement in classifying the questions in to correct answer types over other methods with approximately 91% accuracy and also providing better performance as a QA system in closed domain search.

Download Full-text

Attentive Tensor Product Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33011344 ◽

2019 ◽

Vol 33 ◽

pp. 1344-1351 ◽

Cited By ~ 1

Author(s):

Qiuyuan Huang ◽

Li Deng ◽

Dapeng Wu ◽

Chang Liu ◽

Xiaodong He

Keyword(s):

Deep Learning ◽

Natural Language ◽

Tensor Product ◽

Language Processing ◽

Short Term Memory ◽

Feedforward Neural Networks ◽

Grammatical Structure ◽

Symbolic Model ◽

Pos Tagging ◽

Part Of Speech

This paper proposes a novel neural architecture — Attentive Tensor Product Learning (ATPL) — to represent grammatical structures of natural language in deep learning models. ATPL exploits Tensor Product Representations (TPR), a structured neural-symbolic model developed in cognitive science, to integrate deep learning with explicit natural language structures and rules. The key ideas of ATPL are: 1) unsupervised learning of role-unbinding vectors of words via the TPR-based deep neural network; 2) the use of attention modules to compute TPR; and 3) the integration of TPR with typical deep learning architectures including long short-term memory and feedforward neural networks. The novelty of our approach lies in its ability to extract the grammatical structure of a sentence by using role-unbinding vectors, which are obtained in an unsupervised manner. Our ATPL approach is applied to 1) image captioning, 2) part of speech (POS) tagging, and 3) constituency parsing of a natural language sentence. The experimental results demonstrate the effectiveness of the proposed approach in all these three natural language processing tasks.

Download Full-text

Improving Brill's tagger lexical and transformation rule for Afaan Oromo language

10.7287/peerj.preprints.1225 ◽

2015 ◽

Author(s):

Abraham G Ayana

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Transformation Rule ◽

Initial State ◽

Training Corpus ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Speech Tagging

Natural Language Processing (NLP) refers to Human-like language processing which reveals that it is a discipline within the field of Artificial Intelligence (AI). However, the ultimate goal of research on Natural Language Processing is to parse and understand language, which is not fully achieved yet. For this reason, much research in NLP has focused on intermediate tasks that make sense of some of the structure inherent in language without requiring complete understanding. One such task is part-of-speech tagging, or simply tagging. Lack of standard part of speech tagger for Afaan Oromo will be the main obstacle for researchers in the area of machine translation, spell checkers, dictionary compilation and automatic sentence parsing and constructions. Even though several works have been done in POS tagging for Afaan Oromo, the performance of the tagger is not sufficiently improved yet. Hence,the aim of this thesis is to improve Brill’s tagger lexical and transformation rule for Afaan Oromo POS tagging with sufficiently large training corpus. Accordingly, Afaan Oromo literatures on grammar and morphology are reviewed to understand nature of the language and also to identify possible tagsets. As a result, 26 broad tagsets were identified and 17,473 words from around 1100 sentences containing 6750 distinct words were tagged for training and testing purpose. From which 258 sentences are taken from the previous work. Since there is only a few ready made standard corpuses, the manual tagging process to prepare corpus for this work was challenging and hence, it is recommended that a standard corpus is prepared. Transformation-based Error driven learning are adapted for Afaan Oromo part of speech tagging. Different experiments are conducted for the rule based approach taking 20% of the whole data for testing. A comparison with the previously adapted Brill’s Tagger made. The previously adapted Brill’s Tagger shows an accuracy of 80.08% whereas the improved Brill’s Tagger result shows an accuracy of 95.6% which has an improvement of 15.52%. Hence, it is found that the size of the training corpus, the rule generating system in the lexical rule learner, and moreover, using Afaan Oromo HMM tagger as initial state tagger have a significant effect on the improvement of the tagger.

Download Full-text