Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion

As a highly analytic language, Khmer has considerable ambiguities in tokenization and part-of-speech (POS) tagging processing. This topic is investigated in this study. Specifically, a 20,000-sentence Khmer corpus with manual tokenization and POS-tagging annotation is released after a series of work over the last 4 years. This is the largest morphologically annotated Khmer dataset as of 2020, when this article was prepared. Based on the annotated data, experiments were conducted to establish a comprehensive benchmark on the automatic processing of tokenization and POS-tagging for Khmer. Specifically, a support vector machine, a conditional random field (CRF) , a long short-term memory (LSTM) -based recurrent neural network, and an integrated LSTM-CRF model have been investigated and discussed. As a primary conclusion, processing at morpheme-level is satisfactory for the provided data. However, it is intrinsically difficult to identify further grammatical constituents of compounds or phrases because of the complex analytic features of the language. Syntactic annotation and automatic parsing for Khmer will be scheduled in the near future.

Download Full-text

Bidirectional Long Short-Term Memory Network with a Conditional Random Field Layer for Uyghur Part-Of-Speech Tagging

Information ◽

10.3390/info8040157 ◽

2017 ◽

Vol 8 (4) ◽

pp. 157 ◽

Cited By ~ 6

Author(s):

Maihemuti Maimaiti ◽

Aishan Wumaier ◽

Kahaerjiang Abiderexiti ◽

Tuergen Yibulayin

Keyword(s):

Random Field ◽

Short Term Memory ◽

Conditional Random Field ◽

Short Term ◽

Term Memory ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Memory Network ◽

Long Short Term Memory ◽

Speech Tagging

Download Full-text

Part-of-Speech Tagging Using Long Short Term Memory (LSTM): Amazigh Text Written in Tifinaghe Characters

Business Intelligence - Lecture Notes in Business Information Processing ◽

10.1007/978-3-030-76508-8_1 ◽

2021 ◽

pp. 3-17

Author(s):

Otman Maarouf ◽

Rachid El Ayachi

Keyword(s):

Short Term Memory ◽

Short Term ◽

Term Memory ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Long Short Term Memory ◽

Speech Tagging

Download Full-text

Part of Speech Tagging for Indonesian Language using Bidirectional Long Short-Term Memory

2019 1st International Conference on Cybernetics and Intelligent System (ICORIS) ◽

10.1109/icoris.2019.8874871 ◽

2019 ◽

Cited By ~ 1

Author(s):

Dellon Handrata ◽

Christian Nathaniel Purwanto ◽

Fransisca Haryanti Chandra ◽

Joan Santoso ◽

Gunawan

Keyword(s):

Short Term Memory ◽

Short Term ◽

Term Memory ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Long Short Term Memory ◽

Speech Tagging

Download Full-text

A Persian part of speech tagging system using the long short-term memory neural network

2020 6th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS) ◽

10.1109/icspis51611.2020.9349556 ◽

2020 ◽

Author(s):

Abbas koochari ◽

Abdorreza Alavi Gharahbagh ◽

Vahid Hajihashemi

Keyword(s):

Neural Network ◽

Short Term Memory ◽

Short Term ◽

Term Memory ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Long Short Term Memory ◽

Speech Tagging ◽

Tagging System

Download Full-text

Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss

10.18653/v1/p16-2067 ◽

2016 ◽

Cited By ~ 43

Author(s):

Barbara Plank ◽

Anders Søgaard ◽

Yoav Goldberg

Keyword(s):

Short Term Memory ◽

Memory Models ◽

Short Term ◽

Term Memory ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Long Short Term Memory ◽

Speech Tagging

Download Full-text

Amazigh part-of-speech tagging with machine learning and deep learning

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v24.i3.pp1814-1822 ◽

2021 ◽

Vol 24 (3) ◽

pp. 1814

Author(s):

Otman Maarouf ◽

Rachid El Ayachi ◽

Mohamed Biniz

Keyword(s):

Decision Tree ◽

Language Processing ◽

Conditional Random Fields ◽

Short Term Memory ◽

Long Distance ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

French And English ◽

Speech Tagging

Natural language processing (NLP) is a part of artificial intelligence that dissects, comprehends, and changes common dialects with computers in composed and spoken settings. At that point in scripts. Grammatical features part-of-speech (POS) allow marking the word as per its statement. We find in the literature that POS is used in a few dialects, in particular: French and English. This paper investigates the attention-based long short-term memory (LSTM) networks and simple recurrent neural network (RNN) in Tifinagh POS tagging when it is compared to conditional random fields (CRF) and decision tree. The attractiveness of LSTM networks is their strength in modeling long-distance dependencies. The experiment results show that LSTM networks perform better than RNN, CRF and decision tree that has a near performance.

Download Full-text

Lexical Rule and Lexicon Effect for Part of Speech Tagging Bahasa Madura

Matrik Jurnal Manajemen Teknik Informatika dan Rekayasa Komputer ◽

10.30812/matrik.v18i1.332 ◽

2018 ◽

Vol 18 (1) ◽

pp. 65-72

Author(s):

Nindian Puspa Dewi ◽

Ubaidi Ubaidi

Keyword(s):

Text Processing ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Speech Tagging ◽

Bahasa Indonesia

POS Tagging adalah dasar untuk pengembangan Text Processing suatu bahasa. Dalam penelitian ini kita meneliti pengaruh penggunaan lexicon dan perubahan morfologi kata dalam penentuan tagset yang tepat untuk suatu kata. Aturan dengan pendekatan morfologi kata seperti awalan, akhiran, dan sisipan biasa disebut sebagai lexical rule. Penelitian ini menerapkan lexical rule hasil learner dengan menggunakan algoritma Brill Tagger. Bahasa Madura adalah bahasa daerah yang digunakan di Pulau Madura dan beberapa pulau lainnya di Jawa Timur. Objek penelitian ini menggunakan Bahasa Madura yang memiliki banyak sekali variasi afiksasi dibandingkan dengan Bahasa Indonesia. Pada penelitian ini, lexicon selain digunakan untuk pencarian kata dasar Bahasa Madura juga digunakan sebagai salah satu tahap pemberian POS Tagging. Hasil ujicoba dengan menggunakan lexicon mencapai akurasi yaitu 86.61% sedangkan jika tidak menggunakan lexicon hanya mencapai akurasi 28.95 %. Dari sini dapat disimpulkan bahwa ternyata lexicon sangat berpengaruh terhadap POS Tagging.

Download Full-text

Pengaruh Part of Speech Tagging Berbasis Aturan dan Distribusi Probabilitas Maximum Entropy untuk Bahasa Jawa Krama

Jurnal Buana Informatika ◽

10.24002/jbi.v7i4.764 ◽

2016 ◽

Vol 7 (4) ◽

Author(s):

Hafiz Ridha Pramudita ◽

Ema Utami ◽

Armadyah Amborowati

Keyword(s):

Maximum Entropy ◽

Syntactic Category ◽

Rule Based ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Speech Tagging ◽

Local Languages

Abstract. Javanese language is one of the local languages in Indonesia, which is used by most of the population of Indonesia. The language has complex grammar to embrace the values of decency that is determined by the use of words containing courtesy known as Raos Alus. Every word in the Javanese belongs to a certain part of speech like what happens to other languages. Part of Speech (POS) tagging is a process to set syntactic category in a word such as nouns, verbs, or adjectives to every word in the document or text. This study examined the POS Tagging with Maximum Entropy and Rule Based for Javanese Krama—Higher Javanese--by using the Open NLP library to measure the maximum entropy. The results obtained are Maximum Entropy and Rule Based can be used for POS Tagging on Javanese Krama with the highest accuracy of 97.67%.Keywords: POS Tagging, NLP, Maximum Entropy, Rule Based, Javanese Krama LanguageAbstrak. Bahasa Jawa merupakan salah satu bahasa daerah di Indonesia yang dipakai oleh sebagian besar penduduk Indonesia. Bahasa Jawa memiliki tata bahasa yang kompleks karena menganut nilai-nilai kesopanan yang ditentukan berdasarkan penggunaan dengan kata-kata yang mengandung raos alus (rasa sopan). Setiap kata dalam Bahasa Jawa memiliki jenis kata atau part of speech tertentu seperti halnya dengan bahasa-bahasa lain. POS tagging merupakah bagian penting dari cakupan bidang ilmu Natural Languange Processing (NLP). Penelitian ini menguji POS Tagging dengan Berbasis Aturan dan distribusi probabilitas Maximum Entropy pada Bahasa Jawa Krama menggunakan library OpenNLP untuk mengukur maximum entropy. Hasil yang diperoleh adalah Maximum Entropy dan Rule Based dapat digunakan untuk POSTagging pada Bahasa Jawa Krama dengan akurasi tertinggi 97,67%.Kata Kunci: POS Tagging, NLP, Maximum Entropy, Rule Based, Bahasa Jawa Krama

Download Full-text

PENENTUAN KELAS KATA PADA PART OF SPEECH TAGGING KATA AMBIGU BAHASA INDONESIA

JISKA (Jurnal Informatika Sunan Kalijaga) ◽

10.14421/jiska.2018.23-05 ◽

2018 ◽

Vol 2 (3) ◽

pp. 157

Author(s):

Ahmad Subhan Yazid ◽

Agung Fatwanto

Keyword(s):

Language Processing ◽

Word Class ◽

Rule Based ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Ambiguous Words ◽

Computer Science Faculty ◽

Speech Tagging ◽

Bahasa Indonesia

Indonesian hold a fundamental role in the communication. There is ambiguous problem in its machine learning implementation. In the Natural Language Processing study, Part of Speech (POS) tagging has a role in the decreasing this problem. This study use the Rule Based method to determine the best word class for ambiguous words in Indonesian. This research follows some stages: knowledge inventory, making algorithms, implementation, Testing, Analysis, and Conclusions. The first data used is Indonesian corpus that was developed by Language department of Computer science Faculty, Indonesia University. Then, data is processed and shown descriptively by following certain rules and specification. The result is a POS tagging algorithm included 71 rules in flowchart and descriptive sentence notation. Refer to testing result, the algorithm successfully provides 92 labeling of 100 tested words (92%). The results of the implementation are influenced by the availability of rules, word class tagsets and corpus data.

Download Full-text

Improving Brill's tagger lexical and transformation rule for Afaan Oromo language

10.7287/peerj.preprints.1225v1 ◽

2015 ◽

Author(s):

Abraham G Ayana

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Transformation Rule ◽

Initial State ◽

Training Corpus ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Speech Tagging

Natural Language Processing (NLP) refers to Human-like language processing which reveals that it is a discipline within the field of Artificial Intelligence (AI). However, the ultimate goal of research on Natural Language Processing is to parse and understand language, which is not fully achieved yet. For this reason, much research in NLP has focused on intermediate tasks that make sense of some of the structure inherent in language without requiring complete understanding. One such task is part-of-speech tagging, or simply tagging. Lack of standard part of speech tagger for Afaan Oromo will be the main obstacle for researchers in the area of machine translation, spell checkers, dictionary compilation and automatic sentence parsing and constructions. Even though several works have been done in POS tagging for Afaan Oromo, the performance of the tagger is not sufficiently improved yet. Hence,the aim of this thesis is to improve Brill’s tagger lexical and transformation rule for Afaan Oromo POS tagging with sufficiently large training corpus. Accordingly, Afaan Oromo literatures on grammar and morphology are reviewed to understand nature of the language and also to identify possible tagsets. As a result, 26 broad tagsets were identified and 17,473 words from around 1100 sentences containing 6750 distinct words were tagged for training and testing purpose. From which 258 sentences are taken from the previous work. Since there is only a few ready made standard corpuses, the manual tagging process to prepare corpus for this work was challenging and hence, it is recommended that a standard corpus is prepared. Transformation-based Error driven learning are adapted for Afaan Oromo part of speech tagging. Different experiments are conducted for the rule based approach taking 20% of the whole data for testing. A comparison with the previously adapted Brill’s Tagger made. The previously adapted Brill’s Tagger shows an accuracy of 80.08% whereas the improved Brill’s Tagger result shows an accuracy of 95.6% which has an improvement of 15.52%. Hence, it is found that the size of the training corpus, the rule generating system in the lexical rule learner, and moreover, using Afaan Oromo HMM tagger as initial state tagger have a significant effect on the improvement of the tagger.

Download Full-text