Perkembangan Part-of-Speech Tagger Bahasa Indonesia

Mia Kamayani

doi:10.26418/jlk.v2i2.20

Perkembangan Part-of-Speech Tagger Bahasa Indonesia

Jurnal Linguistik Komputasional (JLK) ◽

10.26418/jlk.v2i2.20 ◽

2019 ◽

Vol 2 (2) ◽

pp. 34

Author(s):

Mia Kamayani

Keyword(s):

Neural Network ◽

State Of The Art ◽

Part Of Speech ◽

Pos Tagger ◽

Bahasa Indonesia

Tujuan dari artikel ini adalah membuat kajian literatur terhadap metode pelabelan part-of-speech (POS tagger) untuk Bahasa Indonesia yang telah dilakukan selama 11 tahun terakhir (sejak tahun 2008). Artikel ini dapat menjadi roadmap POS tagger Bahasa Indonesia dan juga dasar pertimbangan untuk pengembangan selanjutnya agar menggunakan dataset dan tagset yang standar sebagai benchmark metode. Terdapat 15 publikasi yang dibahas, pembahasan meliputi dataset, tagset dan metode yang digunakan untuk POS tag Bahasa Indonesia. Dataset yang paling banyak digunakan dan paling mungkin menjadi corpus standar adalah IDN Tagged Corpus terdiri dari lebih dari 250.000 token. Tagset Bahasa Indonesia hingga saat ini belum terstandarisasi dengan jumlah label bervariasi dari 16 tag hingga 37 tag. Metode yang paling banyak dikembangkan dan berpotensi menjadi state-of-the-art adalah neural network, dengan varian metode biLSTM dan CRF dan sejauh ini memberikan skor F1 dan akurasi tertinggi (>96%).

Download Full-text

Part-of-Speech Tagging via Deep Neural Networks for Northern-Ethiopic Languages

Information Technology And Control ◽

10.5755/j01.itc.49.4.26808 ◽

2020 ◽

Vol 49 (4) ◽

pp. 482-494

Author(s):

Jurgita Kapočiūtė-Dzikienė ◽

Senait Gebremichael Tesfagergish

Keyword(s):

Neural Network ◽

Neural Networks ◽

Language Processing ◽

Deep Neural Networks ◽

Short Term Memory ◽

Parameter Tuning ◽

Feed Forward Neural Network ◽

Pos Tagging ◽

Part Of Speech ◽

Pos Tagger

Deep Neural Networks (DNNs) have proven to be especially successful in the area of Natural Language Processing (NLP) and Part-Of-Speech (POS) tagging—which is the process of mapping words to their corresponding POS labels depending on the context. Despite recent development of language technologies, low-resourced languages (such as an East African Tigrinya language), have received too little attention. We investigate the effectiveness of Deep Learning (DL) solutions for the low-resourced Tigrinya language of the Northern-Ethiopic branch. We have selected Tigrinya as the testbed example and have tested state-of-the-art DL approaches seeking to build the most accurate POS tagger. We have evaluated DNN classifiers (Feed Forward Neural Network – FFNN, Long Short-Term Memory method – LSTM, Bidirectional LSTM, and Convolutional Neural Network – CNN) on a top of neural word2vec word embeddings with a small training corpus known as Nagaoka Tigrinya Corpus. To determine the best DNN classifier type, its architecture and hyper-parameter set both manual and automatic hyper-parameter tuning has been performed. BiLSTM method was proved to be the most suitable for our solving task: it achieved the highest accuracy equal to 92% that is 65% above the random baseline.

Download Full-text

PARSING TWITTER MENGGUNAKAN METODE LEFT-CORNER PARSING DENGAN MEMANFAATKAN POS TAGGER

Repositor ◽

10.22219/repositor.v2i7.203 ◽

2020 ◽

Vol 2 (7) ◽

pp. 897

Author(s):

Dyah Anitia ◽

Yuda Munarko ◽

Yufis Azhar

Keyword(s):

Noun Phrase ◽

Word Class ◽

Sentence Structure ◽

Top Down ◽

Data Set ◽

Bottom Up ◽

Part Of Speech ◽

Left Corner ◽

Pos Tagger ◽

Bahasa Indonesia

AbstrakPada penelitian ini dilakukan investigasi parser dengan pendekatan left-corner untuk data tweet bahasa Indonesia. Total koleksi tweet sebanyak 850 tweet yang dibagi menjadi tiga kumpulan data, yakni data train POS Tagger, data train dan data uji. Left-corner menggabungkan dua metode yakni top-down dan bottom-up. Dimana top-down digunakan pada proses pengenalan kelas kata dan bottom-up digunakan pada proses pengenalan struktur kalimat. Adapun jenis tag yang digunakan dalam proses top-down berjumlah 23 tagset dan frasa yang digunakan untuk menentukan struktur kalimat frasa yakni frasa nomina, frasa verbal, frasa adjektiva, frasa adverbia dan frasa preposisional. Hasilnya adalah untuk pendekatan left corner mencapai nilai precision 88,29%, nilai recall 68,3% dan F1 measure 77,02%. Nilai yang diperoleh dengan pendekatan left-corner lebih besar dibandingkan nilai dengan pendekatan bottom-up. Hasil dari nilai yang diperoleh dengan bottom up mencapai nilai precision 68,79%, nilai recall 47,12% dan F1 measure 55,9%. Hal ini disebabkan penggunaan kelas kata pada proses top-down berpengaruh pada sturuktur kalimat pada proses bottom up.AbstractIn this research, we investigated parser with left-corner parser approach for data tweet in Indonesian language. The data used was consisted of 850 tweets which divided for into three data set, that is data train for POS Tagger, data train for parser and data test. The left-corner combines two methods, top-down and bottom-up methods. Top-down used for processes a sequence of words, and attaches a part of speech tag to each and bottom-up used for processes a sentence structure. We used 41 tags and the pharse used to define the sentence structure is noun phrase, verbal phrase, adjective pharse, adverd phrase and prepositional pharse. The result was that precision 88,29%, recall 68,3% and F1 measure 77,02% of left-corner approach. The value obtained by the left-corner approach is greater than the value with the bottom-up approach. The result was that precision 68,29%, recall 47,12% and F1 measure 55,9% of bottom-up approach. This is because the use of word class in top-down process affect the sentence structure in the bottom up process. that is because the use of word class in top-down process affect the sentence structure in the bottom up process.

Download Full-text

Punjabi Pos Tagger: Rule Based and HMM

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse/v7i7/0106 ◽

2017 ◽

Vol 7 (7) ◽

pp. 193

Author(s):

Umrinderpal Singh ◽

Vishal Goyal

Keyword(s):

Information Retrieval ◽

Language Processing ◽

State Of The Art ◽

Input Word ◽

Rule Based ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Unseen Data ◽

Pos Tagger ◽

Speech Tagging

The Part of Speech tagger system is used to assign a tag to every input word in a given sentence. The tags may include different part of speech tag for a particular language like noun, pronoun, verb, adjective, conjunction etc. and may have subcategories of all these tags. Part of Speech tagging is a basic and a preprocessing task of most of the Natural Language Processing (NLP) applications such as Information Retrieval, Machine Translation, and Grammar Checking etc. The task belongs to a larger set of problems, namely, sequence labeling problems. Part of Speech tagging for Punjabi is not widely explored territory. We have discussed Rule Based and HMM based Part of Speech tagger for Punjabi along with the comparison of their accuracies of both approaches. The System is developed using 35 different standard part of speech tag. We evaluate our system on unseen data with state-of-the-art accuracy 93.3%.

Download Full-text

Semantic Relation Classification via Bidirectional LSTM Networks with Entity-Aware Attention Using Latent Entity Typing

Symmetry ◽

10.3390/sym11060785 ◽

2019 ◽

Vol 11 (6) ◽

pp. 785 ◽

Cited By ~ 13

Author(s):

Joohong Lee ◽

Sangwoo Seo ◽

Yong Suk Choi

Keyword(s):

Language Processing ◽

State Of The Art ◽

Neural Model ◽

Named Entity ◽

Part Of Speech ◽

Classification Tasks ◽

Relation Classification ◽

Pos Tagger ◽

High Level ◽

Dependency Parser

Classifying semantic relations between entity pairs in sentences is an important task in natural language processing (NLP). Most previous models applied to relation classification rely on high-level lexical and syntactic features obtained by NLP tools such as WordNet, the dependency parser, part-of-speech (POS) tagger, and named entity recognizers (NER). In addition, state-of-the-art neural models based on attention mechanisms do not fully utilize information related to the entity, which may be the most crucial feature for relation classification. To address these issues, we propose a novel end-to-end recurrent neural model that incorporates an entity-aware attention mechanism with a latent entity typing (LET) method. Our model not only effectively utilizes entities and their latent types as features, but also builds word representations by applying self-attention based on symmetrical similarity of a sentence itself. Moreover, the model is interpretable by visualizing applied attention mechanisms. Experimental results obtained with the SemEval-2010 Task 8 dataset, which is one of the most popular relation classification tasks, demonstrate that our model outperforms existing state-of-the-art models without any high-level features.

Download Full-text

Tying of embeddings for improving regularization in neural networks for named entity recognition task

Bulletin of Taras Shevchenko National University of Kyiv. Series: Physics and Mathematics ◽

10.17721/1812-5409.2018/3.8 ◽

2018 ◽

pp. 59-64

Author(s):

M. Bevza

Keyword(s):

Neural Network ◽

Network Architecture ◽

State Of The Art ◽

Named Entity Recognition ◽

Recognition Task ◽

Entity Recognition ◽

Named Entity ◽

Part Of Speech ◽

Recent Developments ◽

Knowledge Distillation

We analyze neural network architectures that yield state of the art results on named entity recognition task and propose a new architecture for improving results even further. We have analyzed a number of ideas and approaches that researchers have used to achieve state of the art results in a variety of NLP tasks. In this work, we present a few of them which we consider to be most likely to improve existing state of the art solutions for named entity recognition task. The architecture is inspired by recent developments in language modeling task. The suggested solution is based on a multi-task learning approach. We incorporate part of speech tags as input for the network. Part of speech tags to be yielded by some state of the art tagger and also ask the network to produce those tags in addition to the main named entity recognition tags. This way knowledge distillation from a strong part of speech tagger to our smaller network is happening. We hypothesize that designing neural network architecture in this way improves the generalizability of the system and provide arguments to support this statement.

Download Full-text

Part of Speech Tagging: Shallow or Deep Learning?

Northern European Journal of Language Technology ◽

10.3384/nejlt.2000-1533.1851 ◽

2018 ◽

Vol 5 ◽

pp. 1-15 ◽

Cited By ~ 1

Author(s):

Robert Östling

Keyword(s):

Neural Network ◽

Computational Efficiency ◽

State Of The Art ◽

Neural Network Approach ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

The Neural Network ◽

Efficient Machine ◽

Speech Tagging

Deep neural networks have advanced the state of the art in numerous fields, but they generally suffer from low computational efficiency and the level of improvement compared to more efficient machine learning models is not always significant. We perform a thorough PoS tagging evaluation on the Universal Dependencies treebanks, pitting a state-of-the-art neural network approach against UDPipe and our sparse structured perceptron-based tagger, efselab. In terms of computational efficiency, efselab is three orders of magnitude faster than the neural network model, while being more accurate than either of the other systems on 47 of 65 treebanks.

Download Full-text

Gated POS-Level Language Model for Authorship Verification

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/557 ◽

2020 ◽

Author(s):

Linshu Ouyang ◽

Yongzheng Zhang ◽

Hui Liu ◽

Yige Chen ◽

Yipeng Wang

Keyword(s):

State Of The Art ◽

Language Model ◽

Training Data ◽

Language Models ◽

Effective Parameters ◽

Part Of Speech ◽

Authorship Verification ◽

Verification Methods ◽

Optimal Accuracy ◽

Pos Tagger

Authorship verification is an important problem that has many applications. The state-of-the-art deep authorship verification methods typically leverage character-level language models to encode author-specific writing styles. However, they often fail to capture syntactic level patterns, leading to sub-optimal accuracy in cross-topic scenarios. Also, due to imperfect cross-author parameter sharing, it's difficult for them to distinguish author-specific writing style from common patterns, leading to data-inefficient learning. This paper introduces a novel POS-level (Part of Speech) gated RNN based language model to effectively learn the author-specific syntactic styles. The author-agnostic syntactic information obtained from the POS tagger pre-trained on large external datasets greatly reduces the number of effective parameters of our model, enabling the model to learn accurate author-specific syntactic styles with limited training data. We also utilize a gated architecture to learn the common syntactic writing styles with a small set of shared parameters and let the author-specific parameters focus on each author's special syntactic styles. Extensive experimental results show that our method achieves significantly better accuracy than state-of-the-art competing methods, especially in cross-topic scenarios (over 5\% in terms of AUC-ROC).

Download Full-text

Universal Dependencies for Tweets in Brazilian Portuguese: Tokenization and Part of Speech Tagging

10.5753/eniac.2021.18273 ◽

2021 ◽

Author(s):

Emanuel Huber da Silva ◽

Thiago Alexandre Salgueiro Pardo ◽

Norton Trevisan Roman ◽

Ariani Di Fellipo

Keyword(s):

State Of The Art ◽

Preliminary Evidence ◽

Brazilian Portuguese ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Current State ◽

Language User ◽

Pos Tagger ◽

Speech Tagging

Automatically dealing with Natural Language User-Generated Content (UGC) is a challenging task of utmost importance, given the amount of information available over the web. We present in this paper an effort on building tokenization and Part of Speech (PoS) tagging systems for tweets in Brazilian Portuguese, following the guidelines of the Universal Dependencies (UD) project. We propose a rule-based tokenizer and the customization of current state-of-the-art UD-based tagging strategies for Portuguese, achieving a 98% f-score for tokenization, and a 95% f-score for PoS tagging. We also introduce DANTEStocks, the corpus of stock market tweets on which we base our work, presenting preliminary evidence of the multi-genre capacity of our PoS tagger.

Download Full-text

Analisis Morfologi untuk Menangani Out-of-Vocabulary Words pada Part-of-Speech Tagger Bahasa Indonesia Menggunakan Hidden Markov Model

Jurnal Linguistik Komputasional (JLK) ◽

10.26418/jlk.v2i1.13 ◽

2019 ◽

Vol 2 (1) ◽

pp. 6 ◽

Cited By ~ 1

Author(s):

Febyana Ramadhanti ◽

Yudi Wibisono ◽

Rosa Ariani Sukamto

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Markov Model ◽

Hidden Markov Model ◽

Language Processing ◽

Hidden Markov ◽

Part Of Speech ◽

Pos Tagger ◽

Bahasa Indonesia

Part-of-speech (PoS) tagger merupakan salah satu task dalam bidang natural language processing (NLP) sebagai proses penandaan kategori kata (part-of-speech) untuk setiap kata pada teks kalimat masukan. Hidden markov model (HMM) merupakan algoritma PoS tagger berbasis probabilistik, sehingga sangat tergantung pada train corpus. Terbatasnya komponen dalam train corpus dan luasnya kata dalam bahasa Indonesia menimbulkan masalah yang disebut out-of-vocabulary (OOV) words. Penelitian ini membandingkan PoS tagger yang menggunakan HMM+AM (analisis morfologi) dan PoS tagger HMM tanpa AM, dengan menggunakan train corpus dan testing corpus yang sama. Testing corpus mengandung 30% tingkat OOV dari 6.676 token atau 740 kalimat masukan. Hasil yang diperoleh dari sistem HMM saja memiliki akurasi 97.54%, sedangkan sistem HMM dengan metode analisis morfologi memiliki akurasi tertinggi 99.14%.

Download Full-text

Lexical Rule and Lexicon Effect for Part of Speech Tagging Bahasa Madura

Matrik Jurnal Manajemen Teknik Informatika dan Rekayasa Komputer ◽

10.30812/matrik.v18i1.332 ◽

2018 ◽

Vol 18 (1) ◽

pp. 65-72

Author(s):

Nindian Puspa Dewi ◽

Ubaidi Ubaidi

Keyword(s):

Text Processing ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Speech Tagging ◽

Bahasa Indonesia

POS Tagging adalah dasar untuk pengembangan Text Processing suatu bahasa. Dalam penelitian ini kita meneliti pengaruh penggunaan lexicon dan perubahan morfologi kata dalam penentuan tagset yang tepat untuk suatu kata. Aturan dengan pendekatan morfologi kata seperti awalan, akhiran, dan sisipan biasa disebut sebagai lexical rule. Penelitian ini menerapkan lexical rule hasil learner dengan menggunakan algoritma Brill Tagger. Bahasa Madura adalah bahasa daerah yang digunakan di Pulau Madura dan beberapa pulau lainnya di Jawa Timur. Objek penelitian ini menggunakan Bahasa Madura yang memiliki banyak sekali variasi afiksasi dibandingkan dengan Bahasa Indonesia. Pada penelitian ini, lexicon selain digunakan untuk pencarian kata dasar Bahasa Madura juga digunakan sebagai salah satu tahap pemberian POS Tagging. Hasil ujicoba dengan menggunakan lexicon mencapai akurasi yaitu 86.61% sedangkan jika tidak menggunakan lexicon hanya mencapai akurasi 28.95 %. Dari sini dapat disimpulkan bahwa ternyata lexicon sangat berpengaruh terhadap POS Tagging.

Download Full-text