Support Vector Machine for Chinese Part-Of-Speech Tagging in Speech Synthesis Systems

Author(s):  
Xiang Wang ◽  
Jianping Zhang ◽  
Yonghong Yan
2015 ◽  
Vol 48 ◽  
pp. 507-512 ◽  
Author(s):  
Bishwa Ranjan Das ◽  
Smrutirekha Sahoo ◽  
Chandra Sekhar Panda ◽  
Srikanta Patnaik

Author(s):  
Hour Kaing ◽  
Chenchen Ding ◽  
Masao Utiyama ◽  
Eiichiro Sumita ◽  
Sethserey Sam ◽  
...  

As a highly analytic language, Khmer has considerable ambiguities in tokenization and part-of-speech (POS) tagging processing. This topic is investigated in this study. Specifically, a 20,000-sentence Khmer corpus with manual tokenization and POS-tagging annotation is released after a series of work over the last 4 years. This is the largest morphologically annotated Khmer dataset as of 2020, when this article was prepared. Based on the annotated data, experiments were conducted to establish a comprehensive benchmark on the automatic processing of tokenization and POS-tagging for Khmer. Specifically, a support vector machine, a conditional random field (CRF) , a long short-term memory (LSTM) -based recurrent neural network, and an integrated LSTM-CRF model have been investigated and discussed. As a primary conclusion, processing at morpheme-level is satisfactory for the provided data. However, it is intrinsically difficult to identify further grammatical constituents of compounds or phrases because of the complex analytic features of the language. Syntactic annotation and automatic parsing for Khmer will be scheduled in the near future.


2014 ◽  
Vol 519-520 ◽  
pp. 784-787
Author(s):  
Zhi Qiang Wu ◽  
Hong Zhi Yu ◽  
Shu Hui Wan

It’s a basic work for Tibetan information processing to tag the Tibetan parts of speech,the results can be used in machine translation, speech synthesis and so on. By studying the Tibetan language grammar and the classification of Tibetan parts of speech, established the Tibetan parts of speech tagging sets, and tagged the corpus, used the CRFs to solve the problem that automatic tagging of Tibetan parts of speech, the experimental results show that in the closed test set, part-of-speech tagging accuracy is 94.2%, and in the opening set, the accuracy is 91.5%.


Author(s):  
Nindian Puspa Dewi ◽  
Ubaidi Ubaidi

POS Tagging adalah dasar untuk pengembangan Text Processing suatu bahasa. Dalam penelitian ini kita meneliti pengaruh penggunaan lexicon dan perubahan morfologi kata dalam penentuan tagset yang tepat untuk suatu kata. Aturan dengan pendekatan morfologi kata seperti awalan, akhiran, dan sisipan biasa disebut sebagai lexical rule. Penelitian ini menerapkan lexical rule hasil learner dengan menggunakan algoritma Brill Tagger. Bahasa Madura adalah bahasa daerah yang digunakan di Pulau Madura dan beberapa pulau lainnya di Jawa Timur. Objek penelitian ini menggunakan Bahasa Madura yang memiliki banyak sekali variasi afiksasi dibandingkan dengan Bahasa Indonesia. Pada penelitian ini, lexicon selain digunakan untuk pencarian kata dasar Bahasa Madura juga digunakan sebagai salah satu tahap pemberian POS Tagging. Hasil ujicoba dengan menggunakan lexicon mencapai akurasi yaitu 86.61% sedangkan jika tidak menggunakan lexicon hanya mencapai akurasi 28.95 %. Dari sini dapat disimpulkan bahwa ternyata lexicon sangat berpengaruh terhadap POS Tagging.


Sign in / Sign up

Export Citation Format

Share Document