Khmer POS Tagging Using Conditional Random Fields

Author(s):  
Sokunsatya Sangvat ◽  
Charnyote Pluempitiwiriyawej
2021 ◽  
Vol 11 (4) ◽  
pp. 1-13
Author(s):  
Arpitha Swamy ◽  
Srinath S.

Parts-of-speech (POS) tagging is a method used to assign the POS tag for every word present in the text, and named entity recognition (NER) is a process to identify the proper nouns in the text and to classify the identified nouns into certain predefined categories. A POS tagger and a NER system for Kannada text have been proposed utilizing conditional random fields (CRFs). The dataset used for POS tagging consists of 147K tokens, where 103K tokens are used for training and the remaining tokens are used for testing. The proposed CRF model for POS tagging of Kannada text obtained 91.3% of precision, 91.6% of recall, and 91.4% of f-score values, respectively. To develop the NER system for Kannada, the data required is created manually using the modified tag-set containing 40 labels. The dataset used for NER system consists of 16.5K tokens, where 70% of the total words are used for training the model, and the remaining 30% of total words are used for model testing. The developed NER model obtained the 94% of precision, 93.9% of recall, and 93.9% of F1-measure values, respectively.


2020 ◽  
Vol 26 (6) ◽  
pp. 677-690
Author(s):  
Kareem Darwish ◽  
Mohammed Attia ◽  
Hamdy Mubarak ◽  
Younes Samih ◽  
Ahmed Abdelali ◽  
...  

AbstractThis work introduces robust multi-dialectal part of speech tagging trained on an annotated data set of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses conditional random fields (CRFs), while the second combines word- and character-based representations in a deep neural network with stacked layers of convolutional and recurrent networks with a CRF output layer. We successfully exploit a variety of features that help generalize our models, such as Brown clusters and stem templates. Also, we develop robust joint models that tag multi-dialectal tweets and outperform uni-dialectal taggers. We achieve a combined accuracy of 92.4% across all dialects, with per dialect results ranging between 90.2% and 95.4%. We obtained the results using a train/dev/test split of 70/10/20 for a data set of 350 tweets per dialect.


2011 ◽  
Vol 22 (8) ◽  
pp. 1897-1910 ◽  
Author(s):  
Yun LIU ◽  
Zhi-Ping CAI ◽  
Ping ZHONG ◽  
Jian-Ping YIN ◽  
Jie-Ren CHENG

ROBOT ◽  
2010 ◽  
Vol 32 (3) ◽  
pp. 326-333
Author(s):  
Mingjun WANG ◽  
Jun ZHOU ◽  
Jun TU ◽  
Chengliang LIU

Sign in / Sign up

Export Citation Format

Share Document