Improved Term Weighting Factors for Keyword Extraction in Hierarchical Category Structure and Thai Text Classification

Author(s):  
Boonthida Chiraratanasopha ◽  
Thanaruk Theeramunkong ◽  
Salin Boonbrahm
2021 ◽  
Vol 40 (1) ◽  
pp. 57-82
Author(s):  
Boonthida Chiraratanasopha ◽  
Salin Boonbrahm ◽  
Thanaruk Theeramunkong

IEEE Access ◽  
2019 ◽  
Vol 7 ◽  
pp. 166578-166592
Author(s):  
Surender Singh Samant ◽  
N. L. Bhanu Murthy ◽  
Aruna Malapati

2020 ◽  
Vol 14 (4) ◽  
pp. 101076 ◽  
Author(s):  
Turgut Dogan ◽  
Alper Kursat Uysal

2007 ◽  
Vol 01 (04) ◽  
pp. 421-439 ◽  
Author(s):  
SAMER HASSAN ◽  
RADA MIHALCEA ◽  
CARMEN BANEA

This paper describes a new approach for estimating term weights in a document, and shows how the new weighting scheme can be used to improve the accuracy of a text classifier. The method uses term co-occurrence as a measure of dependency between word features. A random walk model is applied on a graph encoding words and co-occurrence dependencies, resulting in scores that represent a quantification of how a particular word feature contributes to a given context. Experiments performed on three standard classification datasets show that the new random walk based approach outperforms the traditional term frequency approach of feature weighting.


Author(s):  
Boonthida Chiraratanasopha ◽  
Thanaruk Theeramunkong ◽  
Salin Boonbrahm

Automatic hierarchical text classification has been a challenging and in-needed task with an increasing of hierarchical taxonomy from the booming of knowledge organization. The hierarchical structure identifies the relationships of dependence between different categories in which can be overlapped of generalized and specific concepts within the tree. This paper presents the use of frequency of the occurring term in related categories among the hierarchical tree to help in document classification. The four extended term weighting of Relative Inverse Document Frequency (IDFr) including its located category, its parent category, its sibling categories and its child categories are exploited to generate a classifier model using centroid-based technique. From the experiment on hierarchical text classification of Thai documents, the IDFr achieved the best accuracy and F-measure as 53.65% and 50.80% in Top-n features set from family-based evaluation in which are higher than TF-IDF for 2.35% and 1.15% in the same settings, respectively.


Sign in / Sign up

Export Citation Format

Share Document