One text classification algorithm basing on tolerance rough set

Author(s):  
Yang Junchuan ◽  
Tang Yu ◽  
Hu Zhisong ◽  
Lao Jun
Author(s):  
Zhihua Wei ◽  
Duoqian Miao ◽  
Ruizhi Wang ◽  
Zhifei Zhang

Text representation is the prerequisite of various document processing tasks, such as information retrieval, text classification, text clustering, etc. It has been studied intensively for the past few years, and many excellent models have been designed as well. However, the performance of these models is affected by the problem of data sparseness. Existing smoothing techniques usually make use of statistic theory or linguistic information to assign a uniform distribution to absent words. They do not concern the real word distribution or distinguish between words. In this chapter, a method based on a kind of soft computing theory, Tolerance Rough Set theory, which makes use of upper approximation and lower approximation theory in Rough Set to assign different values for absent words in different approximation regions, is proposed. Theoretically, our algorithms can estimate smoothing value for absent words according to their relation with respect to existing words. Text classification experiments by using Vector Space Model (VSM) and Latent Dirichlet Allocation (LDA) model on public corpora have shown that our algorithms greatly improve the performance of text representation model, especially for the performance of unbalanced corpus.


2015 ◽  
Vol 10 (12) ◽  
pp. 195-206 ◽  
Author(s):  
Chunyong Yin ◽  
Jun Xiang ◽  
Hui Zhang ◽  
Zhichao Yin ◽  
Jin Wang

Sign in / Sign up

Export Citation Format

Share Document