scholarly journals An Improved Co-Similarity Measure for Document Clustering

Author(s):  
Syed Fawad Hussain ◽  
Gilles Bisson ◽  
Clement Grimal
2012 ◽  
Vol 24 (6) ◽  
pp. 1002-1013 ◽  
Author(s):  
Taiping Zhang ◽  
Yuan Yan Tang ◽  
Bin Fang ◽  
Yong Xiang

Author(s):  
YONGLI LIU ◽  
YUANXIN OUYANG ◽  
ZHANG XIONG

Document clustering is one of the most effective techniques to organize documents in an unsupervised manner. In this paper, an Incremental method for document Clustering based on Information Bottleneck theory (ICIB) is presented. The ICIB is designed to improve the accuracy and efficiency of document clustering, and resolve the issue that an arbitrary choice of document similarity measure could produce an inaccurate clustering result. In our approach, document similarity is calculated using information bottleneck theory and documents are grouped incrementally. A first document is selected randomly and classified as one cluster, then each remaining document is processed incrementally according to the mutual information loss introduced by the merger of the document and each existing cluster. If the minimum value of mutual information loss is below a certain threshold, the document will be added to its closest cluster; otherwise it will be classified as a new cluster. The incremental clustering process is low-precision and order-dependent, which cannot guarantee accurate clustering results. Therefore, an improved sequential clustering algorithm (SIB) is proposed to adjust the intermediate clustering results. In order to test the effectiveness of ICIB method, ten independent document subsets are constructed based on the 20NewsGroup and Reuters-21578 corpora. Experimental results show that our ICIB method achieves higher accuracy and time performance than K-Means, AIB and SIB algorithms.


Sign in / Sign up

Export Citation Format

Share Document