scholarly journals Phrase based Clustering Scheme of Suffix Tree Document Clustering Model

2013 ◽  
Vol 63 (10) ◽  
pp. 30-37
Author(s):  
Anoop KumarJain ◽  
Satyam Maheshwari
Author(s):  
U. K. Sridevi ◽  
N. Nagaveni

Clustering is an important topic to find relevant content from a document collection and it also reduces the search space. The current clustering research emphasizes the development of a more efficient clustering method without considering the domain knowledge and user’s need. In recent years the semantics of documents have been utilized in document clustering. The discussed work focuses on the clustering model where ontology approach is applied. The major challenge is to use the background knowledge in the similarity measure. This paper presents an ontology based annotation of documents and clustering system. The semi-automatic document annotation and concept weighting scheme is used to create an ontology based knowledge base. The Particle Swarm Optimization (PSO) clustering algorithm can be applied to obtain the clustering solution. The accuracy of clustering has been computed before and after combining ontology with Vector Space Model (VSM). The proposed ontology based framework gives improved performance and better clustering compared to the traditional vector space model. The result using ontology was significant and promising.


2008 ◽  
Vol 11 (2) ◽  
Author(s):  
Esteban Meneses ◽  
Oldemar Rodríguez-Rojas

Documents in HTML format have many features to analyze, from the terms in special sections to the phrases that appear in the whole document. However, it is important to decide which feature contributes the most to separate documents according to classes. Given this information, it is possible not to include certain feature in the representation for the document, given that it is expensive to compute and doesn’t contribute enough in the clustering process. By using a novel representation model and the standard k-means algorithm, we discovered that terms in the body of document contributes the most, followed by terms in other sections. Suffix tree provides poor contribution in that scenario, while term order graphs influence a little the partition. We used 4 known datasets to support the conclusions.


Sign in / Sign up

Export Citation Format

Share Document