Malay document clustering using complete linkage clustering technique with Cosine Coefficient

Author(s):  
Nurazzah Abd Rahman ◽  
Zainab Abu Bakar ◽  
Nurul Syeilla Syazhween Zulkefli
Author(s):  
M. Thangamani ◽  
P. Thangaraj

The increase in the number of documents has aggravated the difficulty of classifying those documents according to specific needs. Clustering analysis in a distributed environment is a thrust area in artificial intelligence and data mining. Its fundamental task is to utilize characters to compute the degree of related corresponding relationship between objects and to accomplish automatic classification without earlier knowledge. Document clustering utilizes clustering technique to gather the documents of high resemblance collectively by computing the documents resemblance. Recent studies have shown that ontologies are useful in improving the performance of document clustering. Ontology is concerned with the conceptualization of a domain into an individual identifiable format and machine-readable format containing entities, attributes, relationships, and axioms. By analyzing types of techniques for document clustering, a better clustering technique depending on Genetic Algorithm (GA) is determined. Non-Dominated Ranked Genetic Algorithm (NRGA) is used in this paper for clustering, which has the capability of providing a better classification result. The experiment is conducted in 20 newsgroups data set for evaluating the proposed technique. The result shows that the proposed approach is very effective in clustering the documents in the distributed environment.


Document clustering plays a central role in knowledge discovery and data mining by representing large data-sets into a certain number of data objects called clusters. Each cluster consists similar data objects in such a way that data objects in the same cluster are more similar and dissimilar to the data objects of other clusters. Document clustering technique for Gurmukhi script consists two phases namely: 1) Pre-processing phase 2) Processing phase. This paper concentrates pre-processing phase of document clustering technique for Gurmukhi script. The purpose of pre-processing phase is to convert unstructured text into structured text format. Various sub-phases of pre-processing phase are: segmentation, tokenization, removal of stop words, stemming, and normalization. The purpose of this paper is to present the significant role of pre-processing phase in an overall performance of document clustering technique for Gurmukhi script. The experimental results represent the significant role of pre-processing phase in terms of performance regarding assignment of data objects to the relevant clusters as well as in creation of meaningful cluster title list. .


2011 ◽  
Vol 7 (4) ◽  
pp. 26-46 ◽  
Author(s):  
M. Thangamani ◽  
P. Thangaraj

The increase in the number of documents has aggravated the difficulty of classifying those documents according to specific needs. Clustering analysis in a distributed environment is a thrust area in artificial intelligence and data mining. Its fundamental task is to utilize characters to compute the degree of related corresponding relationship between objects and to accomplish automatic classification without earlier knowledge. Document clustering utilizes clustering technique to gather the documents of high resemblance collectively by computing the documents resemblance. Recent studies have shown that ontologies are useful in improving the performance of document clustering. Ontology is concerned with the conceptualization of a domain into an individual identifiable format and machine-readable format containing entities, attributes, relationships, and axioms. By analyzing types of techniques for document clustering, a better clustering technique depending on Genetic Algorithm (GA) is determined. Non-Dominated Ranked Genetic Algorithm (NRGA) is used in this paper for clustering, which has the capability of providing a better classification result. The experiment is conducted in 20 newsgroups data set for evaluating the proposed technique. The result shows that the proposed approach is very effective in clustering the documents in the distributed environment.


2009 ◽  
Vol 30 (10) ◽  
pp. 870-876 ◽  
Author(s):  
Moreno Carullo ◽  
Elisabetta Binaghi ◽  
Ignazio Gallo

Sign in / Sign up

Export Citation Format

Share Document