Malay document clustering using complete linkage clustering technique with Cosine Coefficient

The increase in the number of documents has aggravated the difficulty of classifying those documents according to specific needs. Clustering analysis in a distributed environment is a thrust area in artificial intelligence and data mining. Its fundamental task is to utilize characters to compute the degree of related corresponding relationship between objects and to accomplish automatic classification without earlier knowledge. Document clustering utilizes clustering technique to gather the documents of high resemblance collectively by computing the documents resemblance. Recent studies have shown that ontologies are useful in improving the performance of document clustering. Ontology is concerned with the conceptualization of a domain into an individual identifiable format and machine-readable format containing entities, attributes, relationships, and axioms. By analyzing types of techniques for document clustering, a better clustering technique depending on Genetic Algorithm (GA) is determined. Non-Dominated Ranked Genetic Algorithm (NRGA) is used in this paper for clustering, which has the capability of providing a better classification result. The experiment is conducted in 20 newsgroups data set for evaluating the proposed technique. The result shows that the proposed approach is very effective in clustering the documents in the distributed environment.

Download Full-text

Document Clustering Technique by K-means Algorithm and PCA

The Journal of the Korean Institute of Information and Communication Engineering ◽

10.6109/jkiice.2014.18.3.625 ◽

2014 ◽

Vol 18 (3) ◽

pp. 625-630 ◽

Cited By ~ 1

Author(s):

Woosaeng Kim ◽

Sooyoung Kim

Keyword(s):

Document Clustering ◽

Clustering Technique

Download Full-text

Role of Pre-processing Phase in Document Clustering Technique for Gurmukhi Script

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.c9105.019320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 3216-3220

Keyword(s):

Significant Role ◽

Document Clustering ◽

Large Data ◽

Data Sets ◽

Similar Data ◽

Clustering Technique ◽

Two Phases ◽

Data Objects ◽

Gurmukhi Script

Document clustering plays a central role in knowledge discovery and data mining by representing large data-sets into a certain number of data objects called clusters. Each cluster consists similar data objects in such a way that data objects in the same cluster are more similar and dissimilar to the data objects of other clusters. Document clustering technique for Gurmukhi script consists two phases namely: 1) Pre-processing phase 2) Processing phase. This paper concentrates pre-processing phase of document clustering technique for Gurmukhi script. The purpose of pre-processing phase is to convert unstructured text into structured text format. Various sub-phases of pre-processing phase are: segmentation, tokenization, removal of stop words, stemming, and normalization. The purpose of this paper is to present the significant role of pre-processing phase in an overall performance of document clustering technique for Gurmukhi script. The experimental results represent the significant role of pre-processing phase in terms of performance regarding assignment of data objects to the relevant clusters as well as in creation of meaningful cluster title list. .

Download Full-text

Effective Fuzzy Ontology Based Distributed Document Using Non-Dominated Ranked Genetic Algorithm

International Journal of Intelligent Information Technologies ◽

10.4018/jiit.2011100102 ◽

2011 ◽

Vol 7 (4) ◽

pp. 26-46 ◽

Cited By ~ 4

Author(s):

M. Thangamani ◽

P. Thangaraj

Keyword(s):

Genetic Algorithm ◽

Clustering Analysis ◽

Document Clustering ◽

Distributed Environment ◽

Classification Result ◽

Data Set ◽

Clustering Technique ◽

Machine Readable ◽

Readable Format ◽

Machine Readable Format

The increase in the number of documents has aggravated the difficulty of classifying those documents according to specific needs. Clustering analysis in a distributed environment is a thrust area in artificial intelligence and data mining. Its fundamental task is to utilize characters to compute the degree of related corresponding relationship between objects and to accomplish automatic classification without earlier knowledge. Document clustering utilizes clustering technique to gather the documents of high resemblance collectively by computing the documents resemblance. Recent studies have shown that ontologies are useful in improving the performance of document clustering. Ontology is concerned with the conceptualization of a domain into an individual identifiable format and machine-readable format containing entities, attributes, relationships, and axioms. By analyzing types of techniques for document clustering, a better clustering technique depending on Genetic Algorithm (GA) is determined. Non-Dominated Ranked Genetic Algorithm (NRGA) is used in this paper for clustering, which has the capability of providing a better classification result. The experiment is conducted in 20 newsgroups data set for evaluating the proposed technique. The result shows that the proposed approach is very effective in clustering the documents in the distributed environment.

Download Full-text