Incorporating Biological Domain Knowledge into Cluster Validity Assessment

Author(s):  
Nadia Bolshakova ◽  
Francisco Azuaje ◽  
Pádraig Cunningham
2014 ◽  
Vol 2014 ◽  
pp. 1-11 ◽  
Author(s):  
Lopamudra Dey ◽  
Sanjay Chakraborty

“Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.


2020 ◽  
Vol 25 (6) ◽  
pp. 755-769
Author(s):  
Noorullah R. Mohammed ◽  
Moulana Mohammed

Text data clustering is performed for organizing the set of text documents into the desired number of coherent and meaningful sub-clusters. Modeling the text documents in terms of topics derivations is a vital task in text data clustering. Each tweet is considered as a text document, and various topic models perform modeling of tweets. In existing topic models, the clustering tendency of tweets is assessed initially based on Euclidean dissimilarity features. Cosine metric is more suitable for more informative assessment, especially of text clustering. Thus, this paper develops a novel cosine based external and interval validity assessment of cluster tendency for improving the computational efficiency of tweets data clustering. In the experimental, tweets data clustering results are evaluated using cluster validity indices measures. Experimentally proved that cosine based internal and external validity metrics outperforms the other using benchmarked and Twitter-based datasets.


2005 ◽  
Vol 26 (15) ◽  
pp. 2353-2363 ◽  
Author(s):  
Minho Kim ◽  
R.S. Ramakrishna

2004 ◽  
Vol 21 (4) ◽  
pp. 451-455 ◽  
Author(s):  
N. Bolshakova ◽  
F. Azuaje ◽  
P d. Cunningham

2019 ◽  
Vol 12 (2) ◽  
pp. 103
Author(s):  
Udoinyang G. Inyang ◽  
Uduak A. Umoh ◽  
Ifeoma C. Nnaemeka ◽  
Samuel A. Robinson

The large nature of students’ dataset has made it difficult to find patterns associated with students’ academic performance (AP) using conventional methods. This has increased the rate of drop-outs, graduands with weak class of degree (CoD) and students that spend more than the minimum stipulated duration of studies. It is necessary to determine students’ AP using educational data mining (EDM) tools in order to know students who are likely to perform poorly at an early stage of their studies. This paper explores k-means and self-organizing map (SOM) in mining pieces of knowledge relating to the natural number of clusters in students’ dataset and the association of the input features using selected demographic, pre-admission and first year performance. Matlab 2015a was the programming environment and the dataset consists of nine sets of computer science graduands. Cluster validity assessment with k-means discovered four (4) clusters with correlation metric yielding the highest mean silhouette value of 0.5912.  SOM provided an hexagonal grid visual of feature component planes and scatter plots of each significant input attribute. The result shows that the significant attributes were highly correlated with each other except entry mode (EM), indicating that the impact of EM on CoD varies with students irrespective of mode of admission. Also, four distinct clusters were also discovered in the dataset by SOM —7.7% belonging to cluster 1 (first class), and 25% for cluster 2 (2nd class Upper) while Clusters 3 and 4 had 35% proportion each. This validates the results of k-means and further confirms the importance of early detection of students’ AP and confirms the effectiveness of SOM as a cluster validity tool. As further work, the labels from SOM will be associated with records in the dataset for association rule mining, supervised learning and prediction of students’ AP.


Sign in / Sign up

Export Citation Format

Share Document