Evaluation of h- and g-indices of Scientific Authors using Modified K-Means Clustering Algorithm

In this paper we proposed modified K-means algorithm to assess scientific authors performance by using their h,g-indices values. K-means suffers from poor computational scaling and efficiency as the number of clusters has to be supplied by the user. Hence, in this work, we introduce a modification of K-means algorithm that efficiently searches the data to cluster points by compute the sum of squares within each cluster which makes the program to select the most promising subset of classes for clustering. The proposed algorithm was tested on IRIS and ZOO data sets as well as on our local dataset comprising of h- and g-indices, which are the prominent markers for scientific excellence of authors publishing papers in various national and international journals. Results from analysis reveal that the modified k-means algorithm is much faster and outperforms the conventional algorithm in terms of clustering performance, measured by the data discrepancy factor.

Download Full-text

Self-Adaptive K-Means Based on a Covering Algorithm

Complexity ◽

10.1155/2018/7698274 ◽

2018 ◽

Vol 2018 ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Yiwen Zhang ◽

Yuanyuan Zhou ◽

Xing Guo ◽

Jintao Wu ◽

Qiang He ◽

...

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Real Data ◽

Second Phase ◽

Data Sets ◽

Number Of Clusters ◽

Large Scale Data ◽

Long Time ◽

Two Phases ◽

Selection Of

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.

Download Full-text

A dynamic K-means clustering for data mining

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v13.i2.pp521-526 ◽

2019 ◽

Vol 13 (2) ◽

pp. 521

Author(s):

Md. Zakir Hossain ◽

Md.Nasim Akhtar ◽

R.B. Ahmad ◽

Mostafijur Rahman

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Large Data ◽

Threshold Value ◽

Specific Pattern ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Data Points

Data mining is the process of finding structure of data from large data sets. With this process, the decision makers can make a particular decision for further development of the real-world problems. Several data clusteringtechniques are used in data mining for finding a specific pattern of data. The K-means method isone of the familiar clustering techniques for clustering large data sets. The K-means clustering method partitions the data set based on the assumption that the number of clusters are fixed.The main problem of this method is that if the number of clusters is to be chosen small then there is a higher probability of adding dissimilar items into the same group. On the other hand, if the number of clusters is chosen to be high, then there is a higher chance of adding similar items in the different groups. In this paper, we address this issue by proposing a new K-Means clustering algorithm. The proposed method performs data clustering dynamically. The proposed method initially calculates a threshold value as a centroid of K-Means and based on this value the number of clusters are formed. At each iteration of K-Means, if the Euclidian distance between two points is less than or equal to the threshold value, then these two data points will be in the same group. Otherwise, the proposed method will create a new cluster with the dissimilar data point. The results show that the proposed method outperforms the original K-Means method.

Download Full-text

An Improvement of K-Medoids Clustering Algorithm Based on Fixed Point Iteration

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2020100105 ◽

2020 ◽

Vol 16 (4) ◽

pp. 84-94

Author(s):

Xiaodi Huang ◽

Minglun Ren ◽

Zhongfeng Hu

Keyword(s):

Fixed Point ◽

Large Scale ◽

Clustering Algorithm ◽

Number Of Clusters ◽

Fixed Point Iteration ◽

Computational Overhead ◽

Partitioning Around Medoids ◽

Clustering Quality ◽

Conventional Algorithm

The process of K-medoids algorithm is that it first selects data randomly as initial centers to form initial clusters. Then, based on PAM (partitioning around medoids) algorithm, centers will be sequential replaced by all the remaining data to find a result has the best inherent convergence. Since PAM algorithm is an iterative ergodic strategy, when the data size or the number of clusters are huge, its expensive computational overhead will hinder its feasibility. The authors use the fixed-point iteration to search the optimal clustering centers and build a FPK-medoids (fixed point-based K-medoids) algorithm. By constructing fixed point equations for each cluster, the problem of searching optimal centers is converted into the solving of equation set in parallel. The experiment is carried on six standard datasets, and the result shows that the clustering efficiency of proposed algorithm is significantly improved compared with the conventional algorithm. In addition, the clustering quality will be markedly enhanced in handling problems with large-scale datasets or a large number of clusters.

Download Full-text

Rough ISODATA Algorithm

International Journal of Fuzzy System Applications ◽

10.4018/ijfsa.2013100101 ◽

2013 ◽

Vol 3 (4) ◽

pp. 1-14 ◽

Cited By ~ 2

Author(s):

S. Sampath ◽

B. Ramya

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Vital Role ◽

Data Sets ◽

Clustering Method ◽

Data Set ◽

Number Of Clusters ◽

Real Life Data ◽

Nonparametric Statistical

Cluster analysis is a branch of data mining, which plays a vital role in bringing out hidden information in databases. Clustering algorithms help medical researchers in identifying the presence of natural subgroups in a data set. Different types of clustering algorithms are available in the literature. The most popular among them is k-means clustering. Even though k-means clustering is a popular clustering method widely used, its application requires the knowledge of the number of clusters present in the given data set. Several solutions are available in literature to overcome this limitation. The k-means clustering method creates a disjoint and exhaustive partition of the data set. However, in some situations one can come across objects that belong to more than one cluster. In this paper, a clustering algorithm capable of producing rough clusters automatically without requiring the user to give as input the number of clusters to be produced. The efficiency of the algorithm in detecting the number of clusters present in the data set has been studied with the help of some real life data sets. Further, a nonparametric statistical analysis on the results of the experimental study has been carried out in order to analyze the efficiency of the proposed algorithm in automatic detection of the number of clusters in the data set with the help of rough version of Davies-Bouldin index.

Download Full-text

Stability-Based Validation of Clustering Solutions

Neural Computation ◽

10.1162/089976604773717621 ◽

2004 ◽

Vol 16 (6) ◽

pp. 1299-1323 ◽

Cited By ~ 248

Author(s):

Tilman Lange ◽

Volker Roth ◽

Mikio L. Braun ◽

Joachim M. Buhmann

Keyword(s):

Clustering Algorithm ◽

Group Structure ◽

Data Sets ◽

Expression Data ◽

Number Of Clusters ◽

Natural Group ◽

Exploratory Data ◽

Class Labels ◽

Validation Tool ◽

Real World Problems

Data clustering describes a set of frequently employed techniques in exploratory data analysis to extract “natural” group structure in data. Such groupings need to be validated to separate the signal in the data from spurious structure. In this context, finding an appropriate number of clusters is a particularly important model selection question. We introduce a measure of cluster stability to assess the validity of a cluster model. This stability measure quantifies the reproducibility of clustering solutions on a second sample, and it can be interpreted as a classification risk with regard to class labels produced by a clustering algorithm. The preferred number of clusters is determined by minimizing this classification risk as a function of the number of clusters. Convincing results are achieved on simulated as well as gene expression data sets. Comparisons to other methods demonstrate the competitive performance of our method and its suitability as a general validation tool for clustering solutions in real-world problems.

Download Full-text

Determination of the Number of Clusters in a Data Set

Management Theories and Strategic Practices for Decision Making ◽

10.4018/978-1-4666-2473-3.ch004 ◽

2012 ◽

pp. 59-73

Author(s):

Derrick S. Boone

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Monte Carlo Study ◽

Stopping Rules ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

True Number ◽

Clustering Criterion

The accuracy of “stopping rules” for determining the number of clusters in a data set is examined as a function of the underlying clustering algorithm being used. Using a Monte Carlo study, various stopping rules, used in conjunction with six clustering algorithms, are compared to determine which rule/algorithm combinations best recover the true number of clusters. The rules and algorithms are tested using disparately sized, artificially generated data sets that contained multiple numbers and levels of clusters, variables, noise, outliers, and elongated and unequally sized clusters. The results indicate that stopping rule accuracy depends on the underlying clustering algorithm being used. The cubic clustering criterion (CCC), when used in conjunction with mixture models or Ward’s method, recovers the true number of clusters more accurately than other rules and algorithms. However, the CCC was more likely than other stopping rules to report more clusters than are actually present. Implications are discussed.

Download Full-text

AK-means: an automatic clustering algorithm based on K-means

Journal of Advanced Computer Science & Technology ◽

10.14419/jacst.v4i2.4749 ◽

2015 ◽

Vol 4 (2) ◽

pp. 231 ◽

Cited By ~ 1

Author(s):

Omar Kettani ◽

Faical Ramdani ◽

Benaissa Tadili

Keyword(s):

Data Mining ◽

Fast Algorithm ◽

Clustering Algorithm ◽

Data Sets ◽

Number Of Clusters ◽

Correct Number ◽

Standard Data ◽

Exact Number ◽

Automatic Clustering ◽

Clustering Problems

In data mining, K-means is a simple and fast algorithm for solving clustering problems, but it requires that the user provides in advance the exact number of clusters (k), which is often not obvious. Thus, this paper intends to overcome this problem by proposing a parameter-free algorithm for automatic clustering. It is based on successive adequate restarting of K-means algorithm. Experiments conducted on several standard data sets demonstrate that the proposed approach is effective and outperforms the related well known algorithm G-means, in terms of clustering accuracy and estimation of the correct number of clusters.

Download Full-text

Spatial Modification in the Parameters of Mountain Image Clustering Algorithm

Al-Nahrain Journal for Engineering Sciences ◽

10.29194/njes.22010055 ◽

2019 ◽

Vol 22 (1) ◽

pp. 55-58

Author(s):

Nahla Ibraheem Jabbar

Keyword(s):

Clustering Algorithm ◽

Spatial Information ◽

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Image Clustering ◽

Optimum Number ◽

Data Sets ◽

Number Of Clusters ◽

Pixel Value

Our proposed method used to overcome the drawbacks of computing values parameters in the mountain algorithm to image clustering. All existing clustering algorithms are required values of parameters to starting the clustering process such as these algorithms have a big problem in computing parameters. One of the famous clustering is a mountain algorithm that gives expected number of clusters, we presented in this paper a new modification of mountain clustering called Spatial Modification in the Parameters of Mountain Image Clustering Algorithm. This modification in the spatial information of image by taking a window mask for each center pixel value to compute distance between pixel and neighborhood for estimation the values of parameters σ, β that gives a potential optimum number of clusters requiring in image segmentation process. Our experiments show ability the proposed algorithm in image brain segmentation with a quality in the large data sets

Download Full-text

Selection of K in K-means clustering

Proceedings of the Institution of Mechanical Engineers Part C Journal of Mechanical Engineering Science ◽

10.1243/095440605x8298 ◽

2005 ◽

Vol 219 (1) ◽

pp. 103-119 ◽

Cited By ~ 199

Author(s):

D T Pham ◽

S S Dimov ◽

C D Nguyen

Keyword(s):

Data Clustering ◽

Clustering Algorithm ◽

Data Sets ◽

Number Of Clusters ◽

Selection Of

The K-means algorithm is a popular data-clustering algorithm. However, one of its drawbacks is the requirement for the number of clusters, K, to be specified before the algorithm is applied. This paper first reviews existing methods for selecting the number of clusters for the algorithm. Factors that affect this selection are then discussed and a new measure to assist the selection is proposed. The paper concludes with an analysis of the results of using the proposed measure to determine the number of clusters for the K-means algorithm for different data sets.

Download Full-text

An Extensional Clustering Algorithm of FCM Based on Intuitionistic Extension Index

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.490-495.1372 ◽

2012 ◽

Vol 490-495 ◽

pp. 1372-1376

Author(s):

Qing Feng Liu

Keyword(s):

Iterative Algorithm ◽

Clustering Algorithm ◽

Experimental Results ◽

Data Sets ◽

Number Of Clusters ◽

Fcm Algorithm ◽

Object A ◽

Benchmark Data ◽

Degree Of Membership ◽

Fuzzy C Means Algorithm

The fuzzy C-means algorithm is an iterative algorithm in which the desired number of clusters C and the initial clustering seeds has to be pre-defined. The seeds are modified in each stage of the algorithm and for each object a degree of membership to each of the clusters is estimated. In this paper, an extensional clustering algorithm of FCM based on an intuitionistic extension index, denoted E-FCM algorithm, is proposed. For comparing the performance of the above mentioned two algorithms, the experimental results of three benchmark data sets show that the E-FCM algorithm outperforms the FCM algorithm.

Download Full-text