Automatic Genetic Fuzzy c-Means

2018 ◽  
Vol 29 (1) ◽  
pp. 529-539
Author(s):  
Khalid Jebari ◽  
Abdelaziz Elmoujahid ◽  
Aziz Ettouhami

Abstract Fuzzy c-means is an efficient algorithm that is amply used for data clustering. Nonetheless, when using this algorithm, the designer faces two crucial choices: choosing the optimal number of clusters and initializing the cluster centers. The two choices have a direct impact on the clustering outcome. This paper presents an improved algorithm called automatic genetic fuzzy c-means that evolves the number of clusters and provides the initial centroids. The proposed algorithm uses a genetic algorithm with a new crossover operator, a new mutation operator, and modified tournament selection; further, it defines a new fitness function based on three cluster validity indices. Real data sets are used to demonstrate the effectiveness, in terms of quality, of the proposed algorithm.

2014 ◽  
Vol 37 (1) ◽  
pp. 141-157 ◽  
Author(s):  
Mariusz Łapczyński ◽  
Bartłomiej Jefmański

Abstract Making more accurate marketing decisions by managers requires building effective predictive models. Typically, these models specify the probability of customer belonging to a particular category, group or segment. The analytical CRM categories refer to customers interested in starting cooperation with the company (acquisition models), customers who purchase additional products (cross- and up-sell models) or customers intending to resign from the cooperation (churn models). During building predictive models researchers use analytical tools from various disciplines with an emphasis on their best performance. This article attempts to build a hybrid predictive model combining decision trees (C&RT algorithm) and cluster analysis (k-means). During experiments five different cluster validity indices and eight datasets were used. The performance of models was evaluated by using popular measures such as: accuracy, precision, recall, G-mean, F-measure and lift in the first and in the second decile. The authors tried to find a connection between the number of clusters and models' quality.


2021 ◽  
pp. 1-16
Author(s):  
Aikaterini Karanikola ◽  
Charalampos M. Liapis ◽  
Sotiris Kotsiantis

In short, clustering is the process of partitioning a given set of objects into groups containing highly related instances. This relation is determined by a specific distance metric with which the intra-cluster similarity is estimated. Finding an optimal number of such partitions is usually the key step in the entire process, yet a rather difficult one. Selecting an unsuitable number of clusters might lead to incorrect conclusions and, consequently, to wrong decisions: the term “optimal” is quite ambiguous. Furthermore, various inherent characteristics of the datasets, such as clusters that overlap or clusters containing subclusters, will most often increase the level of difficulty of the task. Thus, the methods used to detect similarities and the parameter selection of the partition algorithm have a major impact on the quality of the groups and the identification of their optimal number. Given that each dataset constitutes a rather distinct case, validity indices are indicators introduced to address the problem of selecting such an optimal number of clusters. In this work, an extensive set of well-known validity indices, based on the approach of the so-called relative criteria, are examined comparatively. A total of 26 cluster validation measures were investigated in two distinct case studies: one in real-world and one in artificially generated data. To ensure a certain degree of difficulty, both real-world and generated data were selected to exhibit variations and inhomogeneity. Each of the indices is being deployed under the schemes of 9 different clustering methods, which incorporate 5 different distance metrics. All results are presented in various explanatory forms.


2017 ◽  
Vol 65 (4) ◽  
pp. 359-365 ◽  
Author(s):  
Javier Senent-Aparicio ◽  
Jesús Soto ◽  
Julio Pérez-Sánchez ◽  
Jorge Garrido

AbstractOne of the most important problems faced in hydrology is the estimation of flood magnitudes and frequencies in ungauged basins. Hydrological regionalisation is used to transfer information from gauged watersheds to ungauged watersheds. However, to obtain reliable results, the watersheds involved must have a similar hydrological behaviour. In this study, two different clustering approaches are used and compared to identify the hydrologically homogeneous regions. Fuzzy C-Means algorithm (FCM), which is widely used for regionalisation studies, needs the calculation of cluster validity indices in order to determine the optimal number of clusters. Fuzzy Minimals algorithm (FM), which presents an advantage compared with others fuzzy clustering algorithms, does not need to know a priori the number of clusters, so cluster validity indices are not used. Regional homogeneity test based on L-moments approach is used to check homogeneity of regions identified by both cluster analysis approaches. The validation of the FM algorithm in deriving homogeneous regions for flood frequency analysis is illustrated through its application to data from the watersheds in Alto Genil (South Spain). According to the results, FM algorithm is recommended for identifying the hydrologically homogeneous regions for regional frequency analysis.


Author(s):  
M. Arif Wani ◽  
Romana Riyaz

Purpose – The most commonly used approaches for cluster validation are based on indices but the majority of the existing cluster validity indices do not work well on data sets of different complexities. The purpose of this paper is to propose a new cluster validity index (ARSD index) that works well on all types of data sets. Design/methodology/approach – The authors introduce a new compactness measure that depicts the typical behaviour of a cluster where more points are located around the centre and lesser points towards the outer edge of the cluster. A novel penalty function is proposed for determining the distinctness measure of clusters. Random linear search-algorithm is employed to evaluate and compare the performance of the five commonly known validity indices and the proposed validity index. The values of the six indices are computed for all nc ranging from (nc min, nc max) to obtain the optimal number of clusters present in a data set. The data sets used in the experiments include shaped, Gaussian-like and real data sets. Findings – Through extensive experimental study, it is observed that the proposed validity index is found to be more consistent and reliable in indicating the correct number of clusters compared to other validity indices. This is experimentally demonstrated on 11 data sets where the proposed index has achieved better results. Originality/value – The originality of the research paper includes proposing a novel cluster validity index which is used to determine the optimal number of clusters present in data sets of different complexities.


Entropy ◽  
2020 ◽  
Vol 22 (11) ◽  
pp. 1200
Author(s):  
Ferdinando Di Martino ◽  
Salvatore Sessa

Two well-known drawbacks in fuzzy clustering are the requirement of assigning in advance the number of clusters and random initialization of cluster centers. The quality of the final fuzzy clusters depends heavily on the initial choice of the number of clusters and the initialization of the clusters, then, it is necessary to apply a validity index to measure the compactness and the separability of the final clusters and run the clustering algorithm several times. We propose a new fuzzy C-means algorithm in which a validity index based on the concepts of maximum fuzzy energy and minimum fuzzy entropy is applied to initialize the cluster centers and to find the optimal number of clusters and initial cluster centers in order to obtain a good clustering quality, without increasing time consumption. We test our algorithm on UCI (University of California at Irvine) machine learning classification datasets comparing the results with the ones obtained by using well-known validity indices and variations of fuzzy C-means by using optimization algorithms in the initialization phase. The comparison results show that our algorithm represents an optimal trade-off between the quality of clustering and the time consumption.


2021 ◽  
pp. 1-15
Author(s):  
R.M. Noorullah ◽  
Moulana Mohammed

Topic models are widely used in building clusters of documents for more than a decade, yet problems occurring in choosing the optimal number of topics. The main problem is the lack of a stable metric of the quality of topics obtained during the construction of topic models. The authors analyzed from previous works, most of the models used in determining the number of topics are non-parametric and the quality of topics determined by using perplexity and coherence measures and concluded that they are not applicable in solving this problem. In this paper, we used the parametric method, which is an extension of the traditional topic model with visual access tendency for visualization of the number of topics (clusters) to complement clustering and to choose the optimal number of topics based on results of cluster validity indices. Developed hybrid topic models are demonstrated with different Twitter datasets on various topics in obtaining the optimal number of topics and in measuring the quality of clusters. The experimental results showed that the Visual Non-negative Matrix Factorization (VNMF) topic model performs well in determining the optimal number of topics with interactive visualization and in performance measure of the quality of clusters with validity indices.


Water ◽  
2020 ◽  
Vol 12 (5) ◽  
pp. 1372
Author(s):  
Nikhil Bhatia ◽  
Jency M. Sojan ◽  
Slobodon Simonovic ◽  
Roshan Srivastav

The delineation of precipitation regions is to identify homogeneous zones in which the characteristics of the process are statistically similar. The regionalization process has three main components: (i) delineation of regions using clustering algorithms, (ii) determining the optimal number of regions using cluster validity indices (CVIs), and (iii) validation of regions for homogeneity using L-moments ratio test. The identification of the optimal number of clusters will significantly affect the homogeneity of the regions. The objective of this study is to investigate the performance of the various CVIs in identifying the optimal number of clusters, which maximizes the homogeneity of the precipitation regions. The k-means clustering algorithm is adopted to delineate the regions using location-based attributes for two large areas from Canada, namely, the Prairies and the Great Lakes-St Lawrence lowlands (GL-SL) region. The seasonal precipitation data for 55 years (1951–2005) is derived using high-resolution ANUSPLIN gridded point data for Canada. The results indicate that the optimal number of clusters and the regional homogeneity depends on the CVI adopted. Among 42 cluster indices considered, 15 of them outperform in identifying the homogeneous precipitation regions. The Dunn, D e t _ r a t i o and Trace( W − 1 B ) indices found to be the best for all seasons in both the regions.


2017 ◽  
Vol 26 (3) ◽  
pp. 483-503 ◽  
Author(s):  
Vijay Kumar ◽  
Jitender Kumar Chhabra ◽  
Dinesh Kumar

AbstractFinding the optimal number of clusters and the appropriate partitioning of the given dataset are the two major challenges while dealing with clustering. For both of these, cluster validity indices are used. In this paper, seven widely used cluster validity indices, namely DB index, PS index, I index, XB index, FS index, K index, and SV index, have been developed based on line symmetry distance measures. These indices provide the measure of line symmetry present in the partitioning of the dataset. These are able to detect clusters of any shape or size in a given dataset, as long as they possess the property of line symmetry. The performance of these indices is evaluated on three clustering algorithms: K-means, fuzzy-C means, and modified harmony search-based clustering (MHSC). The efficacy of symmetry-based validity indices on clustering algorithms is demonstrated on artificial and real-life datasets, six each, with the number of clusters varying from 2 to $\sqrt n ,$ where n is the total number of data points existing in the dataset. The experimental results reveal that the incorporation of line symmetry-based distance improves the capabilities of these existing validity indices in finding the appropriate number of clusters. Comparisons of these indices are done with the point symmetric and original versions of these seven validity indices. The results also demonstrate that the MHSC technique performs better as compared to other well-known clustering techniques. For real-life datasets, analysis of variance statistical analysis is also performed.


Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-16 ◽  
Author(s):  
Yiwen Zhang ◽  
Yuanyuan Zhou ◽  
Xing Guo ◽  
Jintao Wu ◽  
Qiang He ◽  
...  

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.


Author(s):  
Chunhua Ren ◽  
Linfu Sun

AbstractThe classic Fuzzy C-means (FCM) algorithm has limited clustering performance and is prone to misclassification of border points. This study offers a bi-directional FCM clustering ensemble approach that takes local information into account (LI_BIFCM) to overcome these challenges and increase clustering quality. First, various membership matrices are created after running FCM multiple times, based on the randomization of the initial cluster centers, and a vertical ensemble is performed using the maximum membership principle. Second, after each execution of FCM, multiple local membership matrices of the sample points are created using multiple K-nearest neighbors, and a horizontal ensemble is performed. Multiple horizontal ensembles can be created using multiple FCM clustering. Finally, the final clustering results are obtained by combining the vertical and horizontal clustering ensembles. Twelve data sets were chosen for testing from both synthetic and real data sources. The LI_BIFCM clustering performance outperformed four traditional clustering algorithms and three clustering ensemble algorithms in the experiments. Furthermore, the final clustering results has a weak correlation with the bi-directional cluster ensemble parameters, indicating that the suggested technique is robust.


Sign in / Sign up

Export Citation Format

Share Document