scholarly journals A complete gradient clustering algorithm formed with kernel estimators

Author(s):  
Piotr Kulczycki ◽  
Małgorzata Charytanowicz

A complete gradient clustering algorithm formed with kernel estimatorsThe aim of this paper is to provide a gradient clustering algorithm in its complete form, suitable for direct use without requiring a deeper statistical knowledge. The values of all parameters are effectively calculated using optimizing procedures. Moreover, an illustrative analysis of the meaning of particular parameters is shown, followed by the effects resulting from possible modifications with respect to their primarily assigned optimal values. The proposed algorithm does not demand strict assumptions regarding the desired number of clusters, which allows the obtained number to be better suited to a real data structure. Moreover, a feature specific to it is the possibility to influence the proportion between the number of clusters in areas where data elements are dense as opposed to their sparse regions. Finally, the algorithm—by the detection of oneelement clusters—allows identifying atypical elements, which enables their elimination or possible designation to bigger clusters, thus increasing the homogeneity of the data set.

Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-16 ◽  
Author(s):  
Yiwen Zhang ◽  
Yuanyuan Zhou ◽  
Xing Guo ◽  
Jintao Wu ◽  
Qiang He ◽  
...  

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.


2021 ◽  
Vol 18 (1) ◽  
pp. 34-57
Author(s):  
Weifeng Pan ◽  
Xinxin Xu ◽  
Hua Ming ◽  
Carl K. Chang

Mashup technology has become a promising way to develop and deliver applications on the web. Automatically organizing Mashups into functionally similar clusters helps improve the performance of Mashup discovery. Although there are many approaches aiming to cluster Mashups, they solely focus on utilizing semantic similarities to guide the Mashup clustering process and are unable to utilize both the structural and semantic information in Mashup profiles. In this paper, a novel approach to cluster Mashups into groups is proposed, which integrates structural similarity and semantic similarity using fuzzy AHP (fuzzy analytic hierarchy process). The structural similarity is computed from usage histories between Mashups and Web APIs using SimRank algorithm. The semantic similarity is computed from the descriptions and tags of Mashups using LDA (latent dirichlet allocation). A clustering algorithm based on the genetic algorithm is employed to cluster Mashups. Comprehensive experiments are performed on a real data set collected from ProgrammableWeb. The results show the effectiveness of the approach when compared with two kinds of conventional approaches.


2013 ◽  
Vol 462-463 ◽  
pp. 438-442
Author(s):  
Ming Gu

Neural network with quadratic junction was described. Structure, properties and unsupervised learning rules of the neural network were discussed. An ART-based hierarchical clustering algorithm using this kind of neural networks was suggested. The algorithm can determine the number of clusters and clustering data. A 2-D artificial data set is used to illustrate and compare the effectiveness of the proposed algorithm and K-means algorithm.


2021 ◽  
Vol 3 (1) ◽  
pp. 1-7
Author(s):  
Yadgar Sirwan Abdulrahman

Clustering is one of the essential strategies in data analysis. In classical solutions, all features are assumed to contribute equally to the data clustering. Of course, some features are more important than others in real data sets. As a result, essential features will have a more significant impact on identifying optimal clusters than other features. In this article, a fuzzy clustering algorithm with local automatic weighting is presented. The proposed algorithm has many advantages such as: 1) the weights perform features locally, meaning that each cluster's weight is different from the rest. 2) calculating the distance between the samples using a non-euclidian similarity criterion to reduce the noise effect. 3) the weight of the features is obtained comparatively during the learning process. In this study, mathematical analyzes were done to obtain the clustering centers well-being and the features' weights. Experiments were done on the data set range to represent the progressive algorithm's efficiency compared to other proposed algorithms with global and local features


2021 ◽  
Vol 37 (1) ◽  
pp. 71-89
Author(s):  
Vu-Tuan Dang ◽  
Viet-Vu Vu ◽  
Hong-Quan Do ◽  
Thi Kieu Oanh Le

During the past few years, semi-supervised clustering has emerged as a new interesting direction in machine learning research. In a semi-supervised clustering algorithm, the clustering results can be significantly improved by using side information, which is available or collected from users. There are two main kinds of side information that can be learned in semi-supervised clustering algorithms: the class labels - called seeds or the pairwise constraints. The first semi-supervised clustering was introduced in 2000, and since that, many algorithms have been presented in literature. However, it is not easy to use both types of side information in the same algorithm. To address the problem, this paper proposes a semi-supervised graph based clustering algorithm that tries to use seeds and constraints in the clustering process, called MCSSGC. Moreover, we introduces a simple but efficient active learning method to collect the constraints that can boost the performance of MCSSGC, named KMMFFQS. In order to verify effectiveness of the proposed algorithm, we conducted a series of experiments not only on real data sets from UCI, but also on a document data set applied in an Information Extraction of Vietnamese documents. These obtained results show that the proposed algorithm can significantly improve the clustering process compared to some recent algorithms.


2021 ◽  
Vol 6 (1) ◽  
pp. 41
Author(s):  
I Kadek Dwi Gandika Supartha ◽  
Adi Panca Saputra Iskandar

In this study, clustering data on STMIK STIKOM Indonesia alumni using the Fuzzy C-Means and Fuzzy Subtractive methods. The method used to test the validity of the cluster is the Modified Partition Coefficient (MPC) and Classification Entropy (CE) index. Clustering is carried out with the aim of finding hidden patterns or information from a fairly large data set, considering that so far the alumni data at STMIK STIKOM Indonesia have not undergone a data mining process. The results of measuring cluster validity using the Modified Partition Coefficient (MPC) and Classification Entropy (CE) index, the Fuzzy C-Means Clustering algorithm has a higher level of validity than the Fuzzy Subtractive Clustering algorithm so it can be said that the Fuzzy C-Means algorithm performs the cluster process better than with the Fuzzy Subtractive method in clustering alumni data. The number of clusters that have the best fitness value / the most optimal number of clusters based on the CE and MPC validity index is 5 clusters. The cluster that has the best characteristics is the 1st cluster which has 514 members (36.82% of the total alumni). With the characteristics of having an average GPA of 3.3617, the average study period is 7.8102 semesters and an average TA work period of 4.9596 months.


Author(s):  
Md. Zakir Hossain ◽  
Md.Nasim Akhtar ◽  
R.B. Ahmad ◽  
Mostafijur Rahman

<span>Data mining is the process of finding structure of data from large data sets. With this process, the decision makers can make a particular decision for further development of the real-world problems. Several data clusteringtechniques are used in data mining for finding a specific pattern of data. The K-means method isone of the familiar clustering techniques for clustering large data sets.  The K-means clustering method partitions the data set based on the assumption that the number of clusters are fixed.The main problem of this method is that if the number of clusters is to be chosen small then there is a higher probability of adding dissimilar items into the same group. On the other hand, if the number of clusters is chosen to be high, then there is a higher chance of adding similar items in the different groups. In this paper, we address this issue by proposing a new K-Means clustering algorithm. The proposed method performs data clustering dynamically. The proposed method initially calculates a threshold value as a centroid of K-Means and based on this value the number of clusters are formed. At each iteration of K-Means, if the Euclidian distance between two points is less than or equal to the threshold value, then these two data points will be in the same group. Otherwise, the proposed method will create a new cluster with the dissimilar data point. The results show that the proposed method outperforms the original K-Means method.</span>


2013 ◽  
Vol 3 (4) ◽  
pp. 1-14 ◽  
Author(s):  
S. Sampath ◽  
B. Ramya

Cluster analysis is a branch of data mining, which plays a vital role in bringing out hidden information in databases. Clustering algorithms help medical researchers in identifying the presence of natural subgroups in a data set. Different types of clustering algorithms are available in the literature. The most popular among them is k-means clustering. Even though k-means clustering is a popular clustering method widely used, its application requires the knowledge of the number of clusters present in the given data set. Several solutions are available in literature to overcome this limitation. The k-means clustering method creates a disjoint and exhaustive partition of the data set. However, in some situations one can come across objects that belong to more than one cluster. In this paper, a clustering algorithm capable of producing rough clusters automatically without requiring the user to give as input the number of clusters to be produced. The efficiency of the algorithm in detecting the number of clusters present in the data set has been studied with the help of some real life data sets. Further, a nonparametric statistical analysis on the results of the experimental study has been carried out in order to analyze the efficiency of the proposed algorithm in automatic detection of the number of clusters in the data set with the help of rough version of Davies-Bouldin index.


This research work proposed an integrated approach using Fuzzy Clustering to discover the optimal number of clusters. The proposed technique is a great technological innovation clustering algorithm in marketing and could be used to determine the best group of customers, similar items and products. The new approach can independently determine the initial distribution of cluster centers. The task of finding the number of clusters is converted into the task of determining the size of the neural network, which later translated to identify the optimal groups of clusters. This approach has been tested using four business data set and shows outstanding results compared to traditional approaches. The proposed method is able to find without any significant error the expected exact number of clusters. Further, we believe that this work is a business value to increase market efficiency in finding out what group of clusters is more cost-effective.


Sign in / Sign up

Export Citation Format

Share Document