Hard c-Means Using Quadratic Penalty-Vector Regularization for Uncertain Data

Author(s):  
Yasunori Endo ◽  
◽  
Arisa Taniguchi ◽  
Yukihiro Hamasuna ◽  
◽  
...  

Clustering is an unsupervised classification technique for data analysis. In general, each datum in real space is transformed into a point in a pattern space to apply clustering methods. Data cannot often be represented by a point, however, because of its uncertainty, e.g., measurement error margin and missing values in data. In this paper, we will introduce quadratic penalty-vector regularization to handle such uncertain data using Hard c-Means (HCM), which is one of the most typical clustering algorithms. We first propose a new clustering algorithm called hard c-means using quadratic penalty-vector regularization for uncertain data (HCMP). Second, we propose sequential extraction hard c-means using quadratic penalty-vector regularization (SHCMP) to handle datasets whose cluster number is unknown. Furthermore, we verify the effectiveness of our proposed algorithms through numerical examples.

Author(s):  
Naohiko Kinoshita ◽  
◽  
Yasunori Endo ◽  
Yukihiro Hamasuna ◽  
◽  
...  

Clustering, a highly useful unsupervised classification, has been applied in many fields. When, for example, we use clustering to classify a set of objects, it generally ignores any uncertainty included in objects. This is because uncertainty is difficult to deal with and model. It is desirable, however, to handle individual objects as is so that we may classify objects more precisely. In this paper, we propose new clustering algorithms that handle objects having uncertainty by introducing penalty vectors. We show the theoretical relationship between our proposal and conventional algorithms verifying the effectiveness of our proposed algorithms through numerical examples.


Author(s):  
Yasunori Endo ◽  
◽  
Tomoyuki Suzuki ◽  
Naohiko Kinoshita ◽  
Yukihiro Hamasuna ◽  
...  

The fuzzy non-metric model (FNM) is a representative non-hierarchical clustering method, which is very useful because the belongingness or the membership degree of each datum to each cluster can be calculated directly from the dissimilarities between data and the cluster centers are not used. However, the original FNM cannot handle data with uncertainty. In this study, we refer to the data with uncertainty as “uncertain data,” e.g., incomplete data or data that have errors. Previously, a methods was proposed based on the concept of a tolerance vector for handling uncertain data and some clustering methods were constructed according to this concept, e.g. fuzzyc-means for data with tolerance. These methods can handle uncertain data in the framework of optimization. Thus, in the present study, we apply the concept to FNM. First, we propose a new clustering algorithm based on FNM using the concept of tolerance, which we refer to as the fuzzy non-metric model for data with tolerance. Second, we show that the proposed algorithm can handle incomplete data sets. Third, we verify the effectiveness of the proposed algorithm based on comparisons with conventional methods for incomplete data sets in some numerical examples.


Author(s):  
Naohiko Kinoshita ◽  
◽  
Yasunori Endo ◽  

Clustering is one of the most popular unsupervised classification methods. In this paper, we focus on rough clustering methods based on rough-set representation. Rough k-Means (RKM) is one of the rough clustering method proposed by Lingras et al. Outputs of many clustering algorithms, including RKM depend strongly on initial values, so we must evaluate the validity of outputs. In the case of objectivebased clustering algorithms, the objective function is handled as the measure. It is difficult, however to evaluate the output in RKM, which is not objective-based. To solve this problem, we propose new objective-based rough clustering algorithms and verify theirs usefulness through numerical examples.


Author(s):  
Yasunori Endo ◽  
◽  
Yasushi Hasegawa ◽  
Yukihiro Hamasuna ◽  
Yuchi Kanzawa ◽  
...  

Clustering - defined as an unsupervised data-analysis classification transforming real-space information into data in pattern space and analyzing it - may require that data be represented by a set, rather than points, due to data uncertainty, e.g., measurement error margin, data regarded as one point, or missing values. These data uncertainties have been represented as interval ranges for which many clustering algorithms are constructed, but the lack of guidelines in selecting available distances in individual cases has made selection difficult and raised the need for ways to calculate dissimilarity between uncertain data without introducing a nearest-neighbor or other distance. The tolerance concept we propose represents uncertain data as a point with a tolerance vector, not as an interval, while this is convenient for handling uncertain data, tolerance-vector constraints make mathematical development difficult. We attempt to remove the tolerance-vector constraints using quadratic penaltyvector regularization similar to the tolerance vector. We also propose clustering algorithms for uncertain data considering optimization and obtaining an optimal solution to handle uncertainty appropriately.


2021 ◽  
Author(s):  
Meskat Jahan ◽  
Mahmudul Hasan

Abstract In the big data era, clustering is one of the most popular data mining method. The majority of clustering algorithms have complications like automatic cluster number determination, poor clustering precision, inconsistent clustering of various datasets and parameter-dependent etc. A new fuzzy autonomous solution for clustering named Meskat-Mahmudul (MM) clustering algorithm proposed to overcome the complexity of parameter–free automatic cluster number determination and clustering accuracy. MM clustering algorithm finds out the exact number of clusters based on Average Silhouette method in multivariate mixed attribute dataset, including real-time gene expression dataset and dealt missing values, noise and outliers. MM Extended K-Means (MMK) clustering algorithm is an enhancement of the K-Means algorithm, which serves the purpose for automatic cluster discovery and runtime cluster placement. Several validation methods used to evaluate cluster and certify optimum cluster partitioning and perfection. Some datasets used to assess the performance of the proposed algorithms to other algorithms in terms of time complexity and clustering efficiency. Finally, MM clustering and MMK clustering algorithms found superior over conventional algorithms.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Baicheng Lyu ◽  
Wenhua Wu ◽  
Zhiqiang Hu

AbstractWith the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.


2021 ◽  
Vol 10 (4) ◽  
pp. 2170-2180
Author(s):  
Untari N. Wisesty ◽  
Tati Rajab Mengko

This paper aims to conduct an analysis of the SARS-CoV-2 genome variation was carried out by comparing the results of genome clustering using several clustering algorithms and distribution of sequence in each cluster. The clustering algorithms used are K-means, Gaussian mixture models, agglomerative hierarchical clustering, mean-shift clustering, and DBSCAN. However, the clustering algorithm has a weakness in grouping data that has very high dimensions such as genome data, so that a dimensional reduction process is needed. In this research, dimensionality reduction was carried out using principal component analysis (PCA) and autoencoder method with three models that produce 2, 10, and 50 features. The main contributions achieved were the dimensional reduction and clustering scheme of SARS-CoV-2 sequence data and the performance analysis of each experiment on each scheme and hyper parameters for each method. Based on the results of experiments conducted, PCA and DBSCAN algorithm achieve the highest silhouette score of 0.8770 with three clusters when using two features. However, dimensionality reduction using autoencoder need more iterations to converge. On the testing process with Indonesian sequence data, more than half of them enter one cluster and the rest are distributed in the other two clusters.


Author(s):  
Ming Cao ◽  
Qinke Peng ◽  
Ze-Gang Wei ◽  
Fei Liu ◽  
Yi-Fan Hou

The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.


2021 ◽  
Author(s):  
Congming Shi ◽  
Bingtao Wei ◽  
Shoulin Wei ◽  
Wen Wang ◽  
Hai Liu ◽  
...  

Abstract Clustering, a traditional machine learning method, plays a significant role in data analysis. Most clustering algorithms depend on a predetermined exact number of clusters, whereas, in practice, clusters are usually unpredictable. Although the Elbow method is one of the most commonly used methods to discriminate the optimal cluster number, the discriminant of the number of clusters depends on the manual identification of the elbow points on the visualization curve. Thus, experienced analysts cannot clearly identify the elbow point from the plotted curve when the plotted curve is fairly smooth. To solve this problem, a new elbow point discriminant method is proposed to yield a statistical metric that estimates an optimal cluster number when clustering on a dataset. First, the average degree of distortion obtained by the Elbow method is normalized to the range of 0 to 10. Second, the normalized results are used to calculate the cosine of intersection angles between elbow points. Third, this calculated cosine of intersection angles and the arccosine theorem are used to compute the intersection angles between elbow points. Finally, the index of the above computed minimal intersection angles between elbow points is used as the estimated potential optimal cluster number. The experimental results based on simulated datasets and a well-known public dataset (Iris Dataset) demonstrated that the estimated optimal cluster number obtained by our newly proposed method is better than the widely used Silhouette method.


2021 ◽  
Author(s):  
Yizhang Wang ◽  
Di Wang ◽  
You Zhou ◽  
Chai Quek ◽  
Xiaofeng Zhang

<div>Clustering is an important unsupervised knowledge acquisition method, which divides the unlabeled data into different groups \cite{atilgan2021efficient,d2021automatic}. Different clustering algorithms make different assumptions on the cluster formation, thus, most clustering algorithms are able to well handle at least one particular type of data distribution but may not well handle the other types of distributions. For example, K-means identifies convex clusters well \cite{bai2017fast}, and DBSCAN is able to find clusters with similar densities \cite{DBSCAN}. </div><div>Therefore, most clustering methods may not work well on data distribution patterns that are different from the assumptions being made and on a mixture of different distribution patterns. Taking DBSCAN as an example, it is sensitive to the loosely connected points between dense natural clusters as illustrated in Figure~\ref{figconnect}. The density of the connected points shown in Figure~\ref{figconnect} is different from the natural clusters on both ends, however, DBSCAN with fixed global parameter values may wrongly assign these connected points and consider all the data points in Figure~\ref{figconnect} as one big cluster.</div>


Sign in / Sign up

Export Citation Format

Share Document