An Ensemble Clusterer Framework based on Valid and Diverse Basic Small Clusters

Author(s):  
Tao Sun ◽  
Saeed Mashdour ◽  
Mohammad Reza Mahmoudi

Clustering ensemble is a new problem where it is aimed to extract a clustering out of a pool of base clusterings. The pool of base clusterings is sometimes referred to as ensemble. An ensemble is to be considered to be a suitable one, if its members are diverse and any of them has a minimum quality. The method that maps an ensemble into an output partition (called also as consensus partition) is named consensus function. The consensus function should find a consensus partition that all of the ensemble members agree on it as much as possible. In this paper, a novel clustering ensemble framework that guarantees generation of a pool of the base clusterings with the both conditions (diversity among ensemble members and high-quality members) is introduced. According to its limitations, a novel consensus function is also introduced. We experimentally show that the proposed clustering ensemble framework is scalable, efficient and general. Using different base clustering algorithms, we show that our improved base clustering algorithm is better. Also, among different consensus functions, we show the effectiveness of our consensus function. Finally, comparing with the state of the art, we find that the clustering ensemble framework is comparable or even better in terms of scalability and efficacy.

2020 ◽  
Vol 10 (5) ◽  
pp. 1891 ◽  
Author(s):  
Huan Niu ◽  
Nasim Khozouie ◽  
Hamid Parvin ◽  
Hamid Alinejad-Rokny ◽  
Amin Beheshti ◽  
...  

Clustering ensemble indicates to an approach in which a number of (usually weak) base clusterings are performed and their consensus clustering is used as the final clustering. Knowing democratic decisions are better than dictatorial decisions, it seems clear and simple that ensemble (here, clustering ensemble) decisions are better than simple model (here, clustering) decisions. But it is not guaranteed that every ensemble is better than a simple model. An ensemble is considered to be a better ensemble if their members are valid or high-quality and if they participate according to their qualities in constructing consensus clustering. In this paper, we propose a clustering ensemble framework that uses a simple clustering algorithm based on kmedoids clustering algorithm. Our simple clustering algorithm guarantees that the discovered clusters are valid. From another point, it is also guaranteed that our clustering ensemble framework uses a mechanism to make use of each discovered cluster according to its quality. To do this mechanism an auxiliary ensemble named reference set is created by running several kmeans clustering algorithms.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Baicheng Lyu ◽  
Wenhua Wu ◽  
Zhiqiang Hu

AbstractWith the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.


2015 ◽  
Vol 2015 ◽  
pp. 1-11 ◽  
Author(s):  
Chia-Yu Hsu

Wafer bin map (WBM) represents specific defect pattern that provides information for diagnosing root causes of low yield in semiconductor manufacturing. In practice, most semiconductor engineers use subjective and time-consuming eyeball analysis to assess WBM patterns. Given shrinking feature sizes and increasing wafer sizes, various types of WBMs occur; thus, relying on human vision to judge defect patterns is complex, inconsistent, and unreliable. In this study, a clustering ensemble approach is proposed to bridge the gap, facilitating WBM pattern extraction and assisting engineer to recognize systematic defect patterns efficiently. The clustering ensemble approach not only generates diverse clusters in data space, but also integrates them in label space. First, the mountain function is used to transform data by using pattern density. Subsequently,k-means and particle swarm optimization (PSO) clustering algorithms are used to generate diversity partitions and various label results. Finally, the adaptive response theory (ART) neural network is used to attain consensus partitions and integration. An experiment was conducted to evaluate the effectiveness of proposed WBMs clustering ensemble approach. Several criterions in terms of sum of squared error, precision, recall, andF-measure were used for evaluating clustering results. The numerical results showed that the proposed approach outperforms the other individual clustering algorithm.


2021 ◽  
Author(s):  
Manuel Fritz ◽  
Michael Behringer ◽  
Dennis Tschechlov ◽  
Holger Schwarz

AbstractClustering is a fundamental primitive in manifold applications. In order to achieve valuable results in exploratory clustering analyses, parameters of the clustering algorithm have to be set appropriately, which is a tremendous pitfall. We observe multiple challenges for large-scale exploration processes. On the one hand, they require specific methods to efficiently explore large parameter search spaces. On the other hand, they often exhibit large runtimes, in particular when large datasets are analyzed using clustering algorithms with super-polynomial runtimes, which repeatedly need to be executed within exploratory clustering analyses. We address these challenges as follows: First, we present LOG-Means and show that it provides estimates for the number of clusters in sublinear time regarding the defined search space, i.e., provably requiring less executions of a clustering algorithm than existing methods. Second, we demonstrate how to exploit fundamental characteristics of exploratory clustering analyses in order to significantly accelerate the (repetitive) execution of clustering algorithms on large datasets. Third, we show how these challenges can be tackled at the same time. To the best of our knowledge, this is the first work which simultaneously addresses the above-mentioned challenges. In our comprehensive evaluation, we unveil that our proposed methods significantly outperform state-of-the-art methods, thus especially supporting novice analysts for exploratory clustering analyses in large-scale exploration processes.


2021 ◽  
Author(s):  
BAICHENG LV ◽  
WENHUA WU ◽  
ZHIQIANG HU

Abstract With the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.


2012 ◽  
Vol 235 ◽  
pp. 15-19
Author(s):  
Li Min Liu ◽  
Xiao Ping Fan ◽  
Yue Shan Xie

Clustering ensemble has been known as an effective method to improve the robustness and stability of clustering analysis. Clustering ensemble solves the problem in two steps:firstly,generating a large set of clustering partitions based on the clustering algorithms;secondly,combining them using a consensus function to get the final clustering result. The key technology of clustering ensemble is the proper consensus function. Recent research proposed using the matrix factorization to solve clustering ensemble. In this paper, we firstly analyze some traditional matrix factorization algorithms; secondly, we propose a new consensus function using binary nonnegative matrix factorization (BMF) and give the optimization algorithm of BMF; lastly, we propose the new framework of clustering ensemble algorithm and give some experiments on UCI Machine Learning Repository. The experiments show that the new algorithm is effective and clustering performance could be significantly improved.


2017 ◽  
Vol 15 (06) ◽  
pp. 1740006 ◽  
Author(s):  
Mohammad Arifur Rahman ◽  
Nathan LaPierre ◽  
Huzefa Rangwala ◽  
Daniel Barbara

Metagenomics is the collective sequencing of co-existing microbial communities which are ubiquitous across various clinical and ecological environments. Due to the large volume and random short sequences (reads) obtained from community sequences, analysis of diversity, abundance and functions of different organisms within these communities are challenging tasks. We present a fast and scalable clustering algorithm for analyzing large-scale metagenome sequence data. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using hashing. These canopies are then refined by using state-of-the-art sequence clustering algorithms. This canopy-clustering (CC) algorithm can be used as a pre-processing phase for computationally expensive clustering algorithms. We use and compare three hashing schemes for canopy construction with five popular and state-of-the-art sequence clustering methods. We evaluate our clustering algorithm on synthetic and real-world 16S and whole metagenome benchmarks. We demonstrate the ability of our proposed approach to determine meaningful Operational Taxonomic Units (OTU) and observe significant speedup with regards to run time when compared to different clustering algorithms. We also make our source code publicly available on Github. a


Author(s):  
Katti Faceli ◽  
Andre C.P.L.F. de Carvalho ◽  
Marcilio C.P. de Souto

Clustering is an important tool for data exploration. Several clustering algorithms exist, and new algorithms are frequently proposed in the literature. These algorithms have been very successful in a large number of real-world problems. However, there is no clustering algorithm, optimizing only a single criterion, able to reveal all types of structures (homogeneous or heterogeneous) present in a dataset. In order to deal with this problem, several multi-objective clustering and cluster ensemble methods have been proposed in the literature, including our multi-objective clustering ensemble algorithm. In this chapter, we present an overview of these methods, which, to a great extent, are based on the combination of various aspects of traditional clustering algorithms.


Author(s):  
Xianjin Shi ◽  
Wanwan Wang ◽  
Chongsheng Zhang

Over the past few decades, a great many data clustering algorithms have been developed, including K-Means, DBSCAN, Bi-Clustering and Spectral clustering, etc. In recent years, two new data  clustering algorithms have been proposed, which are affinity propagation (AP, 2007) and density peak based clustering (DP, 2014). In this work, we empirically compare the performance of these two latest data clustering algorithms with state-of-the-art, using 6 external and 2 internal clustering validation metrics. Our experimental results on 16 public datasets show that, the two latest clustering algorithms, AP and DP, do not always outperform DBSCAN. Therefore, to find the best clustering algorithm for a specific dataset, all of AP, DP and DBSCAN should be considered.  Moreover, we find that the comparison of different clustering algorithms is closely related to the clustering evaluation metrics adopted. For instance, when using the Silhouette clustering validation metric, the overall performance of K-Means is as good as AP and DP. This work has important reference values for researchers and engineers who need to select appropriate clustering algorithms for their specific applications.


2017 ◽  
Vol 2017 ◽  
pp. 1-9 ◽  
Author(s):  
Hongjie Wu ◽  
Haiou Li ◽  
Min Jiang ◽  
Cheng Chen ◽  
Qiang Lv ◽  
...  

Background.One critical issue in protein three-dimensional structure prediction using either ab initio or comparative modeling involves identification of high-quality protein structural models from generated decoys. Currently, clustering algorithms are widely used to identify near-native models; however, their performance is dependent upon different conformational decoys, and, for some algorithms, the accuracy declines when the decoy population increases.Results.Here, we proposed two enhancedK-means clustering algorithms capable of robustly identifying high-quality protein structural models. The first one employs the clustering algorithm SPICKER to determine the initial centroids for basicK-means clustering (SK-means), whereas the other employs squared distance to optimize the initial centroids (K-means++). Our results showed thatSK-means andK-means++ were more robust as compared with SPICKER alone, detecting 33 (59%) and 42 (75%) of 56 targets, respectively, with template modeling scores better than or equal to those of SPICKER.Conclusions.We observed that the classicK-means algorithm showed a similar performance to that of SPICKER, which is a widely used algorithm for protein-structure identification. BothSK-means andK-means++ demonstrated substantial improvements relative to results from SPICKER and classicalK-means.


Sign in / Sign up

Export Citation Format

Share Document