clustering criterion Latest Research Papers

<abstract><p>In this paper, we investigate the theory of rough set to study graphs using the concept of orbits. Rough sets are based on a clustering criterion and we use the idea of similarity of vertices under automorphism as a criterion. We introduce indiscernibility relation in terms of orbits and prove necessary and sufficient conditions under which the indiscernibility partitions remain the same when associated with different attribute sets. We show that automorphisms of the graph $ \mathcal{G} $ preserve the indiscernibility partitions. Further, we prove that for any graph $ \mathcal{G} $ with $ k $ orbits, any reduct $ \mathcal{R} $ consists of one element from $ k-1 $ orbits of the graph. We also study the rough membership functions for paths, cycles, complete and complete bipartite graphs. Moreover, we introduce essential sets and discernibility matrices induced by orbits of graphs and study their relationship. We also prove that every essential set consists of union of any two orbits of the graph.</p></abstract>

Download Full-text

Distance-based clustering challenges for unbiased benchmarking studies

Scientific Reports ◽

10.1038/s41598-021-98126-1 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Michael C. Thrun

Keyword(s):

Cluster Analysis ◽

Clustering Algorithms ◽

Quality Measures ◽

High Dimensional ◽

Data Sets ◽

Algorithm Selection ◽

Box Plots ◽

Benchmark Datasets ◽

Clustering Criterion ◽

Comparison Measures

AbstractBenchmark datasets with predefined cluster structures and high-dimensional biomedical datasets outline the challenges of cluster analysis: clustering algorithms are limited in their clustering ability in the presence of clusters defining distance-based structures resulting in a biased clustering solution. Data sets might not have cluster structures. Clustering yields arbitrary labels and often depends on the trial, leading to varying results. Moreover, recent research indicated that all partition comparison measures can yield the same results for different clustering solutions. Consequently, algorithm selection and parameter optimization by unsupervised quality measures (QM) are always biased and misleading. Only if the predefined structures happen to meet the particular clustering criterion and QM, can the clusters be recovered. Results are presented based on 41 open-source algorithms which are particularly useful in biomedical scenarios. Furthermore, comparative analysis with mirrored density plots provides a significantly more detailed benchmark than that with the typically used box plots or violin plots.

Download Full-text

Distance-Based Clustering Challenges for Unbiased Benchmarking Studies

10.21203/rs.3.rs-301361/v1 ◽

2021 ◽

Author(s):

Michael Thrun

Keyword(s):

Cluster Analysis ◽

Open Source ◽

Clustering Algorithms ◽

Cluster Structure ◽

Quality Measures ◽

High Dimensional ◽

Biomedical Data ◽

Algorithm Selection ◽

Box Plots ◽

Clustering Criterion

Abstract Benchmark datasets with predefined cluster structures and high-dimensional biomedical datasets outline the challenges of cluster analysis: clustering algorithms are limited in their clustering ability in the presence of clusters defining distance-based structures resulting in a biased clustering solution. Data sets might not have cluster structures. Clustering yields arbitrary labels and often depends on the trial, leading to varying results. Moreover, recent research indicated that all partition comparison measures can yield the same results for different clustering solutions. Consequently, algorithm selection and parameter optimization by unsupervised quality measures (QM) are always biased and misleading. Only if the predefined structures happen to meet the particular clustering criterion and QM, can the clusters be recovered by one of the 34 open-source algorithms which are particularly useful in biomedical scenarios. Furthermore, comparative analysis with mirrored density plots provides a significantly more detailed benchmark than that with the typically used box plots or violin plots. Modern biomedical analysis techniques such as next-generation sequencing (NGS) have opened the door for complex high-dimensional data acquisition in medicine. For example, The Cancer Genome Atlas (TCGA) project provides open source cancer data for a worldwide community. The availability of such rich data sources, which enable discovering new insights into disease-related genetic mechanisms, is challenging for data analysts. Genome- or transcriptome-wide association studies may reveal novel disease-related genes, e.g.1, and virtual karyotyping by NGS-based low-coverage whole-genome sequencing may replace the conventional karyotyping technique 130 years after von Waldeyer described human chromosomes2. However, deciphering previously unknown relations and hierarchies in high-dimensional biological datasets remains a challenge for knowledge discovery, meaning that the identification of valid, novel, potentially useful, and ultimately understandable patterns in data (e.g.,3) is a difficult task. A common first step is identifying clusters of objects that are likely to be functionally related or interact4, which has provoked debates about the most suitable clustering approaches. However, the definition of a cluster remains a matter of ongoing discussion5,6. Therefore, clustering is restricted here to the task of separating data into similar groups (c.f.7,8). Vividly, relative relationships between high-dimensional data points are of interest to build up structures in data that a cluster analysis can identify. Therefore, it remains essential to evaluate the results of clustering algorithms and grasp the differences in the structures they can catch. Recent research on cluster analysis conveys the message that relevant and possibly prior unknown relationships in high-dimensional biological datasets can be discovered by employing optimization procedures and automatic pipelines for either benchmarking or algorithm selection (e.g.,4,9). The state-of-the-art approach is to use one or more unsupervised indices for automatic evaluation, e.g., Wiwie et al.4 suggest the following guidelines for biomedical data: "Use […] [hierarchical clustering*] or PAM. (2) Compute the silhouette values for clustering results using a broad range of parameter set variations. (3) Pick the result for the parameter set yielding the highest silhouette value" (*Restricted to UPGMA or average linking, see https://clusteval.sdu.dk/1/programs). Alternatively, the authors provide the possibility of using the internal Davies–Bouldin10 and Dunn11 indices. This work demonstrates the pitfalls and challenges of such approaches; more precisely, it shows that • Parameter optimization on datasets without distance-based clusters, • Algorithm selection by unsupervised quality measures on biomedical data, and • Benchmarking clustering algorithms with first-order statistics or box plots or a small number of trials are biased and often not recommended. Evidence for these pitfalls in cluster analysis is provided through the systematic and unbiased evaluation of 34 open source clustering algorithms with several bodies of data that possess clearly defined structures. These insights are particularly useful for knowledge discovery in biomedical scenarios. Select distance-based structures are consistently defined in artificial samples of data with specific pitfalls for clustering algorithms. Moreover, two natural datasets with investigated cluster structures are employed, and it is shown that the data reflect a true and valid empirical biomedical entity. This work shows that the limitations of clustering methods induced by their clustering criterion cannot be overcome by optimizing the algorithm parameters with a global criterion because such optimization can only reduce the variance but not the intrinsic bias. This limitation is outlined in two examples in which, by optimizing the quality measure of the Davies–Boulding index10, Dunn index11 or Silhouette value12, a specific cluster structure is imposed, but the clinically relevant cluster structures are not reproduced. The biases of conventional clustering algorithms are investigated on five artificially defined data structures and two high-dimensional datasets. Furthermore, a clustering algorithm's parameters can still be significantly optimized even if the dataset does not possess any distance-based cluster structure.

Download Full-text

Performance evaluation of similarity measures for K-means clustering algorithm

Bayero Journal of Pure and Applied Sciences ◽

10.4314/bajopas.v12i2.21 ◽

2021 ◽

Vol 12 (2) ◽

pp. 144-148

Author(s):

D. Usman ◽

S.F. Sani

Keyword(s):

Clustering Algorithm ◽

Similarity Measures ◽

High Dimensional ◽

Manhattan Distance ◽

Intrinsic Structure ◽

Data Points ◽

Data Objects ◽

Clustering Criterion ◽

High Dimensional Datasets ◽

Dimensional Domain

Clustering is a useful technique that organizes a large quantity of unordered datasets into a small number of meaningful and coherent clusters. Every clustering method is based on the index of similarity or dissimilarity between data points. However, the true intrinsic structure of the data could be correctly described by the similarity formula defined and embedded in the clustering criterion function. This paper uses squared Euclidean distance and Manhattan distance to investigates the best method for measuring similarity between data objects in sparse and high-dimensional domain which is fast, capable of providing high quality clustering result and consistent. The performances of these two methods were reported with simulated high dimensional datasets.

Download Full-text

Exploring patterns of corporate social responsibility using a complementary K-means clustering criterion

BuR - Business Research ◽

10.1007/s40685-019-00106-9 ◽

2020 ◽

Vol 13 (2) ◽

pp. 513-540 ◽

Cited By ~ 1

Author(s):

Zina Taran ◽

Boris Mirkin

Keyword(s):

Corporate Social Responsibility ◽

Social Responsibility ◽

Business Processes ◽

Research Effort ◽

Four Dimensions ◽

Corporate Social ◽

Small Clusters ◽

Clustering Criterion ◽

Single Focus

Abstract Companies’ objectives extend beyond mere profitability, to what is generally known as Corporate Social Responsibility (CSR). Empirical research effort of CSR is typically concentrated on a limited number of aspects. We focus on the whole set of CSR activities to identify any structure to that set. In this analysis, we take data from 1850 of the largest international companies via the conventional MSCI database and focus on four major dimensions of CSR: Environment, Social/Stakeholder, Labor, and Governance. To identify any structure hidden in almost constant average values, we apply the popular technique of K-means clustering. When determining the number of clusters, which is especially difficult in the case at hand, we use an equivalent clustering criterion that is complementary to the square-error K-means criterion. Our use of this complementary criterion aims at obtaining clusters that are both large and farthest away from the center. We derive from this a method of extracting anomalous clusters one-by-one with a follow-up removal of small clusters. This method has allowed us to discover a rather impressive process of change from predominantly uniform patterns of CSR activities along the four dimensions in 2007 to predominantly single-focus patterns of CSR activities in 2012. This change may reflect the dynamics of increasingly interweaving and structuring CSR activities into business processes that are likely to be extended into the future.

Download Full-text

A Generalized Multivariate Approach for Possibilistic Fuzzy C-Means Clustering

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s021848851850040x ◽

2018 ◽

Vol 26 (06) ◽

pp. 893-916 ◽

Cited By ~ 2

Author(s):

Bruno Almeida Pimentel ◽

Renata M. C. R. de Souza

Keyword(s):

Synthetic Data ◽

Data Sets ◽

Multivariate Approach ◽

Fuzzy C Means ◽

Special Cases ◽

Possibilistic Clustering ◽

Clustering Quality ◽

Fuzzy C Means Clustering ◽

Clustering Criterion ◽

Squared Euclidean Distance

Fuzzy c-Means (FCM) and Possibilistic c-Means (PCM) are the most popular algorithms of the fuzzy and possibilistic clustering approaches, respectively. A hybridization of these methods, called Possibilistic Fuzzy c-Means (PFCM), solves noise sensitivity defect of FCM and overcomes the coincident clusters problem of PCM. Although PFCM have shown good performance in cluster detection, it does not consider that different variables can produce different membership and possibility degrees and this can improve the clustering quality as it has been performed with the Multivariate Fuzzy c-Means (MFCM). Here, this work presents a generalized multivariate approach for possibilistic fuzzy c-means clustering. This approach gives a general form for the clustering criterion of the possibilistic fuzzy clustering with membership and possibility degrees different by cluster and variable and a weighted squared Euclidean distance in order to take into account the shape of clusters. Six multivariate clustering models (special cases) can be derivative from this general form and their properties are presented. Experiments with real and synthetic data sets validate the usefulness of the approach introduced in this paper using the special cases.

Download Full-text

Managing the complexity of new product development project from the perspectives of customer needs and entropy

Concurrent Engineering ◽

10.1177/1063293x18798001 ◽

2018 ◽

Vol 26 (4) ◽

pp. 328-340 ◽

Cited By ~ 4

Author(s):

Qing Yang ◽

Chen Shan ◽

Bin Jiang ◽

Na Yang ◽

Tao Yao

Keyword(s):

Product Development ◽

New Product Development ◽

New Product ◽

Development Project ◽

Customer Needs ◽

Design Structure ◽

Proposed Model ◽

Function Modules ◽

The Right ◽

Clustering Criterion

To successfully develop a complex product, a firm must answer two critical questions: how to “develop the right product” and how to “develop the product right.” Motivated by the real practice of Xiaomi’s new product development (NPD) projects, this article responds to these calls in the following ways. To design the right product, using an enhanced PageRank algorithm to investigate customer needs, NPD managers can select appropriate function modules of the new product to meet customers’ demand. To develop the new product in the right way, NPD managers should optimize the NPD organization. This article applies the multi-domain matrix (MDM) to identify the technical coordination dependency strength among different teams and then to measure the NPD organization’s complexity according to its entropy. By proposing the External Entropy of Cluster (EEC) and Internal Entropy of Cluster (IEC), we develop an entropy-based two-stage clustering criterion of design structure matrix ( DSM) to optimize the NPD organization. The first-stage clustering criterion maximizes the added average dependency strength of DSM, and the second-stage clustering criterion minimizes the Weighted Total Entropy, including the IEC and EEC. An industrial example is provided to illustrate the proposed model. The results indicate that the clustered DSM can reduce the organization’s complexity.

Download Full-text

Radical right parties and European economic integration: Evidence from the seventh European Parliament

European Union Politics ◽

10.1177/1465116518760241 ◽

2018 ◽

Vol 19 (2) ◽

pp. 321-343 ◽

Cited By ~ 3

Author(s):

Matteo Cavallaro ◽

David Flacher ◽

Massimo Angelo Zanetti

Keyword(s):

European Parliament ◽

Two Dimensions ◽

Radical Right ◽

Future Research ◽

The European Union ◽

Additive Trees ◽

Radical Right Parties ◽

Party Family ◽

Clustering Criterion

This article explores the differences in radical right parties' voting behaviour on economic matters at the European Parliament. As the literature highlights the heterogeneity of these parties in relation to their economic programmes, we test whether divergences survive the elections and translate into dissimilar voting patterns. Using voting records from the seventh term of the European Parliament, we show that radical right parties do not act as a consolidated party family. We then analyse the differences between radical right parties by the means of different statistical methods (NOMINATE, Ward's clustering criterion, and additive trees) and find that these are described along two dimensions: the degree of opposition to the European Union and the classical left–right economic cleavage. We provide a classification of these parties compromising four groups: pro-welfare conditional, pro-market conditional, and rejecting. Our results indicate that radical right parties do not act as a party family at the European Parliament. This remains true regardless of the salience of the policy issues in their agendas. The article also derives streams for future research on the heterogeneity of radical right parties.

Download Full-text

Optimalisasi K-MEDOID dalam Pengklasteran Mahasiswa Pelamar Beasiswa dengan CUBIC CLUSTERING CRITERION

Jurnal Teknologi dan Sistem Informasi ◽

10.25077/teknosi.v3i1.2017.211-218 ◽

2017 ◽

Vol 3 (1) ◽

pp. 211-218

Author(s):

Sofi Defiyanti ◽

Mohamad Jajuli ◽

Nurul Rohmawati

Keyword(s):

Clustering Criterion

Beasiswa merupakan salah satu bantuan belajar yang diberikan kepada mahasiswa. Salah satu beasiswa yang ada adalah beasiswa yang diberikan oleh negara dengan nama Bantuan Belajar Mahasiswa (BBM). Pengelompokan data mahasiswa penerima beasiswa berguna untuk menentukan mahasiswa yang berhak, dipertimbangkan atau tidak berhak. Dengan pengelompokan mahasiswa penerima beasiswa ini dapat memudahkan pihak tata usaha dalam menentukan penerima beasiswa khususnya beasiswa BBM. Pengelompokan tersebut dalam dilakukan dengan menggunakan teknik klustering berbasis partisi yaitu dengan algoritma K-Medoids. Data-data yang didapat untuk dilakukan pengelompokan terdiri dari atribut SKS, IPK, Tanggungan orang tua dan jumlah penghasilan orang tua. Dari data-data yang didapat memiliki nilai yang beragam dan memiliki rentang satu dengan yang lainnya berjauhan. Maka dilakukan tiga buah skenario, yaitu 1: semua data yang didapat dilakukan pengelompokan dengan K-Medoids, 2 : sebagian data yang didapat dilakukan kodefikasi, 3 : semua data yang ada dilakukan kodefikasi. Dari ketiga skenario yang dilakukan didapat nilai Cubic Clustering Criterion (CCC). Dataset kodifikasi keseluruhan menunjukkan nilai CCC berada diantara 2 sampai 3 ini menunjukkan bahwa dataset kodifikasi keseluruhan mempunyai keseragaman yang baik. Hal ini dikarenakan semua nilai pada setiap atribut memiliki nilai yang hampir sama.

Download Full-text

Anomaly Detection in Network Traffic with a Relationnal Clustering Criterion

Lecture Notes in Computer Science - Geometric Science of Information ◽

10.1007/978-3-319-68445-1_15 ◽

2017 ◽

pp. 127-134

Author(s):

Damien Nogues

Keyword(s):

Anomaly Detection ◽

Network Traffic ◽

Clustering Criterion

Download Full-text

clustering criterion
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Rough sets in graphs using similarity relations

Distance-based clustering challenges for unbiased benchmarking studies

Distance-Based Clustering Challenges for Unbiased Benchmarking Studies

Performance evaluation of similarity measures for K-means clustering algorithm

Exploring patterns of corporate social responsibility using a complementary K-means clustering criterion

A Generalized Multivariate Approach for Possibilistic Fuzzy C-Means Clustering

Managing the complexity of new product development project from the perspectives of customer needs and entropy

Radical right parties and European economic integration: Evidence from the seventh European Parliament

Optimalisasi K-MEDOID dalam Pengklasteran Mahasiswa Pelamar Beasiswa dengan CUBIC CLUSTERING CRITERION

Anomaly Detection in Network Traffic with a Relationnal Clustering Criterion

Export Citation Format

clustering criterionRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Rough sets in graphs using similarity relations

Distance-based clustering challenges for unbiased benchmarking studies

Distance-Based Clustering Challenges for Unbiased Benchmarking Studies

Performance evaluation of similarity measures for K-means clustering algorithm

Exploring patterns of corporate social responsibility using a complementary K-means clustering criterion

A Generalized Multivariate Approach for Possibilistic Fuzzy C-Means Clustering

Managing the complexity of new product development project from the perspectives of customer needs and entropy

Radical right parties and European economic integration: Evidence from the seventh European Parliament

Optimalisasi K-MEDOID dalam Pengklasteran Mahasiswa Pelamar Beasiswa dengan CUBIC CLUSTERING CRITERION

Anomaly Detection in Network Traffic with a Relationnal Clustering Criterion

clustering criterion
Recently Published Documents