Clustering assessment in weighted networks

PeerJ Computer Science ◽

10.7717/peerj-cs.600 ◽

2021 ◽

Vol 7 ◽

pp. e600

Author(s):

Argimiro Arratia ◽

Martí Renedo Mirambell

Keyword(s):

Community Structure ◽

Bootstrap Method ◽

Clustering Algorithms ◽

Ground Truth ◽

Scoring Functions ◽

Clustering Methods ◽

Weighted Networks ◽

Validation Methods ◽

Clustering Quality ◽

Clustering Validation

We provide a systematic approach to validate the results of clustering methods on weighted networks, in particular for the cases where the existence of a community structure is unknown. Our validation of clustering comprises a set of criteria for assessing their significance and stability. To test for cluster significance, we introduce a set of community scoring functions adapted to weighted networks, and systematically compare their values to those of a suitable null model. For this we propose a switching model to produce randomized graphs with weighted edges while maintaining the degree distribution constant. To test for cluster stability, we introduce a non parametric bootstrap method combined with similarity metrics derived from information theory and combinatorics. In order to assess the effectiveness of our clustering quality evaluation methods, we test them on synthetically generated weighted networks with a ground truth community structure of varying strength based on the stochastic block model construction. When applying the proposed methods to these synthetic ground truth networks’ clusters, as well as to other weighted networks with known community structure, these correctly identify the best performing algorithms, which suggests their adequacy for cases where the clustering structure is not known. We test our clustering validation methods on a varied collection of well known clustering algorithms applied to the synthetically generated networks and to several real world weighted networks. All our clustering validation methods are implemented in R, and will be released in the upcoming package clustAnalytics.

Quantitative Measurement of Breast Tumors Using Intravoxel Incoherent Motion (IVIM) MR Images

Journal of Personalized Medicine ◽

10.3390/jpm11070656 ◽

2021 ◽

Vol 11 (7) ◽

pp. 656

Author(s):

Si-Wa Chan ◽

Wei-Hsuan Hu ◽

Yen-Chieh Ouyang ◽

Hsien-Chi Su ◽

Chin-Yao Lin ◽

...

Keyword(s):

Quantitative Measurement ◽

Hyperspectral Image ◽

Clustering Algorithms ◽

Ground Truth ◽

Breast Tumors ◽

Tumor Detection ◽

Intravoxel Incoherent Motion ◽

Detection Methods ◽

Clustering Methods ◽

Fuzzy C Means Clustering

Breast magnetic resonance imaging (MRI) is currently a widely used clinical examination tool. Recently, MR diffusion-related technologies, such as intravoxel incoherent motion diffusion weighted imaging (IVIM-DWI), have been extensively studied by breast cancer researchers and gradually adopted in clinical practice. In this study, we explored automatic tumor detection by IVIM-DWI. We considered the acquired IVIM-DWI data as a hyperspectral image cube and used a well-known hyperspectral subpixel target detection technique: constrained energy minimization (CEM). Two extended CEM methods—kernel CEM (K-CEM) and iterative CEM (I-CEM)—were employed to detect breast tumors. The K-means and fuzzy C-means clustering algorithms were also evaluated. The quantitative measurement results were compared to dynamic contrast-enhanced T1-MR imaging as ground truth. All four methods were successful in detecting tumors for all the patients studied. The clustering methods were found to be faster, but the CEM methods demonstrated better performance according to both the Dice and Jaccard metrics. These unsupervised tumor detection methods have the advantage of potentially eliminating operator variability. The quantitative results can be measured by using ADC, signal attenuation slope, D*, D, and PF parameters to classify tumors of mass, non-mass, cyst, and fibroadenoma types.

An improved ant algorithm with LDA-based representation for text document clustering

Journal of Information Science ◽

10.1177/0165551516638784 ◽

2016 ◽

Vol 43 (2) ◽

pp. 275-292 ◽

Cited By ~ 24

Author(s):

Aytug Onan ◽

Hasan Bulut ◽

Serdar Korukoglu

Keyword(s):

Clustering Algorithm ◽

Latent Dirichlet Allocation ◽

Clustering Algorithms ◽

Document Clustering ◽

Clustering Methods ◽

Initial Value ◽

Text Document ◽

Clustering Quality ◽

Text Features

Document clustering can be applied in document organisation and browsing, document summarisation and classification. The identification of an appropriate representation for textual documents is extremely important for the performance of clustering or classification algorithms. Textual documents suffer from the high dimensionality and irrelevancy of text features. Besides, conventional clustering algorithms suffer from several shortcomings, such as slow convergence and sensitivity to the initial value. To tackle the problems of conventional clustering algorithms, metaheuristic algorithms are frequently applied to clustering. In this paper, an improved ant clustering algorithm is presented, where two novel heuristic methods are proposed to enhance the clustering quality of ant-based clustering. In addition, the latent Dirichlet allocation (LDA) is used to represent textual documents in a compact and efficient way. The clustering quality of the proposed ant clustering algorithm is compared to the conventional clustering algorithms using 25 text benchmarks in terms of F-measure values. The experimental results indicate that the proposed clustering scheme outperforms the compared conventional and metaheuristic clustering methods for textual documents.

ClusterEnG: an interactive educational web resource for clustering and visualizing high-dimensional data

PeerJ Computer Science ◽

10.7717/peerj-cs.155 ◽

2018 ◽

Vol 4 ◽

pp. e155 ◽

Cited By ~ 3

Author(s):

Mohith Manjunath ◽

Yi Zhang ◽

Yeonsung Kim ◽

Steve H. Yeo ◽

Omar Sobh ◽

...

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Clustering Methods ◽

Web Interface ◽

Web Resource ◽

Interactive Visualizations ◽

Data Points ◽

Clustering Data ◽

Clustering Validation ◽

Intuitive Manner

Background Clustering is one of the most common techniques in data analysis and seeks to group together data points that are similar in some measure. Although there are many computer programs available for performing clustering, a single web resource that provides several state-of-the-art clustering methods, interactive visualizations and evaluation of clustering results is lacking. Methods ClusterEnG (acronym for Clustering Engine for Genomics) provides a web interface for clustering data and interactive visualizations including 3D views, data selection and zoom features. Eighteen clustering validation measures are also presented to aid the user in selecting a suitable algorithm for their dataset. ClusterEnG also aims at educating the user about the similarities and differences between various clustering algorithms and provides tutorials that demonstrate potential pitfalls of each algorithm. Conclusions The web resource will be particularly useful to scientists who are not conversant with computing but want to understand the structure of their data in an intuitive manner. The validation measures facilitate the process of choosing a suitable clustering algorithm among the available options. ClusterEnG is part of a bigger project called KnowEnG (Knowledge Engine for Genomics) and is available at http://education.knoweng.org/clustereng.

Constraint-based discriminative dimension selection for high-dimensional stream clustering

International Journal of Advances in Intelligent Informatics ◽

10.26555/ijain.v4i3.271 ◽

2018 ◽

Vol 4 (3) ◽

pp. 167

Author(s):

Kitsana Waiyamai ◽

Thanapat Kangkachit

Keyword(s):

Data Streams ◽

Clustering Algorithm ◽

Expert Knowledge ◽

Clustering Algorithms ◽

Clustering Methods ◽

Dynamic Constraints ◽

Stream Clustering ◽

Clustering Quality ◽

Active Research ◽

Clustering Data

Clustering data streams is one of active research topic in data mining. However, runtime of the existing stream clustering algorithms increases and their performance drop in the face of large number of dimensions. Complexity of the stream clustering methods is increased when perform on data with large number of dimensions. In order to reduce the clustering complexity, one possible solution consists in determining the appropriate subset of cluster dimensions via dimension projection. SED-Stream is an efficient clustering algorithm that supports high dimension data streams. The aim of this paper is to increase performance of SED-Stream in terms of both clustering quality and execution-time. In order to improve the clustering process, background or domain expert knowledge are integrated as “constraints” in SEDC-Stream. The new algorithm, SEDC-Stream, supports the evolving characteristics of the dynamic constraints which are activation, fading, outdating and prioritization. SEDC-Stream algorithm is able to reduce cluster splitting time, and place new incoming points to their suitable clusters. Compared to SED-Stream on the three real-world streams datasets, SEDC-Stream is able to generate a better clustering performance in terms of both purity and f-measure.

Towards Expert-Inspired Automatic Criterion to Cut a Dendrogram for Real-Industrial Applications

10.3233/faia210140 ◽

2021 ◽

Author(s):

Shikha Suman ◽

Ashutosh Karna ◽

Karina Gibert

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithms ◽

Computational Cost ◽

Real Life ◽

Ground Truth ◽

Industrial Applications ◽

Underlying Structure ◽

Cluster Validity ◽

Cluster Validity Index ◽

Number Of Clusters

Hierarchical clustering is one of the most preferred choices to understand the underlying structure of a dataset and defining typologies, with multiple applications in real life. Among the existing clustering algorithms, the hierarchical family is one of the most popular, as it permits to understand the inner structure of the dataset and find the number of clusters as an output, unlike popular methods, like k-means. One can adjust the granularity of final clustering to the goals of the analysis themselves. The number of clusters in a hierarchical method relies on the analysis of the resulting dendrogram itself. Experts have criteria to visually inspect the dendrogram and determine the number of clusters. Finding automatic criteria to imitate experts in this task is still an open problem. But, dependence on the expert to cut the tree represents a limitation in real applications like the fields industry 4.0 and additive manufacturing. This paper analyses several cluster validity indexes in the context of determining the suitable number of clusters in hierarchical clustering. A new Cluster Validity Index (CVI) is proposed such that it properly catches the implicit criteria used by experts when analyzing dendrograms. The proposal has been applied on a range of datasets and validated against experts ground-truth overcoming the results obtained by the State of the Art and also significantly reduces the computational cost.

A semantic approach for text document clustering using frequent itemsets and WordNet

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.9.10220 ◽

2018 ◽

Vol 7 (2.18) ◽

pp. 102

Author(s):

Harsha Patil ◽

Ramjeevan Singh Thakur

Keyword(s):

Clustering Algorithms ◽

Document Clustering ◽

Knowledge Bases ◽

Experimental Result ◽

Semantic Approach ◽

Text Document ◽

Clustering Quality ◽

Ship Function ◽

Membership Score ◽

Specific Cluster

Document Clustering is an unsupervised method for classified documents in clusters on the basis of their similarity. Any document get it place in any specific cluster, on the basis of membership score, which calculated through membership function. But many of the traditional clustering algorithms are generally based on only BOW (Bag of Words), which ignores the semantic similarity between document and Cluster. In this research we consider the semantic association between cluster and text document during the calculation of membership score of any document for any specific cluster. Several researchers are working on semantic aspects of document clustering to develop clustering performance. Many external knowledge bases like WordNet, Wikipedia, Lucene etc. are utilized for this purpose. The proposed approach exploits WordNet to improve cluster member ship function. The experimental result shows that clustering quality improved significantly by using proposed framework of semantic approach.

A Bi-directional Fuzzy C-Means Clustering Ensemble Algorithm Considering Local Information

International Journal of Computational Intelligence Systems ◽

10.1007/s44196-021-00014-z ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Chunhua Ren ◽

Linfu Sun

Keyword(s):

Clustering Algorithms ◽

Real Data ◽

Local Information ◽

Data Sets ◽

Clustering Ensemble ◽

K Nearest Neighbors ◽

Fuzzy C Means ◽

Clustering Quality ◽

Fuzzy C Means Clustering ◽

Fcm Clustering

AbstractThe classic Fuzzy C-means (FCM) algorithm has limited clustering performance and is prone to misclassification of border points. This study offers a bi-directional FCM clustering ensemble approach that takes local information into account (LI_BIFCM) to overcome these challenges and increase clustering quality. First, various membership matrices are created after running FCM multiple times, based on the randomization of the initial cluster centers, and a vertical ensemble is performed using the maximum membership principle. Second, after each execution of FCM, multiple local membership matrices of the sample points are created using multiple K-nearest neighbors, and a horizontal ensemble is performed. Multiple horizontal ensembles can be created using multiple FCM clustering. Finally, the final clustering results are obtained by combining the vertical and horizontal clustering ensembles. Twelve data sets were chosen for testing from both synthetic and real data sources. The LI_BIFCM clustering performance outperformed four traditional clustering algorithms and three clustering ensemble algorithms in the experiments. Furthermore, the final clustering results has a weak correlation with the bi-directional cluster ensemble parameters, indicating that the suggested technique is robust.

RFCell: A Gene Selection Approach for scRNA-seq Clustering Based on Permutation and Random Forest

Frontiers in Genetics ◽

10.3389/fgene.2021.665843 ◽

2021 ◽

Vol 12 ◽

Author(s):

Yuan Zhao ◽

Zhao-Yu Fang ◽

Cui-Xiang Lin ◽

Chao Deng ◽

Yun-Pei Xu ◽

...

Keyword(s):

Random Forest ◽

Single Cell ◽

Gene Selection ◽

Clustering Algorithms ◽

Selection Methods ◽

Clustering Methods ◽

Cell Type ◽

Cell Type Specificity ◽

Random Forest Classification ◽

Forest Classification

In recent years, the application of single cell RNA-seq (scRNA-seq) has become more and more popular in fields such as biology and medical research. Analyzing scRNA-seq data can discover complex cell populations and infer single-cell trajectories in cell development. Clustering is one of the most important methods to analyze scRNA-seq data. In this paper, we focus on improving scRNA-seq clustering through gene selection, which also reduces the dimensionality of scRNA-seq data. Studies have shown that gene selection for scRNA-seq data can improve clustering accuracy. Therefore, it is important to select genes with cell type specificity. Gene selection not only helps to reduce the dimensionality of scRNA-seq data, but also can improve cell type identification in combination with clustering methods. Here, we proposed RFCell, a supervised gene selection method, which is based on permutation and random forest classification. We first use RFCell and three existing gene selection methods to select gene sets on 10 scRNA-seq data sets. Then, three classical clustering algorithms are used to cluster the cells obtained by these gene selection methods. We found that the gene selection performance of RFCell was better than other gene selection methods.

Comparison of dimensionality reduction and clustering methods for SARS-CoV-2 genome

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v10i4.2803 ◽

2021 ◽

Vol 10 (4) ◽

pp. 2170-2180

Author(s):

Untari N. Wisesty ◽

Tati Rajab Mengko

Keyword(s):

Dimensionality Reduction ◽

Dimensional Reduction ◽

Clustering Algorithm ◽

Sequence Data ◽

Clustering Algorithms ◽

Gaussian Mixture Models ◽

Reduction Process ◽

Principal Component ◽

Gaussian Mixture ◽

Clustering Methods

This paper aims to conduct an analysis of the SARS-CoV-2 genome variation was carried out by comparing the results of genome clustering using several clustering algorithms and distribution of sequence in each cluster. The clustering algorithms used are K-means, Gaussian mixture models, agglomerative hierarchical clustering, mean-shift clustering, and DBSCAN. However, the clustering algorithm has a weakness in grouping data that has very high dimensions such as genome data, so that a dimensional reduction process is needed. In this research, dimensionality reduction was carried out using principal component analysis (PCA) and autoencoder method with three models that produce 2, 10, and 50 features. The main contributions achieved were the dimensional reduction and clustering scheme of SARS-CoV-2 sequence data and the performance analysis of each experiment on each scheme and hyper parameters for each method. Based on the results of experiments conducted, PCA and DBSCAN algorithm achieve the highest silhouette score of 0.8770 with three clusters when using two features. However, dimensionality reduction using autoencoder need more iterations to converge. On the testing process with Indonesian sequence data, more than half of them enter one cluster and the rest are distributed in the other two clusters.

PRIVACY PRESERVING CLUSTERING BASED ON LINEAR APPROXIMATION OF FUNCTION

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v12i5.2914 ◽

2013 ◽

Vol 12 (5) ◽

pp. 3443-3451

Author(s):

Rajesh Pasupuleti ◽

Narsimha Gugulothu

Keyword(s):

Linear Approximation ◽

Clustering Algorithms ◽

Similarity Measures ◽

Privacy Preserving ◽

Distance Measures ◽

Clustering Methods ◽

Sensitive Data ◽

Processing Information ◽

Data Objects ◽

Approximation Of Function

Clustering analysis initiativesÂ a new direction in data mining that has major impact in various domains including machine learning, pattern recognition, image processing, information retrieval and bioinformatics. Current clustering techniques address some of theÂ requirements not adequately and failed in standardizing clustering algorithms to support for all real applications. Many clustering methods mostly depend on user specified parametric methods and initial seeds of clusters are randomly selected byÂ user.Â In this paper, we proposed new clustering method based on linear approximation of function by getting over all idea of behavior knowledge of clustering function, then pick the initial seeds of clusters as the points on linear approximation line and perform clustering operations, unlike grouping data objects into clusters by using distance measures, similarity measures and statistical distributions in traditional clustering methods. We have shown experimental results as clusters based on linear approximation yields goodÂ results in practice with an example ofÂ business data are provided.Â It alsoÂ explains privacy preserving clusters of sensitive data objects.