clustering quality Latest Research Papers

Clustering and visualization of single-cell RNA-seq data using path metrics

10.1101/2021.12.14.472627 ◽

2021 ◽

Author(s):

Andriana Manousidaki ◽

Anna Little ◽

Yuying Xie

Keyword(s):

Dimension Reduction ◽

Single Cell ◽

Cluster Structure ◽

Data Sets ◽

Sequencing Data ◽

Local Cluster ◽

Wide Range ◽

Clustering Quality ◽

Low Dimensional

Recent advances in single-cell technologies have enabled high-resolution characterization of tissue and cancer compositions. Although numerous tools for dimension reduction and clustering are available for single-cell data analyses, these methods often fail to simultaneously preserve local cluster structure and global data geometry. This article explores the application of power-weighted path metrics for the analysis of single cell RNA data. Extensive experiments on single cell RNA sequencing data sets confirm the usefulness of path metrics for dimension reduction and clustering. Distances between cells are measured in a data- driven way which is both density sensitive (decreasing distances across high density regions) and respects the underlying data geometry. By combining path metrics with multidimensional scaling, a low dimensional embedding of the data is obtained which respects both the global geometry of the data and preserves cluster structure. We evaluate the method both for clustering quality and geometric fidelity, and it outperforms other algorithms on a wide range of bench marking data sets.

Download Full-text

Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants

Algorithms ◽

10.3390/a14120348 ◽

2021 ◽

Vol 14 (12) ◽

pp. 348

Author(s):

Zahra Tayebi ◽

Sarwan Ali ◽

Murray Patterson

Keyword(s):

Feature Selection ◽

Feature Vector ◽

Feature Selection Method ◽

Protein S ◽

High Rate ◽

Spike Protein ◽

Vector Representation ◽

Clustering Methods ◽

Spike Sequences ◽

Clustering Quality

The widespread availability of large amounts of genomic data on the SARS-CoV-2 virus, as a result of the COVID-19 pandemic, has created an opportunity for researchers to analyze the disease at a level of detail, unlike any virus before it. On the one hand, this will help biologists, policymakers, and other authorities to make timely and appropriate decisions to control the spread of the coronavirus. On the other hand, such studies will help to more effectively deal with any possible future pandemic. Since the SARS-CoV-2 virus contains different variants, each of them having different mutations, performing any analysis on such data becomes a difficult task, given the size of the data. It is well known that much of the variation in the SARS-CoV-2 genome happens disproportionately in the spike region of the genome sequence—the relatively short region which codes for the spike protein(s). In this paper, we propose a robust feature-vector representation of biological sequences that, when combined with the appropriate feature selection method, allows different downstream clustering approaches to perform well on a variety of different measures. We use such proposed approach with an array of clustering techniques to cluster spike protein sequences in order to study the behavior of different known variants that are increasing at a very high rate throughout the world. We use a k-mers based approach first to generate a fixed-length feature vector representation of the spike sequences. We then show that we can efficiently and effectively cluster the spike sequences based on the different variants with the appropriate feature selection. Using a publicly available set of SARS-CoV-2 spike sequences, we perform clustering of these sequences using both hard and soft clustering methods and show that, with our feature selection methods, we can achieve higher F1 scores for the clusters and also better clustering quality metrics compared to baselines.

Download Full-text

Cluster Data Analysis with a Fuzzy Equivalence Relation to Substantiate a Medical Diagnosis

Emerging Science Journal ◽

10.28991/esj-2021-01305 ◽

2021 ◽

Vol 5 (5) ◽

pp. 688-699

Author(s):

Abas Hasanovich Lampezhev ◽

Elena Yur`evna Linskaya ◽

Aslan Adal`bievich Tatarkanov ◽

Islam Alexandrovich Alexandrov

Keyword(s):

Data Analysis ◽

Fuzzy Clustering ◽

Equivalence Relation ◽

Medical Diagnosis ◽

Medical Data ◽

Data Uncertainty ◽

Mathematical Tool ◽

Cluster Data ◽

Clustering Quality ◽

Medical Diagnostic

This study aims to develop a methodology for the justification of medical diagnostic decisions based on the clustering of large volumes of statistical information stored in decision support systems. This aim is relevant since the analyzed medical data are often incomplete and inaccurate, negatively affecting the correctness of medical diagnosis and the subsequent choice of the most effective treatment actions. Clustering is an effective mathematical tool for selecting useful information under conditions of initial data uncertainty. The analysis showed that the most appropriate algorithm to solve the problem is based on fuzzy clustering and fuzzy equivalence relation. The methods of the present study are based on the use of this algorithm forming the technique of analyzing large volumes of medical data due to prepare a rationale for making medical diagnostic decisions. The proposed methodology involves the sequential implementation of the following procedures: preliminary data preparation, selecting the purpose of cluster data analysis, determining the form of results presentation, data normalization, selection of criteria for assessing the quality of the solution, application of fuzzy data clustering, evaluation of the sample, results and their use in further work. Fuzzy clustering quality evaluation criteria include partition coefficient, entropy separation criterion, separation efficiency ratio, and cluster power criterion. The novelty of the results of this article is related to the fact that the proposed methodology makes it possible to work with clusters of arbitrary shape and missing centers, which is impossible when using universal algorithms. Doi: 10.28991/esj-2021-01305 Full Text: PDF

Download Full-text

A Bi-directional Fuzzy C-Means Clustering Ensemble Algorithm Considering Local Information

International Journal of Computational Intelligence Systems ◽

10.1007/s44196-021-00014-z ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Chunhua Ren ◽

Linfu Sun

Keyword(s):

Clustering Algorithms ◽

Real Data ◽

Local Information ◽

Data Sets ◽

Clustering Ensemble ◽

K Nearest Neighbors ◽

Fuzzy C Means ◽

Clustering Quality ◽

Fuzzy C Means Clustering ◽

Fcm Clustering

AbstractThe classic Fuzzy C-means (FCM) algorithm has limited clustering performance and is prone to misclassification of border points. This study offers a bi-directional FCM clustering ensemble approach that takes local information into account (LI_BIFCM) to overcome these challenges and increase clustering quality. First, various membership matrices are created after running FCM multiple times, based on the randomization of the initial cluster centers, and a vertical ensemble is performed using the maximum membership principle. Second, after each execution of FCM, multiple local membership matrices of the sample points are created using multiple K-nearest neighbors, and a horizontal ensemble is performed. Multiple horizontal ensembles can be created using multiple FCM clustering. Finally, the final clustering results are obtained by combining the vertical and horizontal clustering ensembles. Twelve data sets were chosen for testing from both synthetic and real data sources. The LI_BIFCM clustering performance outperformed four traditional clustering algorithms and three clustering ensemble algorithms in the experiments. Furthermore, the final clustering results has a weak correlation with the bi-directional cluster ensemble parameters, indicating that the suggested technique is robust.

Download Full-text

Non-Parametric Bayesian Subspace Models for Acoustic Unit Discovery

10.36227/techrxiv.16618135 ◽

2021 ◽

Author(s):

Lucas Ondel

Keyword(s):

Dirichlet Process ◽

Experimental Results ◽

Parametric Models ◽

Dirichlet Process Mixture ◽

Clustering Quality ◽

Segmentation Accuracy ◽

Target Data ◽

Non Parametric

This work investigates subspace non-parametric models for the task of learning a set of acoustic units from unlabeled speech recordings. We constrain the base-measure of a Dirichlet-Process mixture with a phonetic subspace---estimated from other source languages---to build an \emph{educated prior}, thereby forcing the learned acoustic units to resemble phones of known source languages. Two types of models are proposed: (i) the Subspace HMM (SHMM) which assumes that the phonetic subspace is the same for every language, (ii) the Hierarchical-Subspace HMM (H-SHMM) which relaxes this assumption and allows to have a language-specific subspace estimated on the unlabeled target data. These models are applied on 3 languages: English, Yoruba and Mboshi and they are compared with various competitive acoustic units discovery baselines. Experimental results show that both subspace models outperform other systems in terms of clustering quality and segmentation accuracy. Moreover, we observe that the H-SHMM provides results superior to the SHMM supporting the idea that language-specific priors are preferable to language-agnostic priors for acoustic unit discovery.

Download Full-text

Non-Parametric Bayesian Subspace Models for Acoustic Unit Discovery

10.36227/techrxiv.16618135.v1 ◽

2021 ◽

Author(s):

Lucas Ondel

Keyword(s):

Dirichlet Process ◽

Experimental Results ◽

Parametric Models ◽

Dirichlet Process Mixture ◽

Clustering Quality ◽

Segmentation Accuracy ◽

Target Data ◽

Non Parametric

This work investigates subspace non-parametric models for the task of learning a set of acoustic units from unlabeled speech recordings. We constrain the base-measure of a Dirichlet-Process mixture with a phonetic subspace---estimated from other source languages---to build an \emph{educated prior}, thereby forcing the learned acoustic units to resemble phones of known source languages. Two types of models are proposed: (i) the Subspace HMM (SHMM) which assumes that the phonetic subspace is the same for every language, (ii) the Hierarchical-Subspace HMM (H-SHMM) which relaxes this assumption and allows to have a language-specific subspace estimated on the unlabeled target data. These models are applied on 3 languages: English, Yoruba and Mboshi and they are compared with various competitive acoustic units discovery baselines. Experimental results show that both subspace models outperform other systems in terms of clustering quality and segmentation accuracy. Moreover, we observe that the H-SHMM provides results superior to the SHMM supporting the idea that language-specific priors are preferable to language-agnostic priors for acoustic unit discovery.

Download Full-text

Fiber Clustering Acceleration With a Modified Kmeans++ Algorithm Using Data Parallelism

Frontiers in Neuroinformatics ◽

10.3389/fninf.2021.727859 ◽

2021 ◽

Vol 15 ◽

Author(s):

Isaac Goicovich ◽

Paulo Olivares ◽

Claudio Román ◽

Andrea Vázquez ◽

Cyril Poupon ◽

...

Keyword(s):

White Matter ◽

Brain Research ◽

Data Parallelism ◽

Processing Unit ◽

Brain White Matter ◽

Fiber Clustering ◽

New Variant ◽

Clustering Quality ◽

Kmeans Algorithm ◽

The Impact

Fiber clustering methods are typically used in brain research to study the organization of white matter bundles from large diffusion MRI tractography datasets. These methods enable exploratory bundle inspection using visualization and other methods that require identifying brain white matter structures in individuals or a population. Some applications, such as real-time visualization and inter-subject clustering, need fast and high-quality intra-subject clustering algorithms. This work proposes a parallel algorithm using a General Purpose Graphics Processing Unit (GPGPU) for fiber clustering based on the FFClust algorithm. The proposed GPGPU implementation exploits data parallelism using both multicore and GPU fine-grained parallelism present in commodity architectures, including current laptops and desktop computers. Our approach implements all FFClust steps in parallel, improving execution times in all of them. In addition, our parallel approach includes a parallel Kmeans++ algorithm implementation and defines a new variant of Kmeans++ to reduce the impact of choosing outliers as initial centroids. The results show that our approach provides clustering quality results very similar to FFClust, and it requires an execution time of 3.5 s for processing about a million fibers, achieving a speedup of 11.5 times compared to FFClust.

Download Full-text

Develop a dynamic DBSCAN algorithm for solving initial parameter selection problem of the DBSCAN algorithm

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v23.i3.pp1602-1610 ◽

2021 ◽

Vol 23 (3) ◽

pp. 1602

Author(s):

Md. Zakir Hossain ◽

Md. Jakirul Islam ◽

Md. Waliur Rahman Miah ◽

Jahid Hasan Rony ◽

Momotaz Begum

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Cluster Radius ◽

Dbscan Algorithm ◽

Clustering Quality ◽

Data Clusters ◽

Minimum Number ◽

The Given ◽

Clustering Problems

The amount of data has been increasing exponentially in every sector such as banking securities, healthcare, education, manufacturing, consumer-trade, transportation, and energy. Most of these data are noise, different in shapes, and outliers. In such cases, it is challenging to find the desired data clusters using conventional clustering algorithms. DBSCAN is a popular clustering algorithm which is widely used for noisy, arbitrary shape, and outlier data. However, its performance highly depends on the proper selection of cluster radius (Eps) and the minimum number of points (MinPts) that are required for forming clusters for the given dataset. In the case of real-world clustering problems, it is a difficult task to select the exact value of Eps and (MinPts) to perform the clustering on unknown datasets. To address these, this paper proposes a dynamic DBSCAN algorithm that calculates the suitable value for (Eps) and (MinPts) dynamically by which the clustering quality of the given problem will be increased. This paper evaluates the performance of the dynamic DBSCAN algorithm over seven challenging datasets. The experimental results confirm the effectiveness of the dynamic DBSCAN algorithm over the well-known clustering algorithms.

Download Full-text

Adaptive Facial Imagery Clustering via Spectral Clustering and Reinforcement Learning

Applied Sciences ◽

10.3390/app11178051 ◽

2021 ◽

Vol 11 (17) ◽

pp. 8051

Author(s):

Chengxiao Shen ◽

Liping Qian ◽

Ningning Yu

Keyword(s):

Reinforcement Learning ◽

Spectral Clustering ◽

Practical Importance ◽

Optimal Number ◽

Q Learning ◽

Face Images ◽

Clustering Model ◽

Clustering Quality ◽

Evaluation Measure ◽

Feedback Module

In an era of big data, face images captured in social media and forensic investigations, etc., generally lack labels, while the number of identities (clusters) may range from a few dozen to thousands. Therefore, it is of practical importance to cluster a large number of unlabeled face images into an efficient range of identities or even the exact identities, which can avoid image labeling by hand. Here, we propose adaptive facial imagery clustering that involves face representations, spectral clustering, and reinforcement learning (Q-learning). First, we use a deep convolutional neural network (DCNN) to generate face representations, and we adopt a spectral clustering model to construct a similarity matrix and achieve clustering partition. Then, we use an internal evaluation measure (the Davies–Bouldin index) to evaluate the clustering quality. Finally, we adopt Q-learning as the feedback module to build a dynamic multiparameter debugging process. The experimental results on the ORL Face Database show the effectiveness of our method in terms of an optimal number of clusters of 39, which is almost the actual number of 40 clusters; our method can achieve 99.2% clustering accuracy. Subsequent studies should focus on reducing the computational complexity of dealing with more face images.

Download Full-text

Approximate spectral clustering using both reference vectors and topology of the network generated by growing neural gas

PeerJ Computer Science ◽

10.7717/peerj-cs.679 ◽

2021 ◽

Vol 7 ◽

pp. e679

Author(s):

Kazuhisa Fujita

Keyword(s):

Spectral Clustering ◽

Laplacian Matrix ◽

Computational Time ◽

Similarity Matrix ◽

Clustering Methods ◽

Growing Neural Gas ◽

Memory Space ◽

Neural Gas ◽

Clustering Quality ◽

Data Points

Spectral clustering (SC) is one of the most popular clustering methods and often outperforms traditional clustering methods. SC uses the eigenvectors of a Laplacian matrix calculated from a similarity matrix of a dataset. SC has serious drawbacks: the significant increases in the time complexity derived from the computation of eigenvectors and the memory space complexity to store the similarity matrix. To address the issues, I develop a new approximate spectral clustering using the network generated by growing neural gas (GNG), called ASC with GNG in this study. ASC with GNG uses not only reference vectors for vector quantization but also the topology of the network for extraction of the topological relationship between data points in a dataset. ASC with GNG calculates the similarity matrix from both the reference vectors and the topology of the network generated by GNG. Using the network generated from a dataset by GNG, ASC with GNG achieves to reduce the computational and space complexities and improve clustering quality. In this study, I demonstrate that ASC with GNG effectively reduces the computational time. Moreover, this study shows that ASC with GNG provides equal to or better clustering performance than SC.

Download Full-text

clustering quality
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Clustering and visualization of single-cell RNA-seq data using path metrics

Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants

Cluster Data Analysis with a Fuzzy Equivalence Relation to Substantiate a Medical Diagnosis

A Bi-directional Fuzzy C-Means Clustering Ensemble Algorithm Considering Local Information

Non-Parametric Bayesian Subspace Models for Acoustic Unit Discovery

Non-Parametric Bayesian Subspace Models for Acoustic Unit Discovery

Fiber Clustering Acceleration With a Modified Kmeans++ Algorithm Using Data Parallelism

Develop a dynamic DBSCAN algorithm for solving initial parameter selection problem of the DBSCAN algorithm

Adaptive Facial Imagery Clustering via Spectral Clustering and Reinforcement Learning

Approximate spectral clustering using both reference vectors and topology of the network generated by growing neural gas

Export Citation Format

clustering qualityRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Clustering and visualization of single-cell RNA-seq data using path metrics

Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants

Cluster Data Analysis with a Fuzzy Equivalence Relation to Substantiate a Medical Diagnosis

A Bi-directional Fuzzy C-Means Clustering Ensemble Algorithm Considering Local Information

Non-Parametric Bayesian Subspace Models for Acoustic Unit Discovery

Non-Parametric Bayesian Subspace Models for Acoustic Unit Discovery

Fiber Clustering Acceleration With a Modified Kmeans++ Algorithm Using Data Parallelism

Develop a dynamic DBSCAN algorithm for solving initial parameter selection problem of the DBSCAN algorithm

Adaptive Facial Imagery Clustering via Spectral Clustering and Reinforcement Learning

Approximate spectral clustering using both reference vectors and topology of the network generated by growing neural gas

clustering quality
Recently Published Documents