clustering quality
Recently Published Documents


TOTAL DOCUMENTS

175
(FIVE YEARS 79)

H-INDEX

9
(FIVE YEARS 3)

2021 ◽  
Author(s):  
Andriana Manousidaki ◽  
Anna Little ◽  
Yuying Xie

Recent advances in single-cell technologies have enabled high-resolution characterization of tissue and cancer compositions. Although numerous tools for dimension reduction and clustering are available for single-cell data analyses, these methods often fail to simultaneously preserve local cluster structure and global data geometry. This article explores the application of power-weighted path metrics for the analysis of single cell RNA data. Extensive experiments on single cell RNA sequencing data sets confirm the usefulness of path metrics for dimension reduction and clustering. Distances between cells are measured in a data- driven way which is both density sensitive (decreasing distances across high density regions) and respects the underlying data geometry. By combining path metrics with multidimensional scaling, a low dimensional embedding of the data is obtained which respects both the global geometry of the data and preserves cluster structure. We evaluate the method both for clustering quality and geometric fidelity, and it outperforms other algorithms on a wide range of bench marking data sets.


Algorithms ◽  
2021 ◽  
Vol 14 (12) ◽  
pp. 348
Author(s):  
Zahra Tayebi ◽  
Sarwan Ali ◽  
Murray Patterson

The widespread availability of large amounts of genomic data on the SARS-CoV-2 virus, as a result of the COVID-19 pandemic, has created an opportunity for researchers to analyze the disease at a level of detail, unlike any virus before it. On the one hand, this will help biologists, policymakers, and other authorities to make timely and appropriate decisions to control the spread of the coronavirus. On the other hand, such studies will help to more effectively deal with any possible future pandemic. Since the SARS-CoV-2 virus contains different variants, each of them having different mutations, performing any analysis on such data becomes a difficult task, given the size of the data. It is well known that much of the variation in the SARS-CoV-2 genome happens disproportionately in the spike region of the genome sequence—the relatively short region which codes for the spike protein(s). In this paper, we propose a robust feature-vector representation of biological sequences that, when combined with the appropriate feature selection method, allows different downstream clustering approaches to perform well on a variety of different measures. We use such proposed approach with an array of clustering techniques to cluster spike protein sequences in order to study the behavior of different known variants that are increasing at a very high rate throughout the world. We use a k-mers based approach first to generate a fixed-length feature vector representation of the spike sequences. We then show that we can efficiently and effectively cluster the spike sequences based on the different variants with the appropriate feature selection. Using a publicly available set of SARS-CoV-2 spike sequences, we perform clustering of these sequences using both hard and soft clustering methods and show that, with our feature selection methods, we can achieve higher F1 scores for the clusters and also better clustering quality metrics compared to baselines.


2021 ◽  
Vol 5 (5) ◽  
pp. 688-699
Author(s):  
Abas Hasanovich Lampezhev ◽  
Elena Yur`evna Linskaya ◽  
Aslan Adal`bievich Tatarkanov ◽  
Islam Alexandrovich Alexandrov

This study aims to develop a methodology for the justification of medical diagnostic decisions based on the clustering of large volumes of statistical information stored in decision support systems. This aim is relevant since the analyzed medical data are often incomplete and inaccurate, negatively affecting the correctness of medical diagnosis and the subsequent choice of the most effective treatment actions. Clustering is an effective mathematical tool for selecting useful information under conditions of initial data uncertainty. The analysis showed that the most appropriate algorithm to solve the problem is based on fuzzy clustering and fuzzy equivalence relation. The methods of the present study are based on the use of this algorithm forming the technique of analyzing large volumes of medical data due to prepare a rationale for making medical diagnostic decisions. The proposed methodology involves the sequential implementation of the following procedures: preliminary data preparation, selecting the purpose of cluster data analysis, determining the form of results presentation, data normalization, selection of criteria for assessing the quality of the solution, application of fuzzy data clustering, evaluation of the sample, results and their use in further work. Fuzzy clustering quality evaluation criteria include partition coefficient, entropy separation criterion, separation efficiency ratio, and cluster power criterion. The novelty of the results of this article is related to the fact that the proposed methodology makes it possible to work with clusters of arbitrary shape and missing centers, which is impossible when using universal algorithms. Doi: 10.28991/esj-2021-01305 Full Text: PDF


Author(s):  
Chunhua Ren ◽  
Linfu Sun

AbstractThe classic Fuzzy C-means (FCM) algorithm has limited clustering performance and is prone to misclassification of border points. This study offers a bi-directional FCM clustering ensemble approach that takes local information into account (LI_BIFCM) to overcome these challenges and increase clustering quality. First, various membership matrices are created after running FCM multiple times, based on the randomization of the initial cluster centers, and a vertical ensemble is performed using the maximum membership principle. Second, after each execution of FCM, multiple local membership matrices of the sample points are created using multiple K-nearest neighbors, and a horizontal ensemble is performed. Multiple horizontal ensembles can be created using multiple FCM clustering. Finally, the final clustering results are obtained by combining the vertical and horizontal clustering ensembles. Twelve data sets were chosen for testing from both synthetic and real data sources. The LI_BIFCM clustering performance outperformed four traditional clustering algorithms and three clustering ensemble algorithms in the experiments. Furthermore, the final clustering results has a weak correlation with the bi-directional cluster ensemble parameters, indicating that the suggested technique is robust.


2021 ◽  
Author(s):  
Lucas Ondel

This work investigates subspace non-parametric models for the task of learning a set of acoustic units from unlabeled speech recordings. We constrain the base-measure of a Dirichlet-Process mixture with a phonetic subspace---estimated from other source languages---to build an \emph{educated prior}, thereby forcing the learned acoustic units to resemble phones of known source languages. Two types of models are proposed: (i) the Subspace HMM (SHMM) which assumes that the phonetic subspace is the same for every language, (ii) the Hierarchical-Subspace HMM (H-SHMM) which relaxes this assumption and allows to have a language-specific subspace estimated on the unlabeled target data. These models are applied on 3 languages: English, Yoruba and Mboshi and they are compared with various competitive acoustic units discovery baselines. Experimental results show that both subspace models outperform other systems in terms of clustering quality and segmentation accuracy. Moreover, we observe that the H-SHMM provides results superior to the SHMM supporting the idea that language-specific priors are preferable to language-agnostic priors for acoustic unit discovery.


2021 ◽  
Author(s):  
Lucas Ondel

This work investigates subspace non-parametric models for the task of learning a set of acoustic units from unlabeled speech recordings. We constrain the base-measure of a Dirichlet-Process mixture with a phonetic subspace---estimated from other source languages---to build an \emph{educated prior}, thereby forcing the learned acoustic units to resemble phones of known source languages. Two types of models are proposed: (i) the Subspace HMM (SHMM) which assumes that the phonetic subspace is the same for every language, (ii) the Hierarchical-Subspace HMM (H-SHMM) which relaxes this assumption and allows to have a language-specific subspace estimated on the unlabeled target data. These models are applied on 3 languages: English, Yoruba and Mboshi and they are compared with various competitive acoustic units discovery baselines. Experimental results show that both subspace models outperform other systems in terms of clustering quality and segmentation accuracy. Moreover, we observe that the H-SHMM provides results superior to the SHMM supporting the idea that language-specific priors are preferable to language-agnostic priors for acoustic unit discovery.


2021 ◽  
Vol 15 ◽  
Author(s):  
Isaac Goicovich ◽  
Paulo Olivares ◽  
Claudio Román ◽  
Andrea Vázquez ◽  
Cyril Poupon ◽  
...  

Fiber clustering methods are typically used in brain research to study the organization of white matter bundles from large diffusion MRI tractography datasets. These methods enable exploratory bundle inspection using visualization and other methods that require identifying brain white matter structures in individuals or a population. Some applications, such as real-time visualization and inter-subject clustering, need fast and high-quality intra-subject clustering algorithms. This work proposes a parallel algorithm using a General Purpose Graphics Processing Unit (GPGPU) for fiber clustering based on the FFClust algorithm. The proposed GPGPU implementation exploits data parallelism using both multicore and GPU fine-grained parallelism present in commodity architectures, including current laptops and desktop computers. Our approach implements all FFClust steps in parallel, improving execution times in all of them. In addition, our parallel approach includes a parallel Kmeans++ algorithm implementation and defines a new variant of Kmeans++ to reduce the impact of choosing outliers as initial centroids. The results show that our approach provides clustering quality results very similar to FFClust, and it requires an execution time of 3.5 s for processing about a million fibers, achieving a speedup of 11.5 times compared to FFClust.


Author(s):  
Md. Zakir Hossain ◽  
Md. Jakirul Islam ◽  
Md. Waliur Rahman Miah ◽  
Jahid Hasan Rony ◽  
Momotaz Begum

<p>The amount of data has been increasing exponentially in every sector such as banking securities, healthcare, education, manufacturing, consumer-trade, transportation, and energy. Most of these data are noise, different in shapes, and outliers. In such cases, it is challenging to find the desired data clusters using conventional clustering algorithms. DBSCAN is a popular clustering algorithm which is widely used for noisy, arbitrary shape, and outlier data. However, its performance highly depends on the proper selection of cluster radius <em>(Eps)</em> and the minimum number of points <em>(MinPts)</em> that are required for forming clusters for the given dataset. In the case of real-world clustering problems, it is a difficult task to select the exact value of Eps and <em>(MinPts)</em> to perform the clustering on unknown datasets. To address these, this paper proposes a dynamic DBSCAN algorithm that calculates the suitable value for <em>(Eps)</em> and <em>(MinPts)</em> dynamically by which the clustering quality of the given problem will be increased. This paper evaluates the performance of the dynamic DBSCAN algorithm over seven challenging datasets. The experimental results confirm the effectiveness of the dynamic DBSCAN algorithm over the well-known clustering algorithms.</p>


2021 ◽  
Vol 11 (17) ◽  
pp. 8051
Author(s):  
Chengxiao Shen ◽  
Liping Qian ◽  
Ningning Yu

In an era of big data, face images captured in social media and forensic investigations, etc., generally lack labels, while the number of identities (clusters) may range from a few dozen to thousands. Therefore, it is of practical importance to cluster a large number of unlabeled face images into an efficient range of identities or even the exact identities, which can avoid image labeling by hand. Here, we propose adaptive facial imagery clustering that involves face representations, spectral clustering, and reinforcement learning (Q-learning). First, we use a deep convolutional neural network (DCNN) to generate face representations, and we adopt a spectral clustering model to construct a similarity matrix and achieve clustering partition. Then, we use an internal evaluation measure (the Davies–Bouldin index) to evaluate the clustering quality. Finally, we adopt Q-learning as the feedback module to build a dynamic multiparameter debugging process. The experimental results on the ORL Face Database show the effectiveness of our method in terms of an optimal number of clusters of 39, which is almost the actual number of 40 clusters; our method can achieve 99.2% clustering accuracy. Subsequent studies should focus on reducing the computational complexity of dealing with more face images.


2021 ◽  
Vol 7 ◽  
pp. e679
Author(s):  
Kazuhisa Fujita

Spectral clustering (SC) is one of the most popular clustering methods and often outperforms traditional clustering methods. SC uses the eigenvectors of a Laplacian matrix calculated from a similarity matrix of a dataset. SC has serious drawbacks: the significant increases in the time complexity derived from the computation of eigenvectors and the memory space complexity to store the similarity matrix. To address the issues, I develop a new approximate spectral clustering using the network generated by growing neural gas (GNG), called ASC with GNG in this study. ASC with GNG uses not only reference vectors for vector quantization but also the topology of the network for extraction of the topological relationship between data points in a dataset. ASC with GNG calculates the similarity matrix from both the reference vectors and the topology of the network generated by GNG. Using the network generated from a dataset by GNG, ASC with GNG achieves to reduce the computational and space complexities and improve clustering quality. In this study, I demonstrate that ASC with GNG effectively reduces the computational time. Moreover, this study shows that ASC with GNG provides equal to or better clustering performance than SC.


Sign in / Sign up

Export Citation Format

Share Document