scholarly journals A Hybrid Clustering Algorithm for Identifying Cell Types from Single-Cell RNA-Seq Data

Genes ◽  
2019 ◽  
Vol 10 (2) ◽  
pp. 98 ◽  
Author(s):  
Xiaoshu Zhu ◽  
Hong-Dong Li ◽  
Yunpei Xu ◽  
Lilu Guo ◽  
Fang-Xiang Wu ◽  
...  

Single-cell RNA sequencing (scRNA-seq) has recently brought new insight into cell differentiation processes and functional variation in cell subtypes from homogeneous cell populations. A lack of prior knowledge makes unsupervised machine learning methods, such as clustering, suitable for analyzing scRNA-seq . However, there are several limitations to overcome, including high dimensionality, clustering result instability, and parameter adjustment complexity. In this study, we propose a method by combining structure entropy and k nearest neighbor to identify cell subpopulations in scRNA-seq data. In contrast to existing clustering methods for identifying cell subtypes, minimized structure entropy results in natural communities without specifying the number of clusters. To investigate the performance of our model, we applied it to eight scRNA-seq datasets and compared our method with three existing methods (nonnegative matrix factorization, single-cell interpretation via multikernel learning, and structural entropy minimization principle). The experimental results showed that our approach achieves, on average, better performance in these datasets compared to the benchmark methods.

Author(s):  
Ming Tang ◽  
Yasin Kaymaz ◽  
Brandon L Logeman ◽  
Stephen Eichhorn ◽  
Zhengzheng S Liang ◽  
...  

Abstract Motivation One major goal of single-cell RNA sequencing (scRNAseq) experiments is to identify novel cell types. With increasingly large scRNAseq datasets, unsupervised clustering methods can now produce detailed catalogues of transcriptionally distinct groups of cells in a sample. However, the interpretation of these clusters is challenging for both technical and biological reasons. Popular clustering algorithms are sensitive to parameter choices, and can produce different clustering solutions with even small changes in the number of principal components used, the k nearest neighbor and the resolution parameters, among others. Results Here, we present a set of tools to evaluate cluster stability by subsampling, which can guide parameter choice and aid in biological interpretation. The R package scclusteval and the accompanying Snakemake workflow implement all steps of the pipeline: subsampling the cells, repeating the clustering with Seurat and estimation of cluster stability using the Jaccard similarity index and providing rich visualizations. Availabilityand implementation R package scclusteval: https://github.com/crazyhottommy/scclusteval Snakemake workflow: https://github.com/crazyhottommy/pyflow_seuratv3_parameter Tutorial: https://crazyhottommy.github.io/EvaluateSingleCellClustering/.


2021 ◽  
Author(s):  
Maryam Zand ◽  
Jianhua Ruan

Single-cell RNA sequencing (scRNAseq) offers an unprecedented potential for scrutinizing complex biological systems at single cell resolution. One of the most important applications of scRNAseq is to cluster cells into groups of similar expression profiles, which allows unsupervised identification of novel cell subtypes. While many clustering algorithms have been tested towards this goal, graph-based algorithms appear to be the most effective, due to their ability to accommodate the sparsity of the data, as well as the complex topology of the cell population. An integral part of almost all such clustering methods is the construction of a k-nearest-neighbor (KNN) network, and the choice of k, implicitly or explicitly, can have a profound impact on the density distribution of the graph and the structure of the resulting clusters, as well as the resolution of clusters that one can successfully identify from the data. In this work, we propose a fairly simple but robust approach to estimate the best k for constructing the KNN graph while simultaneously identifying the optimal clustering structure from the graph. Our method, named scQcut, employs a topology-based criterion to guide the construction of KNN graph, and then applies an efficient modularity-based community discovery algorithm to predict robust cell clusters. The results obtained from applying scQcut on a large number of real and synthetic datasets demonstrated that scQcut-which does not require any user-tuned parameters-outperformed several popular state-of-the-art clustering methods in terms of clustering accuracy and the ability to correctly identify rare cell types. The promising results indicate that an accurate approximation of the parameter k, which determines the topology of the network, is a crucial element of a successful graph-based clustering method to recover the final community structure of the cell population.


2017 ◽  
Author(s):  
Debajyoti Sinha ◽  
Akhilesh Kumar ◽  
Himanshu Kumar ◽  
Sanghamitra Bandyopadhyay ◽  
Debarka Sengupta

ABSTRACTDroplet based single cell transcriptomics has recently enabled parallel screening of tens of thousands of single cells. Clustering methods that scale for such high dimensional data without compromising accuracy are scarce. We exploit Locality Sensitive Hashing, an approximate nearest neighbor search technique to develop ade novoclustering algorithm for large-scale single cell data. On a number of real datasets, dropClust outperformed the existing best practice methods in terms of execution time, clustering accuracy and detectability of minor cell sub-types.


2009 ◽  
Vol 03 (04) ◽  
pp. 399-419
Author(s):  
ASLI CELIKYILMAZ

Unsupervised spectral clustering methods can yield good performance when identifying crisp clusters with low complexity since the learning algorithm does not rely on finding the local minima of an objective function and rather uses spectral properties of the graph. Nonetheless, the performance of such approaches are usually affected by their uncertain parameters. Using the underlying structure of a general spectral clustering method, in this paper a new soft-link spectral clustering algorithm is introduced to identify clusters based on fuzzy k-nearest neighbor approach. We construct a soft weight matrix of a graph by identifying the upper and lower boundaries of learning parameters of the similarity function, specifically the fuzzifier parameter (fuzziness) of the Fuzzy k-Nearest Neighbor algorithm. The algorithm allows perturbations on the graph Laplace during the learning stage by the changes on such learning parameters. With the empirical analysis using an artificial and a real textual entailment dataset, we demonstrate that our initial hypothesis of implementing soft links for spectral clustering can improve the classification performance of final outcome.


2015 ◽  
pp. 125-138 ◽  
Author(s):  
I. V. Goncharenko

In this article we proposed a new method of non-hierarchical cluster analysis using k-nearest-neighbor graph and discussed it with respect to vegetation classification. The method of k-nearest neighbor (k-NN) classification was originally developed in 1951 (Fix, Hodges, 1951). Later a term “k-NN graph” and a few algorithms of k-NN clustering appeared (Cover, Hart, 1967; Brito et al., 1997). In biology k-NN is used in analysis of protein structures and genome sequences. Most of k-NN clustering algorithms build «excessive» graph firstly, so called hypergraph, and then truncate it to subgraphs, just partitioning and coarsening hypergraph. We developed other strategy, the “upward” clustering in forming (assembling consequentially) one cluster after the other. Until today graph-based cluster analysis has not been considered concerning classification of vegetation datasets.


Author(s):  
Emma Dann ◽  
Neil C. Henderson ◽  
Sarah A. Teichmann ◽  
Michael D. Morgan ◽  
John C. Marioni

2021 ◽  
Vol 25 (6) ◽  
pp. 1453-1471
Author(s):  
Chunhua Tang ◽  
Han Wang ◽  
Zhiwen Wang ◽  
Xiangkun Zeng ◽  
Huaran Yan ◽  
...  

Most density-based clustering algorithms have the problems of difficult parameter setting, high time complexity, poor noise recognition, and weak clustering for datasets with uneven density. To solve these problems, this paper proposes FOP-OPTICS algorithm (Finding of the Ordering Peaks Based on OPTICS), which is a substantial improvement of OPTICS (Ordering Points To Identify the Clustering Structure). The proposed algorithm finds the demarcation point (DP) from the Augmented Cluster-Ordering generated by OPTICS and uses the reachability-distance of DP as the radius of neighborhood eps of its corresponding cluster. It overcomes the weakness of most algorithms in clustering datasets with uneven densities. By computing the distance of the k-nearest neighbor of each point, it reduces the time complexity of OPTICS; by calculating density-mutation points within the clusters, it can efficiently recognize noise. The experimental results show that FOP-OPTICS has the lowest time complexity, and outperforms other algorithms in parameter setting and noise recognition.


2015 ◽  
Vol 2015 ◽  
pp. 1-9 ◽  
Author(s):  
Cheng Lu ◽  
Shiji Song ◽  
Cheng Wu

The Affinity Propagation (AP) algorithm is an effective algorithm for clustering analysis, but it can not be directly applicable to the case of incomplete data. In view of the prevalence of missing data and the uncertainty of missing attributes, we put forward a modified AP clustering algorithm based onK-nearest neighbor intervals (KNNI) for incomplete data. Based on an Improved Partial Data Strategy, the proposed algorithm estimates the KNNI representation of missing attributes by using the attribute distribution information of the available data. The similarity function can be changed by dealing with the interval data. Then the improved AP algorithm can be applicable to the case of incomplete data. Experiments on several UCI datasets show that the proposed algorithm achieves impressive clustering results.


2019 ◽  
Author(s):  
Suhas Srinivasan ◽  
Nathan T. Johnson ◽  
Dmitry Korkin

AbstractSingle-cell RNA sequencing (scRNA-seq) is a recent technology that enables fine-grained discovery of cellular subtypes and specific cell states. It routinely uses machine learning methods, such as feature learning, clustering, and classification, to assist in uncovering novel information from scRNA-seq data. However, current methods are not well suited to deal with the substantial amounts of noise that is created by the experiments or the variation that occurs due to differences in the cells of the same type. Here, we develop a new hybrid approach, Deep Unsupervised Single-cell Clustering (DUSC), that integrates feature generation based on a deep learning architecture with a model-based clustering algorithm, to find a compact and informative representation of the single-cell transcriptomic data generating robust clusters. We also include a technique to estimate an efficient number of latent features in the deep learning model. Our method outperforms both classical and state-of-the-art feature learning and clustering methods, approaching the accuracy of supervised learning. The method is freely available to the community and will hopefully facilitate our understanding of the cellular atlas of living organisms as well as provide the means to improve patient diagnostics and treatment.


2019 ◽  
Vol 9 (17) ◽  
pp. 3484
Author(s):  
Shuai Han ◽  
Heng Li ◽  
Mingchao Li ◽  
Timothy Rose

Hammering rocks of different strengths can make different sounds. Geological engineers often use this method to approximate the strengths of rocks in geology surveys. This method is quick and convenient but subjective. Inspired by this problem, we present a new, non-destructive method for measuring the surface strengths of rocks based on deep neural network (DNN) and spectrogram analysis. All the hammering sounds are transformed into spectrograms firstly, and a clustering algorithm is presented to filter out the outliers of the spectrograms automatically. One of the most advanced image classification DNN, the Inception-ResNet-v2, is then re-trained with the spectrograms. The results show that the training accurate is up to 94.5%. Following this, three regression algorithms, including Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and Random Forest (RF) are adopted to fit the relationship between the outputs of the DNN and the strength values. The tests show that KNN has the highest fitting accuracy, and SVM has the strongest generalization ability. The strengths (represented by rebound values) of almost all the samples can be predicted within an error of [−5, 5]. Overall, the proposed method has great potential in supporting the implementation of efficient rock strength measurement methods in the field.


Sign in / Sign up

Export Citation Format

Share Document