scholarly journals Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor

2018 ◽  
Vol 9 (1) ◽  
Author(s):  
Megan Crow ◽  
Anirban Paul ◽  
Sara Ballouz ◽  
Z. Josh Huang ◽  
Jesse Gillis
Author(s):  
Yinlei Hu ◽  
Bin Li ◽  
Falai Chen ◽  
Kun Qu

Abstract Unsupervised clustering is a fundamental step of single-cell RNA sequencing data analysis. This issue has inspired several clustering methods to classify cells in single-cell RNA sequencing data. However, accurate prediction of the cell clusters remains a substantial challenge. In this study, we propose a new algorithm for single-cell RNA sequencing data clustering based on Sparse Optimization and low-rank matrix factorization (scSO). We applied our scSO algorithm to analyze multiple benchmark datasets and showed that the cluster number predicted by scSO was close to the number of reference cell types and that most cells were correctly classified. Our scSO algorithm is available at https://github.com/QuKunLab/scSO. Overall, this study demonstrates a potent cell clustering approach that can help researchers distinguish cell types in single-cell RNA sequencing data.


2019 ◽  
Vol 21 (5) ◽  
pp. 1581-1595 ◽  
Author(s):  
Xinlei Zhao ◽  
Shuang Wu ◽  
Nan Fang ◽  
Xiao Sun ◽  
Jue Fan

Abstract Single-cell RNA sequencing (scRNA-seq) has been rapidly developing and widely applied in biological and medical research. Identification of cell types in scRNA-seq data sets is an essential step before in-depth investigations of their functional and pathological roles. However, the conventional workflow based on clustering and marker genes is not scalable for an increasingly large number of scRNA-seq data sets due to complicated procedures and manual annotation. Therefore, a number of tools have been developed recently to predict cell types in new data sets using reference data sets. These methods have not been generally adapted due to a lack of tool benchmarking and user guidance. In this article, we performed a comprehensive and impartial evaluation of nine classification software tools specifically designed for scRNA-seq data sets. Results showed that Seurat based on random forest, SingleR based on correlation analysis and CaSTLe based on XGBoost performed better than others. A simple ensemble voting of all tools can improve the predictive accuracy. Under nonideal situations, such as small-sized and class-imbalanced reference data sets, tools based on cluster-level similarities have superior performance. However, even with the function of assigning ‘unassigned’ labels, it is still challenging to catch novel cell types by solely using any of the single-cell classifiers. This article provides a guideline for researchers to select and apply suitable classification tools in their analysis workflows and sheds some lights on potential direction of future improvement on classification tools.


2017 ◽  
Author(s):  
Luke Zappia ◽  
Belinda Phipson ◽  
Alicia Oshlack

AbstractAs single-cell RNA sequencing technologies have rapidly developed, so have analysis methods. Many methods have been tested, developed and validated using simulated datasets. Unfortunately, current simulations are often poorly documented, their similarity to real data is not demonstrated, or reproducible code is not available.Here we present the Splatter Bioconductor package for simple, reproducible and well-documented simulation of single-cell RNA-seq data. Splatter provides an interface to multiple simulation methods including Splat, our own simulation, based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types or differentiation paths.


2018 ◽  
Author(s):  
Aaron T. L. Lun ◽  
Samantha Riesenfeld ◽  
Tallulah Andrews ◽  
Tomas Gomes ◽  
John C. Marioni ◽  
...  

AbstractDroplet-based single-cell RNA sequencing protocols have dramatically increased the throughput and efficiency of single-cell transcriptomics studies. A key computational challenge when processing these data is to distinguish libraries for real cells from empty droplets. Existing methods for cell calling set a minimum threshold on the total unique molecular identifier (UMI) count for each library, which indiscriminately discards cell libraries with low UMI counts. Here, we describe a new statistical method for calling cells from droplet-based data, based on detecting significant deviations from the expression profile of the ambient solution. Using simulations, we demonstrate that our method has greater power than existing approaches for detecting cell libraries with low UMI counts, while controlling the false discovery rate among detected cells. We also apply our method to real data, where we show that the use of our method results in the retention of distinct cell types that would otherwise have been discarded.


GigaScience ◽  
2019 ◽  
Vol 8 (10) ◽  
Author(s):  
Yun-Ching Chen ◽  
Abhilash Suresh ◽  
Chingiz Underbayev ◽  
Clare Sun ◽  
Komudi Singh ◽  
...  

AbstractBackgroundIn single-cell RNA-sequencing analysis, clustering cells into groups and differentiating cell groups by differentially expressed (DE) genes are 2 separate steps for investigating cell identity. However, the ability to differentiate between cell groups could be affected by clustering. This interdependency often creates a bottleneck in the analysis pipeline, requiring researchers to repeat these 2 steps multiple times by setting different clustering parameters to identify a set of cell groups that are more differentiated and biologically relevant.FindingsTo accelerate this process, we have developed IKAP—an algorithm to identify major cell groups and improve differentiating cell groups by systematically tuning parameters for clustering. We demonstrate that, with default parameters, IKAP successfully identifies major cell types such as T cells, B cells, natural killer cells, and monocytes in 2 peripheral blood mononuclear cell datasets and recovers major cell types in a previously published mouse cortex dataset. These major cell groups identified by IKAP present more distinguishing DE genes compared with cell groups generated by different combinations of clustering parameters. We further show that cell subtypes can be identified by recursively applying IKAP within identified major cell types, thereby delineating cell identities in a multi-layered ontology.ConclusionsBy tuning the clustering parameters to identify major cell groups, IKAP greatly improves the automation of single-cell RNA-sequencing analysis to produce distinguishing DE genes and refine cell ontology using single-cell RNA-sequencing data.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Bobby Ranjan ◽  
Florian Schmidt ◽  
Wenjie Sun ◽  
Jinyu Park ◽  
Mohammad Amin Honardoost ◽  
...  

Abstract Background Clustering is a crucial step in the analysis of single-cell data. Clusters identified in an unsupervised manner are typically annotated to cell types based on differentially expressed genes. In contrast, supervised methods use a reference panel of labelled transcriptomes to guide both clustering and cell type identification. Supervised and unsupervised clustering approaches have their distinct advantages and limitations. Therefore, they can lead to different but often complementary clustering results. Hence, a consensus approach leveraging the merits of both clustering paradigms could result in a more accurate clustering and a more precise cell type annotation. Results We present scConsensus, an $${\mathbf {R}}$$ R framework for generating a consensus clustering by (1) integrating results from both unsupervised and supervised approaches and (2) refining the consensus clusters using differentially expressed genes. The value of our approach is demonstrated on several existing single-cell RNA sequencing datasets, including data from sorted PBMC sub-populations. Conclusions scConsensus combines the merits of unsupervised and supervised approaches to partition cells with better cluster separation and homogeneity, thereby increasing our confidence in detecting distinct cell types. scConsensus is implemented in $${\mathbf {R}}$$ R and is freely available on GitHub at https://github.com/prabhakarlab/scConsensus.


2020 ◽  
Author(s):  
Bobby Ranjan ◽  
Florian Schmidt ◽  
Wenjie Sun ◽  
Jinyu Park ◽  
Mohammad Amin Honardoost ◽  
...  

Clustering is a crucial step in the analysis of single-cell data. Clusters identified using unsupervised clustering are typically annotated to cell types based on differentially expressed genes. In contrast, supervised methods use a reference panel of labelled transcriptomes to guide both clustering and cell type identification. Supervised and unsupervised clustering strategies have their distinct advantages and limitations. Therefore, they can lead to different but often complementary clustering results. Hence, a consensus approach leveraging the merits of both clustering paradigms could result in a more accurate clustering and a more precise cell type annotation. We present scConsensus, an R framework for generating a consensus clustering by (i) integrating the results from both unsupervised and supervised approaches and (ii) refining the consensus clusters using differentially expressed (DE) genes. The value of our approach is demonstrated on several existing single-cell RNA sequencing datasets, including data from sorted PBMC sub-populations. scConsensus is freely available on GitHub at https://github.com/prabhakarlab/scConsensus.


2017 ◽  
Author(s):  
Lihua Zhang ◽  
Shihua Zhang

AbstractSingle-cell RNA-sequencing (scRNA-seq) is a recent breakthrough technology, which paves the way for measuring RNA levels at single cell resolution to study precise biological functions. One of the main challenges when analyzing scRNA-seq data is the presence of zeros or dropout events, which may mislead downstream analyses. To compensate the dropout effect, several methods have been developed to impute gene expression since the first Bayesian-based method being proposed in 2016. However, these methods have shown very diverse characteristics in terms of model hypothesis and imputation performance. Thus, large-scale comparison and evaluation of these methods is urgently needed now. To this end, we compared eight imputation methods, evaluated their power in recovering original real data, and performed broad analyses to explore their effects on clustering cell types, detecting differentially expressed genes, and reconstructing lineage trajectories in the context of both simulated and real data. Simulated datasets and case studies highlight that there are no one method performs the best in all the situations. Some defects of these methods such as scalability, robustness and unavailability in some situations need to be addressed in future studies.


iScience ◽  
2020 ◽  
Vol 23 (3) ◽  
pp. 100882 ◽  
Author(s):  
Xin Shao ◽  
Jie Liao ◽  
Xiaoyan Lu ◽  
Rui Xue ◽  
Ni Ai ◽  
...  

Blood ◽  
2018 ◽  
Vol 132 (Supplement 1) ◽  
pp. 995-995
Author(s):  
Vincent-Philippe Lavallee ◽  
Elham Azizi ◽  
Vaidotas Kiseliovas ◽  
Ignas Masilionis ◽  
Linas Mazutis ◽  
...  

Abstract Introduction: Acute myeloid leukemia (AML) evolution is a multistep process in which cells evolve from hematopoietic stem and progenitor cells (HSPCs) that acquire genetic anomalies, such as chromosomal rearrangements and mutations, which define distinct subgroups. Mutations in Nucleophosmin 1 (NPM1), which occur in ~30% patients, are the most frequent subgroup-defining mutations in AML and appear to be a late driver event in this disease. Bulk RNA-sequencing studies have identified differentially expressed genes between AML subgroups, but they are uninformative of the composition of cell types populating each sample. Large scale Single-cell RNA sequencing (scRNA-seq) technologies now enable a detailed characterization of intra tumoral heterogeneity, and could help to better understand the stepwise evolution from normal to malignant cells. Methods: Twelve primary human AML specimens from MSKCC and Quebec Leukemia Cell Bank, including 8 with NPM1 mutations, were included in this cohort. Cells were subjected to scRNA-seq using 10X Genomics Chromium Single Cell 3' protocols and libraries were sequenced on Illumina HiSeq or NovaSeq platforms. FASTQ files were processed using SEQC pipeline (Azizi E et al, Cell 2018), resulting in a carefully filtered count matrix of > 100,000 single cells (4877 to 11532 cells per sample). Results: Using euclidean distance metrics and t-Distributed Stochastic Neighbor Embedding (t-SNE) visualization, we explored the phenotypic overlap between samples and showed that leukemia cells from different patients were mostly dissimilar, suggesting inter-sample heterogeneity. However, samples with similar morphology and similar NPM1 mutational status were phenotypically closer (Fig A), as anticipated from bulk RNA-sequencing data (TCGA, NEJM 2013). We partitioned cells into distinct clusters using Phenograph (Levine J et al, Cell 2015) (Fig B) and measured the diversity of samples per cluster using Shannon's entropy metric, revealing that mature cell types (B/plasma cells, T/NK and erythroid cells, Fig C), presumably excluded from the tumor bulk, are transcriptionally similar across samples. Most notably, the next most diverse cluster (C36), comprising 438 cells from 11/12 samples, contains cells with a HSPC-like phenotype, as suggested by i) highest correlation of the centroid of this cluster with HSC1 (lin-/CD133+/CD34dim) population from sorted bulk RNA-sequencing data (Novershtern N et al, Cell 2011), and ii) marked GSEA enrichment for stem cell signatures (top enrichment: Jaatinen_hematopoeitic_stem_cell_up, NES = 9.04, FDR q-val = 0). To study the extent to which NPM1 or other mutations drive heterogeneity in leukemia populations, we interrogated 3'-derived single-cell sequences for all recurrent mutations in AML and found that NPM1 gene has unique features (e.g. relatively high single-cell expression and 3' localization) that allow specific identification of mutations in 5 to 34% of cells per mutated sample. To control for the high frequency of false negatives caused by dropouts in scRNA-seq data, we normalized the abundance of mutated vs wild-type cells to provide an estimation of mutation frequency in different cell types (Fig D). As expected, NPM1 mutations were rare in B and T/NK lymphoid cells (also observed using RT-qPCR in sorted populations by Dvorakova D et al, Leuk Lymphoma 2013) and were found in the majority of leukemia and myeloid cells. Interestingly, these mutations were detected at various frequencies in erythroid cells, suggesting that NPM1 mutations are acquired in cells with different lineage commitment in different patients. Most notably, the HSPC-like cluster C36 also contained a subpopulation of cells that have acquired NPM1 mutations and are transcriptionally different from wild-type cells. Conclusion: This study presents a first comprehensive single-cell map of primary AML, and the first 3'-based interrogation of mutations in single cells. It led to the identification phenotypically distinct cells presenting a HSPC-like expression profile which were sub-clonally harboring NPM1 mutations, providing the means to identify deregulated genes in these important leukemia subpopulations. Figure Figure. Disclosures Levine: Epizyme: Patents & Royalties; Celgene: Consultancy, Research Funding; Janssen: Consultancy, Honoraria; Isoplexis: Equity Ownership; C4 Therapeutics: Equity Ownership; Prelude: Research Funding; Gilead: Honoraria; Imago: Equity Ownership; Novartis: Consultancy; Roche: Consultancy, Research Funding; Loxo: Consultancy, Equity Ownership; Qiagen: Equity Ownership, Membership on an entity's Board of Directors or advisory committees.


Sign in / Sign up

Export Citation Format

Share Document