Benchmarking UMI-based single-cell RNA-seq preprocessing workflows

Yue You; Luyi Tian; Shian Su; Xueyi Dong; Jafar S. Jabbari; Peter F. Hickey; Matthew E. Ritchie

doi:10.1186/s13059-021-02552-3

Benchmarking UMI-based single-cell RNA-seq preprocessing workflows

Genome Biology ◽

10.1186/s13059-021-02552-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yue You ◽

Luyi Tian ◽

Shian Su ◽

Xueyi Dong ◽

Jafar S. Jabbari ◽

...

Keyword(s):

Single Cell ◽

Ground Truth ◽

Rna Seq ◽

Clustering Methods ◽

Biological Complexity ◽

Cell Type ◽

Analysis Process ◽

Single Cell Rna Sequencing ◽

Downstream Analysis ◽

Almost All

Abstract Background Single-cell RNA-sequencing (scRNA-seq) technologies and associated analysis methods have rapidly developed in recent years. This includes preprocessing methods, which assign sequencing reads to genes to create count matrices for downstream analysis. While several packaged preprocessing workflows have been developed to provide users with convenient tools for handling this process, how they compare to one another and how they influence downstream analysis have not been well studied. Results Here, we systematically benchmark the performance of 10 end-to-end preprocessing workflows (Cell Ranger, Optimus, salmon alevin, alevin-fry, kallisto bustools, dropSeqPipe, scPipe, zUMIs, celseq2, and scruff) using datasets yielding different biological complexity levels generated by CEL-Seq2 and 10x Chromium platforms. We compare these workflows in terms of their quantification properties directly and their impact on normalization and clustering by evaluating the performance of different method combinations. While the scRNA-seq preprocessing workflows compared vary in their detection and quantification of genes across datasets, after downstream analysis with performant normalization and clustering methods, almost all combinations produce clustering results that agree well with the known cell type labels that provided the ground truth in our analysis. Conclusions In summary, the choice of preprocessing method was found to be less important than other steps in the scRNA-seq analysis process. Our study comprehensively compares common scRNA-seq preprocessing workflows and summarizes their characteristics to guide workflow users.

Get full-text (via PubEx)

Benchmarking UMI-based single cell RNA-sequencing preprocessing workflows

10.1101/2021.06.17.448895 ◽

2021 ◽

Author(s):

Yue You ◽

Luyi Tian ◽

Shian Su ◽

Xueyi Dong ◽

Jafar Sheikh Jabbari ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Rapid Development ◽

Ground Truth ◽

Clustering Methods ◽

Analysis Process ◽

Single Cell Rna Sequencing ◽

Systematic Biases ◽

Downstream Analysis ◽

Almost All

Single-cell RNA sequencing (scRNA-seq) technologies and associated analysis methods have undergone rapid development in recent years. This includes methods for data preprocessing, which assign sequencing reads to genes to create count matrices for downstream analysis. Several packaged preprocessing workflows have been developed that aim to provide users with convenient tools for handling this process. How different preprocessing workflows compare to one another and influence downstream analysis has been less well studied. Here, we systematically benchmark the performance of 9 end-to-end preprocessing workflows (Cell Ranger, Optimus, salmon alevin, kallisto bustools, dropSeqPipe, scPipe, zUMIs, celseq2 and scruff) using datasets with varying levels of biological complexity generated on the CEL-Seq2 and 10x Chromium platforms. We compare these workflows in terms of their quantification properties directly and their impact on normalization and clustering by evaluating the performance of different method combinations. We find that lowly expressed genes are discordant between workflows and observe that some workflows have systematic biases towards particular classes of genomics features. While the scRNA-seq preprocessing workflows compared varied in their detection and quantification of genes across datasets, after downstream analysis with performant normalization and clustering methods, almost all combinations produced clustering results that agreed well with the known cell type labels that provided the ground truth in our analysis. In summary, the choice of preprocessing method was found to be less influential than other steps in the scRNA-seq analysis process. Our study comprehensively compares common scRNAseq preprocessing workflows and summarizes their characteristics to guide workflow users.

Get full-text (via PubEx)

Explainability methods for differential gene analysis of single cell RNA-seq clustering models

10.1101/2021.11.15.468416 ◽

2021 ◽

Author(s):

Madalina Ciortan ◽

Matthieu Defrance

Keyword(s):

Single Cell ◽

Traditional Approach ◽

Ground Truth ◽

Rna Seq ◽

Clustering Methods ◽

Numerous Model ◽

Cell Class ◽

Gradient Based ◽

Differential Gene ◽

Downstream Analysis

Single-cell RNA sequencing (scRNA-seq) produces transcriptomic profiling for individual cells. Due to the lack of cell-class annotations, scRNA-seq is routinely analyzed with unsupervised clustering methods. Because these methods are typically limited to producing clustering predictions (that is, assignment of cells to clusters of similar cells), numerous model agnostic differential expression (DE) libraries have been proposed to identify the genes expressed differently in the detected clusters, as needed in the downstream analysis. In parallel, the advancements in neural networks (NN) brought several model-specific explainability methods to identify salient features based on gradients, eliminating the need for external models. We propose a comprehensive study to compare the performance of dedicated DE methods, with that of explainability methods typically used in machine learning, both model agnostic (such as SHAP, permutation importance) and model-specific (such as NN gradient-based methods). The DE analysis is performed on the results of 3 state-of-the-art clustering methods based on NNs. Our results on 36 simulated datasets indicate that all analyzed DE methods have limited agreement between them and with ground-truth genes. The gradients method outperforms the traditional DE methods, which encourages the development of NN-based clustering methods to provide an out-of-the-box DE capability. Employing DE methods on the input data preprocessed by clustering method outperforms the traditional approach of using the original count data, albeit still performing worse than gradient-based methods.

Get full-text (via PubEx)

Critical downstream analysis steps for single-cell RNA sequencing data

Briefings in Bioinformatics ◽

10.1093/bib/bbab105 ◽

2021 ◽

Author(s):

Zilong Zhang ◽

Feifei Cui ◽

Chen Lin ◽

Lingling Zhao ◽

Chunyu Wang ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Noisy Data ◽

Single Cell Level ◽

Cell Type ◽

Sequencing Data ◽

Cell Level ◽

Bioinformatics Tool ◽

Single Cell Rna Sequencing ◽

Downstream Analysis

Abstract Single-cell RNA sequencing (scRNA-seq) has enabled us to study biological questions at the single-cell level. Currently, many analysis tools are available to better utilize these relatively noisy data. In this review, we summarize the most widely used methods for critical downstream analysis steps (i.e. clustering, trajectory inference, cell-type annotation and integrating datasets). The advantages and limitations are comprehensively discussed, and we provide suggestions for choosing proper methods in different situations. We hope this paper will be useful for scRNA-seq data analysts and bioinformatics tool developers.

Get full-text (via PubEx)

scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1820006116 ◽

2019 ◽

Vol 116 (20) ◽

pp. 9775-9784 ◽

Cited By ~ 38

Author(s):

Yingxin Lin ◽

Shila Ghazanfar ◽

Kevin Y. X. Wang ◽

Johann A. Gagnon-Bartsch ◽

Kitty K. Lo ◽

...

Keyword(s):

Factor Analysis ◽

Data Integration ◽

Single Cell ◽

Rna Seq ◽

Cell Type ◽

Large Collection ◽

Single Cell Rna Sequencing ◽

Development Trajectory ◽

Biological Discovery ◽

Public Datasets

Concerted examination of multiple collections of single-cell RNA sequencing (RNA-seq) data promises further biological insights that cannot be uncovered with individual datasets. Here we present scMerge, an algorithm that integrates multiple single-cell RNA-seq datasets using factor analysis of stably expressed genes and pseudoreplicates across datasets. Using a large collection of public datasets, we benchmark scMerge against published methods and demonstrate that it consistently provides improved cell type separation by removing unwanted factors; scMerge can also enhance biological discovery through robust data integration, which we show through the inference of development trajectory in a liver dataset collection.

Get full-text (via PubEx)

MarkerCount: A stable, count-based cell type identifier for single cell RNA-Seq experiments

10.21203/rs.3.rs-418249/v1 ◽

2021 ◽

Author(s):

Hanbyeol Kim ◽

Joongho Lee ◽

Keunsoo Kang ◽

Seokhyun Yoon

Keyword(s):

Gene Expression ◽

Single Cell ◽

Cell Types ◽

Batch Effect ◽

Expression Level ◽

Rna Seq ◽

Cell Type ◽

Stable Performance ◽

Downstream Analysis

Abstract Cell type identification is a key step to downstream analysis of single cell RNA-seq experiments. Indispensible information for this is gene expression, which is used to cluster cells, train the model and set rejection thresholds. Problem is they are subject to batch effect arising from different platforms and preprocessing. We present MarkerCount, which uses the number of markers expressed regardless of their expression level to initially identify cell types and, then, reassign cell type in cluster-basis. MarkerCount works both in reference and marker-based mode, where the latter utilizes only the existing lists of markers, while the former required pre-annotated dataset to train the model. The performance was evaluated and compared with the existing identifiers, both marker and reference-based, that can be customized with publicly available datasets and marker DB. The results show that MarkerCount provides a stable performance when comparing with other reference-based and marker-based cell type identifiers.

Get full-text (via PubEx)

SCDC: bulk gene expression deconvolution by multiple single-cell RNA sequencing references

Briefings in Bioinformatics ◽

10.1093/bib/bbz166 ◽

2020 ◽

Cited By ~ 13

Author(s):

Meichen Dong ◽

Aatish Thennavan ◽

Eugene Urrutia ◽

Yun Li ◽

Charles M Perou ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Specific Gene ◽

Rna Seq ◽

Cell Type ◽

Mixed Cell ◽

Single Cell Rna Sequencing

Abstract Recent advances in single-cell RNA sequencing (scRNA-seq) enable characterization of transcriptomic profiles with single-cell resolution and circumvent averaging artifacts associated with traditional bulk RNA sequencing (RNA-seq) data. Here, we propose SCDC, a deconvolution method for bulk RNA-seq that leverages cell-type specific gene expression profiles from multiple scRNA-seq reference datasets. SCDC adopts an ENSEMBLE method to integrate deconvolution results from different scRNA-seq datasets that are produced in different laboratories and at different times, implicitly addressing the problem of batch-effect confounding. SCDC is benchmarked against existing methods using both in silico generated pseudo-bulk samples and experimentally mixed cell lines, whose known cell-type compositions serve as ground truths. We show that SCDC outperforms existing methods with improved accuracy of cell-type decomposition under both settings. To illustrate how the ENSEMBLE framework performs in complex tissues under different scenarios, we further apply our method to a human pancreatic islet dataset and a mouse mammary gland dataset. SCDC returns results that are more consistent with experimental designs and that reproduce more significant associations between cell-type proportions and measured phenotypes.

Get full-text (via PubEx)

Reference Transcriptomes of Porcine Peripheral Immune Cells Created Through Bulk and Single-Cell RNA Sequencing

Frontiers in Genetics ◽

10.3389/fgene.2021.689406 ◽

2021 ◽

Vol 12 ◽

Author(s):

Juber Herrera-Uribe ◽

Jayne E. Wiarda ◽

Sathesh K. Sivasankaran ◽

Lance Daharsh ◽

Haibo Liu ◽

...

Keyword(s):

T Cells ◽

Single Cell ◽

Immune Cells ◽

Peripheral Blood ◽

Cell Types ◽

Cell Populations ◽

Rna Seq ◽

Cell Type ◽

Transcriptomic Data ◽

Single Cell Rna Sequencing

Pigs are a valuable human biomedical model and an important protein source supporting global food security. The transcriptomes of peripheral blood immune cells in pigs were defined at the bulk cell-type and single cell levels. First, eight cell types were isolated in bulk from peripheral blood mononuclear cells (PBMCs) by cell sorting, representing Myeloid, NK cells and specific populations of T and B-cells. Transcriptomes for each bulk population of cells were generated by RNA-seq with 10,974 expressed genes detected. Pairwise comparisons between cell types revealed specific expression, while enrichment analysis identified 1,885 to 3,591 significantly enriched genes across all 8 cell types. Gene Ontology analysis for the top 25% of significantly enriched genes (SEG) showed high enrichment of biological processes related to the nature of each cell type. Comparison of gene expression indicated highly significant correlations between pig cells and corresponding human PBMC bulk RNA-seq data available in Haemopedia. Second, higher resolution of distinct cell populations was obtained by single-cell RNA-sequencing (scRNA-seq) of PBMC. Seven PBMC samples were partitioned and sequenced that produced 28,810 single cell transcriptomes distributed across 36 clusters and classified into 13 general cell types including plasmacytoid dendritic cells (DC), conventional DCs, monocytes, B-cell, conventional CD4 and CD8 αβ T-cells, NK cells, and γδ T-cells. Signature gene sets from the human Haemopedia data were assessed for relative enrichment in genes expressed in pig cells and integration of pig scRNA-seq with a public human scRNA-seq dataset provided further validation for similarity between human and pig data. The sorted porcine bulk RNAseq dataset informed classification of scRNA-seq PBMC populations; specifically, an integration of the datasets showed that the pig bulk RNAseq data helped define the CD4CD8 double-positive T-cell populations in the scRNA-seq data. Overall, the data provides deep and well-validated transcriptomic data from sorted PBMC populations and the first single-cell transcriptomic data for porcine PBMCs. This resource will be invaluable for annotation of pig genes controlling immunogenetic traits as part of the porcine Functional Annotation of Animal Genomes (FAANG) project, as well as further study of, and development of new reagents for, porcine immunology.

Get full-text (via PubEx)

DSAVE: Detection of misclassified cells in single-cell RNA-Seq data

PLoS ONE ◽

10.1371/journal.pone.0243360 ◽

2020 ◽

Vol 15 (12) ◽

pp. e0243360

Author(s):

Johan Gustafsson ◽

Jonathan Robinson ◽

Juan S. Inda-Díaz ◽

Elias Björnson ◽

Rebecka Jörnsten ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Cell Types ◽

Rna Seq ◽

Cell Type ◽

Log Likelihood ◽

Single Cell Rna Sequencing ◽

Cell Transcriptome ◽

Average Gene ◽

Single Cell Transcriptome

Single-cell RNA sequencing has become a valuable tool for investigating cell types in complex tissues, where clustering of cells enables the identification and comparison of cell populations. Although many studies have sought to develop and compare different clustering approaches, a deeper investigation into the properties of the resulting populations is lacking. Specifically, the presence of misclassified cells can influence downstream analyses, highlighting the need to assess subpopulation purity and to detect such cells. We developed DSAVE (Down-SAmpling based Variation Estimation), a method to evaluate the purity of single-cell transcriptome clusters and to identify misclassified cells. The method utilizes down-sampling to eliminate differences in sampling noise and uses a log-likelihood based metric to help identify misclassified cells. In addition, DSAVE estimates the number of cells needed in a population to achieve a stable average gene expression profile within a certain gene expression range. We show that DSAVE can be used to find potentially misclassified cells that are not detectable by similar tools and reveal the cause of their divergence from the other cells, such as differing cell state or cell type. With the growing use of single-cell RNA-seq, we foresee that DSAVE will be an increasingly useful tool for comparing and purifying subpopulations in single-cell RNA-Seq datasets.

Get full-text (via PubEx)

Detecting cell-type-specific allelic expression imbalance by integrative analysis of bulk and single-cell RNA sequencing data

10.1101/2020.08.26.267815 ◽

2020 ◽

Author(s):

Jiaxin Fan ◽

Xuran Wang ◽

Rui Xiao ◽

Mingyao Li

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Cell Types ◽

Allelic Expression ◽

Rna Seq ◽

Allelic Expression Imbalance ◽

Cell Type ◽

Single Cell Rna Sequencing ◽

Cell Type Specific ◽

Different Cell Types

AbstractAllelic expression imbalance (AEI), quantified by the relative expression of two alleles of a gene in a diploid organism, can help explain phenotypic variations among individuals. Traditional methods detect AEI using bulk RNA sequencing (RNA-seq) data, a data type that averages out cell-to-cell heterogeneity in gene expression across cell types. Since the patterns of AEI may vary across different cell types, it is desirable to study AEI in a cell-type-specific manner. Although this can be achieved by single-cell RNA sequencing (scRNA-seq), it requires full-length transcript to be sequenced in single cells of a large number of individuals, which are still cost prohibitive to generate. To overcome this limitation and utilize the vast amount of existing disease relevant bulk tissue RNA-seq data, we developed BSCET, which enables the characterization of cell-type-specific AEI in bulk RNA-seq data by integrating cell type composition information inferred from a small set of scRNA-seq samples, possibly obtained from an external dataset. By modeling covariate effect, BSCET can also detect genes whose cell-type-specific AEI are associated with clinical factors. Through extensive benchmark evaluations, we show that BSCET correctly detected genes with cell-type-specific AEI and differential AEI between healthy and diseased samples using bulk RNA-seq data. BSCET also uncovered cell-type-specific AEIs that were missed in bulk data analysis when the directions of AEI are opposite in different cell types. We further applied BSCET to two pancreatic islet bulk RNA-seq datasets, and detected genes showing cell-type-specific AEI that are related to the progression of type 2 diabetes. Since bulk RNA-seq data are easily accessible, BSCET provided a convenient tool to integrate information from scRNA-seq data to gain insight on AEI with cell type resolution. Results from such analysis will advance our understanding of cell type contributions in human diseases.Author SummaryDetection of allelic expression imbalance (AEI), a phenomenon where the two alleles of a gene differ in their expression magnitude, is a key step towards the understanding of phenotypic variations among individuals. Existing methods detect AEI use bulk RNA sequencing (RNA-seq) data and ignore AEI variations among different cell types. Although single-cell RNA sequencing (scRNA-seq) has enabled the characterization of cell-to-cell heterogeneity in gene expression, the high costs have limited its application in AEI analysis. To overcome this limitation, we developed BSCET to characterize cell-type-specific AEI using the widely available bulk RNA-seq data by integrating cell-type composition information inferred from scRNA-seq samples. Since the degree of AEI may vary with disease phenotypes, we further extended BSCET to detect genes whose cell-type-specific AEIs are associated with clinical factors. Through extensive benchmark evaluations and analyses of two pancreatic islet bulk RNA-seq datasets, we demonstrated BSCET’s ability to refine bulk-level AEI to cell-type resolution, and to identify genes whose cell-type-specific AEIs are associated with the progression of type 2 diabetes. With the vast amount of easily accessible bulk RNA-seq data, we believe BSCET will be a valuable tool for elucidating cell type contributions in human diseases.

Get full-text (via PubEx)

Detecting cell-type-specific allelic expression imbalance by integrative analysis of bulk and single-cell RNA sequencing data

PLoS Genetics ◽

10.1371/journal.pgen.1009080 ◽

2021 ◽

Vol 17 (3) ◽

pp. e1009080

Author(s):

Jiaxin Fan ◽

Xuran Wang ◽

Rui Xiao ◽

Mingyao Li

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Cell Types ◽

Allelic Expression ◽

Rna Seq ◽

Allelic Expression Imbalance ◽

Cell Type ◽

Single Cell Rna Sequencing ◽

Cell Type Specific ◽

Different Cell Types

Allelic expression imbalance (AEI), quantified by the relative expression of two alleles of a gene in a diploid organism, can help explain phenotypic variations among individuals. Traditional methods detect AEI using bulk RNA sequencing (RNA-seq) data, a data type that averages out cell-to-cell heterogeneity in gene expression across cell types. Since the patterns of AEI may vary across different cell types, it is desirable to study AEI in a cell-type-specific manner. Although this can be achieved by single-cell RNA sequencing (scRNA-seq), it requires full-length transcript to be sequenced in single cells of a large number of individuals, which are still cost prohibitive to generate. To overcome this limitation and utilize the vast amount of existing disease relevant bulk tissue RNA-seq data, we developed BSCET, which enables the characterization of cell-type-specific AEI in bulk RNA-seq data by integrating cell type composition information inferred from a small set of scRNA-seq samples, possibly obtained from an external dataset. By modeling covariate effect, BSCET can also detect genes whose cell-type-specific AEI are associated with clinical factors. Through extensive benchmark evaluations, we show that BSCET correctly detected genes with cell-type-specific AEI and differential AEI between healthy and diseased samples using bulk RNA-seq data. BSCET also uncovered cell-type-specific AEIs that were missed in bulk data analysis when the directions of AEI are opposite in different cell types. We further applied BSCET to two pancreatic islet bulk RNA-seq datasets, and detected genes showing cell-type-specific AEI that are related to the progression of type 2 diabetes. Since bulk RNA-seq data are easily accessible, BSCET provided a convenient tool to integrate information from scRNA-seq data to gain insight on AEI with cell type resolution. Results from such analysis will advance our understanding of cell type contributions in human diseases.

Get full-text (via PubEx)