Adapted single-cell consensus clustering (adaSC3)

AbstractThe analysis of single-cell RNA sequencing data is of great importance in health research. It challenges data scientists, but has enormous potential in the context of personalized medicine. The clustering of single cells aims to detect different subgroups of cell populations within a patient in a data-driven manner. Some comparison studies denote single-cell consensus clustering (SC3), proposed by Kiselev et al. (Nat Methods 14(5):483–486, 2017), as the best method for classifying single-cell RNA sequencing data. SC3 includes Laplacian eigenmaps and a principal component analysis (PCA). Our proposal of unsupervised adapted single-cell consensus clustering (adaSC3) suggests to replace the linear PCA by diffusion maps, a non-linear method that takes the transition of single cells into account. We investigate the performance of adaSC3 in terms of accuracy on the data sets of the original source of SC3 as well as in a simulation study. A comparison of adaSC3 with SC3 as well as with related algorithms based on further alternative dimension reduction techniques shows a quite convincing behavior of adaSC3.

Download Full-text

A Manifold Proximal Linear Method for Sparse Spectral Clustering with Application to Single-Cell RNA Sequencing Data Analysis

INFORMS Journal on Optimization ◽

10.1287/ijoo.2021.0064 ◽

2021 ◽

Author(s):

Zhongruo Wang ◽

Bingyuan Liu ◽

Shixiang Chen ◽

Shiqian Ma ◽

Lingzhou Xue ◽

...

Keyword(s):

Data Analysis ◽

Single Cell ◽

Rna Sequencing ◽

Spectral Clustering ◽

Optimization Problem ◽

Linear Method ◽

Data Sets ◽

Sequencing Data ◽

Adopted Model ◽

Single Cell Rna Sequencing

Spectral clustering is one of the fundamental unsupervised learning methods and is widely used in data analysis. Sparse spectral clustering (SSC) imposes sparsity to the spectral clustering, and it improves the interpretability of the model. One widely adopted model for SSC in the literature is an optimization problem over the Stiefel manifold with nonsmooth and nonconvex objective. Such an optimization problem is very challenging to solve. Existing methods usually solve its convex relaxation or need to smooth its nonsmooth objective using certain smoothing techniques. Therefore, they were not targeting solving the original formulation of SSC. In this paper, we propose a manifold proximal linear method (ManPL) that solves the original SSC formulation without twisting the model. We also extend the algorithm to solve multiple-kernel SSC problems, for which an alternating ManPL algorithm is proposed. Convergence and iteration complexity results of the proposed methods are established. We demonstrate the advantage of our proposed methods over existing methods via clustering of several data sets, including University of California Irvine and single-cell RNA sequencing data sets.

Download Full-text

scds: Computational Annotation of Doublets in Single Cell RNA Sequencing Data

10.1101/564021 ◽

2019 ◽

Cited By ~ 4

Author(s):

Abha S Bais ◽

Dennis Kostka

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Binary Classification ◽

Single Cells ◽

Computational Cost ◽

Software Tool ◽

Original Data ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing

AbstractMotivationSingle cell RNA sequencing (scRNA-seq) technologies enable the study of transcriptional heterogeneity at the resolution of individual cells and have an increasing impact on biomedical research. Specifically, high-throughput approaches that employ micro-fluidics in combination with unique molecular identifiers (UMIs) are capable of assaying many thousands of cells per experiment and are rapidly becoming commonplace. However, it is known that these methods sometimes wrongly consider two or more cells as single cells, and that a number of so-called doublets is present in the output of such experiments. Treating doublets as single cells in downstream analyses can severely bias a study’s conclusions, and therefore computational strategies for the identification of doublets are needed. Here we present single cell doublet scoring (scds), a software tool for the in silico identification of doublets in scRNA-seq data.ResultsWith scds, we propose two new and complementary approaches for doublet identification: Co-expression based doublet scoring (cxds) and binary classification based doublet scoring (bcds). The co-expression based approach, cxds, utilizes binarized (absence/presence) gene expression data and employs a binomial model for the co-expression of pairs of genes and yields interpretable doublet annotations. bcds, on the other hand, uses a binary classification approach to discriminate artificial doublets from the original data. We apply our methods and existing doublet identification approaches to four data sets with experimental doublet annotations and find that our methods perform at least as well as the state of the art, but at comparably little computational cost. We also find appreciable differences between methods and across data sets, that no approach dominates all others, and we believe there is room for improvement in computational doublet identification as more data with experimental annotations becomes available. In the meanwhile, scds presents a scalable, competitive approach that allows for doublet annotations in thousands of cells in a matter of seconds.Availability and Implementationscds is implemented as an R package and freely available at https://github.com/kostkalab/[email protected]

Download Full-text

Software Benchmark—Classification Tree Algorithms for Cell Atlases Annotation Using Single-Cell RNA-Sequencing Data

Microbiology Research ◽

10.3390/microbiolres12020022 ◽

2021 ◽

Vol 12 (2) ◽

pp. 317-334

Author(s):

Omar Alaqeeli ◽

Li Xing ◽

Xuekui Zhang

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Classification Tree ◽

Area Under The Curve ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing ◽

Tree Algorithms ◽

R Packages

Classification tree is a widely used machine learning method. It has multiple implementations as R packages; rpart, ctree, evtree, tree and C5.0. The details of these implementations are not the same, and hence their performances differ from one application to another. We are interested in their performance in the classification of cells using the single-cell RNA-Sequencing data. In this paper, we conducted a benchmark study using 22 Single-Cell RNA-sequencing data sets. Using cross-validation, we compare packages’ prediction performances based on their Precision, Recall, F1-score, Area Under the Curve (AUC). We also compared the Complexity and Run-time of these R packages. Our study shows that rpart and evtree have the best Precision; evtree is the best in Recall, F1-score and AUC; C5.0 prefers more complex trees; tree is consistently much faster than others, although its complexity is often higher than others.

Download Full-text

Quality assessment of single-cell RNA sequencing data by coverage skewness analysis

10.1101/2019.12.31.890269 ◽

2019 ◽

Author(s):

Imad Abugessaisa ◽

Shuhei Noguchi ◽

Melissa Cardon ◽

Akira Hasegawa ◽

Kazuhide Watanabe ◽

...

Keyword(s):

Quality Assessment ◽

Single Cell ◽

Rna Sequencing ◽

Single Cells ◽

Assessment Method ◽

Poor Quality ◽

Sequencing Data ◽

Single Cell Rna Sequencing ◽

Gene Coverage ◽

The Impact

AbstractAnalysis and interpretation of single-cell RNA-sequencing (scRNA-seq) experiments are compromised by the presence of poor quality cells. For meaningful analyses, such poor quality cells should be excluded to avoid biases and large variation. However, no clear guidelines exist. We introduce SkewC, a novel quality-assessment method to identify poor quality single-cells in scRNA-seq experiments. The method is based on the assessment of gene coverage for each single cell and its skewness as a quality measure. To validate the method, we investigated the impact of poor quality cells on downstream analyses and compared biological differences between typical and poor quality cells. Moreover, we measured the ratio of intergenic expression, suggesting genomic contamination, and foreign organism contamination of single-cell samples. SkewC is tested in 37,993 single-cells generated by 15 scRNA-seq protocols. We envision SkewC as an indispensable QC method to be incorporated into scRNA-seq experiment to preclude the possibility of scRNA-seq data misinterpretation.

Download Full-text

Evaluation of single-cell classifiers for single-cell RNA sequencing data sets

Briefings in Bioinformatics ◽

10.1093/bib/bbz096 ◽

2019 ◽

Vol 21 (5) ◽

pp. 1581-1595 ◽

Cited By ~ 6

Author(s):

Xinlei Zhao ◽

Shuang Wu ◽

Nan Fang ◽

Xiao Sun ◽

Jue Fan

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Reference Data ◽

Predictive Accuracy ◽

Cell Types ◽

Superior Performance ◽

Marker Genes ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Abstract Single-cell RNA sequencing (scRNA-seq) has been rapidly developing and widely applied in biological and medical research. Identification of cell types in scRNA-seq data sets is an essential step before in-depth investigations of their functional and pathological roles. However, the conventional workflow based on clustering and marker genes is not scalable for an increasingly large number of scRNA-seq data sets due to complicated procedures and manual annotation. Therefore, a number of tools have been developed recently to predict cell types in new data sets using reference data sets. These methods have not been generally adapted due to a lack of tool benchmarking and user guidance. In this article, we performed a comprehensive and impartial evaluation of nine classification software tools specifically designed for scRNA-seq data sets. Results showed that Seurat based on random forest, SingleR based on correlation analysis and CaSTLe based on XGBoost performed better than others. A simple ensemble voting of all tools can improve the predictive accuracy. Under nonideal situations, such as small-sized and class-imbalanced reference data sets, tools based on cluster-level similarities have superior performance. However, even with the function of assigning ‘unassigned’ labels, it is still challenging to catch novel cell types by solely using any of the single-cell classifiers. This article provides a guideline for researchers to select and apply suitable classification tools in their analysis workflows and sheds some lights on potential direction of future improvement on classification tools.

Download Full-text

Estimating the Allele-Specific Expression of SNVs From 10× Genomics Single-Cell RNA-Sequencing Data

Genes ◽

10.3390/genes11030240 ◽

2020 ◽

Vol 11 (3) ◽

pp. 240 ◽

Cited By ~ 2

Author(s):

Prashant N. M. ◽

Hongyu Liu ◽

Pavlos Bousounis ◽

Liam Spurr ◽

Nawaf Alomran ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Single Cells ◽

Sequencing Data ◽

Specific Expression ◽

Single Nucleotide ◽

Healthy Donors ◽

Allele Expression ◽

Single Cell Rna Sequencing ◽

Allele Specific

With the recent advances in single-cell RNA-sequencing (scRNA-seq) technologies, the estimation of allele expression from single cells is becoming increasingly reliable. Allele expression is both quantitative and dynamic and is an essential component of the genomic interactome. Here, we systematically estimate the allele expression from heterozygous single nucleotide variant (SNV) loci using scRNA-seq data generated on the 10×Genomics Chromium platform. We analyzed 26,640 human adipose-derived mesenchymal stem cells (from three healthy donors), sequenced to an average of 150K sequencing reads per cell (more than 4 billion scRNA-seq reads in total). High-quality SNV calls assessed in our study contained approximately 15% exonic and >50% intronic loci. To analyze the allele expression, we estimated the expressed variant allele fraction (VAFRNA) from SNV-aware alignments and analyzed its variance and distribution (mono- and bi-allelic) at different minimum sequencing read thresholds. Our analysis shows that when assessing positions covered by a minimum of three unique sequencing reads, over 50% of the heterozygous SNVs show bi-allelic expression, while at a threshold of 10 reads, nearly 90% of the SNVs are bi-allelic. In addition, our analysis demonstrates the feasibility of scVAFRNA estimation from current scRNA-seq datasets and shows that the 3′-based library generation protocol of 10×Genomics scRNA-seq data can be informative in SNV-based studies, including analyses of transcriptional kinetics.

Download Full-text

DrivAER: Identification of driving transcriptional programs in single-cell RNA sequencing data

10.1101/864165 ◽

2019 ◽

Author(s):

Lukas M. Simon ◽

Fangfang Yan ◽

Zhongming Zhao

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Disease Status ◽

Data Sets ◽

Sequencing Data ◽

Functional Interpretation ◽

Recent Success ◽

Gene Sets ◽

Single Cell Rna Sequencing ◽

Cellular Maps

AbstractSingle cell RNA sequencing (scRNA-seq) unfolds complex transcriptomic data sets into detailed cellular maps. Despite recent success, there is a pressing need for specialized methods tailored towards the functional interpretation of these cellular maps. Here, we present DrivAER, a machine learning approach that scores annotated gene sets based on their relevance to user-specified outcomes such as pseudotemporal ordering or disease status. We demonstrate that DrivAER extracts the key driving pathways and transcription factors that regulate complex biological processes from scRNA-seq data.

Download Full-text

SNV identification from single-cell RNA sequencing data

Human Molecular Genetics ◽

10.1093/hmg/ddz207 ◽

2019 ◽

Vol 28 (21) ◽

pp. 3569-3583 ◽

Cited By ~ 3

Author(s):

Patricia M Schnepp ◽

Mengjie Chen ◽

Evan T Keller ◽

Xiang Zhou

Keyword(s):

Dna Sequencing ◽

Single Cell ◽

Rna Sequencing ◽

Single Cells ◽

Specific Gene ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Cell Rna Sequencing ◽

Sequencing Studies ◽

Genomic Regions

Abstract Integrating single-cell RNA sequencing (scRNA-seq) data with genotypes obtained from DNA sequencing studies facilitates the detection of functional genetic variants underlying cell type-specific gene expression variation. Unfortunately, most existing scRNA-seq studies do not come with DNA sequencing data; thus, being able to call single nucleotide variants (SNVs) from scRNA-seq data alone can provide crucial and complementary information, detection of functional SNVs, maximizing the potential of existing scRNA-seq studies. Here, we perform extensive analyses to evaluate the utility of two SNV calling pipelines (GATK and Monovar), originally designed for SNV calling in either bulk or single-cell DNA sequencing data. In both pipelines, we examined various parameter settings to determine the accuracy of the final SNV call set and provide practical recommendations for applied analysts. We found that combining all reads from the single cells and following GATK Best Practices resulted in the highest number of SNVs identified with a high concordance. In individual single cells, Monovar resulted in better quality SNVs even though none of the pipelines analyzed is capable of calling a reasonable number of SNVs with high accuracy. In addition, we found that SNV calling quality varies across different functional genomic regions. Our results open doors for novel ways to leverage the use of scRNA-seq for the future investigation of SNV function.

Download Full-text

scConsensus: combining supervised and unsupervised clustering for cell type identification in single-cell RNA sequencing data

BMC Bioinformatics ◽

10.1186/s12859-021-04028-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Bobby Ranjan ◽

Florian Schmidt ◽

Wenjie Sun ◽

Jinyu Park ◽

Mohammad Amin Honardoost ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Differentially Expressed Genes ◽

Cell Types ◽

Unsupervised Clustering ◽

Differentially Expressed ◽

Consensus Clustering ◽

Cell Type ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Abstract Background Clustering is a crucial step in the analysis of single-cell data. Clusters identified in an unsupervised manner are typically annotated to cell types based on differentially expressed genes. In contrast, supervised methods use a reference panel of labelled transcriptomes to guide both clustering and cell type identification. Supervised and unsupervised clustering approaches have their distinct advantages and limitations. Therefore, they can lead to different but often complementary clustering results. Hence, a consensus approach leveraging the merits of both clustering paradigms could result in a more accurate clustering and a more precise cell type annotation. Results We present scConsensus, an $${\mathbf {R}}$$ R framework for generating a consensus clustering by (1) integrating results from both unsupervised and supervised approaches and (2) refining the consensus clusters using differentially expressed genes. The value of our approach is demonstrated on several existing single-cell RNA sequencing datasets, including data from sorted PBMC sub-populations. Conclusions scConsensus combines the merits of unsupervised and supervised approaches to partition cells with better cluster separation and homogeneity, thereby increasing our confidence in detecting distinct cell types. scConsensus is implemented in $${\mathbf {R}}$$ R and is freely available on GitHub at https://github.com/prabhakarlab/scConsensus.

Download Full-text

scds: computational annotation of doublets in single-cell RNA sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btz698 ◽

2019 ◽

Cited By ~ 3

Author(s):

Abha S Bais ◽

Dennis Kostka

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Binary Classification ◽

Single Cells ◽

Computational Cost ◽

Original Data ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Abstract Motivation Single-cell RNA sequencing (scRNA-seq) technologies enable the study of transcriptional heterogeneity at the resolution of individual cells and have an increasing impact on biomedical research. However, it is known that these methods sometimes wrongly consider two or more cells as single cells, and that a number of so-called doublets is present in the output of such experiments. Treating doublets as single cells in downstream analyses can severely bias a study’s conclusions, and therefore computational strategies for the identification of doublets are needed. Results With scds, we propose two new approaches for in silico doublet identification: Co-expression based doublet scoring (cxds) and binary classification based doublet scoring (bcds). The co-expression based approach, cxds, utilizes binarized (absence/presence) gene expression data and, employing a binomial model for the co-expression of pairs of genes, yields interpretable doublet annotations. bcds, on the other hand, uses a binary classification approach to discriminate artificial doublets from original data. We apply our methods and existing computational doublet identification approaches to four datasets with experimental doublet annotations and find that our methods perform at least as well as the state of the art, at comparably little computational cost. We observe appreciable differences between methods and across datasets and that no approach dominates all others. In summary, scds presents a scalable, competitive approach that allows for doublet annotation of datasets with thousands of cells in a matter of seconds. Availability and implementation scds is implemented as a Bioconductor R package (doi: 10.18129/B9.bioc.scds). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text