scds: computational annotation of doublets in single-cell RNA sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btz698 ◽

2019 ◽

Cited By ~ 3

Author(s):

Abha S Bais ◽

Dennis Kostka

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Binary Classification ◽

Single Cells ◽

Computational Cost ◽

Original Data ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Abstract Motivation Single-cell RNA sequencing (scRNA-seq) technologies enable the study of transcriptional heterogeneity at the resolution of individual cells and have an increasing impact on biomedical research. However, it is known that these methods sometimes wrongly consider two or more cells as single cells, and that a number of so-called doublets is present in the output of such experiments. Treating doublets as single cells in downstream analyses can severely bias a study’s conclusions, and therefore computational strategies for the identification of doublets are needed. Results With scds, we propose two new approaches for in silico doublet identification: Co-expression based doublet scoring (cxds) and binary classification based doublet scoring (bcds). The co-expression based approach, cxds, utilizes binarized (absence/presence) gene expression data and, employing a binomial model for the co-expression of pairs of genes, yields interpretable doublet annotations. bcds, on the other hand, uses a binary classification approach to discriminate artificial doublets from original data. We apply our methods and existing computational doublet identification approaches to four datasets with experimental doublet annotations and find that our methods perform at least as well as the state of the art, at comparably little computational cost. We observe appreciable differences between methods and across datasets and that no approach dominates all others. In summary, scds presents a scalable, competitive approach that allows for doublet annotation of datasets with thousands of cells in a matter of seconds. Availability and implementation scds is implemented as a Bioconductor R package (doi: 10.18129/B9.bioc.scds). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

scds: Computational Annotation of Doublets in Single Cell RNA Sequencing Data

10.1101/564021 ◽

2019 ◽

Cited By ~ 4

Author(s):

Abha S Bais ◽

Dennis Kostka

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Binary Classification ◽

Single Cells ◽

Computational Cost ◽

Software Tool ◽

Original Data ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing

AbstractMotivationSingle cell RNA sequencing (scRNA-seq) technologies enable the study of transcriptional heterogeneity at the resolution of individual cells and have an increasing impact on biomedical research. Specifically, high-throughput approaches that employ micro-fluidics in combination with unique molecular identifiers (UMIs) are capable of assaying many thousands of cells per experiment and are rapidly becoming commonplace. However, it is known that these methods sometimes wrongly consider two or more cells as single cells, and that a number of so-called doublets is present in the output of such experiments. Treating doublets as single cells in downstream analyses can severely bias a study’s conclusions, and therefore computational strategies for the identification of doublets are needed. Here we present single cell doublet scoring (scds), a software tool for the in silico identification of doublets in scRNA-seq data.ResultsWith scds, we propose two new and complementary approaches for doublet identification: Co-expression based doublet scoring (cxds) and binary classification based doublet scoring (bcds). The co-expression based approach, cxds, utilizes binarized (absence/presence) gene expression data and employs a binomial model for the co-expression of pairs of genes and yields interpretable doublet annotations. bcds, on the other hand, uses a binary classification approach to discriminate artificial doublets from the original data. We apply our methods and existing doublet identification approaches to four data sets with experimental doublet annotations and find that our methods perform at least as well as the state of the art, but at comparably little computational cost. We also find appreciable differences between methods and across data sets, that no approach dominates all others, and we believe there is room for improvement in computational doublet identification as more data with experimental annotations becomes available. In the meanwhile, scds presents a scalable, competitive approach that allows for doublet annotations in thousands of cells in a matter of seconds.Availability and Implementationscds is implemented as an R package and freely available at https://github.com/kostkalab/[email protected]

Download Full-text

schex avoids overplotting for large single-cell RNA-sequencing datasets

Bioinformatics ◽

10.1093/bioinformatics/btz907 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2291-2292 ◽

Cited By ~ 1

Author(s):

Saskia Freytag ◽

Ryan Lister

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

R Package ◽

Supplementary Information ◽

Supplementary Data ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Abstract Summary Due to the scale and sparsity of single-cell RNA-sequencing data, traditional plots can obscure vital information. Our R package schex overcomes this by implementing hexagonal binning, which has the additional advantages of improving speed and reducing storage for resulting plots. Availability and implementation schex is freely available from Bioconductor via http://bioconductor.org/packages/release/bioc/html/schex.html and its development version can be accessed on GitHub via https://github.com/SaskiaFreytag/schex. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

EnImpute: imputing dropout events in single-cell RNA-sequencing data via ensemble learning

Bioinformatics ◽

10.1093/bioinformatics/btz435 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4827-4829 ◽

Cited By ~ 6

Author(s):

Xiao-Fei Zhang ◽

Le Ou-Yang ◽

Shuo Yang ◽

Xing-Ming Zhao ◽

Xiaohua Hu ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Ensemble Learning ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Single Cell Rna Sequencing ◽

The Individual ◽

Downstream Analysis ◽

Shiny Application

Abstract Summary Imputation of dropout events that may mislead downstream analyses is a key step in analyzing single-cell RNA-sequencing (scRNA-seq) data. We develop EnImpute, an R package that introduces an ensemble learning method for imputing dropout events in scRNA-seq data. EnImpute combines the results obtained from multiple imputation methods to generate a more accurate result. A Shiny application is developed to provide easier implementation and visualization. Experiment results show that EnImpute outperforms the individual state-of-the-art methods in almost all situations. EnImpute is useful for correcting the noisy scRNA-seq data before performing downstream analysis. Availability and implementation The R package and Shiny application are available through Github at https://github.com/Zhangxf-ccnu/EnImpute. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SPsimSeq: semi-parametric simulation of bulk and single-cell RNA-sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa105 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3276-3278 ◽

Cited By ~ 2

Author(s):

Alemu Takele Assefa ◽

Jo Vandesompele ◽

Olivier Thas

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Real Data ◽

Simulation Method ◽

R Package ◽

Supplementary Information ◽

Expression Data ◽

Sequencing Data ◽

Wide Range ◽

Single Cell Rna Sequencing

Abstract Summary SPsimSeq is a semi-parametric simulation method to generate bulk and single-cell RNA-sequencing data. It is designed to simulate gene expression data with maximal retention of the characteristics of real data. It is reasonably flexible to accommodate a wide range of experimental scenarios, including different sample sizes, biological signals (differential expression) and confounding batch effects. Availability and implementation The R package and associated documentation is available from https://github.com/CenterForStatistics-UGent/SPsimSeq. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Quality assessment of single-cell RNA sequencing data by coverage skewness analysis

10.1101/2019.12.31.890269 ◽

2019 ◽

Author(s):

Imad Abugessaisa ◽

Shuhei Noguchi ◽

Melissa Cardon ◽

Akira Hasegawa ◽

Kazuhide Watanabe ◽

...

Keyword(s):

Quality Assessment ◽

Single Cell ◽

Rna Sequencing ◽

Single Cells ◽

Assessment Method ◽

Poor Quality ◽

Sequencing Data ◽

Single Cell Rna Sequencing ◽

Gene Coverage ◽

The Impact

AbstractAnalysis and interpretation of single-cell RNA-sequencing (scRNA-seq) experiments are compromised by the presence of poor quality cells. For meaningful analyses, such poor quality cells should be excluded to avoid biases and large variation. However, no clear guidelines exist. We introduce SkewC, a novel quality-assessment method to identify poor quality single-cells in scRNA-seq experiments. The method is based on the assessment of gene coverage for each single cell and its skewness as a quality measure. To validate the method, we investigated the impact of poor quality cells on downstream analyses and compared biological differences between typical and poor quality cells. Moreover, we measured the ratio of intergenic expression, suggesting genomic contamination, and foreign organism contamination of single-cell samples. SkewC is tested in 37,993 single-cells generated by 15 scRNA-seq protocols. We envision SkewC as an indispensable QC method to be incorporated into scRNA-seq experiment to preclude the possibility of scRNA-seq data misinterpretation.

Download Full-text

Estimating the Allele-Specific Expression of SNVs From 10× Genomics Single-Cell RNA-Sequencing Data

Genes ◽

10.3390/genes11030240 ◽

2020 ◽

Vol 11 (3) ◽

pp. 240 ◽

Cited By ~ 2

Author(s):

Prashant N. M. ◽

Hongyu Liu ◽

Pavlos Bousounis ◽

Liam Spurr ◽

Nawaf Alomran ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Single Cells ◽

Sequencing Data ◽

Specific Expression ◽

Single Nucleotide ◽

Healthy Donors ◽

Allele Expression ◽

Single Cell Rna Sequencing ◽

Allele Specific

With the recent advances in single-cell RNA-sequencing (scRNA-seq) technologies, the estimation of allele expression from single cells is becoming increasingly reliable. Allele expression is both quantitative and dynamic and is an essential component of the genomic interactome. Here, we systematically estimate the allele expression from heterozygous single nucleotide variant (SNV) loci using scRNA-seq data generated on the 10×Genomics Chromium platform. We analyzed 26,640 human adipose-derived mesenchymal stem cells (from three healthy donors), sequenced to an average of 150K sequencing reads per cell (more than 4 billion scRNA-seq reads in total). High-quality SNV calls assessed in our study contained approximately 15% exonic and >50% intronic loci. To analyze the allele expression, we estimated the expressed variant allele fraction (VAFRNA) from SNV-aware alignments and analyzed its variance and distribution (mono- and bi-allelic) at different minimum sequencing read thresholds. Our analysis shows that when assessing positions covered by a minimum of three unique sequencing reads, over 50% of the heterozygous SNVs show bi-allelic expression, while at a threshold of 10 reads, nearly 90% of the SNVs are bi-allelic. In addition, our analysis demonstrates the feasibility of scVAFRNA estimation from current scRNA-seq datasets and shows that the 3′-based library generation protocol of 10×Genomics scRNA-seq data can be informative in SNV-based studies, including analyses of transcriptional kinetics.

Download Full-text

VASC: dimension reduction and visualization of single cell RNA sequencing data by deep variational autoencoder

10.1101/199315 ◽

2017 ◽

Cited By ~ 6

Author(s):

Dongfang Wang ◽

Jin Gu

Keyword(s):

Dimension Reduction ◽

Single Cell ◽

Rna Sequencing ◽

Original Data ◽

Marker Genes ◽

Single Cell Level ◽

Sequencing Data ◽

Cell Level ◽

Variational Autoencoder ◽

Single Cell Rna Sequencing

AbstractSingle cell RNA sequencing (scRNA-seq) is a powerful technique to analyze the transcriptomic heterogeneities in single cell level. It is an important step for studying cell sub-populations and lineages based on scRNA-seq data by finding an effective low-dimensional representation and visualization of the original data. The scRNA-seq data are much noiser than traditional bulk RNA-Seq: in the single cell level, the transcriptional fluctuations are much larger than the average of a cell population and the low amount of RNA transcripts will increase the rate of technical dropout events. In this study, we proposed VASC (deep Variational Autoencoder for scRNA-seq data), a deep multi-layer generative model, for the unsupervised dimension reduction and visualization of scRNA-seq data. It can explicitly model the dropout events and find the nonlinear hierarchical feature representations of the original data. Tested on twenty datasets, VASC shows superior performances in most cases and broader dataset compatibility compared with four state-of-the-art dimension reduction methods. Then, for a case study of pre-implantation embryos, VASC successfully re-establishes the cell dynamics and identifies several candidate marker genes associated with the early embryo development.

Download Full-text

SPsimSeq: semi-parametric simulation of bulk and single cell RNA sequencing data

10.1101/677740 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alemu Takele Assefa ◽

Jo Vandesompele ◽

Olivier Thas

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Empirical Distribution ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Actual Distribution ◽

Wide Range ◽

Single Cell Rna Sequencing

SummarySPsimSeq is a semi-parametric simulation method for bulk and single cell RNA sequencing data. It simulates data from a good estimate of the actual distribution of a given real RNA-seq dataset. In contrast to existing approaches that assume a particular data distribution, our method constructs an empirical distribution of gene expression data from a given source RNA-seq experiment to faithfully capture the data characteristics of real data. Importantly, our method can be used to simulate a wide range of scenarios, such as single or multiple biological groups, systematic variations (e.g. confounding batch effects), and different sample sizes. It can also be used to simulate different gene expression units resulting from different library preparation protocols, such as read counts or UMI counts.Availability and implementationThe R package and associated documentation is available from https://github.com/CenterForStatistics-UGent/SPsimSeq.Supplementary informationSupplementary data are available at bioRχiv online.

Download Full-text

SNV identification from single-cell RNA sequencing data

Human Molecular Genetics ◽

10.1093/hmg/ddz207 ◽

2019 ◽

Vol 28 (21) ◽

pp. 3569-3583 ◽

Cited By ~ 3

Author(s):

Patricia M Schnepp ◽

Mengjie Chen ◽

Evan T Keller ◽

Xiang Zhou

Keyword(s):

Dna Sequencing ◽

Single Cell ◽

Rna Sequencing ◽

Single Cells ◽

Specific Gene ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Cell Rna Sequencing ◽

Sequencing Studies ◽

Genomic Regions

Abstract Integrating single-cell RNA sequencing (scRNA-seq) data with genotypes obtained from DNA sequencing studies facilitates the detection of functional genetic variants underlying cell type-specific gene expression variation. Unfortunately, most existing scRNA-seq studies do not come with DNA sequencing data; thus, being able to call single nucleotide variants (SNVs) from scRNA-seq data alone can provide crucial and complementary information, detection of functional SNVs, maximizing the potential of existing scRNA-seq studies. Here, we perform extensive analyses to evaluate the utility of two SNV calling pipelines (GATK and Monovar), originally designed for SNV calling in either bulk or single-cell DNA sequencing data. In both pipelines, we examined various parameter settings to determine the accuracy of the final SNV call set and provide practical recommendations for applied analysts. We found that combining all reads from the single cells and following GATK Best Practices resulted in the highest number of SNVs identified with a high concordance. In individual single cells, Monovar resulted in better quality SNVs even though none of the pipelines analyzed is capable of calling a reasonable number of SNVs with high accuracy. In addition, we found that SNV calling quality varies across different functional genomic regions. Our results open doors for novel ways to leverage the use of scRNA-seq for the future investigation of SNV function.

Download Full-text

Normalizing single-cell RNA sequencing data with internal spike-in-like genes

10.1101/2020.07.10.198077 ◽

2020 ◽

Author(s):

Li Lin ◽

Minfang Song ◽

Yong Jiang ◽

Xiaojing Zhao ◽

Haopeng Wang ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Single Cells ◽

Sequencing Depth ◽

Sequencing Data ◽

Crucial Step ◽

Single Cell Rna Sequencing ◽

Whole Transcriptome

ABSTRACTNormalization with respect to sequencing depth is a crucial step in single-cell RNA sequencing preprocessing. Most methods normalize data using the whole transcriptome based on the assumption that the majority of transcriptome remains constant and are unable to detect drastic changes of the transcriptome. Here, we develop an algorithm based on a small fraction of constantly expressed genes as internal spike-ins to normalize single cell RNA sequencing data. We demonstrate that the transcriptome of single cells may undergo drastic changes in several case study datasets and accounting for such heterogeneity by ISnorm improves the performance of downstream analyzes.

Download Full-text