EnImpute: imputing dropout events in single-cell RNA-sequencing data via ensemble learning

Xiao-Fei Zhang; Le Ou-Yang; Shuo Yang; Xing-Ming Zhao; Xiaohua Hu; Hong Yan

doi:10.1093/bioinformatics/btz435

EnImpute: imputing dropout events in single-cell RNA-sequencing data via ensemble learning

Bioinformatics ◽

10.1093/bioinformatics/btz435 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4827-4829 ◽

Cited By ~ 6

Author(s):

Xiao-Fei Zhang ◽

Le Ou-Yang ◽

Shuo Yang ◽

Xing-Ming Zhao ◽

Xiaohua Hu ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Ensemble Learning ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Single Cell Rna Sequencing ◽

The Individual ◽

Downstream Analysis ◽

Shiny Application

Abstract Summary Imputation of dropout events that may mislead downstream analyses is a key step in analyzing single-cell RNA-sequencing (scRNA-seq) data. We develop EnImpute, an R package that introduces an ensemble learning method for imputing dropout events in scRNA-seq data. EnImpute combines the results obtained from multiple imputation methods to generate a more accurate result. A Shiny application is developed to provide easier implementation and visualization. Experiment results show that EnImpute outperforms the individual state-of-the-art methods in almost all situations. EnImpute is useful for correcting the noisy scRNA-seq data before performing downstream analysis. Availability and implementation The R package and Shiny application are available through Github at https://github.com/Zhangxf-ccnu/EnImpute. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

schex avoids overplotting for large single-cell RNA-sequencing datasets

Bioinformatics ◽

10.1093/bioinformatics/btz907 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2291-2292 ◽

Cited By ~ 1

Author(s):

Saskia Freytag ◽

Ryan Lister

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

R Package ◽

Supplementary Information ◽

Supplementary Data ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Abstract Summary Due to the scale and sparsity of single-cell RNA-sequencing data, traditional plots can obscure vital information. Our R package schex overcomes this by implementing hexagonal binning, which has the additional advantages of improving speed and reducing storage for resulting plots. Availability and implementation schex is freely available from Bioconductor via http://bioconductor.org/packages/release/bioc/html/schex.html and its development version can be accessed on GitHub via https://github.com/SaskiaFreytag/schex. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

scds: computational annotation of doublets in single-cell RNA sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btz698 ◽

2019 ◽

Cited By ~ 3

Author(s):

Abha S Bais ◽

Dennis Kostka

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Binary Classification ◽

Single Cells ◽

Computational Cost ◽

Original Data ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Abstract Motivation Single-cell RNA sequencing (scRNA-seq) technologies enable the study of transcriptional heterogeneity at the resolution of individual cells and have an increasing impact on biomedical research. However, it is known that these methods sometimes wrongly consider two or more cells as single cells, and that a number of so-called doublets is present in the output of such experiments. Treating doublets as single cells in downstream analyses can severely bias a study’s conclusions, and therefore computational strategies for the identification of doublets are needed. Results With scds, we propose two new approaches for in silico doublet identification: Co-expression based doublet scoring (cxds) and binary classification based doublet scoring (bcds). The co-expression based approach, cxds, utilizes binarized (absence/presence) gene expression data and, employing a binomial model for the co-expression of pairs of genes, yields interpretable doublet annotations. bcds, on the other hand, uses a binary classification approach to discriminate artificial doublets from original data. We apply our methods and existing computational doublet identification approaches to four datasets with experimental doublet annotations and find that our methods perform at least as well as the state of the art, at comparably little computational cost. We observe appreciable differences between methods and across datasets and that no approach dominates all others. In summary, scds presents a scalable, competitive approach that allows for doublet annotation of datasets with thousands of cells in a matter of seconds. Availability and implementation scds is implemented as a Bioconductor R package (doi: 10.18129/B9.bioc.scds). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SPsimSeq: semi-parametric simulation of bulk and single-cell RNA-sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa105 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3276-3278 ◽

Cited By ~ 2

Author(s):

Alemu Takele Assefa ◽

Jo Vandesompele ◽

Olivier Thas

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Real Data ◽

Simulation Method ◽

R Package ◽

Supplementary Information ◽

Expression Data ◽

Sequencing Data ◽

Wide Range ◽

Single Cell Rna Sequencing

Abstract Summary SPsimSeq is a semi-parametric simulation method to generate bulk and single-cell RNA-sequencing data. It is designed to simulate gene expression data with maximal retention of the characteristics of real data. It is reasonably flexible to accommodate a wide range of experimental scenarios, including different sample sizes, biological signals (differential expression) and confounding batch effects. Availability and implementation The R package and associated documentation is available from https://github.com/CenterForStatistics-UGent/SPsimSeq. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

scPipe: a flexible R/Bioconductor preprocessing pipeline for single-cell RNA-sequencing data

10.1101/175927 ◽

2017 ◽

Cited By ~ 8

Author(s):

Luyi Tian ◽

Shian Su ◽

Xueyi Dong ◽

Daniela Amann-Zalcenstein ◽

Christine Biben ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

R Package ◽

Statistical Testing ◽

Complex Data ◽

Sequencing Data ◽

Single Cell Rna Sequencing ◽

Reproducible Analysis ◽

Downstream Analysis ◽

Effective Visualization

AbstractSingle-cell RNA sequencing (scRNA-seq) technology allows researchers to profile the transcriptomes of thousands of cells simultaneously. Protocols that incorpo-rate both designed and random barcodes have greatly increased the throughput of scRNA-seq, but give rise to a more complex data structure. There is a need for new tools that can handle the various barcoding strategies used by different protocols and exploit this information for quality assessment at the sample-level and provide effective visualization of these results in preparation for higher-level analyses.To this end, we developed scPipe, a R/Bioconductor package that integrates barcode demultiplexing, read alignment, UMI-aware gene-level quantification and quality control of raw sequencing data generated by multiple 3-prime-end sequencing protocols that include CEL-seq, MARS-seq, Chromium 10X and Drop-seq. scPipe produces a count matrix that is essential for downstream analysis along with an HTML report that summarises data quality. These results can be used as input for downstream analyses including normalization, visualization and statistical testing. scPipe performs this processing in a few simple R commands, promoting reproducible analysis of single-cell data that is compatible with the emerging suite of scRNA-seq analysis tools available in R/Bioconductor. The scPipe R package is available for download from https://www.bioconductor.org/packages/scPipe.

Download Full-text

Critical downstream analysis steps for single-cell RNA sequencing data

Briefings in Bioinformatics ◽

10.1093/bib/bbab105 ◽

2021 ◽

Author(s):

Zilong Zhang ◽

Feifei Cui ◽

Chen Lin ◽

Lingling Zhao ◽

Chunyu Wang ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Noisy Data ◽

Single Cell Level ◽

Cell Type ◽

Sequencing Data ◽

Cell Level ◽

Bioinformatics Tool ◽

Single Cell Rna Sequencing ◽

Downstream Analysis

Abstract Single-cell RNA sequencing (scRNA-seq) has enabled us to study biological questions at the single-cell level. Currently, many analysis tools are available to better utilize these relatively noisy data. In this review, we summarize the most widely used methods for critical downstream analysis steps (i.e. clustering, trajectory inference, cell-type annotation and integrating datasets). The advantages and limitations are comprehensively discussed, and we provide suggestions for choosing proper methods in different situations. We hope this paper will be useful for scRNA-seq data analysts and bioinformatics tool developers.

Download Full-text

EnTSSR: a weighted ensemble learning method to impute single-cell RNA sequencing data

IEEE/ACM Transactions on Computational Biology and Bioinformatics ◽

10.1109/tcbb.2021.3110850 ◽

2021 ◽

pp. 1-1

Author(s):

Fan Lu ◽

Yilong Lin ◽

Chongbin Yuan ◽

Xiao-Fei Zhang ◽

Le Ou-Yang

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Ensemble Learning ◽

Learning Method ◽

Sequencing Data ◽

Single Cell Rna Sequencing

Download Full-text

SPsimSeq: semi-parametric simulation of bulk and single cell RNA sequencing data

10.1101/677740 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alemu Takele Assefa ◽

Jo Vandesompele ◽

Olivier Thas

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Empirical Distribution ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Actual Distribution ◽

Wide Range ◽

Single Cell Rna Sequencing

SummarySPsimSeq is a semi-parametric simulation method for bulk and single cell RNA sequencing data. It simulates data from a good estimate of the actual distribution of a given real RNA-seq dataset. In contrast to existing approaches that assume a particular data distribution, our method constructs an empirical distribution of gene expression data from a given source RNA-seq experiment to faithfully capture the data characteristics of real data. Importantly, our method can be used to simulate a wide range of scenarios, such as single or multiple biological groups, systematic variations (e.g. confounding batch effects), and different sample sizes. It can also be used to simulate different gene expression units resulting from different library preparation protocols, such as read counts or UMI counts.Availability and implementationThe R package and associated documentation is available from https://github.com/CenterForStatistics-UGent/SPsimSeq.Supplementary informationSupplementary data are available at bioRχiv online.

Download Full-text

Cytosplore-Transcriptomics: a scalable inter-active framework for single-cell RNA sequencing data analysis

10.1101/2020.12.11.421883 ◽

2020 ◽

Author(s):

Tamim Abdelaal ◽

Jeroen Eggermont ◽

Thomas Höllt ◽

Ahmed Mahfouz ◽

Marcel J.T. Reinders ◽

...

Keyword(s):

Data Analysis ◽

Single Cell ◽

Rna Sequencing ◽

Visual Exploration ◽

Sequencing Data ◽

Interactive Analysis ◽

Levels Of Detail ◽

Single Cell Rna Sequencing ◽

Downstream Analysis ◽

Low Dimensional

SummaryThe ever-increasing number of analyzed cells in Single-cell RNA sequencing (scRNA-seq) experiments imposes several challenges on the data analysis. Current analysis methods lack scalability to large datasets hampering interactive visual exploration of the data. We present Cytosplore-Transcriptomics, a framework to analyze scRNA-seq data, including data preprocessing, visualization and downstream analysis. At its core, it uses a hierarchical, manifold preserving representation of the data that allows the inspection and annotation of scRNA-seq data at different levels of detail. Consequently, Cytosplore-Transcriptomics provides interactive analysis of the data using low-dimensional visualizations that scales to millions of cells.AvailabilityCytosplore-Transcriptomics can be freely downloaded from [email protected]

Download Full-text

SoupX removes ambient RNA contamination from droplet based single-cell RNA sequencing data

10.1101/303727 ◽

2018 ◽

Cited By ~ 58

Author(s):

Matthew D Young ◽

Sam Behjati

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Expression Profiles ◽

List Type ◽

Sequencing Data ◽

Sequencing Technologies ◽

Single Cell Rna Sequencing ◽

Biological Interpretation ◽

Downstream Analysis ◽

Rna Contamination

AbstractBackgroundDroplet based single-cell RNA sequence analyses assume all acquired RNAs are endogenous to cells. However, any cell free RNAs contained within the input solution are also captured by these assays. This sequencing of cell free RNA constitutes a background contamination that confounds the biological interpretation of single-cell transcriptomic data.ResultsWe demonstrate that contamination from this ‘soup’ of cell free RNAs is ubiquitous, with experiment-specific variations in composition and magnitude. We present a method, SoupX, for quantifying the extent of the contamination and estimating ‘background corrected’ cell expression profiles that seamlessly integrate with existing downstream analysis tools. Applying this method to several datasets using multiple droplet sequencing technologies, we demonstrate that its application improves biological interpretation of otherwise misleading data, as well as improving quality control metrics.ConclusionsWe present ‘SoupX’, a tool for removing ambient RNA contamination from droplet based single cell RNA sequencing experiments. This tool has broad applicability and its application can improve the biological utility of existing and future data sets.Key PointsThe signal from droplet based single cell RNA sequencing is ubiquitously contaminated by capture of ambient mRNA.SoupX is a method to quantify the abundance of these ambient mRNAs and remove them.Correcting for ambient mRNA contamination improves biological interpretation.

Download Full-text

Comprehensive generation, visualization, and reporting of quality control metrics for single-cell RNA sequencing data

10.1101/2020.11.16.385328 ◽

2020 ◽

Author(s):

Rui Hong ◽

Yusuke Koga ◽

Shruthi Bandyadka ◽

Anastasia Leshchyk ◽

Zhe Wang ◽

...

Keyword(s):

Quality Control ◽

Single Cell ◽

Rna Sequencing ◽

Sequencing Data ◽

Programming Environments ◽

Marker Selection ◽

Quality Control Metrics ◽

Single Cell Rna Sequencing ◽

Standard Quality ◽

Downstream Analysis

AbstractPerforming comprehensive quality control is necessary to remove technical or biological artifacts in single-cell RNA sequencing (scRNA-seq) data. Artifacts in the scRNA-seq data, such as doublets or ambient RNA, can also hinder downstream clustering and marker selection and need to be assessed. While several algorithms have been developed to perform various quality control tasks, they are only available in different packages across various programming environments. No standardized workflow has been developed to streamline the generation and reporting of all quality control metrics from these tools. We have built an easy-to-use pipeline, named SCTK-QC, in the singleCellTK package that generates a comprehensive set of quality control metrics from a plethora of packages for quality control. We are able to import data from several preprocessing tools including CellRanger, STARSolo, BUSTools, dropEST, Optimus, and SEQC. Standard quality control metrics for each cell are calculated including the total number of UMIs, total number of genes detected, and the percentage of counts mapping to predefined gene sets such as mitochondrial genes. Doublet detection algorithms employed include scrublet, scds, doubletCells, and doubletFinder. DecontX is used to identify contamination in each individual cell. To make the data accessible in downstream analysis workflows, the results can be exported to common data structures in R and Python or to text files for use in any generic workflow. Overall, this pipeline will streamline and standardize quality control analyses for single cell RNA-seq data across different platforms.

Download Full-text