scholarly journals scDoc: correcting drop-out events in single-cell RNA-seq data

2020 ◽  
Vol 36 (15) ◽  
pp. 4233-4239
Author(s):  
Di Ran ◽  
Shanshan Zhang ◽  
Nicholas Lytal ◽  
Lingling An

Abstract Motivation Single-cell RNA-sequencing (scRNA-seq) has become an important tool to unravel cellular heterogeneity, discover new cell (sub)types, and understand cell development at single-cell resolution. However, one major challenge to scRNA-seq research is the presence of ‘drop-out’ events, which usually is due to extremely low mRNA input or the stochastic nature of gene expression. In this article, we present a novel single-cell RNA-seq drop-out correction (scDoc) method, imputing drop-out events by borrowing information for the same gene from highly similar cells. Results scDoc is the first method that directly involves drop-out information to accounting for cell-to-cell similarity estimation, which is crucial in scRNA-seq drop-out imputation but has not been appropriately examined. We evaluated the performance of scDoc using both simulated data and real scRNA-seq studies. Results show that scDoc outperforms the existing imputation methods in reference to data visualization, cell subpopulation identification and differential expression detection in scRNA-seq data. Availability and implementation R code is available at https://github.com/anlingUA/scDoc. Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Author(s):  
Di Ran ◽  
Shanshan Zhang ◽  
Nicholas Lytal ◽  
Lingling An

AbstractSingle-cell RNA sequencing (scRNA-seq) has become an important tool to unravel cellular heterogeneity, discover new cell types, and understand cell development at single-cell resolution. However, one major challenge to scRNA-seq research is the presence of “drop-out” events, which usually is due to extremely low mRNA input or the stochastic nature of gene expression. In this paper, we present a novel Single-Cell RNA-seq Drop-Out Correction (scDoc) method, imputing drop-out events by borrowing information for the same gene from highly similar cells. scDoc is the first method that involves drop-out information to account for cell-to-cell similarity estimation, which is crucial in scRNA-seq drop-out imputation but has not been appropriately examined. We evaluated the performance of scDoc using both simulated data and real scRNA-seq studies. Results show that scDoc can impute the drop-out events more accurately and robustly; specifically, it outperforms all available imputation methods in reference to data visualization, cell subpopulation identification, and differential expression detection in scRNA-seq data.


Author(s):  
Congting Ye ◽  
Qian Zhou ◽  
Xiaohui Wu ◽  
Chen Yu ◽  
Guoli Ji ◽  
...  

Abstract Motivation Alternative polyadenylation (APA) plays a key post-transcriptional regulatory role in mRNA stability and functions in eukaryotes. Single cell RNA-seq (scRNA-seq) is a powerful tool to discover cellular heterogeneity at gene expression level. Given 3′ enriched strategy in library construction, the most commonly used scRNA-seq protocol—10× Genomics enables us to improve the study resolution of APA to the single cell level. However, currently there is no computational tool available for investigating APA profiles from scRNA-seq data. Results Here, we present a package scDAPA for detecting and visualizing dynamic APA from scRNA-seq data. Taking bam/sam files and cell cluster labels as inputs, scDAPA detects APA dynamics using a histogram-based method and the Wilcoxon rank-sum test, and visualizes candidate genes with dynamic APA. Benchmarking results demonstrated that scDAPA can effectively identify genes with dynamic APA among different cell groups from scRNA-seq data. Availability and implementation The scDAPA package is implemented in Shell and R, and is freely available at https://scdapa.sourceforge.io. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Marmar Moussa ◽  
Ion I. Măndoiu

AbstractOne of the most notable challenges in single cell RNA-Seq data analysis is the so called drop-out effect, where only a fraction of the transcriptome of each cell is captured. The random nature of drop-outs, however, makes it possible to consider imputation methods as means of correcting for drop-outs. In this paper we study some existing scRNA-Seq imputation methods and propose a novel iterative imputation approach based on efficiently computing highly similar cells. We then present the results of a comprehensive assessment of existing and proposed methods on real scRNA-Seq datasets with varying per cell sequencing depth.


Author(s):  
Giacomo Baruzzo ◽  
Ilaria Patuzzi ◽  
Barbara Di Camillo

Abstract Motivation Single cell RNA-seq (scRNA-seq) count data show many differences compared with bulk RNA-seq count data, making the application of many RNA-seq pre-processing/analysis methods not straightforward or even inappropriate. For this reason, the development of new methods for handling scRNA-seq count data is currently one of the most active research fields in bioinformatics. To help the development of such new methods, the availability of simulated data could play a pivotal role. However, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data. Results In this article we present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. We demonstrate that SPARSim allows to generate count data that resemble real data in terms of count intensity, variability and sparsity, performing comparably or better than one of the most used scRNA-seq simulator, Splat. In particular, SPARSim simulated count matrices well resemble the distribution of zeros across different expression intensities observed in real count data. Availability and implementation SPARSim R package is freely available at http://sysbiobig.dei.unipd.it/? q=SPARSim and at https://gitlab.com/sysbiobig/sparsim. Supplementary information Supplementary data are available at Bioinformatics online.


F1000Research ◽  
2019 ◽  
Vol 7 ◽  
pp. 1740 ◽  
Author(s):  
Tallulah S. Andrews ◽  
Martin Hemberg

Background: Single-cell RNA-seq is a powerful tool for measuring gene expression at the resolution of individual cells.  A challenge in the analysis of this data is the large amount of zero values, representing either missing data or no expression. Several imputation approaches have been proposed to address this issue, but they generally rely on structure inherent to the dataset under consideration they may not provide any additional information, hence, are limited by the information contained therein and the validity of their assumptions. Methods: We evaluated the risk of generating false positive or irreproducible differential expression when imputing data with six different methods. We applied each method to a variety of simulated datasets as well as to permuted real single-cell RNA-seq datasets and consider the number of false positive gene-gene correlations and differentially expressed genes. Using matched 10X and Smart-seq2 data we examined whether cell-type specific markers were reproducible across datasets derived from the same tissue before and after imputation. Results: The extent of false-positives introduced by imputation varied considerably by method. Data smoothing based methods, MAGIC, knn-smooth and dca, generated many false-positives in both real and simulated data. Model-based imputation methods typically generated fewer false-positives but this varied greatly depending on the diversity of cell-types in the sample. All imputation methods decreased the reproducibility of cell-type specific markers, although this could be mitigated by selecting markers with large effect size and significance. Conclusions: Imputation of single-cell RNA-seq data introduces circularity that can generate false-positive results. Thus, statistical tests applied to imputed data should be treated with care. Additional filtering by effect size can reduce but not fully eliminate these effects. Of the methods we considered, SAVER was the least likely to generate false or irreproducible results, thus should be favoured over alternatives if imputation is necessary.


2020 ◽  
Vol 36 (11) ◽  
pp. 3588-3589 ◽  
Author(s):  
Kaiyi Zhu ◽  
Dimitris Anastassiou

Abstract Summary We developed 2DImpute, an imputation method for correcting false zeros (known as dropouts) in single-cell RNA-sequencing (scRNA-seq) data. It features preventing excessive correction by predicting the false zeros and imputing their values by making use of the interrelationships between both genes and cells in the expression matrix. We showed that 2DImpute outperforms several leading imputation methods by applying it on datasets from various scRNA-seq protocols. Availability and implementation The R package of 2DImpute is freely available at GitHub (https://github.com/zky0708/2DImpute). Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Irzam Sarfraz ◽  
Muhammad Asif ◽  
Joshua D Campbell

Abstract Motivation R Experiment objects such as the SummarizedExperiment or SingleCellExperiment are data containers for storing one or more matrix-like assays along with associated row and column data. These objects have been used to facilitate the storage and analysis of high-throughput genomic data generated from technologies such as single-cell RNA sequencing. One common computational task in many genomics analysis workflows is to perform subsetting of the data matrix before applying down-stream analytical methods. For example, one may need to subset the columns of the assay matrix to exclude poor-quality samples or subset the rows of the matrix to select the most variable features. Traditionally, a second object is created that contains the desired subset of assay from the original object. However, this approach is inefficient as it requires the creation of an additional object containing a copy of the original assay and leads to challenges with data provenance. Results To overcome these challenges, we developed an R package called ExperimentSubset, which is a data container that implements classes for efficient storage and streamlined retrieval of assays that have been subsetted by rows and/or columns. These classes are able to inherently provide data provenance by maintaining the relationship between the subsetted and parent assays. We demonstrate the utility of this package on a single-cell RNA-seq dataset by storing and retrieving subsets at different stages of the analysis while maintaining a lower memory footprint. Overall, the ExperimentSubset is a flexible container for the efficient management of subsets. Availability and implementation ExperimentSubset package is available at Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Yang Xu ◽  
Priyojit Das ◽  
Rachel Patton McCord

Abstract Motivation Deep learning approaches have empowered single-cell omics data analysis in many ways and generated new insights from complex cellular systems. As there is an increasing need for single cell omics data to be integrated across sources, types, and features of data, the challenges of integrating single-cell omics data are rising. Here, we present an unsupervised deep learning algorithm that learns discriminative representations for single-cell data via maximizing mutual information, SMILE (Single-cell Mutual Information Learning). Results Using a unique cell-pairing design, SMILE successfully integrates multi-source single-cell transcriptome data, removing batch effects and projecting similar cell types, even from different tissues, into the shared space. SMILE can also integrate data from two or more modalities, such as joint profiling technologies using single-cell ATAC-seq, RNA-seq, DNA methylation, Hi-C, and ChIP data. When paired cells are known, SMILE can integrate data with unmatched feature, such as genes for RNA-seq and genome wide peaks for ATAC-seq. Integrated representations learned from joint profiling technologies can then be used as a framework for comparing independent single source data. Supplementary information Supplementary data are available at Bioinformatics online. The source code of SMILE including analyses of key results in the study can be found at: https://github.com/rpmccordlab/SMILE.


2017 ◽  
Author(s):  
Zhun Miao ◽  
Ke Deng ◽  
Xiaowo Wang ◽  
Xuegong Zhang

AbstractSummaryThe excessive amount of zeros in single-cell RNA-seq data include “real” zeros due to the on-off nature of gene transcription in single cells and “dropout” zeros due to technical reasons. Existing differential expression (DE) analysis methods cannot distinguish these two types of zeros. We developed an R package DEsingle which employed Zero-Inflated Negative Binomial model to estimate the proportion of real and dropout zeros and to define and detect 3 types of DE genes in single-cell RNA-seq data with higher accuracy.Availability and ImplementationThe R package DEsingle is freely available at https://github.com/miaozhun/DEsingle and is under Bioconductor’s consideration [email protected] informationSupplementary data are available at bioRxiv online.


Sign in / Sign up

Export Citation Format

Share Document