scholarly journals RUV-III-NB: Normalization of single cell RNA-seq Data

2021 ◽  
Author(s):  
Agus Salim ◽  
Ramyar Molania ◽  
Jianan Wang ◽  
Alysha De Livera ◽  
Rachel Thijssen ◽  
...  

Motivation: Despite numerous methodological advances, the normalization of single cell RNA-seq (scRNA-seq) data remains a challenging task. The performance of different methods can vary greatly across datasets. Part of the reason for this is the different kinds of unwanted variation, including library size, batch and cell cycle effects, and the association of these with the biology embodied in the cells. A normalization method that does not explicitly take into account cell biology risks removing some of the signal of interest. Here we propose RUV-III-NB, a statistical method that can be used to adjust counts for library size and batch effects. The method uses the concept of pseudo-replicates to ensure that relevant features of the unwanted variation are only inferred from cells with the same biology and return adjusted sequencing count as output. Results: Using five publicly available datasets that encompass different technological platforms, kinds of biology and levels of association between biology and unwanted variation, we show that RUV-III-NB manages to remove library size and batch effects, strengthen biological signals, improve differential expression analyses, and lead to results exhibiting greater concordance with independent datasets of the same kind. The performance of RUV-III-NB is consistent across the five datasets and is not sensitive to the number of factors assumed to contribute to the unwanted variation. It also shows promise for removing other kinds of unwanted variation such as platform effects. Availability: The method is implemented as a publicly available R package available from https://github.com/limfuxing/ruvIIInb. Contact: [email protected], [email protected] Supplementary information: Online Supplementary Methods

2019 ◽  
Vol 35 (24) ◽  
pp. 5155-5162 ◽  
Author(s):  
Chengzhong Ye ◽  
Terence P Speed ◽  
Agus Salim

Abstract Motivation Dropout is a common phenomenon in single-cell RNA-seq (scRNA-seq) data, and when left unaddressed it affects the validity of the statistical analyses. Despite this, few current methods for differential expression (DE) analysis of scRNA-seq data explicitly model the process that gives rise to the dropout events. We develop DECENT, a method for DE analysis of scRNA-seq data that explicitly and accurately models the molecule capture process in scRNA-seq experiments. Results We show that DECENT demonstrates improved DE performance over existing DE methods that do not explicitly model dropout. This improvement is consistently observed across several public scRNA-seq datasets generated using different technological platforms. The gain in improvement is especially large when the capture process is overdispersed. DECENT maintains type I error well while achieving better sensitivity. Its performance without spike-ins is almost as good as when spike-ins are used to calibrate the capture model. Availability and implementation The method is implemented as a publicly available R package available from https://github.com/cz-ye/DECENT. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Irzam Sarfraz ◽  
Muhammad Asif ◽  
Joshua D Campbell

Abstract Motivation R Experiment objects such as the SummarizedExperiment or SingleCellExperiment are data containers for storing one or more matrix-like assays along with associated row and column data. These objects have been used to facilitate the storage and analysis of high-throughput genomic data generated from technologies such as single-cell RNA sequencing. One common computational task in many genomics analysis workflows is to perform subsetting of the data matrix before applying down-stream analytical methods. For example, one may need to subset the columns of the assay matrix to exclude poor-quality samples or subset the rows of the matrix to select the most variable features. Traditionally, a second object is created that contains the desired subset of assay from the original object. However, this approach is inefficient as it requires the creation of an additional object containing a copy of the original assay and leads to challenges with data provenance. Results To overcome these challenges, we developed an R package called ExperimentSubset, which is a data container that implements classes for efficient storage and streamlined retrieval of assays that have been subsetted by rows and/or columns. These classes are able to inherently provide data provenance by maintaining the relationship between the subsetted and parent assays. We demonstrate the utility of this package on a single-cell RNA-seq dataset by storing and retrieving subsets at different stages of the analysis while maintaining a lower memory footprint. Overall, the ExperimentSubset is a flexible container for the efficient management of subsets. Availability and implementation ExperimentSubset package is available at Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Zhun Miao ◽  
Ke Deng ◽  
Xiaowo Wang ◽  
Xuegong Zhang

AbstractSummaryThe excessive amount of zeros in single-cell RNA-seq data include “real” zeros due to the on-off nature of gene transcription in single cells and “dropout” zeros due to technical reasons. Existing differential expression (DE) analysis methods cannot distinguish these two types of zeros. We developed an R package DEsingle which employed Zero-Inflated Negative Binomial model to estimate the proportion of real and dropout zeros and to define and detect 3 types of DE genes in single-cell RNA-seq data with higher accuracy.Availability and ImplementationThe R package DEsingle is freely available at https://github.com/miaozhun/DEsingle and is under Bioconductor’s consideration [email protected] informationSupplementary data are available at bioRxiv online.


Author(s):  
Massimo Andreatta ◽  
Santiago J Carmona

Abstract Summary STACAS is a computational method for the identification of integration anchors in the Seurat environment, optimized for the integration of single-cell (sc) RNA-seq datasets that share only a subset of cell types. We demonstrate that by (i) correcting batch effects while preserving relevant biological variability across datasets, (ii) filtering aberrant integration anchors with a quantitative distance measure and (iii) constructing optimal guide trees for integration, STACAS can accurately align scRNA-seq datasets composed of only partially overlapping cell populations. Availability and implementation Source code and R package available at https://github.com/carmonalab/STACAS; Docker image available at https://hub.docker.com/repository/docker/mandrea1/stacas_demo.


2019 ◽  
Vol 35 (24) ◽  
pp. 5306-5308
Author(s):  
Qi Liu ◽  
Quanhu Sheng ◽  
Jie Ping ◽  
Marisol Adelina Ramirez ◽  
Ken S Lau ◽  
...  

Abstract Summary Single cell RNA sequencing is a revolutionary technique to characterize inter-cellular transcriptomics heterogeneity. However, the data are noise-prone because gene expression is often driven by both technical artifacts and genuine biological variations. Proper disentanglement of these two effects is critical to prevent spurious results. While several tools exist to detect and remove low-quality cells in one single cell RNA-seq dataset, there is lack of approach to examining consistency between sample sets and detecting systematic biases, batch effects and outliers. We present scRNABatchQC, an R package to compare multiple sample sets simultaneously over numerous technical and biological features, which gives valuable hints to distinguish technical artifact from biological variations. scRNABatchQC helps identify and systematically characterize sources of variability in single cell transcriptome data. The examination of consistency across datasets allows visual detection of biases and outliers. Availability and implementation scRNABatchQC is freely available at https://github.com/liuqivandy/scRNABatchQC as an R package. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (10) ◽  
pp. 3115-3123 ◽  
Author(s):  
Teng Fei ◽  
Tianwei Yu

Abstract Motivation Batch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions. Existing methods do not correct batch effects satisfactorily, especially with single-cell RNA sequencing (RNA-seq) data. Results We present scBatch, a numerical algorithm for batch-effect correction on bulk and single-cell RNA-seq data with emphasis on improving both clustering and gene differential expression analysis. scBatch is not restricted by assumptions on the mechanism of batch-effect generation. As shown in simulations and real data analyses, scBatch outperforms benchmark batch-effect correction methods. Availability and implementation The R package is available at github.com/tengfei-emory/scBatch. The code to generate results and figures in this article is available at github.com/tengfei-emory/scBatch-paper-scripts. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Giacomo Baruzzo ◽  
Ilaria Patuzzi ◽  
Barbara Di Camillo

Abstract Motivation Single cell RNA-seq (scRNA-seq) count data show many differences compared with bulk RNA-seq count data, making the application of many RNA-seq pre-processing/analysis methods not straightforward or even inappropriate. For this reason, the development of new methods for handling scRNA-seq count data is currently one of the most active research fields in bioinformatics. To help the development of such new methods, the availability of simulated data could play a pivotal role. However, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data. Results In this article we present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. We demonstrate that SPARSim allows to generate count data that resemble real data in terms of count intensity, variability and sparsity, performing comparably or better than one of the most used scRNA-seq simulator, Splat. In particular, SPARSim simulated count matrices well resemble the distribution of zeros across different expression intensities observed in real count data. Availability and implementation SPARSim R package is freely available at http://sysbiobig.dei.unipd.it/? q=SPARSim and at https://gitlab.com/sysbiobig/sparsim. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Bo Wang ◽  
Daniele Ramazzotti ◽  
Luca De Sano ◽  
Junjie Zhu ◽  
Emma Pierson ◽  
...  

AbstractMotivationWe here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a cell-to-cell similarity measure from single-cell RNA-seq data. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of cells. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization.Availability and ImplementationSIMLR is available on GitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on [email protected] or [email protected] InformationSupplementary data are available at Bioinformatics online.


Author(s):  
Massimo Andreatta ◽  
Santiago J. Carmona

AbstractComputational tools for the integration of single-cell transcriptomics data are designed to correct batch effects between technical replicates or different technologies applied to the same population of cells. However, they have inherent limitations when applied to heterogeneous sets of data with moderate overlap in cell states or sub-types. STACAS is a package for the identification of integration anchors in the Seurat environment, optimized for the integration of datasets that share only a subset of cell types. We demonstrate that by i) correcting batch effects while preserving relevant biological variability across datasets, ii) filtering aberrant integration anchors with a quantitative distance measure, and iii) constructing optimal guide trees for integration, STACAS can accurately align scRNA-seq datasets composed of only partially overlapping cell populations. We anticipate that the algorithm will be a useful tool for the construction of comprehensive single-cell atlases by integration of the growing amount of single-cell data becoming available in public repositories.Code availabilityR package:https://github.com/carmonalab/STACASDocker image:https://hub.docker.com/repository/docker/mandrea1/stacas_demo


2021 ◽  
Author(s):  
Federico Agostinis ◽  
Chiara Romualdi ◽  
Gabriele Sales ◽  
Davide Risso

Summary: We present NewWave, a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA sequencing data. To achieve scalability, NewWave uses mini-batch optimization and can work with out-of-memory data, enabling users to analyze datasets with millions of cells. Availability and implementation: NewWave is implemented as an open-source R package available through the Bioconductor project at https://bioconductor.org/packages/NewWave/ Supplementary information: Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document