A statistical simulator scDesign for rational scRNA-seq experimental design

Wei Vivian Li; Jingyi Jessica Li

doi:10.1093/bioinformatics/btz321

A statistical simulator scDesign for rational scRNA-seq experimental design

Bioinformatics ◽

10.1093/bioinformatics/btz321 ◽

2019 ◽

Vol 35 (14) ◽

pp. i41-i50 ◽

Cited By ~ 9

Author(s):

Wei Vivian Li ◽

Jingyi Jessica Li

Keyword(s):

Biological Sciences ◽

Gene Expression ◽

Experimental Design ◽

Method Development ◽

Cell Types ◽

R Package ◽

Computational Method ◽

Supplementary Information ◽

Reduction Methods ◽

Sequencing Platforms

Abstract Motivation Single-cell RNA sequencing (scRNA-seq) has revolutionized biological sciences by revealing genome-wide gene expression levels within individual cells. However, a critical challenge faced by researchers is how to optimize the choices of sequencing platforms, sequencing depths and cell numbers in designing scRNA-seq experiments, so as to balance the exploration of the depth and breadth of transcriptome information. Results Here we present a flexible and robust simulator, scDesign, the first statistical framework for researchers to quantitatively assess practical scRNA-seq experimental design in the context of differential gene expression analysis. In addition to experimental design, scDesign also assists computational method development by generating high-quality synthetic scRNA-seq datasets under customized experimental settings. In an evaluation based on 17 cell types and 6 different protocols, scDesign outperformed four state-of-the-art scRNA-seq simulation methods and led to rational experimental design. In addition, scDesign demonstrates reproducibility across biological replicates and independent studies. We also discuss the performance of multiple differential expression and dimension reduction methods based on the protocol-dependent scRNA-seq data generated by scDesign. scDesign is expected to be an effective bioinformatic tool that assists rational scRNA-seq experimental design and comparison of scRNA–seq computational methods based on specific research goals. Availability and implementation We have implemented our method in the R package scDesign, which is freely available at https://github.com/Vivianstats/scDesign. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A statistical simulator scDesign for rational scRNA-seq experimental design

10.1101/437095 ◽

2018 ◽

Author(s):

Wei Vivian Li ◽

Jingyi Jessica Li

Keyword(s):

Biological Sciences ◽

Gene Expression ◽

Experimental Design ◽

Method Development ◽

Cell Types ◽

R Package ◽

Computational Method ◽

Statistical Framework ◽

Reduction Methods ◽

Sequencing Platforms

AbstractMotivationSingle-cell RNA-sequencing (scRNA-seq) has revolutionized biological sciences by revealing genome-wide gene expression levels within individual cells. However, a critical challenge faced by researchers is how to optimize the choices of sequencing platforms, sequencing depths, and cell numbers in designing scRNA-seq experiments, so as to balance the exploration of the depth and breadth of transcriptome information.ResultsHere we present a flexible and robust simulator, scDesign, the first statistical framework for researchers to quantitatively assess practical scRNA-seq experimental design in the context of differential gene expression analysis. In addition to experimental design, scDesign also assists computational method development by generating high-quality synthetic scRNA-seq datasets under customized experimental settings. In an evaluation based on 17 cell types and six different protocols, scDesign outperformed four state-of-the-art scRNA-seq simulation methods and led to rational experimental design. In addition, scDesign demonstrates reproducibility across biological replicates and independent studies. We also discuss the performance of multiple differential expression and dimension reduction methods based on the protocol-dependent scRNA-seq data generated by scDesign. scDesign is expected to be an effective bioinformatic tool that assists rational scRNA-seq experiment design based on specific research goals and compares various scRNA-seq computational methods.AvailabilityWe have implemented our method in the R package scDesign, which is freely available at https://github.com/Vivianstats/[email protected]

Download Full-text

Simphony: simulating large-scale, rhythmic data

PeerJ ◽

10.7717/peerj.6985 ◽

2019 ◽

Vol 7 ◽

pp. e6985 ◽

Cited By ~ 5

Author(s):

Jordan M. Singer ◽

Darwin Y. Fu ◽

Jacob J. Hughey

Keyword(s):

Experimental Design ◽

Large Scale ◽

Method Development ◽

Negative Binomial ◽

Simulated Data ◽

R Package ◽

General Purpose ◽

Computational Method ◽

Multiple Time ◽

Multiple Time Points

Simulated data are invaluable for assessing a computational method’s ability to distinguish signal from noise. Although many biological systems show rhythmicity, there is no general-purpose tool to simulate large-scale, rhythmic data. Here we present Simphony, an R package for simulating data from experiments in which the abundances of rhythmic and non-rhythmic features (e.g., genes) are measured at multiple time points in multiple conditions. Simphony has parameters for specifying experimental design and each feature’s rhythmic properties (e.g., amplitude and phase). In addition, Simphony can sample measurements from Gaussian and negative binomial distributions, the latter of which approximates read counts from RNA-seq data. We show an example of using Simphony to evaluate the accuracy of rhythm detection. Our results suggest that Simphony will aid experimental design and computational method development. Simphony is thoroughly documented and freely available at https://github.com/hugheylab/simphony.

Download Full-text

CAGExploreR: an R package for the analysis and visualization of promoter dynamics across multiple experiments

Bioinformatics ◽

10.1093/bioinformatics/btu125 ◽

2014 ◽

Vol 30 (8) ◽

pp. 1183-1184 ◽

Cited By ~ 4

Author(s):

Emmanuel Dimont ◽

Oliver Hofmann ◽

Shannan J Ho Sui ◽

Alistair R R Forrest ◽

Hideya Kawaji ◽

...

Keyword(s):

Gene Expression ◽

Real Data ◽

Cell Types ◽

R Package ◽

Supplementary Information ◽

Promoter Regions ◽

Powerful Approach ◽

Alternate Promoter ◽

Cap Analysis ◽

Insight Into

Summary Alternate promoter usage is an important molecular mechanism for generating RNA and protein diversity. Cap Analysis Gene Expression (CAGE) is a powerful approach for revealing the multiplicity of transcription start site (TSS) events across experiments and conditions. An understanding of the dynamics of TSS choice across these conditions requires both sensitive quantification and comparative visualization. We have developed CAGExploreR, an R package to detect and visualize changes in the use of specific TSS in wider promoter regions in the context of changes in overall gene expression when comparing different CAGE samples. These changes provide insight into the modification of transcript isoform generation and regulatory network alterations associated with cell types and conditions. CAGExploreR is based on the FANTOM5 and MPromDb promoter set definitions but can also work with user-supplied regions. The package compares multiple CAGE libraries simultaneously. Supplementary Materials describe methods in detail, and a vignette demonstrates a workflow with a real data example. Availability and implementation: The package is freely available under the MIT license from CRAN (http://cran.r-project.org/web/packages/CAGExploreR). Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

A computational method to aid the design and analysis of single cell RNA-seq experiments for cell type identification

10.1101/247114 ◽

2018 ◽

Cited By ~ 1

Author(s):

Douglas Abrams ◽

Parveen Kumar ◽

R. Krishna Murthy Karuturi ◽

Joshy George

Keyword(s):

Experimental Design ◽

Single Cell ◽

Single Cells ◽

Cell Types ◽

Cell Number ◽

Fold Change ◽

Computational Method ◽

Marker Genes ◽

Cell Type ◽

Estimate Sample Size

AbstractBackgroundThe advent of single cell RNA sequencing (scRNA-seq) enabled researchers to study transcriptomic activity within individual cells and identify inherent cell types in the sample. Although numerous computational tools have been developed to analyze single cell transcriptomes, there are no published studies and analytical packages available to guide experimental design and to devise suitable analysis procedure for cell type identification.ResultsWe have developed an empirical methodology to address this important gap in single cell experimental design and analysis into an easy-to-use tool called SCEED (Single Cell Empirical Experimental Design and analysis). With SCEED, user can choose a variety of combinations of tools for analysis, conduct performance analysis of analytical procedures and choose the best procedure, and estimate sample size (number of cells to be profiled) required for a given analytical procedure at varying levels of cell type rarity and other experimental parameters. Using SCEED, we examined 3 single cell algorithms using 48 simulated single cell datasets that were generated for varying number of cell types and their proportions, number of genes expressed per cell, number of marker genes and their fold change, and number of single cells successfully profiled in the experiment.ConclusionsBased on our study, we found that when marker genes are expressed at fold change of 4 or more than the rest of the genes, either Seurat or Simlr algorithm can be used to analyze single cell dataset for any number of single cells isolated (minimum 1000 single cells were tested). However, when marker genes are expected to be only up to fC 2 upregulated, choice of the single cell algorithm is dependent on the number of single cells isolated and proportion of rare cell type to be identified. In conclusion, our work allows the assessment of various single cell methods and also aids in examining the single cell experimental design.

Download Full-text

iMIRAGE: an R package to impute microRNA expression using protein-coding genes

Bioinformatics ◽

10.1093/bioinformatics/btz939 ◽

2019 ◽

Vol 36 (8) ◽

pp. 2608-2610

Author(s):

Aritro Nath ◽

Jeremy Chang ◽

R Stephanie Huang

Keyword(s):

Gene Expression ◽

Small Rnas ◽

Transcriptional Regulators ◽

R Package ◽

Machine Learning Algorithms ◽

Supplementary Information ◽

Expression Data ◽

Protein Coding ◽

Altered Protein ◽

Independent Test

Abstract Summary MicroRNAs (miRNAs) are critical post-transcriptional regulators of gene expression. Due to challenges in accurate profiling of small RNAs, a vast majority of public transcriptome datasets lack reliable miRNA profiles. However, the biological consequence of miRNA activity in the form of altered protein-coding gene (PCG) expression can be captured using machine-learning algorithms. Here, we present iMIRAGE (imputed miRNA activity from gene expression), a convenient tool to predict miRNA expression using PCG expression of the test datasets. The iMIRAGE package provides an integrated workflow for normalization and transformation of miRNA and PCG expression data, along with the option to utilize predicted miRNA targets to impute miRNA activity from independent test PCG datasets. Availability and implementation The iMIRAGE package for R, along with package documentation and vignette, is available at https://aritronath.github.io/iMIRAGE/index.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Dissecting differential signals in high-throughput data from complex tissues

Bioinformatics ◽

10.1093/bioinformatics/btz196 ◽

2019 ◽

Vol 35 (20) ◽

pp. 3898-3905 ◽

Cited By ~ 5

Author(s):

Ziyi Li ◽

Zhijin Wu ◽

Peng Jin ◽

Hao Wu

Keyword(s):

High Throughput ◽

Cell Types ◽

R Package ◽

Supplementary Information ◽

Simulation Studies ◽

Clinical Practices ◽

High Throughput Data ◽

Heterogeneous Samples ◽

Cell Type Specific ◽

Different Cell Types

Abstract Motivation Samples from clinical practices are often mixtures of different cell types. The high-throughput data obtained from these samples are thus mixed signals. The cell mixture brings complications to data analysis, and will lead to biased results if not properly accounted for. Results We develop a method to model the high-throughput data from mixed, heterogeneous samples, and to detect differential signals. Our method allows flexible statistical inference for detecting a variety of cell-type specific changes. Extensive simulation studies and analyses of two real datasets demonstrate the favorable performance of our proposed method compared with existing ones serving similar purpose. Availability and implementation The proposed method is implemented as an R package and is freely available on GitHub (https://github.com/ziyili20/TOAST). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

STACAS: Sub-Type Anchor Correction for Alignment in Seurat to integrate single-cell RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btaa755 ◽

2020 ◽

Cited By ~ 1

Author(s):

Massimo Andreatta ◽

Santiago J Carmona

Keyword(s):

Single Cell ◽

Distance Measure ◽

Source Code ◽

Cell Types ◽

R Package ◽

Computational Method ◽

Biological Variability ◽

Rna Seq ◽

Batch Effects ◽

Guide Trees

Abstract Summary STACAS is a computational method for the identification of integration anchors in the Seurat environment, optimized for the integration of single-cell (sc) RNA-seq datasets that share only a subset of cell types. We demonstrate that by (i) correcting batch effects while preserving relevant biological variability across datasets, (ii) filtering aberrant integration anchors with a quantitative distance measure and (iii) constructing optimal guide trees for integration, STACAS can accurately align scRNA-seq datasets composed of only partially overlapping cell populations. Availability and implementation Source code and R package available at https://github.com/carmonalab/STACAS; Docker image available at https://hub.docker.com/repository/docker/mandrea1/stacas_demo.

Download Full-text

MTTFsite: cross-cell type TF binding site prediction by using multi-task learning

Bioinformatics ◽

10.1093/bioinformatics/btz451 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5067-5077 ◽

Cited By ~ 4

Author(s):

Jiyun Zhou ◽

Qin Lu ◽

Lin Gui ◽

Ruifeng Xu ◽

Yunfei Long ◽

...

Keyword(s):

Gene Expression ◽

Histone Modification ◽

Prediction Method ◽

Cell Types ◽

Supplementary Information ◽

Learning Approaches ◽

Cell Type ◽

Baseline Method ◽

Common Features ◽

Task Learning

AbstractMotivationThe prediction of transcription factor binding sites (TFBSs) is crucial for gene expression analysis. Supervised learning approaches for TFBS predictions require large amounts of labeled data. However, many TFs of certain cell types either do not have sufficient labeled data or do not have any labeled data.ResultsIn this paper, a multi-task learning framework (called MTTFsite) is proposed to address the lack of labeled data problem by leveraging on labeled data available in cross-cell types. The proposed MTTFsite contains a shared CNN to learn common features for all cell types and a private CNN for each cell type to learn private features. The common features are aimed to help predicting TFBSs for all cell types especially those cell types that lack labeled data. MTTFsite is evaluated on 241 cell type TF pairs and compared with a baseline method without using any multi-task learning model and a fully shared multi-task model that uses only a shared CNN and do not use private CNNs. For cell types with insufficient labeled data, results show that MTTFsite performs better than the baseline method and the fully shared model on more than 89% pairs. For cell types without any labeled data, MTTFsite outperforms the baseline method and the fully shared model by more than 80 and 93% pairs, respectively. A novel gene expression prediction method (called TFChrome) using both MTTFsite and histone modification features is also presented. Results show that TFBSs predicted by MTTFsite alone can achieve good performance. When MTTFsite is combined with histone modification features, a significant 5.7% performance improvement is obtained.Availability and implementationThe resource and executable code are freely available at http://hlt.hitsz.edu.cn/MTTFsite/ and http://www.hitsz-hlt.com:8080/MTTFsite/.Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

DataRemix: a universal data transformation for optimal inference from gene expression datasets

Bioinformatics ◽

10.1093/bioinformatics/btaa745 ◽

2020 ◽

Cited By ~ 1

Author(s):

Weiguang Mao ◽

Javad Rahimikollu ◽

Ryan Hausler ◽

Maria Chikina

Keyword(s):

Gene Expression ◽

R Package ◽

Supplementary Information ◽

Eqtl Analysis ◽

Thompson Sampling ◽

Normalization Methods ◽

Special Cases ◽

Biological Signals ◽

Gene Correlation ◽

Value Decomposition

Abstract Motivation RNA-seq technology provides unprecedented power in the assessment of the transcription abundance and can be used to perform a variety of downstream tasks such as inference of gene-correlation network and eQTL discovery. However, raw gene expression values have to be normalized for nuisance biological variation and technical covariates, and different normalization strategies can lead to dramatically different results in the downstream study. Results We describe a generalization of singular value decomposition-based reconstruction for which the common techniques of whitening, rank-k approximation and removing the top k principal components are special cases. Our simple three-parameter transformation, DataRemix, can be tuned to reweigh the contribution of hidden factors and reveal otherwise hidden biological signals. In particular, we demonstrate that the method can effectively prioritize biological signals over noise without leveraging external dataset-specific knowledge, and can outperform normalization methods that make explicit use of known technical factors. We also show that DataRemix can be efficiently optimized via Thompson sampling approach, which makes it feasible for computationally expensive objectives such as eQTL analysis. Finally, we apply our method to the Religious Orders Study and Memory and Aging Project dataset, and we report what to our knowledge is the first replicable trans-eQTL effect in human brain. Availabilityand implementation DataRemix is an R package which is freely available at GitHub (https://github.com/wgmao/DataRemix). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study

Bioinformatics ◽

10.1093/bioinformatics/btaa483 ◽

2020 ◽

Vol 36 (15) ◽

pp. 4301-4308

Author(s):

Stephan Seifert ◽

Sven Gundlach ◽

Olaf Junge ◽

Silke Szymczak

Keyword(s):

Gene Expression ◽

Computational Models ◽

Hybrid Approach ◽

Disease Status ◽

R Package ◽

Gene Expression Omnibus ◽

Functional Enrichment ◽

Supplementary Information ◽

Biological Knowledge ◽

Functional Relationships

Abstract Motivation High-throughput technologies allow comprehensive characterization of individuals on many molecular levels. However, training computational models to predict disease status based on omics data is challenging. A promising solution is the integration of external knowledge about structural and functional relationships into the modeling process. We compared four published random forest-based approaches using two simulation studies and nine experimental datasets. Results The self-sufficient prediction error approach should be applied when large numbers of relevant pathways are expected. The competing methods hunting and learner of functional enrichment should be used when low numbers of relevant pathways are expected or the most strongly associated pathways are of interest. The hybrid approach synthetic features is not recommended because of its high false discovery rate. Availability and implementation An R package providing functions for data analysis and simulation is available at GitHub (https://github.com/szymczak-lab/PathwayGuidedRF). An accompanying R data package (https://github.com/szymczak-lab/DataPathwayGuidedRF) stores the processed and quality controlled experimental datasets downloaded from Gene Expression Omnibus (GEO). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text