A benchmark for RNA-seq deconvolution analysis under dynamic testing environments

Abstract Background Deconvolution analyses have been widely used to track compositional alterations of cell types in gene expression data. Although a large number of novel methods have been developed, due to a lack of understanding of the effects of modeling assumptions and tuning parameters, it is challenging for researchers to select an optimal deconvolution method suitable for the targeted biological conditions. Results To systematically reveal the pitfalls and challenges of deconvolution analyses, we investigate the impact of several technical and biological factors including simulation model, quantification unit, component number, weight matrix, and unknown content by constructing three benchmarking frameworks. These frameworks cover comparative analysis of 11 popular deconvolution methods under 1766 conditions. Conclusions We provide new insights to researchers for future application, standardization, and development of deconvolution tools on RNA-seq data.

Download Full-text

A comparative study of deconvolution methods for RNA-seq data under a dynamic testing landscape

10.1101/2020.12.09.418640 ◽

2020 ◽

Author(s):

Haijing Jin ◽

Zhandong Liu

Keyword(s):

Gene Expression ◽

Comparative Analysis ◽

Dynamic Testing ◽

Cell Types ◽

Future Application ◽

Expression Data ◽

Rna Seq ◽

Biological Factors ◽

The Impact ◽

Unit Component

AbstractDeconvolution analyses have been widely used to track compositional alternations of cell-types in gene expression data. Even though numerous novel methods have been developed in recent years, researchers are still having difficulty selecting optimal deconvolution methods due to the lack of comprehensive benchmarks relative to the newly developed methods. To systematically reveal the pitfalls and challenges of deconvolution analyses, we studied the impact of several technical and biological factors such as simulation model, quantification unit, component number, weight matrix, and unknown content by constructing three benchmarking frameworks that cover comparative analysis of 11 popular deconvolution methods under 1,766 conditions. We hope this study can provide new insights to researchers for future application, standardization, and development of deconvolution tools on RNA-seq data.

Download Full-text

IKAP - Identifying K mAjor cell Population groups in single-cell RNA-seq analysis

10.1101/596817 ◽

2019 ◽

Author(s):

Yun-Ching Chen ◽

Abhilash Suresh ◽

Chingiz Underbayev ◽

Clare Sun ◽

Komudi Singh ◽

...

Keyword(s):

Single Cell ◽

Cell Population ◽

Cell Types ◽

Marker Genes ◽

Rna Seq ◽

Population Groups ◽

Tuning Parameters ◽

Multiple Datasets ◽

Cell Groups ◽

Cell Ontology

AbstractIn single-cell RNA-seq analysis, clustering cells into groups and differentiating cell groups by marker genes are two separate steps for investigating cell identity. However, results in clustering greatly affect the ability to differentiate between cell groups. We develop IKAP – an algorithm identifying major cell groups that improves differentiating by tuning parameters for clustering. Using multiple datasets, we demonstrate IKAP improves identification of major cell types and facilitates cell ontology curation.

Download Full-text

Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data

Genome Biology ◽

10.1186/s13059-021-02568-9 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Kayla A. Johnson ◽

Arjun Krishnan

Keyword(s):

Gene Expression ◽

Expression Data ◽

Rna Seq ◽

Functional Relationships ◽

Gene Coexpression ◽

Transformation Methods ◽

Network Transformation ◽

Almost All ◽

Coexpression Networks ◽

The Impact

Abstract Background Constructing gene coexpression networks is a powerful approach for analyzing high-throughput gene expression data towards module identification, gene function prediction, and disease-gene prioritization. While optimal workflows for constructing coexpression networks, including good choices for data pre-processing, normalization, and network transformation, have been developed for microarray-based expression data, such well-tested choices do not exist for RNA-seq data. Almost all studies that compare data processing and normalization methods for RNA-seq focus on the end goal of determining differential gene expression. Results Here, we present a comprehensive benchmarking and analysis of 36 different workflows, each with a unique set of normalization and network transformation methods, for constructing coexpression networks from RNA-seq datasets. We test these workflows on both large, homogenous datasets and small, heterogeneous datasets from various labs. We analyze the workflows in terms of aggregate performance, individual method choices, and the impact of multiple dataset experimental factors. Our results demonstrate that between-sample normalization has the biggest impact, with counts adjusted by size factors producing networks that most accurately recapitulate known tissue-naive and tissue-aware gene functional relationships. Conclusions Based on this work, we provide concrete recommendations on robust procedures for building an accurate coexpression network from an RNA-seq dataset. In addition, researchers can examine all the results in great detail at https://krishnanlab.github.io/RNAseq_coexpression to make appropriate choices for coexpression analysis based on the experimental factors of their RNA-seq dataset.

Download Full-text

Spectrum: Fast density-aware spectral clustering for single and multi-omic data

10.1101/636639 ◽

2019 ◽

Cited By ~ 1

Author(s):

Christopher R. John ◽

David Watson ◽

Michael Barnes ◽

Costantino Pitzalis ◽

Myles J. Lewis

Keyword(s):

Spectral Clustering ◽

Personalised Medicine ◽

Cell Types ◽

Expression Data ◽

Rna Seq ◽

Diffusion Technique ◽

Eigenvector Analysis ◽

And Performance ◽

And Diffusion ◽

Omic Data

AbstractClustering of single or multi-omic data is key to developing personalised medicine and identifying new cell types. We present Spectrum, a fast spectral clustering method for single and multi-omic expression data. Spectrum is flexible and performs well on single-cell RNA-seq data. The method uses a new density-aware kernel that adapts to data scale and density. It uses a tensor product graph data integration and diffusion technique to reveal underlying structures and reduce noise. We developed a powerful method of eigenvector analysis to determine the number of clusters. Benchmarking Spectrum on 21 datasets demonstrated improvements in runtime and performance relative to other state-of-the-art methods.Contact:[email protected]

Download Full-text

CDSeqR: fast complete deconvolution for gene expression data from bulk tissues

BMC Bioinformatics ◽

10.1186/s12859-021-04186-5 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Kai Kang ◽

Caizhi Huang ◽

Yuanyuan Li ◽

David M. Umbach ◽

Leping Li

Keyword(s):

Gene Expression ◽

Cell Types ◽

Biological Tissues ◽

Specific Gene ◽

Specific Cell ◽

Specific Information ◽

Expression Data ◽

Rna Seq ◽

Cell Type ◽

Cell Type Specific

Abstract Background Biological tissues consist of heterogenous populations of cells. Because gene expression patterns from bulk tissue samples reflect the contributions from all cells in the tissue, understanding the contribution of individual cell types to the overall gene expression in the tissue is fundamentally important. We recently developed a computational method, CDSeq, that can simultaneously estimate both sample-specific cell-type proportions and cell-type-specific gene expression profiles using only bulk RNA-Seq counts from multiple samples. Here we present an R implementation of CDSeq (CDSeqR) with significant performance improvement over the original implementation in MATLAB and an added new function to aid cell type annotation. The R package would be of interest for the broader R community. Result We developed a novel strategy to substantially improve computational efficiency in both speed and memory usage. In addition, we designed and implemented a new function for annotating the CDSeq estimated cell types using single-cell RNA sequencing (scRNA-seq) data. This function allows users to readily interpret and visualize the CDSeq estimated cell types. In addition, this new function further allows the users to annotate CDSeq-estimated cell types using marker genes. We carried out additional validations of the CDSeqR software using synthetic, real cell mixtures, and real bulk RNA-seq data from the Cancer Genome Atlas (TCGA) and the Genotype-Tissue Expression (GTEx) project. Conclusions The existing bulk RNA-seq repositories, such as TCGA and GTEx, provide enormous resources for better understanding changes in transcriptomics and human diseases. They are also potentially useful for studying cell–cell interactions in the tissue microenvironment. Bulk level analyses neglect tissue heterogeneity, however, and hinder investigation of a cell-type-specific expression. The CDSeqR package may aid in silico dissection of bulk expression data, enabling researchers to recover cell-type-specific information.

Download Full-text

Integrating multiomics longitudinal data to reconstruct networks underlying lung development

AJP Lung Cellular and Molecular Physiology ◽

10.1152/ajplung.00554.2018 ◽

2019 ◽

Vol 317 (5) ◽

pp. L556-L568 ◽

Cited By ~ 7

Author(s):

Jun Ding ◽

Farida Ahangari ◽

Celia R. Espinoza ◽

Divya Chhabra ◽

Teodora Nicola ◽

...

Keyword(s):

Lung Development ◽

Regulatory Networks ◽

Cell Types ◽

Data Sets ◽

Rna Seq ◽

Key Events ◽

The Impact ◽

Laser Capture ◽

Key Pathways ◽

Predetermined Time

A comprehensive understanding of the dynamic regulatory networks that govern postnatal alveolar lung development is still lacking. To construct such a model, we profiled mRNA, microRNA, DNA methylation, and proteomics of developing murine alveoli isolated by laser capture microdissection at 14 predetermined time points. We developed a detailed comprehensive and interactive model that provides information about the major expression trajectories, the regulators of specific key events, and the impact of epigenetic changes. Intersecting the model with single-cell RNA-Seq data led to the identification of active pathways in multiple or individual cell types. We then constructed a similar model for human lung development by profiling time-series human omics data sets. Several key pathways and regulators are shared between the reconstructed models. We experimentally validated the activity of a number of predicted regulators, leading to new insights about the regulation of innate immunity during lung development.

Download Full-text

Generation and network analysis of an RNA-seq transcriptional atlas for the rat.

10.1101/2021.11.07.467633 ◽

2021 ◽

Author(s):

Kim Summers ◽

Stephen J. Bush ◽

Chunlei Wu ◽

David A Hume

Keyword(s):

Cell Types ◽

Rat Tissues ◽

Biological Processes ◽

Expression Data ◽

Rna Seq ◽

Cell Type ◽

Gene Correlation ◽

Cell Type Specific ◽

Critical Interpretation ◽

Public Repositories

The laboratory rat is an important model for biomedical research. To generate a comprehensive rat transcriptomic atlas, we curated and down-loaded 7700 rat RNA-seq datasets from public repositories, down-sampled them to a common depth and quantified expression. Data from 590 rat tissues and cells, averaged from each Bioproject, can be visualised and queried at http://biogps.org/ratatlas. Gene correlation network (GCN) analysis revealed clusters of transcripts that were tissue or cell-type restricted and contained transcription factors implicated in lineage determination. Other clusters were enriched for transcripts associated with biological processes. Many of these clusters overlap with previous data from analysis of other species whilst some (e.g. expressed specifically in immune cells, retina/pineal gland, pituitary and germ cells) are unique to these data. GCN on large subsets of the data related specifically to liver, nervous system, kidney, musculoskeletal system and cardiovascular system enabled deconvolution of cell-type specific signatures. The approach is extensible and the dataset can be used as a point of reference from which to analyse the transcriptomes of cell types and tissues that have not yet been sampled. Sets of strictly co-expressed transcripts provide a resource for critical interpretation of single cell RNA-seq data.

Download Full-text

Discovering cell types using manifold learning and enhanced visualization of single-cell RNA-Seq data

Scientific Reports ◽

10.1038/s41598-021-03613-0 ◽

2022 ◽

Vol 12 (1) ◽

Author(s):

Akram Vasighizaker ◽

Saiteja Danda ◽

Luis Rueda

Keyword(s):

Dimensionality Reduction ◽

Single Cell ◽

Cell Types ◽

Gene Set Enrichment Analysis ◽

Rna Seq ◽

Reduction Techniques ◽

Non Linear ◽

Dimensionality Reduction Techniques ◽

Linear Dimensionality Reduction ◽

The Impact

AbstractIdentifying relevant disease modules such as target cell types is a significant step for studying diseases. High-throughput single-cell RNA-Seq (scRNA-seq) technologies have advanced in recent years, enabling researchers to investigate cells individually and understand their biological mechanisms. Computational techniques such as clustering, are the most suitable approach in scRNA-seq data analysis when the cell types have not been well-characterized. These techniques can be used to identify a group of genes that belong to a specific cell type based on their similar gene expression patterns. However, due to the sparsity and high-dimensionality of scRNA-seq data, classical clustering methods are not efficient. Therefore, the use of non-linear dimensionality reduction techniques to improve clustering results is crucial. We introduce a method that is used to identify representative clusters of different cell types by combining non-linear dimensionality reduction techniques and clustering algorithms. We assess the impact of different dimensionality reduction techniques combined with the clustering of thirteen publicly available scRNA-seq datasets of different tissues, sizes, and technologies. We further performed gene set enrichment analysis to evaluate the proposed method’s performance. As such, our results show that modified locally linear embedding combined with independent component analysis yields overall the best performance relative to the existing unsupervised methods across different datasets.

Download Full-text

Integrative single-cell and bulk RNA-seq analysis in human retina identified cell type-specific composition and gene expression changes for age-related macular degeneration

10.1101/768143 ◽

2019 ◽

Author(s):

Yafei Lyu ◽

Randy Zauhar ◽

Nico Dana ◽

Christianne E. Strang ◽

Kui Wang ◽

...

Keyword(s):

Gene Expression ◽

Macular Degeneration ◽

Single Cell ◽

Cell Types ◽

Age Related Macular Degeneration ◽

Peripheral Retina ◽

Rna Seq ◽

Cell Type ◽

Age Related ◽

The Impact

Age-related macular degeneration (AMD) preferentially affects distinct cell types and topographic regions in retina. To characterize the impact of AMD on gene expression changes across retinal cell types and regions, we generated both single-cell RNA-seq (scRNA-seq) and bulk RNA-seq data from macular and peripheral retina in postmortem human donors with and without AMD. The scRNA-seq data revealed 11 major cell types with many previously reported AMD risk genes showing substantial cell type and region specificity. Cell type proportional changes with advancing AMD stage were significant for Müller glia, rods, astrocytes, microglia and endothelium.

Download Full-text

AutoGeneS: Automatic gene selection using multi-objective optimization for RNA-seq deconvolution

10.1101/2020.02.21.940650 ◽

2020 ◽

Cited By ~ 5

Author(s):

Hananeh Aliee ◽

Fabian Theis

Keyword(s):

Single Cell ◽

Prior Knowledge ◽

Gene Selection ◽

Ground Truth ◽

Cell Types ◽

Cellular Heterogeneity ◽

Marker Genes ◽

Rna Seq ◽

Cell Type ◽

The Impact

AbstractTissues are complex systems of interacting cell types. Knowing cell-type proportions in a tissue is very important to identify which cells or cell types are targeted by a disease or perturbation. When measuring such responses using RNA-seq, bulk RNA-seq masks cellular heterogeneity. Hence, several computational methods have been proposed to infer cell-type proportions from bulk RNA samples. Their performance with noisy reference profiles highly depends on the set of genes undergoing deconvolution. These genes are often selected based on prior knowledge or a single-criterion test that might not be useful to dissect closely correlated cell types. In this work, we introduce AutoGeneS, a tool that automatically extracts informative genes and reveals the cellular heterogeneity of bulk RNA samples. AutoGeneS requires no prior knowledge about marker genes and selects genes by simultaneously optimizing multiple criteria: minimizing the correlation and maximizing the distance between cell types. It can be applied to reference profiles from various sources like single-cell experiments or sorted cell populations. Results from human samples of peripheral blood illustrate that AutoGeneS outperforms other methods. Our results also highlight the impact of our approach on analyzing bulk RNA samples with noisy single-cell reference profiles and closely correlated cell types. Ground truth cell proportions analyzed by flow cytometry confirmed the accuracy of the predictions of AutoGeneS in identifying cell-type proportions. AutoGeneS is available for use via a standalone Python package (https://github.com/theislab/AutoGeneS).

Download Full-text