An Efficient and Flexible Method for Deconvoluting Bulk RNA-Seq Data with Single-Cell RNA-Seq Data

Estimating cell type compositions for complex diseases is an important step to investigate the cellular heterogeneity for understanding disease etiology and potentially facilitate early disease diagnosis and prevention. Here, we developed a computationally statistical method, referring to Multi-Omics Matrix Factorization (MOMF), to estimate the cell-type compositions of bulk RNA sequencing (RNA-seq) data by leveraging cell type-specific gene expression levels from single-cell RNA sequencing (scRNA-seq) data. MOMF not only directly models the count nature of gene expression data, but also effectively accounts for the uncertainty of cell type-specific mean gene expression levels. We demonstrate the benefits of MOMF through three real data applications, i.e., Glioblastomas (GBM), colorectal cancer (CRC) and type II diabetes (T2D) studies. MOMF is able to accurately estimate disease-related cell type proportions, i.e., oligodendrocyte progenitor cells and macrophage cells, which are strongly associated with the survival of GBM and CRC, respectively.

Download Full-text

SCDC: bulk gene expression deconvolution by multiple single-cell RNA sequencing references

Briefings in Bioinformatics ◽

10.1093/bib/bbz166 ◽

2020 ◽

Cited By ~ 13

Author(s):

Meichen Dong ◽

Aatish Thennavan ◽

Eugene Urrutia ◽

Yun Li ◽

Charles M Perou ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Specific Gene ◽

Rna Seq ◽

Cell Type ◽

Mixed Cell ◽

Single Cell Rna Sequencing

Abstract Recent advances in single-cell RNA sequencing (scRNA-seq) enable characterization of transcriptomic profiles with single-cell resolution and circumvent averaging artifacts associated with traditional bulk RNA sequencing (RNA-seq) data. Here, we propose SCDC, a deconvolution method for bulk RNA-seq that leverages cell-type specific gene expression profiles from multiple scRNA-seq reference datasets. SCDC adopts an ENSEMBLE method to integrate deconvolution results from different scRNA-seq datasets that are produced in different laboratories and at different times, implicitly addressing the problem of batch-effect confounding. SCDC is benchmarked against existing methods using both in silico generated pseudo-bulk samples and experimentally mixed cell lines, whose known cell-type compositions serve as ground truths. We show that SCDC outperforms existing methods with improved accuracy of cell-type decomposition under both settings. To illustrate how the ENSEMBLE framework performs in complex tissues under different scenarios, we further apply our method to a human pancreatic islet dataset and a mouse mammary gland dataset. SCDC returns results that are more consistent with experimental designs and that reproduce more significant associations between cell-type proportions and measured phenotypes.

Download Full-text

Implication of specific retinal cell-type involvement and gene expression changes in AMD progression using integrative analysis of single-cell and bulk RNA-seq profiling

Scientific Reports ◽

10.1038/s41598-021-95122-3 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Yafei Lyu ◽

Randy Zauhar ◽

Nicholas Dana ◽

Christianne E. Strang ◽

Jian Hu ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Age Related Macular Degeneration ◽

Specific Gene ◽

Cell Type ◽

Adult Human ◽

Single Cell Rna Sequencing ◽

Cell Type Specific ◽

Cell Data

AbstractAge‐related macular degeneration (AMD) is a blinding eye disease with no unifying theme for its etiology. We used single-cell RNA sequencing to analyze the transcriptomes of ~ 93,000 cells from the macula and peripheral retina from two adult human donors and bulk RNA sequencing from fifteen adult human donors with and without AMD. Analysis of our single-cell data identified 267 cell-type-specific genes. Comparison of macula and peripheral retinal regions found no cell-type differences but did identify 50 differentially expressed genes (DEGs) with about 1/3 expressed in cones. Integration of our single-cell data with bulk RNA sequencing data from normal and AMD donors showed compositional changes more pronounced in macula in rods, microglia, endothelium, Müller glia, and astrocytes in the transition from normal to advanced AMD. KEGG pathway analysis of our normal vs. advanced AMD eyes identified enrichment in complement and coagulation pathways, antigen presentation, tissue remodeling, and signaling pathways including PI3K-Akt, NOD-like, Toll-like, and Rap1. These results showcase the use of single-cell RNA sequencing to infer cell-type compositional and cell-type-specific gene expression changes in intact bulk tissue and provide a foundation for investigating molecular mechanisms of retinal disease that lead to new therapeutic targets.

Download Full-text

Bayesian estimation of cell-type-specific gene expression per bulk sample with prior derived from single-cell data

10.1101/2020.08.05.238949 ◽

2020 ◽

Cited By ~ 1

Author(s):

Jiebiao Wang ◽

Kathryn Roeder ◽

Bernie Devlin

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Cell Types ◽

Autism Spectrum ◽

Specific Cell ◽

Expression Data ◽

Cell Type ◽

Disease Etiology ◽

Cell Type Specific

AbstractWhen assessed over a large number of samples, bulk RNA sequencing provides reliable data for gene expression at the tissue level. Single-cell RNA sequencing (scRNA-seq) deepens those analyses by evaluating gene expression at the cellular level. Both data types lend insights into disease etiology. With current technologies, however, scRNA-seq data are known to be noisy. Moreover, constrained by costs, scRNA-seq data are typically generated from a relatively small number of subjects, which limits their utility for some analyses, such as identification of gene expression quantitative trait loci (eQTLs). To address these issues while maintaining the unique advantages of each data type, we develop a Bayesian method (bMIND) to integrate bulk and scRNA-seq data. With a prior derived from scRNA-seq data, we propose to estimate sample-level cell-type-specific (CTS) expression from bulk expression data. The CTS expression enables large-scale sample-level downstream analyses, such as detecting CTS differentially expressed genes (DEGs) and eQTLs. Through simulations, we demonstrate that bMIND improves the accuracy of sample-level CTS expression estimates and power to discover CTS-DEGs when compared to existing methods. To further our understanding of two complex phenotypes, autism spectrum disorder and Alzheimer’s disease, we apply bMIND to gene expression data of relevant brain tissue to identify CTS-DEGs. Our results complement findings for CTS-DEGs obtained from snRNA-seq studies, replicating certain DEGs in specific cell types while nominating other novel genes in those cell types. Finally, we calculate CTS-eQTLs for eleven brain regions by analyzing GTEx V8 data, creating a new resource for biological insights.

Download Full-text

SCDC: Bulk Gene Expression Deconvolution by Multiple Single-Cell RNA Sequencing References

10.1101/743591 ◽

2019 ◽

Cited By ~ 1

Author(s):

Meichen Dong ◽

Aatish Thennavan ◽

Eugene Urrutia ◽

Yun Li ◽

Charles M. Perou ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Specific Gene ◽

Rna Seq ◽

Cell Type ◽

Mixed Cell ◽

Single Cell Rna Sequencing

AbstractRecent advances in single-cell RNA sequencing (scRNA-seq) enable characterization of transcriptomic profiles with single-cell resolution and circumvent averaging artifacts associated with traditional bulk RNA sequencing (RNA-seq) data. Here, we propose SCDC, a deconvolution method for bulk RNA-seq that leverages cell-type specific gene expression profiles from multiple scRNA-seq reference datasets. SCDC adopts an ENSEMBLE method to integrate deconvolution results from different scRNA-seq datasets that are produced in different laboratories and at different times, implicitly addressing the problem of batch-effect confounding. SCDC is benchmarked against existing methods using both in silico generated pseudo-bulk samples and experimentally mixed cell lines, whose known cell-type compositions serve as ground truths. We show that SCDC outperforms existing methods with improved accuracy of cell-type decomposition under both settings. To illustrate how the ENSEMBLE framework performs in complex tissues under different scenarios, we further apply our method to a human pancreatic islet dataset and a mouse mammary gland dataset. SCDC returns results that are more consistent with experimental designs and that reproduce more significant associations between cell-type proportions and measured phenotypes.

Download Full-text

Sources of Variation in Cell-Type RNA-Seq Profiles

10.21203/rs.2.23415/v1 ◽

2020 ◽

Cited By ~ 1

Author(s):

Johan Gustafsson ◽

Felix Held ◽

Jonathan Robinson ◽

Elias Björnson ◽

Rebecka Jörnsten ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Specific Gene ◽

Rna Seq ◽

Cell Type ◽

Specific Gene Expression ◽

Cell Type Specific ◽

Technical Factors

Abstract Background Cell-type specific gene expression profiles are needed for many computational methods operating on bulk RNA-Seq samples, such as deconvolution of cell-type fractions and digital cytometry. However, the gene expression profile of a cell type can vary substantially due to both technical factors and biological differences in cell state and surroundings, reducing the efficacy of such methods. Here, we investigated which factors contribute most to this variation. Results We evaluated different normalization methods, quantified the magnitude of variation introduced by different sources, and examined the differences between UMI-based single-cell RNA-Seq and bulk RNA-Seq. We applied methods such as random forest regression to a collection of publicly available bulk and single-cell RNA-Seq datasets containing B and T cells, and found that the technical variation across laboratories is of the same magnitude as the biological variation across cell types. Tissue of origin and cell subtype are less important but still substantial factors, while the difference between individuals is relatively small. We also show that much of the differences between UMI-based single-cell and bulk RNA-Seq methods can be explained by the number of read duplicates per mRNA molecule in the single-cell sample.Conclusions Our work shows the importance of either matching or correcting for technical factors when creating cell-type specific gene expression profiles that are to be used together with bulk samples.

Download Full-text

Sources of Variation in Cell-Type RNA-Seq Profiles

10.21203/rs.2.23415/v2 ◽

2020 ◽

Author(s):

Johan Gustafsson ◽

Felix Held ◽

Jonathan Robinson ◽

Elias Björnson ◽

Rebecka Jörnsten ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Specific Gene ◽

Rna Seq ◽

Cell Type ◽

Specific Gene Expression ◽

Cell Type Specific ◽

Technical Factors

Abstract Cell-type specific gene expression profiles are needed for many computational methods operating on bulk RNA-Seq samples, such as deconvolution of cell-type fractions and digital cytometry. However, the gene expression profile of a cell type can vary substantially due to both technical factors and biological differences in cell state and surroundings, reducing the efficacy of such methods. Here, we investigated which factors contribute most to this variation. We evaluated different normalization methods, quantified the variance explained by different factors, evaluated the effect on deconvolution of cell type fractions, and examined the differences between UMI-based single-cell RNA-Seq and bulk RNA-Seq. We investigated a collection of publicly available bulk and single-cell RNA-Seq datasets containing B and T cells, and found that the technical variation across laboratories is substantial, even for genes specifically selected for deconvolution, and has a confounding effect on deconvolution. Tissue of origin is also a substantial factor, highlighting the challenge of applying cell type profiles derived from blood on mixtures from other tissues. We also show that much of the differences between UMI-based single-cell and bulk RNA-Seq methods can be explained by the number of read duplicates per mRNA molecule in the single-cell sample. Our work shows the importance of either matching or correcting for technical factors when creating cell-type specific gene expression profiles that are to be used together with bulk samples.

Download Full-text

Deep autoencoder enables interpretable tissue-adaptive deconvolution and cell-type-specific gene analysis

10.1101/2021.10.26.465846 ◽

2021 ◽

Author(s):

Yanshuo Chen ◽

Yixuan Wang ◽

Yuelong Chen ◽

Yumeng Wei ◽

Yunxiang Li ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Clinical Data ◽

Gene Set Enrichment Analysis ◽

Specific Gene ◽

Rna Seq ◽

Cell Type ◽

Specific Gene Expression ◽

Wide Range ◽

Cell Type Specific

AbstractSingle-cell RNA-seq has become a powerful tool for researchers to study biologically significant characteristics at explicitly high resolution, but its application on emerging data is currently limited by its intrinsic techniques. Here, we introduce TAPE, a deep learning method that connects bulk RNA-seq and single-cell RNA-seq to balance the demands of big data and precision. By taking advantage of constructing an interpretable decoder and training under a unique scheme, TAPE can predict cell-type fractions and cell-type-specific gene expression tissue-adaptively. Compared with existing methods on several benchmarking datasets, TAPE is more accurate (up to 40% performnace improvement on the real bulk data) and faster than the previous methods. It is sensitive enough to provide biologically meaningful predictions. For example, only TAPE can predict the tendency of increasing monocytes-to-lymphocytes (MLR) ratio in COVID-19 patients from mild to serious symptoms, whose estimated indices are consistent with laboratory data. More importantly, through the analysis of clinical data, TAPE shows its ability to predict cell-type-specific gene expression profiles with biological significance. Combining with single-sample gene set enrichment analysis (ssGSEA), TAPE also provides valuable clues for people to investigate the immune response in different virus-infected patients. We believe that TAPE will enable and accelerate the precise analysis of high-throughput clinical data in a wide range.

Download Full-text

A novel method for predicting cell abundance based on single-cell RNA-seq data

BMC Bioinformatics ◽

10.1186/s12859-021-04187-4 ◽

2021 ◽

Vol 22 (S9) ◽

Author(s):

Jiajie Peng ◽

Lu Han ◽

Xuequn Shang

Keyword(s):

Gene Expression ◽

Least Squares ◽

Single Cell ◽

Specific Gene ◽

Rna Seq ◽

Cell Type ◽

Gene Expression Matrix ◽

Cell Type Specific ◽

Novel Method ◽

Signature Matrix

Abstract Background It is important to understand the composition of cell type and its proportion in intact tissues, as changes in certain cell types are the underlying cause of disease in humans. Although compositions of cell type and ratios can be obtained by single-cell sequencing, single-cell sequencing is currently expensive and cannot be applied in clinical studies involving a large number of subjects. Therefore, it is useful to apply the bulk RNA-Seq dataset and the single-cell RNA dataset to deconvolute and obtain the cell type composition in the tissue. Results By analyzing the existing cell population prediction methods, we found that most of the existing methods need the cell-type-specific gene expression profile as the input of the signature matrix. However, in real applications, it is not always possible to find an available signature matrix. To solve this problem, we proposed a novel method, named DCap, to predict cell abundance. DCap is a deconvolution method based on non-negative least squares. DCap considers the weight resulting from measurement noise of bulk RNA-seq and calculation error of single-cell RNA-seq data, during the calculation process of non-negative least squares and performs the weighted iterative calculation based on least squares. By weighting the bulk tissue gene expression matrix and single-cell gene expression matrix, DCap minimizes the measurement error of bulk RNA-Seq and also reduces errors resulting from differences in the number of expressed genes in the same type of cells in different samples. Evaluation test shows that DCap performs better in cell type abundance prediction than existing methods. Conclusion DCap solves the deconvolution problem using weighted non-negative least squares to predict cell type abundance in tissues. DCap has better prediction results and does not need to prepare a signature matrix that gives the cell-type-specific gene expression profile in advance. By using DCap, we can better study the changes in cell proportion in diseased tissues and provide more information on the follow-up treatment of diseases.

Download Full-text

A United Statistical Framework for Single Cell and Bulk Sequencing Data

10.1101/206532 ◽

2017 ◽

Cited By ~ 1

Author(s):

Lingxue Zhu ◽

Jing Lei ◽

Bernie Devlin ◽

Kathryn Roeder

Keyword(s):

Gene Expression ◽

Single Cell ◽

Cell Types ◽

Accurate Estimation ◽

Specific Gene ◽

Rna Seq ◽

Cell Type ◽

Cell Type Specific ◽

Different Cell Types ◽

Cell Data

Recent advances in technology have enabled the measurement of RNA levels for individual cells. Compared to traditional tissue-level bulk RNA-seq data, single cell sequencing yields valuable insights about gene expression profiles for different cell types, which is potentially critical for understanding many complex human diseases. However, developing quantitative tools for such data remains challenging because of high levels of technical noise, especially the “dropout” events. A “dropout” happens when the RNA for a gene fails to be amplified prior to sequencing, producing a “false” zero in the observed data. In this paper, we propose a Unified RNA-Sequencing Model (URSM) for both single cell and bulk RNA-seq data, formulated as a hierarchical model. URSM borrows the strength from both data sources and carefully models the dropouts in single cell data, leading to a more accurate estimation of cell type specific gene expression profile. In addition, URSM naturally provides inference on the dropout entries in single cell data that need to be imputed for downstream analyses, as well as the mixing proportions of different cell types in bulk samples. We adopt an empirical Bayes approach, where parameters are estimated using the EM algorithm and approximate inference is obtained by Gibbs sampling. Simulation results illustrate that URSM outperforms existing approaches both in correcting for dropouts in single cell data, as well as in deconvolving bulk samples. We also demonstrate an application to gene expression data on fetal brains, where our model successfully imputes the dropout genes and reveals cell type specific expression patterns.

Download Full-text