scholarly journals BayesDeBulk: A Flexible Bayesian Algorithm for the Deconvolution of Bulk Tumor Data

2021 ◽  
Author(s):  
Francesca Petralia ◽  
Anna P Calinawan ◽  
Song Feng ◽  
Sara JC Gosline ◽  
Pietro Pugliese ◽  
...  

Characterizing the tumor microenvironment is crucial in order to improve responsiveness to immunotherapy and develop new therapeutic strategies. The fraction of different cell-types in the tumor microenvironment can be estimated based on transcriptomic profiling of bulk tumor data via deconvolution algorithms. One class of such algorithms, known as reference-based, rely on a reference signature containing gene expression data for various cell-types. The limitation of these methods is that such a signature is derived from the gene expression of pure cell-types, which might not be consistent with the transcriptomic profiling in solid tumors. On the other hand, reference-free methods usually require only a set of cell-specific markers to perform deconvolution; however, once the different components have been estimated from the data, their labeling can be problematic. To overcome these limitations, we propose BayesDeBulk - a new reference-free Bayesian method for bulk deconvolution based on gene expression data. Given a list of markers expressed in each cell-type (cell-specific markers), a repulsive prior is placed on the mean of gene expression in different cell-types to ensure that cell-specific markers are upregulated in a particular component. Contrary to existing reference-free methods, the labeling of different components is decided a priori through a repulsive prior. Furthermore, the advantage over reference-based algorithms is that the cell fractions as well as the gene expression of different cells are estimated from the data, simultaneously. Given its flexibility, BayesDeBulk can be utilized to perform bulk deconvolution beyond transcriptomic data, based on other data types such as proteomic profiles or the integration of both transcriptomic and proteomic profiles.

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Ikko Mito ◽  
Hideyuki Takahashi ◽  
Reika Kawabata-Iwakawa ◽  
Shota Ida ◽  
Hiroe Tada ◽  
...  

AbstractHead and neck squamous carcinoma (HNSCC) is highly infiltrated by immune cells, including tumor-infiltrating lymphocytes and myeloid lineage cells. In the tumor microenvironment, tumor cells orchestrate a highly immunosuppressive microenvironment by secreting immunosuppressive mediators, expressing immune checkpoint ligands, and downregulating human leukocyte antigen expression. In the present study, we aimed to comprehensively profile the immune microenvironment of HNSCC using gene expression data obtained from public database. We calculated enrichment scores of 33 immune cell types based on gene expression data of HNSCC tissues and adjacent non-cancer tissues. Based on these scores, we performed non-supervised clustering and identified three immune signatures—cold, lymphocyte, and myeloid/dendritic cell (DC)—based on the clustering results. We then compared the clinical and biological features of the three signatures. Among HNSCC and non-cancer tissues, human papillomavirus (HPV)-positive HNSCCs exhibited the highest scores in various immune cell types, including CD4+ T cells, CD8+ T cells, B cells, plasma cells, basophils, and their subpopulations. Among the three immune signatures, the proportions of HPV-positive tumors, oropharyngeal cancers, early T tumors, and N factor positive cases were significantly higher in the lymphocyte signature than in other signatures. Among the three signatures, the lymphocyte signature showed the longest overall survival (OS), especially in HPV-positive patients, whereas the myeloid/DC signature demonstrated the shortest OS in these patients. Gene set enrichment analysis revealed the upregulation of several pathways related to inflammatory and proinflammatory responses in the lymphocyte signature. The expression of PRF1, IFNG, GZMB, CXCL9, CXCL10, PDCD1, LAG3, CTLA4, HAVCR2, and TIGIT was the highest in the lymphocyte signature. Meanwhile, the expression of PD-1 ligand genes CD274 and PDCD1LG2 was highest in the myeloid/DC signature. Herein, our findings revealed the transcriptomic landscape of the immune microenvironment that closely reflects the clinical and biological significance of HNSCC, indicating that molecular profiling of the immune microenvironment can be employed to develop novel biomarkers and precision immunotherapies for HNSCC.


2015 ◽  
Vol 47 (6) ◽  
pp. 232-239 ◽  
Author(s):  
Gustav Holmgren ◽  
Nidal Ghosheh ◽  
Xianmin Zeng ◽  
Yalda Bogestål ◽  
Peter Sartipy ◽  
...  

Reference genes, often referred to as housekeeping genes (HKGs), are frequently used to normalize gene expression data based on the assumption that they are expressed at a constant level in the cells. However, several studies have shown that there may be a large variability in the gene expression levels of HKGs in various cell types. In a previous study, employing human embryonic stem cells (hESCs) subjected to spontaneous differentiation, we observed that the expression of commonly used HKG varied to a degree that rendered them inappropriate to use as reference genes under those experimental settings. Here we present a substantially extended study of the HKG signature in human pluripotent stem cells (hPSC), including nine global gene expression datasets from both hESC and human induced pluripotent stem cells, obtained during directed differentiation toward endoderm-, mesoderm-, and ectoderm derivatives. Sets of stably expressed genes were compiled, and a handful of genes (e.g., EID2, ZNF324B, CAPN10, and RABEP2) were identified as generally applicable reference genes in hPSCs across all cell lines and experimental conditions. The stability in gene expression profiles was confirmed by reverse transcription quantitative PCR analysis. Taken together, the current results suggest that differentiating hPSCs have a distinct HKG signature, which in some aspects is different from somatic cell types, and underscore the necessity to validate the stability of reference genes under the actual experimental setup used. In addition, the novel putative HKGs identified in this study can preferentially be used for normalization of gene expression data obtained from differentiating hPSCs.


2017 ◽  
Author(s):  
Hilary K. Finucane ◽  
Yakir A. Reshef ◽  
Verneri Anttila ◽  
Kamil Slowikowski ◽  
Alexander Gusev ◽  
...  

ABSTRACTGenetics can provide a systematic approach to discovering the tissues and cell types relevant for a complex disease or trait. Identifying these tissues and cell types is critical for following up on non-coding allelic function, developing ex-vivo models, and identifying therapeutic targets. Here, we analyze gene expression data from several sources, including the GTEx and PsychENCODE consortia, together with genome-wide association study (GWAS) summary statistics for 48 diseases and traits with an average sample size of 169,331, to identify disease-relevant tissues and cell types. We develop and apply an approach that uses stratified LD score regression to test whether disease heritability is enriched in regions surrounding genes with the highest specific expression in a given tissue. We detect tissue-specific enrichments at FDR < 5% for 34 diseases and traits across a broad range of tissues that recapitulate known biology. In our analysis of traits with observed central nervous system enrichment, we detect an enrichment of neurons over other brain cell types for several brain-related traits, enrichment of inhibitory over excitatory neurons for bipolar disorder but excitatory over inhibitory neurons for schizophrenia and body mass index, and enrichments in the cortex for schizophrenia and in the striatum for migraine. In our analysis of traits with observed immunological enrichment, we identify enrichments of T cells for asthma and eczema, B cells for primary biliary cirrhosis, and myeloid cells for Alzheimer's disease, which we validated with independent chromatin data. Our results demonstrate that our polygenic approach is a powerful way to leverage gene expression data for interpreting GWAS signal.


eLife ◽  
2017 ◽  
Vol 6 ◽  
Author(s):  
Julien Racle ◽  
Kaat de Jonge ◽  
Petra Baumgaertner ◽  
Daniel E Speiser ◽  
David Gfeller

Immune cells infiltrating tumors can have important impact on tumor progression and response to therapy. We present an efficient algorithm to simultaneously estimate the fraction of cancer and immune cell types from bulk tumor gene expression data. Our method integrates novel gene expression profiles from each major non-malignant cell type found in tumors, renormalization based on cell-type-specific mRNA content, and the ability to consider uncharacterized and possibly highly variable cell types. Feasibility is demonstrated by validation with flow cytometry, immunohistochemistry and single-cell RNA-Seq analyses of human melanoma and colorectal tumor specimens. Altogether, our work not only improves accuracy but also broadens the scope of absolute cell fraction predictions from tumor gene expression data, and provides a unique novel experimental benchmark for immunogenomics analyses in cancer research (http://epic.gfellerlab.org).


2019 ◽  
Vol 3 (s1) ◽  
pp. 2-2
Author(s):  
Megan C Hollister ◽  
Jeffrey D. Blume

OBJECTIVES/SPECIFIC AIMS: To examine and compare the claims in Bzdok, Altman, and Brzywinski under a broader set of conditions by using unbiased methods of comparison. To explore how to accurately use various machine learning and traditional statistical methods in large-scale translational research by estimating their accuracy statistics. Then we will identify the methods with the best performance characteristics. METHODS/STUDY POPULATION: We conducted a simulation study with a microarray of gene expression data. We maintained the original structure proposed by Bzdok, Altman, and Brzywinski. The structure for gene expression data includes a total of 40 genes from 20 people, in which 10 people are phenotype positive and 10 are phenotype negative. In order to find a statistical difference 25% of the genes were set to be dysregulated across phenotype. This dysregulation forced the positive and negative phenotypes to have different mean population expressions. Additional variance was included to simulate genetic variation across the population. We also allowed for within person correlation across genes, which was not done in the original simulations. The following methods were used to determine the number of dysregulated genes in simulated data set: unadjusted p-values, Benjamini-Hochberg adjusted p-values, Bonferroni adjusted p-values, random forest importance levels, neural net prediction weights, and second-generation p-values. RESULTS/ANTICIPATED RESULTS: Results vary depending on whether a pre-specified significance level is used or the top 10 ranked values are taken. When all methods are given the same prior information of 10 dysregulated genes, the Benjamini-Hochberg adjusted p-values and the second-generation p-values generally outperform all other methods. We were not able to reproduce or validate the finding that random forest importance levels via a machine learning algorithm outperform classical methods. Almost uniformly, the machine learning methods did not yield improved accuracy statistics and they depend heavily on the a priori chosen number of dysregulated genes. DISCUSSION/SIGNIFICANCE OF IMPACT: In this context, machine learning methods do not outperform standard methods. Because of this and their additional complexity, machine learning approaches would not be preferable. Of all the approaches the second-generation p-value appears to offer significant benefit for the cost of a priori defining a region of trivially null effect sizes. The choice of an analysis method for large-scale translational data is critical to the success of any statistical investigation, and our simulations clearly highlight the various tradeoffs among the available methods.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Bárbara Andrade Barbosa ◽  
Saskia D. van Asten ◽  
Ji Won Oh ◽  
Arantza Farina-Sarasqueta ◽  
Joanne Verheij ◽  
...  

AbstractDeconvolution of bulk gene expression profiles into the cellular components is pivotal to portraying tissue’s complex cellular make-up, such as the tumor microenvironment. However, the inherently variable nature of gene expression requires a comprehensive statistical model and reliable prior knowledge of individual cell types that can be obtained from single-cell RNA sequencing. We introduce BLADE (Bayesian Log-normAl Deconvolution), a unified Bayesian framework to estimate both cellular composition and gene expression profiles for each cell type. Unlike previous comprehensive statistical approaches, BLADE can handle > 20 types of cells due to the efficient variational inference. Throughout an intensive evaluation with > 700 simulated and real datasets, BLADE demonstrated enhanced robustness against gene expression variability and better completeness than conventional methods, in particular, to reconstruct gene expression profiles of each cell type. In summary, BLADE is a powerful tool to unravel heterogeneous cellular activity in complex biological systems from standard bulk gene expression data.


2020 ◽  
Author(s):  
Bárbara Andrade Barbosa ◽  
Saskia van Asten ◽  
Ji-won Oh ◽  
Arantza Fariña-Sarasqueta ◽  
Joanne Verheij ◽  
...  

Abstract High-resolution deconvolution of bulk gene expression profiles is pivotal to characterize the complex cellular make-up of tissues, such as tumor microenvironment. Single-cell RNA-seq provides reliable prior knowledge for deconvolution, however, a comprehensive statistical model is required for efficient utilization due to the inherently variable nature of gene expression. We introduce BLADE (Bayesian Log-normAl Deconvolution), a comprehensive probabilistic framework to estimate both cellular make-up and gene expression profiles of each cell type in each sample. Unlike previous comprehensive statistical approaches, BLADE can handle >20 cell types thanks to the efficient variational inference. Throughout an intensive evaluation using >700 datasets, BLADE showed enhanced robustness against gene expression variability and better completeness than conventional methods, in particular to reconstruct gene expression profiles of each cell type. All-in-all, BLADE is a powerful tool to unravel heterogeneous cellular activity in complex biological systems based on standard bulk gene expression data.


Sign in / Sign up

Export Citation Format

Share Document