scholarly journals Complete deconvolution of DNA methylation signals from complex tissues: a geometric approach

Author(s):  
Weiwei Zhang ◽  
Hao Wu ◽  
Ziyi Li

Abstract Motivation It is a common practice in epigenetics research to profile DNA methylation on tissue samples, which is usually a mixture of different cell types. To properly account for the mixture, estimating cell compositions has been recognized as an important first step. Many methods were developed for quantifying cell compositions from DNA methylation data, but they mostly have limited applications due to lack of reference or prior information. Results We develop Tsisal, a novel complete deconvolution method which accurately estimate cell compositions from DNA methylation data without any prior knowledge of cell types or their proportions. Tsisal is a full pipeline to estimate number of cell types, cell compositions, and identify cell-type-specific CpG sites. It can also assign cell type labels when (full or part of) reference panel is available. Extensive simulation studies and analyses of seven real data sets demonstrate the favorable performance of our proposed method compared with existing deconvolution methods serving similar purpose. Availability The proposed method Tsisal is implemented as part of the R/Bioconductor package TOAST at https://bioconductor.org/packages/TOAST. Contact [email protected] and [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Hanyu Zhang ◽  
Ruoyi Cai ◽  
James Dai ◽  
Wei Sun

AbstractWe introduce a new computational method named EMeth to estimate cell type proportions using DNA methylation data. EMeth is a reference-based method that requires cell type-specific DNA methylation data from relevant cell types. EMeth improves on the existing reference-based methods by detecting the CpGs whose DNA methylation are inconsistent with the deconvolution model and reducing their contributions to cell type decomposition. Another novel feature of EMeth is that it allows a cell type with known proportions but unknown reference and estimates its methylation. This is motivated by the case of studying methylation in tumor cells while bulk tumor samples include tumor cells as well as other cell types such as infiltrating immune cells, and tumor cell proportion can be estimated by copy number data. We demonstrate that EMeth delivers more accurate estimates of cell type proportions than several other methods using simulated data and in silico mixtures. Applications in cancer studies show that the proportions of T regulatory cells estimated by DNA methylation have expected associations with mutation load and survival time, while the estimates from gene expression miss such associations.


2019 ◽  
Author(s):  
Lara Nonell ◽  
Juan R González

AbstractDNA methylation plays an important role in the development and progression of disease. Beta-values are the standard methylation measures. Different statistical methods have been proposed to assess differences in methylation between conditions. However, most of them do not completely account for the distribution of beta-values. The simplex distribution can accommodate beta-values data. We hypothesize that simplex is a quite flexible distribution which is able to model methylation data.To test our hypothesis, we conducted several analyses using four real data sets obtained from microarrays and sequencing technologies. Standard data distributions were studied and modelled in comparison to the simplex. Besides, some simulations were conducted in different scenarios encompassing several distribution assumptions, regression models and sample sizes. Finally, we compared DNA methylation between females and males in order to benchmark the assessed methodologies under different scenarios.According to the results obtained by the simulations and real data analyses, DNA methylation data are concordant with the simplex distribution in many situations. Simplex regression models work well in small sample size data sets. However, when sample size increases, other models such as the beta regression or even the linear regression can be employed to assess group comparisons and obtain unbiased results. Based on these results, we can provide some practical recommendations when analyzing methylation data: 1) use data sets of at least 10 samples per studied condition for microarray data sets or 30 in NGS data sets, 2) apply a simplex or beta regression model for microarray data, 3) apply a linear model in any other case.


Author(s):  
Yun Zhang ◽  
Jonavelle Cuerdo ◽  
Marc K Halushka ◽  
Matthew N McCall

Abstract Variable cellular composition of tissue samples represents a significant challenge for the interpretation of genomic profiling studies. Substantial effort has been devoted to modeling and adjusting for compositional differences when estimating differential expression between sample types. However, relatively little attention has been given to the effect of tissue composition on co-expression estimates. In this study, we illustrate the effect of variable cell-type composition on correlation-based network estimation and provide a mathematical decomposition of the tissue-level correlation. We show that a class of deconvolution methods developed to separate tumor and stromal signatures can be applied to two component cell-type mixtures. In simulated and real data, we identify conditions in which a deconvolution approach would be beneficial. Our results suggest that uncorrelated cell-type-specific markers are ideally suited to deconvolute both the expression and co-expression patterns of an individual cell type. We provide a Shiny application for users to interactively explore the effect of cell-type composition on correlation-based co-expression estimation for any cell types of interest.


Author(s):  
Richard Meier ◽  
Emily Nissen ◽  
Devin C. Koestler

Abstract Statistical methods that allow for cell type specific DNA methylation (DNAm) analyses based on bulk-tissue methylation data have great potential to improve our understanding of human disease and have created unprecedented opportunities for new insights using the wealth of publicly available bulk-tissue methylation data. These methodologies involve incorporating interaction terms formed between the phenotypes/exposures of interest and proportions of the cell types underlying the bulk-tissue sample used for DNAm profiling. Despite growing interest in such “interaction-based” methods, there has been no comprehensive assessment how variability in the cellular landscape across study samples affects their performance. To answer this question, we used numerous publicly available whole-blood DNAm data sets along with extensive simulation studies and evaluated the performance of interaction-based approaches in detecting cell-specific methylation effects. Our results show that low cell proportion variability results in large estimation error and low statistical power for detecting cell-specific effects of DNAm. Further, we identified that many studies targeting methylation profiling in whole-blood may be at risk to be underpowered due to low variability in the cellular landscape across study samples. Finally, we discuss guidelines for researchers seeking to conduct studies utilizing interaction-based approaches to help ensure that their studies are adequately powered.


2021 ◽  
Author(s):  
Wei Zhang ◽  
Hanwen Xu ◽  
Rong Qiao ◽  
Bixi Zhong ◽  
Xianglin Zhang ◽  
...  

Quantifying the cell proportions, especially for rare cell types in some scenarios, is of great value to track signals related to certain phenotypes or diseases. Although some methods have been pro-posed to infer cell proportions from multi-component bulk data, they are substantially less effective for estimating rare cell type proportions since they are highly sensitive against feature outliers and collinearity. Here we proposed a new deconvolution algorithm named ARIC to estimate cell type proportions from bulk gene expression or DNA methylation data. ARIC utilizes a novel two-step marker selection strategy, including component-wise condition number-based feature collinearity elimination and adaptive outlier markers removal. This strategy can systematically obtain effective markers that ensure a robust and precise weighted υ-support vector regression-based proportion prediction. We showed that ARIC can estimate fractions accurately in both DNA methylation and gene expression data from different experiments. Taken together, ARIC is a promising tool to solve the deconvolution problem of bulk data where rare components are of vital importance.


2019 ◽  
Vol 35 (14) ◽  
pp. i154-i163 ◽  
Author(s):  
Lisa Handl ◽  
Adrin Jalali ◽  
Michael Scherer ◽  
Ralf Eggeling ◽  
Nico Pfeifer

Abstract Motivation Predictive models are a powerful tool for solving complex problems in computational biology. They are typically designed to predict or classify data coming from the same unknown distribution as the training data. In many real-world settings, however, uncontrolled biological or technical factors can lead to a distribution mismatch between datasets acquired at different times, causing model performance to deteriorate on new data. A common additional obstacle in computational biology is scarce data with many more features than samples. To address these problems, we propose a method for unsupervised domain adaptation that is based on a weighted elastic net. The key idea of our approach is to compare dependencies between inputs in training and test data and to increase the cost of differently behaving features in the elastic net regularization term. In doing so, we encourage the model to assign a higher importance to features that are robust and behave similarly across domains. Results We evaluate our method both on simulated data with varying degrees of distribution mismatch and on real data, considering the problem of age prediction based on DNA methylation data across multiple tissues. Compared with a non-adaptive standard model, our approach substantially reduces errors on samples with a mismatched distribution. On real data, we achieve far lower errors on cerebellum samples, a tissue which is not part of the training data and poorly predicted by standard models. Our results demonstrate that unsupervised domain adaptation is possible for applications in computational biology, even with many more features than samples. Availability and implementation Source code is available at https://github.com/PfeiferLabTue/wenda. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Han Jing ◽  
Shijie C. Zheng ◽  
Charles E. Breeze ◽  
Stephan Beck ◽  
Andrew E. Teschendorff

AbstractDue to cost and logistical reasons, Epigenome-Wide-Association Studies (EWAS) are normally performed in complex tissues, resulting in average DNA methylation profiles over potentially many different cell-types, which can obscure important cell-type specific associations with disease. Identifying the specific cell-types that are altered is a key hurdle for elucidating causal pathways to disease, and consequently statistical algorithms have recently emerged that aim to address this challenge. Comparisons between these algorithms are of great interest, yet here we find that the main comparative study so far was substantially biased and potentially misleading. By using this study as an example, we highlight some of the key issues that need to be considered to ensure that future assessments between methods are more objective.


2018 ◽  
Author(s):  
Yun Zhang ◽  
Jonavelle Cuerdo ◽  
Marc K Halushka ◽  
Matthew N McCall

Variable cellular composition of tissue samples represents a significant challenge for the interpretation of genomic profiling studies. Substantial effort has been devoted to modeling and adjusting for compositional differences when estimating differential expression between sample types. However, relatively little attention has been given to the effect of tissue composition on co-expression estimates. In this study, we illustrate the effect of variable cell type composition on correlation-based network estimation and provide a mathematical decomposition of the tissue-level correlation. We show that a class of deconvolution methods developed to separate tumor and stromal signatures can be applied to two component cell type mixtures. In simulated and real data, we identify conditions in which a deconvolution approach would be beneficial. Our results suggest that uncorrelated cell type specific markers are ideally suited to deconvolute both the expression and co expression patterns of an individual cell type. Finally, we provide a Shiny application for users to interactively explore the effect of cell type composition on correlation-based co-expression estimation for any cell types of interest.


2020 ◽  
Author(s):  
Guanjue Xiang ◽  
Belinda M. Giardine ◽  
Shaun Mahony ◽  
Yu Zhang ◽  
Ross C Hardison

AbstractSummaryEpigenetic modifications reflect key aspects of transcriptional regulation, and many epigenomic data sets have been generated under many biological contexts to provide insights into regulatory processes. However, the technical noise in epigenomic data sets and the many dimensions (features) examined make it challenging to effectively extract biologically meaningful inferences from these data sets. We developed a package that reduces noise while normalizing the epigenomic data by a novel normalization method, followed by integrative dimensional reduction by learning and assigning epigenetic states. This package, called S3V2-IDEAS, can be used to identify epigenetic states for multiple features, or identify signal intensity states and a master peak list across different cell types for a single feature. We illustrate the outputs and performance of S3V2-IDEAS using 137 epigenomics data sets from the VISION project that provides ValIdated Systematic IntegratiON of epigenomic data in hematopoiesis.Availability and implementationS3V2-IDEAS pipeline is freely available as open source software released under an MIT license at: https://github.com/guanjue/[email protected], [email protected] informationS3V2-IDEAS-bioinfo-supplementary-materials.pdf


2014 ◽  
Vol 13s4 ◽  
pp. CIN.S13980 ◽  
Author(s):  
Eugene Andrέs Houseman ◽  
Tan A. Ince

Historically, breast cancer classification has relied on prognostic subtypes. Thus, unlike hematopoietic cancers, breast tumor classification lacks phylogenetic rationale. The feasibility of phylogenetic classification of breast tumors has recently been demonstrated based on estrogen receptor (ER), androgen receptor (AR), vitamin D receptor (VDR) and Keratin 5 expression. Four hormonal states (HR0–3) comprising 11 cellular subtypes of breast cells have been proposed. This classification scheme has been shown to have relevance to clinical prognosis. We examine the implications of such phylogenetic classification on DNA methylation of both breast tumors and normal breast tissues by applying recently developed deconvolution algorithms to three DNA methylation data sets archived on Gene Expression Omnibus. We propose that breast tumors arising from a particular cell-of-origin essentially magnify the epigenetic state of their original cell type. We demonstrate that DNA methylation of tumors manifests patterns consistent with cell-specific epigenetic states, that these states correspond roughly to previously posited normal breast cell types, and that estimates of proportions of the underlying cell types are predictive of tumor phenotypes. Taken together, these findings suggest that the epigenetics of breast tumors is ultimately based on the underlying phylogeny of normal breast tissue.


Sign in / Sign up

Export Citation Format

Share Document