Complete deconvolution of DNA methylation signals from complex tissues: a geometric approach

Bioinformatics ◽

10.1093/bioinformatics/btaa930 ◽

2020 ◽

Author(s):

Weiwei Zhang ◽

Hao Wu ◽

Ziyi Li

Keyword(s):

Dna Methylation ◽

Geometric Approach ◽

Real Data ◽

Cell Types ◽

Supplementary Information ◽

Data Sets ◽

Methylation Data ◽

Cell Type ◽

Tissue Samples ◽

Different Cell Types

Abstract Motivation It is a common practice in epigenetics research to profile DNA methylation on tissue samples, which is usually a mixture of different cell types. To properly account for the mixture, estimating cell compositions has been recognized as an important first step. Many methods were developed for quantifying cell compositions from DNA methylation data, but they mostly have limited applications due to lack of reference or prior information. Results We develop Tsisal, a novel complete deconvolution method which accurately estimate cell compositions from DNA methylation data without any prior knowledge of cell types or their proportions. Tsisal is a full pipeline to estimate number of cell types, cell compositions, and identify cell-type-specific CpG sites. It can also assign cell type labels when (full or part of) reference panel is available. Extensive simulation studies and analyses of seven real data sets demonstrate the favorable performance of our proposed method compared with existing deconvolution methods serving similar purpose. Availability The proposed method Tsisal is implemented as part of the R/Bioconductor package TOAST at https://bioconductor.org/packages/TOAST. Contact [email protected] and [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

EMeth: An EM algorithm for cell type decomposition based on DNA methylation data

Scientific Reports ◽

10.1038/s41598-021-84864-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Hanyu Zhang ◽

Ruoyi Cai ◽

James Dai ◽

Wei Sun

Keyword(s):

Dna Methylation ◽

Tumor Cells ◽

T Regulatory Cells ◽

Simulated Data ◽

Cell Types ◽

Computational Method ◽

Methylation Data ◽

Cell Type ◽

A Cell ◽

Type Decomposition

AbstractWe introduce a new computational method named EMeth to estimate cell type proportions using DNA methylation data. EMeth is a reference-based method that requires cell type-specific DNA methylation data from relevant cell types. EMeth improves on the existing reference-based methods by detecting the CpGs whose DNA methylation are inconsistent with the deconvolution model and reducing their contributions to cell type decomposition. Another novel feature of EMeth is that it allows a cell type with known proportions but unknown reference and estimates its methylation. This is motivated by the case of studying methylation in tumor cells while bulk tumor samples include tumor cells as well as other cell types such as infiltrating immune cells, and tumor cell proportion can be estimated by copy number data. We demonstrate that EMeth delivers more accurate estimates of cell type proportions than several other methods using simulated data and in silico mixtures. Applications in cancer studies show that the proportions of T regulatory cells estimated by DNA methylation have expected associations with mutation load and survival time, while the estimates from gene expression miss such associations.

Download Full-text

Are methylation beta-values simplex distributed?

10.1101/753459 ◽

2019 ◽

Author(s):

Lara Nonell ◽

Juan R González

Keyword(s):

Dna Methylation ◽

Sample Size ◽

Microarray Data ◽

Regression Models ◽

Real Data ◽

Small Sample ◽

Beta Regression ◽

Data Sets ◽

Methylation Data ◽

Simplex Distribution

AbstractDNA methylation plays an important role in the development and progression of disease. Beta-values are the standard methylation measures. Different statistical methods have been proposed to assess differences in methylation between conditions. However, most of them do not completely account for the distribution of beta-values. The simplex distribution can accommodate beta-values data. We hypothesize that simplex is a quite flexible distribution which is able to model methylation data.To test our hypothesis, we conducted several analyses using four real data sets obtained from microarrays and sequencing technologies. Standard data distributions were studied and modelled in comparison to the simplex. Besides, some simulations were conducted in different scenarios encompassing several distribution assumptions, regression models and sample sizes. Finally, we compared DNA methylation between females and males in order to benchmark the assessed methodologies under different scenarios.According to the results obtained by the simulations and real data analyses, DNA methylation data are concordant with the simplex distribution in many situations. Simplex regression models work well in small sample size data sets. However, when sample size increases, other models such as the beta regression or even the linear regression can be employed to assess group comparisons and obtain unbiased results. Based on these results, we can provide some practical recommendations when analyzing methylation data: 1) use data sets of at least 10 samples per studied condition for microarray data sets or 30 in NGS data sets, 2) apply a simplex or beta regression model for microarray data, 3) apply a linear model in any other case.

Download Full-text

The effect of tissue composition on gene co-expression

Briefings in Bioinformatics ◽

10.1093/bib/bbz135 ◽

2019 ◽

Cited By ~ 6

Author(s):

Yun Zhang ◽

Jonavelle Cuerdo ◽

Marc K Halushka ◽

Matthew N McCall

Keyword(s):

Expression Patterns ◽

Real Data ◽

Cell Types ◽

Tissue Level ◽

Tissue Composition ◽

Cell Type ◽

Tissue Samples ◽

Cell Type Composition ◽

Type Composition ◽

Component Cell

Abstract Variable cellular composition of tissue samples represents a significant challenge for the interpretation of genomic profiling studies. Substantial effort has been devoted to modeling and adjusting for compositional differences when estimating differential expression between sample types. However, relatively little attention has been given to the effect of tissue composition on co-expression estimates. In this study, we illustrate the effect of variable cell-type composition on correlation-based network estimation and provide a mathematical decomposition of the tissue-level correlation. We show that a class of deconvolution methods developed to separate tumor and stromal signatures can be applied to two component cell-type mixtures. In simulated and real data, we identify conditions in which a deconvolution approach would be beneficial. Our results suggest that uncorrelated cell-type-specific markers are ideally suited to deconvolute both the expression and co-expression patterns of an individual cell type. We provide a Shiny application for users to interactively explore the effect of cell-type composition on correlation-based co-expression estimation for any cell types of interest.

Download Full-text

Low variability in the underlying cellular landscape adversely affects the performance of interaction-based approaches for conducting cell-specific analyses of DNA methylation in bulk samples

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2021-0004 ◽

2021 ◽

Vol 0 (0) ◽

Author(s):

Richard Meier ◽

Emily Nissen ◽

Devin C. Koestler

Keyword(s):

Dna Methylation ◽

Whole Blood ◽

Statistical Power ◽

Tissue Sample ◽

Estimation Error ◽

Cell Types ◽

Data Sets ◽

Methylation Data ◽

Interaction Terms ◽

Bulk Tissue

Abstract Statistical methods that allow for cell type specific DNA methylation (DNAm) analyses based on bulk-tissue methylation data have great potential to improve our understanding of human disease and have created unprecedented opportunities for new insights using the wealth of publicly available bulk-tissue methylation data. These methodologies involve incorporating interaction terms formed between the phenotypes/exposures of interest and proportions of the cell types underlying the bulk-tissue sample used for DNAm profiling. Despite growing interest in such “interaction-based” methods, there has been no comprehensive assessment how variability in the cellular landscape across study samples affects their performance. To answer this question, we used numerous publicly available whole-blood DNAm data sets along with extensive simulation studies and evaluated the performance of interaction-based approaches in detecting cell-specific methylation effects. Our results show that low cell proportion variability results in large estimation error and low statistical power for detecting cell-specific effects of DNAm. Further, we identified that many studies targeting methylation profiling in whole-blood may be at risk to be underpowered due to low variability in the cellular landscape across study samples. Finally, we discuss guidelines for researchers seeking to conduct studies utilizing interaction-based approaches to help ensure that their studies are adequately powered.

Download Full-text

ARIC: Accurate and robust inference of cell type proportions from bulk gene expression or DNA methylation data

10.1101/2021.04.02.438149 ◽

2021 ◽

Author(s):

Wei Zhang ◽

Hanwen Xu ◽

Rong Qiao ◽

Bixi Zhong ◽

Xianglin Zhang ◽

...

Keyword(s):

Gene Expression ◽

Dna Methylation ◽

Cell Types ◽

Selection Strategy ◽

Support Vector ◽

Methylation Data ◽

Cell Type ◽

Promising Tool ◽

Marker Selection ◽

Bulk Data

Quantifying the cell proportions, especially for rare cell types in some scenarios, is of great value to track signals related to certain phenotypes or diseases. Although some methods have been pro-posed to infer cell proportions from multi-component bulk data, they are substantially less effective for estimating rare cell type proportions since they are highly sensitive against feature outliers and collinearity. Here we proposed a new deconvolution algorithm named ARIC to estimate cell type proportions from bulk gene expression or DNA methylation data. ARIC utilizes a novel two-step marker selection strategy, including component-wise condition number-based feature collinearity elimination and adaptive outlier markers removal. This strategy can systematically obtain effective markers that ensure a robust and precise weighted υ-support vector regression-based proportion prediction. We showed that ARIC can estimate fractions accurately in both DNA methylation and gene expression data from different experiments. Taken together, ARIC is a promising tool to solve the deconvolution problem of bulk data where rare components are of vital importance.

Download Full-text

Weighted elastic net for unsupervised domain adaptation with application to age prediction from DNA methylation data

Bioinformatics ◽

10.1093/bioinformatics/btz338 ◽

2019 ◽

Vol 35 (14) ◽

pp. i154-i163 ◽

Cited By ~ 1

Author(s):

Lisa Handl ◽

Adrin Jalali ◽

Michael Scherer ◽

Ralf Eggeling ◽

Nico Pfeifer

Keyword(s):

Dna Methylation ◽

Computational Biology ◽

Domain Adaptation ◽

Real Data ◽

Elastic Net ◽

Training Data ◽

Supplementary Information ◽

Methylation Data ◽

Unsupervised Domain Adaptation ◽

Age Prediction

Abstract Motivation Predictive models are a powerful tool for solving complex problems in computational biology. They are typically designed to predict or classify data coming from the same unknown distribution as the training data. In many real-world settings, however, uncontrolled biological or technical factors can lead to a distribution mismatch between datasets acquired at different times, causing model performance to deteriorate on new data. A common additional obstacle in computational biology is scarce data with many more features than samples. To address these problems, we propose a method for unsupervised domain adaptation that is based on a weighted elastic net. The key idea of our approach is to compare dependencies between inputs in training and test data and to increase the cost of differently behaving features in the elastic net regularization term. In doing so, we encourage the model to assign a higher importance to features that are robust and behave similarly across domains. Results We evaluate our method both on simulated data with varying degrees of distribution mismatch and on real data, considering the problem of age prediction based on DNA methylation data across multiple tissues. Compared with a non-adaptive standard model, our approach substantially reduces errors on samples with a mismatched distribution. On real data, we achieve far lower errors on cerebellum samples, a tissue which is not part of the training data and poorly predicted by standard models. Our results demonstrate that unsupervised domain adaptation is possible for applications in computational biology, even with many more features than samples. Availability and implementation Source code is available at https://github.com/PfeiferLabTue/wenda. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Calling differential DNA methylation at cell-type resolution: an objective status-quo

10.1101/822940 ◽

2019 ◽

Cited By ~ 1

Author(s):

Han Jing ◽

Shijie C. Zheng ◽

Charles E. Breeze ◽

Stephan Beck ◽

Andrew E. Teschendorff

Keyword(s):

Dna Methylation ◽

Association Studies ◽

Cell Types ◽

Status Quo ◽

Specific Cell ◽

Cell Type ◽

Causal Pathways ◽

Key Issues ◽

Cell Type Specific ◽

Different Cell Types

AbstractDue to cost and logistical reasons, Epigenome-Wide-Association Studies (EWAS) are normally performed in complex tissues, resulting in average DNA methylation profiles over potentially many different cell-types, which can obscure important cell-type specific associations with disease. Identifying the specific cell-types that are altered is a key hurdle for elucidating causal pathways to disease, and consequently statistical algorithms have recently emerged that aim to address this challenge. Comparisons between these algorithms are of great interest, yet here we find that the main comparative study so far was substantially biased and potentially misleading. By using this study as an example, we highlight some of the key issues that need to be considered to ensure that future assessments between methods are more objective.

Download Full-text

The effect of tissue composition on gene co-expression

10.1101/492223 ◽

2018 ◽

Author(s):

Yun Zhang ◽

Jonavelle Cuerdo ◽

Marc K Halushka ◽

Matthew N McCall

Keyword(s):

Expression Patterns ◽

Real Data ◽

Cell Types ◽

Tissue Level ◽

Tissue Composition ◽

Cell Type ◽

Tissue Samples ◽

Cell Type Composition ◽

Type Composition ◽

Component Cell

Variable cellular composition of tissue samples represents a significant challenge for the interpretation of genomic profiling studies. Substantial effort has been devoted to modeling and adjusting for compositional differences when estimating differential expression between sample types. However, relatively little attention has been given to the effect of tissue composition on co-expression estimates. In this study, we illustrate the effect of variable cell type composition on correlation-based network estimation and provide a mathematical decomposition of the tissue-level correlation. We show that a class of deconvolution methods developed to separate tumor and stromal signatures can be applied to two component cell type mixtures. In simulated and real data, we identify conditions in which a deconvolution approach would be beneficial. Our results suggest that uncorrelated cell type specific markers are ideally suited to deconvolute both the expression and co expression patterns of an individual cell type. Finally, we provide a Shiny application for users to interactively explore the effect of cell type composition on correlation-based co-expression estimation for any cell types of interest.

Download Full-text

S3V2-IDEAS: a package for normalizing, denoising and integrating epigenomic datasets across different cell types

10.1101/2020.09.08.287920 ◽

2020 ◽

Author(s):

Guanjue Xiang ◽

Belinda M. Giardine ◽

Shaun Mahony ◽

Yu Zhang ◽

Ross C Hardison

Keyword(s):

Cell Types ◽

Supplementary Information ◽

Data Sets ◽

Multiple Features ◽

Regulatory Processes ◽

Single Feature ◽

And Performance ◽

The Many ◽

Key Aspects ◽

Different Cell Types

AbstractSummaryEpigenetic modifications reflect key aspects of transcriptional regulation, and many epigenomic data sets have been generated under many biological contexts to provide insights into regulatory processes. However, the technical noise in epigenomic data sets and the many dimensions (features) examined make it challenging to effectively extract biologically meaningful inferences from these data sets. We developed a package that reduces noise while normalizing the epigenomic data by a novel normalization method, followed by integrative dimensional reduction by learning and assigning epigenetic states. This package, called S3V2-IDEAS, can be used to identify epigenetic states for multiple features, or identify signal intensity states and a master peak list across different cell types for a single feature. We illustrate the outputs and performance of S3V2-IDEAS using 137 epigenomics data sets from the VISION project that provides ValIdated Systematic IntegratiON of epigenomic data in hematopoiesis.Availability and implementationS3V2-IDEAS pipeline is freely available as open source software released under an MIT license at: https://github.com/guanjue/[email protected], [email protected] informationS3V2-IDEAS-bioinfo-supplementary-materials.pdf

Download Full-text

Normal Cell-Type Epigenetics and Breast Cancer Classification: A Case Study of Cell Mixture–Adjusted Analysis of DNA Methylation Data from Tumors

Cancer Informatics ◽

10.4137/cin.s13980 ◽

2014 ◽

Vol 13s4 ◽

pp. CIN.S13980 ◽

Cited By ~ 8

Author(s):

Eugene Andrέs Houseman ◽

Tan A. Ince

Keyword(s):

Breast Cancer ◽

Dna Methylation ◽

Breast Tumors ◽

Cell Types ◽

Normal Breast ◽

Cancer Classification ◽

Methylation Data ◽

Cell Type ◽

Phylogenetic Classification ◽

Breast Cancer Classification

Historically, breast cancer classification has relied on prognostic subtypes. Thus, unlike hematopoietic cancers, breast tumor classification lacks phylogenetic rationale. The feasibility of phylogenetic classification of breast tumors has recently been demonstrated based on estrogen receptor (ER), androgen receptor (AR), vitamin D receptor (VDR) and Keratin 5 expression. Four hormonal states (HR0–3) comprising 11 cellular subtypes of breast cells have been proposed. This classification scheme has been shown to have relevance to clinical prognosis. We examine the implications of such phylogenetic classification on DNA methylation of both breast tumors and normal breast tissues by applying recently developed deconvolution algorithms to three DNA methylation data sets archived on Gene Expression Omnibus. We propose that breast tumors arising from a particular cell-of-origin essentially magnify the epigenetic state of their original cell type. We demonstrate that DNA methylation of tumors manifests patterns consistent with cell-specific epigenetic states, that these states correspond roughly to previously posited normal breast cell types, and that estimates of proportions of the underlying cell types are predictive of tumor phenotypes. Taken together, these findings suggest that the epigenetics of breast tumors is ultimately based on the underlying phylogeny of normal breast tissue.

Download Full-text