COCOA: Coordinate covariation analysis of epigenetic heterogeneity

AbstractA key challenge in epigenetics is to determine the biological significance of epigenetic variation among individuals. Here, we present Coordinate Covariation Analysis (COCOA), a computational framework that uses covariation of epigenetic signals across individuals and a database of region sets to annotate epigenetic heterogeneity. COCOA is the first such tool for DNA methylation data and can also analyze any epigenetic signal with genomic coordinates. We demonstrate COCOA’s utility by analyzing DNA methylation, ATAC-seq, and multi-omic data in supervised and unsupervised analyses, showing that COCOA provides new understanding of inter-sample epigenetic variation. COCOA is available as a Bioconductor R package (http://bioconductor.org/packages/COCOA).

Download Full-text

COCOA: coordinate covariation analysis of epigenetic heterogeneity

Genome Biology ◽

10.1186/s13059-020-02139-4 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

John T. Lawson ◽

Jason P. Smith ◽

Stefan Bekiranov ◽

Francine E. Garrett-Bakelman ◽

Nathan C. Sheffield

Keyword(s):

Dna Methylation ◽

Biological Significance ◽

Epigenetic Variation ◽

Methylation Data ◽

Computational Framework ◽

Link Type ◽

Epigenetic Signal ◽

Omic Data ◽

Covariation Analysis

Abstract A key challenge in epigenetics is to determine the biological significance of epigenetic variation among individuals. We present Coordinate Covariation Analysis (COCOA), a computational framework that uses covariation of epigenetic signals across individuals and a database of region sets to annotate epigenetic heterogeneity. COCOA is the first such tool for DNA methylation data and can also analyze any epigenetic signal with genomic coordinates. We demonstrate COCOA’s utility by analyzing DNA methylation, ATAC-seq, and multi-omic data in supervised and unsupervised analyses, showing that COCOA provides new understanding of inter-sample epigenetic variation. COCOA is available on Bioconductor (http://bioconductor.org/packages/COCOA).

Download Full-text

funtooNorm: an R package for normalization of DNA methylation data when there are multiple cell or tissue types

Bioinformatics ◽

10.1093/bioinformatics/btv615 ◽

2015 ◽

Vol 32 (4) ◽

pp. 593-595 ◽

Cited By ~ 12

Author(s):

Kathleen Oros Klein ◽

Stepan Grinek ◽

Sasha Bernatsky ◽

Luigi Bouchard ◽

Antonio Ciampi ◽

...

Keyword(s):

Dna Methylation ◽

R Package ◽

Methylation Data ◽

Multiple Cell

Download Full-text

Faculty Opinions recommendation of BioMethyl: an R package for biological interpretation of DNA methylation data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.735158095.793557064 ◽

2019 ◽

Author(s):

John Holloway

Keyword(s):

Dna Methylation ◽

R Package ◽

Methylation Data ◽

Biological Interpretation

Download Full-text

Accurate ethnicity prediction from placental DNA methylation data

10.1101/618470 ◽

2019 ◽

Author(s):

Victor Yuan ◽

E Magda Price ◽

Giulia F Del Gobbo ◽

Sara Mostafavi ◽

Brian Cox ◽

...

Keyword(s):

Dna Methylation ◽

Population Stratification ◽

Association Studies ◽

R Package ◽

Continuous Variable ◽

Methylation Data ◽

Ancestry Informative Markers ◽

Genome Wide ◽

Highly Correlated ◽

Ethnicity Classification

ABSTRACTBackgroundThe influence of genetics on variation in DNA methylation (DNAme) is well documented. Yet confounding from population stratification is often unaccounted for in DNAme association studies. Existing approaches to address confounding by population stratification using DNAme data may not generalize to populations or tissues outside those in which they were developed. To aid future placental DNAme studies in assessing population stratification, we developed an ethnicity classifier, PlaNET (Placental DNAme Elastic Net Ethnicity Tool), using five cohorts with Infinium Human Methylation 450k BeadChip array (HM450k) data from placental samples that is also compatible with the newer EPIC platform.ResultsData from 509 placental samples was used to develop PlaNET and show that it accurately predicts (accuracy = 0.938, kappa = 0.823) major classes of self-reported ethnicity/race (African: n = 58, Asian: n = 53, Caucasian: n = 389), and produces ethnicity probabilities that are highly correlated with genetic ancestry inferred from genome-wide SNP arrays (>2.5 million SNP) and ancestry informative markers (n = 50 SNPs). PlaNET’s ethnicity classification relies on 1860 HM450K microarray sites, and over half of these were linked to nearby genetic polymorphisms (n = 955). Our placental-optimized method outperforms existing approaches in assessing population stratification in placental samples from individuals of Asian, African, and Caucasian ethnicities.ConclusionPlaNET provides an improved approach to address population stratification in placental DNAme association studies. The method can be applied to predict ethnicity as a discrete or continuous variable and will be especially useful when self-reported ethnicity information is missing and genotyping markers are unavailable. PlaNET is available as an R package at (https://github.com/wvictor14/planet).

Download Full-text

EpiSmokEr: a robust classifier to determine smoking status from DNA methylation data

Epigenomics ◽

10.2217/epi-2019-0206 ◽

2019 ◽

Vol 11 (13) ◽

pp. 1469-1486 ◽

Cited By ~ 7

Author(s):

Sailalitha Bollepalli ◽

Tellervo Korhonen ◽

Jaakko Kaprio ◽

Simon Anders ◽

Miina Ollikainen

Keyword(s):

Machine Learning ◽

Dna Methylation ◽

Whole Blood ◽

Smoking Status ◽

Threshold Value ◽

R Package ◽

Prediction Performance ◽

Methylation Data ◽

Future Studies ◽

Never Smokers

Aim: Smoking strongly influences DNA methylation, with current and never smokers exhibiting different methylation profiles. Methods: To advance the practical applicability of the smoking-associated methylation signals, we used machine learning methodology to train a classifier for smoking status prediction. Results: We show the prediction performance of our classifier on three independent whole-blood datasets demonstrating its robustness and global applicability. Furthermore, we examine the reasons for biologically meaningful misclassifications through comprehensive phenotypic evaluation. Conclusion: The major contribution of our classifier is its global applicability without a need for users to determine a threshold value for each dataset to predict the smoking status. We provide an R package, EpiSmokEr (Epigenetic Smoking status Estimator), facilitating the use of our classifier to predict smoking status in future studies.

Download Full-text

AlphaBeta: Computational inference of epimutation rates and spectra from high-throughput DNA methylation data in plants

10.1101/862243 ◽

2019 ◽

Cited By ~ 2

Author(s):

Yadollah Shahryary ◽

Aikaterini Symeonidi ◽

Rashmi R. Hazarika ◽

Johanna Denkena ◽

Talha Mubeen ◽

...

Keyword(s):

Dna Methylation ◽

High Throughput ◽

Cytosine Methylation ◽

R Package ◽

Plant Evolution ◽

Computational Method ◽

Age Dating ◽

Methylation Data ◽

Somatic Development ◽

Wide Range

AbstractIntroductionHeritable changes in cytosine methylation can arise stochastically in plant genomes independently of DNA sequence alterations. These so-called ‘spontaneous epimutations’ appear to be a byproduct of imperfect DNA methylation maintenance during mitotic or meitotic cell divisions. Accurate estimates of the rate and spectrum of these stochastic events are necessary to be able to quantify how epimutational processes shape methylome diversity in the context of plant evolution, development and aging.MethodHere we describe AlphaBeta, a computational method for estimating epimutation rates and spectra from pedigree-based high-throughput DNA methylation data. The approach requires that the topology of the pedigree is known, which is typically the case in the experimental construction of mutation accumulation lines (MA-lines) in sexually or clonally reproducing species. However, this method also works for inferring somatic epimutation rates in long-lived perennials, such as trees, using leaf methylomes and coring data as input. In this case, we treat the tree branching structure as an intra-organismal phylogeny of somatic lineages and leverage information about the epimutational history of each branch.ResultsTo illustrate the method, we applied AlphaBeta to multi-generational data from selfing- and asexually-derived MA-lines in Arabidopsis and dandelion, as well as to intra-generational leaf methylome data of a single poplar tree. Our results show that the epimutation landscape in plants is deeply conserved across angiosperm species, and that heritable epimutations originate mainly during somatic development, rather than from DNA methylation reinforcement errors during sexual reproduction. Finally, we also provide the first evidence that DNA methylation data, in conjunction with statistical epimutation models, can be used as a molecular clock for age-dating trees.ConclusionAlphaBeta faciliates unprecedented quantitative insights into epimutational processes in a wide range of plant systems. Software implementing our method is available as a Bioconductor R package at http://bioconductor.org/packages/3.10/bioc/html/AlphaBeta.html

Download Full-text

BioMethyl: an R package for biological interpretation of DNA methylation data

Bioinformatics ◽

10.1093/bioinformatics/btz137 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3635-3641 ◽

Cited By ~ 8

Author(s):

Yue Wang ◽

Jennifer M Franks ◽

Michael L Whitfield ◽

Chao Cheng

Keyword(s):

Gene Expression ◽

Dna Methylation ◽

Expression Profiles ◽

R Package ◽

Biological Pathways ◽

The Cancer Genome Atlas ◽

Supplementary Information ◽

Cancer Type ◽

Methylation Data ◽

Biological Interpretation

AbstractMotivationThe accumulation of publicly available DNA methylation datasets has resulted in the need for tools to interpret the specific cellular phenotypes in bulk tissue data. Current approaches use either single differentially methylated CpG sites or differentially methylated regions that map to genes. However, these approaches may introduce biases in downstream analyses of biological interpretation, because of the variability in gene length. There is a lack of approaches to interpret DNA methylation effectively. Therefore, we have developed computational models to provide biological interpretation of relevant gene sets using DNA methylation data in the context of The Cancer Genome Atlas.ResultsWe illustrate that Biological interpretation of DNA Methylation (BioMethyl) utilizes the complete DNA methylation data for a given cancer type to reflect corresponding gene expression profiles and performs pathway enrichment analyses, providing unique biological insight. Using breast cancer as an example, BioMethyl shows high consistency in the identification of enriched biological pathways from DNA methylation data compared to the results calculated from RNA sequencing data. We find that 12 out of 14 pathways identified by BioMethyl are shared with those by using RNA-seq data, with a Jaccard score 0.8 for estrogen receptor (ER) positive samples. For ER negative samples, three pathways are shared in the two enrichments with a slight lower similarity (Jaccard score = 0.6). Using BioMethyl, we can successfully identify those hidden biological pathways in DNA methylation data when gene expression profile is lacking.Availability and implementationBioMethyl R package is freely available in the GitHub repository (https://github.com/yuewangpanda/BioMethyl).Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

Revisiting genetic artifacts on DNA methylation microarrays exposes novel biological implications

Genome Biology ◽

10.1186/s13059-021-02484-y ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Benjamin Planterose Jiménez ◽

Manfred Kayser ◽

Athina Vidaki

Keyword(s):

Dna Methylation ◽

Paradigm Shift ◽

Monozygotic Twins ◽

R Package ◽

Probe Design ◽

Methylation Data ◽

Health And Disease ◽

Specific Regulation ◽

Novel Strategy ◽

Methylation Quantitative Trait Loci

Abstract Background Illumina DNA methylation microarrays enable epigenome-wide analysis vastly used for the discovery of novel DNA methylation variation in health and disease. However, the microarrays’ probe design cannot fully consider the vast human genetic diversity, leading to genetic artifacts. Distinguishing genuine from artifactual genetic influence is of particular relevance in the study of DNA methylation heritability and methylation quantitative trait loci. But despite its importance, current strategies to account for genetic artifacts are lagging due to a limited mechanistic understanding on how such artifacts operate. Results To address this, we develop and benchmark UMtools, an R-package containing novel methods for the quantification and qualification of genetic artifacts based on fluorescence intensity signals. With our approach, we model and validate known SNPs/indels on a genetically controlled dataset of monozygotic twins, and we estimate minor allele frequency from DNA methylation data and empirically detect variants not included in dbSNP. Moreover, we identify examples where genetic artifacts interact with each other or with imprinting, X-inactivation, or tissue-specific regulation. Finally, we propose a novel strategy based on co-methylation that can discern between genetic artifacts and genuine genomic influence. Conclusions We provide an atlas to navigate through the huge diversity of genetic artifacts encountered on DNA methylation microarrays. Overall, our study sets the ground for a paradigm shift in the study of the genetic component of epigenetic variation in DNA methylation microarrays.

Download Full-text

AB0210 ACREULAR: AN R PACKAGE FOR THE CALCULATION AND VISUALISATION OF ACR/EULAR RELATED RHEUMATOID ARTHRITIS MEASURES

Annals of the Rheumatic Diseases ◽

10.1136/annrheumdis-2020-eular.2326 ◽

2020 ◽

Vol 79 (Suppl 1) ◽

pp. 1405.1-1406

Author(s):

F. Morton ◽

J. Nijjar ◽

C. Goodyear ◽

D. Porter

Keyword(s):

Rheumatoid Arthritis ◽

Functional Status ◽

Rheumatic Diseases ◽

Web Application ◽

R Package ◽

Diagnostic Classification ◽

Microsoft Excel ◽

Link Type ◽

Large Joint ◽

Programming Skills

Background:The American College of Rheumatology (ACR) and the European League Against Rheumatism (EULAR) individually and collaboratively have produced/recommended diagnostic classification, response and functional status criteria for a range of different rheumatic diseases. While there are a number of different resources available for performing these calculations individually, currently there are no tools available that we are aware of to easily calculate these values for whole patient cohorts.Objectives:To develop a new software tool, which will enable both data analysts and also researchers and clinicians without programming skills to calculate ACR/EULAR related measures for a number of different rheumatic diseases.Methods:Criteria that had been developed by ACR and/or EULAR that had been approved for the diagnostic classification, measurement of treatment response and functional status in patients with rheumatoid arthritis were identified. Methods were created using the R programming language to allow the calculation of these criteria, which were incorporated into an R package. Additionally, an R/Shiny web application was developed to enable the calculations to be performed via a web browser using data presented as CSV or Microsoft Excel files.Results:acreular is a freely available, open source R package (downloadable fromhttps://github.com/fragla/acreular) that facilitates the calculation of ACR/EULAR related RA measures for whole patient cohorts. Measures, such as the ACR/EULAR (2010) RA classification criteria, can be determined using precalculated values for each component (small/large joint counts, duration in days, normal/abnormal acute-phase reactants, negative/low/high serology classification) or by providing “raw” data (small/large joint counts, onset/assessment dates, ESR/CRP and CCP/RF laboratory values). Other measures, including EULAR response and ACR20/50/70 response, can also be calculated by providing the required information. The accompanying web application is included as part of the R package but is also externally hosted athttps://fragla.shinyapps.io/shiny-acreular. This enables researchers and clinicians without any programming skills to easily calculate these measures by uploading either a Microsoft Excel or CSV file containing their data. Furthermore, the web application allows the incorporation of additional study covariates, enabling the automatic calculation of multigroup comparative statistics and the visualisation of the data through a number of different plots, both of which can be downloaded.Figure 1.The Data tab following the upload of data. Criteria are calculated by the selecting the appropriate checkbox.Figure 2.A density plot of DAS28 scores grouped by ACR/EULAR 2010 RA classification. Statistical analysis has been performed and shows a significant difference in DAS28 score between the two groups.Conclusion:The acreular R package facilitates the easy calculation of ACR/EULAR RA related disease measures for whole patient cohorts. Calculations can be performed either from within R or by using the accompanying web application, which also enables the graphical visualisation of data and the calculation of comparative statistics. We plan to further develop the package by adding additional RA related criteria and by adding ACR/EULAR related measures for other rheumatic disorders.Disclosure of Interests:Fraser Morton: None declared, Jagtar Nijjar Shareholder of: GlaxoSmithKline plc, Consultant of: Janssen Pharmaceuticals UK, Employee of: GlaxoSmithKline plc, Paid instructor for: Janssen Pharmaceuticals UK, Speakers bureau: Janssen Pharmaceuticals UK, AbbVie, Carl Goodyear: None declared, Duncan Porter: None declared

Download Full-text

MUREN: a robust and multi-reference approach of RNA-seq transcript normalization

BMC Bioinformatics ◽

10.1186/s12859-021-04288-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yance Feng ◽

Lei M. Li

Keyword(s):

Biological Significance ◽

Housekeeping Genes ◽

R Package ◽

Data Sets ◽

Statistical Regression ◽

Rna Seq ◽

Least Trimmed Squares ◽

Standard Data ◽

Wide Range ◽

Multiple References

Abstract Background Normalization of RNA-seq data aims at identifying biological expression differentiation between samples by removing the effects of unwanted confounding factors. Explicitly or implicitly, the justification of normalization requires a set of housekeeping genes. However, the existence of housekeeping genes common for a very large collection of samples, especially under a wide range of conditions, is questionable. Results We propose to carry out pairwise normalization with respect to multiple references, selected from representative samples. Then the pairwise intermediates are integrated based on a linear model that adjusts the reference effects. Motivated by the notion of housekeeping genes and their statistical counterparts, we adopt the robust least trimmed squares regression in pairwise normalization. The proposed method (MUREN) is compared with other existing tools on some standard data sets. The goodness of normalization emphasizes on preserving possible asymmetric differentiation, whose biological significance is exemplified by a single cell data of cell cycle. MUREN is implemented as an R package. The code under license GPL-3 is available on the github platform: github.com/hippo-yf/MUREN and on the conda platform: anaconda.org/hippo-yf/r-muren. Conclusions MUREN performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. We propose that the densities of pairwise differentiations are used to evaluate the goodness of normalization. MUREN adjusts the mode of differentiation toward zero while preserving the skewness due to biological asymmetric differentiation. Moreover, by robustly integrating pre-normalized counts with respect to multiple references, MUREN is immune to individual outlier samples.

Download Full-text