Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models

Abstract Background Histone modification constitutes a basic mechanism for the genetic regulation of gene expression. In early 2000s, a powerful technique has emerged that couples chromatin immunoprecipitation with high-throughput sequencing (ChIP-seq). This technique provides a direct survey of the DNA regions associated to these modifications. In order to realize the full potential of this technique, increasingly sophisticated statistical algorithms have been developed or adapted to analyze the massive amount of data it generates. Many of these algorithms were built around natural assumptions such as the Poisson distribution to model the noise in the count data. In this work we start from these natural assumptions and show that it is possible to improve upon them. Results Our comparisons on seven reference datasets of histone modifications (H3K36me3 & H3K4me3) suggest that natural assumptions are not always realistic under application conditions. We show that the unconstrained multiple changepoint detection model with alternative noise assumptions and supervised learning of the penalty parameter reduces the over-dispersion exhibited by count data. These models, implemented in the R package CROCS (https://github.com/aLiehrmann/CROCS), detect the peaks more accurately than algorithms which rely on natural assumptions. Conclusion The segmentation models we propose can benefit researchers in the field of epigenetics by providing new high-quality peak prediction tracks for H3K36me3 and H3K4me3 histone modifications.

Download Full-text

Increased Peak Detection Accuracy in Over-dispersed ChiP-seq Data With Supervised Segmentation Models

10.21203/rs.3.rs-152657/v1 ◽

2021 ◽

Author(s):

Arnaud Liehrmann ◽

Guillem Rigaill ◽

Toby Dylan Hocking

Keyword(s):

Histone Modifications ◽

Count Data ◽

High Throughput Sequencing ◽

Genetic Regulation ◽

Regulation Of Gene Expression ◽

Basic Mechanism ◽

Detection Accuracy ◽

Full Potential ◽

Detection Model ◽

Over Dispersion

Abstract Background: Histone modification constitutes a basic mechanism for the genetic regulation of gene expression. In early 2000s, a powerful technique has emerged that couples chromatin immunoprecipitation with high-throughput sequencing (ChIP-seq). This technique provides a direct survey of the DNA regions associated to these modifications. In order to realize the full potential of this technique, increasingly sophisticated statistical algorithms have been developed or adapted to analyze the massive amount of data it generates. Many of these algorithms were built around natural assumptions such as the Poisson distribution to model the noise in the count data. In this work we start from these natural assumptions and show that it is possible to improve upon them. Results: Our comparisons on seven reference datasets of histone modifications (H3K36me3 & H3K4me3) suggest that natural assumptions are not always realistic under application conditions. We show that the unconstrained multiple changepoint detection model with alternative noise assumptions and supervised learning of the penalty parameter reduces the over-dispersion exhibited by count data. This model detects peaks more accurately than algorithms which rely on natural assumptions. Conclusion: The segmentation model we propose can benefit researchers in the field of epigenetics by providing new high-quality peak prediction tracks for H3K36me3 and H3k4me3 histone modifications.

Download Full-text

Sophisticated Conversations between Chromatin and Chromatin Remodelers, and Dissonances in Cancer

International Journal of Molecular Sciences ◽

10.3390/ijms22115578 ◽

2021 ◽

Vol 22 (11) ◽

pp. 5578

Author(s):

Cedric R. Clapier

Keyword(s):

Gene Expression ◽

Regulation Of Gene Expression ◽

Structural Diversity ◽

Nucleosome Positioning ◽

Basic Mechanism ◽

Progressive Increase ◽

Chromatin Organization ◽

Dna Translocation ◽

Dynamic Regulation ◽

Cooperative Action

The establishment and maintenance of genome packaging into chromatin contribute to define specific cellular identity and function. Dynamic regulation of chromatin organization and nucleosome positioning are critical to all DNA transactions—in particular, the regulation of gene expression—and involve the cooperative action of sequence-specific DNA-binding factors, histone modifying enzymes, and remodelers. Remodelers are molecular machines that generate various chromatin landscapes, adjust nucleosome positioning, and alter DNA accessibility by using ATP binding and hydrolysis to perform DNA translocation, which is highly regulated through sophisticated structural and functional conversations with nucleosomes. In this review, I first present the functional and structural diversity of remodelers, while emphasizing the basic mechanism of DNA translocation, the common regulatory aspects, and the hand-in-hand progressive increase in complexity of the regulatory conversations between remodelers and nucleosomes that accompanies the increase in challenges of remodeling processes. Next, I examine how, through nucleosome positioning, remodelers guide the regulation of gene expression. Finally, I explore various aspects of how alterations/mutations in remodelers introduce dissonance into the conversations between remodelers and nucleosomes, modify chromatin organization, and contribute to oncogenesis.

Download Full-text

kataegis: an R package for identification and visualization of the genomic localized hypermutation regions using high-throughput sequencing

BMC Genomics ◽

10.1186/s12864-021-07696-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Xue Lin ◽

Yingying Hua ◽

Shuanglin Gu ◽

Li Lv ◽

Xingyu Li ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Somatic Mutations ◽

R Package ◽

Frequency Of Occurrence ◽

Link Type ◽

Genomic Landscape ◽

One Step ◽

Flanking Regions

Abstract Background Genomic localized hypermutation regions were found in cancers, which were reported to be related to the prognosis of cancers. This genomic localized hypermutation is quite different from the usual somatic mutations in the frequency of occurrence and genomic density. It is like a mutations “violent storm”, which is just what the Greek word “kataegis” means. Results There are needs for a light-weighted and simple-to-use toolkit to identify and visualize the localized hypermutation regions in genome. Thus we developed the R package “kataegis” to meet these needs. The package used only three steps to identify the genomic hypermutation regions, i.e., i) read in the variation files in standard formats; ii) calculate the inter-mutational distances; iii) identify the hypermutation regions with appropriate parameters, and finally one step to visualize the nucleotide contents and spectra of both the foci and flanking regions, and the genomic landscape of these regions. Conclusions The kataegis package is available on Bionconductor/Github (https://github.com/flosalbizziae/kataegis), which provides a light-weighted and simple-to-use toolkit for quickly identifying and visualizing the genomic hypermuation regions.

Download Full-text

Taming the Wild West of Molecular Tools Application in Aquatic Research and Biomonitoring

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37215 ◽

2019 ◽

Vol 3 ◽

Cited By ~ 1

Author(s):

Alexander Weigand ◽

Agnès Bouchez ◽

Pieter Boets ◽

Kat Bruce ◽

Fedor Ciampor ◽

...

Keyword(s):

Working Group ◽

High Throughput Sequencing ◽

Group Structure ◽

Time Budget ◽

Good Practice ◽

Dna Barcode ◽

Environmental Parameters ◽

Full Potential ◽

Molecular Tools ◽

Wild West

Modern high-throughput sequencing technologies are becoming a game changer in many fields of aquatic research and biomonitoring. To unfold their full potential, however, the independent development of approaches has to be streamlined. This discussion must be fuelled by stakeholders and practitioners and, scientific results collaboratively filtered to identify the most promising avenues. Furthermore, aspects such as time, budget, skills and the application context have to be considered, finally communicating good practice strategies to target audiences. Since 2016, the EU COST Action DNAqua-Net is taming the wild west of molecular tools application in aquatic research and biomonitoring. After nucleating available knowledge by the formation of a highly international and transdisciplinary network of scientists, stakeholders, practitioners and enterprises, fields of high methodological diversity were identified. Relevant aspects are currently ground truthed, thereby reducing the plethora of pipelines, parameters and protocols to a subset of good practices or standardisations. To effectively bridge the science-application interface, the very same network is exploited for the dissemination of results (Leese et al. 2018). The internal working group structure of DNAqua-Net is used to provide an overview of existing methodological fields of diversity in DNA-based aquatic biomonitoring: WG1 -DNA Barcode References: Different marker systems are targeted for the same organism group. Even in case the same molecular marker is investigated, different primer pairs are frequently applied for DNA metabarcoding. Both aspects challenge the further development of high-quality and complete DNA barcode reference libraries (Weigand et al. 2019). WG2 -Biotic Indices & Metrics: Index systems are developed from molecular data in various ways: from the estimation of species' biomass (as a proxy for abundance) from sequence reads, to the correlation of presence/absence data of molecular operational taxonomic units (MOTUs) with environmental parameters (Pawlowski et al. 2018). WG3 -Field & Lab Protocols: Using environmental DNA (eDNA) metabarcoding as an example, diverse sampling techniques based on varying water volumes, different filter systems and collection devices as well as a multitude of laboratory protocols for PCR, replication and sequencing are considered. WG4 -Data Analysis & Storage: During the process of MOTU identification, varying threshold values and conceptually different pipelines are used, potentially impacting the final list of MOTUs or species retrieved. Furthermore, routine storage concepts for big biodiversity data are only in development and some sample types (e.g. eDNA) have no sophisticated metadata descriptions. WG5 -Implementation Strategy & Legal Issues: The working group picks up collaboratively filtered good practice strategies and generates room for discussions at the science-policy interface (Hering et al. 2018). The CEN working group WG28 "DNA methods" has been initiated and the development of standardisations is fostered.

Download Full-text

An R Package for Divergence Analysis of Omics Data

10.1101/720391 ◽

2019 ◽

Author(s):

Wikum Dinalankara ◽

Qian Ke ◽

Donald Geman ◽

Luigi Marchionni

Keyword(s):

High Throughput Sequencing ◽

R Package ◽

The Cancer Genome Atlas ◽

High Dimensional ◽

Omics Data ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Ternary Code ◽

Cancer Genome Atlas ◽

Level Analysis

AbstractGiven the ever-increasing amount of high-dimensional and complex omics data becoming available, it is increasingly important to discover simple but effective methods of analysis. Divergence analysis transforms each entry of a high-dimensional omics profile into a digitized (binary or ternary) code based on the deviation of the entry from a given baseline population. This is a novel framework that is significantly different from existing omics data analysis methods: it allows digitization of continuous omics data at the univariate or multivariate level, facilitates sample level analysis, and is applicable on many different omics platforms. The divergence package, available on the R platform through the Bioconductor repository collection, provides easy-to-use functions for carrying out this transformation. Here we demonstrate how to use the package with sample high throughput sequencing data from the Cancer Genome Atlas.

Download Full-text

On Using Local Ancestry to Characterize the Genetic Architecture of Human Phenotypes: Genetic Regulation of Gene Expression in Multiethnic or Admixed Populations as a Model

10.1101/483107 ◽

2018 ◽

Cited By ~ 1

Author(s):

Yizhen Zhong ◽

Minoli Perera ◽

Eric R. Gamazon

Keyword(s):

Gene Expression ◽

Complex Traits ◽

Genetic Architecture ◽

Genetic Regulation ◽

Regulation Of Gene Expression ◽

Type I ◽

Eqtl Mapping ◽

Entire Genome ◽

Local Ancestry ◽

Heritability Estimation

AbstractBackgroundUnderstanding the nature of the genetic regulation of gene expression promises to advance our understanding of the genetic basis of disease. However, the methodological impact of use of local ancestry on high-dimensional omics analyses, including most prominently expression quantitative trait loci (eQTL) mapping and trait heritability estimation, in admixed populations remains critically underexplored.ResultsHere we develop a statistical framework that characterizes the relationships among the determinants of the genetic architecture of an important class of molecular traits. We estimate the trait variance explained by ancestry using local admixture relatedness between individuals. Using National Institute of General Medical Sciences (NIGMS) and Genotype-Tissue Expression (GTEx) datasets, we show that use of local ancestry can substantially improve eQTL mapping and heritability estimation and characterize the sparse versus polygenic component of gene expression in admixed and multiethnic populations respectively. Using simulations of diverse genetic architectures to estimate trait heritability and the level of confounding, we show improved accuracy given individual-level data and evaluate a summary statistics based approach. Furthermore, we provide a computationally efficient approach to local ancestry analysis in eQTL mapping while increasing control of type I and type II error over traditional approaches.ConclusionOur study has important methodological implications on genetic analysis of omics traits across a range of genomic contexts, from a single variant to a prioritized region to the entire genome. Our findings highlight the importance of using local ancestry to better characterize the heritability of complex traits and to more accurately map genetic associations.

Download Full-text

hypeR: An R Package for Geneset Enrichment Workflows

10.1101/656637 ◽

2019 ◽

Cited By ~ 1

Author(s):

Anthony Federico ◽

Stefano Monti

Keyword(s):

High Throughput Sequencing ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Wide Audience ◽

Popular Method ◽

Link Type ◽

High Throughput Sequencing Data ◽

One Stop ◽

Recent Version

ABSTRACTSummaryGeneset enrichment is a popular method for annotating high-throughput sequencing data. Existing tools fall short in providing the flexibility to tackle the varied challenges researchers face in such analyses, particularly when analyzing many signatures across multiple experiments. We present a comprehensive R package for geneset enrichment workflows that offers multiple enrichment, visualization, and sharing methods in addition to novel features such as hierarchical geneset analysis and built-in markdown reporting. hypeR is a one-stop solution to performing geneset enrichment for a wide audience and range of use cases.Availability and implementationThe most recent version of the package is available at https://github.com/montilab/hypeR.Supplementary informationComprehensive documentation and tutorials, are available at https://montilab.github.io/hypeR-docs.

Download Full-text

Faculty Opinions recommendation of Dynamic genetic regulation of gene expression during cellular differentiation.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.736057096.793563104 ◽

2019 ◽

Author(s):

Robert Copeland

Keyword(s):

Gene Expression ◽

Cellular Differentiation ◽

Genetic Regulation ◽

Regulation Of Gene Expression

Download Full-text

How (Epi)Genetic Regulation of the LIM-Domain Protein FHL2 Impacts Multifactorial Disease

Cells ◽

10.3390/cells10102611 ◽

2021 ◽

Vol 10 (10) ◽

pp. 2611

Author(s):

Jayron J. Habibe ◽

Maria P. Clemente-Olivo ◽

Carlie J. de Vries

Keyword(s):

Gene Expression ◽

Genetic Variation ◽

Current Knowledge ◽

Genetic Regulation ◽

Regulation Of Gene Expression ◽

Lim Domain ◽

Multifactorial Diseases ◽

Lim Domains ◽

Age Related

Susceptibility to complex pathological conditions such as obesity, type 2 diabetes and cardiovascular disease is highly variable among individuals and arises from specific changes in gene expression in combination with external factors. The regulation of gene expression is determined by genetic variation (SNPs) and epigenetic marks that are influenced by environmental factors. Aging is a major risk factor for many multifactorial diseases and is increasingly associated with changes in DNA methylation, leading to differences in gene expression. Four and a half LIM domains 2 (FHL2) is a key regulator of intracellular signal transduction pathways and the FHL2 gene is consistently found as one of the top hyper-methylated genes upon aging. Remarkably, FHL2 expression increases with methylation. This was demonstrated in relevant metabolic tissues: white adipose tissue, pancreatic β-cells, and skeletal muscle. In this review, we provide an overview of the current knowledge on regulation of FHL2 by genetic variation and epigenetic DNA modification, and the potential consequences for age-related complex multifactorial diseases.

Download Full-text

debar, a sequence-by-sequence denoiser for COI-5P DNA barcode data

10.1101/2021.01.04.425285 ◽

2021 ◽

Author(s):

Cameron M. Nugent ◽

Tyler A. Elliott ◽

Sujeevan Ratnasingham ◽

Paul D. N. Hebert ◽

Sarah J. Adamowicz

Keyword(s):

High Throughput Sequencing ◽

Dna Barcode ◽

R Package ◽

Error Rates ◽

Real World Data ◽

Species Discovery ◽

Consensus Sequences ◽

In Silico Studies ◽

Coi Sequences

AbstractDNA barcoding and metabarcoding are now widely used to advance species discovery and biodiversity assessments. High-throughput sequencing (HTS) has expanded the volume and scope of these analyses, but elevated error rates introduce noise into sequence records that can inflate estimates of biodiversity. Denoising —the separation of biological signal from instrument (technical) noise—of barcode and metabarcode data currently employs abundance-based methods which do not capitalize on the highly conserved structure of the cytochrome c oxidase subunit I (COI) region employed as the animal barcode. This manuscript introduces debar, an R package that utilizes a profile hidden Markov model to denoise indel errors in COI sequences introduced by instrument error. In silico studies demonstrated that debar recognized 95% of artificially introduced indels in COI sequences. When applied to real-world data, debar reduced indel errors in circular consensus sequences obtained with the Sequel platform by 75%, and those generated on the Ion Torrent S5 by 94%. The false correction rate was less than 0.1%, indicating that debar is receptive to the majority of true COI variation in the animal kingdom. In conclusion, the debar package improves DNA barcode and metabarcode workflows by aiding the generation of more accurate sequences aiding the characterization of species diversity.

Download Full-text