Crunch: Integrated processing and modeling of ChIP-seq data in terms of regulatory motifs

Although it has become routine for experimental groups to apply ChIP-seq technology to quantitatively characterize the genome-wide binding of transcription factors (TFs), computational analysis procedures remain far from standardized, making it difficult to meaningfully compare ChIP-seq results across experiments. In addition, while genome-wide binding patterns must ultimately be determined by local constellations of binding sites in the DNA, current analysis is typically limited to a standard search for enriched motifs in ChIP-seq peaks.Here we present Crunch, a completely automated computational method that performs all ChIP-seq analysis from quality control through read mapping and peak detecting, and integrates comprehensive modeling of the ChIP signal in terms of known and novel binding motifs, quantifying the contribution of each motif, and annotating which combinations of motifs explain each binding peak.Applying Crunch to 128 ChIP-seq datasets from the ENCODE project we find that TFs naturally separate into ‘solitary TFs’, for which a single motif explains the ChIP-peaks, and ‘co-binding TFs’ for which multiple motifs co-occur within peaks. Moreover, for most datasets the motifs that Crunch identifiedde novooutperform known motifs and both the set of co-binding motifs and the top motif of solitary TFs are consistent across experiments and cell lines. Crunch is implemented as a web server (crunch.unibas.ch), enabling standardized analysis of any collection of ChIP-seq datasets by simply uploading raw sequencing data. Results are provided both in a graphical interface and as downloadable files.

Download Full-text

Transposable element expression in tumors is associated with immune infiltration and increased antigenicity

Nature Communications ◽

10.1038/s41467-019-13035-2 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 16

Author(s):

Yu Kong ◽

Christopher M. Rose ◽

Ashley A. Cass ◽

Alexander G. Williams ◽

Martine Darwish ◽

...

Keyword(s):

Dna Methylation ◽

De Novo ◽

Computational Method ◽

The Cancer Genome Atlas ◽

Potential Consequence ◽

Sequencing Data ◽

Antiviral Responses ◽

Genome Wide ◽

Cancer Genome Atlas ◽

Demethylation Agent

AbstractProfound global loss of DNA methylation is a hallmark of many cancers. One potential consequence of this is the reactivation of transposable elements (TEs) which could stimulate the immune system via cell-intrinsic antiviral responses. Here, we develop REdiscoverTE, a computational method for quantifying genome-wide TE expression in RNA sequencing data. Using The Cancer Genome Atlas database, we observe increased expression of over 400 TE subfamilies, of which 262 appear to result from a proximal loss of DNA methylation. The most recurrent TEs are among the evolutionarily youngest in the genome, predominantly expressed from intergenic loci, and associated with antiviral or DNA damage responses. Treatment of glioblastoma cells with a demethylation agent results in both increased TE expression and de novo presentation of TE-derived peptides on MHC class I molecules. Therapeutic reactivation of tumor-specific TEs may synergize with immunotherapy by inducing inflammation and the display of potentially immunogenic neoantigens.

Download Full-text

Transposable Element Exprssion in Tumors is Associated with Immune Infiltration and Increased Antigenicity

10.1101/388215 ◽

2018 ◽

Cited By ~ 1

Author(s):

Yu Kong ◽

Chris Rose ◽

Ashley A. Cass ◽

Martine Darwish ◽

Steve Lianoglou ◽

...

Keyword(s):

Dna Methylation ◽

Transposable Element ◽

De Novo ◽

Computational Method ◽

The Cancer Genome Atlas ◽

Sequencing Data ◽

Genome Wide ◽

Dna Damage Responses ◽

Cancer Genome Atlas ◽

Demethylation Agent

AbstractProfound loss of DNA methylation is a well-recognized hallmark of cancer. Given its role in silencing transposable elements (TEs), we hypothesized that extensive TE expression occurs in tumors with highly demethylated DNA. We developed REdiscoverTE, a computational method for quantifying genome-wide TE expression in RNA sequencing data. Using The Cancer Genome Atlas database, we observed increased expression of over 400 TE subfamilies, of which 262 appeared to result from a proximal loss of DNA methylation. The most recurrent TEs were among the evolutionarily youngest in the genome, predominantly expressed from intergenic loci, and associated with antiviral or DNA damage responses. Treatment of glioblastoma cells with a demethylation agent resulted in both increased TE expression and de novo presentation of TE-derived peptides on MHC class I molecules. Therapeutic reactivation of tumor-specific TEs may synergize with immunotherapy by inducing both inflammation and the display of potentially immunogenic neoantigens.One Sentence SummaryTransposable element expression in tumors is associated with increased immune response and provides tumor-associated antigens

Download Full-text

Next generation sequencing allows deeper analysis and understanding of genomes and transcriptomes including aspects to fertility

Reproduction Fertility and Development ◽

10.1071/rd10247 ◽

2011 ◽

Vol 23 (1) ◽

pp. 75 ◽

Cited By ~ 7

Author(s):

Thomas Werner

Keyword(s):

Next Generation Sequencing ◽

Transcriptional Control ◽

Target Genes ◽

De Novo ◽

Alternative Promoters ◽

Next Generation ◽

Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Generation Sequencing

Reproduction and fertility are controlled by specific events naturally linked to oocytes, testes and early embryonal tissues. A significant part of these events involves gene expression, especially transcriptional control and alternative transcription (alternative promoters and alternative splicing). While methods to analyse such events for carefully predetermined target genes are well established, until recently no methodology existed to extend such analyses into a genome-wide de novo discovery process. With the arrival of next generation sequencing (NGS) it becomes possible to attempt genome-wide discovery in genomic sequences as well as whole transcriptomes at a single nucleotide level. This does not only allow identification of the primary changes (e.g. alternative transcripts) but also helps to elucidate the regulatory context that leads to the induction of transcriptional changes. This review discusses the basics of the new technological and scientific concepts arising from NGS, prominent differences from microarray-based approaches and several aspects of its application to reproduction and fertility research. These concepts will then be illustrated in an application example of NGS sequencing data analysis involving postimplantation endometrium tissue from cows.

Download Full-text

Learning sequence patterns of AGO-sRNA affinity from high-throughput sequencing libraries to improve in silico functional small RNA detection and classification in plants

10.1101/173575 ◽

2017 ◽

Cited By ~ 1

Author(s):

Lionel Morgado ◽

Ritsert C. Jansen ◽

Frank Johannes

Keyword(s):

Small Rna ◽

High Throughput Sequencing ◽

De Novo ◽

Support Vector ◽

Sequencing Data ◽

Learning Sequence ◽

Rna Detection ◽

Binding Motifs ◽

Regulatory Pathways ◽

Vector Machines

ABSTRACTThe loading of small RNA (sRNA) into Argonaute (AGO) complexes is a crucial step in all regulatory pathways identified so far in plants that depend on such non-coding sequences. Important transcriptional and post-transcriptional silencing mechanisms can be activated depending on the specific AGO protein to which sRNA bind. It is known that sRNA-AGO associations are at least partly encoded in the sRNA primary structure, but the sequence features that drive this association have not been fully explored. Here we train support vector machines (SVM) on sRNA sequencing data obtained from AGO-immunoprecipitation experiments to identify features that determine sRNA affinity to specific AGOs. Our SVM reveal that AGO affinity is strongly determined by complex k-mers in the 5’ and 3’ ends of sRNA, in addition to well-known features such as sRNA length and the base composition of the first nucleotide. Moreover, we find that these k-mers tend to overlap known transcription factor (TF) binding motifs, thus highlighting a close interplay between TF and sRNA-mediated transcriptional regulation. We embedded the learned SVM in a computational pipeline that can be used for de novo functional classification of sRNA sequences. This tool, called SAILS, is provided as a web portal accessible at http://sails.eu.nu.

Download Full-text

HAPDeNovo: a haplotype-based approach for filtering and phasing de novo mutations in linked read sequencing data

10.1101/220830 ◽

2017 ◽

Cited By ~ 1

Author(s):

Xin Zhou ◽

Serafim Batzoglou ◽

Arend Sidow ◽

Lu Zhang

Keyword(s):

False Positive ◽

De Novo ◽

False Positives ◽

Sequencing Data ◽

De Novo Mutations ◽

Congenital Diseases ◽

Genome Wide ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Haplotype Information

AbstractBackgroundDe novo mutations (DNMs) are associated with neurodevelopmental and congenital diseases, and their detection can contribute to understanding disease pathogenicity. However, accurate detection is challenging because of their small number relative to the genome-wide false positives in next generation sequencing (NGS) data. Software such as DeNovoGear and TrioDeNovo have been developed to detect DNMs, but at good sensitivity they still produce many false positive calls.ResultsTo address this challenge, we develop HAPDeNovo, a program that leverages phasing information from linked read sequencing, to remove false positive DNMs from candidate lists generated by DNM-detection tools. Short reads from each phasing block are allocated to each of the two haplotypes followed by generating a haploid genotype for each putative DNM.HAPDeNovo removes variants that are called as heterozygous in one of the haplotypes because they are almost certainly false positives. Our experiments on 10X Chromium linked read sequencing trio data reveal that HAPDeNovo eliminates 80% to 99% of false positives regardless of how large the candidate DNM set is.ConclusionsHAPDeNovo leverages the haplotype information from linked read sequencing to remove spurious false positive DNMs effectively, and it increases accuracy of DNM detection dramatically without sacrificing sensitivity.

Download Full-text

Genome Wide Variant Analysis of Simplex Autism Families with an Integrative Clinical-Bioinformatics Pipeline

10.1101/019208 ◽

2015 ◽

Author(s):

Laura T Jiménez-Barrón ◽

Jason A O'Rawe ◽

Yiyang Wu ◽

Margaret Yoon ◽

Han Fang ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

De Novo ◽

Autism Spectrum ◽

Repetitive Behaviors ◽

Whole Genome ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Bioinformatics Tools ◽

Genome Wide

Autism spectrum disorders (ASD) are a group of developmental disabilities that affect social interaction, communication and are characterized by repetitive behaviors. There is now a large body of evidence that suggests a complex role of genetics in ASD, in which many different loci are involved. Although many current population scale genomic studies have been demonstrably fruitful, these studies generally focus on analyzing a limited part of the genome or use a limited set of bioinformatics tools. These limitations preclude the analysis of genome-wide perturbations that may contribute to the development and severity of ASD-related phenotypes. To overcome these limitations, we have developed and utilized an integrative clinical and bioinformatics pipeline for generating a more complete and reliable set of genomic variants for downstream analyses. Our study focuses on the analysis of three simplex autism families consisting of one affected child, unaffected parents, and one unaffected sibling. All members were clinically evaluated and widely phenotyped. Genotyping arrays and whole genome sequencing were performed on each member, and the resulting sequencing data were analyzed using a variety of available bioinformatics tools. We searched for rare variants of putative functional impact that were found to be segregating according to de-novo, autosomal recessive, x-linked, mitochondrial and compound heterozygote transmission models. The resulting candidate variants included three small heterozygous CNVs, a rare heterozygous de novo nonsense mutation in MYBBP1A located within exon 1, and a novel de novo missense variant in LAMB3. Our work demonstrates how more comprehensive analyses that include rich clinical data and whole genome sequencing data can generate reliable results for use in downstream investigations. We are moving to implement our framework for the analysis and study of larger cohorts of families, where statistical rigor can accompany genetic findings.

Download Full-text

Genome-wide prediction of topoisomerase IIβ binding by architectural factors and chromatin accessibility

10.1101/2020.03.23.003277 ◽

2020 ◽

Author(s):

Pedro Manuel Martínez-García ◽

Miguel García-Torres ◽

Federico Divina ◽

José Terrón-Bautista ◽

Irene Delgado-Sainz ◽

...

Keyword(s):

Machine Learning ◽

Developmental Disorders ◽

Topoisomerase Ii ◽

Catalytic Mechanism ◽

De Novo ◽

Deep Understanding ◽

Genome Integrity ◽

Sequencing Data ◽

Genome Wide ◽

Genome Dynamics

AbstractDNA topoisomerase II-β (TOP2B) is fundamental to remove topological problems linked to DNA metabolism and 3D chromatin architecture, but its cut-and-reseal catalytic mechanism can accidentally cause DNA double-strand breaks (DSBs) that can seriously compromise genome integrity. Understanding the factors that determine the genome-wide distribution of TOP2B is therefore not only essential for a complete knowledge of genome dynamics and organization, but also for the implications of TOP2-induced DSBs in the origin of oncogenic translocations and other types of chromosomal rearrangements. Here, we conduct a machine-learning approach for the prediction of TOP2B binding sites using publicly available sequencing data. We achieve highly accurate predictions, with accessible chromatin and architectural factors being the most informative features. Strikingly, TOP2B is sufficiently explained by only three features: DNase I hypersensitivity, CTCF and cohesin binding, for which genome-wide data are widely available. Based on this, we develop a predictive model for TOP2B genome-wide binding that can be used across cell lines and species, and generate virtual probability tracks that accurately mirror experimental ChIP-seq data. Our results deepen our knowledge on how the accessibility and 3D organization of chromatin determine TOP2B function, and constitute a proof of principle regarding the in silico prediction of sequence-independent chromatin-binding factors.Author summaryType II DNA topoisomerases (TOP2) are a double-edged sword. They solve topological problems in the form of supercoiling, knots and tangles that inevitably accompany genome metabolism, but they do so at the cost of transiently cleaving DNA, with the risk that this entails for genome integrity, and the serious consequences for human health, such as neurodegeneration, developmental disorders or predisposition to cancer. A comprehensive analysis of TOP2 distribution throughout the genome is therefore essential for a deep understanding of its function and regulation, and how this can affect genome dynamics and stability. Here, we use machine learning to thoroughly explore genome-wide binding of TOP2B, a vertebrate TOP2 paralog that has been linked to genome organization and cancer-associated translocations. Our analysis shows that TOP2B-DNA binding can be accurately predicted exclusively using information on DNA accessibility and binding of genome-architecture factors. We show that such information is enough to generate virtual maps of TOP2B binding along the genome, which we validate with de novo experimental data. Our results highlight the importance of TOP2B for accessibility and 3D organization of chromatin, and show that computationally predicted TOP2 maps can be accurately obtained using minimal publicly available datasets, opening the door for their use in different organisms, cell types and conditions with experimental and/or clinical relevance.

Download Full-text

Genome wide efficiency profiling reveals modulation of maintenance and de novo methylation by Tets

10.1101/2020.08.06.236307 ◽

2020 ◽

Author(s):

Pascal Giehr ◽

Charalampos Kyriakopoulos ◽

Karl Nordström ◽

Abduhlrahman Salhab ◽

Fabian Müller ◽

...

Keyword(s):

Dna Methylation ◽

Molecular Mechanisms ◽

De Novo ◽

Epigenetic Modification ◽

Embryonic Stem ◽

Regulatory Elements ◽

Sequencing Data ◽

Reduced Representation ◽

Genome Wide ◽

Global And Local

AbstractBackgroundDNA methylation is an essential epigenetic modification which is set and maintained by DNA methyl transferases (Dnmts) and removed via active and passive mechanisms involving Tet mediated oxidation. While the molecular mechanisms of these enzymes are well studied, their interplay on shaping cell specific methylomes remains less well understood. In our work we model the activities of Tets and Dnmts at single CpGs across the genome using a novel type of high resolution sequencing data.ResultsTo accurately measure 5mC and 5hmC levels at single CpGs we developed RRHPoxBS, a reduced representation hairpin oxidative bisulfite sequencing approach. Using this method we mapped the methylomes and hydroxymethylomes of wild type and Tet triple knockout mouse embryonic stem cells. These comprehensive datasets were then used to develop an extended Hidden Markov model allowing us i) to determine the symmetrical methylation and hydroxymethylation state at millions of individual CpGs, ii) infer the maintenance and de novo methylation efficiencies of Dnmts and the hydroxylation efficiencies of Tets at individual CpG positions. We find that Tets exhibit their highest activity around unmethylated regulatory elements, i.e. active promoters and enhancers. Furthermore, we find that Tets’ presence has a profound effect on the global and local maintenance and de novo methylation activities by the Dnmts, not only substantially contributing to a universal demethylation of the genome but also shaping the overall methylation landscape.ConclusionsOur analysis demonstrates that a fine tuned and locally controlled interplay between Tets and Dnmts is important to modulate de novo and maintenance activities of Dnmts across the genome. Tet activities contribute to DNA methylation patterning in the following ways: They oxidize 5mC, they locally shield DNA from accidental de novo methylation and at the same time modulate maintenance and de novo methylation efficiencies of Dnmts across the genome.

Download Full-text

Genome-wide profiling of heritable and de novo STR variations

10.1101/077727 ◽

2016 ◽

Cited By ~ 7

Author(s):

Thomas Willems ◽

Dina Zielinski ◽

Assaf Gordon ◽

Melissa Gymrek ◽

Yaniv Erlich

Keyword(s):

Tandem Repeats ◽

High Throughput Sequencing ◽

De Novo ◽

Genetic Diseases ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Genome Wide ◽

A Genome ◽

Short Tandem

AbstractShort tandem repeats (STRs) are highly variable elements that play a pivotal role in multiple genetic diseases, population genetics applications, and forensic casework. However, STRs have proven problematic to genotype from high-throughput sequencing data. Here, we describe HipSTR, a novel haplotype-based method for robustly genotyping, haplotyping, and phasing STRs from whole genome sequencing data and report a genome-wide analysis and validation of de novo STR mutations.

Download Full-text

ExpansionHunter Denovo: A computational method for locating known and novel repeat expansions in short-read sequencing data

10.1101/863035 ◽

2019 ◽

Author(s):

Egor Dolzhenko ◽

Mark F. Bennett ◽

Phillip A. Richmond ◽

Brett Trost ◽

Sai Chen ◽

...

Keyword(s):

Tandem Repeats ◽

Simulated Data ◽

Computational Method ◽

Detection Methods ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Monogenic Disorders ◽

Genome Wide ◽

Repeat Expansions

AbstractExpansions of short tandem repeats are responsible for over 40 monogenic disorders, and undoubtedly many more pathogenic repeat expansions (REs) remain to be discovered. Existing methods for detecting REs in short-read sequencing data require predefined repeat catalogs. However recent discoveries have emphasized the need for detection methods that do not require candidate repeats to be specified in advance. To address this need, we introduce ExpansionHunter Denovo, an efficient catalog-free method for genome-wide detection of REs. Analysis of real and simulated data shows that our method can identify large expansions of 41 out of 44 pathogenic repeats, including nine recently reported non-reference REs not discoverable via existing methods.ExpansionHunter Denovo is freely available at https://github.com/Illumina/ExpansionHunterDenovo

Download Full-text