scholarly journals Unsupervised learning of DNA sequence features using a convolutional restricted Boltzmann machine

2017 ◽  
Author(s):  
Wolfgang Kopp ◽  
Roman Schulte-Sasse

AbstractTranscription factors (TFs) are important contributors to gene regulation. They specifically bind to short DNA stretches known as transcription factor binding sites (TFBSs), which are contained in regulatory regions (e.g. promoters), and thereby influence a target gene’s expression level. Computational biology has contributed substantially to understanding regulatory regions by developing numerous tools, including for discovering de novo motif. While those tools primarily focus on determining and studying TFBSs, the surrounding sequence context is often given less attention. In this paper, we attempt to fill this gap by adopting a so-called convolutional restricted Boltzmann machine (cRBM) that captures redundant features from the DNA sequences. The model uses an unsupervised learning approach to derive a rich, yet interpretable, description of the entire sequence context. We evaluated the cRBM on a range of publicly available ChIP-seq peak regions and investigated its capability to summarize heterogeneous sets of regulatory sequences in comparison with MEME-Chip, a popular motif discovery tool. In summary, our method yields a considerably more accurate description of the sequence composition than MEME-Chip, providing both a summary of strong TF motifs as well as subtle low-complexity features.

2018 ◽  
Author(s):  
Niklas Bruse ◽  
Simon J. van Heeringen

AbstractBackgroundTranscription factors (TFs) bind to specific DNA sequences, TF motifs, in cis-regulatory sequences and control the expression of the diverse transcriptional programs encoded in the genome. The concerted action of TFs within the chromatin context enables precise temporal and spatial expression patterns. To understand how TFs control gene expression it is essential to model TF binding. TF motif information can help to interpret the exact role of individual regulatory elements, for instance to predict the functional impact of non-coding variants.FindingsHere we present GimmeMotifs, a comprehensive computational framework for TF motif analysis. Compared to the previously published version, this release adds a whole range of new functionality and analysis methods. It now includes tools for de novo motif discovery, motif scanning and sequence analysis, motif clustering, calculation of performance metrics and visualization. Included with GimmeMotifs is a non-redundant database of clustered motifs. Compared to other motif databases, this collection of motifs shows competitive performance in discriminating bound from unbound sequences. Using our de novo motif discovery pipeline we find large differences in performance between de novo motif finders on ChIP-seq data. Using an ensemble method such as implemented in GimmeMotifs will generally result in improved motif identification compared to a single motif finder. Finally, we demonstrate maelstrom, a new ensemble method that enables comparative analysis of TF motifs between multiple high-throughput sequencing experiments, such as ChIP-seq or ATAC-seq. Using a collection of ~200 H3K27ac ChIP-seq data sets we identify TFs that play a role in hematopoietic differentiation and lineage commitment.ConclusionGimmeMotifs is a fully-featured and flexible framework for TF motif analysis. It contains both command-line tools as well as a Python API and is freely available at: https://github.com/vanheeringen-lab/gimmemotifs.


2006 ◽  
Vol 26 (22) ◽  
pp. 8623-8638 ◽  
Author(s):  
Smitha P. Sripathy ◽  
Jessica Stevens ◽  
David C. Schultz

ABSTRACT KAP1/TIF1β is proposed to be a universal corepressor protein for the KRAB zinc finger protein (KRAB-zfp) superfamily of transcriptional repressors. To characterize the role of KAP1 and KAP1-interacting proteins in transcriptional repression, we investigated the regulation of stably integrated reporter transgenes by hormone-responsive KRAB and KAP1 repressor proteins. Here, we demonstrate that depletion of endogenous KAP1 levels by small interfering RNA (siRNA) significantly inhibited KRAB-mediated transcriptional repression of a chromatin template. Similarly, reduction in cellular levels of HP1α/β/γ and SETDB1 by siRNA attenuated KRAB-KAP1 repression. We also found that direct tethering of KAP1 to DNA was sufficient to repress transcription of an integrated transgene. This activity is absolutely dependent upon the interaction of KAP1 with HP1 and on an intact PHD finger and bromodomain of KAP1, suggesting that these domains function cooperatively in transcriptional corepression. The achievement of the repressed state by wild-type KAP1 involves decreased recruitment of RNA polymerase II, reduced levels of histone H3 K9 acetylation and H3K4 methylation, an increase in histone occupancy, enrichment of trimethyl histone H3K9, H3K36, and histone H4K20, and HP1 deposition at proximal regulatory sequences of the transgene. A KAP1 protein containing a mutation of the HP1 binding domain failed to induce any change in the histone modifications associated with DNA sequences of the transgene, implying that HP1-directed nuclear compartmentalization is required for transcriptional repression by the KRAB/KAP1 repression complex. The combination of these data suggests that KAP1 functions to coordinate activities that dynamically regulate changes in histone modifications and deposition of HP1 to establish a de novo microenvironment of heterochromatin, which is required for repression of gene transcription by KRAB-zfps.


2013 ◽  
Vol 9 (4) ◽  
pp. 412-424 ◽  
Author(s):  
Qiang Yu ◽  
Hongwei Huo ◽  
Yipu Zhang ◽  
Hongzhi Guo ◽  
Haitao Guo

2010 ◽  
Vol 08 (02) ◽  
pp. 219-246 ◽  
Author(s):  
ARVIND RAO ◽  
DAVID J. STATES ◽  
ALFRED O. HERO ◽  
JAMES DOUGLAS ENGEL

Gene regulation in eukaryotes involves a complex interplay between the proximal promoter and distal genomic elements (such as enhancers) which work in concert to drive precise spatio-temporal gene expression. The experimental localization and characterization of gene regulatory elements is a very complex and resource-intensive process. The computational identification of regulatory regions that confer spatiotemporally specific tissue-restricted expression of a gene is thus an important challenge for computational biology. One of the most popular strategies for enhancer localization from DNA sequence is the use of conservation-based prefiltering and more recently, the use of canonical (transcription factor motifs) or de novo tissue-specific sequence motifs. However, there is an ongoing effort in the computational biology community to further improve the fidelity of enhancer predictions from sequence data by integrating other, complementary genomic modalities. In this work, we propose a framework that complements existing methodologies for prospective enhancer identification. The methods in this work are derived from two key insights: (i) that chromatin modification signatures can discriminate proximal and distally located regulatory regions and (ii) the notion of promoter-enhancer cross-talk (as assayed in 3C/5C experiments) might have implications in the search for regulatory sequences that co-operate with the promoter to yield tissue-restricted, gene-specific expression.


2018 ◽  
Vol 16 (01) ◽  
pp. 1740012 ◽  
Author(s):  
Oleg V. Vishnevsky ◽  
Andrey V. Bocharnikov ◽  
Nikolay A. Kolchanov

The development of chromatin immunoprecipitation sequencing (ChIP-seq) technology has revolutionized the genetic analysis of the basic mechanisms underlying transcription regulation and led to accumulation of information about a huge amount of DNA sequences. There are a lot of web services which are currently available for de novo motif discovery in datasets containing information about DNA/protein binding. An enormous motif diversity makes their finding challenging. In order to avoid the difficulties, researchers use different stochastic approaches. Unfortunately, the efficiency of the motif discovery programs dramatically declines with the query set size increase. This leads to the fact that only a fraction of top “peak” ChIP-Seq segments can be analyzed or the area of analysis should be narrowed. Thus, the motif discovery in massive datasets remains a challenging issue. Argo_Compute Unified Device Architecture (CUDA) web service is designed to process the massive DNA data. It is a program for the detection of degenerate oligonucleotide motifs of fixed length written in 15-letter IUPAC code. Argo_CUDA is a full-exhaustive approach based on the high-performance GPU technologies. Compared with the existing motif discovery web services, Argo_CUDA shows good prediction quality on simulated sets. The analysis of ChIP-Seq sequences revealed the motifs which correspond to known transcription factor binding sites.


2020 ◽  
Vol 17 (171) ◽  
pp. 20200600
Author(s):  
Ibrahim Sultan ◽  
Vincent Fromion ◽  
Sophie Schbath ◽  
Pierre Nicolas

Automatic de novo identification of the main regulons of a bacterium from genome and transcriptome data remains a challenge. To address this task, we propose a statistical model that can use information on exact positions of the transcription start sites and condition-dependent expression profiles. The central idea of this model is to improve the probabilistic representation of the promoter DNA sequences by incorporating covariates summarizing expression profiles (e.g. coordinates in projection spaces or hierarchical clustering trees). A dedicated trans-dimensional Markov chain Monte Carlo algorithm adjusts the width and palindromic properties of the corresponding position-weight matrices, the number of parameters to describe exact position relative to the transcription start site, and chooses the expression covariates relevant for each motif. All parameters are estimated simultaneously, for many motifs and many expression covariates. The method is applied to a dataset of transcription start sites and expression profiles available for Listeria monocytogenes . The results validate the approach and provide a new global view of the transcription regulatory network of this important pathogen. Remarkably, a previously unreported motif is found in promoter regions of ribosomal protein genes, suggesting a role in the regulation of growth.


2019 ◽  
Author(s):  
Ibrahim Sultan ◽  
Vincent Fromion ◽  
Sophie Schbath ◽  
Pierre Nicolas

AbstractAutomatic de novo identification of the main regulons of a bacterium from genome and transcriptome data remains a challenge. To address this task, we propose a statistical model of promoter DNA sequences that can use information on exact positions of the transcription start sites and condition-dependent expression profiles. Two main novelties are to allow overlaps between motif occurrences and to incorporate covariates summarising expression profiles (e.g. coordinates in projection spaces or hierarchical clustering trees). All parameters are estimated using a dedicated trans-dimensional Markov chain Monte Carlo algorithm that adjusts, simultaneously, for many motifs and many expression covariates: the width and palindromic properties of the corresponding position-weight matrices, the number of parameters to describe position with respect to the transcription start site, and the choice of relevant expression covariates. A data-set of transcription start sites and expression profiles available for the Listeria monocytogenes is analysed. The results validate the approach and provide a new global view of the transcription regulatory network of this important model food-borne pathogen. A previously unreported motif that may play an important role in the regulation of growth was found in promoter regions of ribosomal protein genes.


2018 ◽  
Author(s):  
Leslie A. Mitchell ◽  
Laura H. McCulloch ◽  
Sudarshan Pinglay ◽  
Henri Berger ◽  
Nazario Bosco ◽  
...  

AbstractDesign and large-scale synthesis of DNA has been applied to the functional study of viral and microbial genomes. New and expanded technology development is required to unlock the transformative potential of such bottom-up approaches to the study of larger mammalian genomes. Two major challenges include assembling and delivering long DNA sequences. Here we describe a pipeline for de novo DNA assembly and delivery that enables functional evaluation of mammalian genes on the length scale of 100 kb. The DNA assembly step is supported by an integrated robotic workcell. We assembled the 101 kb human HPRT1 gene in yeast, delivered it to mouse embryonic stem cells, and showed expression of the human protein from its full-length gene. This pipeline provides a framework for producing systematic, designer variants of any mammalian gene locus for functional evaluation in cells.Significance StatementMammalian genomes consist of a tiny proportion of relatively well-characterized coding regions and vast swaths of poorly characterized “dark matter” containing critical but much less well-defined regulatory sequences. Given the dominant role of noncoding DNA in common human diseases and traits, the interconnectivity of regulatory elements, and the importance of genomic context, de novo design, assembly, and delivery can enable large-scale manipulation of these elements on a locus scale. Here we outline a pipeline for de novo assembly, delivery and expression of mammalian genes replete with native regulatory sequences. We expect this pipeline will be useful for dissecting the function of non-coding sequence variation in mammalian genomes.


2017 ◽  
Author(s):  
Dennis Wylie ◽  
Hans A. Hofmann ◽  
Boris V. Zemelman

AbstractMotivationWe set out to develop an algorithm that can mine differential gene expression data to identify candidate cell type-specific DNA regulatory sequences. Differential expression is usually quantified as a continuous score—fold-change, test-statistic, p-value—comparing biological classes. Unlike existing approaches, our de novo strategy, termed SArKS, applies nonparametric kernel smoothing to uncover promoter motifs that correlate with elevated differential expression scores. SArKS detects motifs by smoothing sequence scores over sequence similarity. A second round of smoothing over spatial proximity reveals multi-motif domains (MMDs). Discovered motifs can then be merged or extended based on adjacency within MMDs. False positive rates are estimated and controlled by permutation testing.ResultsWe applied SArKS to published gene expression data representing distinct neocortical neuron classes in M. musculus and interneuron developmental states in H. sapiens. When benchmarked against several existing algorithms for correlative motif discovery using a cross-validation procedure, SArKS identified larger motif sets that formed the basis for regression models with higher correlative power.Availabilityhttps://github.com/denniscwylie/[email protected] informationappended to document.


Sign in / Sign up

Export Citation Format

Share Document