scholarly journals Prediction of regulatory motifs from human Chip-sequencing data using a deep learning framework

2019 ◽  
Vol 47 (15) ◽  
pp. 7809-7824 ◽  
Author(s):  
Jinyu Yang ◽  
Anjun Ma ◽  
Adam D Hoppe ◽  
Cankun Wang ◽  
Yang Li ◽  
...  

Abstract The identification of transcription factor binding sites and cis-regulatory motifs is a frontier whereupon the rules governing protein–DNA binding are being revealed. Here, we developed a new method (DEep Sequence and Shape mOtif or DESSO) for cis-regulatory motif prediction using deep neural networks and the binomial distribution model. DESSO outperformed existing tools, including DeepBind, in predicting motifs in 690 human ENCODE ChIP-sequencing datasets. Furthermore, the deep-learning framework of DESSO expanded motif discovery beyond the state-of-the-art by allowing the identification of known and new protein–protein–DNA tethering interactions in human transcription factors (TFs). Specifically, 61 putative tethering interactions were identified among the 100 TFs expressed in the K562 cell line. In this work, the power of DESSO was further expanded by integrating the detection of DNA shape features. We found that shape information has strong predictive power for TF–DNA binding and provides new putative shape motif information for human TFs. Thus, DESSO improves in the identification and structural analysis of TF binding sites, by integrating the complexities of DNA binding into a deep-learning framework.

2018 ◽  
Author(s):  
Jinyu Yang ◽  
Adam D. Hoppe ◽  
Bingqiang Liu ◽  
Qin Ma

ABSTRACTIdentification of transcription factor binding sites (TFBSs) and cis-regulatory motifs (motifs for short) from genomics datasets, provides a powerful view of the rules governing the interactions between TFs and DNA. Existing motif prediction methods however, are limited by high false positive rates in TFBSs identification, contributions from non-sequence-specific binding, and complex and indirect binding mechanisms. High throughput next-generation sequencing data provides unprecedented opportunities to overcome these difficulties, as it provides multiple whole-genome scale measurements of TF binding information. Uncovering this information brings new computational and modeling challenges in high-dimensional data mining and heterogeneous data integration. To improve TFBS identification and novel motifs prediction accuracy in the human genome, we developed an advanced computational technique based on deep learning (DL) and high-performance computing, named DESSO. DESSO utilizes deep neural network and binomial distribution to optimize the motif prediction. Our results showed that DESSO outperformed existing tools in predicting distinct motifs from the 690 in vivo ENCODE ChIP-Sequencing (ChIP-Seq) datasets for 161 human TFs in 91 cell lines. We also found that protein-protein interactions (PPIs) are prevalent among human TFs, and a total of 61 potential tethering binding were identified among the 100 TFs in the K562 cell line. To further expand DESSO’s deep-learning capabilities, we included DNA shape features and found that (i) shape information has a strong predictive power for TF-DNA binding specificity; and (ii) it aided in identification of the shape motifs recognized by human TFs which in turn contributed to the interpretation of TF-DNA binding in the absence of sequence recognition. DESSO and the analyses it enabled will continue to improve our understanding of how gene expression is controlled by TFs and the complexities of DNA binding. The source code and the predicted motifs and TFBSs from the 690 ENCODE TF ChIP-Seq datasets are freely available at the DESSO web server: http://bmbl.sdstate.edu/DESSO.


RNA Biology ◽  
2018 ◽  
Vol 15 (12) ◽  
pp. 1468-1476 ◽  
Author(s):  
Fan Wang ◽  
Pranik Chainani ◽  
Tommy White ◽  
Jin Yang ◽  
Yu Liu ◽  
...  

2013 ◽  
Vol 11 (01) ◽  
pp. 1340006 ◽  
Author(s):  
JAN GRAU ◽  
JENS KEILWAGEN ◽  
ANDRÉ GOHR ◽  
IVAN A. PAPONOV ◽  
STEFAN POSCH ◽  
...  

DNA-binding proteins are a main component of gene regulation as they activate or repress gene expression by binding to specific binding sites in target regions of genomic DNA. However, de-novo discovery of these binding sites in target regions obtained by wet-lab experiments is a challenging problem in computational biology, which has not yet been solved satisfactorily. Here, we present a detailed description and analysis of the de-novo motif discovery tool Dispom, which has been developed for finding binding sites of DNA-binding proteins that are differentially abundant in a set of target regions compared to a set of control regions. Two additional features of Dispom are its capability of modeling positional preferences of binding sites and adjusting the length of the motif in the learning process. Dispom yields an increased prediction accuracy compared to existing tools for de-novo motif discovery, suggesting that the combination of searching for differentially abundant motifs, inferring their positional distributions, and adjusting the motif lengths is beneficial for de-novo motif discovery. When applying Dispom to promoters of auxin-responsive genes and those of ABI3 target genes from Arabidopsis thaliana, we identify relevant binding motifs with pronounced positional distributions. These results suggest that learning motifs, their positional distributions, and their lengths by a discriminative learning principle may aid motif discovery from ChIP-chip and gene expression data. We make Dispom freely available as part of Jstacs, an open-source Java library that is tailored to statistical sequence analysis. To facilitate extensions of Dispom, we describe its implementation using Jstacs in this manuscript. In addition, we provide a stand-alone application of Dispom at http://www.jstacs.de/index.php/Dispom for instant use.


2021 ◽  
Vol 22 (11) ◽  
pp. 5521
Author(s):  
Lei Deng ◽  
Hui Wu ◽  
Xuejun Liu ◽  
Hui Liu

Predicting in vivo protein–DNA binding sites is a challenging but pressing task in a variety of fields like drug design and development. Most promoters contain a number of transcription factor (TF) binding sites, but only a small minority has been identified by biochemical experiments that are time-consuming and laborious. To tackle this challenge, many computational methods have been proposed to predict TF binding sites from DNA sequence. Although previous methods have achieved remarkable performance in the prediction of protein–DNA interactions, there is still considerable room for improvement. In this paper, we present a hybrid deep learning framework, termed DeepD2V, for transcription factor binding sites prediction. First, we construct the input matrix with an original DNA sequence and its three kinds of variant sequences, including its inverse, complementary, and complementary inverse sequence. A sliding window of size k with a specific stride is used to obtain its k-mer representation of input sequences. Next, we use word2vec to obtain a pre-trained k-mer word distributed representation model. Finally, the probability of protein–DNA binding is predicted by using the recurrent and convolutional neural network. The experiment results on 50 public ChIP-seq benchmark datasets demonstrate the superior performance and robustness of DeepD2V. Moreover, we verify that the performance of DeepD2V using word2vec-based k-mer distributed representation is better than one-hot encoding, and the integrated framework of both convolutional neural network (CNN) and bidirectional LSTM (bi-LSTM) outperforms CNN or the bi-LSTM model when used alone. The source code of DeepD2V is available at the github repository.


2020 ◽  
Vol 118 (2) ◽  
pp. e2021171118
Author(s):  
Gi Bae Kim ◽  
Ye Gao ◽  
Bernhard O. Palsson ◽  
Sang Yup Lee

A transcription factor (TF) is a sequence-specific DNA-binding protein that modulates the transcription of a set of particular genes, and thus regulates gene expression in the cell. TFs have commonly been predicted by analyzing sequence homology with the DNA-binding domains of TFs already characterized. Thus, TFs that do not show homologies with the reported ones are difficult to predict. Here we report the development of a deep learning-based tool, DeepTFactor, that predicts whether a protein in question is a TF. DeepTFactor uses a convolutional neural network to extract features of a protein. It showed high performance in predicting TFs of both eukaryotic and prokaryotic origins, resulting in F1 scores of 0.8154 and 0.8000, respectively. Analysis of the gradients of prediction score with respect to input suggested that DeepTFactor detects DNA-binding domains and other latent features for TF prediction. DeepTFactor predicted 332 candidate TFs in Escherichia coli K-12 MG1655. Among them, 84 candidate TFs belong to the y-ome, which is a collection of genes that lack experimental evidence of function. We experimentally validated the results of DeepTFactor prediction by further characterizing genome-wide binding sites of three predicted TFs, YqhC, YiaU, and YahB. Furthermore, we made available the list of 4,674,808 TFs predicted from 73,873,012 protein sequences in 48,346 genomes. DeepTFactor will serve as a useful tool for predicting TFs, which is necessary for understanding the regulatory systems of organisms of interest. We provide DeepTFactor as a stand-alone program, available at https://bitbucket.org/kaistsystemsbiology/deeptfactor.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
John-William Sidhom ◽  
H. Benjamin Larman ◽  
Drew M. Pardoll ◽  
Alexander S. Baras

AbstractDeep learning algorithms have been utilized to achieve enhanced performance in pattern-recognition tasks. The ability to learn complex patterns in data has tremendous implications in immunogenomics. T-cell receptor (TCR) sequencing assesses the diversity of the adaptive immune system and allows for modeling its sequence determinants of antigenicity. We present DeepTCR, a suite of unsupervised and supervised deep learning methods able to model highly complex TCR sequencing data by learning a joint representation of a TCR by its CDR3 sequences and V/D/J gene usage. We demonstrate the utility of deep learning to provide an improved ‘featurization’ of the TCR across multiple human and murine datasets, including improved classification of antigen-specific TCRs and extraction of antigen-specific TCRs from noisy single-cell RNA-Seq and T-cell culture-based assays. Our results highlight the flexibility and capacity for deep neural networks to extract meaningful information from complex immunogenomic data for both descriptive and predictive purposes.


2020 ◽  
Vol 22 (Supplement_3) ◽  
pp. iii316-iii316
Author(s):  
Tatsuya Ozawa ◽  
Syuzo Kaneko ◽  
Mutsumi Takadera ◽  
Eric Holland ◽  
Ryuji Hamamoto ◽  
...  

Abstract A majority of supratentorial ependymoma is associated with recurrent C11orf95-RELA fusion (RELAFUS). The presence of RELA as one component of the RELAFUS leads to the suggestion that NF-kB activity is involved in the ependymoma formation, thus being a viable therapeutic target in these tumors. However, the oncogenic role of another C11orf95 component in the tumorigenesis is not still determined. In this study, to clarify the molecular mechanism underlying tumorigenesis of RELAFUS, we performed RELAFUS-ChIP-Seq analysis in cultured cells expressing the RELAFUS protein. Genomic profiling of RELAFUS binding sites pinpointed the transcriptional target genes directly regulated by RELAFUS. We then identified a unique DNA binding motif of the RELAFUS different from the canonical NF-kB motif in de novo motif discovery analysis. Significant responsiveness of RELAFUS but not RELA to the motif was confirmed in the reporter assay. An N-terminal portion of C11orf95 was sufficient to localize in the nucleus and recognizes the unique motif. Interestingly, the RELAFUS peaks concomitant with the unique motif were identified around the transcription start site in the RELAFUS target genes as previously reported. These observations suggested that C11orf95 might have served as a key determinant for the DNA binding sites of RELAFUS, thereby induced aberrant gene expression necessary for ependymoma formation. Our results will give insights into the development of new ependymoma therapy.


2016 ◽  
Author(s):  
Chao Ren ◽  
Hebing Chen ◽  
Feng Liu ◽  
Hao Li ◽  
Xiaochen Bo ◽  
...  

Accurately identifying binding sites of transcription factors (TFs) is crucial to understand the mechanisms of transcriptional regulation and human disease. We present incorporating Find Occurrence of Regulatory Motifs (iFORM), an easy-to-use tool for scanning DNA sequence with TF motifs described as position weight matrices (PWMs). iFORM achieves higher accuracy and sensitivity by integrating the results from five classical motif discovery programs based on Fisher's combined probability test. We have used iFORM to provide accurate results on a variety of data in the ENCODE Project and the NIH Roadmap Epigenomics Project, and has demonstrated its utility to further understand individual roles of functional elements.iFORM can be freely accessed athttps://github.com/wenjiegroup/iFORM.


2017 ◽  
Author(s):  
Babita Singh ◽  
Juan L. Trincado ◽  
PJ Tatlow ◽  
Stephen R. Piccolo ◽  
Eduardo Eyras

AbstractA major challenge in cancer research is to determine the biological and clinical significance of somatic mutations in non-coding regions. This has been studied in terms of recurrence, functional impact, and association to individual regulatory sites, but the combinatorial contribution of mutations to common RNA regulatory motifs has not been explored. We developed a new method, MIRA, to perform the first comprehensive study of significantly mutated regions (SMRs) affecting binding sites for RNA-binding proteins (RBPs) in cancer. Extracting signals related to RNA-related selection processes and using RNA sequencing data from the same samples we identified alterations in RNA expression and splicing linked to mutations on RBP binding sites. We found SRSF10 and MBNL1 motifs in introns, HNRPLL motifs at 5’ UTRs, as well as 5’ and 3’ splice-site motifs, among others, with specific mutational patterns that disrupt the motif and impact RNA processing. MIRA facilitates the integrative analysis of multiple genome sites that operate collectively through common RBPs and can aid in the interpretation of non-coding variants in cancer. MIRA is available athttps://github.com/comprna/mira.


Sign in / Sign up

Export Citation Format

Share Document