Computational Approaches for Transcription Factor Binding Prediction

AbstractUnderstanding gene expression will require understanding where regulatory factors bind genomic DNA. The frequently used sequence-based motifs of protein-DNA binding are not predictive, since a genome contains many more binding sites than are actually bound and transcription factors of the same family share similar DNA-binding motifs. Traditionally, these motifs only depict sequence but neglect DNA shape. Since shape may contribute non-linearly and combinational to binding, machine learning approaches ought to be able to better predict transcription factor binding. Here we show that a random forest machine learning approach, which incorporates the 3D-shape of DNA, enhances binding prediction for all 216 tested Arabidopsis thaliana transcription factors and improves the resolution of differential binding by transcription factor family members which share the same binding motif. We observed that DNA shape features were individually weighted for each transcription factor, even if they shared the same binding sequence.

Download Full-text

BinDNase: a discriminatory approach for transcription factor binding prediction using DNase I hypersensitivity data

Bioinformatics ◽

10.1093/bioinformatics/btv294 ◽

2015 ◽

Vol 31 (17) ◽

pp. 2852-2859 ◽

Cited By ~ 43

Author(s):

Juhani Kähärä ◽

Harri Lähdesmäki

Keyword(s):

Transcription Factor ◽

Transcription Factor Binding ◽

Dnase I ◽

Binding Prediction ◽

Factor Binding ◽

Dnase I Hypersensitivity

Download Full-text

Genome Wide Approaches to Identify Protein-DNA Interactions

Current Medicinal Chemistry ◽

10.2174/0929867325666180530115711 ◽

2020 ◽

Vol 26 (42) ◽

pp. 7641-7654 ◽

Cited By ~ 1

Author(s):

Tao Ma ◽

Zhenqing Ye ◽

Liguo Wang

Keyword(s):

Transcription Factor ◽

Binding Sites ◽

Target Genes ◽

Rapid Development ◽

Transcription Factor Binding Sites ◽

Transcription Factor Binding ◽

Computational Approaches ◽

Factor Binding ◽

Genome Wide ◽

Chip Chip

Background: Transcription factors are DNA-binding proteins that play key roles in many fundamental biological processes. Unraveling their interactions with DNA is essential to identify their target genes and understand the regulatory network. Genome-wide identification of their binding sites became feasible thanks to recent progress in experimental and computational approaches. ChIP-chip, ChIP-seq, and ChIP-exo are three widely used techniques to demarcate genome-wide transcription factor binding sites. Objective: This review aims to provide an overview of these three techniques including their experiment procedures, computational approaches, and popular analytic tools. Conclusion: ChIP-chip, ChIP-seq, and ChIP-exo have been the major techniques to study genome- wide in vivo protein-DNA interaction. Due to the rapid development of next-generation sequencing technology, array-based ChIP-chip is deprecated and ChIP-seq has become the most widely used technique to identify transcription factor binding sites in genome-wide. The newly developed ChIP-exo further improves the spatial resolution to single nucleotide. Numerous tools have been developed to analyze ChIP-chip, ChIP-seq and ChIP-exo data. However, different programs may employ different mechanisms or underlying algorithms thus each will inherently include its own set of statistical assumption and bias. So choosing the most appropriate analytic program for a given experiment needs careful considerations. Moreover, most programs only have command line interface so their installation and usage will require basic computation expertise in Unix/Linux.

Download Full-text

Large-scale transcription factor binding prediction

Nature Methods ◽

10.1038/nmeth.3156 ◽

2014 ◽

Vol 11 (11) ◽

pp. 1091-1091

Keyword(s):

Transcription Factor ◽

Large Scale ◽

Transcription Factor Binding ◽

Binding Prediction ◽

Factor Binding

Download Full-text

Local DNA shape is a general principle of transcription factor binding specificity in Arabidopsis thaliana

10.1101/2020.09.29.318923 ◽

2020 ◽

Author(s):

Janik Sielemann ◽

Donat Wulf ◽

Romy Schmidt ◽

Andrea Bräutigam

Keyword(s):

Arabidopsis Thaliana ◽

Transcription Factor ◽

Transcription Factors ◽

Affinity Purification ◽

Transcription Factor Binding ◽

Binding Motif ◽

Binding Prediction ◽

Factor Binding ◽

Binding Behavior ◽

A Genome

A genome encodes two types of information, the “what can be made” and the “when and where”. The “what” are mostly proteins which perform the majority of functions within living organisms and the “when and where” is the regulatory information that encodes when and where proteins are made. Currently, it is possible to efficiently predict the majority of the protein content of a genome but nearly impossible to predict the transcriptional regulation. This regulation is based upon the interaction between transcription factors and genomic sequences at the site of binding motifs1,2,3. Information contained within the motif is necessary to predict transcription factor binding, however, it is not sufficient4. Peaks detected in amplified DNA affinity purification sequencing (ampDAP-seq) and the motifs derived from them only partially overlap in the genome3 indicating that the sequence holds information beyond the binding motif. Here we show a random forest machine learning approach which incorporates the 3D-shape improved the area under the precision recall curve for binding prediction for all 216 tested Arabidopsis thaliana transcription factors. The method resolved differential binding of transcription factor family members which share the same binding motif. The models correctly predicted the binding behavior of novel, not-in-genome motif sequences. Understanding transcription factor binding as a combination of motif sequence and motif shape brings us closer to predicting gene expression from promoter sequence.

Download Full-text