Bioinformatics of Transcription Factor Binding Prediction

AbstractUnderstanding gene expression will require understanding where regulatory factors bind genomic DNA. The frequently used sequence-based motifs of protein-DNA binding are not predictive, since a genome contains many more binding sites than are actually bound and transcription factors of the same family share similar DNA-binding motifs. Traditionally, these motifs only depict sequence but neglect DNA shape. Since shape may contribute non-linearly and combinational to binding, machine learning approaches ought to be able to better predict transcription factor binding. Here we show that a random forest machine learning approach, which incorporates the 3D-shape of DNA, enhances binding prediction for all 216 tested Arabidopsis thaliana transcription factors and improves the resolution of differential binding by transcription factor family members which share the same binding motif. We observed that DNA shape features were individually weighted for each transcription factor, even if they shared the same binding sequence.

Download Full-text

BinDNase: a discriminatory approach for transcription factor binding prediction using DNase I hypersensitivity data

Bioinformatics ◽

10.1093/bioinformatics/btv294 ◽

2015 ◽

Vol 31 (17) ◽

pp. 2852-2859 ◽

Cited By ~ 43

Author(s):

Juhani Kähärä ◽

Harri Lähdesmäki

Keyword(s):

Transcription Factor ◽

Transcription Factor Binding ◽

Dnase I ◽

Binding Prediction ◽

Factor Binding ◽

Dnase I Hypersensitivity

Download Full-text

Large-scale transcription factor binding prediction

Nature Methods ◽

10.1038/nmeth.3156 ◽

2014 ◽

Vol 11 (11) ◽

pp. 1091-1091

Keyword(s):

Transcription Factor ◽

Large Scale ◽

Transcription Factor Binding ◽

Binding Prediction ◽

Factor Binding

Download Full-text

Local DNA shape is a general principle of transcription factor binding specificity in Arabidopsis thaliana

10.1101/2020.09.29.318923 ◽

2020 ◽

Author(s):

Janik Sielemann ◽

Donat Wulf ◽

Romy Schmidt ◽

Andrea Bräutigam

Keyword(s):

Arabidopsis Thaliana ◽

Transcription Factor ◽

Transcription Factors ◽

Affinity Purification ◽

Transcription Factor Binding ◽

Binding Motif ◽

Binding Prediction ◽

Factor Binding ◽

Binding Behavior ◽

A Genome

A genome encodes two types of information, the “what can be made” and the “when and where”. The “what” are mostly proteins which perform the majority of functions within living organisms and the “when and where” is the regulatory information that encodes when and where proteins are made. Currently, it is possible to efficiently predict the majority of the protein content of a genome but nearly impossible to predict the transcriptional regulation. This regulation is based upon the interaction between transcription factors and genomic sequences at the site of binding motifs1,2,3. Information contained within the motif is necessary to predict transcription factor binding, however, it is not sufficient4. Peaks detected in amplified DNA affinity purification sequencing (ampDAP-seq) and the motifs derived from them only partially overlap in the genome3 indicating that the sequence holds information beyond the binding motif. Here we show a random forest machine learning approach which incorporates the 3D-shape improved the area under the precision recall curve for binding prediction for all 216 tested Arabidopsis thaliana transcription factors. The method resolved differential binding of transcription factor family members which share the same binding motif. The models correctly predicted the binding behavior of novel, not-in-genome motif sequences. Understanding transcription factor binding as a combination of motif sequence and motif shape brings us closer to predicting gene expression from promoter sequence.

Download Full-text