Ultra-fast and accurate motif finding in large ChIP-seq datasets reveals transcription factor binding patterns

ABSTRACTThe availability of a large volume of chromatin immunoprecipitation followed by sequencing (ChIP-seq) datasets for various transcription factors (TF) has provided an unprecedented opportunity to identify all functional TF binding motifs clustered in the enhancers in genomes. However, the progress has been largely hindered by the lack of a highly efficient and accurate tool that is fast enough to find not only the target motifs, but also cooperative motifs contained in very large ChIP-seq datasets with a binding peak length of typical enhancers (∼ 1,000 bp). To circumvent this hurdle, we herein present an ultra-fast and highly accurate motif-finding algorithm, ProSampler, with automatic motif length detection. ProSampler first identifies significant k-mers in the dataset and combines highly similar significant k-mers to form preliminary motifs. ProSampler then merges preliminary motifs with subtle similarity using a novel graph-based Gibbs sampler to find core motifs. Finally, ProSampler extends the core motifs by applying a two-proportion z-test to the flanking positions to identify motifs longer than k. As the number of preliminary motifs is much smaller than that of k-mers in a dataset, we greatly reduce the search space of the Gibbs sampler compared with conventional ones. By storing flanking sequences in a hash table, we avoid extensive IO and the necessity of examining all lengths of motifs in an interval. When evaluated on both synthetic and real ChIP-seq datasets, ProSampler runs orders of magnitude faster than the fastest existing tools while more accurately discovering primary motifs as well as cooperative motifs than do the best existing tools. Using ProSampler, we revealed previously unknown complex motif occurrence patterns in large ChIP-seq datasets, thereby providing insights into the mechanisms of cooperative TF binding for gene transcriptional regulation. Therefore, by allowing fast and accurate mining of the entire ChIP-seq datasets, ProSampler can greatly facilitate the efforts to identify the entire cis-regulatory code in genomes.

Download Full-text

ProSampler: an ultrafast and accurate motif finder in large ChIP-seq datasets for combinatory motif discovery

Bioinformatics ◽

10.1093/bioinformatics/btz290 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4632-4639 ◽

Cited By ~ 1

Author(s):

Yang Li ◽

Pengyu Ni ◽

Shaoqiang Zhang ◽

Guojun Li ◽

Zhengchang Su

Keyword(s):

Transcription Factors ◽

Gibbs Sampler ◽

Binding Sites ◽

Motif Discovery ◽

Source Code ◽

Motif Finding ◽

Supplementary Information ◽

Highly Efficient ◽

Motif Finder ◽

Motif Finding Algorithm

Abstract Motivation The availability of numerous ChIP-seq datasets for transcription factors (TF) has provided an unprecedented opportunity to identify all TF binding sites in genomes. However, the progress has been hindered by the lack of a highly efficient and accurate tool to find not only the target motifs, but also cooperative motifs in very big datasets. Results We herein present an ultrafast and accurate motif-finding algorithm, ProSampler, based on a novel numeration method and Gibbs sampler. ProSampler runs orders of magnitude faster than the fastest existing tools while often more accurately identifying motifs of both the target TFs and cooperators. Thus, ProSampler can greatly facilitate the efforts to identify the entire cis-regulatory code in genomes. Availability and implementation Source code and binaries are freely available for download at https://github.com/zhengchangsulab/prosampler. It was implemented in C++ and supported on Linux, macOS and MS Windows platforms. Supplementary information Supplementary materials are available at Bioinformatics online.

Download Full-text

A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets

BioMed Research International ◽

10.1155/2015/218068 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 4

Author(s):

Yipu Zhang ◽

Ping Wang

Keyword(s):

High Throughput ◽

Motif Discovery ◽

Large Scale ◽

High Throughput Sequencing ◽

Es Cells ◽

Motif Finding ◽

Data Sets ◽

Data Set ◽

Binding Motifs ◽

Motif Finding Algorithm

New high-throughput technique ChIP-seq, coupling chromatin immunoprecipitation experiment with high-throughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genome-wide regions. However, the most existing motif discovery algorithms are time-consuming and limited to identify binding motifs in ChIP-seq data which normally has the significant characteristics of large scale data. In order to improve the efficiency, we propose a fast cluster motif finding algorithm, named as FCmotif, to identify the(l, d)motifs in large scale ChIP-seq data set. It is inspired by the emerging substrings mining strategy to find the enriched substrings and then searching the neighborhood instances to construct PWM and cluster motifs in different length. FCmotif is not following the OOPS model constraint and can find long motifs. The effectiveness of proposed algorithm has been proved by experiments on the ChIP-seq data sets from mouse ES cells. The whole detection of the real binding motifs and processing of the full size data of several megabytes finished in a few minutes. The experimental results show that FCmotif has advantageous to deal with the(l, d)motif finding in the ChIP-seq data; meanwhile it also demonstrates better performance than other current widely-used algorithms such as MEME, Weeder, ChIPMunk, and DREME.

Download Full-text

SIOMICS: a novel approach for systematic identification of motifs in ChIP-seq data

Nucleic Acids Research ◽

10.1093/nar/gkt1288 ◽

2013 ◽

Vol 42 (5) ◽

pp. e35-e35 ◽

Cited By ~ 15

Author(s):

Jun Ding ◽

Haiyan Hu ◽

Xiaoman Li

Keyword(s):

Motif Discovery ◽

De Novo ◽

Data Sets ◽

Random Data ◽

Data Set ◽

Binding Motifs ◽

Gene Transcriptional Regulation ◽

Novel Approach ◽

De Novo Motif Discovery ◽

Systematic Identification

Abstract The identification of transcription factor binding motifs is important for the study of gene transcriptional regulation. The chromatin immunoprecipitation (ChIP), followed by massive parallel sequencing (ChIP-seq) experiments, provides an unprecedented opportunity to discover binding motifs. Computational methods have been developed to identify motifs from ChIP-seq data, while at the same time encountering several problems. For example, existing methods are often not scalable to the large number of sequences obtained from ChIP-seq peak regions. Some methods heavily rely on well-annotated motifs even though the number of known motifs is limited. To simplify the problem, de novo motif discovery methods often neglect underrepresented motifs in ChIP-seq peak regions. To address these issues, we developed a novel approach called SIOMICS to de novo discover motifs from ChIP-seq data. Tested on 13 ChIP-seq data sets, SIOMICS identified motifs of many known and new cofactors. Tested on 13 simulated random data sets, SIOMICS discovered no motif in any data set. Compared with two recently developed methods for motif discovery, SIOMICS shows advantages in terms of speed, the number of known cofactor motifs predicted in experimental data sets and the number of false motifs predicted in random data sets. The SIOMICS software is freely available at http://eecs.ucf.edu/∼xiaoman/SIOMICS/SIOMICS.html.

Download Full-text

Efficient Online Transcription Factor Binding Site Adjustment by Integrating Transitive Graph Projection with MoRAine 2.0

Journal of Integrative Bioinformatics ◽

10.1515/jib-2010-117 ◽

2010 ◽

Vol 7 (3) ◽

Cited By ~ 1

Author(s):

Tobias Wittkop ◽

Sven Rahmann ◽

Jan Baumbach

Keyword(s):

Transcription Factor ◽

Binding Sites ◽

High Accuracy ◽

Web Tool ◽

Transitive Graph ◽

Binding Motifs ◽

Factor Binding Site ◽

Regulatory Interactions ◽

Flanking Sequences ◽

Gene Regulatory

SummaryWe investigated the problem of imprecisely determined prokaryotic transcription factor (TF) binding sites (TFBSs). We found that the identification and reinvestigation of questionable binding motifs may result in improved models of these motifs. Subsequent modelbased predictions of gene regulatory interactions may be performed with increased accuracy when the TFBSs annotation underlying these models has been re-adjusted.We present MoRAine 2.0, a significantly improved version of MoRAine. It can automatically identify cases of unfavorable TFBS strand annotations and imprecisely determined TFBS positions. With release 2.0, we close the gap between reasonable running time and high accuracy. Furthermore, it requires only minimal input from the user: (1) the input TFBS sequences and (2) the length of the flanking sequences.Conclusions: MoRAine 2.0 is an easy-to-use, integrated, and publicly available web tool for the re-annotation of questionable TFBSs. It can be used online or downloaded as a stand-alone version from http://moraine.cebitec.uni-bielefeld.de.

Download Full-text

Developing a motif finding algorithm using Suffix Tree and Hash Table

2020 23rd International Conference on Computer and Information Technology (ICCIT) ◽

10.1109/iccit51783.2020.9392729 ◽

2020 ◽

Author(s):

Mohammad Zahedul Islam ◽

Sumit Chowdhury ◽

Mohammad Asif Khan

Keyword(s):

Suffix Tree ◽

Hash Table ◽

Motif Finding ◽

Motif Finding Algorithm

Download Full-text

Particle Swarm Optimization Based on a Novel Evaluation of Diversity

Algorithms ◽

10.3390/a14020029 ◽

2021 ◽

Vol 14 (2) ◽

pp. 29

Author(s):

Haohao Zhou ◽

Xiangzhi Wei

Keyword(s):

Particle Swarm Optimization ◽

Hash Table ◽

Particle Swarm ◽

Search Space ◽

Test Suite ◽

Information Compression ◽

Swarm Optimization ◽

The Cost ◽

Exploration Exploitation

In this paper, we propose a particle swarm optimization variant based on a novel evaluation of diversity (PSO-ED). By a novel encoding of the sub-space of the search space and the hash table technique, the diversity of the swarm can be evaluated efficiently without any information compression. This paper proposes a notion of exploration degree based on the diversity of the swarm in the exploration, exploitation, and convergence states to characterize the degree of demand for the dispersion of the swarm. Further, a disturbance update mode is proposed to help the particles jump to the promising regions while reducing the cost of function evaluations for poor particles. The effectiveness of PSO-ED is validated on the CEC2015 test suite by comparison with seven popular PSO variants out of 12 benchmark functions; PSO-ED achieves six best results for both 10-D and 30-D.

Download Full-text

Isolation and molecular characterization of the human CD34 gene

Blood ◽

10.1182/blood.v79.9.2296.bloodjournal7992296 ◽

1992 ◽

Vol 79 (9) ◽

pp. 2296-2302 ◽

Cited By ~ 1

Author(s):

XY He ◽

VP Antao ◽

D Basila ◽

JC Marx ◽

BR Davis

Keyword(s):

Transcription Initiation ◽

Transmembrane Protein ◽

Regulatory Sequences ◽

Type I ◽

Hematopoietic Stem ◽

Binding Motifs ◽

Coding Sequences ◽

Human Cd34 ◽

Flanking Sequences ◽

Rapid Amplification

The human CD34 surface antigen is selectively expressed on hematopoietic stem/progenitor cells, suggesting that it plays an essential role in early hematopoiesis. Using a 1.5-kb partial human CD34 cDNA sequence, RNA-polymerase chain reaction (PCR), and rapid amplification of cDNA ends (RACE) methods, we cloned and sequenced the full-length (2.65 kb) cDNA. The cDNA encodes a type I transmembrane protein with no obvious homology to other known proteins. The entire CD34 gene of 28 kb was cloned, and the coding sequences mapped to eight exons. Mapping of the 5′ termini of mRNAs by 5′-RACE and RNAase protection analyses has indicated that the human CD34 gene uses multiple transcription initiation sites. Analysis of the upstream regulatory sequences revealed the absence of TATA and CAAT box sequences, and the presence of myb, myc, and ets-like DNA binding motifs. We have identified significant homology between human and mouse CD34 genes in 5′ and 3′ untranslated regions, amino acid coding sequences, and 5′ flanking sequences. This investigation of the CD34 gene should facilitate study of the function and regulation of this stem cell antigen.

Download Full-text

Putative bovine topological association domains and CTCF binding motifs can reduce the search space for causative regulatory variants of complex traits

10.1101/242792 ◽

2018 ◽

Author(s):

Min Wang ◽

Timothy P Hancock ◽

Amanda J. Chamberlain ◽

Christy J. Vander Jagt ◽

Jennie E Pryce ◽

...

Keyword(s):

Gene Expression ◽

Complex Traits ◽

Bovine Genome ◽

Search Space ◽

Ctcf Binding ◽

P Value ◽

Trna Genes ◽

Specific Expression ◽

Binding Motifs ◽

Regulatory Variants

AbstractBackgroundTopological association domains (TADs) are chromosomal domains characterised by frequent internal DNA-DNA interactions. The transcription factor CTCF binds to conserved DNA sequence patterns called CTCF binding motifs to either prohibit or facilitate chromosomal interactions. TADs and CTCF binding motifs control gene expression, but they are not yet well defined in the bovine genome. In this paper, we sought to improve the annotation of bovine TADs and CTCF binding motifs, and assess whether the new annotation can reduce the search space for cis-regulatory variants.ResultsWe used genomic synteny to map TADs and CTCF binding motifs from humans, mice, dogs and macaques to the bovine genome. We found that our mapped TADs exhibited the same hallmark properties of those sourced from experimental data, such as housekeeping gene, tRNA genes, CTCF binding motifs, SINEs, H3K4me3 and H3K27ac. Then we showed that runs of genes with the same pattern of allele-specific expression (ASE) (either favouring paternal or maternal allele) were often located in the same TAD or between the same conserved CTCF binding motifs. Analyses of variance showed that when averaged across all bovine tissues tested, TADs explained 14% of ASE variation (standard deviation, SD: 0.056), while CTCF explained 27% (SD: 0.078). Furthermore, we showed that the quantitative trait loci (QTLs) associated with gene expression variation (eQTLs) or ASE variation (aseQTLs), which were identified from mRNA transcripts from 141 lactating cows’ white blood and milk cells, were highly enriched at putative bovine CTCF binding motifs. The most significant aseQTL and eQTL for each genic target were located within the same TAD as the gene more often than expected (Chi-Squared test P-value ≤ 0.001).ConclusionsOur results suggest that genomic synteny can be used to functionally annotate conserved transcriptional components, and provides a tool to reduce the search space for causative regulatory variants in the bovine genome.

Download Full-text

Isolation and molecular characterization of the human CD34 gene

Blood ◽

10.1182/blood.v79.9.2296.2296 ◽

1992 ◽

Vol 79 (9) ◽

pp. 2296-2302 ◽

Cited By ~ 39

Author(s):

XY He ◽

VP Antao ◽

D Basila ◽

JC Marx ◽

BR Davis

Keyword(s):

Transcription Initiation ◽

Transmembrane Protein ◽

Regulatory Sequences ◽

Type I ◽

Hematopoietic Stem ◽

Binding Motifs ◽

Coding Sequences ◽

Human Cd34 ◽

Flanking Sequences ◽

Rapid Amplification

Abstract The human CD34 surface antigen is selectively expressed on hematopoietic stem/progenitor cells, suggesting that it plays an essential role in early hematopoiesis. Using a 1.5-kb partial human CD34 cDNA sequence, RNA-polymerase chain reaction (PCR), and rapid amplification of cDNA ends (RACE) methods, we cloned and sequenced the full-length (2.65 kb) cDNA. The cDNA encodes a type I transmembrane protein with no obvious homology to other known proteins. The entire CD34 gene of 28 kb was cloned, and the coding sequences mapped to eight exons. Mapping of the 5′ termini of mRNAs by 5′-RACE and RNAase protection analyses has indicated that the human CD34 gene uses multiple transcription initiation sites. Analysis of the upstream regulatory sequences revealed the absence of TATA and CAAT box sequences, and the presence of myb, myc, and ets-like DNA binding motifs. We have identified significant homology between human and mouse CD34 genes in 5′ and 3′ untranslated regions, amino acid coding sequences, and 5′ flanking sequences. This investigation of the CD34 gene should facilitate study of the function and regulation of this stem cell antigen.

Download Full-text

Finding Transcription Factor Binding Motifs for Coregulated Genes by Combining Sequence Overrepresentation with Cross-Species Conservation

Journal of Probability and Statistics ◽

10.1155/2012/830575 ◽

2012 ◽

Vol 2012 ◽

pp. 1-18 ◽

Cited By ~ 2

Author(s):

Hui Jia ◽

Jinming Li

Keyword(s):

Transcription Factor ◽

Related Species ◽

Transcription Factor Binding ◽

Transcriptional Factors ◽

Transcriptional Factor ◽

Motif Finding ◽

Closely Related Species ◽

Binding Motifs ◽

Factor Binding ◽

Transcription Factor Binding Motifs

Novel computational methods for finding transcription factor binding motifs have long been sought due to tedious work of experimentally identifying them. However, the current prevailing methods yield a large number of false positive predictions due to the short, variable nature of transcriptional factor binding sites (TFBSs). We proposed here a method that combines sequence overrepresentation and cross-species sequence conservation to detect TFBSs in upstream regions of a given set of coregulated genes. We applied the method to 35S. cerevisiaetranscriptional factors with known DNA binding motifs (with the support of orthologous sequences from genomes ofS. mikatae,S. bayanus, andS. paradoxus), and the proposed method outperformed the single-genome-based motif finding methodsMEMEandAlignACEas well as the multiple-genome-based methodsPHYMEandFootprinterfor the majority of these transcriptional factors. Compared with the prevailing motif finding software, our method has some advantages in finding transcriptional factor binding motifs for potential coregulated genes if the gene upstream sequences of multiple closely related species are available. Although we used yeast genomes to assess our method in this study, it might also be applied to other organisms if suitable related species are available and the upstream sequences of coregulated genes can be obtained for the multiple closely related species.

Download Full-text