scholarly journals Ultra-fast and accurate motif finding in large ChIP-seq datasets reveals transcription factor binding patterns

2018 ◽  
Author(s):  
Yang Li ◽  
Pengyu Ni ◽  
Shaoqiang Zhang ◽  
Guojun Li ◽  
Zhengchang Su

ABSTRACTThe availability of a large volume of chromatin immunoprecipitation followed by sequencing (ChIP-seq) datasets for various transcription factors (TF) has provided an unprecedented opportunity to identify all functional TF binding motifs clustered in the enhancers in genomes. However, the progress has been largely hindered by the lack of a highly efficient and accurate tool that is fast enough to find not only the target motifs, but also cooperative motifs contained in very large ChIP-seq datasets with a binding peak length of typical enhancers (∼ 1,000 bp). To circumvent this hurdle, we herein present an ultra-fast and highly accurate motif-finding algorithm, ProSampler, with automatic motif length detection. ProSampler first identifies significant k-mers in the dataset and combines highly similar significant k-mers to form preliminary motifs. ProSampler then merges preliminary motifs with subtle similarity using a novel graph-based Gibbs sampler to find core motifs. Finally, ProSampler extends the core motifs by applying a two-proportion z-test to the flanking positions to identify motifs longer than k. As the number of preliminary motifs is much smaller than that of k-mers in a dataset, we greatly reduce the search space of the Gibbs sampler compared with conventional ones. By storing flanking sequences in a hash table, we avoid extensive IO and the necessity of examining all lengths of motifs in an interval. When evaluated on both synthetic and real ChIP-seq datasets, ProSampler runs orders of magnitude faster than the fastest existing tools while more accurately discovering primary motifs as well as cooperative motifs than do the best existing tools. Using ProSampler, we revealed previously unknown complex motif occurrence patterns in large ChIP-seq datasets, thereby providing insights into the mechanisms of cooperative TF binding for gene transcriptional regulation. Therefore, by allowing fast and accurate mining of the entire ChIP-seq datasets, ProSampler can greatly facilitate the efforts to identify the entire cis-regulatory code in genomes.

2019 ◽  
Vol 35 (22) ◽  
pp. 4632-4639 ◽  
Author(s):  
Yang Li ◽  
Pengyu Ni ◽  
Shaoqiang Zhang ◽  
Guojun Li ◽  
Zhengchang Su

Abstract Motivation The availability of numerous ChIP-seq datasets for transcription factors (TF) has provided an unprecedented opportunity to identify all TF binding sites in genomes. However, the progress has been hindered by the lack of a highly efficient and accurate tool to find not only the target motifs, but also cooperative motifs in very big datasets. Results We herein present an ultrafast and accurate motif-finding algorithm, ProSampler, based on a novel numeration method and Gibbs sampler. ProSampler runs orders of magnitude faster than the fastest existing tools while often more accurately identifying motifs of both the target TFs and cooperators. Thus, ProSampler can greatly facilitate the efforts to identify the entire cis-regulatory code in genomes. Availability and implementation Source code and binaries are freely available for download at https://github.com/zhengchangsulab/prosampler. It was implemented in C++ and supported on Linux, macOS and MS Windows platforms. Supplementary information Supplementary materials are available at Bioinformatics online.


2015 ◽  
Vol 2015 ◽  
pp. 1-10 ◽  
Author(s):  
Yipu Zhang ◽  
Ping Wang

New high-throughput technique ChIP-seq, coupling chromatin immunoprecipitation experiment with high-throughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genome-wide regions. However, the most existing motif discovery algorithms are time-consuming and limited to identify binding motifs in ChIP-seq data which normally has the significant characteristics of large scale data. In order to improve the efficiency, we propose a fast cluster motif finding algorithm, named as FCmotif, to identify the(l, d)motifs in large scale ChIP-seq data set. It is inspired by the emerging substrings mining strategy to find the enriched substrings and then searching the neighborhood instances to construct PWM and cluster motifs in different length. FCmotif is not following the OOPS model constraint and can find long motifs. The effectiveness of proposed algorithm has been proved by experiments on the ChIP-seq data sets from mouse ES cells. The whole detection of the real binding motifs and processing of the full size data of several megabytes finished in a few minutes. The experimental results show that FCmotif has advantageous to deal with the(l, d)motif finding in the ChIP-seq data; meanwhile it also demonstrates better performance than other current widely-used algorithms such as MEME, Weeder, ChIPMunk, and DREME.


2013 ◽  
Vol 42 (5) ◽  
pp. e35-e35 ◽  
Author(s):  
Jun Ding ◽  
Haiyan Hu ◽  
Xiaoman Li

Abstract The identification of transcription factor binding motifs is important for the study of gene transcriptional regulation. The chromatin immunoprecipitation (ChIP), followed by massive parallel sequencing (ChIP-seq) experiments, provides an unprecedented opportunity to discover binding motifs. Computational methods have been developed to identify motifs from ChIP-seq data, while at the same time encountering several problems. For example, existing methods are often not scalable to the large number of sequences obtained from ChIP-seq peak regions. Some methods heavily rely on well-annotated motifs even though the number of known motifs is limited. To simplify the problem, de novo motif discovery methods often neglect underrepresented motifs in ChIP-seq peak regions. To address these issues, we developed a novel approach called SIOMICS to de novo discover motifs from ChIP-seq data. Tested on 13 ChIP-seq data sets, SIOMICS identified motifs of many known and new cofactors. Tested on 13 simulated random data sets, SIOMICS discovered no motif in any data set. Compared with two recently developed methods for motif discovery, SIOMICS shows advantages in terms of speed, the number of known cofactor motifs predicted in experimental data sets and the number of false motifs predicted in random data sets. The SIOMICS software is freely available at http://eecs.ucf.edu/∼xiaoman/SIOMICS/SIOMICS.html.


2010 ◽  
Vol 7 (3) ◽  
Author(s):  
Tobias Wittkop ◽  
Sven Rahmann ◽  
Jan Baumbach

SummaryWe investigated the problem of imprecisely determined prokaryotic transcription factor (TF) binding sites (TFBSs). We found that the identification and reinvestigation of questionable binding motifs may result in improved models of these motifs. Subsequent modelbased predictions of gene regulatory interactions may be performed with increased accuracy when the TFBSs annotation underlying these models has been re-adjusted.We present MoRAine 2.0, a significantly improved version of MoRAine. It can automatically identify cases of unfavorable TFBS strand annotations and imprecisely determined TFBS positions. With release 2.0, we close the gap between reasonable running time and high accuracy. Furthermore, it requires only minimal input from the user: (1) the input TFBS sequences and (2) the length of the flanking sequences.Conclusions: MoRAine 2.0 is an easy-to-use, integrated, and publicly available web tool for the re-annotation of questionable TFBSs. It can be used online or downloaded as a stand-alone version from http://moraine.cebitec.uni-bielefeld.de.


Algorithms ◽  
2021 ◽  
Vol 14 (2) ◽  
pp. 29
Author(s):  
Haohao Zhou ◽  
Xiangzhi Wei

In this paper, we propose a particle swarm optimization variant based on a novel evaluation of diversity (PSO-ED). By a novel encoding of the sub-space of the search space and the hash table technique, the diversity of the swarm can be evaluated efficiently without any information compression. This paper proposes a notion of exploration degree based on the diversity of the swarm in the exploration, exploitation, and convergence states to characterize the degree of demand for the dispersion of the swarm. Further, a disturbance update mode is proposed to help the particles jump to the promising regions while reducing the cost of function evaluations for poor particles. The effectiveness of PSO-ED is validated on the CEC2015 test suite by comparison with seven popular PSO variants out of 12 benchmark functions; PSO-ED achieves six best results for both 10-D and 30-D.


Blood ◽  
1992 ◽  
Vol 79 (9) ◽  
pp. 2296-2302 ◽  
Author(s):  
XY He ◽  
VP Antao ◽  
D Basila ◽  
JC Marx ◽  
BR Davis

The human CD34 surface antigen is selectively expressed on hematopoietic stem/progenitor cells, suggesting that it plays an essential role in early hematopoiesis. Using a 1.5-kb partial human CD34 cDNA sequence, RNA-polymerase chain reaction (PCR), and rapid amplification of cDNA ends (RACE) methods, we cloned and sequenced the full-length (2.65 kb) cDNA. The cDNA encodes a type I transmembrane protein with no obvious homology to other known proteins. The entire CD34 gene of 28 kb was cloned, and the coding sequences mapped to eight exons. Mapping of the 5′ termini of mRNAs by 5′-RACE and RNAase protection analyses has indicated that the human CD34 gene uses multiple transcription initiation sites. Analysis of the upstream regulatory sequences revealed the absence of TATA and CAAT box sequences, and the presence of myb, myc, and ets-like DNA binding motifs. We have identified significant homology between human and mouse CD34 genes in 5′ and 3′ untranslated regions, amino acid coding sequences, and 5′ flanking sequences. This investigation of the CD34 gene should facilitate study of the function and regulation of this stem cell antigen.


2018 ◽  
Author(s):  
Min Wang ◽  
Timothy P Hancock ◽  
Amanda J. Chamberlain ◽  
Christy J. Vander Jagt ◽  
Jennie E Pryce ◽  
...  

AbstractBackgroundTopological association domains (TADs) are chromosomal domains characterised by frequent internal DNA-DNA interactions. The transcription factor CTCF binds to conserved DNA sequence patterns called CTCF binding motifs to either prohibit or facilitate chromosomal interactions. TADs and CTCF binding motifs control gene expression, but they are not yet well defined in the bovine genome. In this paper, we sought to improve the annotation of bovine TADs and CTCF binding motifs, and assess whether the new annotation can reduce the search space for cis-regulatory variants.ResultsWe used genomic synteny to map TADs and CTCF binding motifs from humans, mice, dogs and macaques to the bovine genome. We found that our mapped TADs exhibited the same hallmark properties of those sourced from experimental data, such as housekeeping gene, tRNA genes, CTCF binding motifs, SINEs, H3K4me3 and H3K27ac. Then we showed that runs of genes with the same pattern of allele-specific expression (ASE) (either favouring paternal or maternal allele) were often located in the same TAD or between the same conserved CTCF binding motifs. Analyses of variance showed that when averaged across all bovine tissues tested, TADs explained 14% of ASE variation (standard deviation, SD: 0.056), while CTCF explained 27% (SD: 0.078). Furthermore, we showed that the quantitative trait loci (QTLs) associated with gene expression variation (eQTLs) or ASE variation (aseQTLs), which were identified from mRNA transcripts from 141 lactating cows’ white blood and milk cells, were highly enriched at putative bovine CTCF binding motifs. The most significant aseQTL and eQTL for each genic target were located within the same TAD as the gene more often than expected (Chi-Squared test P-value ≤ 0.001).ConclusionsOur results suggest that genomic synteny can be used to functionally annotate conserved transcriptional components, and provides a tool to reduce the search space for causative regulatory variants in the bovine genome.


Blood ◽  
1992 ◽  
Vol 79 (9) ◽  
pp. 2296-2302 ◽  
Author(s):  
XY He ◽  
VP Antao ◽  
D Basila ◽  
JC Marx ◽  
BR Davis

Abstract The human CD34 surface antigen is selectively expressed on hematopoietic stem/progenitor cells, suggesting that it plays an essential role in early hematopoiesis. Using a 1.5-kb partial human CD34 cDNA sequence, RNA-polymerase chain reaction (PCR), and rapid amplification of cDNA ends (RACE) methods, we cloned and sequenced the full-length (2.65 kb) cDNA. The cDNA encodes a type I transmembrane protein with no obvious homology to other known proteins. The entire CD34 gene of 28 kb was cloned, and the coding sequences mapped to eight exons. Mapping of the 5′ termini of mRNAs by 5′-RACE and RNAase protection analyses has indicated that the human CD34 gene uses multiple transcription initiation sites. Analysis of the upstream regulatory sequences revealed the absence of TATA and CAAT box sequences, and the presence of myb, myc, and ets-like DNA binding motifs. We have identified significant homology between human and mouse CD34 genes in 5′ and 3′ untranslated regions, amino acid coding sequences, and 5′ flanking sequences. This investigation of the CD34 gene should facilitate study of the function and regulation of this stem cell antigen.


2012 ◽  
Vol 2012 ◽  
pp. 1-18 ◽  
Author(s):  
Hui Jia ◽  
Jinming Li

Novel computational methods for finding transcription factor binding motifs have long been sought due to tedious work of experimentally identifying them. However, the current prevailing methods yield a large number of false positive predictions due to the short, variable nature of transcriptional factor binding sites (TFBSs). We proposed here a method that combines sequence overrepresentation and cross-species sequence conservation to detect TFBSs in upstream regions of a given set of coregulated genes. We applied the method to 35S. cerevisiaetranscriptional factors with known DNA binding motifs (with the support of orthologous sequences from genomes ofS. mikatae,S. bayanus, andS. paradoxus), and the proposed method outperformed the single-genome-based motif finding methodsMEMEandAlignACEas well as the multiple-genome-based methodsPHYMEandFootprinterfor the majority of these transcriptional factors. Compared with the prevailing motif finding software, our method has some advantages in finding transcriptional factor binding motifs for potential coregulated genes if the gene upstream sequences of multiple closely related species are available. Although we used yeast genomes to assess our method in this study, it might also be applied to other organisms if suitable related species are available and the upstream sequences of coregulated genes can be obtained for the multiple closely related species.


Sign in / Sign up

Export Citation Format

Share Document