scholarly journals ProSampler: an ultrafast and accurate motif finder in large ChIP-seq datasets for combinatory motif discovery

2019 ◽  
Vol 35 (22) ◽  
pp. 4632-4639 ◽  
Author(s):  
Yang Li ◽  
Pengyu Ni ◽  
Shaoqiang Zhang ◽  
Guojun Li ◽  
Zhengchang Su

Abstract Motivation The availability of numerous ChIP-seq datasets for transcription factors (TF) has provided an unprecedented opportunity to identify all TF binding sites in genomes. However, the progress has been hindered by the lack of a highly efficient and accurate tool to find not only the target motifs, but also cooperative motifs in very big datasets. Results We herein present an ultrafast and accurate motif-finding algorithm, ProSampler, based on a novel numeration method and Gibbs sampler. ProSampler runs orders of magnitude faster than the fastest existing tools while often more accurately identifying motifs of both the target TFs and cooperators. Thus, ProSampler can greatly facilitate the efforts to identify the entire cis-regulatory code in genomes. Availability and implementation Source code and binaries are freely available for download at https://github.com/zhengchangsulab/prosampler. It was implemented in C++ and supported on Linux, macOS and MS Windows platforms. Supplementary information Supplementary materials are available at Bioinformatics online.

2015 ◽  
Vol 2015 ◽  
pp. 1-10 ◽  
Author(s):  
Yipu Zhang ◽  
Ping Wang

New high-throughput technique ChIP-seq, coupling chromatin immunoprecipitation experiment with high-throughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genome-wide regions. However, the most existing motif discovery algorithms are time-consuming and limited to identify binding motifs in ChIP-seq data which normally has the significant characteristics of large scale data. In order to improve the efficiency, we propose a fast cluster motif finding algorithm, named as FCmotif, to identify the(l, d)motifs in large scale ChIP-seq data set. It is inspired by the emerging substrings mining strategy to find the enriched substrings and then searching the neighborhood instances to construct PWM and cluster motifs in different length. FCmotif is not following the OOPS model constraint and can find long motifs. The effectiveness of proposed algorithm has been proved by experiments on the ChIP-seq data sets from mouse ES cells. The whole detection of the real binding motifs and processing of the full size data of several megabytes finished in a few minutes. The experimental results show that FCmotif has advantageous to deal with the(l, d)motif finding in the ChIP-seq data; meanwhile it also demonstrates better performance than other current widely-used algorithms such as MEME, Weeder, ChIPMunk, and DREME.


Blood ◽  
2010 ◽  
Vol 116 (21) ◽  
pp. 3870-3870
Author(s):  
Eirini Trompouki ◽  
Teresa V. Bowman ◽  
Lee N Lawton ◽  
Zi Peng Fan ◽  
Anthony DiBiase ◽  
...  

Abstract Abstract 3870 The BMP and WNT signaling pathways are two highly conserved signaling pathways that cooperate in many developmental processes, ultimately through alteration of transcription via SMAD and TCF transcription factors. These pathways elicit pleiotropic outcomes across cell types, yet only a few cell-specific direct target genes are known for the signaling transcription factors that mitigate these effects. We took a genome-wide approach to define the binding sites of BMP and WNT-directed transcription factors in different hematopoietic lineages. Using heat-shock inducible transgenic fish lines that overexpress BMP2 or WNT8, we demonstrated accelerated marrow recovery following irradiation. Irradiation recovery was blunted by heat shock induced overexpression of the respective inhibitors Chordin and DKK1. Similar to the zebrafish regeneration results, competitive transplants with mouse bone marrow treated with the WNT agonist BIO led to enhanced chimerism. Inhibition of BMP diminished peripheral blood contribution even in the presence of WNT stimulation, suggesting a conserved and cell intrinsic interaction for these signaling pathways in adult stress hematopoiesis. To examine potential target genes that could account for the synergy, we performed chromatin immunoprecipitation with WNT- and BMP-activated transcription factors followed by sequencing (ChIP-seq) in K562 cells. ChIP-seq was performed with TCF7L2/TCF4, a mediator of the WNT pathway, and SMAD1, a mediator of the BMP signaling pathway, and >2000 binding sites were identified for each factor. Motif discovery revealed that the DNA sequences bound by TCF7L2 and SMAD1 were not only enriched for TCF and SMAD binding elements, respectively, but were also enriched for a GATA motif. Comparison of the TCF7L2 and SMAD1 bound genes with published ChIP-Seq data for GATA1 and GATA2 in K562 cells revealed that both signaling factors bind more than 40% of GATA1 bound genes and greater than 70% of GATA2 bound genes. Ingenuity and GSEA analysis revealed that genes important for erythropoiesis were among the genes co-bound by these factors. To evaluate the effect of cell lineage on signaling factor binding, ChIP-seq of TCF7L2 and SMAD1 in U937, a monocytic leukemia cell line, was performed. Motif discovery of sequences bound in U937 found enrichment for an ETS motif, which is bound by the key myeloid transcription factor Pu.1. In addition, TCF7L2 and SMAD1 bound genes in U937 overlapped genes bound by C/EBPalpha in U937 by greater than 70%. These genes are implicated in monocytic development. The overlap of binding between TCF7L2 in K562 and U937 was less than 15% and the overlap of SMAD1 binding sites between the cell lines was less than 10%, indicating a substantial influence of cell lineage on transcription factor binding. Confirmation of cell type selective binding of TCF7L2 and SMAD1 in vivo was accomplished by ChIP of the transcription factors in zebrafish nucleated erythrocytes. Binding of TCF7L2 and SMAD1 in these cells showed that these factors co-bind with GATA1 in many genes with established roles in erythropoiesis. Together our data suggest the co-binding of WNT- and BMP-specific transcription factors with master regulators of each hematopoietic cell type results in regulation of distinct blood genes based on lineage. (First two authors contributed equally to this work) Disclosures: Zon: FATE, Inc.: Consultancy, Equity Ownership, Membership on an entity's Board of Directors or advisory committees, Patents & Royalties; Stemgent: Consultancy, Equity Ownership, Membership on an entity's Board of Directors or advisory committees.


2018 ◽  
Vol 35 (16) ◽  
pp. 2774-2782 ◽  
Author(s):  
Alice Cheng ◽  
Charles E Grant ◽  
William S Noble ◽  
Timothy L Bailey

Abstract Motivation Post-translational modifications (PTMs) of proteins are associated with many significant biological functions and can be identified in high throughput using tandem mass spectrometry. Many PTMs are associated with short sequence patterns called ‘motifs’ that help localize the modifying enzyme. Accordingly, many algorithms have been designed to identify these motifs from mass spectrometry data. Accurate statistical confidence estimates for discovered motifs are critically important for proper interpretation and in the design of downstream experimental validation. Results We describe a method for assigning statistical confidence estimates to PTM motifs, and we demonstrate that this method provides accurate P-values on both simulated and real data. Our methods are implemented in MoMo, a software tool for discovering motifs among sets of PTMs that we make available as a web server and as downloadable source code. MoMo re-implements the two most widely used PTM motif discovery algorithms—motif-x and MoDL—while offering many enhancements. Relative to motif-x, MoMo offers improved statistical confidence estimates and more accurate calculation of motif scores. The MoMo web server offers more proteome databases, more input formats, larger inputs and longer running times than the motif-x web server. Finally, our study demonstrates that the confidence estimates produced by motif-x are inaccurate. This inaccuracy stems in part from the common practice of drawing ‘background’ peptides from an unshuffled proteome database. Our results thus suggest that many of the papers that use motif-x to find motifs may be reporting results that lack statistical support. Availability and implementation The MoMo web server and source code are provided at http://meme-suite.org. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Yichao Li ◽  
Yating Liu ◽  
David Juedes ◽  
Frank Drews ◽  
Razvan Bunescu ◽  
...  

Abstract Motivation De novo motif discovery algorithms find statistically over-represented sequence motifs that may function as transcription factor binding sites. Current methods often report large numbers of motifs, making it difficult to perform further analyses and experimental validation. The motif selection problem seeks to identify a minimal set of putative regulatory motifs that characterize sequences of interest (e.g. ChIP-Seq binding regions). Results In this study, the motif selection problem is mapped to variants of the set cover problem that are solved via tabu search and by relaxed integer linear programing (RILP). The algorithms are employed to analyze 349 ChIP-Seq experiments from the ENCODE project, yielding a small number of high-quality motifs that represent putative binding sites of primary factors and cofactors. Specifically, when compared with the motifs reported by Kheradpour and Kellis, the set cover-based algorithms produced motif sets covering 35% more peaks for 11 TFs and identified 4 more putative cofactors for 6 TFs. Moreover, a systematic evaluation using nested cross-validation revealed that the RILP algorithm selected fewer motifs and was able to cover 6% more peaks and 3% fewer background regions, which reduced the error rate by 7%. Availability and implementation The source code of the algorithms and all the datasets are available at https://github.com/YichaoOU/Set_cover_tools. Supplementary information Supplementary data are available at Bioinformatics online.


2016 ◽  
Author(s):  
Chao Ren ◽  
Hebing Chen ◽  
Feng Liu ◽  
Hao Li ◽  
Xiaochen Bo ◽  
...  

Accurately identifying binding sites of transcription factors (TFs) is crucial to understand the mechanisms of transcriptional regulation and human disease. We present incorporating Find Occurrence of Regulatory Motifs (iFORM), an easy-to-use tool for scanning DNA sequence with TF motifs described as position weight matrices (PWMs). iFORM achieves higher accuracy and sensitivity by integrating the results from five classical motif discovery programs based on Fisher's combined probability test. We have used iFORM to provide accurate results on a variety of data in the ENCODE Project and the NIH Roadmap Epigenomics Project, and has demonstrated its utility to further understand individual roles of functional elements.iFORM can be freely accessed athttps://github.com/wenjiegroup/iFORM.


2019 ◽  
Vol 35 (18) ◽  
pp. 3287-3293 ◽  
Author(s):  
Vu Ngo ◽  
Mengchi Wang ◽  
Wei Wang

Abstract Motivation Increasing evidence has shown that nucleotide modifications such as methylation and hydroxymethylation on cytosine would greatly impact the binding of transcription factors (TFs). However, there is a lack of motif finding algorithms with the function to search for motifs with modified bases. In this study, we expand on our previous motif finding pipeline Epigram to provide systematic de novo motif discovery and performance evaluation on methylated DNA motifs. Results mEpigram outperforms both MEME and DREME on finding modified motifs in simulated data that mimics various motif enrichment scenarios. Furthermore we were able to identify methylated motifs in Arabidopsis DNA affinity purification sequencing (DAP-seq) data that were previously demonstrated to contain such motifs. When applied to TF ChIP-seq and DNA methylome data in H1 and GM12878, our method successfully identified novel methylated motifs that can be recognized by the TFs or their co-factors. We also observed spacing constraint between the canonical motif of the TF and the newly discovered methylated motifs, which suggests operative recognition of these cis-elements by collaborative proteins. Availability and implementation The mEpigram program is available at http://wanglab.ucsd.edu/star/mEpigram. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Pavel Beran ◽  
Dagmar Stehlíková ◽  
Stephen P Cohen ◽  
Vladislav Čurn

Abstract Summary Searching for amino acid or nucleic acid sequences unique to one organism may be challenging depending on size of the available datasets. K-mer elimination by cross-reference (KEC) allows users to quickly and easily find unique sequences by providing target and non-target sequences. Due to its speed, it can be used for datasets of genomic size and can be run on desktop or laptop computers with modest specifications. Availability and implementation KEC is freely available for non-commercial purposes. Source code and executable binary files compiled for Linux, Mac and Windows can be downloaded from https://github.com/berybox/KEC. Supplementary information Supplementary data are available at Bioinformatics online.


BMC Biology ◽  
2021 ◽  
Vol 19 (1) ◽  
Author(s):  
Alexandre Z. Daly ◽  
Lindsey A. Dudley ◽  
Michael T. Peel ◽  
Stephen A. Liebhaber ◽  
Stephen C. J. Parker ◽  
...  

Abstract Background The pituitary gland is a neuroendocrine organ containing diverse cell types specialized in secreting hormones that regulate physiology. Pituitary thyrotropes produce thyroid-stimulating hormone (TSH), a critical factor for growth and maintenance of metabolism. The transcription factors POU1F1 and GATA2 have been implicated in thyrotrope fate, but the transcriptomic and epigenomic landscapes of these neuroendocrine cells have not been characterized. The goal of this work was to discover transcriptional regulatory elements that drive thyrotrope fate. Results We identified the transcription factors and epigenomic changes in chromatin that are associated with differentiation of POU1F1-expressing progenitors into thyrotropes using cell lines that represent an undifferentiated Pou1f1 lineage progenitor (GHF-T1) and a committed thyrotrope line that produces TSH (TαT1). We compared RNA-seq, ATAC-seq, histone modification (H3K27Ac, H3K4Me1, and H3K27Me3), and POU1F1 binding in these cell lines. POU1F1 binding sites are commonly associated with bZIP transcription factor consensus binding sites in GHF-T1 cells and Helix-Turn-Helix (HTH) or basic Helix-Loop-Helix (bHLH) factors in TαT1 cells, suggesting that these classes of transcription factors may recruit or cooperate with POU1F1 binding at unique sites. We validated enhancer function of novel elements we mapped near Cga, Pitx1, Gata2, and Tshb by transfection in TαT1 cells. Finally, we confirmed that an enhancer element near Tshb can drive expression in thyrotropes of transgenic mice, and we demonstrate that GATA2 enhances Tshb expression through this element. Conclusion These results extend the ENCODE multi-omic profiling approach to the pituitary gland, which should be valuable for understanding pituitary development and disease pathogenesis. Graphical abstract


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Sara Lago ◽  
Matteo Nadai ◽  
Filippo M. Cernilogar ◽  
Maryam Kazerani ◽  
Helena Domíniguez Moreno ◽  
...  

AbstractCell identity is maintained by activation of cell-specific gene programs, regulated by epigenetic marks, transcription factors and chromatin organization. DNA G-quadruplex (G4)-folded regions in cells were reported to be associated with either increased or decreased transcriptional activity. By G4-ChIP-seq/RNA-seq analysis on liposarcoma cells we confirmed that G4s in promoters are invariably associated with high transcription levels in open chromatin. Comparing G4 presence, location and transcript levels in liposarcoma cells to available data on keratinocytes, we showed that the same promoter sequences of the same genes in the two cell lines had different G4-folding state: high transcript levels consistently associated with G4-folding. Transcription factors AP-1 and SP1, whose binding sites were the most significantly represented in G4-folded sequences, coimmunoprecipitated with their G4-folded promoters. Thus, G4s and their associated transcription factors cooperate to determine cell-specific transcriptional programs, making G4s to strongly emerge as new epigenetic regulators of the transcription machinery.


Sign in / Sign up

Export Citation Format

Share Document