Detection and Employment of Biological Sequence Motifs

Author(s):  
Marjan Trutschl ◽  
Phillip C. S. R. Kilgore ◽  
Rona S. Scott ◽  
Christine E. Birdwell ◽  
Urška Cvek

Biological sequence motifs are short nucleotide or amino acid sequences that are biologically significant and are attractive to scientists because they are usually highly conserved and result in structural and regulatory implications. In this chapter, the authors show practical applications of these data, followed by a review of the algorithms, techniques, and tools. They address the nature of motifs and elucidate on several methods for de novo motif discovery, covering the algorithms based on Gibbs sampling, expectation maximization, Bayesian inference, covariance models, and discriminative learning. The authors present the tools and their requirements to weigh their individual benefits and challenges. Since interpretation of a large set of results can pose significant challenges, they discuss several methods for handling data that span from visualization to integration into pipelines and curated databases. Additionally, the authors show practical applications of these data with examples.

Author(s):  
Yichao Li ◽  
Yating Liu ◽  
David Juedes ◽  
Frank Drews ◽  
Razvan Bunescu ◽  
...  

Abstract Motivation De novo motif discovery algorithms find statistically over-represented sequence motifs that may function as transcription factor binding sites. Current methods often report large numbers of motifs, making it difficult to perform further analyses and experimental validation. The motif selection problem seeks to identify a minimal set of putative regulatory motifs that characterize sequences of interest (e.g. ChIP-Seq binding regions). Results In this study, the motif selection problem is mapped to variants of the set cover problem that are solved via tabu search and by relaxed integer linear programing (RILP). The algorithms are employed to analyze 349 ChIP-Seq experiments from the ENCODE project, yielding a small number of high-quality motifs that represent putative binding sites of primary factors and cofactors. Specifically, when compared with the motifs reported by Kheradpour and Kellis, the set cover-based algorithms produced motif sets covering 35% more peaks for 11 TFs and identified 4 more putative cofactors for 6 TFs. Moreover, a systematic evaluation using nested cross-validation revealed that the RILP algorithm selected fewer motifs and was able to cover 6% more peaks and 3% fewer background regions, which reduced the error rate by 7%. Availability and implementation The source code of the algorithms and all the datasets are available at https://github.com/YichaoOU/Set_cover_tools. Supplementary information Supplementary data are available at Bioinformatics online.


BMC Genomics ◽  
2018 ◽  
Vol 19 (1) ◽  
Author(s):  
Louis T. Dang ◽  
Markus Tondl ◽  
Man Ho H. Chiu ◽  
Jerico Revote ◽  
Benedict Paten ◽  
...  

Biotechnology ◽  
2019 ◽  
pp. 1069-1085
Author(s):  
Andrei Lihu ◽  
Ștefan Holban

De novo motif discovery is essential in understanding the cis-regulatory processes that play a role in gene expression. Finding unknown patterns of unknown lengths in massive amounts of data has long been a major challenge in computational biology. Because algorithms for motif prediction have always suffered of low performance issues, there is a constant effort to find better techniques. Evolutionary methods, including swarm intelligence algorithms, have been applied with limited success for motif prediction. However, recently developed methods, such as the Fireworks Algorithm (FWA) which simulates the explosion process of fireworks, may show better prospects. This paper describes a motif finding algorithm based on FWA that maximizes the Kullback-Leibler divergence between candidate solutions and the background noise. Following the terminology of FWA's framework, the candidate motifs are fireworks that generate additional sparks (i.e. derived motifs) in their neighborhood. During the iterations, better sparks can replace the fireworks, as the Fireworks Motif Finder (FW-MF) assumes a one occurrence per sequence mode. The results obtained on a standard benchmark for promoter analysis show that our proof of concept is promising.


2015 ◽  
Vol 6 (3) ◽  
pp. 24-40 ◽  
Author(s):  
Andrei Lihu ◽  
Ștefan Holban

De novo motif discovery is essential in understanding the cis-regulatory processes that play a role in gene expression. Finding unknown patterns of unknown lengths in massive amounts of data has long been a major challenge in computational biology. Because algorithms for motif prediction have always suffered of low performance issues, there is a constant effort to find better techniques. Evolutionary methods, including swarm intelligence algorithms, have been applied with limited success for motif prediction. However, recently developed methods, such as the Fireworks Algorithm (FWA) which simulates the explosion process of fireworks, may show better prospects. This paper describes a motif finding algorithm based on FWA that maximizes the Kullback-Leibler divergence between candidate solutions and the background noise. Following the terminology of FWA's framework, the candidate motifs are fireworks that generate additional sparks (i.e. derived motifs) in their neighborhood. During the iterations, better sparks can replace the fireworks, as the Fireworks Motif Finder (FW-MF) assumes a one occurrence per sequence mode. The results obtained on a standard benchmark for promoter analysis show that our proof of concept is promising.


2020 ◽  
Vol 36 (9) ◽  
pp. 2905-2906 ◽  
Author(s):  
Kevin R Shieh ◽  
Christina Kratschmer ◽  
Keith E Maier ◽  
John M Greally ◽  
Matthew Levy ◽  
...  

Abstract Summary High-throughput sequencing can enhance the analysis of aptamer libraries generated by the Systematic Evolution of Ligands by EXponential enrichment. Robust analysis of the resulting sequenced rounds is best implemented by determining a ranked consensus of reads following the processing by multiple aptamer detection algorithms. While several such approaches have been developed to this end, their installation and implementation is problematic. We developed AptCompare, a cross-platform program that combines six of the most widely used analytical approaches for the identification of RNA aptamer motifs and uses a simple weighted ranking to order the candidate aptamers, all driven within the same GUI-enabled environment. We demonstrate AptCompare’s performance by identifying the top-ranked candidate aptamers from a previously published selection experiment in our laboratory, with follow-up bench assays demonstrating good correspondence between the sequences’ rankings and their binding affinities. Availability and implementation The source code and pre-built virtual machine images are freely available at https://bitbucket.org/shiehk/aptcompare. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Yuchun Guo ◽  
Kevin Tian ◽  
Haoyang Zeng ◽  
Xiaoyun Guo ◽  
David Kenneth Gifford

ABSTRACTThe representation and discovery of transcription factor (TF) sequence binding specificities is critical for understanding gene regulatory networks and interpreting the impact of disease-associated non-coding genetic variants. We present a novel TF binding motif representation, the K-mer Set Memory (KSM), which consists of a set of aligned k-mers that are over-represented at TF binding sites, and a new method called KMAC for de novo discovery of KSMs. We find that KSMs more accurately predict in vivo binding sites than position weight matrix models (PWMs) and other more complex motif models across a large set of ChIP-seq experiments. KMAC also identifies correct motifs in more experiments than four state-of-the-art motif discovery methods. In addition, KSM derived features outperform both PWM and deep learning model derived sequence features in predicting differential regulatory activities of expression quantitative trait loci (eQTL) alleles. Finally, we have applied KMAC to 1488 ENCODE TF ChIP-seq datasets and created a public resource of KSM and PWM motifs. We expect that the KSM representation and KMAC method will be valuable in characterizing TF binding specificities and in interpreting the effects of non-coding genetic variations.


2015 ◽  
Author(s):  
Bong-Hyun Kim ◽  
Jiali Zhuang ◽  
Jie Wang ◽  
Zhiping Weng

Summary: High-throughput sequencing technologies such as ChIP-seq have deepened our understanding in many biological processes. De novo motif search is one of the key downstream computational analysis following the ChIP-seq experiments and several algorithms have been proposed for this purpose. However, most web-based systems do not perform independent filtering or enrichment analyses to ensure the quality of the discovered motifs. Here, we developed a web server Factorbook Motif Pipeline based on an algorithm used in analyzing ENCODE consortium ChIP-seq datasets. It performs comprehensive analysis on the set of peaks detected from a ChIP-seq experiments: (i) de novo motif discovery; (ii) independent composition and bias analyses and (iii) matching to the annotated motifs. The statistical tests employed in our pipeline provide a reliable measure of confidence as to how significant are the motifs reported in the discovery step. Availability: Factorbook Motif Pipeline source code is accessible through the following URL. https://github.com/joshuabhk/factorbook-motif-pipeline


Sign in / Sign up

Export Citation Format

Share Document