Biological Sequence Motif Discovery Usingmotif-x

Motif is an over-represented pattern in biological sequence. Motif discovery is a major challenge in bioinformatics. Pattern mismatches phenomena makes motif mining very difficult. Brute Force approaches take exponential time with motif length to solve this problem. In this paper, the authors discuss a Recursive-Brute Force algorithm. Its average case time complexity is exponential with the allowed mutations instead of the motif length. Modern Multi-Core architecture revolution encourages us to parallelize our algorithm. We implement the algorithm using two different approaches. A multi-threaded version (OMP-RBF) is implemented using OpenMP. OMP-RBF suffers from a serious performance degradation due to the heap contention problem. The authors have investigated different solutions to solve the heap contention problem. The second implementation is based on MPI that is called MPI-RBF. The efficient handling of the data locality boost the scalability of the MPI-RBF. The authors prove that MPI approach outperforms OpenMP in such computationally-intensive, memory-intensive, and communication-less problem.

Download Full-text

Top-Down Motif Discovery in Biological Sequence Datasets by Genetic Algorithm

2006 International Conference on Hybrid Information Technology ◽

10.1109/ichit.2006.253597 ◽

2006 ◽

Cited By ~ 1

Author(s):

Ulas BALOGLU ◽

Mehmet KAYA

Keyword(s):

Genetic Algorithm ◽

Motif Discovery ◽

Biological Sequence ◽

Top Down

Download Full-text

Bayesian Modeling and Inference for Sequence Motif Discovery

Bayesian Inference for Gene Expression and Proteomics ◽

10.1017/cbo9780511584589.017 ◽

2009 ◽

pp. 309-332

Author(s):

Mayetri Gupta ◽

Jun S. Liu ◽

Marina Vannucci

Keyword(s):

Bayesian Modeling ◽

Motif Discovery ◽

Sequence Motif

Download Full-text

XSTREME: Comprehensive motif analysis of biological sequence datasets

10.1101/2021.09.02.458722 ◽

2021 ◽

Cited By ~ 1

Author(s):

Charles E. Grant ◽

Timothy L. Bailey

Keyword(s):

Motif Discovery ◽

De Novo ◽

Positional Distribution ◽

Enrichment Analysis ◽

Biological Sequence ◽

Motif Analysis ◽

Web Based ◽

Fully Integrated ◽

Commercial Use ◽

Motif Enrichment

AbstractXSTREME is a web-based tool for performing comprehensive motif discovery and analysis in DNA, RNA or protein sequences, as well as in sequences in user-defined alphabets. It is designed for both very large and very small datasets. XSTREME is similar to the MEME-ChIP tool, but expands upon its capabilities in several ways. Like MEME-ChIP, XSTREME performs two types of de novo motif discovery, and also performs motif enrichment analysis of the input sequences using databases of known motifs. Unlike MEME-ChIP, which ranks motifs based on their enrichment in the centers of the input sequences, XSTREME uses enrichment anywhere in the sequences for this purpose. Consequently, XSTREME is more appropriate for motif-based analysis of sequences regardless of how the motifs are distributed within the sequences. XSTREME uses the MEME and STREME algorithms for motif discovery, and the recently developed SEA algorithm for motif enrichment analysis. The interactive HTML output produced by XSTREME includes highly accurate motif significance estimates, plots of the positional distribution of each motif, and histograms of the number of motif matches in each sequences. XSTREME is easy to use via its web server at https://meme-suite.org, and is fully integrated with the widely-used MEME Suite of sequence analysis tools, which can be freely downloaded at the same web site for non-commercial use.

Download Full-text

Protein Sequence Motif Discovery on Distributed Supercomputer

Advances in Grid and Pervasive Computing - Lecture Notes in Computer Science ◽

10.1007/978-3-540-68083-3_24 ◽

2008 ◽

pp. 232-243 ◽

Cited By ~ 3

Author(s):

Santan Challa ◽

Parimala Thulasiraman

Keyword(s):

Protein Sequence ◽

Motif Discovery ◽

Sequence Motif

Download Full-text

Detection and Employment of Biological Sequence Motifs

Big Data Analytics in Bioinformatics and Healthcare - Advances in Bioinformatics and Biomedical Engineering ◽

10.4018/978-1-4666-6611-5.ch005 ◽

2015 ◽

pp. 86-116

Author(s):

Marjan Trutschl ◽

Phillip C. S. R. Kilgore ◽

Rona S. Scott ◽

Christine E. Birdwell ◽

Urška Cvek

Keyword(s):

Motif Discovery ◽

De Novo ◽

Amino Acid Sequences ◽

Discriminative Learning ◽

Large Set ◽

Sequence Motifs ◽

Biological Sequence ◽

Practical Applications ◽

De Novo Motif Discovery ◽

Covariance Models

Biological sequence motifs are short nucleotide or amino acid sequences that are biologically significant and are attractive to scientists because they are usually highly conserved and result in structural and regulatory implications. In this chapter, the authors show practical applications of these data, followed by a review of the algorithms, techniques, and tools. They address the nature of motifs and elucidate on several methods for de novo motif discovery, covering the algorithms based on Gibbs sampling, expectation maximization, Bayesian inference, covariance models, and discriminative learning. The authors present the tools and their requirements to weigh their individual benefits and challenges. Since interpretation of a large set of results can pose significant challenges, they discuss several methods for handling data that span from visualization to integration into pipelines and curated databases. Additionally, the authors show practical applications of these data with examples.

Download Full-text

MOTIF DISCOVERY WITH DATA MINING IN 3D PROTEIN STRUCTURE DATABASES: DISCOVERY, VALIDATION AND PREDICTION OF THE U-SHAPE ZINC BINDING ("HUF-ZINC") MOTIF

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720013400088 ◽

2013 ◽

Vol 11 (01) ◽

pp. 1340008 ◽

Cited By ~ 3

Author(s):

SEBASTIAN MAURER-STROH ◽

HE GAO ◽

HAO HAN ◽

LIES BAETEN ◽

JOOST SCHYMKOWITZ ◽

...

Keyword(s):

Data Mining ◽

Metal Ion ◽

Motif Discovery ◽

Enzymatic Catalysis ◽

3D Structure ◽

Structural Motif ◽

Zinc Binding ◽

Sequence Motif ◽

3D Protein Structure ◽

Binding Motifs

Data mining in protein databases, derivatives from more fundamental protein 3D structure and sequence databases, has considerable unearthed potential for the discovery of sequence motif—structural motif—function relationships as the finding of the U-shape (Huf-Zinc) motif, originally a small student's project, exemplifies. The metal ion zinc is critically involved in universal biological processes, ranging from protein-DNA complexes and transcription regulation to enzymatic catalysis and metabolic pathways. Proteins have evolved a series of motifs to specifically recognize and bind zinc ions. Many of these, so called zinc fingers, are structurally independent globular domains with discontinuous binding motifs made up of residues mostly far apart in sequence. Through a systematic approach starting from the BRIX structure fragment database, we discovered that there exists another predictable subset of zinc-binding motifs that not only have a conserved continuous sequence pattern but also share a characteristic local conformation, despite being included in totally different overall folds. While this does not allow general prediction of all Zn binding motifs, a HMM-based web server, Huf-Zinc, is available for prediction of these novel, as well as conventional, zinc finger motifs in protein sequences. The Huf-Zinc webserver can be freely accessed through this URL ( http://mendel.bii.a-star.edu.sg/METHODS/hufzinc/ ).

Download Full-text

Regmex, Motif analysis in ranked lists of sequences

10.1101/035956 ◽

2016 ◽

Cited By ~ 3

Author(s):

Morten Muhlig Nielsen ◽

Paula Tataru ◽

Tobias Madsen ◽

Asger Hobolth ◽

Jakob Skou Pedersen

Keyword(s):

Motif Discovery ◽

Markov Models ◽

Rank Correlation ◽

R Package ◽

Analysis Tool ◽

Biological Sequence ◽

Motif Analysis ◽

Biological Sequence Analysis ◽

Brownian Bridges ◽

Biological Functionality

Motif analysis has long been an important method to characterize biological functionality and the current growth of sequencing-based genomics experiments further extends its potential. These diverse experiments often generate sequence lists ranked by some functional property. There is therefore a growing need for motif analysis methods that can exploit this coupled data structure and be tailored for specific biological questions. Here, we present a motif analysis tool, Regmex (REGular expression Motif EXplorer), which offers several methods to identify overrepresented motifs in a ranked list of sequences. Regmex uses regular expressions to define motifs or families of motifs and embedded Markov models to calculate exact probabilities for motif observations in sequences. Motif enrichment is optionally evaluated using random walks, Brownian bridges, or modified rank based statistics. These features make Regmex well suited for a range of biological sequence analysis problems related to motif discovery. We demonstrate different usage scenarios including rank correlation of microRNA binding sites co-occurring with a U-rich motif. The method is available as an R package.

Download Full-text

MotiMul: A significant discriminative sequence motif discovery algorithm with multiple testing correction

10.1101/2020.08.21.261024 ◽

2020 ◽

Author(s):

Koichi Mori ◽

Haruka Ozaki ◽

Tsukasa Fukunaga

Keyword(s):

Multiple Testing ◽

Statistical Power ◽

Motif Discovery ◽

Hypothesis Test ◽

Error Rates ◽

Statistical Hypothesis ◽

Sequence Motif ◽

Sequence Motifs ◽

Statistical Hypothesis Testing ◽

Multiple Testing Correction

AbstractSequence motifs play essential roles in intermolecular interactions such as DNA-protein interactions. The discovery of novel sequence motifs is therefore crucial for revealing gene functions. Various bioinformatics tools have been developed for finding sequence motifs, but until now there has been no software based on statistical hypothesis testing with statistically sound multiple testing correction. Existing software therefore could not control for the type-1 error rates. This is because, in the sequence motif discovery problem, conventional multiple testing correction methods produce very low statistical power due to overly-strict correction. We developed MotiMul, which comprehensively finds significant sequence motifs using statistically sound multiple testing correction. Our key idea is the application of Tarone’s correction, which improves the statistical power of the hypothesis test by ignoring hypotheses that never become statistically significant. For the efficient enumeration of the significant sequence motifs, we integrated a variant of the PrefixSpan algorithm with Tarone’s correction. Simulation and empirical dataset analysis showed that MotiMul is a powerful method for finding biologically meaningful sequence motifs. The source code of MotiMul is freely available at https://github.com/ko-ichimo-ri/MotiMul.

Download Full-text