RANDOMIZED ALGORITHMS FOR MOTIF DETECTION

Motivation: Motif detection for DNA sequences has many important applications in biological studies, e.g. locating binding sites regulatory signals, designing genetic probes etc. In this paper, we propose a randomized algorithm, design an improved EM algorithm and combine them to form a software tool. Results: (1) We design a randomized algorithm for consensus pattern problem. We can show that with high probability, our randomized algorithm finds a pattern in polynomial time with cost error at most ∊ × l for each string, where l is the length of the motif and ∊ can be any positive number given by the user. (2) We design an improved EM algorithm that outperforms the original EM algorithm. (3) We develop a software tool, MotifDetector, that uses our randomized algorithm to find good seeds and uses the improved EM algorithm to do local search. We compare MotifDetector with Buhler and Tompa's PROJECTION which is considered to be the best known software for motif detection. Simulations show that MotifDetector is slower than PROJECTION when the pattern length is relatively small, and outperforms PROJECTION when the pattern length becomes large. Availability: It is available for free at , subject to copyright restrictions.

Download Full-text

Identification of Distinguishing Motifs

International Journal of Knowledge Discovery in Bioinformatics ◽

10.4018/jkdb.2010070104 ◽

2010 ◽

Vol 1 (3) ◽

pp. 53-67

Author(s):

Wangsen Feng ◽

Lusheng Wang

Keyword(s):

Dna Sequences ◽

Target Identification ◽

Randomized Algorithm ◽

Probe Design ◽

Single Group ◽

Biological Studies ◽

Drug Target Identification ◽

Diagnostic Probe ◽

The One ◽

The Given

Motif identification for DNA sequences has many important applications in biological studies, including diagnostic probe design, locating binding sites and regulatory signals, and potential drug target identification. There are two versions—the Single Group and Two Groups. Here, the occurrences of the motif in the given sequences have errors. Currently, most of existing programs can only handle the case of single group. However, most of the programs do not allow indels (insertions and deletions) in the occurrences of the motif. In this paper, the authors propose a randomized algorithm for the one group problem that can handle indels in the occurrences of the motif. Finally, an algorithm for the two groups’ problem is given along with extensive simulations evaluating algorithms.

Download Full-text

Identification of Distinguishing Motifs

Computational Knowledge Discovery for Bioinformatics Research ◽

10.4018/978-1-4666-1785-8.ch001 ◽

2013 ◽

pp. 1-14

Author(s):

Wangsen Feng ◽

Lusheng Wang

Keyword(s):

Dna Sequences ◽

Target Identification ◽

Randomized Algorithm ◽

Probe Design ◽

Single Group ◽

Biological Studies ◽

Drug Target Identification ◽

Diagnostic Probe ◽

The One ◽

The Given

Download Full-text

A randomized algorithm for aligning DNA sequences to reference genomes

2013 IEEE 3rd International Conference on Computational Advances in Bio and medical Sciences (ICCABS) ◽

10.1109/iccabs.2013.6629197 ◽

2013 ◽

Cited By ~ 1

Author(s):

Nam S. Vo ◽

Quang Tran ◽

Nobal Niraula ◽

Vinhthuy Phan

Keyword(s):

Dna Sequences ◽

Randomized Algorithm ◽

Reference Genomes

Download Full-text

DBSCAN-SWA: an integrated tool for rapid prophage detection and annotation

10.1101/2020.07.12.199018 ◽

2020 ◽

Author(s):

Rui Gan ◽

Fengxia Zhou ◽

Yu Si ◽

Han Yang ◽

Chuangeng Chen ◽

...

Keyword(s):

Dna Sequences ◽

Bacterial Infections ◽

Software Tool ◽

High Specificity ◽

Bacterial Genomes ◽

Bacterial Host ◽

Accurate Identification ◽

Bacterial Dna ◽

Intracellular Form ◽

User Friendly

AbstractSummaryAs an intracellular form of a bacteriophage in the bacterial host genome, a prophage is usually integrated into bacterial DNA with high specificity and contributes to horizontal gene transfer (HGT). Phage therapy has been widely applied, for example, using phages to kill bacteria to treat pathogenic and resistant bacterial infections. Therefore, it is necessary to develop effective tools for the fast and accurate identification of prophages. Here, we introduce DBSCAN-SWA, a command line software tool developed to predict prophage regions of bacterial genomes. DBSCAN-SWA runs faster than any previous tool. Importantly, it has great detection power based on analysis using 184 manually curated prophages, with a recall of 85% compared with Phage_Finder (63%), VirSorter (74%) and PHASTER (82%) for raw DNA sequences. DBSCAN-SWA also provides user-friendly visualizations including a circular prophage viewer and interactive DataTables.Availability and implementationDBSCAN-SWA is implemented in Python3 and is freely available under an open source GPLv2 license from https://github.com/HIT-ImmunologyLab/DBSCAN-SWA/.

Download Full-text

EMBL2checklists: A Python package to facilitate the user-friendly submission of plant DNA barcoding sequences to ENA

10.1101/435644 ◽

2018 ◽

Author(s):

Michael Gruenstaeudl ◽

Yannick Hartmaring

Keyword(s):

Dna Barcoding ◽

Dna Sequence ◽

Dna Sequences ◽

Sequence Data ◽

Software Tool ◽

Plant Dna ◽

Dna Sequence Data ◽

User Friendly ◽

Common Plant ◽

Python Package

AbstractBackgroundThe submission of DNA sequences to public sequence databases is an essential, but insufficiently automated step in the process of generating and disseminating novel DNA sequence data. Despite the centrality of database submissions to biological research, the range of available software tools that facilitate the preparation of sequence data for database submissions is low, especially for sequences generated via plant DNA barcoding. Current submission procedures can be complex and prohibitively time expensive for any but a small number of input sequences. A user-friendly software tool is needed that streamlines the file preparation for database submissions of DNA sequences that are commonly generated in plant DNA barcoding.MethodsA Python package was developed that converts DNA sequences from the common EMBL and GenBank flat file formats to submission-ready, tab-delimited spreadsheets (so-called “checklists”) for a subsequent upload to the public sequence database of the European Nucleotide Archive (ENA). The software tool, titled “EMBL2checklists”, automatically converts DNA sequences, their annotation features, and associated metadata into the idiosyncratic format of marker-specific ENA checklists and, thus, generates output that can be uploaded via the interactive Webin submission system of ENA.ResultsEMBL2checklists provides a simple, platform-independent tool that automates the conversion of common plant DNA barcoding sequences into easily editable spreadsheets that require no further processing but their upload to ENA via the interactive Webin submission system. The software is equipped with an intuitive graphical as well as an efficient command-line interface for its operation. The utility of the software is illustrated by its application in the submission of DNA sequences of two recent plant phylogenetic investigations and one fungal metagenomic study.DiscussionEMBL2checklists bridges the gap between common software suites for DNA sequence assembly and annotation and the interactive data submission process of ENA. It represents an easy-to-use solution for plant biologists without bioinformatics expertise to generate submission-ready checklists from common plant DNA sequence data. It allows the post-processing of checklists as well as work-sharing during the submission process and solves a critical bottleneck in the effort to increase participation in public data sharing.

Download Full-text

Characterization of the hypothetical proteins of Human Papillomavirus DNA consensus sequence

Research Journal of Biotechnology ◽

10.25303/167rjbt19721 ◽

2021 ◽

Vol 16 (7) ◽

pp. 197-202

Author(s):

Suruchi Jamkhedkar

Keyword(s):

Genetic Algorithm ◽

Human Papillomavirus ◽

Amino Acid ◽

High Risk ◽

Amino Acid Sequence ◽

Drug Development ◽

Dna Sequences ◽

Consensus Sequence ◽

Software Tool ◽

Molecular Techniques

The diagnosis of HPV infection is generally carried out using immunological and molecular techniques based on high risk to probable high-risk HPV strains. The aim of this work is to generate a global representation of HPV strains for diagnosis and drug development. In this work, all the complete genomic DNA sequences of registered Human Papillomavirus (HPV) strains available in NCBI GenBank were used to obtain a consensus sequence of HPV using the Genetic Algorithm. The consensus DNA sequence was translated using the ExPASy software tool. In all, six longest amino acids frames were selected from the six translated frames. The amino acid sequence identity was carried out using the BLAST tool. The six amino acid sequences were identified as E1, E2, E6, E7, L1 and L2. The homology modeling method (Modeller Software Tool) was used to determine the secondary structure of these six identified primary amino acid sequence. The percentage of similarity ranged from 24% in L2 to 100% in E7 and L1. The functions of these structural domains were also determined from PDB databank, InterProScan and CATH. Hence the consensus sequence built using a genetic algorithm is representative of the HPV genome which can be used for diagnostics and drug development purposes.

Download Full-text

MetaGenomeThreader: A Software Tool for Predicting Genes in DNA-Sequences of Metagenome Projects

Methods in Molecular Biology - Metagenomics ◽

10.1007/978-1-60761-823-2_23 ◽

2010 ◽

pp. 325-338 ◽

Cited By ~ 1

Author(s):

David J. Schmitz-Hübsch ◽

Stefan Kurtz

Keyword(s):

Dna Sequences ◽

Software Tool

Download Full-text

cisExpress: motif detection in DNA sequences

Bioinformatics ◽

10.1093/bioinformatics/btt366 ◽

2013 ◽

Vol 29 (17) ◽

pp. 2203-2205 ◽

Cited By ~ 16

Author(s):

Martin Triska ◽

David Grocutt ◽

James Southern ◽

Denis J. Murphy ◽

Tatiana Tatarinova

Keyword(s):

Dna Sequences ◽

Motif Detection

Download Full-text

Graph-Based Problem Explorer: A Software Tool to Support Algorithm Design Learning While Solving the Salesperson Problem

Mathematics ◽

10.3390/math8091595 ◽

2020 ◽

Vol 8 (9) ◽

pp. 1595

Author(s):

Aura Hernández-Sabaté ◽

Lluís Albarracín ◽

F. Javier Sánchez

Keyword(s):

Computer Literacy ◽

Design Research ◽

Analysis Of Algorithms ◽

Algorithm Design ◽

Software Tool ◽

Point Of View ◽

Real Problem ◽

Educational Design ◽

Computational Point ◽

Algorithmic Techniques

In this article, we present a sequence of activities in the form of a project in order to promote learning on design and analysis of algorithms. The project is based on the resolution of a real problem, the salesperson problem, and it is theoretically grounded on the fundamentals of mathematical modelling. In order to support the students’ work, a multimedia tool, called Graph-based Problem Explorer (GbPExplorer), has been designed and refined to promote the development of computer literacy in engineering and science university students. This tool incorporates several modules to allow coding different algorithmic techniques solving the salesman problem. Based on an educational design research along five years, we observe that working with GbPExplorer during the project provides students with the possibility of representing the situation to be studied in the form of graphs and analyze them from a computational point of view.

Download Full-text

4SpecID: Reference DNA Libraries Auditing and Annotation System for Forensic Applications

Genes ◽

10.3390/genes12010061 ◽

2021 ◽

Vol 12 (1) ◽

pp. 61

Author(s):

Luís Neto ◽

Nádia Pinto ◽

Alberto Proença ◽

António Amorim ◽

Eduardo Conde-Sousa

Keyword(s):

Dna Sequences ◽

Forensic Genetics ◽

Software Tool ◽

Taxonomic Assignment ◽

Specific Data ◽

Annotation System ◽

Dna Libraries ◽

Best Execution ◽

User Friendly

Forensic genetics is a fast-growing field that frequently requires DNA-based taxonomy, namely, when evidence are parts of specimens, often highly processed in food, potions, or ointments. Reference DNA-sequences libraries, such as BOLD or GenBank, are imperative tools for taxonomic assignment, particularly when morphology is inadequate for classification. The auditing and curation of these datasets require reliable mechanisms, preferably with automated data preprocessing. Software tools were developed to grade these datasets considering as primary criterion the number of records, which is not compliant with forensic standards, where the priority is validation from independent sources. Moreover, 4SpecID is an efficient and freely available software tool developed to audit and annotate reference libraries, specifically designed for forensic applications. Its intuitive user-friendly interface virtually accesses any database and includes specific data mining functions tuned for the widespread BOLD repositories. The built tool was evaluated in laptop MacBook and a dual-Xeon server with a large BOLD dataset (Culicidae, 36,115 records), and the best execution time to grade the dataset on the laptop was 0.28 s. Datasets of Bovidae and Felidae families were used to evaluate the quality of the tool and the relevance of independent sources validation.

Download Full-text