Minimally-overlapping words for sequence similarity search

Bioinformatics ◽

10.1093/bioinformatics/btaa1054 ◽

2020 ◽

Author(s):

Martin C Frith ◽

Laurent Noé ◽

Gregory Kucherov

Keyword(s):

Similarity Search ◽

Sequence Similarity ◽

Random Sequence ◽

Supplementary Information ◽

Sequence Similarity Search ◽

Supplementary Data ◽

Huge Data ◽

Open Questions ◽

Seeding Method ◽

Genetic Sequences

Abstract Motivation Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via “seeds”: simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. Results Here we study a simple sparse-seeding method: using seeds at positions of certain “words” (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally-overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed “minimizer” sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. Availability and Implementation Software to design and test minimally-overlapping words is freely available at https://gitlab.com/mcfrith/noverlap. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Minimally-overlapping words for sequence similarity search

10.1101/2020.07.24.220616 ◽

2020 ◽

Author(s):

Martin C. Frith ◽

Laurent Noé ◽

Gregory Kucherov

Keyword(s):

Big Data ◽

Similarity Search ◽

Sequence Similarity ◽

Random Sequence ◽

Sequence Similarity Search ◽

Huge Data ◽

Open Questions ◽

Seeding Method ◽

Genetic Sequences

AbstractAnalysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via “seeds”: simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence.Here we study a simple sparse-seeding method: using seeds at positions of certain “words” (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally-overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed “minimizer” sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it.

Download Full-text

IGLOSS: iterative gapless local similarity search

Bioinformatics ◽

10.1093/bioinformatics/btz086 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3491-3492

Author(s):

Braslav Rabar ◽

Maja Zagorščak ◽

Strahil Ristov ◽

Martin Rosenzweig ◽

Pavle Goldstein

Keyword(s):

Parameter Estimation ◽

Similarity Search ◽

Sequence Similarity ◽

Web Server ◽

Supplementary Information ◽

Local Similarity ◽

Supplementary Data ◽

Matching Algorithm ◽

Local Sequence ◽

Sequence Patterns

Abstract Summary Searching for local sequence patterns is one of the basic tasks in bioinformatics. Sequence patterns might have structural, functional or some other relevance, and numerous methods have been developed to detect and analyze them. These methods often depend on the wealth of information already collected. The explosion in the number of newly available sequences calls for novel methods to explore local sequence similarity. We have developed a new method for iterative motif scanning that will look for ungapped sequence patterns similar to a submitted query. Using careful parameter estimation and an adaptation of a fast string-matching algorithm, the method performs significantly better in this context than the existing software. Availability and implementation The IGLOSS web server is available at http://compbioserv.math.hr/igloss/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Database-integrated genome screening (DIGS): exploring genomes heuristically using sequence similarity search tools and a relational database

10.1101/246835 ◽

2018 ◽

Cited By ~ 11

Author(s):

Henan Zhu ◽

Tristan Dennis ◽

Joseph Hughes ◽

Robert J. Gifford

Keyword(s):

Dna Sequences ◽

Similarity Search ◽

Sequence Similarity ◽

Genomic Diversity ◽

Biological Information ◽

Supplementary Information ◽

Sequence Similarity Search ◽

Genome Screening ◽

Mammalian Genomes ◽

Endogenous Viral Elements

ABSTRACTA significant fraction of most genomes is comprised of DNA sequences that have been incompletely investigated. This genomic ‘dark matter’ contains a wealth of useful biological information that can be recovered by systematically screening genomes in silico using sequence similarity search tools. Specialized computational tools are required to implement these screens efficiently. Here, we describe the database-integrated genome-screening (DIGS) tool: a computational framework for performing these investigations. To demonstrate, we screen mammalian genomes for endogenous viral elements (EVEs) derived from the Filoviridae, Parvoviridae, Circoviridae and Bornaviridae families, identifying numerous novel elements in addition to those that have been described previously. The DIGS tool provides a simple, robust framework for implementing a broad range of heuristic, sequence analysis-based explorations of genomic diversity.Availabilityhttp://giffordlabcvr.github.io/DIGS-tool/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Filtering Redundancies For Sequence Similarity Search Programs

Journal of Biomolecular Structure and Dynamics ◽

10.1080/07391102.2005.10507020 ◽

2005 ◽

Vol 22 (4) ◽

pp. 487-492

Author(s):

Hubert Cantalloube ◽

Jacques Chomilier ◽

Sylvain Chiusa ◽

Mathieu Lonquety ◽

Jean-Louis Spadoni ◽

...

Keyword(s):

Similarity Search ◽

Sequence Similarity ◽

Sequence Similarity Search

Download Full-text

Rfam 14: expanded coverage of metagenomic, viral and microRNA families

Nucleic Acids Research ◽

10.1093/nar/gkaa1047 ◽

2020 ◽

Vol 49 (D1) ◽

pp. D192-D200 ◽

Cited By ~ 2

Author(s):

Ioanna Kalvari ◽

Eric P Nawrocki ◽

Nancy Ontiveros-Palacios ◽

Joanna Argasinska ◽

Kevin Lamkiewicz ◽

...

Keyword(s):

Similarity Search ◽

Sequence Similarity ◽

Sequence Similarity Search ◽

Covariance Model ◽

Rna Sequences ◽

Multiple Sequence ◽

The Family ◽

Recent Developments ◽

Community Contribution ◽

Website Features

Abstract Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microRNAs, viral and bacterial RNAs. We have completed the first phase of synchronising microRNA families in Rfam and miRBase, creating 356 new Rfam families and updating 40. We established a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs. We have also increased the coverage of bacterial and metagenome-based RNA families from the ZWD database. These developments have enabled a significant growth of the database, with the addition of 759 new families in Rfam 14. To facilitate further community contribution to Rfam, expert users are now able to build and submit new families using the newly developed Rfam Cloud family curation system. New Rfam website features include a new sequence similarity search powered by RNAcentral, as well as search and visualisation of families with pseudoknots. Rfam is freely available at https://rfam.org.

Download Full-text

PROCAIN server for remote protein sequence similarity search

Bioinformatics ◽

10.1093/bioinformatics/btp346 ◽

2009 ◽

Vol 25 (16) ◽

pp. 2076-2077 ◽

Cited By ~ 5

Author(s):

Y. Wang ◽

R. I. Sadreyev ◽

N. V. Grishin

Keyword(s):

Protein Sequence ◽

Similarity Search ◽

Sequence Similarity ◽

Sequence Similarity Search ◽

Protein Sequence Similarity

Download Full-text

Powerful Sequence Similarity Search Methods and In-Depth Manual Analyses Can Identify Remote Homologs in Many Apparently "Orphan" Viral Proteins

Journal of Virology ◽

10.1128/jvi.02595-13 ◽

2013 ◽

Vol 88 (1) ◽

pp. 10-20 ◽

Cited By ~ 56

Author(s):

D. B. Kuchibhatla ◽

W. A. Sherman ◽

B. Y. W. Chung ◽

S. Cook ◽

G. Schneider ◽

...

Keyword(s):

Similarity Search ◽

Sequence Similarity ◽

Viral Proteins ◽

Sequence Similarity Search ◽

Search Methods ◽

Remote Homologs

Download Full-text

Efficient Algorithm for Sequence Similarity Search Based on Reference Indexing

Journal of Software ◽

10.3724/sp.j.1001.2010.03610 ◽

2010 ◽

Vol 21 (4) ◽

pp. 718-731 ◽

Cited By ~ 3

Author(s):

Dong-Bo DAI ◽

Yun XIONG ◽

Yang-Yong ZHU

Keyword(s):

Efficient Algorithm ◽

Similarity Search ◽

Sequence Similarity ◽

Sequence Similarity Search

Download Full-text

Efficient filtration of sequence similarity search through singular value decomposition

Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering ◽

10.1109/bibe.2004.1317371 ◽

2004 ◽

Cited By ~ 2

Author(s):

S. Alireza Aghili ◽

O.D. Sahin ◽

D. Agrawal ◽

Amr El Abbadi

Keyword(s):

Singular Value Decomposition ◽

Similarity Search ◽

Sequence Similarity ◽

Singular Value ◽

Sequence Similarity Search ◽

Value Decomposition

Download Full-text

Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases

International Journal of Computer Applications ◽

10.5120/13516-1295 ◽

2013 ◽

Vol 78 (9) ◽

pp. 13-17

Author(s):

Robinson Silvester.A ◽

J. Cruz Antony ◽

M. Pratheepa

Keyword(s):

Dna Sequence ◽

Similarity Search ◽

Sequence Similarity ◽

Sequence Similarity Search ◽

Sequence Databases

Download Full-text