sequence similarity search Latest Research Papers

Abstract Motivation Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via “seeds”: simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. Results Here we study a simple sparse-seeding method: using seeds at positions of certain “words” (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally-overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed “minimizer” sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. Availability and Implementation Software to design and test minimally-overlapping words is freely available at https://gitlab.com/mcfrith/noverlap. Supplementary information Supplementary data are available at Bioinformatics online.

Rfam 14: expanded coverage of metagenomic, viral and microRNA families

Nucleic Acids Research ◽

10.1093/nar/gkaa1047 ◽

2020 ◽

Vol 49 (D1) ◽

pp. D192-D200 ◽

Cited By ~ 2

Author(s):

Ioanna Kalvari ◽

Eric P Nawrocki ◽

Nancy Ontiveros-Palacios ◽

Joanna Argasinska ◽

Kevin Lamkiewicz ◽

...

Keyword(s):

Similarity Search ◽

Sequence Similarity ◽

Sequence Similarity Search ◽

Covariance Model ◽

Rna Sequences ◽

Multiple Sequence ◽

The Family ◽

Recent Developments ◽

Community Contribution ◽

Website Features

Abstract Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microRNAs, viral and bacterial RNAs. We have completed the first phase of synchronising microRNA families in Rfam and miRBase, creating 356 new Rfam families and updating 40. We established a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs. We have also increased the coverage of bacterial and metagenome-based RNA families from the ZWD database. These developments have enabled a significant growth of the database, with the addition of 759 new families in Rfam 14. To facilitate further community contribution to Rfam, expert users are now able to build and submit new families using the newly developed Rfam Cloud family curation system. New Rfam website features include a new sequence similarity search powered by RNAcentral, as well as search and visualisation of families with pseudoknots. Rfam is freely available at https://rfam.org.

Minimally-overlapping words for sequence similarity search

10.1101/2020.07.24.220616 ◽

2020 ◽

Author(s):

Martin C. Frith ◽

Laurent Noé ◽

Gregory Kucherov

Keyword(s):

Big Data ◽

Similarity Search ◽

Sequence Similarity ◽

Random Sequence ◽

Sequence Similarity Search ◽

Huge Data ◽

Open Questions ◽

Seeding Method ◽

Genetic Sequences

AbstractAnalysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via “seeds”: simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence.Here we study a simple sparse-seeding method: using seeds at positions of certain “words” (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally-overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed “minimizer” sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it.

Recommendation with Temporal Dynamics Based on Sequence Similarity Search

Algorithms and Architectures for Parallel Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-030-60239-0_47 ◽

2020 ◽

pp. 690-704

Author(s):

Guang Yang ◽

Xiaoguang Hong ◽

Zhaohui Peng

Keyword(s):

Similarity Search ◽

Temporal Dynamics ◽

Sequence Similarity ◽

Sequence Similarity Search

Unaligned Sequence Similarity Search Using Deep Learning

2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm47256.2019.8983072 ◽

2019 ◽

Author(s):

James K. Senter ◽

Taylor M. Royalty ◽

Andrew D. Steen ◽

Amir Sadovnik

Keyword(s):

Deep Learning ◽

Similarity Search ◽

Sequence Similarity ◽

Sequence Similarity Search

A Simplified Description of Child Tables for Sequence Similarity Search

IEEE/ACM Transactions on Computational Biology and Bioinformatics ◽

10.1109/tcbb.2018.2796064 ◽

2018 ◽

Vol 15 (6) ◽

pp. 2067-2073

Author(s):

Martin C. Frith ◽

Anish M. S. Shrestha

Keyword(s):

Similarity Search ◽

Sequence Similarity ◽

Sequence Similarity Search

Database-integrated genome screening (DIGS): exploring genomes heuristically using sequence similarity search tools and a relational database

10.1101/246835 ◽

2018 ◽

Cited By ~ 11

Author(s):

Henan Zhu ◽

Tristan Dennis ◽

Joseph Hughes ◽

Robert J. Gifford

Keyword(s):

Dna Sequences ◽

Similarity Search ◽

Sequence Similarity ◽

Genomic Diversity ◽

Biological Information ◽

Supplementary Information ◽

Sequence Similarity Search ◽

Genome Screening ◽

Mammalian Genomes ◽

Endogenous Viral Elements

ABSTRACTA significant fraction of most genomes is comprised of DNA sequences that have been incompletely investigated. This genomic ‘dark matter’ contains a wealth of useful biological information that can be recovered by systematically screening genomes in silico using sequence similarity search tools. Specialized computational tools are required to implement these screens efficiently. Here, we describe the database-integrated genome-screening (DIGS) tool: a computational framework for performing these investigations. To demonstrate, we screen mammalian genomes for endogenous viral elements (EVEs) derived from the Filoviridae, Parvoviridae, Circoviridae and Bornaviridae families, identifying numerous novel elements in addition to those that have been described previously. The DIGS tool provides a simple, robust framework for implementing a broad range of heuristic, sequence analysis-based explorations of genomic diversity.Availabilityhttp://giffordlabcvr.github.io/DIGS-tool/[email protected] informationSupplementary data are available at Bioinformatics online.

Alignment-Free Approaches for Predicting Novel Nuclear Mitochondrial Segments (NUMTs) in the Human Genome

10.1101/239053 ◽

2017 ◽

Author(s):

Wentian Li ◽

Jerome Freudenberg ◽

Jan Freudenberg

Keyword(s):

Human Genome ◽

Similarity Search ◽

Sequence Similarity ◽

Local Alignment ◽

Sequence Similarity Search ◽

Alignment Free ◽

Search Tool ◽

Manhattan Plot ◽

Mitochondrial Origin ◽

Jensen Shannon Divergence

AbstractThe nuclear human genome harbors sequences of mitochondrial origin, indicating an ancestral transfer of DNA from the mitogenome. Several Nuclear Mitochondrial Segments (NUMTs) have been detected by alignment-based sequence similarity search, as implemented in the Basic Local Alignment Search Tool (BLAST). Identifying NUMTs is important for the comprehensive annotation and understanding of the human genome. Here we explore the possibility of detecting NUMTs in the human genome by alignment-free sequence similarity search, such as k-mers (k-tuples, k-grams, oligos of length k) distributions. We find that when k=6 or larger, the k-mer approach and BLAST search produce almost identical results, e.g., detect the same set of NUMTs longer than 3kb. However, when k=5 or k=4, certain signals are only detected by the alignment-free approach, and these may indicate yet unrecognized, and potentially more ancestral NUMTs. We introduce a “Manhattan plot” style representation of NUMT predictions across the genome, which are calculated based on the reciprocal of the Jensen-Shannon divergence between the nuclear and mitochondrial k-mer frequencies. The further inspection of the k-mer-based NUMT predictions however shows that most of them contain long-terminal-repeat (LTR) annotations, whereas BLAST-based NUMT predictions do not. Thus, similarity of the mitogenome to LTR sequences is recognized, which we validate by finding the mitochondrial k-mer distribution closer to those for transposable sequences and specifically, close to some types of LTR.

The HMMER Web Server for Protein Sequence Similarity Search

Current Protocols in Bioinformatics ◽

10.1002/cpbi.40 ◽

2017 ◽

Vol 60 (1) ◽

Cited By ~ 26

Author(s):

Ananth Prakash ◽

Matt Jeffryes ◽

Alex Bateman ◽

Robert D. Finn

Keyword(s):

Protein Sequence ◽

Similarity Search ◽

Sequence Similarity ◽

Web Server ◽

Sequence Similarity Search ◽

Protein Sequence Similarity

A Massively Parallel Sequence Similarity Search for Metagenomic Sequencing Data

International Journal of Molecular Sciences ◽

10.3390/ijms18102124 ◽

2017 ◽

Vol 18 (10) ◽

pp. 2124 ◽

Cited By ~ 2

Author(s):

Masanori Kakuta ◽

Shuji Suzuki ◽

Kazuki Izawa ◽

Takashi Ishida ◽

Yutaka Akiyama

Keyword(s):

Similarity Search ◽

Sequence Similarity ◽

Massively Parallel ◽

Metagenomic Sequencing ◽

Sequence Similarity Search ◽

Sequencing Data

sequence similarity search
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Minimally-overlapping words for sequence similarity search

Rfam 14: expanded coverage of metagenomic, viral and microRNA families

Minimally-overlapping words for sequence similarity search

Recommendation with Temporal Dynamics Based on Sequence Similarity Search

Unaligned Sequence Similarity Search Using Deep Learning

A Simplified Description of Child Tables for Sequence Similarity Search

Database-integrated genome screening (DIGS): exploring genomes heuristically using sequence similarity search tools and a relational database

Alignment-Free Approaches for Predicting Novel Nuclear Mitochondrial Segments (NUMTs) in the Human Genome

The HMMER Web Server for Protein Sequence Similarity Search

A Massively Parallel Sequence Similarity Search for Metagenomic Sequencing Data

Export Citation Format

sequence similarity searchRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Minimally-overlapping words for sequence similarity search

Rfam 14: expanded coverage of metagenomic, viral and microRNA families

Minimally-overlapping words for sequence similarity search

Recommendation with Temporal Dynamics Based on Sequence Similarity Search

Unaligned Sequence Similarity Search Using Deep Learning

A Simplified Description of Child Tables for Sequence Similarity Search

Database-integrated genome screening (DIGS): exploring genomes heuristically using sequence similarity search tools and a relational database

Alignment-Free Approaches for Predicting Novel Nuclear Mitochondrial Segments (NUMTs) in the Human Genome

The HMMER Web Server for Protein Sequence Similarity Search

A Massively Parallel Sequence Similarity Search for Metagenomic Sequencing Data

sequence similarity search
Recently Published Documents