Rfam 14: expanded coverage of metagenomic, viral and microRNA families

Abstract Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microRNAs, viral and bacterial RNAs. We have completed the first phase of synchronising microRNA families in Rfam and miRBase, creating 356 new Rfam families and updating 40. We established a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs. We have also increased the coverage of bacterial and metagenome-based RNA families from the ZWD database. These developments have enabled a significant growth of the database, with the addition of 759 new families in Rfam 14. To facilitate further community contribution to Rfam, expert users are now able to build and submit new families using the newly developed Rfam Cloud family curation system. New Rfam website features include a new sequence similarity search powered by RNAcentral, as well as search and visualisation of families with pseudoknots. Rfam is freely available at https://rfam.org.

Download Full-text

Sequence similarity search, Multiple Sequence Alignment, Model Selection, Distance Matrix and Phylogeny Reconstruction

Protocol Exchange ◽

10.1038/protex.2013.065 ◽

2013 ◽

Cited By ~ 12

Author(s):

Felix Bast ◽

Felix Bast

Keyword(s):

Model Selection ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Similarity Search ◽

Sequence Similarity ◽

Distance Matrix ◽

Phylogeny Reconstruction ◽

Sequence Similarity Search ◽

Multiple Sequence ◽

Alignment Model

Download Full-text

Filtering Redundancies For Sequence Similarity Search Programs

Journal of Biomolecular Structure and Dynamics ◽

10.1080/07391102.2005.10507020 ◽

2005 ◽

Vol 22 (4) ◽

pp. 487-492

Author(s):

Hubert Cantalloube ◽

Jacques Chomilier ◽

Sylvain Chiusa ◽

Mathieu Lonquety ◽

Jean-Louis Spadoni ◽

...

Keyword(s):

Similarity Search ◽

Sequence Similarity ◽

Sequence Similarity Search

Download Full-text

Minimally-overlapping words for sequence similarity search

Bioinformatics ◽

10.1093/bioinformatics/btaa1054 ◽

2020 ◽

Author(s):

Martin C Frith ◽

Laurent Noé ◽

Gregory Kucherov

Keyword(s):

Similarity Search ◽

Sequence Similarity ◽

Random Sequence ◽

Supplementary Information ◽

Sequence Similarity Search ◽

Supplementary Data ◽

Huge Data ◽

Open Questions ◽

Seeding Method ◽

Genetic Sequences

Abstract Motivation Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via “seeds”: simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. Results Here we study a simple sparse-seeding method: using seeds at positions of certain “words” (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally-overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed “minimizer” sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. Availability and Implementation Software to design and test minimally-overlapping words is freely available at https://gitlab.com/mcfrith/noverlap. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text