Alignment-Free Approaches for Predicting Novel Nuclear Mitochondrial Segments (NUMTs) in the Human Genome

AbstractThe nuclear human genome harbors sequences of mitochondrial origin, indicating an ancestral transfer of DNA from the mitogenome. Several Nuclear Mitochondrial Segments (NUMTs) have been detected by alignment-based sequence similarity search, as implemented in the Basic Local Alignment Search Tool (BLAST). Identifying NUMTs is important for the comprehensive annotation and understanding of the human genome. Here we explore the possibility of detecting NUMTs in the human genome by alignment-free sequence similarity search, such as k-mers (k-tuples, k-grams, oligos of length k) distributions. We find that when k=6 or larger, the k-mer approach and BLAST search produce almost identical results, e.g., detect the same set of NUMTs longer than 3kb. However, when k=5 or k=4, certain signals are only detected by the alignment-free approach, and these may indicate yet unrecognized, and potentially more ancestral NUMTs. We introduce a “Manhattan plot” style representation of NUMT predictions across the genome, which are calculated based on the reciprocal of the Jensen-Shannon divergence between the nuclear and mitochondrial k-mer frequencies. The further inspection of the k-mer-based NUMT predictions however shows that most of them contain long-terminal-repeat (LTR) annotations, whereas BLAST-based NUMT predictions do not. Thus, similarity of the mitogenome to LTR sequences is recognized, which we validate by finding the mitochondrial k-mer distribution closer to those for transposable sequences and specifically, close to some types of LTR.

Download Full-text

RAFTS3: Rapid Alignment-Free Tool for Sequence Similarity Search

10.1101/055269 ◽

2016 ◽

Cited By ~ 11

Author(s):

Ricardo Assunção Vialle ◽

Fábio de Oliveira Pedrosa ◽

Vinicius Almir Weiss ◽

Dieval Guizelini ◽

Juliana Helena Tibaes ◽

...

Keyword(s):

Similarity Search ◽

Sequence Similarity ◽

Biological Data ◽

Amino Acid Residues ◽

Binary Matrix ◽

Protein Databases ◽

Alignment Free ◽

Large Databases ◽

Search Tool ◽

Time Required

AbstractBackgroundSimilarity search of a given protein sequence against a database is an essential task in genome analysis. Sequence alignment is the most used method to perform such analysis. Although this approach is efficient, the time required to perform searches against large databases is always a challenge. Alignment-free techniques offer alternatives to comparing sequences without the need of alignment.ResultsHere We developed RAFTS3, a fast protein similarity search tool that utilizes a filter step for candidate selection based on shared k-mers and a comparison measure using a binary matrix of co-occurrence of amino acid residues. RAFTS3performed searches many times faster than those with BLASTp against large protein databases, such as NR, Pfam or UniRef, with a small loss of sensitivity depending on the similarity degree of the sequences.ConclusionsRAFTS3 is a new alternative for fast comparison of proteinsequences genome annotation and biological data mining. The source code and the standalone files for Windows and Linux platform are available at: https://sourceforge.net/projects/rafts3/

Download Full-text

BEAUTY: an enhanced BLAST-based search tool that integrates multiple biological information resources into sequence similarity search results.

Genome Research ◽

10.1101/gr.5.2.173 ◽

1995 ◽

Vol 5 (2) ◽

pp. 173-184 ◽

Cited By ~ 154

Author(s):

K C Worley ◽

B A Wiese ◽

R F Smith

Keyword(s):

Similarity Search ◽

Sequence Similarity ◽

Information Resources ◽

Biological Information ◽

Sequence Similarity Search ◽

Search Results ◽

Search Tool

Download Full-text

Predicting bacteriophage hosts based on sequences of annotated receptor-binding proteins

Scientific Reports ◽

10.1038/s41598-021-81063-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Dimitri Boeckaerts ◽

Michiel Stock ◽

Bjorn Criel ◽

Hans Gerstmans ◽

Bernard De Baets ◽

...

Keyword(s):

Machine Learning ◽

Predictive Model ◽

Receptor Binding ◽

Bacterial Infections ◽

Sequence Data ◽

Sequence Similarity ◽

Area Under The Curve ◽

Local Alignment ◽

Search Tool ◽

Different Levels

AbstractNowadays, bacteriophages are increasingly considered as an alternative treatment for a variety of bacterial infections in cases where classical antibiotics have become ineffective. However, characterizing the host specificity of phages remains a labor- and time-intensive process. In order to alleviate this burden, we have developed a new machine-learning-based pipeline to predict bacteriophage hosts based on annotated receptor-binding protein (RBP) sequence data. We focus on predicting bacterial hosts from the ESKAPE group, Escherichia coli, Salmonella enterica and Clostridium difficile. We compare the performance of our predictive model with that of the widely used Basic Local Alignment Search Tool (BLAST). Our best-performing predictive model reaches Precision-Recall Area Under the Curve (PR-AUC) scores between 73.6 and 93.8% for different levels of sequence similarity in the collected data. Our model reaches a performance comparable to that of BLASTp when sequence similarity in the data is high and starts outperforming BLASTp when sequence similarity drops below 75%. Therefore, our machine learning methods can be especially useful in settings in which sequence similarity to other known sequences is low. Predicting the hosts of novel metagenomic RBP sequences could extend our toolbox to tune the host spectrum of phages or phage tail-like bacteriocins by swapping RBPs.

Download Full-text

Filtering Redundancies For Sequence Similarity Search Programs

Journal of Biomolecular Structure and Dynamics ◽

10.1080/07391102.2005.10507020 ◽

2005 ◽

Vol 22 (4) ◽

pp. 487-492

Author(s):

Hubert Cantalloube ◽

Jacques Chomilier ◽

Sylvain Chiusa ◽

Mathieu Lonquety ◽

Jean-Louis Spadoni ◽

...

Keyword(s):

Similarity Search ◽

Sequence Similarity ◽

Sequence Similarity Search

Download Full-text

Rfam 14: expanded coverage of metagenomic, viral and microRNA families

Nucleic Acids Research ◽

10.1093/nar/gkaa1047 ◽

2020 ◽

Vol 49 (D1) ◽

pp. D192-D200 ◽

Cited By ~ 2

Author(s):

Ioanna Kalvari ◽

Eric P Nawrocki ◽

Nancy Ontiveros-Palacios ◽

Joanna Argasinska ◽

Kevin Lamkiewicz ◽

...

Keyword(s):

Similarity Search ◽

Sequence Similarity ◽

Sequence Similarity Search ◽

Covariance Model ◽

Rna Sequences ◽

Multiple Sequence ◽

The Family ◽

Recent Developments ◽

Community Contribution ◽

Website Features

Abstract Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microRNAs, viral and bacterial RNAs. We have completed the first phase of synchronising microRNA families in Rfam and miRBase, creating 356 new Rfam families and updating 40. We established a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs. We have also increased the coverage of bacterial and metagenome-based RNA families from the ZWD database. These developments have enabled a significant growth of the database, with the addition of 759 new families in Rfam 14. To facilitate further community contribution to Rfam, expert users are now able to build and submit new families using the newly developed Rfam Cloud family curation system. New Rfam website features include a new sequence similarity search powered by RNAcentral, as well as search and visualisation of families with pseudoknots. Rfam is freely available at https://rfam.org.

Download Full-text

Minimally-overlapping words for sequence similarity search

Bioinformatics ◽

10.1093/bioinformatics/btaa1054 ◽

2020 ◽

Author(s):

Martin C Frith ◽

Laurent Noé ◽

Gregory Kucherov

Keyword(s):

Similarity Search ◽

Sequence Similarity ◽

Random Sequence ◽

Supplementary Information ◽

Sequence Similarity Search ◽

Supplementary Data ◽

Huge Data ◽

Open Questions ◽

Seeding Method ◽

Genetic Sequences

Abstract Motivation Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via “seeds”: simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. Results Here we study a simple sparse-seeding method: using seeds at positions of certain “words” (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally-overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed “minimizer” sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. Availability and Implementation Software to design and test minimally-overlapping words is freely available at https://gitlab.com/mcfrith/noverlap. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text