An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction

Abstract Background Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACSk takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACSk have been introduced. Results In this paper, we present a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction. Conclusions Our method produces a better approximation for ACSk and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs.

Download Full-text

Prot-SpaM: Fast alignment-free phylogeny reconstruction based on whole-proteome sequences

10.1101/306142 ◽

2018 ◽

Cited By ~ 3

Author(s):

Chris-Andre Leimeister ◽

Jendrik Schellhorn ◽

Svenja Schöbel ◽

Michael Gerth ◽

Christoph Bleidorn ◽

...

Keyword(s):

Word Frequency ◽

Sequence Comparison ◽

Phylogenetic Trees ◽

Sequence Similarity ◽

Source Code ◽

Active Area ◽

Genomic Sequences ◽

Phylogeny Reconstruction ◽

High Quality ◽

Alignment Free

AbstractWord-based or ‘alignment-free’ sequence comparison has become an active area of research in bioinformatics. While previous word-frequency approaches calculated rough measures of sequence similarity or dissimilarity, some new alignment-free methods are able to accurately estimate phylogenetic distances between genomic sequences. One of these approaches is Filtered Spaced Word Matches. Herein, we extend this approach to estimate evolutionary distances between complete or incomplete proteomes; our implementation of this approach is called Prot-SpaM. We compare the performance of Prot-SpaM to other alignment-free methods on simulated sequences and on various groups of eukaryotic and prokaryotic taxa. Prot-SpaM can be used to calculate high-quality phylogenetic trees from whole-proteome sequences in a matter of seconds or minutes and often outperforms other alignment-free approaches. The source code of our software is available through Github:https://github.com/jschellh/ProtSpaM

Download Full-text

SWeeP: representing large biological sequences datasets in compact vectors

Scientific Reports ◽

10.1038/s41598-019-55627-4 ◽

2020 ◽

Vol 10 (1) ◽

Cited By ~ 2

Author(s):

Camilla Reginatto De Pierri ◽

Ricardo Voyceik ◽

Letícia Graziela Costa Santos de Mattos ◽

Mariane Gonçalves Kulik ◽

Josué Oliveira Camargo ◽

...

Keyword(s):

Phylogenetic Trees ◽

Computational Cost ◽

Large Data ◽

Data Sets ◽

Biological Sequences ◽

Biological Sequence ◽

Sequence Comparisons ◽

Alignment Free ◽

Higher Dimensional ◽

Sequence Representation

AbstractVectoral and alignment-free approaches to biological sequence representation have been explored in bioinformatics to efficiently handle big data. Even so, most current methods involve sequence comparisons via alignment-based heuristics and fail when applied to the analysis of large data sets. Here, we present “Spaced Words Projection (SWeeP)”, a method for representing biological sequences using relatively small vectors while preserving intersequence comparability. SWeeP uses spaced-words by scanning the sequences and generating indices to create a higher-dimensional vector that is later projected onto a smaller randomly oriented orthonormal base. We constructed phylogenetic trees for all organisms with mitochondrial and bacterial protein data in the NCBI database. SWeeP quickly built complete and accurate trees for these organisms with low computational cost. We compared SWeeP to other alignment-free methods and Sweep was 10 to 100 times quicker than the other techniques. A tool to build SWeeP vectors is available at https://sourceforge.net/projects/spacedwordsprojection/.

Download Full-text

An Alignment-free Heuristic for Fast Sequence Comparisons with Applications to Phylogeny Reconstruction

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics - BCB '18 ◽

10.1145/3233547.3233648 ◽

2018 ◽

Author(s):

Jodh Pannu ◽

Sriram P. Chockalingam ◽

Sharma V. Thankachan ◽

Srinivas Aluru

Keyword(s):

Phylogeny Reconstruction ◽

Sequence Comparisons ◽

Alignment Free

Download Full-text

Contrastive learning on protein embeddings enlightens midnight zone at lightning speed

10.1101/2021.11.14.468528 ◽

2021 ◽

Author(s):

Michael Heinzinger ◽

Maria Littmann ◽

Ian Sillitoe ◽

Nicola Bordin ◽

Christine Orengo ◽

...

Keyword(s):

Structure Prediction ◽

Sequence Similarity ◽

3D Structure ◽

Three Dimensional ◽

Hierarchical Classification ◽

Language Models ◽

Sequence Alignments ◽

Sequence Comparisons ◽

Multiple Sequence ◽

3D Structures

Thanks to the recent advances in protein three-dimensional (3D) structure prediction, in particular through AlphaFold 2 and RoseTTAFold, the abundance of protein 3D information will explode over the next year(s). Expert resources based on 3D structures such as SCOP and CATH have been organizing the complex sequence-structure-function relations into a hierarchical classification schema. Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI) transferring annotations from a protein with experimentally known annotation to a query without annotation. Here, we presented a novel approach that expands the concept of HBI from a low-dimensional sequence-distance lookup to the level of a high-dimensional embedding-based annotation transfer (EAT). Secondly, we introduced a novel solution using single protein sequence representations from protein Language Models (pLMs), so called embeddings (Prose, ESM-1b, ProtBERT, and ProtT5), as input to contrastive learning, by which a new set of embeddings was created that optimized constraints captured by hierarchical classifications of protein 3D structures. These new embeddings (dubbed ProtTucker) clearly improved what was historically referred to as threading or fold recognition. Thereby, the new embeddings enabled the intrusion into the midnight zone of protein comparisons, i.e., the region in which the level of pairwise sequence similarity is akin of random relations and therefore is hard to navigate by HBI methods. Cautious benchmarking showed that ProtTucker reached much further than advanced sequence comparisons without the need to compute alignments allowing it to be orders of magnitude faster. Code is available at https://github.com/Rostlab/EAT .

Download Full-text

A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies

Research Ideas and Outcomes ◽

10.3897/rio.5.e36178 ◽

2019 ◽

Vol 5 ◽

Cited By ~ 18

Author(s):

Alexis Criscuolo

Keyword(s):

Phylogenetic Trees ◽

Species Trees ◽

Sequence Alignments ◽

Multiple Sequence ◽

Internal Branch ◽

Multiple Sequence Alignments ◽

Fast Running ◽

Alignment Free ◽

Multiple Threads ◽

Genome Assemblies

This paper describes a novel alignment-free distance-based procedure for inferring phylogenetic trees from genome contig sequences using publicly available bioinformatics tools. For each pair of genomes, a dissimilarity measure is first computed and next transformed to obtain an estimation of the number of substitution events that have occurred during their evolution. These pairwise evolutionary distances are then used to infer a phylogenetic tree and assess a confidence support for each internal branch. Analyses of both simulated and real genome datasets show that this bioinformatics procedure allows accurate phylogenetic trees to be reconstructed with fast running times, especially when launched on multiple threads. Implemented in a publicly available script, named JolyTree, this procedure is a useful approach for quickly inferring species trees without the burden and potential biases of multiple sequence alignments.

Download Full-text

PyMod 3: a complete suite for structural bioinformatics in PyMOL

Bioinformatics ◽

10.1093/bioinformatics/btaa849 ◽

2020 ◽

Author(s):

Giacomo Janson ◽

Alessandro Paiardini

Keyword(s):

Phylogenetic Trees ◽

Model Building ◽

Sequence Similarity ◽

Structural Bioinformatics ◽

Structure Alignment ◽

Supplementary Information ◽

Large Set ◽

Loop Modeling ◽

Multiple Sequence ◽

Wide Range

Abstract Summary The PyMod project is designed to act as a fully integrated interface between the popular molecular graphics viewer PyMOL, and some of the most frequently used tools for structural bioinformatics, e.g. BLAST, HMMER, Clustal, MUSCLE, PSIPRED, DOPE and MODELLER. Here we report its latest release, PyMod 3, which has been completely renewed with a graphical interface written in PyQt, to make it compatible with the most recent PyMOL versions, and has been extended with a large set of new functionalities compared to its predecessor, i.e. PyMod 2. Starting from the amino acid sequence of a target protein, users can take advantage of PyMod 3 to carry out all the steps of the homology modeling process (i.e. template searching, target–template sequence alignment, model building and quality assessment). Additionally, the integrated tools in PyMod 3 may also be used alone, in order to extend PyMOL with a wide range of capabilities. Sequence similarity searches, multiple sequence/structure alignment building, phylogenetic trees and evolutionary conservation analyses, domain parsing, single/multiple chains and loop modeling can be performed in the PyMod 3/PyMOL environment. Availability and implementation A cross-platform PyMod 3 installer package for Windows, Linux and Mac OS X and a complete user guide with tutorials, are available at https://github.com/pymodproject/pymod Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Sequence similarity search, Multiple Sequence Alignment, Model Selection, Distance Matrix and Phylogeny Reconstruction

Protocol Exchange ◽

10.1038/protex.2013.065 ◽

2013 ◽

Cited By ~ 12

Author(s):

Felix Bast ◽

Felix Bast

Keyword(s):

Model Selection ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Similarity Search ◽

Sequence Similarity ◽

Distance Matrix ◽

Phylogeny Reconstruction ◽

Sequence Similarity Search ◽

Multiple Sequence ◽

Alignment Model

Download Full-text

PVTree: A Sequential Pattern Mining Method for Alignment Independent Phylogeny Reconstruction

Genes ◽

10.3390/genes10020073 ◽

2019 ◽

Vol 10 (2) ◽

pp. 73 ◽

Cited By ~ 3

Author(s):

Yongyong Kang ◽

Xiaofei Yang ◽

Jiadong Lin ◽

Kai Ye

Keyword(s):

Phylogenetic Tree ◽

Phylogenetic Trees ◽

Pattern Mining ◽

Distance Matrix ◽

Parameter Tuning ◽

Sequential Pattern Mining ◽

Sequential Pattern ◽

Binary Representation ◽

Multiple Sequence ◽

Alignment Free

Phylogenetic tree is essential to understand evolution and it is usually constructed through multiple sequence alignment, which suffers from heavy computational burdens and requires sophisticated parameter tuning. Recently, alignment free methods based on k-mer profiles or common substrings provide alternative ways to construct phylogenetic trees. However, most of these methods ignore the global similarities between sequences or some specific valuable features, e.g., frequent patterns overall datasets. To make further improvement, we propose an alignment free algorithm based on sequential pattern mining, where each sequence is converted into a binary representation of sequential patterns among sequences. The phylogenetic tree is further constructed via clustering distance matrix which is calculated from pattern vectors. To increase accuracy for highly divergent sequences, we consider pattern weight and filtering redundancy sub-patterns. Both simulated and real data demonstrates our method outperform other alignment free methods, especially for large sequence set with low similarity.

Download Full-text

Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification

Bioinformatics ◽

10.1093/bioinformatics/btv006 ◽

2015 ◽

Vol 31 (9) ◽

pp. 1396-1404 ◽

Cited By ~ 23

Author(s):

Ivan Borozan ◽

Stuart Watt ◽

Vincent Ferretti

Keyword(s):

Sequence Similarity ◽

Similarity Measures ◽

Sequence Classification ◽

Biological Sequence ◽

Alignment Free

Download Full-text

Fast and accurate large multiple sequence alignments using root-to-leave regressive computation

10.1101/490235 ◽

2018 ◽

Cited By ~ 2

Author(s):

Edgar Garriga ◽

Paolo Di Tommaso ◽

Cedrik Magis ◽

Ionas Erb ◽

Hafid Laayouni ◽

...

Keyword(s):

Linear Time ◽

Scale Up ◽

Approximate Solutions ◽

Biological Sequences ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Multiple Alignments ◽

Genomic Analyses ◽

Alignment Problem

AbstractInferences derived from large multiple alignments of biological sequences are critical to many areas of biology, including evolution, genomics, biochemistry, and structural biology. However, the complexity of the alignment problem imposes the use of approximate solutions. The most common is the progressive algorithm, which starts by aligning the most similar sequences, incorporating the remaining ones following the order imposed by a guide-tree. We developed and validated on protein sequences a regressive algorithm that works the other way around, aligning first the most dissimilar sequences. Our algorithm produces more accurate alignments than non-regressive methods, especially on datasets larger than 10,000 sequences. By design, it can run any existing alignment method in linear time thus allowing the scale-up required for extremely large genomic analyses.One Sentence SummaryInitiating alignments with the most dissimilar sequences allows slow and accurate methods to be used on large datasets

Download Full-text