An Alignment-free Heuristic for Fast Sequence Comparisons with Applications to Phylogeny Reconstruction

Abstract Background Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACSk takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACSk have been introduced. Results In this paper, we present a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction. Conclusions Our method produces a better approximation for ACSk and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs.

Download Full-text

Whole-Genome k-mer Topic Modeling Associates Bacterial Families

Genes ◽

10.3390/genes11020197 ◽

2020 ◽

Vol 11 (2) ◽

pp. 197

Author(s):

Ernesto Borrayo ◽

Isaias May-Canche ◽

Omar Paredes ◽

J. Alejandro Morales ◽

Rebeca Romo-Vázquez ◽

...

Keyword(s):

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Hierarchical Classification ◽

Whole Genome Sequence ◽

Whole Genome ◽

Sequence Comparisons ◽

Alignment Free ◽

Biological Phenomena ◽

Topic Distribution ◽

Genome Comparisons

Alignment-free k-mer-based algorithms in whole genome sequence comparisons remain an ongoing challenge. Here, we explore the possibility to use Topic Modeling for organism whole-genome comparisons. We analyzed 30 complete genomes from three bacterial families by topic modeling. For this, each genome was considered as a document and 13-mer nucleotide representations as words. Latent Dirichlet allocation was used as the probabilistic modeling of the corpus. We where able to identify the topic distribution among analyzed genomes, which is highly consistent with traditional hierarchical classification. It is possible that topic modeling may be applied to establish relationships between genome’s composition and biological phenomena.

Download Full-text

Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis

Briefings in Bioinformatics ◽

10.1093/bib/bbt052 ◽

2013 ◽

Vol 15 (6) ◽

pp. 890-905 ◽

Cited By ~ 79

Author(s):

O. Bonham-Carter ◽

J. Steele ◽

D. Bastola

Keyword(s):

Sequence Comparisons ◽

Genetic Sequence ◽

Word Analysis ◽

Alignment Free

Download Full-text

Sensitivity of A Frequency-Based Alignment-Free Approach For Phylogeny Rreconstruction.

10.21203/rs.3.rs-1174825/v1 ◽

2021 ◽

Author(s):

Yana Hrytsenko ◽

Noah M. Daniels ◽

Rachel S. Schwartz

Keyword(s):

Computational Efficiency ◽

Sequence Data ◽

Genome Structure ◽

Branch Length ◽

Phylogeny Reconstruction ◽

Genome Data ◽

Alignment Free ◽

The Face ◽

Efficient Alternative ◽

Parameter Ranges

Abstract Background: Phylogenies enrich our understanding of how genes, genomes, and species evolve. Traditionally, alignment-based methods are used to construct phylogenies from genetic sequence data; however, this process can be time-consuming when analyzing the large amounts of genomic data available today. Additionally, these analyses face challenges due to differences in genome structure, synteny, and the need to identify similarities in the face of repeated substitutions resulting in loss of phylogenetic information contained in the sequence. Alignment Free (AF) approaches using k-mers (short subsequences) can be an efficient alternative due to their indifference to positional rearrangements in a sequence. However, these approaches may be sensitive to k-mer length and the distance between samples.Results: In this paper, we analyzed the sensitivity of an AF approach based on k-mer frequencies to these challenges using cosine and Euclidean distance metrics for both assembled genomes and unassembled sequencing reads. Quantification of the sensitivity of this AF approach for phylogeny reconstruction to branch length and k-mer length provides a better understanding of the necessary parameter ranges for accurate phylogeny reconstruction. Our results show that a frequency-based AF approach can result in accurate phylogeny reconstruction when using whole genomes, but not stochastically sequenced reads, so long as longer k-mers are used. Conclusions: In this study, we have shown an AF approach for phylogeny reconstruction is robust in analyzing assembled genome data for a range of numbers of substitutions using longer k-mers. Using simulated reads randomly selected from the genome by the Illumina sequencer had a detrimental effect on phylogeny estimation. Additionally, filtering out infrequent k-mers improved the computational efficiency of the method while preserving the accuracy of the results thus suggesting the feasibility of using only a subset of data to improve computational efficiency in cases where large sets of genome-scale data are analyzed.

Download Full-text

An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data

BMC Genomics ◽

10.1186/s12864-015-1647-5 ◽

2015 ◽

Vol 16 (1) ◽

Cited By ~ 68

Author(s):

Huan Fan ◽

Anthony R. Ives ◽

Yann Surget-Groba ◽

Charles H. Cannon

Keyword(s):

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Phylogeny Reconstruction ◽

Next Generation ◽

Sequencing Data ◽

Alignment Free ◽

Generation Sequencing

Download Full-text

Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage

BMC Bioinformatics ◽

10.1186/s12859-019-3205-7 ◽

2019 ◽

Vol 20 (S20) ◽

Cited By ~ 6

Author(s):

Anna-Katharina Lau ◽

Svenja Dörrer ◽

Chris-André Leimeister ◽

Christoph Bleidorn ◽

Burkhard Morgenstern

Keyword(s):

Biomedical Research ◽

Real World ◽

Medical Diagnostics ◽

High Accuracy ◽

Phylogeny Reconstruction ◽

Strain Typing ◽

Bacterial Genomes ◽

Sequencing Coverage ◽

Alignment Free ◽

Low Coverage

Abstract Background In many fields of biomedical research, it is important to estimate phylogenetic distances between taxa based on low-coverage sequencing reads. Major applications are, for example, phylogeny reconstruction, species identification from small sequencing samples, or bacterial strain typing in medical diagnostics. Results We adapted our previously developed software program Filtered Spaced-Word Matches (FSWM) for alignment-free phylogeny reconstruction to take unassembled reads as input; we call this implementation Read-SpaM. Conclusions Test runs on simulated reads from semi-artificial and real-world bacterial genomes show that our approach can estimate phylogenetic distances with high accuracy, even for large evolutionary distances and for very low sequencing coverage.

Download Full-text

Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences

GigaScience ◽

10.1093/gigascience/giy148 ◽

2018 ◽

Vol 8 (3) ◽

Cited By ~ 8

Author(s):

Chris-Andre Leimeister ◽

Jendrik Schellhorn ◽

Svenja Dörrer ◽

Michael Gerth ◽

Christoph Bleidorn ◽

...

Keyword(s):

Phylogeny Reconstruction ◽

Alignment Free

Download Full-text

Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage

10.1101/550632 ◽

2019 ◽

Cited By ~ 2

Author(s):

Anna Katharina Lau ◽

Chris-André Leimeister ◽

Burkhard Morgenstern

Keyword(s):

Biomedical Research ◽

Bacterial Strain ◽

Medical Diagnostics ◽

High Accuracy ◽

Phylogeny Reconstruction ◽

Strain Typing ◽

Bacterial Genomes ◽

Sequencing Coverage ◽

Alignment Free ◽

Low Coverage

AbstractIn many fields of biomedical research, it is important to estimate phylogenetic distances between taxa based on low-coverage sequencing reads. Major applications are, for example, phylogeny reconstruction, species identification from small sequencing samples, or bacterial strain typing in medical diagnostics. Herein, we adapt our previously developed software program Filtered Spaced-Word Matches (FSWM) for alignment-free phylogeny reconstruction to work on unassembled reads; we call this implementation Read-SpaM. Test runs on simulated reads from bacterial genomes show that our approach can estimate phylogenetic distances with high accuracy, even for large evolutionary distances and for very low sequencing coverage.Contact: [email protected]

Download Full-text

Prot-SpaM: Fast alignment-free phylogeny reconstruction based on whole-proteome sequences

10.1101/306142 ◽

2018 ◽

Cited By ~ 3

Author(s):

Chris-Andre Leimeister ◽

Jendrik Schellhorn ◽

Svenja Schöbel ◽

Michael Gerth ◽

Christoph Bleidorn ◽

...

Keyword(s):

Word Frequency ◽

Sequence Comparison ◽

Phylogenetic Trees ◽

Sequence Similarity ◽

Source Code ◽

Active Area ◽

Genomic Sequences ◽

Phylogeny Reconstruction ◽

High Quality ◽

Alignment Free

AbstractWord-based or ‘alignment-free’ sequence comparison has become an active area of research in bioinformatics. While previous word-frequency approaches calculated rough measures of sequence similarity or dissimilarity, some new alignment-free methods are able to accurately estimate phylogenetic distances between genomic sequences. One of these approaches is Filtered Spaced Word Matches. Herein, we extend this approach to estimate evolutionary distances between complete or incomplete proteomes; our implementation of this approach is called Prot-SpaM. We compare the performance of Prot-SpaM to other alignment-free methods on simulated sequences and on various groups of eukaryotic and prokaryotic taxa. Prot-SpaM can be used to calculate high-quality phylogenetic trees from whole-proteome sequences in a matter of seconds or minutes and often outperforms other alignment-free approaches. The source code of our software is available through Github:https://github.com/jschellh/ProtSpaM

Download Full-text

Reconstructing phylogeny from reduced-representation genome sequencing data without assembly or alignment

10.1101/225623 ◽

2017 ◽

Author(s):

Huan Fan ◽

Anthony R. Ives ◽

Yann Surget-Groba

Keyword(s):

Missing Data ◽

Genome Sequencing ◽

Population Level ◽

Model Organisms ◽

Phylogeny Reconstruction ◽

Sequencing Data ◽

Reduced Representation ◽

Real Dataset ◽

Data Alignment ◽

Alignment Free

AbstractAlthough genome sequencing is becoming cheaper and faster, reducing the quantity of data by only sequencing part of the genome lowers both sequencing costs and computational burdens. One popular genome-reduction approach is restriction site associated DNA sequencing, or RADseq. RADseq was initially designed for studying genetic variation across genomes usually at the population level, and it has also proved to be suitable for interspecific phylogeny reconstruction. RADseq data pose challenges for standard phylogenomic methods, however, due to incomplete coverage of the genome and large amounts of missing data. Alignment-free methods are both efficient and accurate for phylogenetic reconstructions with whole genomes and are especially practical for non-model organisms; nonetheless, alignment-free methods have only been applied with whole genome sequences. Here, we test a full-genome assembly and alignment-free method, AAF, in application to RADseq data and propose two procedures for reads selection to remove missing data. We validate these methods using both simulations and a real dataset. Reads selection improved the accuracy of phylogenetic construction in every simulated scenario and the real dataset, making AAF comparable to or better than alignment-based method with much lower computation burdens. We also investigated the sources of missing data in RADseq and their effects on phylogeny reconstruction using AAF. The AAF pipeline modified for RADseq data, phyloRAD, is available on github (https://github.com/fanhuan/phyloRAD).

Download Full-text