Computing nearest neighbour interchange distances between ranked phylogenetic trees

Lena Collienne; Alex Gavryushkin

doi:10.1007/s00285-021-01567-5

Computing nearest neighbour interchange distances between ranked phylogenetic trees

Journal of Mathematical Biology ◽

10.1007/s00285-021-01567-5 ◽

2021 ◽

Vol 82 (1-2) ◽

Author(s):

Lena Collienne ◽

Alex Gavryushkin

Keyword(s):

Cancer Research ◽

Computational Complexity ◽

Phylogenetic Tree ◽

Shortest Path ◽

Phylogenetic Trees ◽

Shortest Paths ◽

Nearest Neighbour ◽

Tree Inference ◽

Subtree Prune And Regraft ◽

Comparison Algorithms

AbstractMany popular algorithms for searching the space of leaf-labelled (phylogenetic) trees are based on tree rearrangement operations. Under any such operation, the problem is reduced to searching a graph where vertices are trees and (undirected) edges are given by pairs of trees connected by one rearrangement operation (sometimes called a move). Most popular are the classical nearest neighbour interchange, subtree prune and regraft, and tree bisection and reconnection moves. The problem of computing distances, however, is $${\mathbf {N}}{\mathbf {P}}$$ N P -hard in each of these graphs, making tree inference and comparison algorithms challenging to design in practice. Although anked phylogenetic trees are one of the central objects of interest in applications such as cancer research, immunology, and epidemiology, the computational complexity of the shortest path problem for these trees remained unsolved for decades. In this paper, we settle this problem for the ranked nearest neighbour interchange operation by establishing that the complexity depends on the weight difference between the two types of tree rearrangements (rank moves and edge moves), and varies from quadratic, which is the lowest possible complexity for this problem, to $${\mathbf {N}}{\mathbf {P}}$$ N P -hard, which is the highest. In particular, our result provides the first example of a phylogenetic tree rearrangement operation for which shortest paths, and hence the distance, can be computed efficiently. Specifically, our algorithm scales to trees with tens of thousands of leaves (and likely hundreds of thousands if implemented efficiently).

Download Full-text

Breaking bud: probing the scalability limits of phylogenetic network inference methods

10.1101/056572 ◽

2016 ◽

Author(s):

Hussein A Hejase ◽

Kevin J Liu

Keyword(s):

Phylogenetic Tree ◽

Phylogenetic Trees ◽

Network Inference ◽

State Of The Art ◽

Probabilistic Inference ◽

Phylogenetic Network ◽

Main Memory ◽

Tree Inference ◽

Dataset Size ◽

Inference Methods

AbstractBackgroundBranching events in phylogenetic trees reflect strictly bifurcating and/or multifurcating speciation and splitting events. In the presence of gene flow, a phylogeny cannot be described by a tree but is instead a directed acyclic graph known as a phylogenetic network. Both phylogenetic trees and networks are typically reconstructed using computational analysis of multi-locus sequence data. The advent of high-throughput sequencing technologies has brought about two main scalability challenges:(1) dataset size in terms of the number of taxa and (2) the evolutionary divergence of the taxa in a study. The impact of both dimensions of scale on phylogenetic tree inference has been well characterized by recent studies; in contrast, the scalability limits of phylogenetic network inference methods are largely unknown. In this study, we quantify the performance of state-of-the-art phylogenetic network inference methods on large-scale datasets using empirical data sampled from natural mouse populations and synthetic data capturing a wide range of evolutionary scenarios.ResultsWe find that, as in the case of phylogenetic tree inference, the performance of leading network inference methods is negatively impacted by both dimensions of dataset scale. In general, we found that topological accuracy degrades as the number of taxa increases; a similar effect was observed with increased sequence mutation rate. The most accurate methods were probabilistic inference methods which maximize either likelihood under coalescent-based models or pseudo-likelihood approximations to the model likelihood. Furthermore, probabilistic inference methods with optimization criteria which did not make use of gene tree root and/or branch length information performed best-a result that runs contrary to widely held assumptions in the literature. The improved accuracy obtained with probabilistic inference methods comes at a computational cost in terms of runtime and main memory usage, which quickly become prohibitive as dataset size grows past thirty taxa.ConclusionsWe conclude that the state of the art of phylogenetic network inference lags well behind the scope of current phylogenomic studies. New algorithmic development is critically needed to address this methodological gap.

Download Full-text

Joint Alignment and Tree Inference

10.1101/2021.09.28.462230 ◽

2021 ◽

Author(s):

Jūlija Pečerska ◽

Manuel Gil ◽

Maria Anisimova

Keyword(s):

Computational Complexity ◽

Maximum Likelihood ◽

Phylogenetic Tree ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Combinatorial Optimisation ◽

Simultaneous Inference ◽

Inference Process ◽

Multiple Sequence ◽

Tree Inference

Multiple sequence alignment and phylogenetic tree inference are connected problems that are often solved as independent steps in the inference process. Several attempts at doing simultaneous inference have been made, however currently the available methods are greatly limited by their computational complexity and can only handle small datasets. In this manuscript we introduce a combinatorial optimisation approach that will allow us to resolve the circularity of the problem and efficiently infer both alignments and trees under maximum likelihood.

Download Full-text

Geometry of Ranked Nearest Neighbour Interchange Space of Phylogenetic Trees

10.1101/2019.12.19.883603 ◽

2019 ◽

Author(s):

Lena Collienne ◽

Kieran Elmes ◽

Mareike Fischer ◽

David Bryant ◽

Alex Gavryushkin

Keyword(s):

Monte Carlo ◽

Markov Chain ◽

Markov Chain Monte Carlo ◽

Maximum Likelihood ◽

Phylogenetic Trees ◽

Search Space ◽

Nearest Neighbour ◽

Adjacency Relation ◽

Inference Algorithms ◽

Tree Inference

AbstractIn this paper we study the graph of ranked phylogenetic trees where the adjacency relation is given by a local rearrangement of the tree structure. Our work is motivated by tree inference algorithms, such as maximum likelihood and Markov Chain Monte Carlo methods, where the geometry of the search space plays a central role for efficiency and practicality of optimisation and sampling. We hence focus on understanding the geometry of the space (graph) of ranked trees, the so-called ranked nearest neighbour interchange (RNNI) graph. We find the radius and diameter of the space exactly, improving the best previously known estimates. Since the RNNI graph is a generalisation of the classical nearest neighbour interchange (NNI) graph to ranked phylogenetic trees, we compare geometric and algorithmic properties of the two graphs. Surprisingly, we discover that both geometric and algorithmic properties of RNNI and NNI are quite different. For example, we establish convexity of certain natural subspaces in RNNI which are not convex is NNI. Our results suggest that the complexity of computing distances in the two graphs is different.

Download Full-text

Dynamic Shortest Paths Methods for the Time-Dependent TSP

Algorithms ◽

10.3390/a14010021 ◽

2021 ◽

Vol 14 (1) ◽

pp. 21

Author(s):

Christoph Hansknecht ◽

Imke Joormann ◽

Sebastian Stiller

Keyword(s):

Column Generation ◽

Traveling Salesman Problem ◽

Shortest Path ◽

Valid Inequalities ◽

Shortest Paths ◽

Traveling Salesman ◽

Time Dependent ◽

Full Generality ◽

Branching Rule ◽

The Traveling Salesman Problem

The time-dependent traveling salesman problem (TDTSP) asks for a shortest Hamiltonian tour in a directed graph where (asymmetric) arc-costs depend on the time the arc is entered. With traffic data abundantly available, methods to optimize routes with respect to time-dependent travel times are widely desired. This holds in particular for the traveling salesman problem, which is a corner stone of logistic planning. In this paper, we devise column-generation-based IP methods to solve the TDTSP in full generality, both for arc- and path-based formulations. The algorithmic key is a time-dependent shortest path problem, which arises from the pricing problem of the column generation and is of independent interest—namely, to find paths in a time-expanded graph that are acyclic in the underlying (non-expanded) graph. As this problem is computationally too costly, we price over the set of paths that contain no cycles of length k. In addition, we devise—tailored for the TDTSP—several families of valid inequalities, primal heuristics, a propagation method, and a branching rule. Combining these with the time-dependent shortest path pricing we provide—to our knowledge—the first elaborate method to solve the TDTSP in general and with fully general time-dependence. We also provide for results on complexity and approximability of the TDTSP. In computational experiments on randomly generated instances, we are able to solve the large majority of small instances (20 nodes) to optimality, while closing about two thirds of the remaining gap of the large instances (40 nodes) after one hour of computation.

Download Full-text

Origin of the European avian-like swine influenza viruses

Journal of General Virology ◽

10.1099/vir.0.068569-0 ◽

2014 ◽

Vol 95 (11) ◽

pp. 2372-2376 ◽

Cited By ~ 10

Author(s):

Andi Krumbholz ◽

Jeannette Lange ◽

Andreas Sauerbrei ◽

Marco Groth ◽

Matthias Platzer ◽

...

Keyword(s):

Avian Influenza ◽

Phylogenetic Trees ◽

Sequence Data ◽

Swine Influenza ◽

H1n1 Influenza ◽

Molecular Data ◽

Influenza Viruses ◽

Time Resolved ◽

Genotype Constellation ◽

Tree Inference

The avian-like swine influenza viruses emerged in 1979 in Belgium and Germany. Thereafter, they spread through many European swine-producing countries, replaced the circulating classical swine H1N1 influenza viruses, and became endemic. Serological and subsequent molecular data indicated an avian source, but details remained obscure due to a lack of relevant avian influenza virus sequence data. Here, the origin of the European avian-like swine influenza viruses was analysed using a collection of 16 European swine H1N1 influenza viruses sampled in 1979–1981 in Germany, the Netherlands, Belgium, Italy and France, as well as several contemporaneous avian influenza viruses of various serotypes. The phylogenetic trees suggested a triple reassortant with a unique genotype constellation. Time-resolved maximum clade credibility trees indicated times to the most recent common ancestors of 34–46 years (before 2008) depending on the RNA segment and the method of tree inference.

Download Full-text

Phylogenetic tree inference on PC architectures with AxML/PAxML

Proceedings International Parallel and Distributed Processing Symposium ◽

10.1109/ipdps.2003.1213296 ◽

2004 ◽

Cited By ~ 2

Author(s):

A.P. Stamatakis ◽

T. Ludwig

Keyword(s):

Phylogenetic Tree ◽

Tree Inference

Download Full-text

Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian haemoglobin sequences

Biochemical Journal ◽

10.1042/bj1870065 ◽

1980 ◽

Vol 187 (1) ◽

pp. 65-74 ◽

Cited By ~ 12

Author(s):

D Penny ◽

M D Hendy ◽

L R Foulds

Keyword(s):

Amino Acid ◽

Phylogenetic Tree ◽

Protein Sequence ◽

Phylogenetic Trees ◽

Sequence Data ◽

Protein Sequences ◽

Nucleotide Sequences ◽

Amino Acid Sequences ◽

Minimal Tree ◽

Protein Sequence Data

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.

Download Full-text

The Effect of Route-choice Strategy on Transit Travel Time Estimates

10.31235/osf.io/3r4p6 ◽

2019 ◽

Author(s):

Nate Wessel ◽

Steven Farber

Keyword(s):

Travel Time ◽

Shortest Path ◽

Imperfect Information ◽

Optimal Path ◽

Shortest Paths ◽

Route Choice ◽

Travel Times ◽

Selection Strategies ◽

Time Estimates ◽

Average Travel Time

Estimates of travel time by public transit often rely on the calculation of a shortest-path between two points for a given departure time. Such shortest-paths are time-dependent and not always stable from one moment to the next. Given that actual transit passengers necessarily have imperfect information about the system, their route selection strategies are heuristic and cannot be expected to achieve optimal travel times for all possible departures. Thus an algorithm that returns optimal travel times at all moments will tend to underestimate real travel times all else being equal. While several researchers have noted this issue none have yet measured the extent of the problem. This study observes and measures this effect by contrasting two alternative heuristic routing strategies to a standard shortest-path calculation. The Toronto Transit Commission is used as a case study and we model actual transit operations for the agency over the course of a normal week with archived AVL data transformed into a retrospective GTFS dataset. Travel times are estimated using two alternative route-choice assumptions: 1) habitual selection of the itinerary with the best average travel time and 2) dynamic choice of the next-departing route in a predefined choice set. It is shown that most trips present passengers with a complex choice among competing itineraries and that the choice of itinerary at any given moment of departure may entail substantial travel time risk relative to the optimal outcome. In the context of accessibility modelling, where travel times are typically considered as a distribution, the optimal path method is observed in aggregate to underestimate travel time by about 3-4 minutes at the median and 6-7 minutes at the \nth{90} percentile for a typical trip.

Download Full-text

A MODIFIED GENETIC ALGORITHM FOR FINDING FUZZY SHORTEST PATHS IN UNCERTAIN NETWORKS

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprsarchives-xli-b2-299-2016 ◽

2016 ◽

Vol XLI-B2 ◽

pp. 299-304 ◽

Cited By ~ 4

Author(s):

A. A. Heidari ◽

M. R. Delavar

Keyword(s):

Genetic Algorithm ◽

Shortest Path ◽

Shortest Paths ◽

The Body ◽

Shortest Path Problems ◽

Conventional Procedure ◽

Path Lengths ◽

The Cost ◽

Uncertain Networks ◽

Exploration Exploitation

In realistic network analysis, there are several uncertainties in the measurements and computation of the arcs and vertices. These uncertainties should also be considered in realizing the shortest path problem (SPP) due to the inherent fuzziness in the body of expert's knowledge. In this paper, we investigated the SPP under uncertainty to evaluate our modified genetic strategy. We improved the performance of genetic algorithm (GA) to investigate a class of shortest path problems on networks with vague arc weights. The solutions of the uncertain SPP with considering fuzzy path lengths are examined and compared in detail. As a robust metaheuristic, GA algorithm is modified and evaluated to tackle the fuzzy SPP (FSPP) with uncertain arcs. For this purpose, first, a dynamic operation is implemented to enrich the exploration/exploitation patterns of the conventional procedure and mitigate the premature convergence of GA technique. Then, the modified GA (MGA) strategy is used to resolve the FSPP. The attained results of the proposed strategy are compared to those of GA with regard to the cost, quality of paths and CPU times. Numerical instances are provided to demonstrate the success of the proposed MGA-FSPP strategy in comparison with GA. The simulations affirm that not only the proposed technique can outperform GA, but also the qualities of the paths are effectively improved. The results clarify that the competence of the proposed GA is preferred in view of quality quantities. The results also demonstrate that the proposed method can efficiently be utilized to handle FSPP in uncertain networks.

Download Full-text

Analysis of SARS-CoV-2 nucleocapsid protein sequence variations in ASEAN countries

Medical Journal of Indonesia ◽

10.13181/mji.oa.215304 ◽

2021 ◽

Author(s):

Mochammad Rajasa Mukti Negara ◽

Ita Krissanti ◽

Gita Widya Pradini

Keyword(s):

Phylogenetic Tree ◽

Phylogenetic Trees ◽

Protein Sequences ◽

Reference Sequence ◽

N Protein ◽

Asean Country ◽

Sequence Variations ◽

Complete Sequences ◽

Asean Countries ◽

Global Initiative

BACKGROUND Nucleocapsid (N) protein is one of four structural proteins of SARS-CoV-2 which is known to be more conserved than spike protein and is highly immunogenic. This study aimed to analyze the variation of the SARS-CoV-2 N protein sequences in ASEAN countries, including Indonesia. METHODS Complete sequences of SARS-CoV-2 N protein from each ASEAN country were obtained from Global Initiative on Sharing All Influenza Data (GISAID), while the reference sequence was obtained from GenBank. All sequences collected from December 2019 to March 2021 were grouped to the clade according to GISAID, and two representative isolates were chosen from each clade for the analysis. The sequences were aligned by MUSCLE, and phylogenetic trees were built using MEGA-X software based on the nucleotide and translated AA sequences. RESULTS 98 isolates of complete N protein genes from ASEAN countries were analyzed. The nucleotides of all isolates were 97.5% conserved. Of 31 nucleotide changes, 22 led to amino acid (AA) substitutions; thus, the AA sequences were 94.5% conserved. The phylogenetic tree of nucleotide and AA sequences shows similar branches. Nucleotide variations in clade O (C28311T); clade GR (28881–28883 GGG>AAC); and clade GRY (28881–28883 GGG>AAC and C28977T) lead to specific branches corresponding to the clade within both trees. CONCLUSIONS The N protein sequences of SARS-CoV-2 across ASEAN countries are highly conserved. Most isolates were closely related to the reference sequence originating from China, except the isolates representing clade O, GR, and GRY which formed specific branches in the phylogenetic tree.

Download Full-text