Systematic Exploration of the High Likelihood Set of Phylogenetic Tree Topologies

Abstract Bayesian Markov chain Monte Carlo explores tree space slowly, in part because it frequently returns to the same tree topology. An alternative strategy would be to explore tree space systematically, and never return to the same topology. In this article, we present an efficient parallelized method to map out the high likelihood set of phylogenetic tree topologies via systematic search, which we show to be a good approximation of the high posterior set of tree topologies on the data sets analyzed. Here, “likelihood” of a topology refers to the tree likelihood for the corresponding tree with optimized branch lengths. We call this method “phylogenetic topographer” (PT). The PT strategy is very simple: starting in a number of local topology maxima (obtained by hill-climbing from random starting points), explore out using local topology rearrangements, only continuing through topologies that are better than some likelihood threshold below the best observed topology. We show that the normalized topology likelihoods are a useful proxy for the Bayesian posterior probability of those topologies. By using a nonblocking hash table keyed on unique representations of tree topologies, we avoid visiting topologies more than once across all concurrent threads exploring tree space. We demonstrate that PT can be used directly to approximate a Bayesian consensus tree topology. When combined with an accurate means of evaluating per-topology marginal likelihoods, PT gives an alternative procedure for obtaining Bayesian posterior distributions on phylogenetic tree topologies.

Download Full-text

Empirical Analysis of Phylogenetic Quasi-Terraces

10.1101/810309 ◽

2019 ◽

Author(s):

Paula Breitling ◽

Alexandros Stamatakis ◽

Olga Chernomor ◽

Ben Bettisworth ◽

Lukasz Reszczynski

Keyword(s):

Phylogenetic Tree ◽

Search Algorithms ◽

Data Sets ◽

Tree Search ◽

Significance Tests ◽

Analogous Structure ◽

Phylogenetic Studies ◽

Log Likelihood ◽

Tree Space ◽

Nearest Neighborhood

AbstractTerraces in phylogenetic tree space are, among other things, important for the design of tree space search strategies. While the phenomenon of phylogenetic terraces is already known for unlinked partition models on partitioned phylogenomic data sets, it has not yet been studied if an analogous structure is present under linked and scaled partition models. To this end, we analyze aspects such as the log-likelihood distributions, likelihood-based significance tests, and nearest neighborhood interchanges on the trees residing on a terrace and compare their distributions among unlinked, linked, and scaled partition models. Our study shows that there exists a terrace-like structure under linked and scaled partition models as well. We denote this phenomenon as quasi-terrace. Therefore quasi-terraces should be taken into account in the design of tree search algorithms as well as when reporting results on ‘the’ final tree topology in empirical phylogenetic studies.

Download Full-text

Two C++ Libraries for Counting Trees on a Phylogenetic Terrace

10.1101/211276 ◽

2017 ◽

Cited By ~ 2

Author(s):

R. Biczok ◽

P. Bozsoky ◽

P. Eisenmann ◽

J. Ernst ◽

T. Ribizel ◽

...

Keyword(s):

Maximum Likelihood ◽

Phylogenetic Tree ◽

Phylogenetic Inference ◽

Source Codes ◽

Likelihood Score ◽

Order Of Magnitude ◽

Tree Topologies ◽

Tree Space ◽

Bayesian Phylogenetic Inference ◽

Counting Trees

AbstractMotivationThe presence of terraces in phylogenetic tree space, that is, a potentially large number of distinct tree topologies that have exactly the same analytical likelihood score, was first described by Sanderson et al, (2011). However, popular software tools for maximum likelihood and Bayesian phylogenetic inference do not yet routinely report, if inferred phylogenies reside on a terrace, or not. We believe, this is due to the unavailability of an efficient library implementation to (i) determine if a tree resides on a terrace, (ii) calculate how many trees reside on a terrace, and (iii) enumerate all trees on a terrace.ResultsIn our bioinformatics programming practical we developed two efficient and independent C++ implementations of the SUPERB algorithm by Constantinescu and Sankoff (1995) for counting and enumerating the trees on a terrace. Both implementations yield exactly the same results and are more than one order of magnitude faster and require one order of magnitude less memory than a previous 3rd party python implementation.AvailabilityThe source codes are available under GNU GPL at https://github.com/[email protected]

Download Full-text

Markov Katana: a Novel Method for Bayesian Resampling of Parameter Space Applied to Phylogenetic Trees

10.1101/250951 ◽

2018 ◽

Author(s):

Stephen T. Pollard ◽

Kenji Fukushima ◽

Zhengyuan O. Wang ◽

Todd A. Castoe ◽

David D. Pollock

Keyword(s):

Phylogenetic Tree ◽

Phylogenetic Trees ◽

Statistical Approach ◽

Evolutionary Model ◽

Complex Model ◽

Tree Topology ◽

Added Value ◽

General Idea ◽

Branch Lengths ◽

Tree Space

ABSTRACTPhylogenetic inference requires a means to search phylogenetic tree space. This is usually achieved using progressive algorithms that propose and test small alterations in the current tree topology and branch lengths. Current programs search tree topology space using branch-swapping algorithms, but proposals do not discriminate well between swaps likely to succeed or fail. When applied to datasets with many taxa, the huge number of possible topologies slows these programs dramatically. To overcome this, we developed a statistical approach for proposal generation in Bayesian analysis, and evaluated its applicability for the problem of searching phylogenetic tree space. The general idea of the approach, which we call ‘Markov katana’, is to make proposals based on a heuristic algorithm using bootstrapped subsets of the data. Such proposals induce an unintended sampling distribution that must be determined and removed to generate posterior estimates, but the cost of this extra step can in principle be small compared to the added value of more efficient parameter exploration in Markov chain Monte Carlo analyses. Our prototype application uses the simple neighbor-joining distance heuristic on data subsets to propose new reasonably likely phylogenetic trees (including topologies and branch lengths). The evolutionary model used to generate distances in our prototype was far simpler than the more complex model used to evaluate the likelihood of phylogenies based on the full dataset. This prototype implementation indicates that the Markov katana approach could be easily incorporated into existing phylogenetic search programs and may prove a useful alternative in conjunction with existing methods. The general features of this statistical approach may also prove useful in disciplines other than phylogenetics. We demonstrate that this method can be used to efficiently estimate a Bayesian posterior.

Download Full-text

19 Dubious Ways to Compute the Marginal Likelihood of a Phylogenetic Tree Topology

Systematic Biology ◽

10.1093/sysbio/syz046 ◽

2019 ◽

Vol 69 (2) ◽

pp. 209-220 ◽

Cited By ~ 2

Author(s):

Mathieu Fourment ◽

Andrew F Magee ◽

Chris Whidden ◽

Arman Bilge ◽

Frederick A Matsen ◽

...

Keyword(s):

Marginal Likelihood ◽

Real Data ◽

Tree Topology ◽

Model Parameters ◽

Data Sets ◽

Posterior Density ◽

Computational Burden ◽

Marginal Likelihoods ◽

Tree Topologies ◽

First Time

Abstract The marginal likelihood of a model is a key quantity for assessing the evidence provided by the data in support of a model. The marginal likelihood is the normalizing constant for the posterior density, obtained by integrating the product of the likelihood and the prior with respect to model parameters. Thus, the computational burden of computing the marginal likelihood scales with the dimension of the parameter space. In phylogenetics, where we work with tree topologies that are high-dimensional models, standard approaches to computing marginal likelihoods are very slow. Here, we study methods to quickly compute the marginal likelihood of a single fixed tree topology. We benchmark the speed and accuracy of 19 different methods to compute the marginal likelihood of phylogenetic topologies on a suite of real data sets under the JC69 model. These methods include several new ones that we develop explicitly to solve this problem, as well as existing algorithms that we apply to phylogenetic models for the first time. Altogether, our results show that the accuracy of these methods varies widely, and that accuracy does not necessarily correlate with computational burden. Our newly developed methods are orders of magnitude faster than standard approaches, and in some cases, their accuracy rivals the best established estimators.

Download Full-text

Rule Extraction from Decision Trees Ensembles: New Algorithms Based on Heuristic Search and Sparse Group Lasso Methods

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622017500055 ◽

2017 ◽

Vol 16 (06) ◽

pp. 1707-1727 ◽

Cited By ~ 9

Author(s):

Morteza Mashayekhi ◽

Robin Gras

Keyword(s):

Decision Trees ◽

Predictive Accuracy ◽

Weight Vector ◽

Rule Extraction ◽

Group Lasso ◽

Hill Climbing ◽

Data Sets ◽

Sparse Group Lasso ◽

Rule Set ◽

Interpretable Models

Decision trees are examples of easily interpretable models whose predictive accuracy is normally low. In comparison, decision tree ensembles (DTEs) such as random forest (RF) exhibit high predictive accuracy while being regarded as black-box models. We propose three new rule extraction algorithms from DTEs. The RF[Formula: see text]DHC method, a hill climbing method with downhill moves (DHC), is used to search for a rule set that decreases the number of rules dramatically. In the RF[Formula: see text]SGL and RF[Formula: see text]MSGL methods, the sparse group lasso (SGL) method, and the multiclass SGL (MSGL) method are employed respectively to find a sparse weight vector corresponding to the rules generated by RF. Experimental results with 24 data sets show that the proposed methods outperform similar state-of-the-art methods, in terms of human comprehensibility, by greatly reducing the number of rules and limiting the number of antecedents in the retained rules, while preserving the same level of accuracy.

Download Full-text

Bayesian Tip-Dated Phylogenetics in Paleontology: Topological Effects and Stratigraphic Fit

Systematic Biology ◽

10.1093/sysbio/syaa057 ◽

2020 ◽

Author(s):

Benedict King

Keyword(s):

Phylogenetic Analysis ◽

Bayesian Methods ◽

Diversification Rate ◽

Evolutionary Relationships ◽

Data Sets ◽

Tree Model ◽

Phylogenetic Methods ◽

History Of ◽

Tree Topologies ◽

Birth Death

Abstract The incorporation of stratigraphic data into phylogenetic analysis has a long history of debate but is not currently standard practice for paleontologists. Bayesian tip-dated (or morphological clock) phylogenetic methods have returned these arguments to the spotlight, but how tip dating affects the recovery of evolutionary relationships has yet to be fully explored. Here I show, through analysis of several data sets with multiple phylogenetic methods, that topologies produced by tip dating are outliers as compared to topologies produced by parsimony and undated Bayesian methods, which retrieve broadly similar trees. Unsurprisingly, trees recovered by tip dating have better fit to stratigraphy than trees recovered by other methods under both the Gap Excess Ratio (GER) and the Stratigraphic Completeness Index (SCI). This is because trees with better stratigraphic fit are assigned a higher likelihood by the fossilized birth-death tree model. However, the degree to which the tree model favors tree topologies with high stratigraphic fit metrics is modulated by the diversification dynamics of the group under investigation. In particular, when net diversification rate is low, the tree model favors trees with a higher GER compared to when net diversification rate is high. Differences in stratigraphic fit and tree topology between tip dating and other methods are concentrated in parts of the tree with weaker character signal, as shown by successive deletion of the most incomplete taxa from two data sets. These results show that tip dating incorporates stratigraphic data in an intuitive way, with good stratigraphic fit an expectation that can be overturned by strong evidence from character data. [fossilized birth-death; fossils; missing data; morphological clock; morphology; parsimony; phylogenetics.]

Download Full-text

Sampling phylogenetic tree space with the generalized Gibbs sampler

Cladistics ◽

10.1111/cla.12093 ◽

2014 ◽

Vol 31 (4) ◽

pp. 438-440

Author(s):

Jonathan M. Keith

Keyword(s):

Phylogenetic Tree ◽

Gibbs Sampler ◽

Tree Space

Download Full-text

Extension of Colijn-Plazotta tree shape distance metric to unrooted trees

10.1101/506022 ◽

2018 ◽

Author(s):

Alexey Anatolievich Morozov

Keyword(s):

Phylogenetic Tree ◽

Euclidean Distance ◽

Tree Topology ◽

Proof Of Concept ◽

Distance Metric ◽

Tree Shape ◽

Single Pass ◽

The Difference ◽

Shape Distance ◽

Labeling Scheme

Colijn-Plazotta tree shape labeling scheme allows to describe an arbitrary phylogenetic tree topology by recursively labeling all nodes from tips to root with integers. The multisets of these labels can then be used to estimate the difference between topologies using eg Euclidean distance. In this work I propose an extension of the labeling scheme (and thus a distance metric) to unrooted trees, which is achieved by labeling all rooted subtrees within a given tree. To avoid exhaustively enumerating the subtrees, the labels are collected into a dependency graph and calculated in a single pass. A proof-of-concept implementation is available at https://github.com/synedraacus/metrics.

Download Full-text

Distance preserving dimension reduction with local-topology based scaling for improved classification of Biomedical data-sets

10.1101/2019.12.27.889337 ◽

2019 ◽

Author(s):

Karaj Khosla ◽

Indra Prakash Jha ◽

Vibhor Kumar

Keyword(s):

Dimension Reduction ◽

Data Sets ◽

Biomedical Data ◽

Improve Performance ◽

Distance Information ◽

Low Dimension ◽

Data Points ◽

Reduced Dimension ◽

Local Topology

AbstractDimension reduction is often used for several procedures of analysis of high dimensional biomedical data-sets such as classification or outlier detection. To improve performance of such data-mining steps, preserving both distance information and local topology among data-points could be more useful than giving priority to visualisation in low dimension. Therefore, we introduce topology preserving distance scaling (TPDS) to augment dimension reduction method meant to reproduce distance information in higher dimension. Our approach involves distance inflation to preserve local topology to avoid collapse during distance preservation based optimisation. Applying TPDS on diverse biomedical data-sets revealed that besides providing better visualisation than typical distance preserving methods, TPDS leads to better classification of data points in reduced dimension. For data-sets with outliers, the approach of TPDS also proves to be useful, even for purely distance-preserving method for achieving better convergence.

Download Full-text

Locality-Sensitive Hashing for Information Retrieval System on Multiple GPGPU Devices

Applied Sciences ◽

10.3390/app10072539 ◽

2020 ◽

Vol 10 (7) ◽

pp. 2539 ◽

Cited By ~ 1

Author(s):

Toan Nguyen Mau ◽

Yasushi Inoguchi

Keyword(s):

Big Data ◽

Information Retrieval ◽

Retrieval System ◽

Hash Table ◽

Information Retrieval System ◽

Main Memory ◽

Locality Sensitive Hashing ◽

Data Sets ◽

Similar Data ◽

Data Set

It is challenging to build a real-time information retrieval system, especially for systems with high-dimensional big data. To structure big data, many hashing algorithms that map similar data items to the same bucket to advance the search have been proposed. Locality-Sensitive Hashing (LSH) is a common approach for reducing the number of dimensions of a data set, by using a family of hash functions and a hash table. The LSH hash table is an additional component that supports the indexing of hash values (keys) for the corresponding data/items. We previously proposed the Dynamic Locality-Sensitive Hashing (DLSH) algorithm with a dynamically structured hash table, optimized for storage in the main memory and General-Purpose computation on Graphics Processing Units (GPGPU) memory. This supports the handling of constantly updated data sets, such as songs, images, or text databases. The DLSH algorithm works effectively with data sets that are updated with high frequency and is compatible with parallel processing. However, the use of a single GPGPU device for processing big data is inadequate, due to the small memory capacity of GPGPU devices. When using multiple GPGPU devices for searching, we need an effective search algorithm to balance the jobs. In this paper, we propose an extension of DLSH for big data sets using multiple GPGPUs, in order to increase the capacity and performance of the information retrieval system. Different search strategies on multiple DLSH clusters are also proposed to adapt our parallelized system. With significant results in terms of performance and accuracy, we show that DLSH can be applied to real-life dynamic database systems.

Download Full-text