Disjoint Tree Mergers for Large-Scale Maximum Likelihood Tree Estimation

The estimation of phylogenetic trees for individual genes or multi-locus datasets is a basic part of considerable biological research. In order to enable large trees to be computed, Disjoint Tree Mergers (DTMs) have been developed; these methods operate by dividing the input sequence dataset into disjoint sets, constructing trees on each subset, and then combining the subset trees (using auxiliary information) into a tree on the full dataset. DTMs have been used to advantage for multi-locus species tree estimation, enabling highly accurate species trees at reduced computational effort, compared to leading species tree estimation methods. Here, we evaluate the feasibility of using DTMs to improve the scalability of maximum likelihood (ML) gene tree estimation to large numbers of input sequences. Our study shows distinct differences between the three selected ML codes—RAxML-NG, IQ-TREE 2, and FastTree 2—and shows that good DTM pipeline design can provide advantages over these ML codes on large datasets.

Download Full-text

Phylogenomic species tree estimation in the presence of incomplete lineage sorting and horizontal gene transfer

10.1101/023168 ◽

2015 ◽

Cited By ~ 1

Author(s):

Ruth Davidson ◽

Pranjal Vachaspati ◽

Siavash Mirarab ◽

Tandy Warnow

Keyword(s):

Maximum Likelihood ◽

Gene Transfer ◽

Horizontal Gene Transfer ◽

Incomplete Lineage Sorting ◽

Gene Tree ◽

Species Tree ◽

Estimation Methods ◽

Species Trees ◽

Lineage Sorting ◽

Tree Estimation

Background: Species tree estimation is challenged by gene tree heterogeneity resulting from biological processes such as duplication and loss, hybridization, incomplete lineage sorting (ILS), and horizontal gene transfer (HGT). Mathematical theory about reconstructing species trees in the presence of HGT alone or ILS alone suggests that quartet-based species tree methods (known to be statistically consistent under ILS, or under bounded amounts of HGT) might be effective techniques for estimating species trees when both HGT and ILS are present. Results: We evaluated several publicly available coalescent-based methods and concatenation under maximum likelihood on simulated datasets with moderate ILS and varying levels of HGT. Our study shows that two quartet-based species tree estimation methods (ASTRAL-2 and weighted Quartets MaxCut) are both highly accurate, even on datasets with high rates of HGT. In contrast, although NJst and concatenation using maximum likelihood are highly accurate under low HGT, they are less robust to high HGT rates. Conclusion: Our study shows that quartet-based species-tree estimation methods can be highly accurate under the presence of both HGT and ILS. The study suggests the possibility that some quartet-based methods might be statistically consistent under phylogenomic models of gene tree heterogeneity with both HGT and ILS. Keywords: phylogenomics; HGT; ILS; summary methods; concatenation

Download Full-text

FastMulRFS: fast and accurate species tree estimation under generic gene duplication and loss models

Bioinformatics ◽

10.1093/bioinformatics/btaa444 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i57-i65 ◽

Cited By ~ 3

Author(s):

Erin K Molloy ◽

Tandy Warnow

Keyword(s):

Gene Duplication ◽

Species Tree ◽

Supplementary Information ◽

Biological Research ◽

Generic Model ◽

Species Trees ◽

Basic Part ◽

Gene Duplication And Loss ◽

Tree Estimation ◽

Scalable Methods

Abstract Motivation Species tree estimation is a basic part of biological research but can be challenging because of gene duplication and loss (GDL), which results in genes that can appear more than once in a given genome. All common approaches in phylogenomic studies either reduce available data or are error-prone, and thus, scalable methods that do not discard data and have high accuracy on large heterogeneous datasets are needed. Results We present FastMulRFS, a polynomial-time method for estimating species trees without knowledge of orthology. We prove that FastMulRFS is statistically consistent under a generic model of GDL when adversarial GDL does not occur. Our extensive simulation study shows that FastMulRFS matches the accuracy of MulRF (which tries to solve the same optimization problem) and has better accuracy than prior methods, including ASTRAL-multi (the only method to date that has been proven statistically consistent under GDL), while being much faster than both methods. Availability and impementation FastMulRFS is available on Github (https://github.com/ekmolloy/fastmulrfs). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

wQFM: Statistically Consistent Genome-scale Species Tree Estimation from Weighted Quartets

10.1101/2020.11.30.403352 ◽

2020 ◽

Author(s):

Mahim Mahbub ◽

Zahin Wahab ◽

Rezwana Reaz ◽

M. Saifur Rahman ◽

Md. Shamsuzzoha Bayzid

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Accurate Method ◽

Species Tree ◽

Estimation Methods ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Tree Estimation ◽

Source Form

AbstractMotivationSpecies tree estimation from genes sampled from throughout the whole genome is complicated due to the gene tree-species tree discordance. Incomplete lineage sorting (ILS) is one of the most frequent causes for this discordance, where alleles can coexist in populations for periods that may span several speciation events. Quartet-based summary methods for estimating species trees from a collection of gene trees are becoming popular due to their high accuracy and statistical guarantee under ILS. Generating quartets with appropriate weights, where weights correspond to the relative importance of quartets, and subsequently amalgamating the weighted quartets to infer a single coherent species tree allows for a statistically consistent way of estimating species trees. However, handling weighted quartets is challenging.ResultsWe propose wQFM, a highly accurate method for species tree estimation from multi-locus data, by extending the quartet FM (QFM) algorithm to a weighted setting. wQFM was assessed on a collection of simulated and real biological datasets, including the avian phylogenomic dataset which is one of the largest phylogenomic datasets to date. We compared wQFM with wQMC, which is the best alternate method for weighted quartet amalgamation, and with ASTRAL, which is one of the most accurate and widely used coalescent-based species tree estimation methods. Our results suggest that wQFM matches or improves upon the accuracy of wQMC and ASTRAL.AvailabilitywQFM is available in open source form at https://github.com/Mahim1997/wQFM-2020.

Download Full-text

FastMulRFS: Fast and accurate species tree estimation under generic gene duplication and loss models

10.1101/835553 ◽

2019 ◽

Cited By ~ 2

Author(s):

Erin K. Molloy ◽

Tandy Warnow

Keyword(s):

Gene Duplication ◽

Species Tree ◽

Biological Research ◽

Generic Model ◽

Species Trees ◽

Basic Part ◽

Heterogeneous Datasets ◽

Gene Duplication And Loss ◽

Tree Estimation ◽

Scalable Methods

AbstractMotivationSpecies tree estimation is a basic part of biological research but can be challenging because of gene duplication and loss (GDL), which results in genes that can appear more than once in a given genome. All common approaches in phylogenomic studies either reduce available data or are error-prone, and thus, scalable methods that do not discard data and have high accuracy on large heterogeneous datasets are needed.ResultsWe present FastMulRFS, a polynomial-time method for estimating species trees without knowledge of orthology. We prove that FastMulRFS is statistically consistent under a generic model of GDL when adversarial GDL does not occur. Our extensive simulation study shows that FastMulRFS matches the accuracy of MulRF (which tries to solve the same optimization problem) and has better accuracy than prior methods, including ASTRAL-multi (the only method to date that has been proven statistically consistent under GDL), while being much faster than both methods.AvailabilityFastMulRFS is available on Github (https://github.com/ekmolloy/fastmulrfs).

Download Full-text

Multispecies Coalescent: Theory and Applications in Phylogenetics

Annual Review of Ecology Evolution and Systematics ◽

10.1146/annurev-ecolsys-012121-095340 ◽

2021 ◽

Vol 52 (1) ◽

Author(s):

Siavash Mirarab ◽

Luay Nakhleh ◽

Tandy Warnow

Keyword(s):

Incomplete Lineage Sorting ◽

Species Tree ◽

Phylogenetic Networks ◽

Biological Research ◽

Annual Review ◽

Publication Date ◽

Gene Trees ◽

Species Trees ◽

Basic Part ◽

Tree Estimation

Species tree estimation is a basic part of many biological research projects, ranging from answering basic evolutionary questions (e.g., how did a group of species adapt to their environments?) to addressing questions in functional biology. Yet, species tree estimation is very challenging, due to processes such as incomplete lineage sorting, gene duplication and loss, horizontal gene transfer, and hybridization, which can make gene trees differ from each other and from the overall evolutionary history of the species. Over the last 10–20 years, there has been tremendous growth in methods and mathematical theory for estimating species trees and phylogenetic networks, and some of these methods are now in wide use. In this survey, we provide an overview of the current state of the art, identify the limitations of existing methods and theory, and propose additional research problems and directions. Expected final online publication date for the Annual Review of Ecology, Evolution, and Systematics, Volume 52 is November 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Why concatenation fails in the anomaly zone

10.1101/116509 ◽

2017 ◽

Cited By ~ 2

Author(s):

Fábio K. Mendes ◽

Matthew W. Hahn

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Species Tree ◽

Estimation Methods ◽

Species Trees ◽

Lineage Sorting ◽

Common Gene ◽

Tree Estimation ◽

Genome Scale ◽

Future Work

AbstrctGenome-scale sequencing has been of great benefit in recovering species trees, but has not provided final answers. Despite the rapid accumulation of molecular sequences, resolving short and deep branches of the tree of life has remained a challenge, and has prompted the development of new strategies that can make the best use of available data. One such strategy – the concatenation of gene alignments – can be successful when coupled with many tree estimation methods, but has also been shown to fail when there are high levels of incomplete lineage sorting. Here, we focus on the failure of likelihood-based methods in retrieving a rooted, asymmetric four-taxon species tree from concatenated data when the species tree is in or near the anomaly zone – a region of parameter space where the most common gene tree does not match the species tree because of incomplete lineage sorting. First, we use coalescent theory to prove that most informative sites will support the species tree in the anomaly zone, and that as a consequence maximum-parsimony succeeds in recovering the species tree from concatenated data. We further show that maximum-likelihood tree estimation from concatenated data fails both inside and outside the anomaly zone, and that this failure is unconnected to the frequency of the most common gene tree. We provide support for a hypothesis that likelihood-based methods fail in and near the anomaly zone because discordant sites on the species tree have a lower likelihood than those that are discordant on alternative topologies. Our results confirm and extend previous reports of the failure and success of likelihood- and parsimony-based methods, and highlight avenues for future work improving the performance of methods aimed at recovering species tree.

Download Full-text

The Perfect Storm: Gene Tree Estimation Error, Incomplete Lineage Sorting, and Ancient Gene Flow Explain the Most Recalcitrant Ancient Angiosperm Clade, Malpighiales

Systematic Biology ◽

10.1093/sysbio/syaa083 ◽

2020 ◽

Author(s):

Liming Cai ◽

Zhenxiang Xi ◽

Emily Moriarty Lemmon ◽

Alan R Lemmon ◽

Austin Mast ◽

...

Keyword(s):

Gene Flow ◽

Incomplete Lineage Sorting ◽

Estimation Error ◽

Gene Tree ◽

Species Tree ◽

Flowering Plant ◽

Estimation Methods ◽

Lineage Sorting ◽

Tree Estimation ◽

Perfect Storm

Abstract The genomic revolution offers renewed hope of resolving rapid radiations in the Tree of Life. The development of the multispecies coalescent (MSC) model and improved gene tree estimation methods can better accommodate gene tree heterogeneity caused by incomplete lineage sorting (ILS) and gene tree estimation error stemming from the short internal branches. However, the relative influence of these factors in species tree inference is not well understood. Using anchored hybrid enrichment, we generated a data set including 423 single-copy loci from 64 taxa representing 39 families to infer the species tree of the flowering plant order Malpighiales. This order includes nine of the top ten most unstable nodes in angiosperms, which have been hypothesized to arise from the rapid radiation during the Cretaceous. Here, we show that coalescent-based methods do not resolve the backbone of Malpighiales and concatenation methods yield inconsistent estimations, providing evidence that gene tree heterogeneity is high in this clade. Despite high levels of ILS and gene tree estimation error, our simulations demonstrate that these two factors alone are insufficient to explain the lack of resolution in this order. To explore this further, we examined triplet frequencies among empirical gene trees and discovered some of them deviated significantly from those attributed to ILS and estimation error, suggesting gene flow as an additional and previously unappreciated phenomenon promoting gene tree variation in Malpighiales. Finally, we applied a novel method to quantify the relative contribution of these three primary sources of gene tree heterogeneity and demonstrated that ILS, gene tree estimation error, and gene flow contributed to 10.0%, 34.8%, and 21.4% of the variation, respectively. Together, our results suggest that a perfect storm of factors likely influence this lack of resolution, and further indicate that recalcitrant phylogenetic relationships like the backbone of Malpighiales may be better represented as phylogenetic networks. Thus, reducing such groups solely to existing models that adhere strictly to bifurcating trees greatly oversimplifies reality, and obscures our ability to more clearly discern the process of evolution.

Download Full-text

Phylogenomic terraces: presence and implication in species tree estimation from gene trees

10.1101/2020.04.19.048843 ◽

2020 ◽

Author(s):

Ishrat Tanzila Farah ◽

Md Muktadirul Islam ◽

Kazi Tasnim Zinat ◽

Atif Hasan Rahman ◽

Md Shamsuzzoha Bayzid

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Species Tree ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Deep Coalescence ◽

Tree Estimation ◽

Tree Space ◽

Multiple Species

AbstractSpecies tree estimation from multi-locus dataset is extremely challenging, especially in the presence of gene tree heterogeneity across the genome due to incomplete lineage sorting (ILS). Summary methods have been developed which estimate gene trees and then combine the gene trees to estimate a species tree by optimizing various optimization scores. In this study, we have formalized the concept of “phylogenomic terraces” in the species tree space, where multiple species trees with distinct topologies may have exactly the same optimization score (quartet score, extra lineage score, etc.) with respect to a collection of gene trees. We investigated the presence and implication of terraces in species tree estimation from multi-locus data by taking ILS into account. We analyzed two of the most popular ILS-aware optimization criteria: maximize quartet consistency (MQC) and minimize deep coalescence (MDC). Methods based on MQC are provably statistically consistent, whereas MDC is not a consistent criterion for species tree estimation. Our experiments, on a collection of dataset simulated under ILS, indicate that MDC-based methods may achieve competitive or identical quartet consistency score as MQC but could be significantly worse than MQC in terms of tree accuracy – demonstrating the presence and affect of phylogenomic terraces. This is the first known study that formalizes the concept of phylogenomic terraces in the context of species tree estimation from multi-locus data, and reports the presence and implications of terraces in species tree estimation under ILS.

Download Full-text

Complexity of the simplest species tree problem

Molecular Biology and Evolution ◽

10.1093/molbev/msab009 ◽

2021 ◽

Author(s):

Tianqi Zhu ◽

Ziheng Yang

Keyword(s):

Maximum Likelihood ◽

Estimation Error ◽

Gene Tree ◽

Species Tree ◽

Ancestral Population ◽

Statistical Efficiency ◽

Multispecies Coalescent ◽

Full Likelihood ◽

Tree Methods ◽

Tree Estimation

Abstract The multispecies coalescent (MSC) model provides a natural framework for species tree estimation accounting for gene-tree conflicts. While a number of species tree methods under the MSC have been suggested and evaluated using simulation, their statistical properties remain poorly understood. Here we use mathematical analysis aided by computer simulation to examine the identifiability, consistency, and efficiency of different species tree methods in the case of three species and three sequences under the molecular clock. We consider four major species-tree methods including concatenation, two-step, independent-sites maximum likelihood (ISML) and maximum likelihood (ML). We develop approximations that predict that the probit transform of the species tree estimation error decreases linearly with the square root of the number of loci. Even in this simplest case major differences exist among the methods. Fulllikelihood methods are considerably more efficient than summary methods such as concatenation and two-step. They also provide estimates of important parameters such as species divergence times and ancestral population sizes while these parameters are not identifiable by summary methods. Our results highlight the need to improve the statistical efficiency of summary methods and the computational efficiency of full likelihood methods of species tree estimation.

Download Full-text

PRANC: ML species tree estimation from the ranked gene trees under coalescence

Bioinformatics ◽

10.1093/bioinformatics/btaa605 ◽

2020 ◽

Vol 36 (18) ◽

pp. 4819-4821

Author(s):

Anastasiia Kim ◽

James H Degnan

Keyword(s):

Maximum Likelihood ◽

Gene Tree ◽

Species Tree ◽

Supplementary Information ◽

Supplementary Data ◽

Gene Trees ◽

Multispecies Coalescent ◽

Branch Lengths ◽

Tree Topologies ◽

Tree Estimation

Abstract Summary PRANC computes the Probabilities of RANked gene tree topologies under the multispecies coalescent. A ranked gene tree is a gene tree accounting for the temporal ordering of internal nodes. PRANC can also estimate the maximum likelihood (ML) species tree from a sample of ranked or unranked gene tree topologies. It estimates the ML tree with estimated branch lengths in coalescent units. Availability and implementation PRANC is written in C++ and freely available at github.com/anastasiiakim/PRANC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text