To Include or Not to Include: The Impact of Gene Filtering on Species Tree Estimation Methods

AbstractSpecies tree estimation from loci sampled from multiple genomes is now common, but is challenged by the heterogeneity across the genome due to multiple processes, such as gene duplication and loss, horizontal gene transfer, and incomplete lineage sorting. Although methods for estimating species trees have been developed that address gene tree heterogeneity due to incomplete lineage sorting, many of these methods operate by combining estimated gene trees and are hence vulnerable to gene tree quality. There is also the added concern that missing data, which is frequently encountered in genome-scale datasets, will impact species tree estimation.Our study addresses the impact of gene filtering on species trees inferred from multi-gene datasets. We address these questions using a large and heterogeneous collection of simulated datasets both with and without missing data. We compare several established coalescent-based methods (ASTRAL, ASTRID, MP-EST, and SVDquartets within PAUP*) as well as unpartitioned concatenation using maximum likelihood (RAxML).Our study shows that gene tree error and missing data impact all methods (and some methods degrade more than others), but the degree of incomplete lineage sorting and gene tree estimation error impacts the absolute and relative performance of methods as well as their response to gene filtering strategies. We find that filtering genes based on the degree of missing data is either neutral or else reduces the accuracy of all five methods examined, and so is not recommended. Filtering genes based on gene tree estimation error shows somewhat different trends. Under low levels of incomplete lineage sorting, removing genes with high gene tree estimation error can improve the accuracy of summary methods, but only if not too many genes are removed. Otherwise, filtering genes tends to increase error, especially under high levels of incomplete lineage sorting. Hence, while filtering genes based on missing data is not recommended, there are conditions under which removing high error gene trees can improve species tree estimation. This study provides insights into prior studies and suggests approaches for analyzing phylogenomic datasets.

Download Full-text

The Impact of Cross-Species Gene Flow on Species Tree Estimation

10.1101/820019 ◽

2019 ◽

Cited By ~ 1

Author(s):

Xiyun Jiao ◽

Thomas Flouris ◽

Bruce Rannala ◽

Ziheng Yang

Keyword(s):

Gene Flow ◽

Sequence Data ◽

Majority Vote ◽

Gene Tree ◽

Species Tree ◽

Likelihood Method ◽

Estimation Methods ◽

Vote Method ◽

Tree Estimation ◽

The Impact

ABSTRACTRecent analyses of genomic sequence data suggest cross-species gene flow is common in both plants and animals, posing challenges to species tree inference. We examine the levels of gene flow needed to mislead species tree estimation with three species and either episodic introgressive hybridization or continuous migration between an outgroup and one ingroup species. Several species tree estimation methods are examined, including the majority-vote method based on the most common gene tree topology (with either the true or reconstructed gene trees used), the UPGMA method based on the average sequence distances (or average coalescent times) between species, and the full-likelihood method based on multi-locus sequence data. Our results suggest that the majority-vote method is more robust to gene flow than the UPGMA method and both are more robust than likelihood assuming a multispecies coalescent (MSC) model with no cross-species gene flow. A small amount of introgression or migration can mislead species tree methods if the species diverged through speciation events separated by short time intervals. Estimates of parameters under the MSC with gene flow suggest the Anopheles gambia African mosquito species complex is an example where gene flow greatly impacts species phylogeny.

Download Full-text

The Perfect Storm: Gene Tree Estimation Error, Incomplete Lineage Sorting, and Ancient Gene Flow Explain the Most Recalcitrant Ancient Angiosperm Clade, Malpighiales

Systematic Biology ◽

10.1093/sysbio/syaa083 ◽

2020 ◽

Author(s):

Liming Cai ◽

Zhenxiang Xi ◽

Emily Moriarty Lemmon ◽

Alan R Lemmon ◽

Austin Mast ◽

...

Keyword(s):

Gene Flow ◽

Incomplete Lineage Sorting ◽

Estimation Error ◽

Gene Tree ◽

Species Tree ◽

Flowering Plant ◽

Estimation Methods ◽

Lineage Sorting ◽

Tree Estimation ◽

Perfect Storm

Abstract The genomic revolution offers renewed hope of resolving rapid radiations in the Tree of Life. The development of the multispecies coalescent (MSC) model and improved gene tree estimation methods can better accommodate gene tree heterogeneity caused by incomplete lineage sorting (ILS) and gene tree estimation error stemming from the short internal branches. However, the relative influence of these factors in species tree inference is not well understood. Using anchored hybrid enrichment, we generated a data set including 423 single-copy loci from 64 taxa representing 39 families to infer the species tree of the flowering plant order Malpighiales. This order includes nine of the top ten most unstable nodes in angiosperms, which have been hypothesized to arise from the rapid radiation during the Cretaceous. Here, we show that coalescent-based methods do not resolve the backbone of Malpighiales and concatenation methods yield inconsistent estimations, providing evidence that gene tree heterogeneity is high in this clade. Despite high levels of ILS and gene tree estimation error, our simulations demonstrate that these two factors alone are insufficient to explain the lack of resolution in this order. To explore this further, we examined triplet frequencies among empirical gene trees and discovered some of them deviated significantly from those attributed to ILS and estimation error, suggesting gene flow as an additional and previously unappreciated phenomenon promoting gene tree variation in Malpighiales. Finally, we applied a novel method to quantify the relative contribution of these three primary sources of gene tree heterogeneity and demonstrated that ILS, gene tree estimation error, and gene flow contributed to 10.0%, 34.8%, and 21.4% of the variation, respectively. Together, our results suggest that a perfect storm of factors likely influence this lack of resolution, and further indicate that recalcitrant phylogenetic relationships like the backbone of Malpighiales may be better represented as phylogenetic networks. Thus, reducing such groups solely to existing models that adhere strictly to bifurcating trees greatly oversimplifies reality, and obscures our ability to more clearly discern the process of evolution.

Download Full-text

TreeMerge: a new method for improving the scalability of species tree estimation methods

Bioinformatics ◽

10.1093/bioinformatics/btz344 ◽

2019 ◽

Vol 35 (14) ◽

pp. i417-i426 ◽

Cited By ~ 7

Author(s):

Erin K Molloy ◽

Tandy Warnow

Keyword(s):

Large Scale ◽

Species Tree ◽

New Method ◽

Divide And Conquer ◽

Supplementary Information ◽

Estimation Methods ◽

Running Time ◽

Tree Estimation ◽

Computationally Intensive ◽

A Minor

Abstract Motivation At RECOMB-CG 2018, we presented NJMerge and showed that it could be used within a divide-and-conquer framework to scale computationally intensive methods for species tree estimation to larger datasets. However, NJMerge has two significant limitations: it can fail to return a tree and, when used within the proposed divide-and-conquer framework, has O(n5) running time for datasets with n species. Results Here we present a new method called ‘TreeMerge’ that improves on NJMerge in two ways: it is guaranteed to return a tree and it has dramatically faster running time within the same divide-and-conquer framework—only O(n2) time. We use a simulation study to evaluate TreeMerge in the context of multi-locus species tree estimation with two leading methods, ASTRAL-III and RAxML. We find that the divide-and-conquer framework using TreeMerge has a minor impact on species tree accuracy, dramatically reduces running time, and enables both ASTRAL-III and RAxML to complete on datasets (that they would otherwise fail on), when given 64 GB of memory and 48 h maximum running time. Thus, TreeMerge is a step toward a larger vision of enabling researchers with limited computational resources to perform large-scale species tree estimation, which we call Phylogenomics for All. Availability and implementation TreeMerge is publicly available on Github (http://github.com/ekmolloy/treemerge). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

wQFM: Statistically Consistent Genome-scale Species Tree Estimation from Weighted Quartets

10.1101/2020.11.30.403352 ◽

2020 ◽

Author(s):

Mahim Mahbub ◽

Zahin Wahab ◽

Rezwana Reaz ◽

M. Saifur Rahman ◽

Md. Shamsuzzoha Bayzid

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Accurate Method ◽

Species Tree ◽

Estimation Methods ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Tree Estimation ◽

Source Form

AbstractMotivationSpecies tree estimation from genes sampled from throughout the whole genome is complicated due to the gene tree-species tree discordance. Incomplete lineage sorting (ILS) is one of the most frequent causes for this discordance, where alleles can coexist in populations for periods that may span several speciation events. Quartet-based summary methods for estimating species trees from a collection of gene trees are becoming popular due to their high accuracy and statistical guarantee under ILS. Generating quartets with appropriate weights, where weights correspond to the relative importance of quartets, and subsequently amalgamating the weighted quartets to infer a single coherent species tree allows for a statistically consistent way of estimating species trees. However, handling weighted quartets is challenging.ResultsWe propose wQFM, a highly accurate method for species tree estimation from multi-locus data, by extending the quartet FM (QFM) algorithm to a weighted setting. wQFM was assessed on a collection of simulated and real biological datasets, including the avian phylogenomic dataset which is one of the largest phylogenomic datasets to date. We compared wQFM with wQMC, which is the best alternate method for weighted quartet amalgamation, and with ASTRAL, which is one of the most accurate and widely used coalescent-based species tree estimation methods. Our results suggest that wQFM matches or improves upon the accuracy of wQMC and ASTRAL.AvailabilitywQFM is available in open source form at https://github.com/Mahim1997/wQFM-2020.

Download Full-text

The performance of coalescent-based species tree estimation methods under models of missing data

BMC Genomics ◽

10.1186/s12864-018-4619-8 ◽

2018 ◽

Vol 19 (S5) ◽

Cited By ~ 20

Author(s):

Michael Nute ◽

Jed Chou ◽

Erin K. Molloy ◽

Tandy Warnow

Keyword(s):

Missing Data ◽

Species Tree ◽

Estimation Methods ◽

Tree Estimation

Download Full-text

The Impact of Cross-Species Gene Flow on Species Tree Estimation

Systematic Biology ◽

10.1093/sysbio/syaa001 ◽

2020 ◽

Vol 69 (5) ◽

pp. 830-847 ◽

Cited By ~ 1

Author(s):

Xiyun Jiao ◽

Tomáš Flouri ◽

Bruce Rannala ◽

Ziheng Yang

Keyword(s):

Gene Flow ◽

Sequence Data ◽

Majority Vote ◽

Gene Tree ◽

Species Tree ◽

Likelihood Method ◽

Estimation Methods ◽

Vote Method ◽

Multispecies Coalescent ◽

Tree Estimation

Abstract Recent analyses of genomic sequence data suggest cross-species gene flow is common in both plants and animals, posing challenges to species tree estimation. We examine the levels of gene flow needed to mislead species tree estimation with three species and either episodic introgressive hybridization or continuous migration between an outgroup and one ingroup species. Several species tree estimation methods are examined, including the majority-vote method based on the most common gene tree topology (with either the true or reconstructed gene trees used), the UPGMA method based on the average sequence distances (or average coalescent times) between species, and the full-likelihood method based on multilocus sequence data. Our results suggest that the majority-vote method based on gene tree topologies is more robust to gene flow than the UPGMA method based on coalescent times and both are more robust than likelihood assuming a multispecies coalescent (MSC) model with no cross-species gene flow. Comparison of the continuous migration model with the episodic introgression model suggests that a small amount of gene flow per generation can cause drastic changes to the genetic history of the species and mislead species tree methods, especially if the species diverged through radiative speciation events. Estimates of parameters under the MSC with gene flow suggest that African mosquito species in the Anopheles gambiae species complex constitute such an example of extreme impact of gene flow on species phylogeny. [IM; introgression; migration; MSci; multispecies coalescent; species tree.]

Download Full-text

The Perfect Storm: Gene Tree Estimation Error, Incomplete Lineage Sorting, and Ancient Gene Flow Explain the Most Recalcitrant Ancient Angiosperm Clade, Malpighiales

10.1101/2020.05.26.112318 ◽

2020 ◽

Author(s):

Liming Cai ◽

Zhenxiang Xi ◽

Emily Moriarty Lemmon ◽

Alan R. Lemmon ◽

Austin Mast ◽

...

Keyword(s):

Gene Flow ◽

Incomplete Lineage Sorting ◽

Estimation Error ◽

Gene Tree ◽

Species Tree ◽

Flowering Plant ◽

Estimation Methods ◽

Lineage Sorting ◽

Tree Estimation ◽

Perfect Storm

ABSTRACTThe genomic revolution offers renewed hope of resolving rapid radiations in the Tree of Life. The development of the multispecies coalescent (MSC) model and improved gene tree estimation methods can better accommodate gene tree heterogeneity caused by incomplete lineage sorting (ILS) and gene tree estimation error stemming from the short internal branches. However, the relative influence of these factors in species tree inference is not well understood. Using anchored hybrid enrichment, we generated a data set including 423 single-copy loci from 64 taxa representing 39 families to infer the species tree of the flowering plant order Malpighiales. This order alone includes nine of the top ten most unstable nodes in angiosperms, and the recalcitrant relationships along the backbone of the order have been hypothesized to arise from the rapid radiation during the Cretaceous. Here, we show that coalescent-based methods do not resolve the backbone of Malpighiales and concatenation methods yield inconsistent estimations, providing evidence that gene tree heterogeneity is high in this clade. Despite high levels of ILS and gene tree estimation error, our simulations demonstrate that these two factors alone are insufficient to explain the lack of resolution in this order. To explore this further, we examined triplet frequencies among empirical gene trees and discovered some of them deviated significantly from those attributed to ILS and estimation error, suggesting gene flow as an additional and previously unappreciated phenomenon promoting gene tree variation in Malpighiales. Finally, we applied a novel method to quantify the relative contribution of these three primary sources of gene tree heterogeneity and demonstrated that ILS, gene tree estimation error, and gene flow contributed to 15%, 52%, and 32% of the variation, respectively. Together, our results suggest that a perfect storm of factors likely influence this lack of resolution, and further indicate that recalcitrant phylogenetic relationships like the backbone of Malpighiales may be better represented as phylogenetic networks. Thus, reducing such groups solely to existing models that adhere strictly to bifurcating trees greatly oversimplifies reality, and obscures our ability to more clearly discern the process of evolution.

Download Full-text

Quantifying the impact of an inference model in Bayesian phylogenetics

10.1101/2019.12.17.879098 ◽

2019 ◽

Cited By ~ 1

Author(s):

Richèl J.C. Bilderbeek ◽

Giovanni Laudanno ◽

Rampal S. Etienne

Keyword(s):

Phylogenetic Trees ◽

R Package ◽

Species Tree ◽

Joint Estimation ◽

List Type ◽

Inference Model ◽

Bayesian Phylogenetics ◽

Character Sequences ◽

Tree Estimation ◽

The Impact

SummaryPhylogenetic trees are currently routinely reconstructed from an alignment of character sequences (usually nucleotide sequences). Bayesian tools, such as MrBayes, RevBayes and BEAST2, have gained much popularity over the last decade, as they allow joint estimation of the posterior distribution of the phylogenetic trees and the parameters of the underlying inference model. An important ingredient of these Bayesian approaches is the species tree prior. In principle, the Bayesian framework allows for comparing different tree priors, which may elucidate the macroevolutionary processes underlying the species tree. In practice, however, only macroevolutionary models that allow for fast computation of the prior probability are used. The question is how accurate the tree estimation is when the real macroevolutionary processes are substantially different from those assumed in the tree prior.Here we present pirouette, a free and open-source R package that assesses the inference error made by Bayesian phylogenetics for a given macroevolutionary diversification model. pirouette makes use of BEAST2, but its philosophy applies to any Bayesian phylogenetic inference tool.We describe pirouette’s usage providing full examples in which we interrogate a model for its power to describe another.Last, we discuss the results obtained by the examples and their interpretation.

Download Full-text

ASTRID: Accurate Species TRees from Internode Distances

10.1101/023036 ◽

2015 ◽

Cited By ~ 1

Author(s):

Pranjal Vachaspati ◽

Tandy Warnow

Keyword(s):

Good Accuracy ◽

Incomplete Lineage Sorting ◽

Current Method ◽

Species Tree ◽

Estimation Methods ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Tree Estimation ◽

Source Form

Background: Incomplete lineage sorting (ILS), modelled by the multi-species coalescent (MSC), is known to create discordance between gene trees and species trees, and lead to inaccurate species tree estimations unless appropriate methods are used to estimate the species tree. While many statistically consistent methods have been developed to estimate the species tree in the presence of ILS, only ASTRAL-2 and NJst have been shown to have good accuracy on large datasets. Yet, NJst is generally slower and less accurate than ASTRAL-2, and cannot run on some datasets. Results: We have redesigned NJst to enable it to run on all datasets, and we have expanded its design space so that it can be used with different distance-based tree estimation methods. The resultant method, ASTRID, is statistically consistent under the MSC model, and has accuracy that is competitive with ASTRAL-2. Furthermore, ASTRID is much faster than ASTRAL-2, completing in minutes on some datasets for which ASTRAL-2 used hours. Conclusions: ASTRID is a new coalescent-based method for species tree estimation that is competitive with the best current method in terms of accuracy, while being much faster. ASTRID is available in open source form on github.

Download Full-text