Split Probabilities and Species Tree Inference Under the Multispecies Coalescent Model

Elizabeth S. Allman; James H. Degnan; John A. Rhodes

doi:10.1007/s11538-017-0363-5

MSCquartets 1.0: Quartet methods for species trees and networks under the multispecies coalescent model in R

10.1101/2020.05.01.073361 ◽

2020 ◽

Author(s):

John A. Rhodes ◽

Hector Baños ◽

Jonathan D. Mitchell ◽

Elizabeth S. Allman

Keyword(s):

Network Inference ◽

Incomplete Lineage Sorting ◽

R Package ◽

Species Tree ◽

Species Trees ◽

Lineage Sorting ◽

Coalescent Model ◽

Multispecies Coalescent ◽

Tree Inference ◽

Species Tree Inference

AbstractMSCquartets is an R package for species tree hypothesis testing, inference of species trees, and inference of species networks under the Multispecies Coalescent model of incomplete lineage sorting. Input for these analyses are collections of metric or topological locus trees which are then summarized by the quartets displayed on them. Results of hypothesis tests at user-supplied levels are displayed in a simplex plot by color-coded points. The package includes the QDC and WQDC algorithms for topological and metric species tree inference, and the NANUQ algorithm for level-1 topological species network inference, all of which give statistically consistent estimators under the model.

Download Full-text

MSCquartets 1.0: Quartet methods for species trees and networks under the multispecies coalescent model in R

Bioinformatics ◽

10.1093/bioinformatics/btaa868 ◽

2020 ◽

Author(s):

John A Rhodes ◽

Hector Baños ◽

Jonathan D Mitchell ◽

Elizabeth S Allman

Keyword(s):

Network Inference ◽

Incomplete Lineage Sorting ◽

R Package ◽

Species Tree ◽

Supplementary Information ◽

Species Trees ◽

Lineage Sorting ◽

Coalescent Model ◽

Multispecies Coalescent ◽

Tree Inference

Abstract Summary MSCquartets is an R package for species tree hypothesis testing, inference of species trees, and inference of species networks under the Multispecies Coalescent model of incomplete lineage sorting and its network analog. Input for these analyses are collections of metric or topological locus trees which are then summarized by the quartets displayed on them. Results of hypothesis tests at user-supplied levels are displayed in a simplex plot by color-coded points. The package implements the QDC and WQDC algorithms for topological and metric species tree inference, and the NANUQ algorithm for level-1 topological species network inference, all of which give statistically consistent estimators under the model. Availability MSCquartets is available through the Comprehensive R Archive Network: https://CRAN.R-project.org/package=MSCquartets. Supplementary information Supplementary materials, including example data and analyses, are incorporated into the package.

Download Full-text

Efficient Bayesian species tree inference under the multispecies coalescent

Systematic Biology ◽

10.1093/sysbio/syw119 ◽

2017 ◽

pp. syw119 ◽

Cited By ~ 15

Author(s):

Bruce Rannala ◽

Ziheng Yang

Keyword(s):

Species Tree ◽

Multispecies Coalescent ◽

Tree Inference ◽

Species Tree Inference

Download Full-text

Species Tree Inference under the Multispecies Coalescent on Data with Paralogs is Accurate

10.1101/498378 ◽

2018 ◽

Cited By ~ 10

Author(s):

Zhi Yan ◽

Peng Du ◽

Matthew W. Hahn ◽

Luay Nakhleh

Keyword(s):

Incomplete Lineage Sorting ◽

Single Copy ◽

Species Tree ◽

Biological Data ◽

Lineage Sorting ◽

Multispecies Coalescent ◽

Gene Copies ◽

Tree Inference ◽

Inference Methods ◽

Species Tree Inference

AbstractThe multispecies coalescent (MSC) has emerged as a powerful and desirable framework for species tree inference in phylogenomic studies. Under this framework, the data for each locus is assumed to consist of orthologous, single-copy genes, and heterogeneity across loci is assumed to be due to incomplete lineage sorting (ILS). These assumptions have led biologists that use ILS-aware inference methods, whether based directly on the MSC or proven to be statistically consistent under it (collectively referred to here as MSC-based methods), to exclude all loci that are present in more than a single copy in any of the studied genomes. Furthermore, such analyses entail orthology assignment to avoid the potential of hidden paralogy in the data. The question we seek to answer in this study is: What happens if one runs such species tree inference methods on data where paralogy is present, in addition to or without ILS being present? Through simulation studies and analyses of two biological data sets, we show that running such methods on data with paralogs provide very accurate results, either by treating all gene copies within a family as alleles from multiple individuals or by randomly selecting one copy per species. Our results have significant implications for the use of MSC-based phylogenomic analyses, demonstrating that they do not have to be restricted to single-copy loci, thus greatly increasing the amount of data that can be used. [Multispecies coalescent; incomplete lineage sorting; gene duplication and loss; orthology; paralogy.]

Download Full-text

Practical Speedup of Bayesian Inference of Species Phylogenies by Restricting the Space of Gene Trees

10.1101/770784 ◽

2019 ◽

Author(s):

Yaxuan Wang ◽

Huw A. Ogilvie ◽

Luay Nakhleh

Keyword(s):

Bayesian Inference ◽

Species Tree ◽

Biological Data ◽

Small Data ◽

Data Sets ◽

Gene Trees ◽

Multispecies Coalescent ◽

Tree Inference ◽

The Individual ◽

Species Tree Inference

AbstractSpecies tree inference from multi-locus data has emerged as a powerful paradigm in the post-genomic era, both in terms of the accuracy of the species tree it produces as well as in terms of elucidating the processes that shaped the evolutionary history. Bayesian methods for species tree inference are desirable in this area as they have been shown to yield accurate estimates, but also to naturally provide measures of confidence in those estimates. However, the heavy computational requirements of Bayesian inference have limited the applicability of such methods to very small data sets.In this paper, we show that the computational efficiency of Bayesian inference under the multispecies coalescent can be improved in practice by restricting the space of the gene trees explored during the random walk, without sacrificing accuracy as measured by various metrics. The idea is to first infer constraints on the trees of the individual loci in the form of unresolved gene trees, and then to restrict the sampler to consider only resolutions of the constrained trees. We demonstrate the improvements gained by such an approach on both simulated and biological data.

Download Full-text

Practical Speedup of Bayesian Inference of Species Phylogenies by Restricting the Space of Gene Trees

Molecular Biology and Evolution ◽

10.1093/molbev/msaa045 ◽

2020 ◽

Vol 37 (6) ◽

pp. 1809-1818

Author(s):

Yaxuan Wang ◽

Huw A Ogilvie ◽

Luay Nakhleh

Keyword(s):

Bayesian Inference ◽

Species Tree ◽

Biological Data ◽

Small Data ◽

Data Sets ◽

Gene Trees ◽

Multispecies Coalescent ◽

Tree Inference ◽

The Individual ◽

Species Tree Inference

Abstract Species tree inference from multilocus data has emerged as a powerful paradigm in the postgenomic era, both in terms of the accuracy of the species tree it produces as well as in terms of elucidating the processes that shaped the evolutionary history. Bayesian methods for species tree inference are desirable in this area as they have been shown not only to yield accurate estimates, but also to naturally provide measures of confidence in those estimates. However, the heavy computational requirements of Bayesian inference have limited the applicability of such methods to very small data sets. In this article, we show that the computational efficiency of Bayesian inference under the multispecies coalescent can be improved in practice by restricting the space of the gene trees explored during the random walk, without sacrificing accuracy as measured by various metrics. The idea is to first infer constraints on the trees of the individual loci in the form of unresolved gene trees, and then to restrict the sampler to consider only resolutions of the constrained trees. We demonstrate the improvements gained by such an approach on both simulated and biological data.

Download Full-text

Consistency of SVDQuartets and Maximum Likelihood for Coalescent-based Species Tree Estimation

10.1101/523050 ◽

2019 ◽

Cited By ~ 2

Author(s):

Matthew Wascher ◽

Laura Kubatko

Keyword(s):

Maximum Likelihood ◽

Species Tree ◽

Gene Trees ◽

Data Types ◽

Coalescent Model ◽

Multispecies Coalescent ◽

Consistency Results ◽

True Tree ◽

Tree Inference ◽

Statistical Consistency

AbtractNumerous methods for inferring species-level phylogenies under the coalescent model have been proposed within the last 20 years, and debates continue about the relative strengths and weaknesses of these methods. One desirable property of a phylogenetic estimator is that of statistical consistency, which means intuitively that as more data are collected, the probability that the estimated tree has the same topology as the true tree goes to 1. To date, consistency results for species tree inference under the multispecies coalescent have been derived only for summary statistics methods, such as ASTRAL and MP-EST. These methods have been found to be consistent given true gene trees, but may be inconsistent when gene trees are estimated from data for loci of finite length (Roch et al., 2019). Here we consider the question of statistical consistency for four taxa for SVDQuartets for general data types, as well as for the maximum likelihood (ML) method in the case in which the data are a collection of sites generated under the multispecies coalescent model such that the sites are conditionally independent given the species tree (we call these data Coalescent Independent Sites (CIS) data). We show that SVDQuartets is statistically consistent for all data types (i.e., for both CIS data and for multilocus data), and we derive its rate of convergence. We additionally show that ML is consistent for CIS data under the JC69 model, and discuss why a proof for the more general multilocus case is difficult. Finally, we compare the performance of maximum likelihood and SDVQuartets using simulation for both data types.

Download Full-text

Species Tree Inference with BPP Using Genomic Sequences and the Multispecies Coalescent

Molecular Biology and Evolution ◽

10.1093/molbev/msy147 ◽

2018 ◽

Vol 35 (10) ◽

pp. 2585-2593 ◽

Cited By ~ 68

Author(s):

Tomáš Flouri ◽

Xiyun Jiao ◽

Bruce Rannala ◽

Ziheng Yang

Keyword(s):

Species Tree ◽

Genomic Sequences ◽

Multispecies Coalescent ◽

Tree Inference ◽

Species Tree Inference

Download Full-text

On the effects of selection and mutation on species tree inference

10.1101/2020.09.08.288183 ◽

2020 ◽

Author(s):

Matthew Wascher ◽

Laura S. Kubatko

Keyword(s):

Negative Impact ◽

Species Tree ◽

Multispecies Coalescent ◽

Coalescent Process ◽

Fisher Model ◽

Tree Inference ◽

Substantial Bias ◽

Genome Scale ◽

Species Tree Inference ◽

Scale Data

AbstractA common question that arises when inferring species-level phylogenies from genome-scale data is whether selection acting on certain parts of the genome could create a bias in the inferred phylogeny. While most methods for species tree inference currently assume the multispecies coalescent (MSC), all methods that we are aware of utilize only the neutral coalescent process. If selection is in fact present, failure to adequately model it could introduce substantial bias. We work toward rigorously addressing this question using mathematical theory by deriving a version of the coalescent including selection and mutation as a limiting approximation of the Wright-Fisher model with selection and mutation, and showing that it can be used to closely approximate the distribution of coalescent times in the presence of selection and mutation. We confirm the adequacy of the approximation with a simulation study, and discuss its implications for species tree inference. Our results show that in a general class containing many cases of interest, selection has only a small impact on the coalescent process, and ignoring selection when it is present does not have a substantial negative impact on inference of the species tree topology.

Download Full-text

Impact of Ghost Introgression on Coalescent-based Species Tree Inference and Estimation of Divergence Time

10.1101/2022.01.11.475787 ◽

2022 ◽

Author(s):

XiaoXu Pang ◽

Da-Yong Zhang

Keyword(s):

Incomplete Lineage Sorting ◽

Divergence Time ◽

Species Tree ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Multispecies Coalescent ◽

Tree Inference ◽

Tree Methods ◽

The Impact

The species studied in any evolutionary investigation generally constitute a very small proportion of all the species currently existing or that have gone extinct. It is therefore likely that introgression, which is widespread across the tree of life, involves "ghosts," i.e., unsampled, unknown, or extinct lineages. However, the impact of ghost introgression on estimations of species trees has been rarely studied and is thus poorly understood. In this study, we use mathematical analysis and simulations to examine the robustness of species tree methods based on a multispecies coalescent model under gene flow sourcing from an extant or ghost lineage. We found that very low levels of extant or ghost introgression can result in anomalous gene trees (AGTs) on three-taxon rooted trees if accompanied by strong incomplete lineage sorting (ILS). In contrast, even massive introgression, with more than half of the recipient genome descending from the donor lineage, may not necessarily lead to AGTs. In cases involving an ingroup lineage (defined as one that diverged no earlier than the most basal species under investigation) acting as the donor of introgression, the time of root divergence among the investigated species was either underestimated or remained unaffected, but for the cases of outgroup ghost lineages acting as donors, the divergence time was generally overestimated. Under many conditions of ingroup introgression, the stronger the ILS was, the higher was the accuracy of estimating the time of root divergence, although the topology of the species tree is more prone to be biased by the effect of introgression.

Download Full-text