scholarly journals Efficient Bayesian species tree inference under the multispecies coalescent

2017 ◽  
pp. syw119 ◽  
Author(s):  
Bruce Rannala ◽  
Ziheng Yang
2018 ◽  
Author(s):  
Zhi Yan ◽  
Peng Du ◽  
Matthew W. Hahn ◽  
Luay Nakhleh

AbstractThe multispecies coalescent (MSC) has emerged as a powerful and desirable framework for species tree inference in phylogenomic studies. Under this framework, the data for each locus is assumed to consist of orthologous, single-copy genes, and heterogeneity across loci is assumed to be due to incomplete lineage sorting (ILS). These assumptions have led biologists that use ILS-aware inference methods, whether based directly on the MSC or proven to be statistically consistent under it (collectively referred to here as MSC-based methods), to exclude all loci that are present in more than a single copy in any of the studied genomes. Furthermore, such analyses entail orthology assignment to avoid the potential of hidden paralogy in the data. The question we seek to answer in this study is: What happens if one runs such species tree inference methods on data where paralogy is present, in addition to or without ILS being present? Through simulation studies and analyses of two biological data sets, we show that running such methods on data with paralogs provide very accurate results, either by treating all gene copies within a family as alleles from multiple individuals or by randomly selecting one copy per species. Our results have significant implications for the use of MSC-based phylogenomic analyses, demonstrating that they do not have to be restricted to single-copy loci, thus greatly increasing the amount of data that can be used. [Multispecies coalescent; incomplete lineage sorting; gene duplication and loss; orthology; paralogy.]


2019 ◽  
Author(s):  
Yaxuan Wang ◽  
Huw A. Ogilvie ◽  
Luay Nakhleh

AbstractSpecies tree inference from multi-locus data has emerged as a powerful paradigm in the post-genomic era, both in terms of the accuracy of the species tree it produces as well as in terms of elucidating the processes that shaped the evolutionary history. Bayesian methods for species tree inference are desirable in this area as they have been shown to yield accurate estimates, but also to naturally provide measures of confidence in those estimates. However, the heavy computational requirements of Bayesian inference have limited the applicability of such methods to very small data sets.In this paper, we show that the computational efficiency of Bayesian inference under the multispecies coalescent can be improved in practice by restricting the space of the gene trees explored during the random walk, without sacrificing accuracy as measured by various metrics. The idea is to first infer constraints on the trees of the individual loci in the form of unresolved gene trees, and then to restrict the sampler to consider only resolutions of the constrained trees. We demonstrate the improvements gained by such an approach on both simulated and biological data.


2020 ◽  
Vol 37 (6) ◽  
pp. 1809-1818
Author(s):  
Yaxuan Wang ◽  
Huw A Ogilvie ◽  
Luay Nakhleh

Abstract Species tree inference from multilocus data has emerged as a powerful paradigm in the postgenomic era, both in terms of the accuracy of the species tree it produces as well as in terms of elucidating the processes that shaped the evolutionary history. Bayesian methods for species tree inference are desirable in this area as they have been shown not only to yield accurate estimates, but also to naturally provide measures of confidence in those estimates. However, the heavy computational requirements of Bayesian inference have limited the applicability of such methods to very small data sets. In this article, we show that the computational efficiency of Bayesian inference under the multispecies coalescent can be improved in practice by restricting the space of the gene trees explored during the random walk, without sacrificing accuracy as measured by various metrics. The idea is to first infer constraints on the trees of the individual loci in the form of unresolved gene trees, and then to restrict the sampler to consider only resolutions of the constrained trees. We demonstrate the improvements gained by such an approach on both simulated and biological data.


2018 ◽  
Vol 35 (10) ◽  
pp. 2585-2593 ◽  
Author(s):  
Tomáš Flouri ◽  
Xiyun Jiao ◽  
Bruce Rannala ◽  
Ziheng Yang

2020 ◽  
Author(s):  
John A. Rhodes ◽  
Hector Baños ◽  
Jonathan D. Mitchell ◽  
Elizabeth S. Allman

AbstractMSCquartets is an R package for species tree hypothesis testing, inference of species trees, and inference of species networks under the Multispecies Coalescent model of incomplete lineage sorting. Input for these analyses are collections of metric or topological locus trees which are then summarized by the quartets displayed on them. Results of hypothesis tests at user-supplied levels are displayed in a simplex plot by color-coded points. The package includes the QDC and WQDC algorithms for topological and metric species tree inference, and the NANUQ algorithm for level-1 topological species network inference, all of which give statistically consistent estimators under the model.


2020 ◽  
Author(s):  
Matthew Wascher ◽  
Laura S. Kubatko

AbstractA common question that arises when inferring species-level phylogenies from genome-scale data is whether selection acting on certain parts of the genome could create a bias in the inferred phylogeny. While most methods for species tree inference currently assume the multispecies coalescent (MSC), all methods that we are aware of utilize only the neutral coalescent process. If selection is in fact present, failure to adequately model it could introduce substantial bias. We work toward rigorously addressing this question using mathematical theory by deriving a version of the coalescent including selection and mutation as a limiting approximation of the Wright-Fisher model with selection and mutation, and showing that it can be used to closely approximate the distribution of coalescent times in the presence of selection and mutation. We confirm the adequacy of the approximation with a simulation study, and discuss its implications for species tree inference. Our results show that in a general class containing many cases of interest, selection has only a small impact on the coalescent process, and ignoring selection when it is present does not have a substantial negative impact on inference of the species tree topology.


2022 ◽  
Author(s):  
XiaoXu Pang ◽  
Da-Yong Zhang

The species studied in any evolutionary investigation generally constitute a very small proportion of all the species currently existing or that have gone extinct. It is therefore likely that introgression, which is widespread across the tree of life, involves "ghosts," i.e., unsampled, unknown, or extinct lineages. However, the impact of ghost introgression on estimations of species trees has been rarely studied and is thus poorly understood. In this study, we use mathematical analysis and simulations to examine the robustness of species tree methods based on a multispecies coalescent model under gene flow sourcing from an extant or ghost lineage. We found that very low levels of extant or ghost introgression can result in anomalous gene trees (AGTs) on three-taxon rooted trees if accompanied by strong incomplete lineage sorting (ILS). In contrast, even massive introgression, with more than half of the recipient genome descending from the donor lineage, may not necessarily lead to AGTs. In cases involving an ingroup lineage (defined as one that diverged no earlier than the most basal species under investigation) acting as the donor of introgression, the time of root divergence among the investigated species was either underestimated or remained unaffected, but for the cases of outgroup ghost lineages acting as donors, the divergence time was generally overestimated. Under many conditions of ingroup introgression, the stronger the ILS was, the higher was the accuracy of estimating the time of root divergence, although the topology of the species tree is more prone to be biased by the effect of introgression.


2021 ◽  
Author(s):  
Benoit Morel ◽  
Paul Schade ◽  
Sarah Lutteropp ◽  
Tom A. Williams ◽  
Gergely J. Szöllösi ◽  
...  

Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informative signal. At present, there does not exist any widely adopted inference method for this purpose. Here, we present SpeciesRax, the first maximum likelihood method that can infer a rooted species tree from a set of gene family trees and can account for gene duplication, loss, and transfer events. By explicitly modelling events by which gene trees can depart from the species tree, SpeciesRax leverages the phylogenetic rooting signal in gene trees. SpeciesRax infers species tree branch lengths in units of expected substitutions per site and branch support values via paralogy-aware quartets extracted from the gene family trees. Using both empirical and simulated datasets we show that SpeciesRax is at least as accurate as the best competing methods while being one order of magnitude faster on large datasets at the same time. We used SpeciesRax to infer a biologically plausible rooted phylogeny of the vertebrates comprising $188$ species from $31612$ gene families in one hour using $40$ cores. SpeciesRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax and on BioConda.


Sign in / Sign up

Export Citation Format

Share Document