A Simulation Study to Examine the Information Content in Phylogenomic Data Sets under the Multispecies Coalescent Model

Abstract We use computer simulation to examine the information content in multilocus data sets for inference under the multispecies coalescent model. Inference problems considered include estimation of evolutionary parameters (such as species divergence times, population sizes, and cross-species introgression probabilities), species tree estimation, and species delimitation based on Bayesian comparison of delimitation models. We found that the number of loci is the most influential factor for almost all inference problems examined. Although the number of sequences per species does not appear to be important to species tree estimation, it is very influential to species delimitation. Increasing the number of sites and the per-site mutation rate both increase the mutation rate for the whole locus and these have the same effect on estimation of parameters, but the sequence length has a greater effect than the per-site mutation rate for species tree estimation. We discuss the computational costs when the data size increases and provide guidelines concerning the subsampling of genomic data to enable the application of full-likelihood methods of inference.

Download Full-text

The BPP program for species tree estimation and species delimitation

Current Zoology ◽

10.1093/czoolo/61.5.854 ◽

2015 ◽

Vol 61 (5) ◽

pp. 854-865 ◽

Cited By ~ 314

Author(s):

Ziheng Yang

Keyword(s):

Species Delimitation ◽

Genomic Sequence ◽

Sequence Data ◽

Species Tree ◽

Joint Analysis ◽

Coalescent Model ◽

Species Phylogeny ◽

Multispecies Coalescent ◽

Tree Estimation ◽

Nuclear Loci

Abstract This paper provides an overview and a tutorial of the BPP program, which is a Bayesian MCMC program for analyzing multi-locus genomic sequence data under the multispecies coalescent model. An example dataset of five nuclear loci from the East Asian brown frogs is used to illustrate four different analyses, including estimation of species divergence times and population size parameters under the multispecies coalescent model on a fixed species phylogeny (A00), species tree estimation when the assignment and species delimitation are fixed (A01), species delimitation using a fixed guide tree (A10), and joint species delimitation and species-tree estimation or unguided species delimitation (A11). For the joint analysis (A11), two new priors are introduced, which assign uniform probabilities for the different numbers of delimited species, which may be useful when assignment, species delimitation, and species phylogeny are all inferred in one joint analysis. The paper ends with a discussion of the assumptions, the strengths and weaknesses of the BPP analysis.

Download Full-text

Challenges in Species Tree Estimation Under the Multispecies Coalescent Model

Genetics ◽

10.1534/genetics.116.190173 ◽

2016 ◽

Vol 204 (4) ◽

pp. 1353-1368 ◽

Cited By ~ 77

Author(s):

Bo Xu ◽

Ziheng Yang

Keyword(s):

Species Tree ◽

Coalescent Model ◽

Multispecies Coalescent ◽

Tree Estimation

Download Full-text

MSCquartets 1.0: Quartet methods for species trees and networks under the multispecies coalescent model in R

Bioinformatics ◽

10.1093/bioinformatics/btaa868 ◽

2020 ◽

Author(s):

John A Rhodes ◽

Hector Baños ◽

Jonathan D Mitchell ◽

Elizabeth S Allman

Keyword(s):

Network Inference ◽

Incomplete Lineage Sorting ◽

R Package ◽

Species Tree ◽

Supplementary Information ◽

Species Trees ◽

Lineage Sorting ◽

Coalescent Model ◽

Multispecies Coalescent ◽

Tree Inference

Abstract Summary MSCquartets is an R package for species tree hypothesis testing, inference of species trees, and inference of species networks under the Multispecies Coalescent model of incomplete lineage sorting and its network analog. Input for these analyses are collections of metric or topological locus trees which are then summarized by the quartets displayed on them. Results of hypothesis tests at user-supplied levels are displayed in a simplex plot by color-coded points. The package implements the QDC and WQDC algorithms for topological and metric species tree inference, and the NANUQ algorithm for level-1 topological species network inference, all of which give statistically consistent estimators under the model. Availability MSCquartets is available through the Comprehensive R Archive Network: https://CRAN.R-project.org/package=MSCquartets. Supplementary information Supplementary materials, including example data and analyses, are incorporated into the package.

Download Full-text

A Bayesian Implementation of the Multispecies Coalescent Model with Introgression for Phylogenomic Analysis

Molecular Biology and Evolution ◽

10.1093/molbev/msz296 ◽

2019 ◽

Vol 37 (4) ◽

pp. 1211-1223 ◽

Cited By ~ 8

Author(s):

Tomáš Flouri ◽

Xiyun Jiao ◽

Bruce Rannala ◽

Ziheng Yang

Keyword(s):

Gene Flow ◽

Genomic Sequence ◽

Sequence Data ◽

Incomplete Lineage Sorting ◽

Mosquito Species ◽

Phylogenomic Analysis ◽

Data Sets ◽

Lineage Sorting ◽

Coalescent Model ◽

Multispecies Coalescent

Abstract Recent analyses suggest that cross-species gene flow or introgression is common in nature, especially during species divergences. Genomic sequence data can be used to infer introgression events and to estimate the timing and intensity of introgression, providing an important means to advance our understanding of the role of gene flow in speciation. Here, we implement the multispecies-coalescent-with-introgression model, an extension of the multispecies-coalescent model to incorporate introgression, in our Bayesian Markov chain Monte Carlo program Bpp. The multispecies-coalescent-with-introgression model accommodates deep coalescence (or incomplete lineage sorting) and introgression and provides a natural framework for inference using genomic sequence data. Computer simulation confirms the good statistical properties of the method, although hundreds or thousands of loci are typically needed to estimate introgression probabilities reliably. Reanalysis of data sets from the purple cone spruce confirms the hypothesis of homoploid hybrid speciation. We estimated the introgression probability using the genomic sequence data from six mosquito species in the Anopheles gambiae species complex, which varies considerably across the genome, likely driven by differential selection against introgressed alleles.

Download Full-text

The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets

Systematic Biology ◽

10.1093/sysbio/syaa008 ◽

2020 ◽

Vol 69 (4) ◽

pp. 795-812 ◽

Cited By ~ 1

Author(s):

Xiaodong Jiang ◽

Scott V Edwards ◽

Liang Liu

Keyword(s):

Data Analysis ◽

Model Validation ◽

Bayesian Model ◽

Statistical Tests ◽

Gc Content ◽

Data Sets ◽

Gene Trees ◽

Coalescent Model ◽

Multispecies Coalescent ◽

Substitution Models

Abstract A statistical framework of model comparison and model validation is essential to resolving the debates over concatenation and coalescent models in phylogenomic data analysis. A set of statistical tests are here applied and developed to evaluate and compare the adequacy of substitution, concatenation, and multispecies coalescent (MSC) models across 47 phylogenomic data sets collected across tree of life. Tests for substitution models and the concatenation assumption of topologically congruent gene trees suggest that a poor fit of substitution models, rejected by 44% of loci, and concatenation models, rejected by 38% of loci, is widespread. Logistic regression shows that the proportions of GC content and informative sites are both negatively correlated with the fit of substitution models across loci. Moreover, a substantial violation of the concatenation assumption of congruent gene trees is consistently observed across six major groups (birds, mammals, fish, insects, reptiles, and others, including other invertebrates). In contrast, among those loci adequately described by a given substitution model, the proportion of loci rejecting the MSC model is 11%, significantly lower than those rejecting the substitution and concatenation models. Although conducted on reduced data sets due to computational constraints, Bayesian model validation and comparison both strongly favor the MSC over concatenation across all data sets; the concatenation assumption of congruent gene trees rarely holds for phylogenomic data sets with more than 10 loci. Thus, for large phylogenomic data sets, model comparisons are expected to consistently and more strongly favor the coalescent model over the concatenation model. We also found that loci rejecting the MSC have little effect on species tree estimation. Our study reveals the value of model validation and comparison in phylogenomic data analysis, as well as the need for further improvements of multilocus models and computational tools for phylogenetic inference. [Bayes factor; Bayesian model validation; coalescent prior; congruent gene trees; independent prior; Metazoa; posterior predictive simulation.]

Download Full-text

Assessing the Impacts of Positive Selection on Coalescent-Based Species Tree Estimation and Species Delimitation

Systematic Biology ◽

10.1093/sysbio/syy034 ◽

2018 ◽

Vol 67 (6) ◽

pp. 1076-1090 ◽

Cited By ~ 8

Author(s):

Richard H Adams ◽

Drew R Schield ◽

Daren C Card ◽

Todd A Castoe

Keyword(s):

Positive Selection ◽

Species Delimitation ◽

Species Tree ◽

Tree Estimation

Download Full-text

Practical Speedup of Bayesian Inference of Species Phylogenies by Restricting the Space of Gene Trees

10.1101/770784 ◽

2019 ◽

Author(s):

Yaxuan Wang ◽

Huw A. Ogilvie ◽

Luay Nakhleh

Keyword(s):

Bayesian Inference ◽

Species Tree ◽

Biological Data ◽

Small Data ◽

Data Sets ◽

Gene Trees ◽

Multispecies Coalescent ◽

Tree Inference ◽

The Individual ◽

Species Tree Inference

AbstractSpecies tree inference from multi-locus data has emerged as a powerful paradigm in the post-genomic era, both in terms of the accuracy of the species tree it produces as well as in terms of elucidating the processes that shaped the evolutionary history. Bayesian methods for species tree inference are desirable in this area as they have been shown to yield accurate estimates, but also to naturally provide measures of confidence in those estimates. However, the heavy computational requirements of Bayesian inference have limited the applicability of such methods to very small data sets.In this paper, we show that the computational efficiency of Bayesian inference under the multispecies coalescent can be improved in practice by restricting the space of the gene trees explored during the random walk, without sacrificing accuracy as measured by various metrics. The idea is to first infer constraints on the trees of the individual loci in the form of unresolved gene trees, and then to restrict the sampler to consider only resolutions of the constrained trees. We demonstrate the improvements gained by such an approach on both simulated and biological data.

Download Full-text

Split Probabilities and Species Tree Inference Under the Multispecies Coalescent Model

Bulletin of Mathematical Biology ◽

10.1007/s11538-017-0363-5 ◽

2017 ◽

Vol 80 (1) ◽

pp. 64-103 ◽

Cited By ~ 3

Author(s):

Elizabeth S. Allman ◽

James H. Degnan ◽

John A. Rhodes

Keyword(s):

Species Tree ◽

Coalescent Model ◽

Multispecies Coalescent ◽

Tree Inference ◽

Species Tree Inference

Download Full-text

Complexity of the simplest species tree problem

Molecular Biology and Evolution ◽

10.1093/molbev/msab009 ◽

2021 ◽

Author(s):

Tianqi Zhu ◽

Ziheng Yang

Keyword(s):

Maximum Likelihood ◽

Estimation Error ◽

Gene Tree ◽

Species Tree ◽

Ancestral Population ◽

Statistical Efficiency ◽

Multispecies Coalescent ◽

Full Likelihood ◽

Tree Methods ◽

Tree Estimation

Abstract The multispecies coalescent (MSC) model provides a natural framework for species tree estimation accounting for gene-tree conflicts. While a number of species tree methods under the MSC have been suggested and evaluated using simulation, their statistical properties remain poorly understood. Here we use mathematical analysis aided by computer simulation to examine the identifiability, consistency, and efficiency of different species tree methods in the case of three species and three sequences under the molecular clock. We consider four major species-tree methods including concatenation, two-step, independent-sites maximum likelihood (ISML) and maximum likelihood (ML). We develop approximations that predict that the probit transform of the species tree estimation error decreases linearly with the square root of the number of loci. Even in this simplest case major differences exist among the methods. Fulllikelihood methods are considerably more efficient than summary methods such as concatenation and two-step. They also provide estimates of important parameters such as species divergence times and ancestral population sizes while these parameters are not identifiable by summary methods. Our results highlight the need to improve the statistical efficiency of summary methods and the computational efficiency of full likelihood methods of species tree estimation.

Download Full-text

PRANC: ML species tree estimation from the ranked gene trees under coalescence

Bioinformatics ◽

10.1093/bioinformatics/btaa605 ◽

2020 ◽

Vol 36 (18) ◽

pp. 4819-4821

Author(s):

Anastasiia Kim ◽

James H Degnan

Keyword(s):

Maximum Likelihood ◽

Gene Tree ◽

Species Tree ◽

Supplementary Information ◽

Supplementary Data ◽

Gene Trees ◽

Multispecies Coalescent ◽

Branch Lengths ◽

Tree Topologies ◽

Tree Estimation

Abstract Summary PRANC computes the Probabilities of RANked gene tree topologies under the multispecies coalescent. A ranked gene tree is a gene tree accounting for the temporal ordering of internal nodes. PRANC can also estimate the maximum likelihood (ML) species tree from a sample of ranked or unranked gene tree topologies. It estimates the ML tree with estimated branch lengths in coalescent units. Availability and implementation PRANC is written in C++ and freely available at github.com/anastasiiakim/PRANC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text