scholarly journals The Multispecies Coalescent Model Outperforms Concatenation across Diverse Phylogenomic Data Sets

2019 ◽  
Author(s):  
Xiaodong Jian ◽  
Scott V. Edwards ◽  
Liang Liu

ABSTRACTA statistical framework of model comparison and model validation is essential to resolving the debates over concatenation and coalescent models in phylogenomic data analysis. A set of statistical tests are here applied and developed to evaluate and compare the adequacy of substitution, concatenation, and multispecies coalescent (MSC) models across 47 phylogenomic data sets collected across tree of life. Tests for substitution models and the concatenation assumption of topologically concordant gene trees suggest that a poor fit of substitution models (44% of loci rejecting the substitution model) and concatenation models (38% of loci rejecting the hypothesis of topologically congruent gene trees) is widespread. Logistic regression shows that the proportions of GC content and informative sites are both negatively correlated with the fit of substitution models across loci. Moreover, a substantial violation of the concatenation assumption of congruent gene trees is consistently observed across 6 major groups (birds, mammals, fish, insects, reptiles, and others, including other invertebrates). In contrast, among those loci adequately described by a given substitution model, the proportion of loci rejecting the MSC model is 11%, significantly lower than those rejecting the substitution and concatenation models, and Bayesian model comparison strongly favors the MSC over concatenation across all data sets. Species tree inference suggests that loci rejecting the MSC have little effect on species tree estimation. Due to computational constraints, the Bayesian model validation and comparison analyses were conducted on the reduced data sets. A complete analysis of phylogenomic data requires the development of efficient algorithms for phylogenetic inference. Nevertheless, the concatenation assumption of congruent gene trees rarely holds for phylogenomic data with more than 10 loci. Thus, for large phylogenomic data sets, model comparison analyses are expected to consistently and more strongly favor the coalescent model over the concatenation model. Our analysis reveals the value of model validation and comparison in phylogenomic data analysis, as well as the need for further improvements of multilocus models and computational tools for phylogenetic inference.

2020 ◽  
Vol 69 (4) ◽  
pp. 795-812 ◽  
Author(s):  
Xiaodong Jiang ◽  
Scott V Edwards ◽  
Liang Liu

Abstract A statistical framework of model comparison and model validation is essential to resolving the debates over concatenation and coalescent models in phylogenomic data analysis. A set of statistical tests are here applied and developed to evaluate and compare the adequacy of substitution, concatenation, and multispecies coalescent (MSC) models across 47 phylogenomic data sets collected across tree of life. Tests for substitution models and the concatenation assumption of topologically congruent gene trees suggest that a poor fit of substitution models, rejected by 44% of loci, and concatenation models, rejected by 38% of loci, is widespread. Logistic regression shows that the proportions of GC content and informative sites are both negatively correlated with the fit of substitution models across loci. Moreover, a substantial violation of the concatenation assumption of congruent gene trees is consistently observed across six major groups (birds, mammals, fish, insects, reptiles, and others, including other invertebrates). In contrast, among those loci adequately described by a given substitution model, the proportion of loci rejecting the MSC model is 11%, significantly lower than those rejecting the substitution and concatenation models. Although conducted on reduced data sets due to computational constraints, Bayesian model validation and comparison both strongly favor the MSC over concatenation across all data sets; the concatenation assumption of congruent gene trees rarely holds for phylogenomic data sets with more than 10 loci. Thus, for large phylogenomic data sets, model comparisons are expected to consistently and more strongly favor the coalescent model over the concatenation model. We also found that loci rejecting the MSC have little effect on species tree estimation. Our study reveals the value of model validation and comparison in phylogenomic data analysis, as well as the need for further improvements of multilocus models and computational tools for phylogenetic inference. [Bayes factor; Bayesian model validation; coalescent prior; congruent gene trees; independent prior; Metazoa; posterior predictive simulation.]


AoB Plants ◽  
2020 ◽  
Vol 12 (3) ◽  
Author(s):  
Nannie L Persson ◽  
Ingrid Toresen ◽  
Heidi Lie Andersen ◽  
Jenny E E Smedmark ◽  
Torsten Eriksson

Abstract The genus Potentilla (Rosaceae) has been subjected to several phylogenetic studies, but resolving its evolutionary history has proven challenging. Previous analyses recovered six, informally named, groups: the Argentea, Ivesioid, Fragarioides, Reptans, Alba and Anserina clades, but the relationships among some of these clades differ between data sets. The Reptans clade, which includes the type species of Potentilla, has been noticed to shift position between plastid and nuclear ribosomal data sets. We studied this incongruence by analysing four low-copy nuclear markers, in addition to chloroplast and nuclear ribosomal data, with a set of Bayesian phylogenetic and Multispecies Coalescent (MSC) analyses. A selective taxon removal strategy demonstrated that the included representatives from the Fragarioides clade, P. dickinsii and P. fragarioides, were the main sources of the instability seen in the trees. The Fragarioides species showed different relationships in each gene tree, and were only supported as a monophyletic group in a single marker when the Reptans clade was excluded from the analysis. The incongruences could not be explained by allopolyploidy, but rather by homoploid hybridization, incomplete lineage sorting or taxon sampling effects. When P. dickinsii and P. fragarioides were removed from the data set, a fully resolved, supported backbone phylogeny of Potentilla was obtained in the MSC analysis. Additionally, indications of autopolyploid origins of the Reptans and Ivesioid clades were discovered in the low-copy gene trees.


2019 ◽  
Author(s):  
Yaxuan Wang ◽  
Huw A. Ogilvie ◽  
Luay Nakhleh

AbstractSpecies tree inference from multi-locus data has emerged as a powerful paradigm in the post-genomic era, both in terms of the accuracy of the species tree it produces as well as in terms of elucidating the processes that shaped the evolutionary history. Bayesian methods for species tree inference are desirable in this area as they have been shown to yield accurate estimates, but also to naturally provide measures of confidence in those estimates. However, the heavy computational requirements of Bayesian inference have limited the applicability of such methods to very small data sets.In this paper, we show that the computational efficiency of Bayesian inference under the multispecies coalescent can be improved in practice by restricting the space of the gene trees explored during the random walk, without sacrificing accuracy as measured by various metrics. The idea is to first infer constraints on the trees of the individual loci in the form of unresolved gene trees, and then to restrict the sampler to consider only resolutions of the constrained trees. We demonstrate the improvements gained by such an approach on both simulated and biological data.


2020 ◽  
Vol 37 (6) ◽  
pp. 1809-1818
Author(s):  
Yaxuan Wang ◽  
Huw A Ogilvie ◽  
Luay Nakhleh

Abstract Species tree inference from multilocus data has emerged as a powerful paradigm in the postgenomic era, both in terms of the accuracy of the species tree it produces as well as in terms of elucidating the processes that shaped the evolutionary history. Bayesian methods for species tree inference are desirable in this area as they have been shown not only to yield accurate estimates, but also to naturally provide measures of confidence in those estimates. However, the heavy computational requirements of Bayesian inference have limited the applicability of such methods to very small data sets. In this article, we show that the computational efficiency of Bayesian inference under the multispecies coalescent can be improved in practice by restricting the space of the gene trees explored during the random walk, without sacrificing accuracy as measured by various metrics. The idea is to first infer constraints on the trees of the individual loci in the form of unresolved gene trees, and then to restrict the sampler to consider only resolutions of the constrained trees. We demonstrate the improvements gained by such an approach on both simulated and biological data.


2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Xing-Xing Shen ◽  
Yuanning Li ◽  
Chris Todd Hittinger ◽  
Xue-xin Chen ◽  
Antonis Rokas

AbstractPhylogenetic trees are essential for studying biology, but their reproducibility under identical parameter settings remains unexplored. Here, we find that 3515 (18.11%) IQ-TREE-inferred and 1813 (9.34%) RAxML-NG-inferred maximum likelihood (ML) gene trees are topologically irreproducible when executing two replicates (Run1 and Run2) for each of 19,414 gene alignments in 15 animal, plant, and fungal phylogenomic datasets. Notably, coalescent-based ASTRAL species phylogenies inferred from Run1 and Run2 sets of individual gene trees are topologically irreproducible for 9/15 phylogenomic datasets, whereas concatenation-based phylogenies inferred twice from the same supermatrix are reproducible. Our simulations further show that irreproducible phylogenies are more likely to be incorrect than reproducible phylogenies. These results suggest that a considerable fraction of single-gene ML trees may be irreproducible. Increasing reproducibility in ML inference will benefit from providing analyses’ log files, which contain typically reported parameters (e.g., program, substitution model, number of tree searches) but also typically unreported ones (e.g., random starting seed number, number of threads, processor type).


2019 ◽  
Vol 487 (3) ◽  
pp. 3644-3649 ◽  
Author(s):  
Haochen Wang ◽  
Stephen R Taylor ◽  
Michele Vallisneri

ABSTRACT Gravitational-wave data analysis demands sophisticated statistical noise models in a bid to extract highly obscured signals from data. In Bayesian model comparison, we choose among a landscape of models by comparing their marginal likelihoods. However, this computation is numerically fraught and can be sensitive to arbitrary choices in the specification of parameter priors. In Bayesian cross validation, we characterize the fit and predictive power of a model by computing the Bayesian posterior of its parameters in a training data set, and then use that posterior to compute the averaged likelihood of a different testing data set. The resulting cross-validation scores are straightforward to compute; they are insensitive to prior tuning; and they penalize unnecessarily complex models that overfit the training data at the expense of predictive performance. In this article, we discuss cross validation in the context of pulsar-timing-array data analysis, and we exemplify its application to simulated pulsar data (where it successfully selects the correct spectral index of a stochastic gravitational-wave background), and to a pulsar data set from the NANOGrav 11-yr release (where it convincingly favours a model that represents a transient feature in the interstellar medium). We argue that cross validation offers a promising alternative to Bayesian model comparison, and we discuss its use for gravitational-wave detection, by selecting or refuting models that include a gravitational-wave component.


Author(s):  
Elizabeth S. Allman ◽  
Jonathan D. Mitchell ◽  
John A. Rhodes

AbstractA simple graphical device, the simplex plot of quartet concordance factors, is introduced to aid in the exploration of a collection of gene trees on a common set of taxa. A single plot summarizes all gene tree discord, and allows for visual comparison to the expected discord from the multispecies coalescent model (MSC) of incomplete lineage sorting on a species tree. A formal statistical procedure is described that can quantify the deviation from expectation for each subset of four taxa, suggesting when the data is not in accord with the MSC, and thus either gene tree inference error is substantial or a more complex model such as that on a network may be required. If the collection of gene trees appears to be in accord with the MSC, the plots may reveal when substantial incomplete lineage sorting is present and coalescent based species tree inference is preferred over concatenation approaches. Applications to both simulated and empirical multilocus data sets illustrate the insights provided.


2018 ◽  
Author(s):  
Adam D. Leaché ◽  
Tianqi Zhu ◽  
Bruce Rannala ◽  
Ziheng Yang

AbstractRecent simulation studies examining the performance of Bayesian species delimitation as implemented in the BPP program have suggested that BPP may detect population splits but not species divergences and that it tends to over-split when data of many loci are analyzed. Here we confirm several of these results and provide their mathematical justifications. We point out that the distinction between population and species splits made in the protracted speciation model has no influence on the generation of gene trees and sequence data, which explains why no method can use such data to distinguish between population splits and speciation. We suggest that the the protracted speciation model is unrealistic and its mechanism for assigning species status contradicts prevailing taxonomic practice. We confirm the suggestion, based on simulation, that in the case of speciation with gene flow, Bayesian model selection as implemented in BPP tends to detect population splits when the amount of data (the number of loci) increases so over-splitting is a legitimate concern. We discuss the use of a recently proposed empirical genealogical divergence index (gdi) for species delimitation and illustrate that parameter estimates produced by a full likelihood analysis as implemented in BPP provide much more reliable inference under thegdithan the approximate method PHRAPL. We suggest that the Bayesian model-selection approach is useful for identifying sympatric cryptic species while Bayesian parameter estimation under the multispecies coalescent can be used to implement empirical criteria for determining species status among allopatric populations.


Sign in / Sign up

Export Citation Format

Share Document