Statistical Consistency of Coalescent-Based Species Tree Methods Under Models of Missing Data

Author(s):  
Michael Nute ◽  
Jed Chou

Author(s):  
Daniel M Portik ◽  
John J Wiens

Abstract Alignment is a crucial issue in molecular phylogenetics because different alignment methods can potentially yield very different topologies for individual genes. But it is unclear if the choice of alignment methods remains important in phylogenomic analyses, which incorporate data from hundreds or thousands of genes. For example, problematic biases in alignment might be multiplied across many loci, whereas alignment errors in individual genes might become irrelevant. The issue of alignment trimming (i.e., removing poorly aligned regions or missing data from individual genes) is also poorly explored. Here, we test the impact of 12 different combinations of alignment and trimming methods on phylogenomic analyses. We compare these methods using published phylogenomic data from ultraconserved elements (UCEs) from squamate reptiles (lizards and snakes), birds, and tetrapods. We compare the properties of alignments generated by different alignment and trimming methods (e.g., length, informative sites, missing data). We also test whether these data sets can recover well-established clades when analyzed with concatenated (RAxML) and species-tree methods (ASTRAL-III), using the full data ($\sim $5000 loci) and subsampled data sets (10% and 1% of loci). We show that different alignment and trimming methods can significantly impact various aspects of phylogenomic data sets (e.g., length, informative sites). However, these different methods generally had little impact on the recovery and support values for well-established clades, even across very different numbers of loci. Nevertheless, our results suggest several “best practices” for alignment and trimming. Intriguingly, the choice of phylogenetic methods impacted the phylogenetic results most strongly, with concatenated analyses recovering significantly more well-established clades (with stronger support) than the species-tree analyses. [Alignment; concatenated analysis; phylogenomics; sequence length heterogeneity; species-tree analysis; trimming]



2014 ◽  
Vol 81 ◽  
pp. 221-231 ◽  
Author(s):  
R. Alexander Pyron ◽  
Catriona R. Hendry ◽  
Vincent M. Chou ◽  
Emily M. Lemmon ◽  
Alan R. Lemmon ◽  
...  
Keyword(s):  


2022 ◽  
Author(s):  
XiaoXu Pang ◽  
Da-Yong Zhang

The species studied in any evolutionary investigation generally constitute a very small proportion of all the species currently existing or that have gone extinct. It is therefore likely that introgression, which is widespread across the tree of life, involves "ghosts," i.e., unsampled, unknown, or extinct lineages. However, the impact of ghost introgression on estimations of species trees has been rarely studied and is thus poorly understood. In this study, we use mathematical analysis and simulations to examine the robustness of species tree methods based on a multispecies coalescent model under gene flow sourcing from an extant or ghost lineage. We found that very low levels of extant or ghost introgression can result in anomalous gene trees (AGTs) on three-taxon rooted trees if accompanied by strong incomplete lineage sorting (ILS). In contrast, even massive introgression, with more than half of the recipient genome descending from the donor lineage, may not necessarily lead to AGTs. In cases involving an ingroup lineage (defined as one that diverged no earlier than the most basal species under investigation) acting as the donor of introgression, the time of root divergence among the investigated species was either underestimated or remained unaffected, but for the cases of outgroup ghost lineages acting as donors, the divergence time was generally overestimated. Under many conditions of ingroup introgression, the stronger the ILS was, the higher was the accuracy of estimating the time of root divergence, although the topology of the species tree is more prone to be biased by the effect of introgression.



PLoS ONE ◽  
2012 ◽  
Vol 7 (3) ◽  
pp. e32066 ◽  
Author(s):  
Mariana Mateos ◽  
Luis A. Hurtado ◽  
Carlos A. Santamaria ◽  
Vincent Leignel ◽  
Danièle Guinot


2018 ◽  
Author(s):  
Stephen A. Smith ◽  
Nathanael Walker-Hale ◽  
Joseph F. Walker ◽  
Joseph W. Brown

AbstractStudies have demonstrated that pervasive gene tree conflict underlies several important phylogenetic relationships where different species tree methods produce conflicting results. Here, we present a means of dissecting the phylogenetic signal for alternative resolutions within a dataset in order to resolve recalcitrant relationships and, importantly, identify what the dataset is unable to resolve. These procedures extend upon methods for isolating conflict and concordance involving specific candidate relationships and can be used to identify systematic error and disambiguate sources of conflict among species tree inference methods. We demonstrate these on a large phylogenomic plant dataset. Our results support the placement of Amborella as sister to the remaining extant angiosperms, Gnetales as sister to pines, and the monophyly of extant gymnosperms. Several other contentious relationships, including the resolution of relationships within the bryophytes and the eudicots, remain uncertain given the low number of supporting gene trees. To address whether concatenation of filtered genes amplified phylogenetic signal for relationships, we implemented a combinatorial heuristic to test combinability of genes. We found that nested conflicts limited the ability of data filtering methods to fully ameliorate conflicting signal amongst gene trees. These analyses confirmed that the underlying conflicting signal does not support broad concatenation of genes. Our approach provides a means of dissecting a specific dataset to address deep phylogenetic relationships while also identifying the inferential boundaries of the dataset.



2019 ◽  
Vol 139 ◽  
pp. 106539 ◽  
Author(s):  
John Gatesy ◽  
Daniel B. Sloan ◽  
Jessica M. Warren ◽  
Richard H. Baker ◽  
Mark P. Simmons ◽  
...  
Keyword(s):  


2014 ◽  
Vol 81 ◽  
pp. 242-257 ◽  
Author(s):  
Andrew W. Thompson ◽  
Ricardo Betancur-R. ◽  
Hernán López-Fernández ◽  
Guillermo Ortí
Keyword(s):  


Author(s):  
Tianqi Zhu ◽  
Ziheng Yang

Abstract The multispecies coalescent (MSC) model provides a natural framework for species tree estimation accounting for gene-tree conflicts. While a number of species tree methods under the MSC have been suggested and evaluated using simulation, their statistical properties remain poorly understood. Here we use mathematical analysis aided by computer simulation to examine the identifiability, consistency, and efficiency of different species tree methods in the case of three species and three sequences under the molecular clock. We consider four major species-tree methods including concatenation, two-step, independent-sites maximum likelihood (ISML) and maximum likelihood (ML). We develop approximations that predict that the probit transform of the species tree estimation error decreases linearly with the square root of the number of loci. Even in this simplest case major differences exist among the methods. Fulllikelihood methods are considerably more efficient than summary methods such as concatenation and two-step. They also provide estimates of important parameters such as species divergence times and ancestral population sizes while these parameters are not identifiable by summary methods. Our results highlight the need to improve the statistical efficiency of summary methods and the computational efficiency of full likelihood methods of species tree estimation.



2019 ◽  
Author(s):  
Lucas Santiago Barrientos ◽  
Jeffrey W. Streicher ◽  
Elizabeth C. Miller ◽  
Marcio R. Pie ◽  
John J. Wiens ◽  
...  

Abstract Background: Terraranae is a large clade of New World direct-developing frogs that includes 3–5 families and >1,000 described species, encompassing ~15% of all named frog species. The relationships among major groups of terraranan frogs have been highly contentious, including conflicts among three recent phylogenomic studies utilizing 95, 389, and 2,214 nuclear loci, respectively. In this paper, we re-evaluate relationships within Terraranae using a novel genomic dataset for 16 ingroup species representing most terraranan families and subfamilies. Results: The preferred data matrix consisted of 2,665 nuclear loci from ultraconserved elements (UCEs), with a total of 743,419 aligned base pairs and 57% missing data. Concatenated likelihood analyses and coalescent-based species-tree analyses both recovered strong statistical support for the following relationships among terraranan families: (Brachycephalidae, (Eleutherodactylidae, (Craugastoridae + “Strabomantidae”))). Our placement of Brachycephalidae agrees with two previous phylogenomic studies but conflicts with another. Our results place Strabomantis (of the Strabomantidae) with (or within) Craugastor (Craugastoridae) rather than with other strabomantid genera, rendering Strabomantidae paraphyletic with respect to Craugastoridae. Conclusions: Our results suggest that Strabomantidae should be placed in the synonymy of the older Craugastoridae. Furthermore, our results suggest that Pristimantinae is paraphyletic with respect to Holoadeninae and should be subsumed into the older Holoadeninae. We also found that using matrices of UCE loci with less missing data (and concomitantly fewer loci) generally decreased support for most nodes on the tree. Overall, our results help resolve controversial relationships within one of the largest clades of frogs, with a dataset containing ~7 times more loci than previous studies focused on this clade.



2019 ◽  
Author(s):  
Matthew Wascher ◽  
Laura Kubatko

AbtractNumerous methods for inferring species-level phylogenies under the coalescent model have been proposed within the last 20 years, and debates continue about the relative strengths and weaknesses of these methods. One desirable property of a phylogenetic estimator is that of statistical consistency, which means intuitively that as more data are collected, the probability that the estimated tree has the same topology as the true tree goes to 1. To date, consistency results for species tree inference under the multispecies coalescent have been derived only for summary statistics methods, such as ASTRAL and MP-EST. These methods have been found to be consistent given true gene trees, but may be inconsistent when gene trees are estimated from data for loci of finite length (Roch et al., 2019). Here we consider the question of statistical consistency for four taxa for SVDQuartets for general data types, as well as for the maximum likelihood (ML) method in the case in which the data are a collection of sites generated under the multispecies coalescent model such that the sites are conditionally independent given the species tree (we call these data Coalescent Independent Sites (CIS) data). We show that SVDQuartets is statistically consistent for all data types (i.e., for both CIS data and for multilocus data), and we derive its rate of convergence. We additionally show that ML is consistent for CIS data under the JC69 model, and discuss why a proof for the more general multilocus case is difficult. Finally, we compare the performance of maximum likelihood and SDVQuartets using simulation for both data types.



Sign in / Sign up

Export Citation Format

Share Document