Identifiability of speciation times under the multispecies coalescent

Mapping Intimacies ◽

10.1101/2020.11.24.396424 ◽

2020 ◽

Author(s):

Laura Kubatko ◽

Julia Chifman

Keyword(s):

Sequence Data ◽

Hypothesis Test ◽

Gene Trees ◽

Species Trees ◽

Computationally Efficient ◽

Coalescent Model ◽

Sequencing Technologies ◽

Multispecies Coalescent ◽

Branch Lengths ◽

And Performance

AbstractThe advent of rapid and inexpensive sequencing technologies has necessitated the development of computationally efficient methods for analyzing sequence data for many genes simultaneously in a phylogenetic framework. The coalescent process is the most commonly used model for linking the underlying genealogies of individual genes with the global species-level phylogeny, but inference under the coalescent model is computationally daunting in the typical inference frameworks (e.g., the likelihood and Bayesian frameworks) due to the dimensionality of the space of both gene trees and species trees. Here we consider estimation of the branch lengths in a fixed species tree, and show that these branch lengths are identifiable. We also show that in the case of four taxa simple estimators for the branch lengths can be derived based on observed site pattern frequencies. Properties of these estimators, such as their asymptotic variances and large-sample distributions, are examined, and performance of the estimators is assessed using simulation. Finally, we use these estimators to develop a hypothesis test that can be limit species under the coalescent model.

Download Full-text

Gene Tree Discord, Simplex Plots, and Statistical Tests under the Coalescent

Systematic Biology ◽

10.1093/sysbio/syab008 ◽

2021 ◽

Author(s):

Elizabeth S Allman ◽

Jonathan D Mitchell ◽

John A Rhodes

Keyword(s):

Incomplete Lineage Sorting ◽

Hypothesis Test ◽

Statistical Tests ◽

Gene Tree ◽

Complex Model ◽

Species Tree ◽

Gene Trees ◽

Lineage Sorting ◽

Coalescent Model ◽

Multispecies Coalescent

Abstract A simple graphical device, the simplex plot of quartet concordance factors, is introduced to aid in the exploration of a collection of gene trees on a common set of taxa. A single plot summarizes all gene tree discord and allows for visual comparison to the expected discord from the multispecies coalescent model (MSC) of incomplete lineage sorting on a species tree. A formal statistical procedure is described that can quantify the deviation from expectation for each subset of four taxa, suggesting when the data are not in accord with the MSC, and thus that either gene tree inference error is substantial or a more complex model such as that on a network may be required. If the collection of gene trees is in accord with the MSC, the plots reveal when substantial incomplete lineage sorting is present. Applications to both simulated and empirical multilocus data sets illustrate the insights provided. [Gene tree discordance; hypothesis test; multispecies coalescent model; quartet concordance factor; simplex plot; species tree].

Download Full-text

Polymorphism-aware species trees with advanced mutation models, bootstrap and rate heterogeneity

10.1101/483479 ◽

2018 ◽

Author(s):

Dominik Schrempf ◽

Bui Quang Minh ◽

Arndt von Haeseler ◽

Carolin Kosiol

Keyword(s):

Molecular Phylogenetics ◽

Sequence Data ◽

Phylogenetic Analyses ◽

Large Data ◽

Population Data ◽

Rate Variation ◽

Gene Trees ◽

Species Trees ◽

Multispecies Coalescent ◽

Branch Support

AbstractMolecular phylogenetics has neglected polymorphisms within present and ancestral populations for a long time. Recently, multispecies coalescent based methods have increased in popularity, however, their application is limited to a small number of species and individuals. We introduced a polymorphism-aware phylogenetic model (PoMo), which overcomes this limitation and scales well with the increasing amount of sequence data while accounting for present and ancestral polymorphisms. PoMo circumvents handling of gene trees and directly infers species trees from allele frequency data. Here, we extend the PoMo implementation in IQ-TREE and integrate search for the statistically best-fit mutation model, the ability to infer mutation rate variation across sites, and assessment of branch support values. We exemplify an analysis of a hundred species with ten haploid individuals each, showing that PoMo can perform inference on large data sets. While PoMo is more accurate than standard substitution models applied to concatenated alignments, it is almost as fast. We also provide bmm-simulate, a software package that allows simulation of sequences evolving under PoMo. The new options consolidate the value of PoMo for phylogenetic analyses with population data.

Download Full-text

Retroposon Insertions within a Multispecies Coalescent Framework Suggest that Ratite Phylogeny is not in the ‘Anomaly Zone’

10.1101/643296 ◽

2019 ◽

Cited By ~ 3

Author(s):

Mark S. Springer ◽

John Gatesy

Keyword(s):

Phylogenetic Analysis ◽

Sequence Data ◽

Gene Tree ◽

Reconstruction Error ◽

Species Tree ◽

Neutral Evolution ◽

Gene Trees ◽

Tree Reconstruction ◽

Multispecies Coalescent ◽

Branch Lengths

ABSTRACTSummary coalescence methods were developed to address the negative impacts of incomplete lineage sorting on species tree estimation with concatenation. Coalescence methods are statistically consistent if certain requirements are met including no intralocus recombination, neutral evolution, and no gene tree reconstruction error. However, the assumption of no intralocus recombination may not hold for many DNA sequence data sets, and neutral evolution is not the rule for genetic markers that are commonly employed in phylogenomic coalescence analyses. Most importantly, the assumption of no gene tree reconstruction error is routinely violated, especially for rapid radiations that are deep in the Tree of Life. With the sequencing of complete genomes and novel pipelines, phylogenetic analysis of retroposon insertions has emerged as a valuable alternative to sequence-based phylogenetic analysis. Retroposon insertions avoid or reduce several problems that beset analysis of sequence data with summary coalescence methods: 1) intralocus recombination is avoided because retroposon insertions are singular evolutionary events, 2) neutral evolution is approximated in many cases, and 3) gene tree reconstruction errors are rare because retroposons have low rates of homoplasy. However, the analysis of retroposons within a multispecies coalescent framework has not been realized. Here, we propose a simple workaround in which a retroposon insertion matrix is first transformed into a series of incompletely resolved gene trees. Next, the program ASTRAL is used to estimate a species tree in the statistically consistent framework of the multispecies coalescent. The inferred species tree includes support scores at all nodes and internal branch lengths in coalescent units. As a test case, we analyzed a retroposon dataset for palaeognath birds (ratites and tinamous) with ASTRAL and compared the resulting species tree to an MP-EST species tree for the same clade derived from thousands of sequence-based gene trees. The MP-EST species tree suggests an empirical case of the ‘anomaly zone’ with three very short internal branches at the base of Palaeognathae, and as predicted for anomaly zone conditions, the MP-EST species tree differs from the most common gene tree. Although identical in topology to the MP-EST tree, the ASTRAL species tree based on retroposons shows branch lengths that are much longer and incompatible with anomaly zone conditions. Simulation of gene trees from the retroposon-based species tree reveals that the most common gene tree matches the species tree. We contend that the wide discrepancies in branch lengths between sequence-based and retroposon-based species trees are explained by the greater accuracy of retroposon gene trees (bipartitions) relative to sequence-based gene trees. Coalescence analysis of retroposon data provides a promising alternative to the status quo by reducing gene tree reconstruction error that can have large impacts on both branch length estimates and evolutionary interpretations.

Download Full-text

Impact of Ghost Introgression on Coalescent-based Species Tree Inference and Estimation of Divergence Time

10.1101/2022.01.11.475787 ◽

2022 ◽

Author(s):

XiaoXu Pang ◽

Da-Yong Zhang

Keyword(s):

Incomplete Lineage Sorting ◽

Divergence Time ◽

Species Tree ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Multispecies Coalescent ◽

Tree Inference ◽

Tree Methods ◽

The Impact

The species studied in any evolutionary investigation generally constitute a very small proportion of all the species currently existing or that have gone extinct. It is therefore likely that introgression, which is widespread across the tree of life, involves "ghosts," i.e., unsampled, unknown, or extinct lineages. However, the impact of ghost introgression on estimations of species trees has been rarely studied and is thus poorly understood. In this study, we use mathematical analysis and simulations to examine the robustness of species tree methods based on a multispecies coalescent model under gene flow sourcing from an extant or ghost lineage. We found that very low levels of extant or ghost introgression can result in anomalous gene trees (AGTs) on three-taxon rooted trees if accompanied by strong incomplete lineage sorting (ILS). In contrast, even massive introgression, with more than half of the recipient genome descending from the donor lineage, may not necessarily lead to AGTs. In cases involving an ingroup lineage (defined as one that diverged no earlier than the most basal species under investigation) acting as the donor of introgression, the time of root divergence among the investigated species was either underestimated or remained unaffected, but for the cases of outgroup ghost lineages acting as donors, the divergence time was generally overestimated. Under many conditions of ingroup introgression, the stronger the ILS was, the higher was the accuracy of estimating the time of root divergence, although the topology of the species tree is more prone to be biased by the effect of introgression.

Download Full-text

MSCquartets 1.0: Quartet methods for species trees and networks under the multispecies coalescent model in R

Bioinformatics ◽

10.1093/bioinformatics/btaa868 ◽

2020 ◽

Author(s):

John A Rhodes ◽

Hector Baños ◽

Jonathan D Mitchell ◽

Elizabeth S Allman

Keyword(s):

Network Inference ◽

Incomplete Lineage Sorting ◽

R Package ◽

Species Tree ◽

Supplementary Information ◽

Species Trees ◽

Lineage Sorting ◽

Coalescent Model ◽

Multispecies Coalescent ◽

Tree Inference

Abstract Summary MSCquartets is an R package for species tree hypothesis testing, inference of species trees, and inference of species networks under the Multispecies Coalescent model of incomplete lineage sorting and its network analog. Input for these analyses are collections of metric or topological locus trees which are then summarized by the quartets displayed on them. Results of hypothesis tests at user-supplied levels are displayed in a simplex plot by color-coded points. The package implements the QDC and WQDC algorithms for topological and metric species tree inference, and the NANUQ algorithm for level-1 topological species network inference, all of which give statistically consistent estimators under the model. Availability MSCquartets is available through the Comprehensive R Archive Network: https://CRAN.R-project.org/package=MSCquartets. Supplementary information Supplementary materials, including example data and analyses, are incorporated into the package.

Download Full-text

The BPP program for species tree estimation and species delimitation

Current Zoology ◽

10.1093/czoolo/61.5.854 ◽

2015 ◽

Vol 61 (5) ◽

pp. 854-865 ◽

Cited By ~ 314

Author(s):

Ziheng Yang

Keyword(s):

Species Delimitation ◽

Genomic Sequence ◽

Sequence Data ◽

Species Tree ◽

Joint Analysis ◽

Coalescent Model ◽

Species Phylogeny ◽

Multispecies Coalescent ◽

Tree Estimation ◽

Nuclear Loci

Abstract This paper provides an overview and a tutorial of the BPP program, which is a Bayesian MCMC program for analyzing multi-locus genomic sequence data under the multispecies coalescent model. An example dataset of five nuclear loci from the East Asian brown frogs is used to illustrate four different analyses, including estimation of species divergence times and population size parameters under the multispecies coalescent model on a fixed species phylogeny (A00), species tree estimation when the assignment and species delimitation are fixed (A01), species delimitation using a fixed guide tree (A10), and joint species delimitation and species-tree estimation or unguided species delimitation (A11). For the joint analysis (A11), two new priors are introduced, which assign uniform probabilities for the different numbers of delimited species, which may be useful when assignment, species delimitation, and species phylogeny are all inferred in one joint analysis. The paper ends with a discussion of the assumptions, the strengths and weaknesses of the BPP analysis.

Download Full-text

A Bayesian Implementation of the Multispecies Coalescent Model with Introgression for Phylogenomic Analysis

Molecular Biology and Evolution ◽

10.1093/molbev/msz296 ◽

2019 ◽

Vol 37 (4) ◽

pp. 1211-1223 ◽

Cited By ~ 8

Author(s):

Tomáš Flouri ◽

Xiyun Jiao ◽

Bruce Rannala ◽

Ziheng Yang

Keyword(s):

Gene Flow ◽

Genomic Sequence ◽

Sequence Data ◽

Incomplete Lineage Sorting ◽

Mosquito Species ◽

Phylogenomic Analysis ◽

Data Sets ◽

Lineage Sorting ◽

Coalescent Model ◽

Multispecies Coalescent

Abstract Recent analyses suggest that cross-species gene flow or introgression is common in nature, especially during species divergences. Genomic sequence data can be used to infer introgression events and to estimate the timing and intensity of introgression, providing an important means to advance our understanding of the role of gene flow in speciation. Here, we implement the multispecies-coalescent-with-introgression model, an extension of the multispecies-coalescent model to incorporate introgression, in our Bayesian Markov chain Monte Carlo program Bpp. The multispecies-coalescent-with-introgression model accommodates deep coalescence (or incomplete lineage sorting) and introgression and provides a natural framework for inference using genomic sequence data. Computer simulation confirms the good statistical properties of the method, although hundreds or thousands of loci are typically needed to estimate introgression probabilities reliably. Reanalysis of data sets from the purple cone spruce confirms the hypothesis of homoploid hybrid speciation. We estimated the introgression probability using the genomic sequence data from six mosquito species in the Anopheles gambiae species complex, which varies considerably across the genome, likely driven by differential selection against introgressed alleles.

Download Full-text

The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets

Systematic Biology ◽

10.1093/sysbio/syaa008 ◽

2020 ◽

Vol 69 (4) ◽

pp. 795-812 ◽

Cited By ~ 1

Author(s):

Xiaodong Jiang ◽

Scott V Edwards ◽

Liang Liu

Keyword(s):

Data Analysis ◽

Model Validation ◽

Bayesian Model ◽

Statistical Tests ◽

Gc Content ◽

Data Sets ◽

Gene Trees ◽

Coalescent Model ◽

Multispecies Coalescent ◽

Substitution Models

Abstract A statistical framework of model comparison and model validation is essential to resolving the debates over concatenation and coalescent models in phylogenomic data analysis. A set of statistical tests are here applied and developed to evaluate and compare the adequacy of substitution, concatenation, and multispecies coalescent (MSC) models across 47 phylogenomic data sets collected across tree of life. Tests for substitution models and the concatenation assumption of topologically congruent gene trees suggest that a poor fit of substitution models, rejected by 44% of loci, and concatenation models, rejected by 38% of loci, is widespread. Logistic regression shows that the proportions of GC content and informative sites are both negatively correlated with the fit of substitution models across loci. Moreover, a substantial violation of the concatenation assumption of congruent gene trees is consistently observed across six major groups (birds, mammals, fish, insects, reptiles, and others, including other invertebrates). In contrast, among those loci adequately described by a given substitution model, the proportion of loci rejecting the MSC model is 11%, significantly lower than those rejecting the substitution and concatenation models. Although conducted on reduced data sets due to computational constraints, Bayesian model validation and comparison both strongly favor the MSC over concatenation across all data sets; the concatenation assumption of congruent gene trees rarely holds for phylogenomic data sets with more than 10 loci. Thus, for large phylogenomic data sets, model comparisons are expected to consistently and more strongly favor the coalescent model over the concatenation model. We also found that loci rejecting the MSC have little effect on species tree estimation. Our study reveals the value of model validation and comparison in phylogenomic data analysis, as well as the need for further improvements of multilocus models and computational tools for phylogenetic inference. [Bayes factor; Bayesian model validation; coalescent prior; congruent gene trees; independent prior; Metazoa; posterior predictive simulation.]

Download Full-text

Species Identification by Bayesian Fingerprinting: A Powerful Alternative to DNA Barcoding

10.1101/041608 ◽

2016 ◽

Cited By ~ 3

Author(s):

Ziheng Yang ◽

Bruce Rannala

Keyword(s):

Dna Barcoding ◽

Species Identification ◽

Model Comparison ◽

Sequence Data ◽

Single Species ◽

Species Variation ◽

Distance Threshold ◽

Coalescent Model ◽

Multispecies Coalescent ◽

Bayesian Model Comparison

A number of methods have been developed to use genetic sequence data to identify and delineate species. Some methods are based on heuristics, such as DNA barcoding which is based on a sequence-distance threshold, while others use Bayesian model comparison under the multispecies coalescent model. Here we use mathematical analysis and computer simulation to demonstrate large differences in statistical performance of species identification between DNA barcoding and Bayesian inference under the multispecies coalescent model as implemented in the bpp program. We show that a fixed genetic-distance threshold as used in DNA barcoding is problematic for delimiting species, even if the threshold is "optimized", because different species have different population sizes and different divergence times, and therefore display different amounts of intra-species versus inter-species variation. In contrast, bpp can reliably delimit species in such situations with only one locus and rarely supports a wrong assignment with high posterior probability. While under-sampling or rare specimens may pose problems for heuristic methods, bpp can delimit species with high power when multi-locus data are used, even if the species is represented by a single specimen. Finally we demonstrate that bpp may be powerful for delimiting cryptic species using specimens that are misidentified as a single species in the barcoding library.

Download Full-text

CRAFT: Compact genome Representation towards large-scale Alignment-Free daTabase

10.1101/2020.07.10.196741 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Computationally Efficient ◽

Sequencing Technologies ◽

Alignment Free

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text