Co-estimating Reticulate Phylogenies and Gene Trees from Multi-locus Sequence Data

AbstractThe multispecies network coalescent (MSNC) is a stochastic process that captures how gene trees grow within the branches of a phylogenetic network. Coupling the MSNC with a stochastic mutational process that operates along the branches of the gene trees gives rise to a generative model of how multiple loci from within and across species evolve in the presence of both incomplete lineage sorting (ILS) and reticulation (e.g., hybridization). We report on a Bayesian method for sampling the parameters of this generative model, including the species phylogeny, gene trees, divergence times, and population sizes, from DNA sequences of multiple independent loci. We demonstrate the utility of our method by analyzing simulated data and reanalyzing three biological data sets. Our results demonstrate the significance of not only co-estimating species phylogenies and gene trees, but also accounting for reticulation and ILS simultaneously. In particular, we show that when gene flow occurs, our method accurately estimates the evolutionary histories, coalescence times, and divergence times. Tree inference methods, on the other hand, underestimate divergence times and overestimate coalescence times when the evolutionary history is reticulate. While the MSNC corresponds to an abstract model of “intermixture,” we study the performance of the model and method on simulated data generated under a gene flow model. We show that the method accurately infers the most recent time at which gene flow occurs. Finally, we demonstrate the application of the new method to a 106-locus yeast data set. [Multispecies network coalescent; reticulation; incomplete lineage sorting; phylogenetic network; Bayesian inference; RJMCMC.]

Download Full-text

Maximum Parsimony Inference of Phylogenetic Networks in the Presence of Polyploid Complexes

10.1101/2020.09.28.317651 ◽

2020 ◽

Author(s):

Zhi Yan ◽

Zhen Cao ◽

Yushu Liu ◽

Luay Nakhleh

Keyword(s):

Network Inference ◽

Incomplete Lineage Sorting ◽

Phylogenetic Network ◽

Biological Data ◽

Phylogenetic Networks ◽

Data Sets ◽

Gene Trees ◽

Polyploid Species ◽

Lineage Sorting ◽

Work Done

AbstractPhylogenetic networks provide a powerful framework for modeling and analyzing reticulate evolutionary histories. While polyploidy has been shown to be prevalent not only in plants but also in other groups of eukaryotic species, most work done thus far on phylogenetic network inference assumes diploid hybridization. These inference methods have been applied, with varying degrees of success, to data sets with polyploid species, even though polyploidy violates the mathematical assumptions underlying these methods. Statistical methods were developed recently for handling specific types of polyploids and so were parsimony methods that could handle polyploidy more generally yet while excluding processes such as incomplete lineage sorting. In this paper, we introduce a new method for inferring most parsimonious phylogenetic networks on data that include polyploid species. Taking gene trees as input, the method seeks a phylogenetic network that minimizes deep coalescences while accounting for polyploidy. The method could also infer trees, thus potentially distinguishing between auto- and allo-polyploidy. We demonstrate the performance of the method on both simulated and biological data. The inference method as well as a method for evaluating given phylogenetic networks are implemented and publicly available in the PhyloNet software package.

Download Full-text

A Divide-and-Conquer Method for Scalable Phylogenetic Network Inference from Multi-locus Data

10.1101/587725 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jiafan Zhu ◽

Xinhao Liu ◽

Huw A. Ogilvie ◽

Luay K. Nakhleh

Keyword(s):

Large Scale ◽

Network Inference ◽

Incomplete Lineage Sorting ◽

Phylogenetic Network ◽

Biological Data ◽

Phylogenetic Networks ◽

Divide And Conquer ◽

Lineage Sorting ◽

Step Method ◽

Sequence Alignments

AbstractReticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting (ILS). However, these methods can only handle a small number of loci from a handful of genomes.In this paper, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it. We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological data sets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference.We implemented the algorithms in the publicly available software package PhyloNet (https://bioinfocs.rice.edu/PhyloNet)[email protected]

Download Full-text

Practical Aspects of Phylogenetic Network Analysis Using PhyloNet

10.1101/746362 ◽

2019 ◽

Author(s):

Zhen Cao ◽

Xinhao Liu ◽

Huw A. Ogilvie ◽

Zhi Yan ◽

Luay Nakhleh

Keyword(s):

Incomplete Lineage Sorting ◽

Phylogenetic Network ◽

Synthetic Data ◽

Simulated Data ◽

Single Species ◽

Phylogenetic Networks ◽

Lineage Sorting ◽

Data Set ◽

Types Of Information ◽

Analyze Data

AbstractPhylogenetic networks extend trees to enable simultaneous modeling of both vertical and horizontal evolutionary processes. PhyloNet is a software package that has been under constant development for over 10 years and includes a wide array of functionalities for inferring and analyzing phylogenetic networks. These functionalities differ in terms of the input data they require, the criteria and models they employ, and the types of information they allow to infer about the networks beyond their topologies. Furthermore, PhyloNet includes functionalities for simulating synthetic data on phylogenetic networks, quantifying the topological differences between phylogenetic networks, and evaluating evolutionary hypotheses given in the form of phylogenetic networks.In this paper, we use a simulated data set to illustrate the use of several of PhyloNet’s functionalities and make recommendations on how to analyze data sets and interpret the results when using these functionalities. All inference methods that we illustrate are incomplete lineage sorting (ILS) aware; that is, they account for the potential of ILS in the data while inferring the phylogenetic network. While the models do not include gene duplication and loss, we discuss how the methods can be used to analyze data in the presence of polyploidy.The concept of species is irrelevant for the computational analyses enabled by PhyloNet in that species-individuals mappings are user-defined. Consequently, none of the functionalities in PhyloNet deals with the task of species delimitation. In this sense, the data being analyzed could come from different individuals within a single species, in which case population structure along with potential gene flow is inferred (assuming the data has sufficient signal), or from different individuals sampled from different species, in which case the species phylogeny is being inferred.

Download Full-text

Maximum Parsimony Inference of Phylogenetic Networks in the Presence of Polyploid Complexes

Systematic Biology ◽

10.1093/sysbio/syab081 ◽

2021 ◽

Author(s):

Zhi Yan ◽

Zhen Cao ◽

Yushu Liu ◽

Huw A Ogilvie ◽

Luay Nakhleh

Keyword(s):

Network Inference ◽

Incomplete Lineage Sorting ◽

Gene Tree ◽

Phylogenetic Network ◽

Biological Data ◽

Phylogenetic Networks ◽

Data Sets ◽

Polyploid Species ◽

Lineage Sorting ◽

Work Done

Abstract Phylogenetic networks provide a powerful framework for modeling and analyzing reticulate evolutionary histories. While polyploidy has been shown to be prevalent not only in plants but also in other groups of eukaryotic species, most work done thus far on phylogenetic network inference assumes diploid hybridization. These inference methods have been applied, with varying degrees of success, to data sets with polyploid species, even though polyploidy violates the mathematical assumptions underlying these methods. Statistical methods were developed recently for handling specific types of polyploids and so were parsimony methods that could handle polyploidy more generally yet while excluding processes such as incomplete lineage sorting. In this paper, we introduce a new method for inferring most parsimonious phylogenetic networks on data that include polyploid species. Taking gene tree topologies as input, the method seeks a phylogenetic network that minimizes deep coalescences while accounting for polyploidy. We demonstrate the performance of the method on both simulated and biological data. The inference method as well as a method for evaluating evolutionary hypotheses in the form of phylogenetic networks are implemented and publicly available in the PhyloNet software package.

Download Full-text

Phylogenetic Trees and Networks Can Serve as Powerful and Complementary Approaches for Analysis of Genomic Data

Systematic Biology ◽

10.1093/sysbio/syz056 ◽

2019 ◽

Vol 69 (3) ◽

pp. 593-601 ◽

Cited By ~ 9

Author(s):

Christopher Blair ◽

Cécile Ané

Keyword(s):

Gene Flow ◽

Phylogenetic Trees ◽

Evolutionary History ◽

Incomplete Lineage Sorting ◽

Gene Tree ◽

Phylogenetic Network ◽

Genomic Data ◽

Point Of View ◽

Phylogenetic Networks ◽

Lineage Sorting

Abstract Genomic data have had a profound impact on nearly every biological discipline. In systematics and phylogenetics, the thousands of loci that are now being sequenced can be analyzed under the multispecies coalescent model (MSC) to explicitly account for gene tree discordance due to incomplete lineage sorting (ILS). However, the MSC assumes no gene flow post divergence, calling for additional methods that can accommodate this limitation. Explicit phylogenetic network methods have emerged, which can simultaneously account for ILS and gene flow by representing evolutionary history as a directed acyclic graph. In this point of view, we highlight some of the strengths and limitations of phylogenetic networks and argue that tree-based inference should not be blindly abandoned in favor of networks simply because they represent more parameter rich models. Attention should be given to model selection of reticulation complexity, and the most robust conclusions regarding evolutionary history are likely obtained when combining tree- and network-based inference.

Download Full-text

A phylotranscriptome study using silica gel-dried leaf tissues produces an updated robust phylogeny of Ranunculaceae

10.1101/2021.07.29.454256 ◽

2021 ◽

Author(s):

Jian He ◽

Rudan Lyu ◽

Yike Luo ◽

Jiamin Xiao ◽

Lei Xie ◽

...

Keyword(s):

Silica Gel ◽

Incomplete Lineage Sorting ◽

Rna Extraction ◽

Phylogenetic Reconstruction ◽

Phylogenetic Network ◽

Molecular Dating ◽

Limiting Factor ◽

Gene Trees ◽

Lineage Sorting ◽

Leaf Tissues

The utility of transcriptome data in plant phylogenetics has gained popularity in recent years. However, because RNA degrades much more easily than DNA, the logistics of obtaining fresh tissues has become a major limiting factor for widely applying this method. Here, we used Ranunculaceae to test whether silica-dried plant tissues could be used for RNA extraction and subsequent phylogenomic studies. We sequenced 27 transcriptomes, 21 from silica gel-dried (SD-samples) and six from liquid nitrogen-preserved (LN-samples) leaf tissues, and downloaded 27 additional transcriptomes from GenBank. Our results showed that although the LN-samples produced slightly better reads than the SD-samples, there were no significant differences in RNA quality and quantity, assembled contig lengths and numbers, and BUSCO comparisons between two treatments. Using this data, we conducted phylogenomic analyses, including concatenated- and coalescent-based phylogenetic reconstruction, molecular dating, coalescent simulation, phylogenetic network estimation, and whole genome duplication (WGD) inference. The resulting phylogeny was consistent with previous studies with higher resolution and statistical support. The 11 core Ranunculaceae tribes grouped into two chromosome type clades (T- and R-types), with high support. Discordance among gene trees is likely due to hybridization and introgression, ancient genetic polymorphism and incomplete lineage sorting. Our results strongly support one ancient hybridization event within the R-type clade and three WGD events in Ranunculales. Evolution of the three Ranunculaceae chromosome types is likely not directly related to WGD events. By clearly resolving the Ranunculaceae phylogeny, we demonstrated that SD-samples can be used for RNA-seq and phylotranscriptomic studies of angiosperms.

Download Full-text

Bears in a Forest of Gene Trees: Phylogenetic Inference Is Complicated by Incomplete Lineage Sorting and Gene Flow

Molecular Biology and Evolution ◽

10.1093/molbev/msu186 ◽

2014 ◽

Vol 31 (8) ◽

pp. 2004-2017 ◽

Cited By ~ 75

Author(s):

Verena E. Kutschera ◽

Tobias Bidon ◽

Frank Hailer ◽

Julia L. Rodi ◽

Steven R. Fain ◽

...

Keyword(s):

Gene Flow ◽

Incomplete Lineage Sorting ◽

Phylogenetic Inference ◽

Gene Trees ◽

Lineage Sorting

Download Full-text

The genomic view of diversification

10.1101/413427 ◽

2018 ◽

Author(s):

Julie Marin ◽

Guillaume Achaz ◽

Anton Crombach ◽

Amaury Lambert

Keyword(s):

Gene Flow ◽

Incomplete Lineage Sorting ◽

Molecular Data ◽

Species Tree ◽

Data Sets ◽

Gene Trees ◽

Lineage Sorting ◽

Parameters Tuning ◽

History Of ◽

New Framework

AbstractEvolutionary relationships between species are traditionally represented in the form of a tree, the species tree. Its reconstruction from molecular data is hindered by frequent conflicts between gene genealogies. Usually, these disagreements are explained by incomplete lineage sorting (ILS) due to random coalescences of gene lineages inside the edges of the species tree. This paradigm, the multi-species coalescent (MSC), is constantly violated by the ubiquitous presence of gene flow, leading to incongruences between gene trees that cannot be explained by ILS alone. Here we argue instead in favor of a vision acknowledging the importance of gene flow and where gene histories shape the species tree rather than the opposite. We propose a new framework for modeling the joint evolution of gene and species lineages relaxing the hierarchy between the species tree and gene trees. We implement this framework in two mathematical models called the gene-based diversification models (GBD): 1) GBD-forward following all evolving genomes and 2) GBD-backward based on coalescent theory. They feature four parameters tuning colonization, gene flow, genetic drift and genetic differentiation. We propose a quick inference method based on differences between gene trees. Applied to two empirical data-sets prone to gene flow, we find a better support for the GBD model than for the MSC model. Along with the increasing awareness of the extent of gene flow, this work shows the importance of considering the richer signal contained in genomic histories, rather than in the mere species tree, to better apprehend the complex evolutionary history of species.

Download Full-text

Speciation genes are more likely to have discordant gene trees

10.1101/244822 ◽

2018 ◽

Cited By ~ 1

Author(s):

Richard J. Wang ◽

Matthew W. Hahn

Keyword(s):

Gene Flow ◽

Reproductive Isolation ◽

Incomplete Lineage Sorting ◽

Gene Trees ◽

Species Trees ◽

Species Relationships ◽

Lineage Sorting ◽

Epistatic Interactions ◽

Speciation Event ◽

History Of

AbstractSpeciation genes are responsible for reproductive isolation between species. By directly participating in the process of speciation, the genealogies of isolating loci have been thought to more faithfully represent species trees. The unique properties of speciation genes may provide valuable evolutionary insights and help determine the true history of species divergence. Here, we formally analyze whether genealogies from loci participating in Dobzhansky-Muller (DM) incompatibilities are more likely to be concordant with the species tree under incomplete lineage sorting (ILS). Individual loci differ stochastically from the true history of divergence with a predictable frequency due to ILS, and these expectations—combined with the DM model of intrinsic reproductive isolation from epistatic interactions—can be used to examine the probability of concordance at isolating loci. Contrary to existing verbal models, we find that reproductively isolating loci that follow the DM model are often more likely to have discordant gene trees. These results are dependent on the pattern of isolation observed between three species, the time between speciation events, and the time since the last speciation event. Results supporting a higher probability of discordance are found for both derived-derived and derived-ancestral DM pairs, and regardless of whether incompatibilities are allowed or prohibited from segregating in the same population. Our overall results suggest that DM loci are unlikely to be especially useful for reconstructing species relationships, even in the presence of gene flow between incipient species, and may in fact be positively misleading.

Download Full-text

Unifying Gene Duplication, Loss, and Coalescence on Phylogenetic Networks

10.1101/589655 ◽

2019 ◽

Cited By ~ 3

Author(s):

Peng Du ◽

Huw A. Ogilvie ◽

Luay Nakhleh

Keyword(s):

Gene Duplication ◽

Incomplete Lineage Sorting ◽

Gene Tree ◽

Phylogenetic Network ◽

Biological Data ◽

Phylogenetic Networks ◽

Lineage Sorting ◽

Evolutionary Processes ◽

Domains Of Life ◽

Gene Duplication And Loss

AbstractStatistical methods were recently introduced for inferring phylogenetic networks under the multispecies network coalescent, thus accounting for both reticulation and incomplete lineage sorting. Two evolutionary processes that are ubiquitous across all three domains of life, but are not accounted for by those methods, are gene duplication and loss (GDL).In this work, we devise a three-piece model—phylogenetic network, locus network, and gene tree—that unifies all the aforementioned processes into a single model of how genes evolve in the presence of ILS, GDL, and introgression within the branches of a phylogenetic network. To illustrate the power of this model, we develop an algorithm for estimating the parameters of a phylogenetic network topology under this unified model. The algorithm consists of a set of moves that allow for stochastic search through the parameter space. The challenges with developing such moves stem from the intricate dependencies among the three pieces of the model. We demonstrate the application of the model and the accuracy of the algorithm on simulated as well as biological data.Our work adds to the biologist’s toolbox of methods for phylogenomic inference by accounting for more complex evolutionary processes.

Download Full-text