A Divide-and-Conquer Method for Scalable Phylogenetic Network Inference from Multi-locus Data

AbstractReticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting (ILS). However, these methods can only handle a small number of loci from a handful of genomes.In this paper, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it. We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological data sets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference.We implemented the algorithms in the publicly available software package PhyloNet (https://bioinfocs.rice.edu/PhyloNet)[email protected]

Download Full-text

A divide-and-conquer method for scalable phylogenetic network inference from multilocus data

Bioinformatics ◽

10.1093/bioinformatics/btz359 ◽

2019 ◽

Vol 35 (14) ◽

pp. i370-i378 ◽

Cited By ~ 5

Author(s):

Jiafan Zhu ◽

Xinhao Liu ◽

Huw A Ogilvie ◽

Luay K Nakhleh

Keyword(s):

Large Scale ◽

Network Inference ◽

Incomplete Lineage Sorting ◽

Phylogenetic Network ◽

Phylogenetic Networks ◽

Divide And Conquer ◽

Supplementary Information ◽

Lineage Sorting ◽

Step Method ◽

Sequence Alignments

Abstract Motivation Reticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting. However, these methods can only handle a small number of loci from a handful of genomes. Results In this article, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it. We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological datasets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference. Availability and implementation We implemented the algorithms in the publicly available software package PhyloNet (https://bioinfocs.rice.edu/PhyloNet). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Maximum Parsimony Inference of Phylogenetic Networks in the Presence of Polyploid Complexes

10.1101/2020.09.28.317651 ◽

2020 ◽

Author(s):

Zhi Yan ◽

Zhen Cao ◽

Yushu Liu ◽

Luay Nakhleh

Keyword(s):

Network Inference ◽

Incomplete Lineage Sorting ◽

Phylogenetic Network ◽

Biological Data ◽

Phylogenetic Networks ◽

Data Sets ◽

Gene Trees ◽

Polyploid Species ◽

Lineage Sorting ◽

Work Done

AbstractPhylogenetic networks provide a powerful framework for modeling and analyzing reticulate evolutionary histories. While polyploidy has been shown to be prevalent not only in plants but also in other groups of eukaryotic species, most work done thus far on phylogenetic network inference assumes diploid hybridization. These inference methods have been applied, with varying degrees of success, to data sets with polyploid species, even though polyploidy violates the mathematical assumptions underlying these methods. Statistical methods were developed recently for handling specific types of polyploids and so were parsimony methods that could handle polyploidy more generally yet while excluding processes such as incomplete lineage sorting. In this paper, we introduce a new method for inferring most parsimonious phylogenetic networks on data that include polyploid species. Taking gene trees as input, the method seeks a phylogenetic network that minimizes deep coalescences while accounting for polyploidy. The method could also infer trees, thus potentially distinguishing between auto- and allo-polyploidy. We demonstrate the performance of the method on both simulated and biological data. The inference method as well as a method for evaluating given phylogenetic networks are implemented and publicly available in the PhyloNet software package.

Download Full-text

Maximum Parsimony Inference of Phylogenetic Networks in the Presence of Polyploid Complexes

Systematic Biology ◽

10.1093/sysbio/syab081 ◽

2021 ◽

Author(s):

Zhi Yan ◽

Zhen Cao ◽

Yushu Liu ◽

Huw A Ogilvie ◽

Luay Nakhleh

Keyword(s):

Network Inference ◽

Incomplete Lineage Sorting ◽

Gene Tree ◽

Phylogenetic Network ◽

Biological Data ◽

Phylogenetic Networks ◽

Data Sets ◽

Polyploid Species ◽

Lineage Sorting ◽

Work Done

Abstract Phylogenetic networks provide a powerful framework for modeling and analyzing reticulate evolutionary histories. While polyploidy has been shown to be prevalent not only in plants but also in other groups of eukaryotic species, most work done thus far on phylogenetic network inference assumes diploid hybridization. These inference methods have been applied, with varying degrees of success, to data sets with polyploid species, even though polyploidy violates the mathematical assumptions underlying these methods. Statistical methods were developed recently for handling specific types of polyploids and so were parsimony methods that could handle polyploidy more generally yet while excluding processes such as incomplete lineage sorting. In this paper, we introduce a new method for inferring most parsimonious phylogenetic networks on data that include polyploid species. Taking gene tree topologies as input, the method seeks a phylogenetic network that minimizes deep coalescences while accounting for polyploidy. We demonstrate the performance of the method on both simulated and biological data. The inference method as well as a method for evaluating evolutionary hypotheses in the form of phylogenetic networks are implemented and publicly available in the PhyloNet software package.

Download Full-text

NetRAX: Accurate and Fast Maximum Likelihood Phylogenetic Network Inference⋆

10.1101/2021.08.30.458194 ◽

2021 ◽

Author(s):

Sarah Lutteropp ◽

Céline Scornavacca ◽

Alexey M. Kozlov ◽

Benoit Morel ◽

Alexandros Stamatakis

Keyword(s):

Maximum Likelihood ◽

Network Inference ◽

Likelihood Function ◽

Incomplete Lineage Sorting ◽

Phylogenetic Network ◽

Phylogenetic Networks ◽

Small Data ◽

Lineage Sorting ◽

Sequence Alignments ◽

Multiple Sequence

AbstractPhylogenetic networks are used to represent non-treelike evolutionary scenarios. Current, actively developed approaches for phylogenetic network inference jointly account for non-treelike evolution and incomplete lineage sorting (ILS). Unfortunately, this induces a very high computational complexity. Hence, current tools can only analyze small data sets.We present NetRAX, a tool for maximum likelihood inference of phylogenetic networks in the absence of incomplete lineage sorting. Our tool leverages state-of-the-art methods for efficiently computing the phylogenetic likelihood function on trees, and extends them to phylogenetic networks via the notion of “displayed trees”. NetRAX can infer maximum likelihood phylogenetic networks from partitioned multiple sequence alignments and returns the inferred networks in Extended Newick format.On simulated data, our results show a very low relative difference in BIC score and a near-zero unrooted softwired cluster distance to the true, simulated networks. With NetRAX, a network inference on a partitioned alignment with 8, 000 sites, 30 taxa, and 3 reticulations completes within a few minutes on a standard laptop.Our implementation is available under the GNU General Public License v3.0 at https://github.com/lutteropp/NetRAX.

Download Full-text

Inferring Phylogenetic Networks Using PhyloNet

10.1101/238071 ◽

2017 ◽

Cited By ~ 1

Author(s):

Dingqiao Wen ◽

Yun Yu ◽

Jiafan Zhu ◽

Luay Nakhleh

Keyword(s):

Bayesian Inference ◽

Maximum Likelihood ◽

Maximum Parsimony ◽

Software Package ◽

Incomplete Lineage Sorting ◽

Gene Tree ◽

Phylogenetic Network ◽

Phylogenetic Networks ◽

Lineage Sorting ◽

Sequence Alignments

AbstractPhyloNet was released in 2008 as a software package for representing and analyzing phylogenetic networks. At the time of its release, the main functionalities in PhyloNet consisted of measures for comparing network topologies and a single heuristic for reconciling gene trees with a species tree. Since then, PhyloNet has grown significantly. The software package now includes a wide array of methods for inferring phylogenetic networks from data sets of unlinked loci while accounting for both reticulation (e.g., hybridization) and incomplete lineage sorting. In particular, PhyloNet now allows for maximum parsimony, maximum likelihood, and Bayesian inference of phylogenetic networks from gene tree estimates. Furthermore, Bayesian inference directly from sequence data (sequence alignments or bi-allelic markers) is implemented. Maximum parsimony is based on an extension of the “minimizing deep coalescences” criterion to phylogenetic networks, whereas maximum likelihood and Bayesian inference are based on the multispecies network coalescent. All methods allow for multiple individuals per species. As computing the likelihood of a phylogenetic network is computationally hard, PhyloNet allows for evaluation and inference of networks using a pseudo-likelihood measure. PhyloNet summarizes the results of the various analyses, and generates phylogenetic networks in the extended Newick format that is readily viewable by existing visualization software, [phylogenetic networks; reticulation; incomplete lineage sorting; multispecies network coalescent; Bayesian inference; maximum likelihood; maximum parsimony.]

Download Full-text

Unifying Gene Duplication, Loss, and Coalescence on Phylogenetic Networks

10.1101/589655 ◽

2019 ◽

Cited By ~ 3

Author(s):

Peng Du ◽

Huw A. Ogilvie ◽

Luay Nakhleh

Keyword(s):

Gene Duplication ◽

Incomplete Lineage Sorting ◽

Gene Tree ◽

Phylogenetic Network ◽

Biological Data ◽

Phylogenetic Networks ◽

Lineage Sorting ◽

Evolutionary Processes ◽

Domains Of Life ◽

Gene Duplication And Loss

AbstractStatistical methods were recently introduced for inferring phylogenetic networks under the multispecies network coalescent, thus accounting for both reticulation and incomplete lineage sorting. Two evolutionary processes that are ubiquitous across all three domains of life, but are not accounted for by those methods, are gene duplication and loss (GDL).In this work, we devise a three-piece model—phylogenetic network, locus network, and gene tree—that unifies all the aforementioned processes into a single model of how genes evolve in the presence of ILS, GDL, and introgression within the branches of a phylogenetic network. To illustrate the power of this model, we develop an algorithm for estimating the parameters of a phylogenetic network topology under this unified model. The algorithm consists of a set of moves that allow for stochastic search through the parameter space. The challenges with developing such moves stem from the intricate dependencies among the three pieces of the model. We demonstrate the application of the model and the accuracy of the algorithm on simulated as well as biological data.Our work adds to the biologist’s toolbox of methods for phylogenomic inference by accounting for more complex evolutionary processes.

Download Full-text

A test statistic to quantify treelikeness in phylogenetics

10.1101/2021.02.16.431544 ◽

2021 ◽

Author(s):

Caitlin Cherryh ◽

Bui Quang Minh ◽

Rob Lanfear

Keyword(s):

Evolutionary History ◽

Incomplete Lineage Sorting ◽

Phylogenetic Analyses ◽

Phylogenetic Network ◽

Parametric Bootstrap ◽

Lineage Sorting ◽

Test Statistic ◽

Sequence Alignments ◽

Wide Range ◽

History Of

AbstractMost phylogenetic analyses assume that the evolutionary history of an alignment (either that of a single locus, or of multiple concatenated loci) can be described by a single bifurcating tree, the so-called the treelikeness assumption. Treelikeness can be violated by biological events such as recombination, introgression, or incomplete lineage sorting, and by systematic errors in phylogenetic analyses. The incorrect assumption of treelikeness may then mislead phylogenetic inferences. To quantify and test for treelikeness in alignments, we develop a test statistic which we call the tree proportion. This statistic quantifies the proportion of the edge weights in a phylogenetic network that are represented in a bifurcating phylogenetic tree of the same alignment. We extend this statistic to a statistical test of treelikeness using a parametric bootstrap. We use extensive simulations to compare tree proportion to a range of related approaches. We show that tree proportion successfully identifies non-treelikeness in a wide range of simulation scenarios, and discuss its strengths and weaknesses compared to other approaches. The power of the tree-proportion test to reject non-treelike alignments can be lower than some other approaches, but these approaches tend to be limited in their scope and/or the ease with which they can be interpreted. Our recommendation is to test treelikeness of sequence alignments with both tree proportion and mosaic methods such as 3Seq. The scripts necessary to replicate this study are available at https://github.com/caitlinch/treelikeness

Download Full-text

Disentangling Sources of Gene Tree Discordance in Phylogenomic Data Sets: Testing Ancient Hybridizations in Amaranthaceae s.l

Systematic Biology ◽

10.1093/sysbio/syaa066 ◽

2020 ◽

Cited By ~ 1

Author(s):

Diego F Morales-Briones ◽

Gudrun Kadereit ◽

Delphine T Tefarikis ◽

Michael J Moore ◽

Stephen A Smith ◽

...

Keyword(s):

Network Inference ◽

Incomplete Lineage Sorting ◽

Gene Tree ◽

Phylogenetic Network ◽

Species Tree ◽

Data Sets ◽

Species Trees ◽

Lineage Sorting ◽

Data Set ◽

Gene Tree Discordance

Abstract Gene tree discordance in large genomic data sets can be caused by evolutionary processes such as incomplete lineage sorting and hybridization, as well as model violation, and errors in data processing, orthology inference, and gene tree estimation. Species tree methods that identify and accommodate all sources of conflict are not available, but a combination of multiple approaches can help tease apart alternative sources of conflict. Here, using a phylotranscriptomic analysis in combination with reference genomes, we test a hypothesis of ancient hybridization events within the plant family Amaranthaceae s.l. that was previously supported by morphological, ecological, and Sanger-based molecular data. The data set included seven genomes and 88 transcriptomes, 17 generated for this study. We examined gene-tree discordance using coalescent-based species trees and network inference, gene tree discordance analyses, site pattern tests of introgression, topology tests, synteny analyses, and simulations. We found that a combination of processes might have generated the high levels of gene tree discordance in the backbone of Amaranthaceae s.l. Furthermore, we found evidence that three consecutive short internal branches produce anomalous trees contributing to the discordance. Overall, our results suggest that Amaranthaceae s.l. might be a product of an ancient and rapid lineage diversification, and remains, and probably will remain, unresolved. This work highlights the potential problems of identifiability associated with the sources of gene tree discordance including, in particular, phylogenetic network methods. Our results also demonstrate the importance of thoroughly testing for multiple sources of conflict in phylogenomic analyses, especially in the context of ancient, rapid radiations. We provide several recommendations for exploring conflicting signals in such situations. [Amaranthaceae; gene tree discordance; hybridization; incomplete lineage sorting; phylogenomics; species network; species tree; transcriptomics.]

Download Full-text

Co-estimating Reticulate Phylogenies and Gene Trees from Multi-locus Sequence Data

10.1101/095539 ◽

2016 ◽

Cited By ~ 6

Author(s):

Dingqiao Wen ◽

Luay Nakhleh

Keyword(s):

Gene Flow ◽

Incomplete Lineage Sorting ◽

Phylogenetic Network ◽

Simulated Data ◽

Biological Data ◽

Generative Model ◽

Divergence Times ◽

Gene Trees ◽

Lineage Sorting ◽

Coalescence Times

AbstractThe multispecies network coalescent (MSNC) is a stochastic process that captures how gene trees grow within the branches of a phylogenetic network. Coupling the MSNC with a stochastic mutational process that operates along the branches of the gene trees gives rise to a generative model of how multiple loci from within and across species evolve in the presence of both incomplete lineage sorting (ILS) and reticulation (e.g., hybridization). We report on a Bayesian method for sampling the parameters of this generative model, including the species phylogeny, gene trees, divergence times, and population sizes, from DNA sequences of multiple independent loci. We demonstrate the utility of our method by analyzing simulated data and reanalyzing three biological data sets. Our results demonstrate the significance of not only co-estimating species phylogenies and gene trees, but also accounting for reticulation and ILS simultaneously. In particular, we show that when gene flow occurs, our method accurately estimates the evolutionary histories, coalescence times, and divergence times. Tree inference methods, on the other hand, underestimate divergence times and overestimate coalescence times when the evolutionary history is reticulate. While the MSNC corresponds to an abstract model of “intermixture,” we study the performance of the model and method on simulated data generated under a gene flow model. We show that the method accurately infers the most recent time at which gene flow occurs. Finally, we demonstrate the application of the new method to a 106-locus yeast data set. [Multispecies network coalescent; reticulation; incomplete lineage sorting; phylogenetic network; Bayesian inference; RJMCMC.]

Download Full-text

Practical Aspects of Phylogenetic Network Analysis Using PhyloNet

10.1101/746362 ◽

2019 ◽

Author(s):

Zhen Cao ◽

Xinhao Liu ◽

Huw A. Ogilvie ◽

Zhi Yan ◽

Luay Nakhleh

Keyword(s):

Incomplete Lineage Sorting ◽

Phylogenetic Network ◽

Synthetic Data ◽

Simulated Data ◽

Single Species ◽

Phylogenetic Networks ◽

Lineage Sorting ◽

Data Set ◽

Types Of Information ◽

Analyze Data

AbstractPhylogenetic networks extend trees to enable simultaneous modeling of both vertical and horizontal evolutionary processes. PhyloNet is a software package that has been under constant development for over 10 years and includes a wide array of functionalities for inferring and analyzing phylogenetic networks. These functionalities differ in terms of the input data they require, the criteria and models they employ, and the types of information they allow to infer about the networks beyond their topologies. Furthermore, PhyloNet includes functionalities for simulating synthetic data on phylogenetic networks, quantifying the topological differences between phylogenetic networks, and evaluating evolutionary hypotheses given in the form of phylogenetic networks.In this paper, we use a simulated data set to illustrate the use of several of PhyloNet’s functionalities and make recommendations on how to analyze data sets and interpret the results when using these functionalities. All inference methods that we illustrate are incomplete lineage sorting (ILS) aware; that is, they account for the potential of ILS in the data while inferring the phylogenetic network. While the models do not include gene duplication and loss, we discuss how the methods can be used to analyze data in the presence of polyploidy.The concept of species is irrelevant for the computational analyses enabled by PhyloNet in that species-individuals mappings are user-defined. Consequently, none of the functionalities in PhyloNet deals with the task of species delimitation. In this sense, the data being analyzed could come from different individuals within a single species, in which case population structure along with potential gene flow is inferred (assuming the data has sufficient signal), or from different individuals sampled from different species, in which case the species phylogeny is being inferred.

Download Full-text