GeneRax: A tool for species tree-aware maximum likelihood based gene family tree inference under gene duplication, transfer, and loss

AbstractInferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges species tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data pre-processing (e.g., computing bootstrap trees), and rely on approximations and heuristics that limit the degree of tree space exploration. Here we present GeneRax, the first maximum likelihood species tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared to competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson-Foulds distance. On empirical datasets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1099 Cyanobacteria families in eight minutes on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax.

Download Full-text

GeneRax: A Tool for Species-Tree-Aware Maximum Likelihood-Based Gene Family Tree Inference under Gene Duplication, Transfer, and Loss

Molecular Biology and Evolution ◽

10.1093/molbev/msaa141 ◽

2020 ◽

Vol 37 (9) ◽

pp. 2763-2774 ◽

Cited By ~ 5

Author(s):

Benoit Morel ◽

Alexey M Kozlov ◽

Alexandros Stamatakis ◽

Gergely J Szöllősi

Keyword(s):

Maximum Likelihood ◽

Phylogenetic Trees ◽

Large Scale ◽

Simulated Data ◽

Gene Families ◽

Species Tree ◽

Homologous Gene ◽

Sequence Alignments ◽

Full Likelihood ◽

True Tree

Abstract Inferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges, species-tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data preprocessing (e.g., computing bootstrap trees) and rely on approximations and heuristics that limit the degree of tree space exploration. Here, we present GeneRax, the first maximum likelihood species-tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared with competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson–Foulds distance. On empirical data sets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1,099 Cyanobacteria families in 8 min on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax (last accessed June 17, 2020).

Download Full-text

Faculty Opinions recommendation of Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1163036.623685 ◽

2009 ◽

Author(s):

Oliver Pybus

Keyword(s):

Phylogenetic Trees ◽

Large Scale ◽

Sequence Alignments

Download Full-text

SpeciesRax: A tool for maximum likelihood species tree inference from gene family trees under duplication, transfer, and loss.

10.1101/2021.03.29.437460 ◽

2021 ◽

Author(s):

Benoit Morel ◽

Paul Schade ◽

Sarah Lutteropp ◽

Tom A. Williams ◽

Gergely J. Szöllösi ◽

...

Keyword(s):

Maximum Likelihood ◽

Gene Family ◽

Gene Families ◽

Species Tree ◽

Likelihood Method ◽

Gene Trees ◽

Informative Signal ◽

Tree Inference ◽

Species Tree Inference ◽

Family Trees

Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informative signal. At present, there does not exist any widely adopted inference method for this purpose. Here, we present SpeciesRax, the first maximum likelihood method that can infer a rooted species tree from a set of gene family trees and can account for gene duplication, loss, and transfer events. By explicitly modelling events by which gene trees can depart from the species tree, SpeciesRax leverages the phylogenetic rooting signal in gene trees. SpeciesRax infers species tree branch lengths in units of expected substitutions per site and branch support values via paralogy-aware quartets extracted from the gene family trees. Using both empirical and simulated datasets we show that SpeciesRax is at least as accurate as the best competing methods while being one order of magnitude faster on large datasets at the same time. We used SpeciesRax to infer a biologically plausible rooted phylogeny of the vertebrates comprising $188$ species from $31612$ gene families in one hour using $40$ cores. SpeciesRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax and on BioConda.

Download Full-text

Evaluation of phylogenetic reconstruction methods using bacterial whole genomes: a simulation based study

Wellcome Open Research ◽

10.12688/wellcomeopenres.14265.2 ◽

2018 ◽

Vol 3 ◽

pp. 33 ◽

Cited By ~ 18

Author(s):

John A. Lees ◽

Michelle Kendall ◽

Julian Parkhill ◽

Caroline Colijn ◽

Stephen D. Bentley ◽

...

Keyword(s):

Maximum Likelihood ◽

Phylogenetic Reconstruction ◽

Simulated Data ◽

Real Data ◽

Tree Topology ◽

Computational Time ◽

Advantages And Disadvantages ◽

Phylogenetic Reconstructions ◽

Reconstruction Methods ◽

True Tree

Background: Phylogenetic reconstruction is a necessary first step in many analyses which use whole genome sequence data from bacterial populations. There are many available methods to infer phylogenies, and these have various advantages and disadvantages, but few unbiased comparisons of the range of approaches have been made. Methods: We simulated data from a defined 'true tree' using a realistic evolutionary model. We built phylogenies from this data using a range of methods, and compared reconstructed trees to the true tree using two measures, noting the computational time needed for different phylogenetic reconstructions. We also used real data from Streptococcus pneumoniae alignments to compare individual core gene trees to a core genome tree. Results: We found that, as expected, maximum likelihood trees from good quality alignments were the most accurate, but also the most computationally intensive. Using less accurate phylogenetic reconstruction methods, we were able to obtain results of comparable accuracy; we found that approximate results can rapidly be obtained using genetic distance based methods. In real data we found that highly conserved core genes, such as those involved in translation, gave an inaccurate tree topology, whereas genes involved in recombination events gave inaccurate branch lengths. We also show a tree-of-trees, relating the results of different phylogenetic reconstructions to each other. Conclusions: We recommend three approaches, depending on requirements for accuracy and computational time. For the most accurate tree, use of either RAxML or IQ-TREE with an alignment of variable sites produced by mapping to a reference genome is best. Quicker approaches that do not perform full maximum likelihood optimisation may be useful for many analyses requiring a phylogeny, as generating a high quality input alignment is likely to be the major limiting factor of accurate tree topology. We have publicly released our simulated data and code to enable further comparisons.

Download Full-text

Complexity of the simplest species tree problem

Molecular Biology and Evolution ◽

10.1093/molbev/msab009 ◽

2021 ◽

Author(s):

Tianqi Zhu ◽

Ziheng Yang

Keyword(s):

Maximum Likelihood ◽

Estimation Error ◽

Gene Tree ◽

Species Tree ◽

Ancestral Population ◽

Statistical Efficiency ◽

Multispecies Coalescent ◽

Full Likelihood ◽

Tree Methods ◽

Tree Estimation

Abstract The multispecies coalescent (MSC) model provides a natural framework for species tree estimation accounting for gene-tree conflicts. While a number of species tree methods under the MSC have been suggested and evaluated using simulation, their statistical properties remain poorly understood. Here we use mathematical analysis aided by computer simulation to examine the identifiability, consistency, and efficiency of different species tree methods in the case of three species and three sequences under the molecular clock. We consider four major species-tree methods including concatenation, two-step, independent-sites maximum likelihood (ISML) and maximum likelihood (ML). We develop approximations that predict that the probit transform of the species tree estimation error decreases linearly with the square root of the number of loci. Even in this simplest case major differences exist among the methods. Fulllikelihood methods are considerably more efficient than summary methods such as concatenation and two-step. They also provide estimates of important parameters such as species divergence times and ancestral population sizes while these parameters are not identifiable by summary methods. Our results highlight the need to improve the statistical efficiency of summary methods and the computational efficiency of full likelihood methods of species tree estimation.

Download Full-text

EXACT SOLUTIONS FOR SPECIES TREE INFERENCE FROM DISCORDANT GENE TREES

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720013420055 ◽

2013 ◽

Vol 11 (05) ◽

pp. 1342005 ◽

Cited By ~ 16

Author(s):

WEN-CHIEH CHANG ◽

PAWEŁ GÓRECKI ◽

OLIVER EULENSTEIN

Keyword(s):

Exact Solutions ◽

Gene Tree ◽

Simulated Data ◽

Gene Families ◽

Species Tree ◽

Data Sets ◽

Gene Trees ◽

Species Trees ◽

Worst Case

Phylogenetic analysis has to overcome the grant challenge of inferring accurate species trees from evolutionary histories of gene families (gene trees) that are discordant with the species tree along whose branches they have evolved. Two well studied approaches to cope with this challenge are to solve either biologically informed gene tree parsimony (GTP) problems under gene duplication, gene loss, and deep coalescence, or the classic RF supertree problem that does not rely on any biological model. Despite the potential of these problems to infer credible species trees, they are NP-hard. Therefore, these problems are addressed by heuristics that typically lack any provable accuracy and precision. We describe fast dynamic programming algorithms that solve the GTP problems and the RF supertree problem exactly, and demonstrate that our algorithms can solve instances with data sets consisting of as many as 22 taxa. Extensions of our algorithms can also report the number of all optimal species trees, as well as the trees themselves. To better asses the quality of the resulting species trees that best fit the given gene trees, we also compute the worst case species trees, their numbers, and optimization score for each of the computational problems. Finally, we demonstrate the performance of our exact algorithms using empirical and simulated data sets, and analyze the quality of heuristic solutions for the studied problems by contrasting them with our exact solutions.

Download Full-text

Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees

Science ◽

10.1126/science.1171243 ◽

2009 ◽

Vol 324 (5934) ◽

pp. 1561-1564 ◽

Cited By ~ 366

Author(s):

K. Liu ◽

S. Raghavan ◽

S. Nelesen ◽

C. R. Linder ◽

T. Warnow

Keyword(s):

Phylogenetic Trees ◽

Large Scale ◽

Sequence Alignments

Download Full-text

Comparative transcriptomic analysis reveals conserved transcriptional programs underpinning organogenesis and reproduction in land plants

10.1101/2020.10.29.361501 ◽

2020 ◽

Author(s):

Irene Julca ◽

Camilla Ferrari ◽

María Flores-Tornero ◽

Sebastian Proost ◽

Ann-Cathrin Lindner ◽

...

Keyword(s):

Phylogenetic Trees ◽

Large Scale ◽

Expression Profiles ◽

Gene Families ◽

Male Reproduction ◽

Land Plants ◽

Easy Access ◽

Specific Gene ◽

Male Gametes ◽

Organ Specific

AbstractThe evolution of plant organs, including leaves, stems, roots, and flowers, mediated the explosive radiation of land plants, which shaped the biosphere and allowed the establishment of terrestrial animal life. Furthermore, the fertilization products of angiosperms, seeds serve as the basis for most of our food. The evolution of organs and immobile gametes required the coordinated acquisition of novel gene functions, the co-option of existing genes, and the development of novel regulatory programs. However, our knowledge of these events is limited, as no large-scale analyses of genomic and transcriptomic data have been performed for land plants. To remedy this, we have generated gene expression atlases for various organs and gametes of 10 plant species comprising bryophytes, vascular plants, gymnosperms, and flowering plants. Comparative analysis of the atlases identified hundreds of organ- and gamete-specific gene families and revealed that most of the specific transcriptomes are significantly conserved. Interestingly, the appearance of organ-specific gene families does not coincide with the corresponding organ’s appearance, suggesting that co-option of existing genes is the main mechanism for evolving new organs. In contrast to female gametes, male gametes showed a high number and conservation of specific genes, suggesting that male reproduction is highly specialized. The expression atlas capturing pollen development revealed numerous transcription factors and kinases essential for pollen biogenesis and function. To provide easy access to the expression atlases and these comparative analyses, we provide an online database, www.evorepro.plant.tools, that allows the exploration of expression profiles, organ-specific genes, phylogenetic trees, co-expression networks, and others.

Download Full-text

Consistency of SVDQuartets and Maximum Likelihood for Coalescent-based Species Tree Estimation

10.1101/523050 ◽

2019 ◽

Cited By ~ 2

Author(s):

Matthew Wascher ◽

Laura Kubatko

Keyword(s):

Maximum Likelihood ◽

Species Tree ◽

Gene Trees ◽

Data Types ◽

Coalescent Model ◽

Multispecies Coalescent ◽

Consistency Results ◽

True Tree ◽

Tree Inference ◽

Statistical Consistency

AbtractNumerous methods for inferring species-level phylogenies under the coalescent model have been proposed within the last 20 years, and debates continue about the relative strengths and weaknesses of these methods. One desirable property of a phylogenetic estimator is that of statistical consistency, which means intuitively that as more data are collected, the probability that the estimated tree has the same topology as the true tree goes to 1. To date, consistency results for species tree inference under the multispecies coalescent have been derived only for summary statistics methods, such as ASTRAL and MP-EST. These methods have been found to be consistent given true gene trees, but may be inconsistent when gene trees are estimated from data for loci of finite length (Roch et al., 2019). Here we consider the question of statistical consistency for four taxa for SVDQuartets for general data types, as well as for the maximum likelihood (ML) method in the case in which the data are a collection of sites generated under the multispecies coalescent model such that the sites are conditionally independent given the species tree (we call these data Coalescent Independent Sites (CIS) data). We show that SVDQuartets is statistically consistent for all data types (i.e., for both CIS data and for multilocus data), and we derive its rate of convergence. We additionally show that ML is consistent for CIS data under the JC69 model, and discuss why a proof for the more general multilocus case is difficult. Finally, we compare the performance of maximum likelihood and SDVQuartets using simulation for both data types.

Download Full-text

Distribution of gene tree histories under the coalescent model with gene flow

10.1101/023937 ◽

2015 ◽

Author(s):

Yuan Tian ◽

Laura Kubatko

Keyword(s):

Gene Flow ◽

Maximum Likelihood ◽

Gene Tree ◽

Species Tree ◽

Tree Topology ◽

Effective Population ◽

Sequence Alignments ◽

Data Set ◽

Coalescent Model ◽

Population Sizes

We propose a coalescent model for three species that allows gene flow between both pairs of sister populations. The model is designed to analyze multilocus genomic sequence alignments, with one sequence sampled from each of the three species. The model is formulated using a Markov chain representation, which allows use of matrix exponentiation to compute analytical expressions for the probability density of gene tree genealogies. The gene tree history distribution as well as the gene tree topology distribution under this coalescent model with gene flow are then calculated via numerical integration. We analyze the model to compare the distributions of gene tree topologies and gene tree histories for species trees with differing effective population sizes and gene flow rates. Our results suggest conditions under which the species tree and associated parameters are not identifiable from the gene tree topology distribution when gene flow is present, but indicate that the gene tree history distribution may identify the species tree and associated parameters. Thus, the gene tree history distribution can be used to infer parameters such as the ancestral effective population sizes and the rates of gene flow in a maximum likelihood (ML) framework. We conduct computer simulations to evaluate the performance of our method in estimating these parameters, and we apply our method to an Afrotropical mosquito data set (Fontaine et al., 2015) to demonstrate the usefulness of our method for the analysis of empirical data. Key words: coalescent, gene flow, migration, hybridization, gene tree, topology, history, maximum likelihood, speciation.

Download Full-text