Taxon sampling unequally affects individual nodes in a phylogenetic tree: consequences for model gene tree construction in SwissTree

AbstractMedium to large phylogenetic gene trees constructed from datasets of different species density and taxonomic range are rarely topologically consistent because of missing phylogenetic signal, non-phylogenetic signal and error. In this study, we first use simulations to show that taxon sampling unequally affects nodes in a gene tree, which likely contributes to controversial conclusions from taxon sampling experiments and contradicting species phylogenies such as for the boreoeutherians. Hence, because it is unlikely that a large gene tree can be reconstructed correctly based on a single optimized dataset, we take a two-step approach for the construction of model gene trees. First, stable and unstable clades are identified by comparing phylogenetic trees inferred from multiple datasets and data types (nucleotide, amino acid, codon) from the same gene family. Subsequently, data subsets are optimized for the analysis of individual uncertain clades. Results are summarized in form of a model tree that illustrates the evolutionary relationship of gene loci. A case study shows how a seemingly complex gene phylogeny becomes increasingly consistent with the reference species tree by attentive taxon sampling and subtree analysis. The procedure is progressively introduced to SwissTree (http://swisstree.vital-it.ch), a resource of high confidence model gene (locus) trees. Finally we demonstrate the usefulness of SwissTree for orthology benchmarking.

Download Full-text

Phylogenomic Discordance in the Eared Seals is best explained by Incomplete Lineage Sorting following Explosive Radiation in the Southern Hemisphere

Systematic Biology ◽

10.1093/sysbio/syaa099 ◽

2020 ◽

Author(s):

Fernando Lopes ◽

Larissa R Oliveira ◽

Amanda Kessler ◽

Yago Beux ◽

Enrique Crespo ◽

...

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Molecular Data ◽

Species Tree ◽

Gene Trees ◽

Lineage Sorting ◽

Data Types ◽

High Coverage ◽

Sea Lions ◽

The Family

Abstract The phylogeny and systematics of fur seals and sea lions (Otariidae) have long been studied with diverse data types, including an increasing amount of molecular data. However, only a few phylogenetic relationships have reached acceptance because of strong gene-tree species tree discordance. Divergence times estimates in the group also vary largely between studies. These uncertainties impeded the understanding of the biogeographical history of the group, such as when and how trans-equatorial dispersal and subsequent speciation events occurred. Here we used high-coverage genome-wide sequencing for 14 of the 15 species of Otariidae to elucidate the phylogeny of the family and its bearing on the taxonomy and biogeographical history. Despite extreme topological discordance among gene trees, we found a fully supported species tree that agrees with the few well-accepted relationships and establishes monophyly of the genus Arctocephalus. Our data support a relatively recent trans-hemispheric dispersal at the base of a southern clade, which rapidly diversified into six major lineages between 3 to 2.5 Ma. Otaria diverged first, followed by Phocarctos and then four major lineages within Arctocephalus. However, we found Zalophus to be non-monophyletic, with California (Z. californianus) and Steller sea lions (Eumetopias jubatus) grouping closer than the Galapagos sea lion (Z. wollebaeki) with evidence for introgression between the two genera. Overall, the high degree of genealogical discordance was best explained by incomplete lineage sorting resulting from quasi-simultaneous speciation within the southern clade with introgresssion playing a subordinate role in explaining the incongruence among and within prior phylogenetic studies of the family.

Download Full-text

Phylogenetic conflicts, combinability, and deep phylogenomics in plants

10.1101/371930 ◽

2018 ◽

Cited By ~ 1

Author(s):

Stephen A. Smith ◽

Nathanael Walker-Hale ◽

Joseph F. Walker ◽

Joseph W. Brown

Keyword(s):

Phylogenetic Relationships ◽

Phylogenetic Signal ◽

Gene Tree ◽

Species Tree ◽

Gene Trees ◽

Data Filtering ◽

Tree Inference ◽

Tree Methods ◽

Inference Methods ◽

Species Tree Inference

AbstractStudies have demonstrated that pervasive gene tree conflict underlies several important phylogenetic relationships where different species tree methods produce conflicting results. Here, we present a means of dissecting the phylogenetic signal for alternative resolutions within a dataset in order to resolve recalcitrant relationships and, importantly, identify what the dataset is unable to resolve. These procedures extend upon methods for isolating conflict and concordance involving specific candidate relationships and can be used to identify systematic error and disambiguate sources of conflict among species tree inference methods. We demonstrate these on a large phylogenomic plant dataset. Our results support the placement of Amborella as sister to the remaining extant angiosperms, Gnetales as sister to pines, and the monophyly of extant gymnosperms. Several other contentious relationships, including the resolution of relationships within the bryophytes and the eudicots, remain uncertain given the low number of supporting gene trees. To address whether concatenation of filtered genes amplified phylogenetic signal for relationships, we implemented a combinatorial heuristic to test combinability of genes. We found that nested conflicts limited the ability of data filtering methods to fully ameliorate conflicting signal amongst gene trees. These analyses confirmed that the underlying conflicting signal does not support broad concatenation of genes. Our approach provides a means of dissecting a specific dataset to address deep phylogenetic relationships while also identifying the inferential boundaries of the dataset.

Download Full-text

Mathematical Understanding of Sequence Alignment and Phylogenetic Algorithms: A Comprehensive Review of Methods

10.21203/rs.3.rs-105281/v1 ◽

2020 ◽

Author(s):

Rashid Saif ◽

Sadia Nadeem ◽

Ali Iftekhar ◽

Alishba Khaliq ◽

Saeeda Zia

Keyword(s):

Phylogenetic Tree ◽

Sequence Alignment ◽

Phylogenetic Trees ◽

Evolutionary Relationship ◽

Biological Sequences ◽

Pairwise Sequence Alignment ◽

Phylogenetic Tree Construction ◽

Local Sequence Alignment ◽

Tree Construction ◽

Local Sequence

Abstract Context: Pairwise sequence alignment is one of the ways to arrange two biological sequences to identify regions of resemblance that may suggest the functional, structural, and/or evolutionary relationship (proteins or nucleic acids) between the sequences. There are two strategies in pairwise sequence alignment: Local sequence Alignment (Smith-waterman algorithm) and Global sequence Alignment (Needleman-Wunsch algorithm). In local sequence alignment, two sequences that may or may not be related are aligned to find regions of local similarities in large sequences whereas in global sequence alignment, two sequences same in length are aligned to identify conserved regions. Similarities and divergence between biological sequences identified by sequence alignment also have to be rationalized and visualized in the sense of phylogenetic trees. The phylogenetic tree construction methods are divided into distance-based and character-based methods. Evidence Acquisition: In this article, different algorithms of sequence alignment and phylogenetic tree construction were studied with examples and compared to establish the best among them to look into background of these methods for the better understanding of computational phylogenetics.Conclusions: Pairwise sequence alignment is a very important part of bioinformatics to compare biological sequences to find similarities among them. The alignment data is visualized through phylogenetic tree diagram that shows evolutionary history among organisms. Phylogenetic tree is constructed through various methods some are easier but does not provide accurate evolutionary data whereas others provide accurate evolutionary distance among organism but are computationally exhaustive.

Download Full-text

Four myriapod relatives – but who are sisters? No end to debates on relationships among the four major myriapod subgroups

BMC Evolutionary Biology ◽

10.1186/s12862-020-01699-0 ◽

2020 ◽

Vol 20 (1) ◽

Cited By ~ 1

Author(s):

Nikolaus U. Szucsich ◽

Daniela Bartel ◽

Alexander Blanke ◽

Alexander Böhm ◽

Alexander Donath ◽

...

Keyword(s):

Phylogenetic Relationships ◽

Phylogenetic Trees ◽

Morphological Characters ◽

Morphological Evidence ◽

Taxon Sampling ◽

Transcriptome Data ◽

Data Types ◽

Data Set ◽

Phylogenomic Study ◽

Quartet Topology

Abstract Background Phylogenetic relationships among the myriapod subgroups Chilopoda, Diplopoda, Symphyla and Pauropoda are still not robustly resolved. The first phylogenomic study covering all subgroups resolved phylogenetic relationships congruently to morphological evidence but is in conflict with most previously published phylogenetic trees based on diverse molecular data. Outgroup choice and long-branch attraction effects were stated as possible explanations for these incongruencies. In this study, we addressed these issues by extending the myriapod and outgroup taxon sampling using transcriptome data. Results We generated new transcriptome data of 42 panarthropod species, including all four myriapod subgroups and additional outgroup taxa. Our taxon sampling was complemented by published transcriptome and genome data resulting in a supermatrix covering 59 species. We compiled two data sets, the first with a full coverage of genes per species (292 single-copy protein-coding genes), the second with a less stringent coverage (988 genes). We inferred phylogenetic relationships among myriapods using different data types, tree inference, and quartet computation approaches. Our results unambiguously support monophyletic Mandibulata and Myriapoda. Our analyses clearly showed that there is strong signal for a single unrooted topology, but a sensitivity of the position of the internal root on the choice of outgroups. However, we observe strong evidence for a clade Pauropoda+Symphyla, as well as for a clade Chilopoda+Diplopoda. Conclusions Our best quartet topology is incongruent with current morphological phylogenies which were supported in another phylogenomic study. AU tests and quartet mapping reject the quartet topology congruent to trees inferred with morphological characters. Moreover, quartet mapping shows that confounding signal present in the data set is sufficient to explain the weak signal for the quartet topology derived from morphological characters. Although outgroup choice affects results, our study could narrow possible trees to derivatives of a single quartet topology. For highly disputed relationships, we propose to apply a series of tests (AU and quartet mapping), since results of such tests allow to narrow down possible relationships and to rule out confounding signal.

Download Full-text

Orthology clusters from gene trees with Possvm

10.1101/2021.05.03.442399 ◽

2021 ◽

Author(s):

Xavier Grau-Bové ◽

Arnau Sebé-Pedrós

Keyword(s):

Gene Family ◽

Phylogenetic Trees ◽

Clustering Algorithm ◽

Gene Tree ◽

Gene Family Evolution ◽

Gene Trees ◽

Orthologous Genes ◽

Gene Annotations ◽

Species Overlap ◽

Markov Clustering

Possvm (Phylogenetic Ortholog Sorting with Species oVerlap and MCL) is a tool that automates the process of classifying clusters of orthologous genes from precomputed phylogenetic trees. It identifies orthology relationships between genes using the species overlap algorithm to infer taxonomic information from the gene tree topology, and then uses the Markov Clustering Algorithm (MCL) to identify orthology clusters and provide annotated gene family classifications. Our benchmarking shows that this approach, when provided with accurate phylogenies, is able to identify manually curated orthogroups with high precision and recall. Overall, Possvm automates the routine process of gene tree inspection and annotation in a highly interpretable manner, and provides reusable outputs that can be used to obtain phylogeny-informed gene annotations and inform comparative genomics and gene family evolution analyses.

Download Full-text

Phylogenomic relationship and evolutionary insights of sweet potato viruses from the western highlands of Kenya

PeerJ ◽

10.7717/peerj.5254 ◽

2018 ◽

Vol 6 ◽

pp. e5254 ◽

Cited By ~ 4

Author(s):

James M. Wainaina ◽

Elijah Ateka ◽

Timothy Makori ◽

Monica A. Kehoe ◽

Laura M. Boykin

Keyword(s):

Coat Protein ◽

Sweet Potato ◽

Field Isolate ◽

Gene Tree ◽

Evolutionary Relationship ◽

Sub Saharan Africa ◽

Whole Genome ◽

Gene Trees ◽

Species Trees ◽

Potato Viruses

Sweet potato is a major food security crop within sub-Saharan Africa where 90% of Africa production occurs. One of the major limitations of sweet potato production are viral infections. In this study, we used a combination of whole genome sequences from a field isolate obtained from Kenya and those available in GenBank. Sequences of four sweet potato viruses: Sweet potato feathery mottle virus (SPFMV), Sweet potato virus C (SPVC), Sweet potato chlorotic stunt virus (SPCSV), Sweet potato chlorotic fleck virus (SPCFV) were obtained from the Kenyan sample. SPFMV sequences both from this study and from GenBank were found to be recombinant. Recombination breakpoints were found within the Nla-Pro, coat protein and P1 genes. The SPCSV, SPVC, and SPCFV viruses from this study were non-recombinant. Bayesian phylogenomic relationships across whole genome trees showed variation in the number of well-supported clades; within SPCSV (RNA1 and RNA2) and SPFMV two well-supported clades (I and II) were resolved. The SPCFV tree resolved three well-supported clades (I–III) while four well-supported clades were resolved in SPVC (I–IV). Similar clades were resolved within the coalescent species trees. However, there were disagreements between the clades resolved in the gene trees compared to those from the whole genome tree and coalescent species trees. However the coat protein gene tree of SPCSV and SPCFV resolved similar clades to the genome and coalescent species tree while this was not the case in SPFMV and SPVC. In addition, we report variation in selective pressure within sites of individual genes across all four viruses; overall all viruses were under purifying selection. We report the first complete genomes of SPFMV, SPVC, SPCFV, and a partial SPCSV from Kenya as a mixed infection in one sample. Our findings provide a snap shot on the evolutionary relationship of sweet potato viruses (SPFMV, SPVC, SPCFV, and SPCSV) from Kenya as well as assessing whether selection pressure has an effect on their evolution.

Download Full-text

Phylogenomic Discordance in the Eared Seals is best explained by Incomplete Lineage Sorting following Explosive Radiation in the Southern Hemisphere

10.1101/2020.08.11.246108 ◽

2020 ◽

Author(s):

Fernando Lopes ◽

Larissa R. Oliveira ◽

Amanda Kessler ◽

Yago Beux ◽

Enrique Crespo ◽

...

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Molecular Data ◽

Species Tree ◽

Gene Trees ◽

Lineage Sorting ◽

Data Types ◽

High Coverage ◽

Sea Lions ◽

The Family

AbstractThe phylogeny and systematics of fur seals and sea lions (Otariidae) have long been studied with diverse data types, including an increasing amount of molecular data. However, only a few phylogenetic relationships have reached acceptance because of strong gene-tree species tree discordance. Divergence times estimates in the group also vary largely between studies. These uncertainties impeded the understanding of the biogeographical history of the group, such as when and how trans-equatorial dispersal and subsequent speciation events occurred. Here we used high-coverage genome-wide sequencing for 14 of the 15 species of Otariidae to elucidate the phylogeny of the family and its bearing on the taxonomy and biogeographical history. Despite extreme topological discordance among gene trees, we found a fully supported species tree that agrees with the few well-accepted relationships and establishes monophyly of the genusArctocephalus. Our data support a relatively recent trans-hemispheric dispersal at the base of a southern clade, which rapidly diversified into six major lineages between 3 to 2.5 Ma.Otariadiverged first, followed byPhocarctosand then four major lineages withinArctocephalus. However, we foundZalophusto be non-monophyletic, with California(Z. californianus)and Steller sea lions(Eumetopias jubatus)grouping closer than the Galapagos sea lion (Z. wollebaeki)with evidence for introgression between the two genera. Overall, the high degree of genealogical discordance was best explained by incomplete lineage sorting resulting from quasi-simultaneous speciation within the southern clade with introgresssion playing a subordinate role in explaining the incongruence among and within prior phylogenetic studies of the family.

Download Full-text

Consistency of SVDQuartets and Maximum Likelihood for Coalescent-Based Species Tree Estimation

Systematic Biology ◽

10.1093/sysbio/syaa039 ◽

2020 ◽

Vol 70 (1) ◽

pp. 33-48 ◽

Cited By ~ 1

Author(s):

Matthew Wascher ◽

Laura Kubatko

Keyword(s):

Maximum Likelihood ◽

Gene Tree ◽

Species Tree ◽

Gene Trees ◽

Data Types ◽

Multispecies Coalescent ◽

Consistency Results ◽

True Tree ◽

Tree Inference ◽

Statistical Consistency

Abstract Numerous methods for inferring species-level phylogenies under the coalescent model have been proposed within the last 20 years, and debates continue about the relative strengths and weaknesses of these methods. One desirable property of a phylogenetic estimator is that of statistical consistency, which means intuitively that as more data are collected, the probability that the estimated tree has the same topology as the true tree goes to 1. To date, consistency results for species tree inference under the multispecies coalescent (MSC) have been derived only for summary statistics methods, such as ASTRAL and MP-EST. These methods have been found to be consistent given true gene trees but may be inconsistent when gene trees are estimated from data for loci of finite length. Here, we consider the question of statistical consistency for four taxa for SVDQuartets for general data types, as well as for the maximum likelihood (ML) method in the case in which the data are a collection of sites generated under the MSC model such that the sites are conditionally independent given the species tree (we call these data coalescent independent sites [CIS] data). We show that SVDQuartets is statistically consistent for all data types (i.e., for both CIS data and for multilocus data), and we derive its rate of convergence. We additionally show that ML is consistent for CIS data under the JC69 model and discuss why a proof for the more general multilocus case is difficult. Finally, we compare the performance of ML and SDVQuartets using simulation for both data types. [Consistency; gene tree; maximum likelihood; multilocus data; hylogenetic inference; species tree; SVDQuartets.]

Download Full-text

Using allele frequencies and geographic subdivision to reconstruct gene trees within a species: molecular variance parsimony.

Genetics ◽

10.1093/genetics/136.1.343 ◽

1994 ◽

Vol 136 (1) ◽

pp. 343-359

Author(s):

L Excoffier ◽

P E Smouse

Keyword(s):

Spanning Trees ◽

Gene Tree ◽

Search Space ◽

Gene Trees ◽

Data Types ◽

Molecular Variance ◽

Population Statistics ◽

Null Distributions ◽

Dna Restriction ◽

Definition Of

Abstract We formalize the use of allele frequency and geographic information for the construction of gene trees at the intraspecific level and extend the concept of evolutionary parsimony to molecular variance parsimony. The central principle is to consider a particular gene tree as a variable to be optimized in the estimation of a given population statistic. We propose three population statistics that are related to variance components and that are explicit functions of phylogenetic information. The methodology is applied in the context of minimum spanning trees (MSTs) and human mitochondrial DNA restriction data, but could be extended to accommodate other tree-making procedures, as well as other data types. We pursue optimal trees by heuristic optimization over a search space of more than 1.29 billion MSTs. This very large number of equally parsimonious trees underlines the lack of resolution of conventional parsimony procedures. This lack of resolution is highlighted by the observation that equally parsimonious trees yield very different estimates of population genetic diversity and genetic structure, as shown by null distributions of the population statistics, obtained by evaluation of 10,000 random MSTs. We propose a non-parametric test for the similarity between any two trees, based on the distribution of a weighted coevolutionary correlation. The ability to test for tree relatedness leads to the definition of a class of solutions instead of a single solution. Members of the class share virtually all of the critical internal structure of the tree but differ in the placement of singleton branch tips.

Download Full-text

Phylogenetic Signal, Congruence, and Uncertainty across Bacteria and Archaea

Molecular Biology and Evolution ◽

10.1093/molbev/msab254 ◽

2021 ◽

Author(s):

Carolina A Martinez-Gutierrez ◽

Frank O Aylward

Keyword(s):

Phylogenetic Trees ◽

Phylogenetic Signal ◽

Phylogenetic Reconstruction ◽

Sister Group ◽

Tree Of Life ◽

Marker Genes ◽

Sequence Composition ◽

Tree Construction ◽

Taxonomic Groups ◽

The Impact

Abstract Reconstruction of the Tree of Life is a central goal in biology. Although numerous novel phyla of bacteria and archaea have recently been discovered, inconsistent phylogenetic relationships are routinely reported, and many inter-phylum and inter-domain evolutionary relationships remain unclear. Here, we benchmark different marker genes often used in constructing multidomain phylogenetic trees of bacteria and archaea and present a set of marker genes that perform best for multidomain trees constructed from concatenated alignments. We use recently-developed Tree Certainty metrics to assess the confidence of our results and to obviate the complications of traditional bootstrap-based metrics. Given the vastly disparate number of genomes available for different phyla of bacteria and archaea, we also assessed the impact of taxon sampling on multidomain tree construction. Our results demonstrate that biases between the representation of different taxonomic groups can dramatically impact the topology of resulting trees. Inspection of our highest-quality tree supports the division of most bacteria into Terrabacteria and Gracilicutes, with Thermatogota and Synergistota branching earlier from these superphyla. This tree also supports the inclusion of the Patescibacteria within the Terrabacteria as a sister group to the Chloroflexota instead of as a basal-branching lineage. For the Archaea, our tree supports three monophyletic lineages (DPANN, Euryarchaeota, and TACK/Asgard), although we note the basal placement of the DPANN may still represent an artifact caused by biased sequence composition. Our findings provide a robust and standardized framework for multidomain phylogenetic reconstruction that can be used to evaluate inter-phylum relationships and assess uncertainty in conflicting topologies of the Tree of Life.

Download Full-text