scholarly journals Phylogenomic subsampling and the search for phylogenetically reliable loci

2021 ◽  
Author(s):  
Nicolás Mongiardino Koch

AbstractPhylogenomic subsampling is a procedure by which small sets of loci are selected from large genome-scale datasets and used for phylogenetic inference. This step is often motivated by either computational limitations associated with the use of complex inference methods, or as a means of testing the robustness of phylogenetic results by discarding loci that are deemed potentially misleading. Although many alternative methods of phylogenomic subsampling have been proposed, little effort has gone into comparing their behavior across different datasets. Here, I calculate multiple gene properties for a range of phylogenomic datasets spanning animal, fungal and plant clades, uncovering a remarkable predictability in their patterns of covariance. I also show how these patterns provide a means for ordering loci by both their rate of evolution and their relative phylogenetic usefulness. This method of retrieving phylogenetically useful loci is found to be among the top performing when compared to alternative subsampling protocols. Relatively common approaches such as minimizing potential sources of systematic bias or increasing the clock-likeness of the data are found to fare worse than selecting loci at random. Likewise, the general utility of rate-based subsampling is found to be limited: loci evolving at both low and high rates are among the least effective, and even those evolving at optimal rates can still widely differ in usefulness. This study shows that many common subsampling approaches introduce unintended effects in off-target gene properties, and proposes an alternative multivariate method that simultaneously optimizes phylogenetic signal while controlling for known sources of bias.

IMA Fungus ◽  
2019 ◽  
Vol 10 (1) ◽  
Author(s):  
Claudio G. Ametrano ◽  
Felix Grewe ◽  
Pedro W. Crous ◽  
Stephen B. Goodwin ◽  
Chen Liang ◽  
...  

Abstract Dothideomycetes is the most diverse fungal class in Ascomycota and includes species with a wide range of lifestyles. Previous multilocus studies have investigated the taxonomic and evolutionary relationships of these taxa but often failed to resolve early diverging nodes and frequently generated inconsistent placements of some clades. Here, we use a phylogenomic approach to resolve relationships in Dothideomycetes, focusing on two genera of melanized, extremotolerant rock-inhabiting fungi, Lichenothelia and Saxomyces, that have been suggested to be early diverging lineages. We assembled phylogenomic datasets from newly sequenced (4) and previously available genomes (238) of 242 taxa. We explored the influence of tree inference methods, supermatrix vs. coalescent-based species tree, and the impact of varying amounts of genomic data. Overall, our phylogenetic reconstructions provide consistent and well-supported topologies for Dothideomycetes, recovering Lichenothelia and Saxomyces among the earliest diverging lineages in the class. In addition, many of the major lineages within Dothideomycetes are recovered as monophyletic, and the phylogenomic approach implemented strongly supports their relationships. Ancestral character state reconstruction suggest that the rock-inhabiting lifestyle is ancestral within the class.


2018 ◽  
Author(s):  
Stephen A. Smith ◽  
Nathanael Walker-Hale ◽  
Joseph F. Walker ◽  
Joseph W. Brown

AbstractStudies have demonstrated that pervasive gene tree conflict underlies several important phylogenetic relationships where different species tree methods produce conflicting results. Here, we present a means of dissecting the phylogenetic signal for alternative resolutions within a dataset in order to resolve recalcitrant relationships and, importantly, identify what the dataset is unable to resolve. These procedures extend upon methods for isolating conflict and concordance involving specific candidate relationships and can be used to identify systematic error and disambiguate sources of conflict among species tree inference methods. We demonstrate these on a large phylogenomic plant dataset. Our results support the placement of Amborella as sister to the remaining extant angiosperms, Gnetales as sister to pines, and the monophyly of extant gymnosperms. Several other contentious relationships, including the resolution of relationships within the bryophytes and the eudicots, remain uncertain given the low number of supporting gene trees. To address whether concatenation of filtered genes amplified phylogenetic signal for relationships, we implemented a combinatorial heuristic to test combinability of genes. We found that nested conflicts limited the ability of data filtering methods to fully ameliorate conflicting signal amongst gene trees. These analyses confirmed that the underlying conflicting signal does not support broad concatenation of genes. Our approach provides a means of dissecting a specific dataset to address deep phylogenetic relationships while also identifying the inferential boundaries of the dataset.


2018 ◽  
Author(s):  
Akanksha Pandey ◽  
Edward L. Braun

AbstractPhylogenomics has revolutionized the study of evolutionary relationships. However, genome-scale data have not been able to resolve all relationships in the tree of life. This could reflect the poor-fit of the models used to analyze heterogeneous datasets; that heterogeneity is likely to have many explanations. However, it seems reasonable to hypothesize that the different patterns of selection on proteins based on their structures might represent a source of heterogeneity. To test that hypothesis, we developed an efficient pipeline to divide phylogenomic datasets that comprise proteins into subsets based on secondary structure and relative solvent accessibility. We then tested whether amino acids in different structural environments had different signals for the deepest branches in the metazoan tree of life. Sites located in different structural environments did support distinct tree topologies. The most striking difference in phylogenetic signal reflected relative solvent accessibility; analyses of sites on the surface of proteins yielded a tree that placed ctenophores sister to all other animals whereas sites buried inside proteins yielded a tree with a sponge-ctenophore clade. These differences in phylogenetic signal were not ameliorated when we repeated our analyses using the site-heterogeneous CAT model, a mixture model that is often used for analyses of protein datasets. In fact, analyses using the CAT model actually resulted in rearrangements that are unlikely to represent evolutionary history. These results provide striking evidence that it will be necessary to achieve a better understanding the constraints due to protein structure to improve phylogenetic estimation.


2017 ◽  
Author(s):  
Arun N. Prasanna ◽  
Daniel Gerber ◽  
Kijpornyongpan Teeratas ◽  
M. Catherina Aime ◽  
Vinson Doyle ◽  
...  

AbstractResolving deep divergences in the fungal tree of life remains a challenging task even for analyses of genome-scale phylogenetic datasets. Relationships between Basidiomycota subphyla, the rusts (Pucciniomycotina), smuts (Ustilaginomycotina) and mushroom forming fungi (Agaricomycotina) represent a particularly challenging situation that posed problems to both traditional multigene and genome-scale phylogenetic studies. Here, we address basal Basidiomycota relationships using three different phylogenomic datasets, concatenated and gene tree-based analyses and examine the contribution of several potential sources of uncertainty, including fast-evolving sites, putative long-branch taxa, model violation and missing data. We inferred conflicting results with different datasets and under different models. Fast-evolving sites and oversimplified models of amino acid substitution favored the grouping of smuts with mushroom-forming fungi, often leading to maximal bootstrap support in both concatenation and Astral analyses. The most conserved datasets grouped rusts with mushroom forming fungi, although this relationship proved labile, sensitive to model choice, different data subsets and missing data. Excluding putative long branch taxa, genes with the highest proportions of missing data and/or genes with strong signal failed to reveal a consistent trend toward one or the other topology, suggesting that additional sources of conflict are at play too. Our analyses suggest that topologies uniting smuts with mushroom forming fungi can arise as a result of inappropriate modeling of amino acid sites that might be prone to systematic bias. While concatenated analyses yielded strong but conflicting support, individual gene trees mostly provided poor support for rusts, smuts and mushroom-forming fungi, suggesting that the true Basidiomycota tree might be in a part of the tree space that is difficult to access using both concatenation and gene tree based approaches. Thus, basal Basidiomycota relationships remain unresolved and might represent a phylogenetic problem that remains contentious even in the genomic era.


2019 ◽  
Vol 69 (1) ◽  
pp. 17-37 ◽  
Author(s):  
Arun N Prasanna ◽  
Daniel Gerber ◽  
Teeratas Kijpornyongpan ◽  
M Catherine Aime ◽  
Vinson P Doyle ◽  
...  

AbstractResolving deep divergences in the tree of life is challenging even for analyses of genome-scale phylogenetic data sets. Relationships between Basidiomycota subphyla, the rusts and allies (Pucciniomycotina), smuts and allies (Ustilaginomycotina), and mushroom-forming fungi and allies (Agaricomycotina) were found particularly recalcitrant both to traditional multigene and genome-scale phylogenetics. Here, we address basal Basidiomycota relationships using concatenated and gene tree-based analyses of various phylogenomic data sets to examine the contribution of several potential sources of bias. We evaluate the contribution of biological causes (hard polytomy, incomplete lineage sorting) versus unmodeled evolutionary processes and factors that exacerbate their effects (e.g., fast-evolving sites and long-branch taxa) to inferences of basal Basidiomycota relationships. Bayesian Markov Chain Monte Carlo and likelihood mapping analyses reject the hard polytomy with confidence. In concatenated analyses, fast-evolving sites and oversimplified models of amino acid substitution favored the grouping of smuts with mushroom-forming fungi, often leading to maximal bootstrap support in both concatenation and coalescent analyses. On the contrary, the most conserved data subsets grouped rusts and allies with mushroom-forming fungi, although this relationship proved labile, sensitive to model choice, to different data subsets and to missing data. Excluding putative long-branch taxa, genes with high proportions of missing data and/or with strong signal failed to reveal a consistent trend toward one or the other topology, suggesting that additional sources of conflict are at play. While concatenated analyses yielded strong but conflicting support, individual gene trees mostly provided poor support for any resolution of rusts, smuts, and mushroom-forming fungi, suggesting that the true Basidiomycota tree might be in a part of tree space that is difficult to access using both concatenation and gene tree-based approaches. Inference-based assessments of absolute model fit strongly reject best-fit models for the vast majority of genes, indicating a poor fit of even the most commonly used models. While this is consistent with previous assessments of site-homogenous models of amino acid evolution, this does not appear to be the sole source of confounding signal. Our analyses suggest that topologies uniting smuts with mushroom-forming fungi can arise as a result of inappropriate modeling of amino acid sites that might be prone to systematic bias. We speculate that improved models of sequence evolution could shed more light on basal splits in the Basidiomycota, which, for now, remain unresolved despite the use of whole genome data.


2020 ◽  
Vol 85 (2) ◽  
pp. 272-279
Author(s):  
Mengting Gong ◽  
Xi Zhang ◽  
Yaru Wang ◽  
Guiyan Mao ◽  
Yangqi Ou ◽  
...  

ABSTRACT AGO2 is the only member of mammalian Ago protein family that possesses the catalytic activity and plays a central role in gene silencing. Recently researches reported that multiple gene silencing factors, including AGO2, function in the nuclei. The molecular mechanisms of the gene silencing factors functioning in nuclei are conducive to comprehend the roles of gene silencing in pretranslational regulation of gene expression. Here, we report that AGO2 interacts with DDX21 indirectly in an RNA-dependent manner by Co-IP and GST-Pulldown assays and the 2 proteins present nuclei foci in the immunofluorescence experiments. We found that DDX21 up-regulated the protein level of AGO2 and participated in target gene, SNM2, alternative splicing involved in AGO2 by the indirect interaction with AGO2, which produced different transcripts of SMN2 in discrepant expression level. This study laid important experiment foundation for the further analysis of the nuclear functions of gene silencing components.


2021 ◽  
Author(s):  
Softya Sebastian ◽  
Swarup Roy

Genome-scale network inference is essential to understand comprehensive interaction patterns. Current methods are limited to the reconstruction of small to moderate-size networks. The most obvious alternative is to propose a novel method or alter existing methods that may leverage parallel computing paradigms. Very few attempts also have been made to re-engineer existing methods by executing selective iterative steps concurrently. We propose a generic framework in this paper that leverages parallel computing without re-engineering the original methods. The proposed framework uses state-of-the-art methods as a black box to infer sub-networks of the segmented data matrix. A simple merger was designed based on preferential attachment to generate the global network by merging the sub-networks. Fifteen (15) inference methods were considered for experimentation. Qualitative and speedup analysis was carried out using DREAM challenge networks. The proposed framework was implemented on all the 15 inference methods using large expression matrices. The results were auspicious as we could infer large networks in a reasonable time without compromising the qualitative aspects of the original (serial) algorithm. CLR, the top performer, was then used to infer the network from the expression profiles of an Alzheimer's disease (AD) affected mouse model consisting of 45,101 genes. We have also highlighted few hub genes from the network that are functionally related to various diseases.


2017 ◽  
Author(s):  
Florent Mazel ◽  
Arne Mooers ◽  
Giulio Valentino Dalla Riva ◽  
Matthew W. Pennell

AbstractFor decades, academic biologists have advocated for making conservation decisions in light of evolutionary history. Specifically, they suggest that policymakers should prioritize conserving phylogenetically diverse assemblages. The most prominent argument is that conserving phylogenetic diversity (PD) will also conserve diversity in traits and features (functional diversity; FD), which may be valuable for a number of reasons. The claim that PD-maximized (‘maxPD’) sets of taxa will also have high FD is often taken at face value and in cases where researchers have actually tested it, they have done so by measuring the phylogenetic signal in ecologically important functional traits. The rationale is that if traits closely mirror phylogeny, then saving the maxPD set of taxa will tend to maximize FD and if traits do not have phylogenetic structure, then saving the maxPD set of taxa will be no better at capturing FD than criteria that ignore PD. Here, we suggest that measuring the phylogenetic signal in traits is uninformative for evaluating the effectiveness of using PD in conservation. We evolve traits under several different models and, for the first time, directly compare the FD of a set of taxa that maximize PD to the FD of a random set of the same size. Under many common models of trait evolution and tree shapes, conserving the maxPD set of taxa will conserve more FD than conserving a random set of the same size. However, this result cannot be generalized to other classes of models. We find that under biologically plausible scenarios, using PD to select species can actually lead to less FD compared to a random set. Critically, this can occur even when there is phylogenetic signal in the traits. Predicting exactly when we expect using PD to be a good strategy for conserving FD is challenging, as it depends on complex interactions between tree shape and the assumptions of the evolutionary model. Nonetheless, if our goal is to maintain trait diversity, the fact that conserving taxa based on PD will not reliably conserve at least as much FD as choosing randomly raises serious concerns about the general utility of PD in conservation.


2017 ◽  
Author(s):  
Xiaofan Zhou ◽  
Sarah Lutteropp ◽  
Lucas Czech ◽  
Alexandros Stamatakis ◽  
Moritz von Looz ◽  
...  

AbstractIncongruence, or topological conflict, is prevalent in genome-scale data sets but relatively few measures have been developed to quantify it. Internode Certainty (IC) and related measures were recently introduced to explicitly quantify the level of incongruence of a given internode (or internal branch) among a set of phylogenetic trees and complement regular branch support statistics in assessing the confidence of the inferred phylogenetic relationships. Since most phylogenomic studies contain data partitions (e.g., genes) with missing taxa and IC scores stem from the frequencies of bipartitions (or splits) on a set of trees, the calculation of IC scores requires adjusting the frequencies of bipartitions from these partial gene trees. However, when the proportion of missing data is high, current approaches that adjust bipartition frequencies in partial gene trees tend to overestimate IC scores and alternative adjustment approaches differ substantially from each other in their scores. To overcome these issues, we developed three new measures for calculating internode certainty that are based on the frequencies of quartets, which naturally apply to both comprehensive and partial trees. Our comparison of these new quartet-based measures to previous bipartition-based measures on simulated data shows that: 1) on comprehensive trees, both types of measures yield highly similar IC scores; 2) on partial trees, quartet-based measures generate more accurate IC scores; and 3) quartet-based measures are more robust to the absence of phylogenetic signal and errors in the phylogenetic relationships to be assessed. Additionally, analysis of 15 empirical phylogenomic data sets using our quartet-based measures suggests that numerous relationships remain unresolved despite the availability of genome-scale data. Finally, we provide an efficient open-source implementation of these quartet-based measures in the program QuartetScores, which is freely available at https://github.com/algomaus/QuartetScores.


Sign in / Sign up

Export Citation Format

Share Document