scholarly journals An investigation of irreproducibility in maximum likelihood phylogenetic inference

2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Xing-Xing Shen ◽  
Yuanning Li ◽  
Chris Todd Hittinger ◽  
Xue-xin Chen ◽  
Antonis Rokas

AbstractPhylogenetic trees are essential for studying biology, but their reproducibility under identical parameter settings remains unexplored. Here, we find that 3515 (18.11%) IQ-TREE-inferred and 1813 (9.34%) RAxML-NG-inferred maximum likelihood (ML) gene trees are topologically irreproducible when executing two replicates (Run1 and Run2) for each of 19,414 gene alignments in 15 animal, plant, and fungal phylogenomic datasets. Notably, coalescent-based ASTRAL species phylogenies inferred from Run1 and Run2 sets of individual gene trees are topologically irreproducible for 9/15 phylogenomic datasets, whereas concatenation-based phylogenies inferred twice from the same supermatrix are reproducible. Our simulations further show that irreproducible phylogenies are more likely to be incorrect than reproducible phylogenies. These results suggest that a considerable fraction of single-gene ML trees may be irreproducible. Increasing reproducibility in ML inference will benefit from providing analyses’ log files, which contain typically reported parameters (e.g., program, substitution model, number of tree searches) but also typically unreported ones (e.g., random starting seed number, number of threads, processor type).

2019 ◽  
Author(s):  
Xiaodong Jian ◽  
Scott V. Edwards ◽  
Liang Liu

ABSTRACTA statistical framework of model comparison and model validation is essential to resolving the debates over concatenation and coalescent models in phylogenomic data analysis. A set of statistical tests are here applied and developed to evaluate and compare the adequacy of substitution, concatenation, and multispecies coalescent (MSC) models across 47 phylogenomic data sets collected across tree of life. Tests for substitution models and the concatenation assumption of topologically concordant gene trees suggest that a poor fit of substitution models (44% of loci rejecting the substitution model) and concatenation models (38% of loci rejecting the hypothesis of topologically congruent gene trees) is widespread. Logistic regression shows that the proportions of GC content and informative sites are both negatively correlated with the fit of substitution models across loci. Moreover, a substantial violation of the concatenation assumption of congruent gene trees is consistently observed across 6 major groups (birds, mammals, fish, insects, reptiles, and others, including other invertebrates). In contrast, among those loci adequately described by a given substitution model, the proportion of loci rejecting the MSC model is 11%, significantly lower than those rejecting the substitution and concatenation models, and Bayesian model comparison strongly favors the MSC over concatenation across all data sets. Species tree inference suggests that loci rejecting the MSC have little effect on species tree estimation. Due to computational constraints, the Bayesian model validation and comparison analyses were conducted on the reduced data sets. A complete analysis of phylogenomic data requires the development of efficient algorithms for phylogenetic inference. Nevertheless, the concatenation assumption of congruent gene trees rarely holds for phylogenomic data with more than 10 loci. Thus, for large phylogenomic data sets, model comparison analyses are expected to consistently and more strongly favor the coalescent model over the concatenation model. Our analysis reveals the value of model validation and comparison in phylogenomic data analysis, as well as the need for further improvements of multilocus models and computational tools for phylogenetic inference.


2019 ◽  
Author(s):  
Angie M. Macias ◽  
David M. Geiser ◽  
Jason E. Stajich ◽  
Piotr Łukasik ◽  
Claudio Veloso ◽  
...  

AbstractThe fungal genus Massospora (Zoopagomycota: Entomophthorales) includes more than a dozen obligate, sexually transmissible pathogenic species that infect cicadas (Hemiptera) worldwide. At least two species are known to produce psychoactive compounds during infection, which has garnered considerable interest for this enigmatic genus. As with many Entomophthorales, the evolutionary relationships and host associations of Massospora spp. are not well understood. The acquisition of M. diceroproctae from Arizona, M. tettigatis from Chile, and M. platypediae from California and Colorado provided an opportunity to conduct molecular phylogenetic analyses and morphological studies to investigate if these fungi represent a monophyletic group and delimit species boundaries. In a three-locus phylogenetic analysis including the D1–D2 domains of the nuclear 28S rRNA gene (28S), elongation factor 1 alpha-like (EFL), and beta-tubulin (BTUB), Massospora was resolved in a strongly supported monophyletic group containing four well-supported genealogically exclusive lineages, based on two of three methods of phylogenetic inference. There was incongruence among the single-gene trees: two methods of phylogenetic inference recovered trees with either the same topology as the 3-gene concatenated tree (EFL), or a basal polytomy (28S, BTUB). Massospora levispora and M. platypediae isolates formed a single lineage in all analyses and are synonymized here as M. levispora. Massospora diceroproctae was sister to M. cicadina in all three single-gene trees and on an extremely long branch relative to the other Massospora, and even the outgroup taxa, which may reflect an accelerated rate of molecular evolution and/or incomplete taxa sampling. The results of the morphological study presented here indicate that spore measurements may not be phylogenetically or diagnostically informative. Despite recent advances in understanding the ecology of Massospora, much about its host range and diversity remains unexplored. The emerging phylogenetic framework can provide a foundation for exploring co-evolutionary relationships with cicada hosts and the evolution of behavior-altering compounds.


2021 ◽  
Author(s):  
Bryan Thornlow ◽  
Cheng Ye ◽  
Nicola De Maio ◽  
Jakob McBroome ◽  
Angie S. Hinrichs ◽  
...  

AbstractPhylogenetics has been foundational to SARS-CoV-2 research and public health policy, assisting in genomic surveillance, contact tracing, and assessing emergence and spread of new variants. However, phylogenetic analyses of SARS-CoV-2 have often relied on tools designed for de novo phylogenetic inference, in which all data are collected before any analysis is performed and the phylogeny is inferred once from scratch. SARS-CoV-2 datasets do not fit this mould. There are currently over 5 million sequenced SARS-CoV-2 genomes in public databases, with tens of thousands of new genomes added every day. Continuous data collection, combined with the public health relevance of SARS-CoV-2, invites an “online” approach to phylogenetics, in which new samples are added to existing phylogenetic trees every day. The extremely dense sampling of SARS-CoV-2 genomes also invites a comparison between Likelihood and Parsimony approaches to phylogenetic inference. Maximum Likelihood (ML) methods are more accurate when there are multiple changes at a single site on a single branch, but this accuracy comes at a large computational cost, and the dense sampling of SARS-CoV-2 genomes means that these instances will be extremely rare. Therefore, it may be that approaches based on Maximum Parsimony (MP) are sufficiently accurate for reconstructing phylogenies of SARS-CoV-2, and their simplicity means that they can be applied to much larger datasets. Here, we evaluate the performance of de novo and online phylogenetic approaches, and ML and MP frameworks, for inferring large and dense SARS-CoV-2 phylogenies. Overall, we find that online phylogenetics produces similar phylogenetic trees to de novo analyses for SARS-CoV-2, and that MP optimizations produce more accurate SARS-CoV-2 phylogenies than do ML optimizations. Since MP is thousands of times faster than presently available implementations of ML and online phylogenetics is faster than de novo, we therefore propose that, in the context of comprehensive genomic epidemiology of SARS-CoV-2, MP online phylogenetics approaches should be favored.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e4783 ◽  
Author(s):  
Yuanmeng Miles Zhang ◽  
Julia Stigenberg ◽  
Jacqueline Hope Meyer ◽  
Barbara Jo-Anne Sharanowski

BackgroundParasitic wasps in the family Braconidae are important regulators of insect pests, particularly in forest and agroecosystems. Within Braconidae, wasps in the tribe Euphorini (Euphorinae) attack economically damaging plant bugs (Miridae) that are major pests of field and vegetable crops. However, the evolutionary relationships of this tribe have been historically problematic. Most generic concepts have been based on ambiguous morphological characters which often leads to misidentification, complicating their use in biological control.MethodsUsing a combination of three genes (COI,28S, andCAD) and 80 taxa collected worldwide, we conducted Bayesian inference using MrBayes, and maximum likelihood analyses using RAxML and IQ-Tree on individual gene trees as well as the concatenated dataset.ResultsThe monophyly of the tribe Euphorini and the two generaPeristenusandLeiophronwere confirmed using maximum likelihood and Bayesian inference. The subgeneric classifications ofLeiophron sensu latowere not supported, and the monotypic genusMamawas also not supported.DiscussionEuphoriella,Euphoriana,Euphorus, andMamasyn. n,have been synonymized underLeiophron. Mama mariaesyn. nwas placed as a junior synonym ofLeiophron reclinator. The generic concepts ofPeristenusandLeiophronwere refined to reflect the updated phylogeny. Further we discuss the need for revising Euphorini given the number of undescribed species within the tribe.


2015 ◽  
Vol 112 (7) ◽  
pp. 2058-2063 ◽  
Author(s):  
Marc Hellmuth ◽  
Nicolas Wieseke ◽  
Marcus Lechner ◽  
Hans-Peter Lenhof ◽  
Martin Middendorf ◽  
...  

Phylogenomics heavily relies on well-curated sequence data sets that comprise, for each gene, exclusively 1:1 orthologos. Paralogs are treated as a dangerous nuisance that has to be detected and removed. We show here that this severe restriction of the data sets is not necessary. Building upon recent advances in mathematical phylogenetics, we demonstrate that gene duplications convey meaningful phylogenetic information and allow the inference of plausible phylogenetic trees, provided orthologs and paralogs can be distinguished with a degree of certainty. Starting from tree-free estimates of orthology, cograph editing can sufficiently reduce the noise to find correct event-annotated gene trees. The information of gene trees can then directly be translated into constraints on the species trees. Although the resolution is very poor for individual gene families, we show that genome-wide data sets are sufficient to generate fully resolved phylogenetic trees, even in the presence of horizontal gene transfer.


Author(s):  
David A. Spade

AbstractMaximum likelihood is a common method of estimating a phylogenetic tree based on a set of genetic data. However, models of evolution for certain types of genetic data are highly flawed in their specification, and this misspecification can have an adverse impact on phylogenetic inference. Our attention here is focused on extending an existing class of models for estimating phylogenetic trees from discrete morphological characters. The main advance of this work is a model that allows unequal equilibrium frequencies in the estimation of phylogenetic trees from discrete morphological character data using likelihood methods. Possible extensions of the proposed model will also be discussed.


2021 ◽  
Author(s):  
Anna Cho ◽  
Denis V. Tikhonenkov ◽  
Elisabeth Hehenberger ◽  
Anna Karnkowska ◽  
Patrick J. Keeling

Stramenopiles are a diverse but relatively well-studied eukaryotic supergroup with considerable genomic information available (Sibbald and Archibald, 2017). Nevertheless, the relationships between major stramenopile subgroups remain unresolved, in part due to a lack of data from small nanoflagellates that make up a lot of the genetic diversity of the group. This is most obvious in Bigyromonadea, which is one of four major stramenopile subgroups but represented by a single transcriptome. To examine the diversity of Bigyromonadea and how the lack of data affects the tree, we generated transcriptomes from seven novel bigyromonada species described in this study: Develocauda condao, Develocanicus komovi, Develocanicus vyazemskyi, Cubaremonas variflagellatum, Pirsonia chemainus, Feodosia pseudopoda, and Koktebelia satura. Both maximum likelihood and Bayesian phylogenomic trees based on a 247 gene-matrix recovered a monophyletic Bigyromonadea that includes two diverse subgroups, Developea and Pirsoniales, that were not previously related based on single gene trees. Maximum likelihood analyses show Bigyromonadea related to oomycetes, whereas Bayesian analyses and topology testing were inconclusive. We observed similarities between the novel bigyromonad species and motile zoospores of oomycetes in morphology and the ability to self-aggregate. Rare formation of pseudopods and fused cells were also observed, traits that are also found in members of labyrinthulomycetes, another osmotrophic stramenopiles. Furthermore, we report the first case of eukaryovory in the flagellated stages of Pirsoniales. These analyses reveal new diversity of Bigyromonadea, and altogether suggest their monophyly with oomycetes, collectively known as Pseudofungi, is the most likely topology of the stramenopile tree.


Plant Disease ◽  
2020 ◽  
Author(s):  
Madison Julia McCulloch ◽  
Shanice Edwards ◽  
Harrison Inocencio ◽  
Franklin Machado ◽  
Etta Nuckles ◽  
...  

Fungi in the genus Colletotrichum cause apple, blueberry, and strawberry fruit rots, which can result in significant losses. Accurate identification is important because species differ in aggressiveness, fungicide sensitivity, and other factors affecting management. Multiple Colletotrichum species can cause similar symptoms on the same host, while more than one fruit type can be infected by a single Colletotrichum species. Mixed-fruit orchards may facilitate cross-infection, with significant management implications. Colletotrichum isolates from small fruits in Kentucky orchards were characterized and compared with apple isolates by using a combination of morphotyping, sequencing of voucher loci and whole genomes, and cross-inoculation assays. Seven morphotypes representing two species complexes (C. acutatum and C. gloeosporioides) were identified. Morphotypes corresponded with phylogenetic species C. fioriniae, C. fructicola, C. nymphaeae, and C. siamense, identified by TUB2 and GAPDH barcodes. Phylogenetic trees built from nine single gene sequences matched barcoding results with one exception, later determined to belong to an undescribed species. Comparison of single gene trees with representative whole genome sequences revealed that CHS and ApMat were the most informative for diagnosis of fruit rot species and individual morphotypes within the C. acutatum or C. gloeosporioides complexes, respectively. All blueberry isolates belonged to C. fioriniae, and most strawberry isolates were C. nymphaeae, with a few C. siamense and C. fioriniae also recovered. All three species cause fruit rot on apples in Kentucky. Cross-inoculation assays on detached apple, blueberry, and strawberry fruits showed that all species were pathogenic on all three hosts, but with species-specific differences in aggressiveness.


2020 ◽  
Vol 18 ◽  
Author(s):  
Yin Yueqi ◽  
Zhou Ying ◽  
Lu Jing ◽  
Guo Hongxiong ◽  
Chen Jianshuang ◽  
...  

Background: CRF01_AE and CRF07_BC are the two major HIV-1 virus strains circulating in China. The proportion of dominant subtypes (CRF01_AE and CRF07_BC) among MSM in Jiangsu province was over 80%. A large number of URFs have been found in China in recently years. Objective: This study aimed to report on novel HIV-1 recombinants. Method: We constructed Phylogenetic trees using the maximum likelihood (ML) method with 1000 bootstrap replicates in IQ-TREE 1.6.8 software and determined recombination break points using SimPlot 3.5.1. Results: We identified a novel, second-generation HIV-1 recombinant (JS020202) between CRF01_AE and CRF07_BC. The analysis of near full-length genome (NFLG) showed there were at least 8 breakpoints inner virus, which differed from any previously identified CRF and URF around the world. Conclusion: Novel diverse CRF01_AE/07_BC suggested the complexity trends of HIV-1 genetics. The emergency situation of diverse recombinant strains should be monitored continuously.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Juan C. Muñoz-Escalante ◽  
Andreu Comas-García ◽  
Sofía Bernal-Silva ◽  
Daniel E. Noyola

AbstractRespiratory syncytial virus (RSV) is a major cause of respiratory infections and is classified in two main groups, RSV-A and RSV-B, with multiple genotypes within each of them. For RSV-B, more than 30 genotypes have been described, without consensus on their definition. The lack of genotype assignation criteria has a direct impact on viral evolution understanding, development of viral detection methods as well as vaccines design. Here we analyzed the totality of complete RSV-B G gene ectodomain sequences published in GenBank until September 2018 (n = 2190) including 478 complete genome sequences using maximum likelihood and Bayesian phylogenetic analyses, as well as intergenotypic and intragenotypic distance matrices, in order to generate a systematic genotype assignation. Individual RSV-B genes were also assessed using maximum likelihood phylogenetic analyses and multiple sequence alignments were used to identify molecular markers associated to specific genotypes. Analyses of the complete G gene ectodomain region, sequences clustering patterns, and the presence of molecular markers of each individual gene indicate that the 37 previously described genotypes can be classified into fifteen distinct genotypes: BA, BA-C, BA-CC, CB1-THB, GB1-GB4, GB6, JAB1-NZB2, SAB1, SAB2, SAB4, URU2 and a novel early circulating genotype characterized in the present study and designated GB0.


Sign in / Sign up

Export Citation Format

Share Document