scholarly journals Online Phylogenetics using Parsimony Produces Slightly Better Trees and is Dramatically More Efficient for Large SARS-CoV-2 Phylogenies than de novo and Maximum-Likelihood Approaches

2021 ◽  
Author(s):  
Bryan Thornlow ◽  
Cheng Ye ◽  
Nicola De Maio ◽  
Jakob McBroome ◽  
Angie S. Hinrichs ◽  
...  

AbstractPhylogenetics has been foundational to SARS-CoV-2 research and public health policy, assisting in genomic surveillance, contact tracing, and assessing emergence and spread of new variants. However, phylogenetic analyses of SARS-CoV-2 have often relied on tools designed for de novo phylogenetic inference, in which all data are collected before any analysis is performed and the phylogeny is inferred once from scratch. SARS-CoV-2 datasets do not fit this mould. There are currently over 5 million sequenced SARS-CoV-2 genomes in public databases, with tens of thousands of new genomes added every day. Continuous data collection, combined with the public health relevance of SARS-CoV-2, invites an “online” approach to phylogenetics, in which new samples are added to existing phylogenetic trees every day. The extremely dense sampling of SARS-CoV-2 genomes also invites a comparison between Likelihood and Parsimony approaches to phylogenetic inference. Maximum Likelihood (ML) methods are more accurate when there are multiple changes at a single site on a single branch, but this accuracy comes at a large computational cost, and the dense sampling of SARS-CoV-2 genomes means that these instances will be extremely rare. Therefore, it may be that approaches based on Maximum Parsimony (MP) are sufficiently accurate for reconstructing phylogenies of SARS-CoV-2, and their simplicity means that they can be applied to much larger datasets. Here, we evaluate the performance of de novo and online phylogenetic approaches, and ML and MP frameworks, for inferring large and dense SARS-CoV-2 phylogenies. Overall, we find that online phylogenetics produces similar phylogenetic trees to de novo analyses for SARS-CoV-2, and that MP optimizations produce more accurate SARS-CoV-2 phylogenies than do ML optimizations. Since MP is thousands of times faster than presently available implementations of ML and online phylogenetics is faster than de novo, we therefore propose that, in the context of comprehensive genomic epidemiology of SARS-CoV-2, MP online phylogenetics approaches should be favored.

2002 ◽  
Vol 33 (4) ◽  
pp. 361-386 ◽  
Author(s):  
Vest Pedersen

AbstractThe phylogenetics of 40 taxa of European bumblebees were analysed based on PCR amplified and direct sequenced DNA from one region of the mitochondrial gene Cytochrome Oxidase I (1046 bp) and for 26 taxa from two regions in the nuclear gene Elongation Factor 1α (1056 bp). The sequences were aligned to the corresponding sequences in the honey bee. Phylogenetic analyses based on parsimony, as well as maximum likelihood, indicate that the bumblebees can be separated into several well-supported clades. Most of the terminal clades correspond very well with the clades known from former phylogenetic analyses based on morphology and recognized as the subgenera: Mendacibombus, Confusibombus, Psithyrus, Thoracobombus, Megabombus, Rhodobombus, Kallobombus, Alpinobombus, Subterraneobombus, Alpigenobombus, Pyrobombus, Bombus and Melanobombus. All the cuckoo bumblebees form a well-supported clade, the subgenus Psithyrus, within the true bumblebees. All the analyses place Kallobombus as the most basal taxon in contradiction to former analyses. The other deeper nodes of the phylogenetic trees, which are weakly supported, deviate significantly from former published trees - especially the trees based on mtCO-I. Presumably, the reasons are that multiple hits and the strong bias of the bases A and T blur the relationships in the deepest part of the trees. Analyses of the region in mtCO-I show a very strong A+T bias (A+T= 75%), which also indicate preferences in the use of codons with A or T in third positions. In closely related entities, there is only a weak transversion bias (A+T). In the studied regions in EF 1-α, no nucleotide bias is observed. The observed differences in bases between the investigated taxa are relatively small and the gene is too conserved to solve all the questions that the analyses of the deeper nodes using mtCO-I raise.


Author(s):  
Xianding Deng ◽  
Wei Gu ◽  
Scot Federman ◽  
Louis du Plessis ◽  
Oliver G. Pybus ◽  
...  

AbstractThe COVID-19 pandemic caused by the novel coronavirus SARS-CoV-2 has spread globally, resulting in >300,000 reported cases worldwide as of March 21st, 2020. Here we investigate the genetic diversity and genomic epidemiology of SARS-CoV-2 in Northern California using samples from returning travelers, cruise ship passengers, and cases of community transmission with unclear infection sources. Virus genomes were sampled from 29 patients diagnosed with COVID-19 infection from Feb 3rd through Mar 15th. Phylogenetic analyses revealed at least 8 different SARS-CoV-2 lineages, suggesting multiple independent introductions of the virus into the state. Virus genomes from passengers on two consecutive excursions of the Grand Princess cruise ship clustered with those from an established epidemic in Washington State, including the WA1 genome representing the first reported case in the United States on January 19th. We also detected evidence for presumptive transmission of SARS-CoV-2 lineages from one community to another. These findings suggest that cryptic transmission of SARS-CoV-2 in Northern California to date is characterized by multiple transmission chains that originate via distinct introductions from international and interstate travel, rather than widespread community transmission of a single predominant lineage. Rapid testing and contact tracing, social distancing, and travel restrictions are measures that will help to slow SARS-CoV-2 spread in California and other regions of the USA.


2021 ◽  
Vol 9 (12) ◽  
pp. 2609
Author(s):  
Atia Basheer ◽  
Imran Zahoor

The present study aims to investigate the genomic variability and epidemiology of SARS-CoV-2 in Pakistan along with its role in the spread and severity of infection during the three waves of COVID-19. A total of 453 genomic sequences of Pakistani SARS-CoV-2 were retrieved from GISAID and subjected to MAFFT-based alignment and QC check which resulted in removal of 53 samples. The remaining 400 samples were subjected to Pangolin-based genomic lineage identification. And to infer our SARS-CoV-2 time-scaled and divergence phylogenetic trees, 3804 selected global reference sequences plus 400 Pakistani samples were used for the Nextstrain analysis with Wuhan/Hu-1/2019, as reference genome. Finally, maximum likelihood based phylogenetic tree was built by using the Nextstrain and coverage map was created by employing Nextclade. By using the amino acid substitutions, the maximum likelihood phylogenetic trees were developed for each wave, separately. Our results reveal the circulation of 29 lineages, belonging to following seven clades G, GH, GR, GRY, L, O, and S in the three waves. From first wave, 16 genomic lineages of SARS-CoV-2 were identified with B.1(24.7%), B.1.36(18.8%), and B.1.471(18.8%) as the most prevalent lineages respectively. The second wave data showed 18 lineages, 10 of which were overlapping with the first wave suggesting that those variants could not be contained during the first wave. In this wave, a new lineage, AE.4, was reported from Pakistan for the very first time in the world. However, B.1.36 (17.8%), B.1.36.31 (11.9%), B.1.1.7 (8.5%), and B.1.1.1 (5.9%) were the major lineages in second wave. Third wave data showed the presence of nine lineages with Alpha/B.1.1.7 (72.7%), Beta/B.1.351 (12.99%), and Delta/B.1.617.2 (10.39%) as the most predominant variants. It is suggested that these VOCs should be contained at the earliest in order to prevent any devastating outbreak of SARS-CoV-2 in the country.


2007 ◽  
Vol 20 (4) ◽  
pp. 287 ◽  
Author(s):  
Michael J. Sanderson

Broad availability of molecular sequence data allows construction of phylogenetic trees with 1000s or even 10 000s of taxa. This paper reviews methodological, technological and empirical issues raised in phylogenetic inference at this scale. Numerous algorithmic and computational challenges have been identified surrounding the core problem of reconstructing large trees accurately from sequence data, but many other obstacles, both upstream and downstream of this step, are less well understood. Before phylogenetic analysis, data must be generated de novo or extracted from existing databases, compiled into blocks of homologous data with controlled properties, aligned, examined for the presence of gene duplications or other kinds of complicating factors, and finally, combined with other evidence via supermatrix or supertree approaches. After phylogenetic analysis, confidence assessments are usually reported, along with other kinds of annotations, such as clade names, or annotations requiring additional inference procedures, such as trait evolution or divergence time estimates. Prospects for partial automation of large-tree construction are also discussed, as well as risks associated with ‘outsourcing’ phylogenetic inference beyond the systematics community.


2019 ◽  
Author(s):  
Daniel Edler ◽  
Johannes Klein ◽  
Alexandre Antonelli ◽  
Daniele Silvestro

AbstractRaxmlGUI is a graphical user interface to RAxML, one of the most popular and widely used software for phylogenetic inference using maximum likelihood. Here we present raxmlGUI 2.0, a complete rewrite of the GUI which seamlessly integrates RAxML binaries for all major operating systems with an intuitive graphical front-end to set up and run phylogenetic analyses. Our program offers automated pipelines for analyses that require multiple successive calls of RAxML and built-in functions to concatenate alignment files while automatically specifying the appropriate partition settings. In addition to RAxML 8.x, raxmlGUI 2.0 also supports the new RAxML Next Generation. RaxmlGUI facilitates phylogenetic analyses by coupling an intuitive interface with the unmatched performance of RAxML.


2021 ◽  
Author(s):  
Gunnar Stoddard ◽  
Allison Black ◽  
Patrick Ayscue ◽  
Dan Lu ◽  
Jack Kamm ◽  
...  

ABSTRACTDuring the COVID-19 pandemic within the United States, much of the responsibility for diagnostic testing and epidemiologic response has relied on the action of county-level departments of public health. Here we describe the integration of genomic surveillance into epidemiologic response within Humboldt County, a rural county in northwest California. Through a collaborative effort, 853 whole SARS-CoV-2 genomes were generated, representing ∼58% of the 1,449 SARS-CoV-2-positive cases detected in Humboldt County as of mid-March 2021. Phylogenetic analysis of these data was used to develop a comprehensive understanding of SARS-CoV-2 introductions to the county and to support contact tracing and epidemiologic investigations of all large outbreaks in the county. In the case of an outbreak on a commercial farm, viral genomic data were used to validate reported epidemiologic links and link additional cases within the community who did not report a farm exposure to the outbreak. During a separate outbreak within a skilled nursing facility, genomic surveillance data were used to rule out the putative index case, detect the emergence of an independent Spike:N501Y substitution, and verify that the outbreak had been brought under control. These use cases demonstrate how developing genomic surveillance capacity within local public health departments can support timely and responsive deployment of genomic epidemiology for surveillance and outbreak response based on local needs and priorities.


2015 ◽  
Author(s):  
Lucas Czech ◽  
Jaime Huerta-Cepas ◽  
Alexandros Stamatakis

AbstractPhylogenetic trees are routinely visualized to present and interpret the evolutionary relationships of species. Virtually all empirical evolutionary data studies contain a visualization of the inferred tree with branch support values. Ambiguous semantics in tree file formats can lead to erroneous tree visualizations and therefore to incorrect interpretations of phylogenetic analyses.Here, we discuss problems that can and do arise when displaying branch values on trees after re-rooting. Branch values are typically stored as node labels in the widely-used Newick tree format. However, such values are attributes of branches. Storing them as node labels can therefore yield errors when re-rooting trees. This depends on the mostly implicit semantics that tools deploy to interpret node labels.We reviewed 10 tree viewers and 10 bioinformatics toolkits that can display and re-root trees. We found that 14 out of 20 of these tools do not permit users to select the semantics of node labels. Thus, unaware users might obtain incorrect results when rooting trees inferred by common phylogenetic inference programs. We illustrate such incorrect mappings for several test cases and real examples taken from the literature. This review has already led to improvements and workarounds in 8 of the tested tools. We suggest tools should provide an option that explicitly forces users to define the semantics of node labels.


2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Xing-Xing Shen ◽  
Yuanning Li ◽  
Chris Todd Hittinger ◽  
Xue-xin Chen ◽  
Antonis Rokas

AbstractPhylogenetic trees are essential for studying biology, but their reproducibility under identical parameter settings remains unexplored. Here, we find that 3515 (18.11%) IQ-TREE-inferred and 1813 (9.34%) RAxML-NG-inferred maximum likelihood (ML) gene trees are topologically irreproducible when executing two replicates (Run1 and Run2) for each of 19,414 gene alignments in 15 animal, plant, and fungal phylogenomic datasets. Notably, coalescent-based ASTRAL species phylogenies inferred from Run1 and Run2 sets of individual gene trees are topologically irreproducible for 9/15 phylogenomic datasets, whereas concatenation-based phylogenies inferred twice from the same supermatrix are reproducible. Our simulations further show that irreproducible phylogenies are more likely to be incorrect than reproducible phylogenies. These results suggest that a considerable fraction of single-gene ML trees may be irreproducible. Increasing reproducibility in ML inference will benefit from providing analyses’ log files, which contain typically reported parameters (e.g., program, substitution model, number of tree searches) but also typically unreported ones (e.g., random starting seed number, number of threads, processor type).


Phytotaxa ◽  
2015 ◽  
Vol 231 (3) ◽  
pp. 271 ◽  
Author(s):  
Kasun Madhusanka Thambugala ◽  
YU CHUNFANG ◽  
ERIO CAMPORESI ◽  
ALI H. BAHKALI ◽  
ZUO YI LIU ◽  
...  

Didymosphaeria spartii was collected from dead branches of Spartium junceum in Italy. Multi-gene phylogenetic analyses of ITS, 18S and 28S nrDNA sequence data were carried out using maximum likelihood and Bayesian analysis. The resulting phylogenetic trees showed this to be a new genus in a well-supported clade in Massarinaceae. A new genus Pseudodidymosphaeria is therefore introduced to accommodate this species based on molecular phylogeny and morphology. A illustrated account is provided for the new genus with its asexual morph and the new taxon is compared with Massarina and Didymosphaeria.


2021 ◽  
Author(s):  
Atia Basheer ◽  
Imran Zahoor

The present study aims to investigate the genomic variability and epidemiology of SARS-CoV-2 in Pakistan along with their role in the spread and severity of infection during the three waves of COVID-19. A total of 453 genomic sequences of Pakistani SARS-CoV-2 were retrieved from GISAID and subjected to MAFFT-based alignment and QC check which resulted in removal of 53 samples. The remaining 400 samples were subjected to Pangolin-based genomic lineage identification. And to infer our SARS-CoV-2 time-scaled and divergence phylogenetic trees, 3,804 selected global reference sequences plus 400 Pakistani samples were used for the Nextstrain analysis with Wuhan/Hu-1/2019, as reference genome. Finally, maximum likelihood based phylogenetic tree was built by using the Nextstrain & coverage map was created by employing Nextclade. And by using the amino acid subsitutions the maximum likelihood phylogenetic trees were developed for each wave, separately. Our results reveal the circulation of 29 lineages, belonging to following 7 clades G, GH, GR, GRY, L, O, & S in the three waves. From first wave, 16 genomic lineages of SARS-CoV-2 were identified with B.1(24.7%), B.1.36(18.8%), & B.1.471(18.8%) as the most prevalent lineages respectively. The second wave data showed 18 lineages, 10 of which were overlapping with the first wave suggesting that those variants could not be contained during the first wave. In this wave, a new lineage, AE.4, was reported from Pakistan for the very first time in the world. However, B.1.36 (17.8%), B.1.36.31 (11.9%), B.1.1.7 (8.5%) & B.1.1.1 (5.9%) were the major lineages in second wave. Third wave data showed the presence of 9 lineages with Alpha/B.1.1.7 (72.7%), Beta/B.1.351 (12.99%), & Delta/B.1.617.2 (10.39%) as the most predominant variants. It is suggested that these VOCs should be contained at the earliest in order to prevent any devastating outbreak of SARS-CoV-2 in the country.


Sign in / Sign up

Export Citation Format

Share Document