Rephine.r: a pipeline for correcting gene calls and clusters to improve phage pangenomes and phylogenies

PeerJ ◽

10.7717/peerj.11950 ◽

2021 ◽

Vol 9 ◽

pp. e11950

Author(s):

Jason W. Shapiro ◽

Catherine Putonti

Keyword(s):

Phylogenetic Trees ◽

Markov Models ◽

Gene Families ◽

Gene Clusters ◽

Automated Analysis ◽

Single Copy ◽

Bootstrap Support ◽

Homing Endonucleases ◽

Sequence Alignments ◽

Selfish Genetic Element

Background A pangenome is the collection of all genes found in a set of related genomes. For microbes, these genomes are often different strains of the same species, and the pangenome offers a means to compare gene content variation with differences in phenotypes, ecology, and phylogenetic relatedness. Though most frequently applied to bacteria, there is growing interest in adapting pangenome analysis to bacteriophages. However, working with phage genomes presents new challenges. First, most phage families are under-sampled, and homologous genes in related viruses can be difficult to identify. Second, homing endonucleases and intron-like sequences may be present, resulting in fragmented gene calls. Each of these issues can reduce the accuracy of standard pangenome analysis tools. Methods We developed an R pipeline called Rephine.r that takes as input the gene clusters produced by an initial pangenomics workflow. Rephine.r then proceeds in two primary steps. First, it identifies three common causes of fragmented gene calls: (1) indels creating early stop codons and new start codons; (2) interruption by a selfish genetic element; and (3) splitting at the ends of the reported genome. Fragmented genes are then fused to create new sequence alignments. In tandem, Rephine.r searches for distant homologs separated into different gene families using Hidden Markov Models. Significant hits are used to merge families into larger clusters. A final round of fragment identification is then run, and results may be used to infer single-copy core genomes and phylogenetic trees. Results We applied Rephine.r to three well-studied phage groups: the Tevenvirinae (e.g., T4), the Studiervirinae (e.g., T7), and the Pbunaviruses (e.g., PB1). In each case, Rephine.r recovered additional members of the single-copy core genome and increased the overall bootstrap support of the phylogeny. The Rephine.r pipeline is provided through GitHub (https://www.github.com/coevoeco/Rephine.r) as a single script for automated analysis and with utility functions to assist in building single-copy core genomes and predicting the sources of fragmented genes.

Rephine.r: a pipeline for correcting gene calls and clusters to improve phage pangenomes and phylogenies

10.1101/2021.04.26.441508 ◽

2021 ◽

Author(s):

Jason W. Shapiro ◽

Catherine Putonti

Keyword(s):

Phylogenetic Trees ◽

Markov Models ◽

Gene Families ◽

Gene Clusters ◽

Automated Analysis ◽

Single Copy ◽

Bootstrap Support ◽

Homing Endonucleases ◽

Sequence Alignments ◽

Selfish Genetic Element

AbstractBackgroundA pangenome is the collection of all genes found in a set of related genomes. For microbes, these genomes are often different strains of the same species, and the pangenome offers a means to compare gene content variation with differences in phenotypes, ecology, and phylogenetic relatedness. Though most frequently applied to bacteria, there is growing interest in adapting pangenome analysis to bacteriophages. However, working with phage genomes presents new challenges. First, most phage families are under-sampled, and homologous genes in related viruses can be difficult to identify. Second, homing endonucleases and intron-like sequences may be present, resulting in fragmented gene calls. Each of these issues can reduce the accuracy of standard pangenome analysis tools.MethodsWe developed an R pipeline called Rephine.r that takes as input the gene clusters produced by an initial pangenomics workflow. Rephine.r then proceeds in two primary steps. First, it identifies three common causes of fragmented gene calls: 1) indels creating early stop codons and new start codons; 2) interruption by a selfish genetic element; and 3) splitting at the ends of the reported genome. Fragmented genes are then fused to create new sequence alignments. In tandem, Rephine.r searches for distant homologs separated into different gene families using Hidden Markov Models. Significant hits are used to merge families into larger clusters. A final round of fragment identification is then run, and results may be used to infer single-copy core genomes and phylogenetic trees.ResultsWe applied Rephine.r to three well-studied phage groups: the Tevenvirinae (e.g. T4), the Studiervirinae (e.g. T7), and the Pbunaviruses (e.g. PB1). In each case, Rephine.r recovered additional members of the single-copy core genome and increased the overall bootstrap support of the phylogeny. The Rephine.r pipeline is provided through GitHub (https://www.github.com/coevoeco/Rephine.r) as a single script for automated analysis and with utility functions and a walkthrough for researchers with specific use cases for each type of correction.

Whole genome sequencing reveals the genomic diversity, taxonomic classification, and evolutionary relationships of the genus Nocardia

PLoS Neglected Tropical Diseases ◽

10.1371/journal.pntd.0009665 ◽

2021 ◽

Vol 15 (8) ◽

pp. e0009665

Author(s):

Shuai Xu ◽

Zhenpeng Li ◽

Yuanming Huang ◽

Lichao Han ◽

Yanlin Che ◽

...

Keyword(s):

Genetic Diversity ◽

Phylogenetic Trees ◽

Gene Families ◽

Single Copy ◽

Taxonomic Classification ◽

Genomic Diversity ◽

Evolutionary Relationships ◽

Taxonomic Structure ◽

Pan Genome ◽

Dipeptidyl Aminopeptidase

Nocardia is a complex and diverse genus of aerobic actinomycetes that cause complex clinical presentations, which are difficult to diagnose due to being misunderstood. To date, the genetic diversity, evolution, and taxonomic structure of the genus Nocardia are still unclear. In this study, we investigated the pan-genome of 86 Nocardia type strains to clarify their genetic diversity. Our study revealed an open pan-genome for Nocardia containing 265,836 gene families, with about 99.7% of the pan-genome being variable. Horizontal gene transfer appears to have been an important evolutionary driver of genetic diversity shaping the Nocardia genome and may have caused historical taxonomic confusion from other taxa (primarily Rhodococcus, Skermania, Aldersonia, and Mycobacterium). Based on single-copy gene families, we established a high-accuracy phylogenomic approach for Nocardia using 229 genome sequences. Furthermore, we found 28 potentially new species and reclassified 16 strains. Finally, by comparing the topology between a phylogenomic tree and 384 phylogenetic trees (from 384 single-copy genes from the core genome), we identified a novel locus for inferring the phylogeny of this genus. The dapb1 gene, which encodes dipeptidyl aminopeptidase BI, was far superior to commonly used markers for Nocardia and yielded a topology almost identical to that of genome-based phylogeny. In conclusion, the present study provides insights into the genetic diversity, contributes a robust framework for the taxonomic classification, and elucidates the evolutionary relationships of Nocardia. This framework should facilitate the development of rapid tests for the species identification of highly variable species and has given new insight into the behavior of this genus.

GeneRax: A tool for species tree-aware maximum likelihood based gene family tree inference under gene duplication, transfer, and loss

10.1101/779066 ◽

2019 ◽

Cited By ~ 3

Author(s):

Benoit Morel ◽

Alexey M. Kozlov ◽

Alexandros Stamatakis ◽

Gergely J. Szöllősi

Keyword(s):

Maximum Likelihood ◽

Phylogenetic Trees ◽

Large Scale ◽

Simulated Data ◽

Gene Families ◽

Species Tree ◽

Homologous Gene ◽

Sequence Alignments ◽

Full Likelihood ◽

True Tree

AbstractInferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges species tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data pre-processing (e.g., computing bootstrap trees), and rely on approximations and heuristics that limit the degree of tree space exploration. Here we present GeneRax, the first maximum likelihood species tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared to competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson-Foulds distance. On empirical datasets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1099 Cyanobacteria families in eight minutes on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax.

GeneRax: A Tool for Species-Tree-Aware Maximum Likelihood-Based Gene Family Tree Inference under Gene Duplication, Transfer, and Loss

Molecular Biology and Evolution ◽

10.1093/molbev/msaa141 ◽

2020 ◽

Vol 37 (9) ◽

pp. 2763-2774 ◽

Cited By ~ 5

Author(s):

Benoit Morel ◽

Alexey M Kozlov ◽

Alexandros Stamatakis ◽

Gergely J Szöllősi

Keyword(s):

Maximum Likelihood ◽

Phylogenetic Trees ◽

Large Scale ◽

Simulated Data ◽

Gene Families ◽

Species Tree ◽

Homologous Gene ◽

Sequence Alignments ◽

Full Likelihood ◽

True Tree

Abstract Inferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges, species-tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data preprocessing (e.g., computing bootstrap trees) and rely on approximations and heuristics that limit the degree of tree space exploration. Here, we present GeneRax, the first maximum likelihood species-tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared with competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson–Foulds distance. On empirical data sets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1,099 Cyanobacteria families in 8 min on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax (last accessed June 17, 2020).

Using evolutionary Expectation Maximization to estimate indel rates

Bioinformatics ◽

10.1093/bioinformatics/bti177 ◽

2005 ◽

Vol 21 (10) ◽

pp. 2294-2300 ◽

Cited By ~ 21

Author(s):

Ian Holmes

Keyword(s):

Em Algorithm ◽

Hidden Markov Models ◽

Expectation Maximization ◽

Phylogenetic Trees ◽

Markov Models ◽

Hidden Markov ◽

Biological Sequence ◽

Sequence Alignments ◽

Multiple Sequence ◽

Stochastic Grammars

Abstract Motivation The Expectation Maximization (EM) algorithm, in the form of the Baum–Welch algorithm (for hidden Markov models) or the Inside-Outside algorithm (for stochastic context-free grammars), is a powerful way to estimate the parameters of stochastic grammars for biological sequence analysis. To use this algorithm for multiple-sequence evolutionary modelling, it would be useful to apply the EM algorithm to estimate not only the probability parameters of the stochastic grammar, but also the instantaneous mutation rates of the underlying evolutionary model (to facilitate the development of stochastic grammars based on phylogenetic trees, also known as Statistical Alignment). Recently, we showed how to do this for the point substitution component of the evolutionary process; here, we extend these results to the indel process. Results We present an algorithm for maximum-likelihood estimation of insertion and deletion rates from multiple sequence alignments, using EM, under the single-residue indel model owing to Thorne, Kishino and Felsenstein (the ‘TKF91’ model). The algorithm converges extremely rapidly, gives accurate results on simulated data that are an improvement over parsimonious estimates (which are shown to underestimate the true indel rate), and gives plausible results on experimental data (coronavirus envelope domains). Owing to the algorithm's close similarity to the Baum–Welch algorithm for training hidden Markov models, it can be used in an ‘unsupervised’ fashion to estimate rates for unaligned sequences, or estimate several sets of rates for sequences with heterogenous rates. Availability Software implementing the algorithm and the benchmark is available under GPL from http://www.biowiki.org/ Contact [email protected]

Phylogenetic analysis reveals a low rate of homologous recombination in negative-sense RNA viruses

Journal of General Virology ◽

10.1099/vir.0.19277-0 ◽

2003 ◽

Vol 84 (10) ◽

pp. 2691-2703 ◽

Cited By ~ 180

Author(s):

Elizabeth R. Chare ◽

Ernest A. Gould ◽

Edward C. Holmes

Keyword(s):

Homologous Recombination ◽

Phylogenetic Trees ◽

Rna Viruses ◽

Rift Valley Fever Virus ◽

Fever Virus ◽

Hantaan Virus ◽

Bootstrap Support ◽

Influenza B ◽

Sequence Alignments ◽

Negative Sense

Recombination is increasingly seen as an important means of shaping genetic diversity in RNA viruses. However, observed recombination frequencies vary widely among those viruses studied to date, with only sporadic occurrences reported in RNA viruses with negative-sense genomes. To determine the extent of homologous recombination in negative-sense RNA viruses, phylogenetic analyses of 79 gene sequence alignments from 35 negative-sense RNA viruses (a total of 2154 sequences) were carried out. Powerful evidence was found for recombination, in the form of incongruent phylogenetic trees between different gene regions, in only five sequences from Hantaan virus, Mumps virus and Newcastle disease virus. This is the first report of recombination in these viruses. More tentative evidence for recombination, where conflicting phylogenetic trees were observed (but were without strong bootstrap support) and/or where putative recombinant regions were very short, was found in three alignments from La Crosse virus and Puumala virus. Finally, patterns of sequence variation compatible with the action of recombination, but not definitive evidence for this process, were observed in a further ten viruses: Canine distemper virus, Crimean-Congo haemorrhagic fever virus, Influenza A virus, Influenza B virus, Influenza C virus, Lassa virus, Pirital virus, Rabies virus, Rift Valley Fever virus and Vesicular stomatitis virus. The possibility of recombination in these viruses should be investigated further. Overall, this study reveals that rates of homologous recombination in negative-sense RNA viruses are very much lower than those of mutation, with many viruses seemingly clonal on current data. Consequently, recombination rate is unlikely to be a trait that is set by natural selection to create advantageous or purge deleterious mutations.

Machine-learning classification suggests that many alphaproteobacterial prophages may instead be gene transfer agents

10.1101/697243 ◽

2019 ◽

Author(s):

Roman Kogay ◽

Taylor B. Neely ◽

Daniel P. Birnbaum ◽

Camille R. Hankel ◽

Migun Shakya ◽

...

Keyword(s):

Gene Transfer ◽

Phylogenetic Trees ◽

Rhodobacter Capsulatus ◽

Gene Clusters ◽

Support Vector ◽

Sequence Alignments ◽

Bacterial Populations ◽

Machine Learning Classification ◽

Bona Fide ◽

Transfer Agents

AbstractMany of the sequenced bacterial and archaeal genomes encode regions of viral provenance. Yet, not all of these regions encode bona fide viruses. Gene transfer agents (GTAs) are thought to be former viruses that are now maintained in genomes of some bacteria and archaea and are hypothesized to enable exchange of DNA within bacterial populations. In Alphaproteobacteria, genes homologous to the ‘head-tail’ gene cluster that encodes structural components of the Rhodobacter capsulatus GTA (RcGTA) are found in many taxa, even if they are only distantly related to Rhodobacter capsulatus. Yet, in most genomes available in GenBank RcGTA-like genes have annotations of typical viral proteins, and therefore are not easily distinguished from their viral homologs without additional analyses. Here, we report a ‘support vector machine’ classifier that quickly and accurately distinguishes RcGTA-like genes from their viral homologs by capturing the differences in the amino acid composition of the encoded proteins. Our open-source classifier is implemented in Python and can be used to scan homologs of the RcGTA genes in newly sequenced genomes. The classifier can also be trained to identify other types of GTAs, or even to detect other elements of viral ancestry. Using the classifier trained on a manually curated set of homologous viruses and GTAs, we detected RcGTA-like ‘head-tail’ gene clusters in 57.5% of the 1,423 examined alphaproteobacterial genomes. We also demonstrated that more than half of the in silico prophage predictions are instead likely to be GTAs, suggesting that in many alphaproteobacterial genomes the RcGTA-like elements remain unrecognized.Data depositionSequence alignments and phylogenetic trees are available in a FigShare repository at DOI 10.6084/m9.figshare.8796419. The Python source code of the described classifier and additional scripts used in the analyses are available via a GitHub repository at https://github.com/ecg-lab/GTA-Hunter-v1

Faculty Opinions recommendation of Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1163036.623685 ◽

2009 ◽

Author(s):

Oliver Pybus

Keyword(s):

Phylogenetic Trees ◽

Large Scale ◽

Sequence Alignments

Analysis of the chromosomal clustering of Fusarium-responsive wheat genes uncovers new players in the defence against head blight disease

Scientific Reports ◽

10.1038/s41598-021-86362-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Alexandre Perochon ◽

Harriet R. Benbow ◽

Katarzyna Ślęczka-Brady ◽

Keshav B. Malla ◽

Fiona M. Doohan

Keyword(s):

Map Kinases ◽

Gene Families ◽

Gene Clusters ◽

Head Blight ◽

Orphan Gene ◽

Metabolic Genes ◽

Eukaryotic Gene ◽

Blight Disease ◽

Genomic Regions ◽

Eukaryotic Genomes

AbstractThere is increasing evidence that some functionally related, co-expressed genes cluster within eukaryotic genomes. We present a novel pipeline that delineates such eukaryotic gene clusters. Using this tool for bread wheat, we uncovered 44 clusters of genes that are responsive to the fungal pathogen Fusarium graminearum. As expected, these Fusarium-responsive gene clusters (FRGCs) included metabolic gene clusters, many of which are associated with disease resistance, but hitherto not described for wheat. However, the majority of the FRGCs are non-metabolic, many of which contain clusters of paralogues, including those implicated in plant disease responses, such as glutathione transferases, MAP kinases, and germin-like proteins. 20 of the FRGCs encode nonhomologous, non-metabolic genes (including defence-related genes). One of these clusters includes the characterised Fusarium resistance orphan gene, TaFROG. Eight of the FRGCs map within 6 FHB resistance loci. One small QTL on chromosome 7D (4.7 Mb) encodes eight Fusarium-responsive genes, five of which are within a FRGC. This study provides a new tool to identify genomic regions enriched in genes responsive to specific traits of interest and applied herein it highlighted gene families, genetic loci and biological pathways of importance in the response of wheat to disease.

The Chloroplast Phylogenomics and Systematics of Zoysia (Poaceae)

Plants ◽

10.3390/plants10081517 ◽

2021 ◽

Vol 10 (8) ◽

pp. 1517

Author(s):

Se-Hwan Cheon ◽

Min-Ah Woo ◽

Sangjin Jo ◽

Young-Kee Kim ◽

Ki-Joong Kim

Keyword(s):

Northeast Asia ◽

Single Copy ◽

Rrna Genes ◽

Bootstrap Support ◽

Trna Genes ◽

Protein Coding ◽

Tropical Regions ◽

Relationship Of ◽

The Relationship ◽

Simple Sequence

The genus Zoysia Willd. (Chloridoideae) is widely distributed from the temperate regions of Northeast Asia—including China, Japan, and Korea—to the tropical regions of Southeast Asia. Among these, four species—Zoysia japonica Steud., Zoysia sinica Hance, Zoysia tenuifolia Thiele, and Zoysia macrostachya Franch. & Sav.—are naturally distributed in the Korean Peninsula. In this study, we report the complete plastome sequences of these Korean Zoysia species (NCBI acc. nos. MF953592, MF967579~MF967581). The length of Zoysia plastomes ranges from 135,854 to 135,904 bp, and the plastomes have a typical quadripartite structure, which consists of a pair of inverted repeat regions (20,962~20,966 bp) separated by a large (81,348~81,392 bp) and a small (12,582~12,586 bp) single-copy region. In terms of gene order and structure, Zoysia plastomes are similar to the typical plastomes of Poaceae. The plastomes encode 110 genes, of which 76 are protein-coding genes, 30 are tRNA genes, and four are rRNA genes. Fourteen genes contain single introns and one gene has two introns. Three evolutionary hotspot spacer regions—atpB~rbcL, rps16~rps3, and rpl32~trnL-UAG—were recognized among six analyzed Zoysia species. The high divergences in the atpB~rbcL spacer and rpl16~rpl3 region are primarily due to the differences in base substitutions and indels. In contrast, the high divergence between rpl32~trnL-UAG spacers is due to a small inversion with a pair of 22 bp stem and an 11 bp loop. Simple sequence repeats (SSRs) were identified in 59 different locations in Z. japonica, 63 in Z. sinica, 62 in Z. macrostachya, and 63 in Z. tenuifolia plastomes. Phylogenetic analysis showed that the Zoysia (Zoysiinae) forms a monophyletic group, which is sister to Sporobolus (Sporobolinae), with 100% bootstrap support. Within the Zoysia clade, the relationship of (Z. sinica, Z japonica), (Z. tenuifolia, Z. matrella), (Z. macrostachya, Z. macrantha) was suggested.