Whole genome phylogenies reflect long-tailed distributions of recombination rates in many bacterial species

AbstractAlthough homologous recombination is accepted to be common in bacteria, so far it has been challenging to accurately quantify its impact on genome evolution within bacterial species. We here introduce methods that use the statistics of single-nucleotide polymorphism (SNP) splits in the core genome alignment of a set of strains to show that, for many bacterial species, recombination dominates genome evolution. Each genomic locus has been overwritten so many times by recombination that it is impossible to reconstruct the clonal phylogeny and, instead of a consensus phylogeny, the phylogeny typically changes many thousands of times along the core genome alignment.We also show how SNP splits can be used to quantify the relative rates with which different subsets of strains have recombined in the past. We find that virtually every strain has a unique pattern of frequencies with which its lineages have recombined with those of other strains, and that the relative rates with which different subsets of strains share SNPs follow long-tailed distributions. Our findings show that bacterial populations are neither clonal nor freely recombining, but structured such that recombination rates between different lineages vary along a continuum spanning several orders of magnitude, with a unique pattern of rates for each lineage. Thus, rather than reflecting clonal ancestry, whole genome phylogenies reflect these long-tailed distributions of recombination rates.

Download Full-text

Whole genome phylogenies reflect the distributions of recombination rates for many bacterial species

eLife ◽

10.7554/elife.65366 ◽

2021 ◽

Vol 10 ◽

Author(s):

Thomas Sakoparnig ◽

Chris Field ◽

Erik van Nimwegen

Keyword(s):

Bacterial Species ◽

Genome Alignment ◽

Whole Genome ◽

Allele Sharing ◽

Nucleotide Polymorphism ◽

Recombination Rates ◽

Single Nucleotide ◽

Bacterial Populations ◽

New Methods ◽

Alignment Analysis

Although recombination is accepted to be common in bacteria, for many species robust phylogenies with well-resolved branches can be reconstructed from whole genome alignments of strains, and these are generally interpreted to reflect clonal relationships. Using new methods based on the statistics of single-nucleotide polymorphism (SNP) splits, we show that this interpretation is incorrect. For many species, each locus has recombined many times along its line of descent, and instead of many loci supporting a common phylogeny, the phylogeny changes many thousands of times along the genome alignment. Analysis of the patterns of allele sharing among strains shows that bacterial populations cannot be approximated as either clonal or freely recombining, but are structured such that recombination rates between lineages vary over several orders of magnitude, with a unique pattern of rates for each lineage. Thus, rather than reflecting clonal ancestry, whole genome phylogenies reflect distributions of recombination rates.

Download Full-text

A Genomic Distance Based on MUM Indicates Discontinuity between Most Bacterial Species and Genera

Journal of Bacteriology ◽

10.1128/jb.01202-08 ◽

2008 ◽

Vol 191 (1) ◽

pp. 91-99 ◽

Cited By ~ 115

Author(s):

Marc Deloger ◽

Meriem El Karoui ◽

Marie-Agnès Petit

Keyword(s):

Dna Sequences ◽

Dna Content ◽

Core Genome ◽

Biological Diversity ◽

Bacterial Species ◽

Genomic Distance ◽

The Core ◽

Intraspecies Diversity ◽

Genome Level ◽

Definition Of

ABSTRACT The fundamental unit of biological diversity is the species. However, a remarkable extent of intraspecies diversity in bacteria was discovered by genome sequencing, and it reveals the need to develop clear criteria to group strains within a species. Two main types of analyses used to quantify intraspecies variation at the genome level are the average nucleotide identity (ANI), which detects the DNA conservation of the core genome, and the DNA content, which calculates the proportion of DNA shared by two genomes. Both estimates are based on BLAST alignments for the definition of DNA sequences common to the genome pair. Interestingly, however, results using these methods on intraspecies pairs are not well correlated. This prompted us to develop a genomic-distance index taking into account both criteria of diversity, which are based on DNA maximal unique matches (MUM) shared by two genomes. The values, called MUMi, for MUM index, correlate better with the ANI than with the DNA content. Moreover, the MUMi groups strains in a way that is congruent with routinely used multilocus sequence-typing trees, as well as with ANI-based trees. We used the MUMi to determine the relatedness of all available genome pairs at the species and genus levels. Our analysis reveals a certain consistency in the current notion of bacterial species, in that the bulk of intraspecies and intragenus values are clearly separable. It also confirms that some species are much more diverse than most. As the MUMi is fast to calculate, it offers the possibility of measuring genome distances on the whole database of available genomes.

Download Full-text

chewBBACA: A complete suite for gene-by-gene schema creation and strain identification

10.1101/173146 ◽

2017 ◽

Cited By ~ 5

Author(s):

Mickael Silva ◽

Miguel Machado ◽

Diogo N. Silva ◽

Mirko Rossi ◽

Jacob Moran-Gilad ◽

...

Keyword(s):

Open Source ◽

Core Genome ◽

Bacterial Species ◽

Outbreak Detection ◽

Strain Identification ◽

List Type ◽

Whole Genome ◽

Link Type ◽

The Creation ◽

Allele Calling

ABSTRACTGene-by-gene approaches are becoming increasingly popular in bacterial genomic epidemiology and outbreak detection. However, there is a lack of open-source scalable software for schema definition and allele calling for these methodologies. The chewBBACA suite was designed to assist users in the creation and evaluation of novel whole-genome or core-genome gene-by-gene typing schemas and subsequent allele calling in bacterial strains of interest. The software can run in a laptop or in high performance clusters making it useful for both small laboratories and large reference centers. ChewBBACA is available athttps://github.com/B-UMMI/chewBBACAor as a docker image athttps://hub.docker.com/r/ummidock/chewbbaca/.DATA SUMMARYAssembled genomes used for the tutorial were downloaded from NCBI in August 2016 by selecting those submitted asStreptococcus agalactiaetaxon or sub-taxa. All the assemblies have been deposited as a zip file in FigShare (https://figshare.com/s/9cbe1d422805db54cd52), where a file with the original ftp link for each NCBI directory is also available.Code for the chewBBACA suite is available athttps://github.com/B-UMMI/chewBBACAwhile the tutorial example is found athttps://github.com/B-UMMI/chewBBACA_tutorial.I/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. ⊠IMPACT STATEMENTThe chewBBACA software offers a computational solution for the creation, evaluation and use of whole genome (wg) and core genome (cg) multilocus sequence typing (MLST) schemas. It allows researchers to develop wg/cgMLST schemes for any bacterial species from a set of genomes of interest. The alleles identified by chewBBACA correspond to potential coding sequences, possibly offering insights into the correspondence between the genetic variability identified and phenotypic variability. The software performs allele calling in a matter of seconds to minutes per strain in a laptop but is easily scalable for the analysis of large datasets of hundreds of thousands of strains using multiprocessing options. The chewBBACA software thus provides an efficient and freely available open source solution for gene-by-gene methods. Moreover, the ability to perform these tasks locally is desirable when the submission of raw data to a central repository or web services is hindered by data protection policies or ethical or legal concerns.

Download Full-text

First Steps in the Analysis of Prokaryotic Pan-Genomes

Bioinformatics and Biology Insights ◽

10.1177/1177932220938064 ◽

2020 ◽

Vol 14 ◽

pp. 117793222093806

Author(s):

Sávio Souza Costa ◽

Luís Carlos Guimarães ◽

Artur Silva ◽

Siomar Castro Soares ◽

Rafael Azevedo Baraúna

Keyword(s):

Genome Analysis ◽

Core Genome ◽

Bacterial Species ◽

Genomic Analysis ◽

Gene Families ◽

Specific Group ◽

The Core ◽

Pan Genome ◽

Research Areas ◽

Key Concepts

Pan-genome is defined as the set of orthologous and unique genes of a specific group of organisms. The pan-genome is composed by the core genome, accessory genome, and species- or strain-specific genes. The pan-genome is considered open or closed based on the alpha value of the Heap law. In an open pan-genome, the number of gene families will continuously increase with the addition of new genomes to the analysis, while in a closed pan-genome, the number of gene families will not increase considerably. The first step of a pan-genome analysis is the homogenization of genome annotation. The same software should be used to annotate genomes, such as GeneMark or RAST. Subsequently, several software are used to calculate the pan-genome such as BPGA, GET_HOMOLOGUES, PGAP, among others. This review presents all these initial steps for those who want to perform a pan-genome analysis, explaining key concepts of the area. Furthermore, we present the pan-genomic analysis of 9 bacterial species. These are the species with the highest number of genomes deposited in GenBank. We also show the influence of the identity and coverage parameters on the prediction of orthologous and paralogous genes. Finally, we cite the perspectives of several research areas where pan-genome analysis can be used to answer important issues.

Download Full-text

Heterogeneity among estimates of the core genome and pan-genome in different pneumococcal populations

10.1101/133991 ◽

2017 ◽

Cited By ~ 5

Author(s):

Andries J van Tonder ◽

James E Bray ◽

Keith A Jolley ◽

Sigríður J Quirk ◽

Gunnsteinn Haraldsson ◽

...

Keyword(s):

Bacterial Population ◽

Core Genome ◽

Bacterial Species ◽

Essential Point ◽

Genetic Lineages ◽

The Core ◽

Pan Genome ◽

Single Dataset ◽

Genomic Regions ◽

Core Genes

AbstractBackgroundUnderstanding the structure of a bacterial population is essential in order to understand bacterial evolution, or which genetic lineages cause disease, or the consequences of perturbations to the bacterial population. Estimating the core genome, the genes common to all or nearly all strains of a species, is an essential component of such analyses. The size and composition of the core genome varies by dataset, but our hypothesis was that variation between different collections of the same bacterial species should be minimal. To test this, the genome sequences of 3,121 pneumococci recovered from healthy individuals in Reykjavik (Iceland), Southampton (United Kingdom), Boston (USA) and Maela (Thailand) were analysed.ResultsThe analyses revealed a ‘supercore’ genome (genes shared by all 3,121 pneumococci) of only 303 genes, although 461 additional core genes were shared by pneumococci from Reykjavik, Southampton and Boston. Overall, the size and composition of the core genomes and pan-genomes among pneumococci recovered in Reykjavik, Southampton and Boston were very similar, but pneumococci from Maela were distinctly different. Inspection of the pan-genome of Maela pneumococci revealed several >25 Kb sequence regions that were homologous to genomic regions found in other bacterial species.ConclusionsSome subsets of the global pneumococcal population are highly heterogeneous and thus our hypothesis was rejected. This is an essential point of consideration before generalising the findings from a single dataset to the wider pneumococcal population.

Download Full-text

Global genomic similarity and core genome sequence diversity of the Streptococcus genus as a toolkit to identify closely related bacterial species in complex environments

PeerJ ◽

10.7717/peerj.6233 ◽

2019 ◽

Vol 6 ◽

pp. e6233 ◽

Cited By ~ 4

Author(s):

Hugo R. Barajas ◽

Miguel F. Romero ◽

Shamayim Martínez-Sánchez ◽

Luis D. Alcaraz

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Core Genome ◽

Bacterial Species ◽

Genomic Diversity ◽

Comparative Genomic ◽

Rrna Gene ◽

Gene Phylogeny ◽

The Core ◽

Core Proteins

Background The Streptococcus genus is relevant to both public health and food safety because of its ability to cause pathogenic infections. It is well-represented (>100 genomes) in publicly available databases. Streptococci are ubiquitous, with multiple sources of isolation, from human pathogens to dairy products. The Streptococcus genus has traditionally been classified by morphology, serum types, the 16S ribosomal RNA (rRNA) gene, and multi-locus sequence types subject to in-depth comparative genomic analysis. Methods Core and pan-genomes described the genomic diversity of 108 strains belonging to 16 Streptococcus species. The core genome nucleotide diversity was calculated and compared to phylogenomic distances within the genus Streptococcus. The core genome was also used as a resource to recruit metagenomic fragment reads from streptococci dominated environments. A conventional 16S rRNA gene phylogeny reconstruction was used as a reference to compare the resulting dendrograms of average nucleotide identity (ANI) and genome similarity score (GSS) dendrograms. Results The core genome, in this work, consists of 404 proteins that are shared by all 108 Streptococcus. The average identity of the pairwise compared core proteins decreases proportionally to GSS lower scores, across species. The GSS dendrogram recovers most of the clades in the 16S rRNA gene phylogeny while distinguishing between 16S polytomies (unresolved nodes). The GSS is a distance metric that can reflect evolutionary history comparing orthologous proteins. Additionally, GSS resulted in the most useful metric for genus and species comparisons, where ANI metrics failed due to false positives when comparing different species. Discussion Understanding of genomic variability and species relatedness is the goal of tools like GSS, which makes use of the maximum pairwise shared orthologous sequences for its calculation. It allows for long evolutionary distances (above species) to be included because of the use of amino acid alignment scores, rather than nucleotides, and normalizing by positive matches. Newly sequenced species and strains could be easily placed into GSS dendrograms to infer overall genomic relatedness. The GSS is not restricted to ubiquitous conservancy of gene features; thus, it reflects the mosaic-structure and dynamism of gene acquisition and loss in bacterial genomes.

Download Full-text

EnteroBase: Hierarchical clustering of 100,000s of bacterial genomes into species/sub-species and populations

10.1101/2022.01.11.475882 ◽

2022 ◽

Author(s):

Mark Achtman ◽

Zhemin Zhou ◽

Jane Charlesworth ◽

Laura A. Baxter

Keyword(s):

Core Genome ◽

Bacterial Species ◽

Automated Identification ◽

Bacterial Genomes ◽

Bacterial Populations ◽

Vertical Inheritance ◽

Definition Of ◽

Taxonomic Assignments ◽

Species Specific ◽

Sequence Types

The definition of bacterial species is traditionally a taxonomic issue while defining bacterial populations is done with population genetics. These assignments are species specific, and depend on the practitioner. Legacy multilocus sequence typing is commonly used to identify sequence types (STs) and clusters (ST Complexes). However, these approaches are not adequate for the millions of genomic sequences from bacterial pathogens that have been generated since 2012. EnteroBase (http://enterobase.warwick.ac.uk) automatically clusters core genome MLST alleles into hierarchical clusters (HierCC) after assembling annotated draft genomes from short read sequences. HierCC clusters span core sequence diversity from the species level down to individual transmission chains. Here we evaluate the ability of HierCC to correctly assign 100,000s of genomes to the species/subspecies and population levels for Salmonella, Clostridoides, Yersinia, Vibrio and Streptococcus. HierCC assignments were more consistent with maximum-likelihood super-trees of core SNPs or presence/absence of accessory genes than classical taxonomic assignments or 95% ANI. However, neither HierCC nor ANI were uniformly consistent with classical taxonomy of Streptococcus. HierCC was also consistent with legacy eBGs/ST Complexes in Salmonella or Escherichia and revealed differences in vertical inheritance of O serogroups. Thus, EnteroBase HierCC supports the automated identification of and assignment to species/subspecies and populations for multiple genera.

Download Full-text

Rapid Core-Genome Alignment and Visualization for Thousands of Intraspecific Microbial Genomes

10.1101/007351 ◽

2014 ◽

Cited By ~ 6

Author(s):

Todd J. Treangen ◽

Brian D. Ondov ◽

Sergey Koren ◽

Adam M. Phillippy

Keyword(s):

Open Source ◽

Phylogenetic Trees ◽

Core Genome ◽

Real Data ◽

Genome Alignment ◽

Whole Genome ◽

Microbial Genomes ◽

Microbial Strains ◽

Visualization Tools ◽

Whole Genome Alignment

Though many microbial species or clades now have hundreds of sequenced genomes, existing whole-genome alignment methods do not efficiently handle comparisons on this scale. Here we present the Harvest suite of core-genome alignment and visualization tools for quickly analyzing thousands of intraspecific microbial strains. Harvest includes Parsnp, a fast core-genome multi-aligner, and Gingr, a dynamic visual platform. Combined they provide interactive core-genome alignments, variant calls, recombination detection, and phylogenetic trees. Using simulated and real data we demonstrate that our approach exhibits unrivaled speed while maintaining the accuracy of existing methods. The Harvest suite is open-source and freely available from: http://github.com/marbl/harvest.

Download Full-text

The Genomics of Streptococcus Pneumoniae Carriage Isolates from UK Children and Their Household Contacts, Pre-PCV7 to Post-PCV13

Genes ◽

10.3390/genes10090687 ◽

2019 ◽

Vol 10 (9) ◽

pp. 687 ◽

Cited By ~ 2

Author(s):

Sheppard ◽

Groves ◽

Andrews ◽

Litt ◽

Fry ◽

...

Keyword(s):

Streptococcus Pneumoniae ◽

Core Genome ◽

Genomic Data ◽

Genome Alignment ◽

Whole Genome ◽

Genetic Lineages ◽

Household Contacts ◽

Serotype Replacement ◽

Maximum Likelihood Phylogeny ◽

Sequence Types

We used whole genome sequencing (WGS) analysis to investigate the population structure of 877 Streptococcus pneumoniae isolates from five carriage studies from 2002 (N = 346), 2010 (N = 127), 2013 (N = 153), 2016 (N = 187) and 2018 (N = 64) in UK households which covers the period pre-PCV7 to post-PCV13 implementation. The genomic lineages seen in the population were determined using multi-locus sequence typing (MLST) and PopPUNK (Population Partitioning Using Nucleotide K-mers) which was used for local and global comparisons. A Roary core genome alignment of all the carriage genomes was used to investigate phylogenetic relationships between the lineages. The results showed an influx of previously undetected sequence types after vaccination associated with non-vaccine serotypes. A small number of lineages persisted throughout, associated with both non-vaccine and vaccine types (such as ST199), or that could be an example of serotype switching from vaccine to non-vaccine types (ST177). Serotype 3 persisted throughout the study years, represented by ST180 and Global Pneumococcal Sequencing Cluster (GPSC) 12; the local PopPUNK analysis and core genome maximum likelihood phylogeny separated them into two clades, one of which is only seen in later study years. The genomic data showed that serotype replacement in the carriage studies was mostly due to a change in genotype as well as serotype, but that some important genetic lineages, previously associated with vaccine types, persisted.

Download Full-text