PANINI: Pangenome Neighbor Identification for Bacterial Populations

ABSTRACTThe standard workhorse for genomic analysis of the evolution of bacterial populations is phylogenetic modelling of mutations in the core genome. However, in the current era of population genomics, a notable amount of information about evolutionary and transmission processes in diverse populations can be lost unless the accessory genome is also taken into consideration. Here we introduce PANINI, a computationally scalable method for identifying the neighbours for each isolate in a data set using unsupervised machine learning with stochastic neighbour embedding. PANINI is browser-based and integrates with the Microreact platform for rapid online visualisation and exploration of both core and accessory genome evolutionary signals together with relevant epidemiological, geographic, temporal and other metadata. Several case studies with single-and multi-clone pneumococcal populations are presented to demonstrate ability to identify biologically important signals from gene content data. PANINI is available at http://panini.wgsa.net/ and code at http://gitlab.com/cgps/panini

Download Full-text

Why? – Successful Pseudomonas aeruginosa clones with a focus on clone C

FEMS Microbiology Reviews ◽

10.1093/femsre/fuaa029 ◽

2020 ◽

Vol 44 (6) ◽

pp. 740-762

Author(s):

Changhan Lee ◽

Jens Klockgether ◽

Sebastian Fischer ◽

Janja Trcek ◽

Burkhard Tümmler ◽

...

Keyword(s):

Pseudomonas Aeruginosa ◽

Core Genome ◽

Genomic Island ◽

Protein Quality ◽

Temperature Tolerance ◽

Protein Homeostasis ◽

Gene Products ◽

Accessory Genome ◽

The Core ◽

Conserved Core

ABSTRACT The environmental species Pseudomonas aeruginosa thrives in a variety of habitats. Within the epidemic population structure of P. aeruginosa, occassionally highly successful clones that are equally capable to succeed in the environment and the human host arise. Framed by a highly conserved core genome, individual members of successful clones are characterized by a high variability in their accessory genome. The abundance of successful clones might be funded in specific features of the core genome or, although not mutually exclusive, in the variability of the accessory genome. In clone C, one of the most predominant clones, the plasmid pKLC102 and the PACGI-1 genomic island are two ubiquitous accessory genetic elements. The conserved transmissible locus of protein quality control (TLPQC) at the border of PACGI-1 is a unique horizontally transferred compository element, which codes predominantly for stress-related cargo gene products such as involved in protein homeostasis. As a hallmark, most TLPQC xenologues possess a core genome equivalent. With elevated temperature tolerance as a characteristic of clone C strains, the unique P. aeruginosa and clone C specific disaggregase ClpG is a major contributor to tolerance. As other successful clones, such as PA14, do not encode the TLPQC locus, ubiquitous denominators of success, if existing, need to be identified.

Download Full-text

Geography Shapes the Population Genomics of Salmonella enterica Dublin

Genome Biology and Evolution ◽

10.1093/gbe/evz158 ◽

2019 ◽

Vol 11 (8) ◽

pp. 2220-2231 ◽

Cited By ~ 3

Author(s):

Gavin J Fenske ◽

Anil Thachil ◽

Patrick L McDonough ◽

Amy Glaser ◽

Joy Scaria

Keyword(s):

Population Structure ◽

Salmonella Enterica ◽

Core Genome ◽

Population Genomics ◽

Negative Impact ◽

Antibiotic Resistance Genes ◽

Virulence Plasmid ◽

Ancestral State ◽

Secretion Systems ◽

Data Set

Abstract Salmonella enterica serotype Dublin (S. Dublin) is a bovine-adapted serotype that can cause serious systemic infections in humans. Despite the increasing prevalence of human infections and the negative impact on agricultural processes, little is known about the population structure of the serotype. To this end, we compiled a manually curated data set comprising of 880 S. Dublin genomes. Core genome phylogeny and ancestral state reconstruction revealed that region-specific clades dominate the global population structure of S. Dublin. Strains of S. Dublin in the UK are genomically distinct from US, Brazilian, and African strains. The geographical partitioning impacts the composition of the core genome as well as the ancillary genome. Antibiotic resistance genes are almost exclusively found in US genomes and are mediated by an IncA/C2 plasmid. Phage content and the S. Dublin virulence plasmid were strongly conserved in the serotype. Comparison of S. Dublin to a closely related serotype, S. enterica serotype Enteritidis, revealed that S. Dublin contains 82 serotype specific genes that are not found in S. Enteritidis. Said genes encode metabolic functions involved in the uptake and catabolism of carbohydrates and virulence genes associated with type VI secretion systems and fimbria assembly respectively.

Download Full-text

First Steps in the Analysis of Prokaryotic Pan-Genomes

Bioinformatics and Biology Insights ◽

10.1177/1177932220938064 ◽

2020 ◽

Vol 14 ◽

pp. 117793222093806

Author(s):

Sávio Souza Costa ◽

Luís Carlos Guimarães ◽

Artur Silva ◽

Siomar Castro Soares ◽

Rafael Azevedo Baraúna

Keyword(s):

Genome Analysis ◽

Core Genome ◽

Bacterial Species ◽

Genomic Analysis ◽

Gene Families ◽

Specific Group ◽

The Core ◽

Pan Genome ◽

Research Areas ◽

Key Concepts

Pan-genome is defined as the set of orthologous and unique genes of a specific group of organisms. The pan-genome is composed by the core genome, accessory genome, and species- or strain-specific genes. The pan-genome is considered open or closed based on the alpha value of the Heap law. In an open pan-genome, the number of gene families will continuously increase with the addition of new genomes to the analysis, while in a closed pan-genome, the number of gene families will not increase considerably. The first step of a pan-genome analysis is the homogenization of genome annotation. The same software should be used to annotate genomes, such as GeneMark or RAST. Subsequently, several software are used to calculate the pan-genome such as BPGA, GET_HOMOLOGUES, PGAP, among others. This review presents all these initial steps for those who want to perform a pan-genome analysis, explaining key concepts of the area. Furthermore, we present the pan-genomic analysis of 9 bacterial species. These are the species with the highest number of genomes deposited in GenBank. We also show the influence of the identity and coverage parameters on the prediction of orthologous and paralogous genes. Finally, we cite the perspectives of several research areas where pan-genome analysis can be used to answer important issues.

Download Full-text

Identification of Nitrogen Fixation Genes in Lactococcus Isolated from Maize Using Population Genomics and Machine Learning

Microorganisms ◽

10.3390/microorganisms8122043 ◽

2020 ◽

Vol 8 (12) ◽

pp. 2043

Author(s):

Shawn M. Higdon ◽

Bihua C. Huang ◽

Alan B. Bennett ◽

Bart C. Weimer

Keyword(s):

Machine Learning ◽

Nitrogen Fixation ◽

Genome Wide Association Study ◽

Population Genomics ◽

Genomic Analysis ◽

Oxidation Reduction ◽

Genome Wide ◽

Biological Nitrogen ◽

Carbohydrate Catabolism ◽

Comparative Population

Sierra Mixe maize is a landrace variety from Oaxaca, Mexico, that utilizes nitrogen derived from the atmosphere via an undefined nitrogen fixation mechanism. The diazotrophic microbiota associated with the plant’s mucilaginous aerial root exudate composed of complex carbohydrates was previously identified and characterized by our group where we found 23 lactococci capable of biological nitrogen fixation (BNF) without containing any of the proposed essential genes for this trait (nifHDKENB). To determine the genes in Lactococcus associated with this phenotype, we selected 70 lactococci from the dairy industry that are not known to be diazotrophic to conduct a comparative population genomic analysis. This showed that the diazotrophic lactococcal genomes were distinctly different from the dairy isolates. Examining the pangenome followed by genome-wide association study and machine learning identified genes with the functions needed for BNF in the maize isolates that were absent from the dairy isolates. Many of the putative genes received an ‘unknown’ annotation, which led to the domain analysis of the 135 homologs. This revealed genes with molecular functions needed for BNF, including mucilage carbohydrate catabolism, glycan-mediated host adhesion, iron/siderophore utilization, and oxidation/reduction control. This is the first report of this pathway in this organism to underpin BNF. Consequently, we proposed a model needed for BNF in lactococci that plausibly accounts for BNF in the absence of the nif operon in this organism.

Download Full-text

A Genome-Based Model to Predict the Virulence of Pseudomonas aeruginosa Isolates

mBio ◽

10.1128/mbio.01527-20 ◽

2020 ◽

Vol 11 (4) ◽

Author(s):

Nathan B. Pincus ◽

Egon A. Ozer ◽

Jonathan P. Allen ◽

Marcus Nguyen ◽

James J. Davis ◽

...

Keyword(s):

Machine Learning ◽

Pseudomonas Aeruginosa ◽

Core Genome ◽

Learning Models ◽

Single Nucleotide Variants ◽

Bacterial Genomes ◽

Content Type ◽

Accessory Genome ◽

A Genome ◽

Machine Learning Models

ABSTRACT Variation in the genome of Pseudomonas aeruginosa, an important pathogen, can have dramatic impacts on the bacterium’s ability to cause disease. We therefore asked whether it was possible to predict the virulence of P. aeruginosa isolates based on their genomic content. We applied a machine learning approach to a genetically and phenotypically diverse collection of 115 clinical P. aeruginosa isolates using genomic information and corresponding virulence phenotypes in a mouse model of bacteremia. We defined the accessory genome of these isolates through the presence or absence of accessory genomic elements (AGEs), sequences present in some strains but not others. Machine learning models trained using AGEs were predictive of virulence, with a mean nested cross-validation accuracy of 75% using the random forest algorithm. However, individual AGEs did not have a large influence on the algorithm’s performance, suggesting instead that virulence predictions are derived from a diffuse genomic signature. These results were validated with an independent test set of 25 P. aeruginosa isolates whose virulence was predicted with 72% accuracy. Machine learning models trained using core genome single-nucleotide variants and whole-genome k-mers also predicted virulence. Our findings are a proof of concept for the use of bacterial genomes to predict pathogenicity in P. aeruginosa and highlight the potential of this approach for predicting patient outcomes. IMPORTANCE Pseudomonas aeruginosa is a clinically important Gram-negative opportunistic pathogen. P. aeruginosa shows a large degree of genomic heterogeneity both through variation in sequences found throughout the species (core genome) and through the presence or absence of sequences in different isolates (accessory genome). P. aeruginosa isolates also differ markedly in their ability to cause disease. In this study, we used machine learning to predict the virulence level of P. aeruginosa isolates in a mouse bacteremia model based on genomic content. We show that both the accessory and core genomes are predictive of virulence. This study provides a machine learning framework to investigate relationships between bacterial genomes and complex phenotypes such as virulence.

Download Full-text

Evolutionary Dynamics Based on Comparative Genomics of Pathogenic Escherichia coli Lineages Harboring Polyketide Synthase (pks) Island

mBio ◽

10.1128/mbio.03634-20 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Arya Suresh ◽

Sabiha Shaik ◽

Ramani Baddam ◽

Amit Ranjan ◽

Shamsul Qumar ◽

...

Keyword(s):

Comparative Genomics ◽

Maximum Likelihood ◽

High Throughput ◽

Core Genome ◽

Polyketide Synthase ◽

Clinical Implications ◽

Data Set ◽

Content Type ◽

E Coli ◽

The Core

ABSTRACT The genotoxin colibactin is a secondary metabolite produced by the polyketide synthase (pks) island harbored by extraintestinal pathogenic E. coli (ExPEC) and other members of the Enterobacteriaceae that has been increasingly reported to have critical implications in human health. The present study entails a high-throughput whole-genome comparison and phylogenetic analysis of such pathogenic E. coli isolates to gain insights into the patterns of distribution, horizontal transmission, and evolution of the island. For the current study, 23 pks-positive ExPEC genomes were newly sequenced, and their virulome and resistome profiles indicated a preponderance of virulence encoding genes and a reduced number of genes for antimicrobial resistance. In addition, 4,090 E. coli genomes from the public domain were also analyzed for large-scale screening for pks-positive genomes, out of which a total of 530 pks-positive genomes were studied to understand the subtype-based distribution pattern(s). The pks island showed a significant association with the B2 phylogroup (82.2%) and a high prevalence in sequence type 73 (ST73; n = 179) and ST95 (n = 110) and the O6:H1 (n = 110) serotype. Maximum-likelihood (ML) phylogeny of the core genome and intergenic regions (IGRs) of the ST95 model data set, which was selected because it had both pks-positive and pks-negative genomes, displayed clustering in relation to their carriage of the pks island. Prevalence patterns of genes encoding RM systems in the pks-positive and pks-negative genomes were also analyzed to determine their potential role in pks island acquisition and the maintenance capability of the genomes. Further, the maximum-likelihood phylogeny based on the core genome and pks island sequences from 247 genomes with an intact pks island demonstrated horizontal gene transfer of the island across sequence types and serotypes, with few exceptions. This study vitally contributes to understanding of the lineages and subtypes that have a higher propensity to harbor the pks island-encoded genotoxin with possible clinical implications. IMPORTANCE Extraintestinal pathologies caused by highly virulent strains of E. coli amount to clinical implications with high morbidity and mortality rates. Pathogenic E. coli strains are evolving with the horizontal acquisition of mobile genetic elements, including pathogenicity islands such as the pks island, which produces the genotoxin colibactin, resulting in severe clinical outcomes, including colorectal cancer progression. The current study encompasses high-throughput comparative genomics and phylogenetic analyses to address the questions pertaining to the acquisition and evolution pattern of the genomic island in different E. coli subtypes. It is crucial to gain insights into the distribution, transfer, and maintenance of pathogenic islands, as they harbor multiple virulence genes involved in pathogenesis and clinical implications of the infection.

Download Full-text

Gene-gene relationships in an Escherichia coli accessory genome are linked to function and mobility

Microbial Genomics ◽

10.1099/mgen.0.000650 ◽

2021 ◽

Vol 7 (9) ◽

Author(s):

Rebecca J. Hall ◽

Fiona J. Whelan ◽

Elizabeth A. Cummins ◽

Christopher Connor ◽

Alan McNally ◽

...

Keyword(s):

Escherichia Coli ◽

Complex Network ◽

Type Species ◽

Core Genome ◽

Mobile Genetic Elements ◽

Dynamic Nature ◽

Content Type ◽

Accessory Genome ◽

The Core ◽

Sequence Types

The pangenome contains all genes encoded by a species, with the core genome present in all strains and the accessory genome in only a subset. Coincident gene relationships are expected within the accessory genome, where the presence or absence of one gene is influenced by the presence or absence of another. Here, we analysed the accessory genome of an Escherichia coli pangenome consisting of 400 genomes from 20 sequence types to identify genes that display significant co-occurrence or avoidance patterns with one another. We present a complex network of genes that are either found together or that avoid one another more often than would be expected by chance, and show that these relationships vary by lineage. We demonstrate that genes co-occur by function, and that several highly connected gene relationships are linked to mobile genetic elements. We find that genes are more likely to co-occur with, rather than avoid, another gene in the accessory genome. This work furthers our understanding of the dynamic nature of prokaryote pangenomes and implicates both function and mobility as drivers of gene relationships.

Download Full-text

An Escherichia coli ST131 pangenome atlas reveals population structure and evolution across 4,071 isolates

10.1101/719583 ◽

2019 ◽

Cited By ~ 1

Author(s):

Arun Gonzales Decano ◽

Tim Downing

Keyword(s):

Escherichia Coli ◽

Population Structure ◽

Population Genomics ◽

De Novo ◽

Beta Lactam ◽

Illumina Hiseq ◽

Accessory Genome ◽

Esbl Gene ◽

The Core ◽

Genomic Study

AbstractEscherichia coli ST131 is a major cause of infection with extensive antimicrobial resistance (AMR) facilitated by widespread beta-lactam antibiotic use. This drug pressure has driven extended-spectrum beta-lactamase (ESBL) gene acquisition and evolution in pathogens, so a clearer resolution of ST131’s origin, adaptation and spread is essential. E. coli ST131’s ESBL genes are typically embedded in mobile genetic elements (MGEs) that aid transfer to new plasmid or chromosomal locations, which are mobilised further by plasmid conjugation and recombination, resulting in a flexible ESBL, MGE and plasmid composition with a conserved core genome. We used population genomics to trace the evolution of AMR in ST131 more precisely by extracting all available high-quality Illumina HiSeq read libraries to investigate 4,071 globally-sourced genomes, the largest ST131 collection examined so far. We applied rigorous quality-control, genome de novo assembly and ESBL gene screening to resolve ST131’s population structure across three genetically distinct Clades (A, B, C) and abundant subclades from the dominant Clade C. We reconstructed their evolutionary relationships across the core and accessory genomes using published reference genomes, long read assemblies and k-mer-based methods to contextualise pangenome diversity. The three main C subclades have co-circulated globally at relatively stable frequencies over time, suggesting attaining an equilibrium after their origin and initial rapid spread. This contrasted with their ESBL genes, which had stronger patterns across time, geography and subclade, and were located at distinct locations across the chromosomes and plasmids between isolates. Within the three C subclades, the core and accessory genome diversity levels were not correlated due to plasmid and MGE activity, unlike patterns between the three main clades, A, B and C. This population genomic study highlights the dynamic nature of the accessory genomes in ST131, suggesting that surveillance should anticipate genetically variable outbreaks with broader antibiotic resistance levels. Our findings emphasise the potential of evolutionary pangenomics to improve our understanding of AMR gene transfer, adaptation and transmission to discover accessory genome changes linked to novel subtypes.

Download Full-text

Genomic Analysis of Curtobacterium Flaccumfaciens Reveals the Differences Between Pathogenic and Nonpathogenic Strains

10.21203/rs.3.rs-335330/v1 ◽

2021 ◽

Author(s):

Qingde Li ◽

Lianjun Sun

Keyword(s):

Core Genome ◽

Genomic Analysis ◽

Pectate Lyase ◽

Economic Losses ◽

Positive Bacterium ◽

Citrate Cycle ◽

The Core ◽

Genomic Level ◽

Genes Encoding ◽

Genome Phylogeny

Abstract Purpose Curtobacterium flaccumfaciens is a Gram-positive bacterium which has been isolated from different plants and abiotic environment. Curtobacterium. flaccumfaciens pv. flaccumfaciens (Cff) is a pathogenic bacterium that infects legume, which is causing great economic losses. At the genomic level, the metabolic and phylogenetic characteristics, and differences in pathogenicity between pathogenic and nonpathogenic C. flaccumfaciens strains have not been analyzed in detail. Methods Therefore, in order to discuss the differences in genome, phylogeny, gene function and mobile genetic elements between pathogenic and nonpathogenic strains, pangenomics and comparative genomics were used in this study to analyze 12 C. flaccumfaciens strains. Result The pangenome of C. flaccumfaciens is open. Phylogenetic analysis showed that there was no correlation between the phylogeny and pathogenicity of C. flaccumfaciens. KAAS annotation of the core genome shows that the citrate cycle was incomplete. In addition, gene islands analysis of the three pathogenicity-related genes encoding for pectate lyase, serine protease and cellulases showed that they only existed in the Cffs and LMG3645 strains. LMG3645 might be a pathogenic strain. Conclusion This study clearly and reliably revealed the differences between the pathogenic and nonpathogenic strains of C. flaccumfaciens at the genomic level, and paves the way for further research on its pathogenicity.

Download Full-text

Biogeography and Microscale Diversity Shape the Biosynthetic Potential of Fungus-growing Ant-associated Pseudonocardia

10.1101/545640 ◽

2019 ◽

Cited By ~ 5

Author(s):

Bradon R. McDonald ◽

Marc G. Chevrette ◽

Jonathan L. Klassen ◽

Heidi A. Horn ◽

Eric J. Caldera ◽

...

Keyword(s):

Genetic Diversity ◽

Core Genome ◽

Genomic Analysis ◽

Barro Colorado Island ◽

Bacterial Populations ◽

Ant Colonies ◽

Geographic Patterns ◽

Population Genomic ◽

Close Proximity ◽

Fungus Growing Ants

AbstractThe geographic and phylogenetic scale of ecologically relevant microbial diversity is still poorly understood. Using a model mutualism, fungus-growing ants and their defensive bacterial associate Pseudonocardia, we analyzed genetic diversity and biosynthetic potential in 46 strains isolated from ant colonies in a 20km transect near Barro Colorado Island in Panama. Despite an average pairwise core genome similarity of greater than 99%, population genomic analysis revealed several distinct bacterial populations matching ant host geographic distribution. We identified both genetic diversity signatures and divergent genes distinct to each lineage. We also identify natural product biosynthesis clusters specific to isolation locations. These geographic patterns were observable despite the populations living in close proximity to each other and provides evidence of ongoing genetic exchange. Our results add to the growing body of literature suggesting that variation in traits of interest can be found at extremely fine phylogenetic scales.

Download Full-text