scholarly journals ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses

2020 ◽  
Author(s):  
Natasha Pavlovikj ◽  
Joao Carlos Gomes-Neto ◽  
Jitender S. Deogun ◽  
Andrew K. Benson

AbstractWhole Genome Sequence (WGS) data from bacterial species is used for a variety of applications ranging from basic microbiological research, diagnostics, and epidemiological surveillance. The availability of WGS data from hundreds of thousands of individual isolates of individual microbial species poses a tremendous opportunity for discovery and hypothesis-generating research into ecology and evolution of these microorganisms. Scalability and user-friendliness of existing pipelines for population-scale inquiry, however, limit applications of systematic, population-scale approaches. Here, we present ProkEvo, an automated, scalable, and open-source framework for bacterial population genomics analyses using WGS data. ProkEvo was specifically developed to achieve the following goals: 1) Automation and scaling of complex combinations of computational analyses for many thousands of bacterial genomes from inputs of raw Illumina paired-end sequence reads; 2) Use of workflow management systems (WMS) such as Pegasus WMS to ensure reproducibility, scalability, modularity, fault-tolerance, and robust file management throughout the process; 3) Use of high-performance and high-throughput computational platforms; 4) Generation of hierarchical population-based genotypes at different scales of resolution based on combinations of multi-locus and Bayesian statistical approaches for classification; 5) Detection of antimicrobial resistance (AMR) genes, putative virulence factors, and plasmids from curated databases and association with genotypic classifications; and 6) Production of pan-genome annotations and data compilation that can be utilized for downstream analysis. The scalability of ProkEvo was measured with two datasets comprising significantly different numbers of input genomes (one with ~2,400 genomes, and the second with ~23,000 genomes). Depending on the dataset and the computational platform used, the running time of ProkEvo varied from ~3-26 days. ProkEvo can be used with virtually any bacterial species and the Pegasus WMS facilitates addition or removal of programs from the workflow or modification of options within them. All the dependencies of ProkEvo can be distributed via conda environment or Docker image. To demonstrate versatility of the ProkEvo platform, we performed population-based analyses from available genomes of three distinct pathogenic bacterial species as individual case studies (three serovars of Salmonella enterica, as well as Campylobacter jejuni and Staphylococcus aureus). The specific case studies used reproducible Python and R scripts documented in Jupyter Notebooks and collectively illustrate how hierarchical analyses of population structures, genotype frequencies, and distribution of specific gene functions can be used to generate novel hypotheses about the evolutionary history and ecological characteristics of specific populations of each pathogen. Collectively, our study shows that ProkEvo presents a viable option for scalable, automated analyses of bacterial populations with powerful applications for basic microbiology research, clinical microbiological diagnostics, and epidemiological surveillance.

PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e11376
Author(s):  
Natasha Pavlovikj ◽  
Joao Carlos Gomes-Neto ◽  
Jitender S. Deogun ◽  
Andrew K. Benson

Whole Genome Sequence (WGS) data from bacterial species is used for a variety of applications ranging from basic microbiological research, diagnostics, and epidemiological surveillance. The availability of WGS data from hundreds of thousands of individual isolates of individual microbial species poses a tremendous opportunity for discovery and hypothesis-generating research into ecology and evolution of these microorganisms. Flexibility, scalability, and user-friendliness of existing pipelines for population-scale inquiry, however, limit applications of systematic, population-scale approaches. Here, we present ProkEvo, an automated, scalable, reproducible, and open-source framework for bacterial population genomics analyses using WGS data. ProkEvo was specifically developed to achieve the following goals: (1) Automation and scaling of complex combinations of computational analyses for many thousands of bacterial genomes from inputs of raw Illumina paired-end sequence reads; (2) Use of workflow management systems (WMS) such as Pegasus WMS to ensure reproducibility, scalability, modularity, fault-tolerance, and robust file management throughout the process; (3) Use of high-performance and high-throughput computational platforms; (4) Generation of hierarchical-based population structure analysis based on combinations of multi-locus and Bayesian statistical approaches for classification for ecological and epidemiological inquiries; (5) Association of antimicrobial resistance (AMR) genes, putative virulence factors, and plasmids from curated databases with the hierarchically-related genotypic classifications; and (6) Production of pan-genome annotations and data compilation that can be utilized for downstream analysis such as identification of population-specific genomic signatures. The scalability of ProkEvo was measured with two datasets comprising significantly different numbers of input genomes (one with ~2,400 genomes, and the second with ~23,000 genomes). Depending on the dataset and the computational platform used, the running time of ProkEvo varied from ~3-26 days. ProkEvo can be used with virtually any bacterial species, and the Pegasus WMS uniquely facilitates addition or removal of programs from the workflow or modification of options within them. To demonstrate versatility of the ProkEvo platform, we performed a hierarchical-based population structure analyses from available genomes of three distinct pathogenic bacterial species as individual case studies. The specific case studies illustrate how hierarchical analyses of population structures, genotype frequencies, and distribution of specific gene functions can be integrated into an analysis. Collectively, our study shows that ProkEvo presents a practical viable option for scalable, automated analyses of bacterial populations with direct applications for basic microbiology research, clinical microbiological diagnostics, and epidemiological surveillance.


Open Biology ◽  
2015 ◽  
Vol 5 (1) ◽  
pp. 140133 ◽  
Author(s):  
Nitin Kumar ◽  
Ganesh Lad ◽  
Elisa Giuntini ◽  
Maria E. Kaye ◽  
Piyachat Udomwong ◽  
...  

Biological species may remain distinct because of genetic isolation or ecological adaptation, but these two aspects do not always coincide. To establish the nature of the species boundary within a local bacterial population, we characterized a sympatric population of the bacterium Rhizobium leguminosarum by genomic sequencing of 72 isolates. Although all strains have 16S rRNA typical of R. leguminosarum , they fall into five genospecies by the criterion of average nucleotide identity (ANI). Many genes, on plasmids as well as the chromosome, support this division: recombination of core genes has been largely within genospecies. Nevertheless, variation in ecological properties, including symbiotic host range and carbon-source utilization, cuts across these genospecies, so that none of these phenotypes is diagnostic of genospecies. This phenotypic variation is conferred by mobile genes. The genospecies meet the Mayr criteria for biological species in respect of their core genes, but do not correspond to coherent ecological groups, so periodic selection may not be effective in purging variation within them. The population structure is incompatible with traditional ‘polyphasic taxonomy′ that requires bacterial species to have both phylogenetic coherence and distinctive phenotypes. More generally, genomics has revealed that many bacterial species share adaptive modules by horizontal gene transfer, and we envisage a more consistent taxonomic framework that explicitly recognizes this. Significant phenotypes should be recognized as ‘biovars' within species that are defined by core gene phylogeny.


2021 ◽  
Author(s):  
Natasha Pavlovikj ◽  
Joao Carlos Gomes-Neto ◽  
Jitender S. Deogun ◽  
Andrew K. Benson

Epidemiological surveillance of bacterial pathogens requires real-time data analysis with a fast turn-around, while aiming at generating two main outcomes: 1) Species level identification; and 2) Variant mapping at different levels of genotypic resolution for population-based tracking, in addition to predicting traits such as antimicrobial resistance (AMR). With the recent advances and continual dissemination of whole-genome sequencing technologies, large-scale population-based genotyping of bacterial pathogens has become possible. Since bacterial populations often present a high degree of clonality in the genomic backbone (i.e., low genetic diversity), the choice of genotyping scheme can even facilitate the understanding of ancestral relationships and can be used for prediction of co-inherited traits such as AMR. Multi-locus sequence typing (MLST) fits that purpose and can identify sequence types (ST) based on seven ubiquitous genome-scattered loci that aid in genotyping isolates beneath the species level. ST-based mapping also standardizes genotyping across laboratories and can be consistently used worldwide. However, ST-based algorithms, when using Illumina paired-end sequences, often rely on genome assembly prior to classification. That hinders rapid genotyping and scalability which are essential aspects of genomic epidemiology. stringMLST is a kmer-based ST method with the capacity to solve both hurdles. Yet, a comprehensive scalable comparison of its use in contrast to a standard MLST program for a wide array of phylogenetically divergent Public Health-relevant bacterial pathogens is lacking. Herein, we first demonstrated that stringMLST is a fast tool that can be deployed for ST-based epidemiological inquiries of bacterial populations. Additionally, we systematically evaluated and showed the impact of genome-intrinsic and -extrinsic features, as well as the optimal kmer length in maximizing the performance of stringMLST on species-by-species basis, and highlighted a few instances where this program may not be applicable in its current format. Furthermore, we integrated stringMLST as part of our freely available and scalable hierarchical-based population genomics platform called ProkEvo. Besides facilitating automatable and reproducible bacterial population guided analysis, ProkEvo now offers a rapidly deployable genomic epidemiology tool for ST mapping, with specific guidance on how to optimize its performance, that can be widely applicable by microbiological laboratories and epidemiological agencies.


2021 ◽  
Author(s):  
John A Lees ◽  
Gerry Tonkin-Hill ◽  
Zhirong Yang ◽  
Jukka Corander

In less than a decade, population genomics of microbes has progressed from the effort of sequencing dozens of strains to thousands, or even tens of thousands of strains in a single study. There are now hundreds of thousands of genomes available even for a single bacterial species and the number of genomes is expected to continue to increase at an accelerated pace given the advances in sequencing technology and widespread genomic surveillance initiatives. This explosion of data calls for innovative methods to enable rapid exploration of the structure of a population based on different data modalities, such as multiple sequence alignments, assemblies and estimates of gene content across different genomes. Here we present Mandrake, an efficient implementation of a dimensional reduction method tailored for the needs of large-scale population genomics. Mandrake is capable of visualising population structure from millions of whole genomes and we illustrate its usefulness with several data sets representing major pathogens. Our method is freely available both as an analysis pipeline (https://github.com/johnlees/mandrake) and as a browser-based interactive application (https://gtonkinhill.github.io/mandrake-web/).


2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Victoria N. Parikh

The rich tradition of cardiovascular genomics has placed the field in prime position to extend our knowledge toward a genome-first approach to diagnosis and therapy. Population-scale genomic data has enabled exponential improvements in our ability to adjudicate variant pathogenicity based on allele rarity, and there has been a significant effort to employ these sizeable data in the investigation of rare disease. Certainly, population genomics data has great potential to aid the development of a genome-first approach to Mendelian cardiovascular disease, but its use in the clinical and investigative decision making is limited by the characteristics of the populations studied, and the evolutionary constraints on human Mendelian variation. To truly empower clinicians and patients, the successful implementation of a genome-first approach to rare cardiovascular disease will require the nuanced incorporation of population-based discovery with detailed investigation of rare disease cohorts and prospective variant evaluation.


Viruses ◽  
2021 ◽  
Vol 13 (5) ◽  
pp. 749
Author(s):  
Julia Butt ◽  
Rajagopal Murugan ◽  
Theresa Hippchen ◽  
Sylvia Olberg ◽  
Monique van Straaten ◽  
...  

The emerging SARS-CoV-2 pandemic entails an urgent need for specific and sensitive high-throughput serological assays to assess SARS-CoV-2 epidemiology. We, therefore, aimed at developing a fluorescent-bead based SARS-CoV-2 multiplex serology assay for detection of antibody responses to the SARS-CoV-2 proteome. Proteins of the SARS-CoV-2 proteome and protein N of SARS-CoV-1 and common cold Coronaviruses (ccCoVs) were recombinantly expressed in E. coli or HEK293 cells. Assay performance was assessed in a COVID-19 case cohort (n = 48 hospitalized patients from Heidelberg) as well as n = 85 age- and sex-matched pre-pandemic controls from the ESTHER study. Assay validation included comparison with home-made immunofluorescence and commercial enzyme-linked immunosorbent (ELISA) assays. A sensitivity of 100% (95% CI: 86–100%) was achieved in COVID-19 patients 14 days post symptom onset with dual sero-positivity to SARS-CoV-2 N and the receptor-binding domain of the spike protein. The specificity obtained with this algorithm was 100% (95% CI: 96–100%). Antibody responses to ccCoVs N were abundantly high and did not correlate with those to SARS-CoV-2 N. Inclusion of additional SARS-CoV-2 proteins as well as separate assessment of immunoglobulin (Ig) classes M, A, and G allowed for explorative analyses regarding disease progression and course of antibody response. This newly developed SARS-CoV-2 multiplex serology assay achieved high sensitivity and specificity to determine SARS-CoV-2 sero-positivity. Its high throughput ability allows epidemiologic SARS-CoV-2 research in large population-based studies. Inclusion of additional pathogens into the panel as well as separate assessment of Ig isotypes will furthermore allow addressing research questions beyond SARS-CoV-2 sero-prevalence.


2021 ◽  
Vol 7 (7) ◽  
pp. eabe5054
Author(s):  
Qianxin Wu ◽  
Chenqu Suo ◽  
Tom Brown ◽  
Tengyao Wang ◽  
Sarah A. Teichmann ◽  
...  

We present INSIGHT [isothermal NASBA (nucleic acid sequence–based amplification) sequencing–based high-throughput test], a two-stage coronavirus disease 2019 testing strategy, using a barcoded isothermal NASBA reaction. It combines point-of-care diagnosis with next-generation sequencing, aiming to achieve population-scale testing. Stage 1 allows a quick decentralized readout for early isolation of presymptomatic or asymptomatic patients. It gives results within 1 to 2 hours, using either fluorescence detection or a lateral flow readout, while simultaneously incorporating sample-specific barcodes. The same reaction products from potentially hundreds of thousands of samples can then be pooled and used in a highly multiplexed sequencing–based assay in stage 2. This second stage confirms the near-patient testing results and facilitates centralized data collection. The 95% limit of detection is <50 copies of viral RNA per reaction. INSIGHT is suitable for further development into a rapid home-based, point-of-care assay and is potentially scalable to the population level.


Circulation ◽  
2015 ◽  
Vol 132 (suppl_3) ◽  
Author(s):  
Martin I Sigurdsson ◽  
Mahyar Heydarpour ◽  
Louis Saddic ◽  
Tzuu-Wang Chang ◽  
Stanton K Shernan ◽  
...  

Introduction: The majority of information on the genetic background of atrial fibrillation (AF) results from genomic DNA variant analysis without consideration of tissue expression. Hypothesis: Analysis of tissue-specific gene expression in left atrium (LA) can further understanding of the molecular mechanism of identified AF risk variants, and identify novel genes and gene variants associated with AF. Methods: We isolated mRNA from samples of the LA free wall taken during mitral valve surgery in 62 Caucasian individuals. Gene expression in the LA was compared between patients who did and did not have post-operative AF (poAF) using high-throughput RNA expression. Using genotypes of 1.4 million single nucleotide polymorphisms (SNP) we performed cis expression quantifying trait loci (eQTL) analysis, correlating gene expression of each gene with the genotypes of adjacent (<1Mbp) SNPs. Results: We identified 23 differentially expressed genes in the LA of patients with poAF, including three potassium channel genes (KCNA7, KCNH8 and KCNK17). The largest expression difference was in LOC645323, a long non-coding RNA. The expression of PITX2, ZFHX3 and KCNN3, previously shown to be associated with AF, did not differ between patients with and without poAF. We identified 12,476 cis eQTL relationships in the LA, several of those included genetic regions and genes previously associated with AF. We confirmed an eQTL relationship between rs3744029 genotype and the expression of MYOZ1. Furthermore we describe a novel eQTL relationship between rs6795970 genotype and the expression of the SCN10A gene. Conclusions: We have analysed the human LA expression via high-throughput RNA sequencing, and identified novel genes and gene variants likely involved in the molecular pathophysiology of AF.


Sign in / Sign up

Export Citation Format

Share Document