scholarly journals ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses

PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e11376
Author(s):  
Natasha Pavlovikj ◽  
Joao Carlos Gomes-Neto ◽  
Jitender S. Deogun ◽  
Andrew K. Benson

Whole Genome Sequence (WGS) data from bacterial species is used for a variety of applications ranging from basic microbiological research, diagnostics, and epidemiological surveillance. The availability of WGS data from hundreds of thousands of individual isolates of individual microbial species poses a tremendous opportunity for discovery and hypothesis-generating research into ecology and evolution of these microorganisms. Flexibility, scalability, and user-friendliness of existing pipelines for population-scale inquiry, however, limit applications of systematic, population-scale approaches. Here, we present ProkEvo, an automated, scalable, reproducible, and open-source framework for bacterial population genomics analyses using WGS data. ProkEvo was specifically developed to achieve the following goals: (1) Automation and scaling of complex combinations of computational analyses for many thousands of bacterial genomes from inputs of raw Illumina paired-end sequence reads; (2) Use of workflow management systems (WMS) such as Pegasus WMS to ensure reproducibility, scalability, modularity, fault-tolerance, and robust file management throughout the process; (3) Use of high-performance and high-throughput computational platforms; (4) Generation of hierarchical-based population structure analysis based on combinations of multi-locus and Bayesian statistical approaches for classification for ecological and epidemiological inquiries; (5) Association of antimicrobial resistance (AMR) genes, putative virulence factors, and plasmids from curated databases with the hierarchically-related genotypic classifications; and (6) Production of pan-genome annotations and data compilation that can be utilized for downstream analysis such as identification of population-specific genomic signatures. The scalability of ProkEvo was measured with two datasets comprising significantly different numbers of input genomes (one with ~2,400 genomes, and the second with ~23,000 genomes). Depending on the dataset and the computational platform used, the running time of ProkEvo varied from ~3-26 days. ProkEvo can be used with virtually any bacterial species, and the Pegasus WMS uniquely facilitates addition or removal of programs from the workflow or modification of options within them. To demonstrate versatility of the ProkEvo platform, we performed a hierarchical-based population structure analyses from available genomes of three distinct pathogenic bacterial species as individual case studies. The specific case studies illustrate how hierarchical analyses of population structures, genotype frequencies, and distribution of specific gene functions can be integrated into an analysis. Collectively, our study shows that ProkEvo presents a practical viable option for scalable, automated analyses of bacterial populations with direct applications for basic microbiology research, clinical microbiological diagnostics, and epidemiological surveillance.

2020 ◽  
Author(s):  
Natasha Pavlovikj ◽  
Joao Carlos Gomes-Neto ◽  
Jitender S. Deogun ◽  
Andrew K. Benson

AbstractWhole Genome Sequence (WGS) data from bacterial species is used for a variety of applications ranging from basic microbiological research, diagnostics, and epidemiological surveillance. The availability of WGS data from hundreds of thousands of individual isolates of individual microbial species poses a tremendous opportunity for discovery and hypothesis-generating research into ecology and evolution of these microorganisms. Scalability and user-friendliness of existing pipelines for population-scale inquiry, however, limit applications of systematic, population-scale approaches. Here, we present ProkEvo, an automated, scalable, and open-source framework for bacterial population genomics analyses using WGS data. ProkEvo was specifically developed to achieve the following goals: 1) Automation and scaling of complex combinations of computational analyses for many thousands of bacterial genomes from inputs of raw Illumina paired-end sequence reads; 2) Use of workflow management systems (WMS) such as Pegasus WMS to ensure reproducibility, scalability, modularity, fault-tolerance, and robust file management throughout the process; 3) Use of high-performance and high-throughput computational platforms; 4) Generation of hierarchical population-based genotypes at different scales of resolution based on combinations of multi-locus and Bayesian statistical approaches for classification; 5) Detection of antimicrobial resistance (AMR) genes, putative virulence factors, and plasmids from curated databases and association with genotypic classifications; and 6) Production of pan-genome annotations and data compilation that can be utilized for downstream analysis. The scalability of ProkEvo was measured with two datasets comprising significantly different numbers of input genomes (one with ~2,400 genomes, and the second with ~23,000 genomes). Depending on the dataset and the computational platform used, the running time of ProkEvo varied from ~3-26 days. ProkEvo can be used with virtually any bacterial species and the Pegasus WMS facilitates addition or removal of programs from the workflow or modification of options within them. All the dependencies of ProkEvo can be distributed via conda environment or Docker image. To demonstrate versatility of the ProkEvo platform, we performed population-based analyses from available genomes of three distinct pathogenic bacterial species as individual case studies (three serovars of Salmonella enterica, as well as Campylobacter jejuni and Staphylococcus aureus). The specific case studies used reproducible Python and R scripts documented in Jupyter Notebooks and collectively illustrate how hierarchical analyses of population structures, genotype frequencies, and distribution of specific gene functions can be used to generate novel hypotheses about the evolutionary history and ecological characteristics of specific populations of each pathogen. Collectively, our study shows that ProkEvo presents a viable option for scalable, automated analyses of bacterial populations with powerful applications for basic microbiology research, clinical microbiological diagnostics, and epidemiological surveillance.


Forests ◽  
2019 ◽  
Vol 10 (8) ◽  
pp. 681 ◽  
Author(s):  
Huiquan Zheng ◽  
Dehuo Hu ◽  
Ruping Wei ◽  
Shu Yan ◽  
Runhui Wang

Knowledge on population diversity and structure is of fundamental importance for conifer breeding programs. In this study, we concentrated on the development and application of high-density single nucleotide polymorphism (SNP) markers through a high-throughput sequencing technique termed as specific-locus amplified fragment sequencing (SLAF-seq) for the economically important conifer tree species, Chinese fir (Cunninghamia lanceolata). Based on the SLAF-seq, we successfully established a high-density SNP panel consisting of 108,753 genomic SNPs from Chinese fir. This SNP panel facilitated us in gaining insight into the genetic base of the Chinese fir advance breeding population with 221 genotypes for its genetic variation, relationship and diversity, and population structure status. Overall, the present population appears to have considerable genetic variability. Most (94.15%) of the variability was attributed to the genetic differentiation of genotypes, very limited (5.85%) variation occurred on the population (sub-origin set) level. Correspondingly, low FST (0.0285–0.0990) values were seen for the sub-origin sets. When viewing the genetic structure of the population regardless of its sub-origin set feature, the present SNP data opened a new population picture where the advanced Chinese fir breeding population could be divided into four genetic sets, as evidenced by phylogenetic tree and population structure analysis results, albeit some difference in membership of the corresponding set (cluster vs. group). It also suggested that all the genetic sets were admixed clades revealing a complex relationship of the genotypes of this population. With a step wise pruning procedure, we captured a core collection (core 0.650) harboring 143 genotypes that maintains all the allele, diversity, and specific genetic structure of the whole population. This generalist core is valuable for the Chinese fir advanced breeding program and further genetic/genomic studies.


Open Biology ◽  
2015 ◽  
Vol 5 (1) ◽  
pp. 140133 ◽  
Author(s):  
Nitin Kumar ◽  
Ganesh Lad ◽  
Elisa Giuntini ◽  
Maria E. Kaye ◽  
Piyachat Udomwong ◽  
...  

Biological species may remain distinct because of genetic isolation or ecological adaptation, but these two aspects do not always coincide. To establish the nature of the species boundary within a local bacterial population, we characterized a sympatric population of the bacterium Rhizobium leguminosarum by genomic sequencing of 72 isolates. Although all strains have 16S rRNA typical of R. leguminosarum , they fall into five genospecies by the criterion of average nucleotide identity (ANI). Many genes, on plasmids as well as the chromosome, support this division: recombination of core genes has been largely within genospecies. Nevertheless, variation in ecological properties, including symbiotic host range and carbon-source utilization, cuts across these genospecies, so that none of these phenotypes is diagnostic of genospecies. This phenotypic variation is conferred by mobile genes. The genospecies meet the Mayr criteria for biological species in respect of their core genes, but do not correspond to coherent ecological groups, so periodic selection may not be effective in purging variation within them. The population structure is incompatible with traditional ‘polyphasic taxonomy′ that requires bacterial species to have both phylogenetic coherence and distinctive phenotypes. More generally, genomics has revealed that many bacterial species share adaptive modules by horizontal gene transfer, and we envisage a more consistent taxonomic framework that explicitly recognizes this. Significant phenotypes should be recognized as ‘biovars' within species that are defined by core gene phylogeny.


2021 ◽  
Author(s):  
John A Lees ◽  
Gerry Tonkin-Hill ◽  
Zhirong Yang ◽  
Jukka Corander

In less than a decade, population genomics of microbes has progressed from the effort of sequencing dozens of strains to thousands, or even tens of thousands of strains in a single study. There are now hundreds of thousands of genomes available even for a single bacterial species and the number of genomes is expected to continue to increase at an accelerated pace given the advances in sequencing technology and widespread genomic surveillance initiatives. This explosion of data calls for innovative methods to enable rapid exploration of the structure of a population based on different data modalities, such as multiple sequence alignments, assemblies and estimates of gene content across different genomes. Here we present Mandrake, an efficient implementation of a dimensional reduction method tailored for the needs of large-scale population genomics. Mandrake is capable of visualising population structure from millions of whole genomes and we illustrate its usefulness with several data sets representing major pathogens. Our method is freely available both as an analysis pipeline (https://github.com/johnlees/mandrake) and as a browser-based interactive application (https://gtonkinhill.github.io/mandrake-web/).


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Linlin Zhao ◽  
Fangyuan Qu ◽  
Na Song ◽  
Zhiqiang Han ◽  
Tianxiang Gao ◽  
...  

Abstract Background Understanding the genetic structure and local adaptive evolutionary mechanisms of marine organisms is crucial for the management of biological resources. As the ecologically and commercially important small-sized shallow-sea fish, Collichthys lucidus plays a vital role in the structure and functioning of marine ecosystem processes. C. lucidus has been shown to have an obvious population structure. Therefore, it is an ideal candidate for investigating population differentiation and local adaptation under heterogeneous environmental pressure. Results A total of 184,708 high-quality single nucleotide polymorphisms (SNPs) were identified and applied to elucidate the fine-scale genetic structure and local thermal adaptation of 8 C. lucidus populations. Population structure analysis based on all SNPs indicated that the northern group and southern group of C. lucidus have a strong differentiation. Moreover, 314 SNPs were found to be significantly associated with temperature variation, and annotations of genes containing temperature-related SNPs suggested that genes were involved in material (protein, lipid, and carbohydrate) metabolism and immune responses. Conclusion The high genetic differentiation of 8 C. lucidus populations may have been caused by long-term geographic isolation during the glacial period. Moreover, we suspected that variation in these genes associated with material (protein, lipid, and carbohydrate) metabolism and immune responses was critical for adaptation to spatially heterogeneous temperatures in natural C. lucidus populations. In conclusion, this study could help us determine how C. lucidus populations will respond to future ocean temperature rising.


2020 ◽  
Author(s):  
TEWODROS TESFAYE NEGASH ◽  
KASSAHUN TESFAYE ◽  
GEMECHU KENENI WAKEYO ◽  
CATHRINE ZIYOMO

Abstract BackgroundSesame is an important oil crop widely cultivated in Africa and Asia continent. Characterization of genetic diversity and population structure of sesame genotypes in these continents can be used to designing breeding methods. In the present study, 300 sesame genotypes comprising 209 local, and 75 exotic collection, and 16 released varieties provided from the Ethiopian Biodiversity Institute and research centers were used in the present study.ResultsThe panel was genotyped using two ultra-high-throughput diversity array technology (DArT) markers (silicoDArT and SNP). Both markers were used to identify the genetic diversity and population structure of sesame germplasm. A total of 6115 silicoDArT and 6474 SNP markers were reported, of which 5002 silicoDArT and 4638 SNP markers were screening with quality control parameters. The average polymorphic information content values of silicoDArT and SNP markers were 0.07 and 0.08, respectively. For further analysis, the allele frequency for each SNP site was calculated and purified with MAF < 0.01 and left 2997 high-quality SNPs evenly distributed across the whole genome that could be used for subsequent analysis. All genotypes used in this study were descended from eight 8 geographical origins. The genetic diversity analysis showed that the average nucleotide diversity of the panel was 0.14. Considering the genotypes based on their geographical origin, Africa collections (0.21) as a whole without Ethiopian collection was more diverse than Asia and when further portioned Africa, North Africa (0.23) collection was more diverse than others, but at the continent level, Asia (0.17) was more diverse than Africa (0.14). The genetic distance among the sesame populations was ranged from 0.015 to 0.394, with an average of 0.165. The sesame populations was clustered into four groups. The structure analysis divided the panel into four subgroups and 21 genotypes were clustered as an admixture. These indicates genotypes from the same origin didn’t classify properly on the premise of the country of origin. ConclusionsThe genetic diversity and population structure revealed in this study should guide the future research work to design association studies and the systematic utilization of the genetic variation characterizing the sesame panel.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Xiaoting Xia ◽  
Shunjin Zhang ◽  
Huaju Zhang ◽  
Zijing Zhang ◽  
Ningbo Chen ◽  
...  

Abstract Background Native cattle breeds are an important source of genetic variation because they might carry alleles that enable them to adapt to local environment and tough feeding conditions. Jiaxian Red, a Chinese native cattle breed, is reported to have originated from crossbreeding between taurine and indicine cattle; their history as a draft and meat animal dates back at least 30 years. Using whole-genome sequencing (WGS) data of 30 animals from the core breeding farm, we investigated the genetic diversity, population structure and genomic regions under selection of Jiaxian Red cattle. Furthermore, we used 131 published genomes of world-wide cattle to characterize the genomic variation of Jiaxian Red cattle. Results The population structure analysis revealed that Jiaxian Red cattle harboured the ancestry with East Asian taurine (0.493), Chinese indicine (0.379), European taurine (0.095) and Indian indicine (0.033). Three methods (nucleotide diversity, linkage disequilibrium decay and runs of homozygosity) implied the relatively high genomic diversity in Jiaxian Red cattle. We used θπ, CLR, FST and XP-EHH methods to look for the candidate signatures of positive selection in Jiaxian Red cattle. A total number of 171 (θπ and CLR) and 17 (FST and XP-EHH) shared genes were identified using different detection strategies. Functional annotation analysis revealed that these genes are potentially responsible for growth and feed efficiency (CCSER1), meat quality traits (ROCK2, PPP1R12A, CYB5R4, EYA3, PHACTR1), fertility (RFX4, SRD5A2) and immune system response (SLAMF1, CD84 and SLAMF6). Conclusion We provide a comprehensive overview of sequence variations in Jiaxian Red cattle genomes. Selection signatures were detected in genomic regions that are possibly related to economically important traits in Jiaxian Red cattle. We observed a high level of genomic diversity and low inbreeding in Jiaxian Red cattle. These results provide a basis for further resource protection and breeding improvement of this breed.


2021 ◽  
Vol 134 (5) ◽  
pp. 1343-1362
Author(s):  
Alex C. Ogbonna ◽  
Luciano Rogerio Braatz de Andrade ◽  
Lukas A. Mueller ◽  
Eder Jorge de Oliveira ◽  
Guillaume J. Bauchet

Abstract Key message Brazilian cassava diversity was characterized through population genetics and clustering approaches, highlighting contrasted genetic groups and spatial genetic differentiation. Abstract Cassava (Manihot esculenta Crantz) is a major staple root crop of the tropics, originating from the Amazonian region. In this study, 3354 cassava landraces and modern breeding lines from the Embrapa Cassava Germplasm Bank (CGB) were characterized. All individuals were subjected to genotyping-by-sequencing (GBS), identifying 27,045 single-nucleotide polymorphisms (SNPs). Identity-by-state and population structure analyses revealed a unique set of 1536 individuals and 10 distinct genetic groups with heterogeneous linkage disequilibrium (LD). On this basis, a density of 1300–4700 SNP markers were selected for large-effect quantitative trait loci (QTL) detection. Identified genetic groups were further characterized for population genetics parameters including minor allele frequency (MAF), observed heterozygosity $$({H}_{o})$$ ( H o ) , effective population size estimate $$\widehat{{(N}_{e}}$$ ( N e ^ ) and polymorphism information content (PIC). Selection footprints and introgressions of M. glaziovii were detected. Spatial population structure analysis revealed five ancestral populations related to distinct Brazilian ecoregions. Estimation of historical relationships among identified populations suggests an early population split from Amazonian to Atlantic forest and Caatinga ecoregions and active gene flows. This study provides a thorough genetic characterization of ex situ germplasm resources from cassava’s center of origin, South America, with results shedding light on Brazilian cassava characteristics and its biogeographical landscape. These findings support and facilitate the use of genetic resources in modern breeding programs including implementation of association mapping and genomic selection strategies.


2021 ◽  
Vol 7 (7) ◽  
pp. eabe5054
Author(s):  
Qianxin Wu ◽  
Chenqu Suo ◽  
Tom Brown ◽  
Tengyao Wang ◽  
Sarah A. Teichmann ◽  
...  

We present INSIGHT [isothermal NASBA (nucleic acid sequence–based amplification) sequencing–based high-throughput test], a two-stage coronavirus disease 2019 testing strategy, using a barcoded isothermal NASBA reaction. It combines point-of-care diagnosis with next-generation sequencing, aiming to achieve population-scale testing. Stage 1 allows a quick decentralized readout for early isolation of presymptomatic or asymptomatic patients. It gives results within 1 to 2 hours, using either fluorescence detection or a lateral flow readout, while simultaneously incorporating sample-specific barcodes. The same reaction products from potentially hundreds of thousands of samples can then be pooled and used in a highly multiplexed sequencing–based assay in stage 2. This second stage confirms the near-patient testing results and facilitates centralized data collection. The 95% limit of detection is <50 copies of viral RNA per reaction. INSIGHT is suitable for further development into a rapid home-based, point-of-care assay and is potentially scalable to the population level.


Sign in / Sign up

Export Citation Format

Share Document