Peer Review #2 of "The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes (v0.1)"

Background. As whole genome sequence data from bacterial isolates becomes cheaper to generate, computational methods are needed to correlate sequence data with biological observations. Here we present the large-scale BLAST score ratio (LS-BSR) pipeline, which rapidly compares the genetic content of hundreds to thousands of bacterial genomes, and returns a matrix that describes the relatedness of all coding sequences (CDSs) in all genomes surveyed. This matrix can be easily parsed in order to identify genetic relationships between bacterial genomes. Although pipelines have been published that group peptides by sequence similarity, no other software performs the large-scale, flexible, full-genome comparative analyses carried out by LS-BSR. Results. To demonstrate the utility of the method, the LS-BSR pipeline was tested on 96 Escherichia coli and Shigella genomes; the pipeline ran in 163 minutes using 16 processors, which is a greater than 7-fold speedup compared to using a single processor. The BSR values for each CDS, which indicate a relative level of relatedness, were then mapped to each genome on an independent core genome single nucleotide polymorphism (SNP) based phylogeny. Comparisons were then used to identify clade specific CDS markers and validate the LS-BSR pipeline based on molecular markers that delineate between classical E. coli pathogenic variant (pathovar) designations. Scalability tests demonstrated that the LS-BSR pipeline can process 1,000 E. coli genomes in ~60h using 16 processors. Conclusions. LS-BSR is an open-source, parallel implementation of the BSR algorithm, enabling rapid comparison of the genetic content of large numbers of genomes. The results of the pipeline can be used to identify specific markers between user-defined phylogenetic groups, and to identify the loss and/or acquisition of genetic information between bacterial isolates. Taxa-specific genetic markers can then be translated into clinical diagnostics, or can be used to identify broadly conserved putative therapeutic candidates.

Download Full-text

The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes

10.7287/peerj.preprints.220 ◽

2014 ◽

Author(s):

Jason W Sahl ◽

Greg Caporaso ◽

David A Rasko ◽

Paul S Keim

Keyword(s):

Large Scale ◽

Sequence Data ◽

Parallel Implementation ◽

Genetic Relationships ◽

Clinical Diagnostics ◽

Whole Genome Sequence ◽

Bacterial Isolates ◽

Bacterial Genomes ◽

E Coli ◽

Blast Score

Background. As whole genome sequence data from bacterial isolates becomes cheaper to generate, computational methods are needed to correlate sequence data with biological observations. Here we present the large-scale BLAST score ratio (LS-BSR) pipeline, which rapidly compares the genetic content of hundreds to thousands of bacterial genomes, and returns a matrix that describes the relatedness of all coding sequences (CDSs) in all genomes surveyed. This matrix can be easily parsed in order to identify genetic relationships between bacterial genomes. Although pipelines have been published that group peptides by sequence similarity, no other software performs the large-scale, flexible, full-genome comparative analyses carried out by LS-BSR. Results. To demonstrate the utility of the method, the LS-BSR pipeline was tested on 96 Escherichia coli and Shigella genomes; the pipeline ran in 163 minutes using 16 processors, which is a greater than 7-fold speedup compared to using a single processor. The BSR values for each CDS, which indicate a relative level of relatedness, were then mapped to each genome on an independent core genome single nucleotide polymorphism (SNP) based phylogeny. Comparisons were then used to identify clade specific CDS markers and validate the LS-BSR pipeline based on molecular markers that delineate between classical E. coli pathogenic variant (pathovar) designations. Scalability tests demonstrated that the LS-BSR pipeline can process 1,000 E. coli genomes in ~60h using 16 processors. Conclusions. LS-BSR is an open-source, parallel implementation of the BSR algorithm, enabling rapid comparison of the genetic content of large numbers of genomes. The results of the pipeline can be used to identify specific markers between user-defined phylogenetic groups, and to identify the loss and/or acquisition of genetic information between bacterial isolates. Taxa-specific genetic markers can then be translated into clinical diagnostics, or can be used to identify broadly conserved putative therapeutic candidates.

Download Full-text

The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes

PeerJ ◽

10.7717/peerj.332 ◽

2014 ◽

Vol 2 ◽

pp. e332 ◽

Cited By ~ 134

Author(s):

Jason W. Sahl ◽

J. Gregory Caporaso ◽

David A. Rasko ◽

Paul Keim

Keyword(s):

Large Scale ◽

Bacterial Genomes ◽

Blast Score

Download Full-text

Consistent Metagenome-Derived Metrics Verify and Delineate Bacterial Species Boundaries

mSystems ◽

10.1128/msystems.00731-19 ◽

2020 ◽

Vol 5 (1) ◽

Cited By ~ 14

Author(s):

Matthew R. Olm ◽

Alexander Crits-Christoph ◽

Spencer Diamond ◽

Adi Lavy ◽

Paula B. Matheus Carnevali ◽

...

Keyword(s):

Bacterial Diversity ◽

Ribosomal Proteins ◽

Large Scale ◽

Bacterial Species ◽

Bacterial Genome ◽

16S Rrna Genes ◽

Rrna Genes ◽

Species Discrimination ◽

Bacterial Genomes ◽

Discrimination Power

ABSTRACT Longstanding questions relate to the existence of naturally distinct bacterial species and genetic approaches to distinguish them. Bacterial genomes in public databases form distinct groups, but these databases are subject to isolation and deposition biases. To avoid these biases, we compared 5,203 bacterial genomes from 1,457 environmental metagenomic samples to test for distinct clouds of diversity and evaluated metrics that could be used to define the species boundary. Bacterial genomes from the human gut, soil, and the ocean all exhibited gaps in whole-genome average nucleotide identities (ANI) near the previously suggested species threshold of 95% ANI. While genome-wide ratios of nonsynonymous and synonymous nucleotide differences (dN/dS) decrease until ANI values approach ∼98%, two methods for estimating homologous recombination approached zero at ∼95% ANI, supporting breakdown of recombination due to sequence divergence as a species-forming force. We evaluated 107 genome-based metrics for their ability to distinguish species when full genomes are not recovered. Full-length 16S rRNA genes were least useful, in part because they were underrecovered from metagenomes. However, many ribosomal proteins displayed both high metagenomic recoverability and species discrimination power. Taken together, our results verify the existence of sequence-discrete microbial species in metagenome-derived genomes and highlight the usefulness of ribosomal genes for gene-level species discrimination. IMPORTANCE There is controversy about whether bacterial diversity is clustered into distinct species groups or exists as a continuum. To address this issue, we analyzed bacterial genome databases and reports from several previous large-scale environment studies and identified clear discrete groups of species-level bacterial diversity in all cases. Genetic analysis further revealed that quasi-sexual reproduction via horizontal gene transfer is likely a key evolutionary force that maintains bacterial species integrity. We next benchmarked over 100 metrics to distinguish these bacterial species from each other and identified several genes encoding ribosomal proteins with high species discrimination power. Overall, the results from this study provide best practices for bacterial species delineation based on genome content and insight into the nature of bacterial species population genetics.

Download Full-text

Dynamics of Genome Architecture in Rhizobium sp. Strain NGR234

Journal of Bacteriology ◽

10.1128/jb.184.1.171-176.2002 ◽

2002 ◽

Vol 184 (1) ◽

pp. 171-176 ◽

Cited By ~ 62

Author(s):

Patrick Mavingui ◽

Margarita Flores ◽

Xianwu Guo ◽

Guillermo Dávila ◽

Xavier Perret ◽

...

Keyword(s):

Large Scale ◽

Insertion Sequence ◽

Biological Significance ◽

Genome Architecture ◽

Bacterial Genomes ◽

Symbiotic Plasmid ◽

Sequence Elements ◽

Dynamic Structures ◽

Genome Analyses ◽

Insertion Sequence Elements

ABSTRACT Bacterial genomes are usually partitioned in several replicons, which are dynamic structures prone to mutation and genomic rearrangements, thus contributing to genome evolution. Nevertheless, much remains to be learned about the origins and dynamics of the formation of bacterial alternative genomic states and their possible biological consequences. To address these issues, we have studied the dynamics of the genome architecture in Rhizobium sp. strain NGR234 and analyzed its biological significance. NGR234 genome consists of three replicons: the symbiotic plasmid pNGR234a (536,165 bp), the megaplasmid pNGR234b (>2,000 kb), and the chromosome (>3,700 kb). Here we report that genome analyses of cell siblings showed the occurrence of large-scale DNA rearrangements consisting of cointegrations and excisions between the three replicons. As a result, four new genomic architectures have emerged. Three consisted of the cointegrates between two replicons: chromosome-pNGR234a, chromosome-pNGR234b, and pNGR234a-pNGR234b. The other consisted of a cointegrate of the three replicons (chromosome-pNGR234a-pNGR234b). Cointegration and excision of pNGR234a with either the chromosome or pNGR234b were studied and found to proceed via a Campbell-type mechanism, mediated by insertion sequence elements. We provide evidence showing that changes in the genome architecture did not alter the growth and symbiotic proficiency of Rhizobium derivatives.

Download Full-text

SkewIT: The Skew Index Test for large-scale GC Skew analysis of bacterial genomes

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008439 ◽

2020 ◽

Vol 16 (12) ◽

pp. e1008439

Author(s):

Jennifer Lu ◽

Steven L. Salzberg

Keyword(s):

Large Scale ◽

Analysis Tool ◽

Index Test ◽

Bacterial Genomes ◽

Phylogenetic Groups ◽

Bacterial Phyla ◽

Link Type ◽

Gc Skew ◽

A Genome ◽

Web App

GC skew is a phenomenon observed in many bacterial genomes, wherein the two replication strands of the same chromosome contain different proportions of guanine and cytosine nucleotides. Here we demonstrate that this phenomenon, which was first discovered in the mid-1990s, can be used today as an analysis tool for the 15,000+ complete bacterial genomes in NCBI’s Refseq library. In order to analyze all 15,000+ genomes, we introduce a new method, SkewIT (Skew Index Test), that calculates a single metric representing the degree of GC skew for a genome. Using this metric, we demonstrate how GC skew patterns are conserved within certain bacterial phyla, e.g. Firmicutes, but show different patterns in other phylogenetic groups such as Actinobacteria. We also discovered that outlier values of SkewIT highlight potential bacterial mis-assemblies. Using our newly defined metric, we identify multiple mis-assembled chromosomal sequences in previously published complete bacterial genomes. We provide a SkewIT web app https://jenniferlu717.shinyapps.io/SkewIT/ that calculates SkewI for any user-provided bacterial sequence. The web app also provides an interactive interface for the data generated in this paper, allowing users to further investigate the SkewI values and thresholds of the Refseq-97 complete bacterial genomes. Individual scripts for analysis of bacterial genomes are provided in the following repository: https://github.com/jenniferlu717/SkewIT.

Download Full-text