The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes

Background. As whole genome sequence data from bacterial isolates becomes cheaper to generate, computational methods are needed to correlate sequence data with biological observations. Here we present the large-scale BLAST score ratio (LS-BSR) pipeline, which rapidly compares the genetic content of hundreds to thousands of bacterial genomes, and returns a matrix that describes the relatedness of all coding sequences (CDSs) in all genomes surveyed. This matrix can be easily parsed in order to identify genetic relationships between bacterial genomes. Although pipelines have been published that group peptides by sequence similarity, no other software performs the large-scale, flexible, full-genome comparative analyses carried out by LS-BSR. Results. To demonstrate the utility of the method, the LS-BSR pipeline was tested on 96 Escherichia coli and Shigella genomes; the pipeline ran in 163 minutes using 16 processors, which is a greater than 7-fold speedup compared to using a single processor. The BSR values for each CDS, which indicate a relative level of relatedness, were then mapped to each genome on an independent core genome single nucleotide polymorphism (SNP) based phylogeny. Comparisons were then used to identify clade specific CDS markers and validate the LS-BSR pipeline based on molecular markers that delineate between classical E. coli pathogenic variant (pathovar) designations. Scalability tests demonstrated that the LS-BSR pipeline can process 1,000 E. coli genomes in ~60h using 16 processors. Conclusions. LS-BSR is an open-source, parallel implementation of the BSR algorithm, enabling rapid comparison of the genetic content of large numbers of genomes. The results of the pipeline can be used to identify specific markers between user-defined phylogenetic groups, and to identify the loss and/or acquisition of genetic information between bacterial isolates. Taxa-specific genetic markers can then be translated into clinical diagnostics, or can be used to identify broadly conserved putative therapeutic candidates.

Download Full-text

Peer Review #2 of "The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes (v0.1)"

10.7287/peerj.332v0.1/reviews/2 ◽

2014 ◽

Keyword(s):

Peer Review ◽

Large Scale ◽

Bacterial Genomes ◽

Blast Score

Download Full-text

The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes

10.7287/peerj.preprints.220 ◽

2014 ◽

Author(s):

Jason W Sahl ◽

Greg Caporaso ◽

David A Rasko ◽

Paul S Keim

Keyword(s):

Large Scale ◽

Sequence Data ◽

Parallel Implementation ◽

Genetic Relationships ◽

Clinical Diagnostics ◽

Whole Genome Sequence ◽

Bacterial Isolates ◽

Bacterial Genomes ◽

E Coli ◽

Blast Score

Background. As whole genome sequence data from bacterial isolates becomes cheaper to generate, computational methods are needed to correlate sequence data with biological observations. Here we present the large-scale BLAST score ratio (LS-BSR) pipeline, which rapidly compares the genetic content of hundreds to thousands of bacterial genomes, and returns a matrix that describes the relatedness of all coding sequences (CDSs) in all genomes surveyed. This matrix can be easily parsed in order to identify genetic relationships between bacterial genomes. Although pipelines have been published that group peptides by sequence similarity, no other software performs the large-scale, flexible, full-genome comparative analyses carried out by LS-BSR. Results. To demonstrate the utility of the method, the LS-BSR pipeline was tested on 96 Escherichia coli and Shigella genomes; the pipeline ran in 163 minutes using 16 processors, which is a greater than 7-fold speedup compared to using a single processor. The BSR values for each CDS, which indicate a relative level of relatedness, were then mapped to each genome on an independent core genome single nucleotide polymorphism (SNP) based phylogeny. Comparisons were then used to identify clade specific CDS markers and validate the LS-BSR pipeline based on molecular markers that delineate between classical E. coli pathogenic variant (pathovar) designations. Scalability tests demonstrated that the LS-BSR pipeline can process 1,000 E. coli genomes in ~60h using 16 processors. Conclusions. LS-BSR is an open-source, parallel implementation of the BSR algorithm, enabling rapid comparison of the genetic content of large numbers of genomes. The results of the pipeline can be used to identify specific markers between user-defined phylogenetic groups, and to identify the loss and/or acquisition of genetic information between bacterial isolates. Taxa-specific genetic markers can then be translated into clinical diagnostics, or can be used to identify broadly conserved putative therapeutic candidates.

Download Full-text

Consistent Metagenome-Derived Metrics Verify and Delineate Bacterial Species Boundaries

mSystems ◽

10.1128/msystems.00731-19 ◽

2020 ◽

Vol 5 (1) ◽

Cited By ~ 14

Author(s):

Matthew R. Olm ◽

Alexander Crits-Christoph ◽

Spencer Diamond ◽

Adi Lavy ◽

Paula B. Matheus Carnevali ◽

...

Keyword(s):

Bacterial Diversity ◽

Ribosomal Proteins ◽

Large Scale ◽

Bacterial Species ◽

Bacterial Genome ◽

16S Rrna Genes ◽

Rrna Genes ◽

Species Discrimination ◽

Bacterial Genomes ◽

Discrimination Power

ABSTRACT Longstanding questions relate to the existence of naturally distinct bacterial species and genetic approaches to distinguish them. Bacterial genomes in public databases form distinct groups, but these databases are subject to isolation and deposition biases. To avoid these biases, we compared 5,203 bacterial genomes from 1,457 environmental metagenomic samples to test for distinct clouds of diversity and evaluated metrics that could be used to define the species boundary. Bacterial genomes from the human gut, soil, and the ocean all exhibited gaps in whole-genome average nucleotide identities (ANI) near the previously suggested species threshold of 95% ANI. While genome-wide ratios of nonsynonymous and synonymous nucleotide differences (dN/dS) decrease until ANI values approach ∼98%, two methods for estimating homologous recombination approached zero at ∼95% ANI, supporting breakdown of recombination due to sequence divergence as a species-forming force. We evaluated 107 genome-based metrics for their ability to distinguish species when full genomes are not recovered. Full-length 16S rRNA genes were least useful, in part because they were underrecovered from metagenomes. However, many ribosomal proteins displayed both high metagenomic recoverability and species discrimination power. Taken together, our results verify the existence of sequence-discrete microbial species in metagenome-derived genomes and highlight the usefulness of ribosomal genes for gene-level species discrimination. IMPORTANCE There is controversy about whether bacterial diversity is clustered into distinct species groups or exists as a continuum. To address this issue, we analyzed bacterial genome databases and reports from several previous large-scale environment studies and identified clear discrete groups of species-level bacterial diversity in all cases. Genetic analysis further revealed that quasi-sexual reproduction via horizontal gene transfer is likely a key evolutionary force that maintains bacterial species integrity. We next benchmarked over 100 metrics to distinguish these bacterial species from each other and identified several genes encoding ribosomal proteins with high species discrimination power. Overall, the results from this study provide best practices for bacterial species delineation based on genome content and insight into the nature of bacterial species population genetics.

Download Full-text

Dynamics of Genome Architecture in Rhizobium sp. Strain NGR234

Journal of Bacteriology ◽

10.1128/jb.184.1.171-176.2002 ◽

2002 ◽

Vol 184 (1) ◽

pp. 171-176 ◽

Cited By ~ 62

Author(s):

Patrick Mavingui ◽

Margarita Flores ◽

Xianwu Guo ◽

Guillermo Dávila ◽

Xavier Perret ◽

...

Keyword(s):

Large Scale ◽

Insertion Sequence ◽

Biological Significance ◽

Genome Architecture ◽

Bacterial Genomes ◽

Symbiotic Plasmid ◽

Sequence Elements ◽

Dynamic Structures ◽

Genome Analyses ◽

Insertion Sequence Elements

ABSTRACT Bacterial genomes are usually partitioned in several replicons, which are dynamic structures prone to mutation and genomic rearrangements, thus contributing to genome evolution. Nevertheless, much remains to be learned about the origins and dynamics of the formation of bacterial alternative genomic states and their possible biological consequences. To address these issues, we have studied the dynamics of the genome architecture in Rhizobium sp. strain NGR234 and analyzed its biological significance. NGR234 genome consists of three replicons: the symbiotic plasmid pNGR234a (536,165 bp), the megaplasmid pNGR234b (>2,000 kb), and the chromosome (>3,700 kb). Here we report that genome analyses of cell siblings showed the occurrence of large-scale DNA rearrangements consisting of cointegrations and excisions between the three replicons. As a result, four new genomic architectures have emerged. Three consisted of the cointegrates between two replicons: chromosome-pNGR234a, chromosome-pNGR234b, and pNGR234a-pNGR234b. The other consisted of a cointegrate of the three replicons (chromosome-pNGR234a-pNGR234b). Cointegration and excision of pNGR234a with either the chromosome or pNGR234b were studied and found to proceed via a Campbell-type mechanism, mediated by insertion sequence elements. We provide evidence showing that changes in the genome architecture did not alter the growth and symbiotic proficiency of Rhizobium derivatives.

Download Full-text

SkewIT: The Skew Index Test for large-scale GC Skew analysis of bacterial genomes

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008439 ◽

2020 ◽

Vol 16 (12) ◽

pp. e1008439

Author(s):

Jennifer Lu ◽

Steven L. Salzberg

Keyword(s):

Large Scale ◽

Analysis Tool ◽

Index Test ◽

Bacterial Genomes ◽

Phylogenetic Groups ◽

Bacterial Phyla ◽

Link Type ◽

Gc Skew ◽

A Genome ◽

Web App

GC skew is a phenomenon observed in many bacterial genomes, wherein the two replication strands of the same chromosome contain different proportions of guanine and cytosine nucleotides. Here we demonstrate that this phenomenon, which was first discovered in the mid-1990s, can be used today as an analysis tool for the 15,000+ complete bacterial genomes in NCBI’s Refseq library. In order to analyze all 15,000+ genomes, we introduce a new method, SkewIT (Skew Index Test), that calculates a single metric representing the degree of GC skew for a genome. Using this metric, we demonstrate how GC skew patterns are conserved within certain bacterial phyla, e.g. Firmicutes, but show different patterns in other phylogenetic groups such as Actinobacteria. We also discovered that outlier values of SkewIT highlight potential bacterial mis-assemblies. Using our newly defined metric, we identify multiple mis-assembled chromosomal sequences in previously published complete bacterial genomes. We provide a SkewIT web app https://jenniferlu717.shinyapps.io/SkewIT/ that calculates SkewI for any user-provided bacterial sequence. The web app also provides an interactive interface for the data generated in this paper, allowing users to further investigate the SkewI values and thresholds of the Refseq-97 complete bacterial genomes. Individual scripts for analysis of bacterial genomes are provided in the following repository: https://github.com/jenniferlu717/SkewIT.

Download Full-text

A Universal, Genomewide GuideFinder for CRISPR/Cas9 Targeting in Microbial Genomes

mSphere ◽

10.1128/msphere.00086-20 ◽

2020 ◽

Vol 5 (1) ◽

Author(s):

Michelle Spoto ◽

Changhui Guan ◽

Elizabeth Fleming ◽

Julia Oh

Keyword(s):

Gene Function ◽

Large Scale ◽

Essential Gene ◽

Bacterial Species ◽

Bacterial Genome ◽

Model Organisms ◽

Design Parameters ◽

Bacterial Genomes ◽

Wide Range ◽

User Friendly

ABSTRACT The CRISPR/Cas system has significant potential to facilitate gene editing in a variety of bacterial species. CRISPR interference (CRISPRi) and CRISPR activation (CRISPRa) represent modifications of the CRISPR/Cas9 system utilizing a catalytically inactive Cas9 protein for transcription repression and activation, respectively. While CRISPRi and CRISPRa have tremendous potential to systematically investigate gene function in bacteria, few programs are specifically tailored to identify guides in draft bacterial genomes genomewide. Furthermore, few programs offer open-source code with flexible design parameters for bacterial targeting. To address these limitations, we created GuideFinder, a customizable, user-friendly program that can design guides for any annotated bacterial genome. GuideFinder designs guides from NGG protospacer-adjacent motif (PAM) sites for any number of genes by the use of an annotated genome and FASTA file input by the user. Guides are filtered according to user-defined design parameters and removed if they contain any off-target matches. Iteration with lowered parameter thresholds allows the program to design guides for genes that did not produce guides with the more stringent parameters, one of several features unique to GuideFinder. GuideFinder can also identify paired guides for targeting multiplicity, whose validity we tested experimentally. GuideFinder has been tested on a variety of diverse bacterial genomes, finding guides for 95% of genes on average. Moreover, guides designed by the program are functionally useful—focusing on CRISPRi as a potential application—as demonstrated by essential gene knockdown in two staphylococcal species. Through the large-scale generation of guides, this open-access software will improve accessibility to CRISPR/Cas studies of a variety of bacterial species. IMPORTANCE With the explosion in our understanding of human and environmental microbial diversity, corresponding efforts to understand gene function in these organisms are strongly needed. CRISPR/Cas9 technology has revolutionized interrogation of gene function in a wide variety of model organisms. Efficient CRISPR guide design is required for systematic gene targeting. However, existing tools are not adapted for the broad needs of microbial targeting, which include extraordinary species and subspecies genetic diversity, the overwhelming majority of which is characterized by draft genomes. In addition, flexibility in guide design parameters is important to consider the wide range of factors that can affect guide efficacy, many of which can be species and strain specific. We designed GuideFinder, a customizable, user-friendly program that addresses the limitations of existing software and that can design guides for any annotated bacterial genome with numerous features that facilitate guide design in a wide variety of microorganisms.

Download Full-text

The Distribution of Tryptophan-Dependent Indole-3-Acetic Acid Synthesis Pathways in Bacteria Unraveled by Large-Scale Genomic Analysis

Molecules ◽

10.3390/molecules24071411 ◽

2019 ◽

Vol 24 (7) ◽

pp. 1411 ◽

Cited By ~ 12

Author(s):

Pengfan Zhang ◽

Tao Jin ◽

Sunil Kumar Sahu ◽

Jin Xu ◽

Qiong Shi ◽

...

Keyword(s):

Acetic Acid ◽

Large Scale ◽

Growth Promotion ◽

Plant Root ◽

Genomic Analysis ◽

Acid Synthesis ◽

Metagenomic Data ◽

Bacterial Genomes ◽

Bacterial Phyla ◽

Indole 3 Acetic Acid

Bacterial indole-3-acetic acid (IAA), an effector molecule in microbial physiology, plays an important role in plant growth-promotion. Here, we comprehensively analyzed about 7282 prokaryotic genomes representing diverse bacterial phyla, combined with root-associated metagenomic data to unravel the distribution of tryptophan-dependent IAA synthesis pathways and to quantify the IAA synthesis-related genes in the plant root environments. We found that 82.2% of the analyzed bacterial genomes were potentially capable of synthesizing IAA from tryptophan (Trp) or intermediates. Interestingly, several phylogenetically diverse bacteria showed a preferential tendency to utilize different pathways and tryptamine and indole-3-pyruvate pathways are most prevalent in bacteria. About 45.3% of the studied genomes displayed multiple coexisting pathways, constituting complex IAA synthesis systems. Furthermore, root-associated metagenomic analyses revealed that rhizobacteria mainly synthesize IAA via indole-3-acetamide (IAM) and tryptamine (TMP) pathways and might possess stronger IAA synthesis abilities than bacteria colonizing other environments. The obtained results refurbished our understanding of bacterial IAA synthesis pathways and provided a faster and less labor-intensive alternative to physiological screening based on genome collections. The better understanding of IAA synthesis among bacterial communities could maximize the utilization of bacterial IAA to augment the crop growth and physiological function.

Download Full-text

Platanus_B: an accurate de novo assembler for bacterial genomes using an iterative error-removal process

DNA Research ◽

10.1093/dnares/dsaa014 ◽

2020 ◽

Vol 27 (3) ◽

Cited By ~ 1

Author(s):

Rei Kajitani ◽

Dai Yoshimura ◽

Yoshitoshi Ogura ◽

Yasuhiro Gotoh ◽

Tetsuya Hayashi ◽

...

Keyword(s):

High Resolution ◽

Large Scale ◽

De Novo ◽

Bacterial Genomes ◽

Long Reads ◽

Wide Range ◽

Long Read ◽

Phylogenomic Analyses ◽

Error Removal ◽

Iterative Error

Abstract De novo assembly of short DNA reads remains an essential technology, especially for large-scale projects and high-resolution variant analyses in epidemiology. However, the existing tools often lack sufficient accuracy required to compare closely related strains. To facilitate such studies on bacterial genomes, we developed Platanus_B, a de novo assembler that employs iterations of multiple error-removal algorithms. The benchmarks demonstrated the superior accuracy and high contiguity of Platanus_B, in addition to its ability to enhance the hybrid assembly of both short and nanopore long reads. Although the hybrid strategies for short and long reads were effective in achieving near full-length genomes, we found that short-read-only assemblies generated with Platanus_B were sufficient to obtain ≥90% of exact coding sequences in most cases. In addition, while nanopore long-read-only assemblies lacked fine-scale accuracies, inclusion of short reads was effective in improving the accuracies. Platanus_B can, therefore, be used for comprehensive genomic surveillances of bacterial pathogens and high-resolution phylogenomic analyses of a wide range of bacteria.

Download Full-text