scholarly journals The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes

Author(s):  
Jason W Sahl ◽  
Greg Caporaso ◽  
David A Rasko ◽  
Paul S Keim

Background. As whole genome sequence data from bacterial isolates becomes cheaper to generate, computational methods are needed to correlate sequence data with biological observations. Here we present the large-scale BLAST score ratio (LS-BSR) pipeline, which rapidly compares the genetic content of hundreds to thousands of bacterial genomes, and returns a matrix that describes the relatedness of all coding sequences (CDSs) in all genomes surveyed. This matrix can be easily parsed in order to identify genetic relationships between bacterial genomes. Although pipelines have been published that group peptides by sequence similarity, no other software performs the large-scale, flexible, full-genome comparative analyses carried out by LS-BSR. Results. To demonstrate the utility of the method, the LS-BSR pipeline was tested on 96 Escherichia coli and Shigella genomes; the pipeline ran in 163 minutes using 16 processors, which is a greater than 7-fold speedup compared to using a single processor. The BSR values for each CDS, which indicate a relative level of relatedness, were then mapped to each genome on an independent core genome single nucleotide polymorphism (SNP) based phylogeny. Comparisons were then used to identify clade specific CDS markers and validate the LS-BSR pipeline based on molecular markers that delineate between classical E. coli pathogenic variant (pathovar) designations. Scalability tests demonstrated that the LS-BSR pipeline can process 1,000 E. coli genomes in ~60h using 16 processors. Conclusions. LS-BSR is an open-source, parallel implementation of the BSR algorithm, enabling rapid comparison of the genetic content of large numbers of genomes. The results of the pipeline can be used to identify specific markers between user-defined phylogenetic groups, and to identify the loss and/or acquisition of genetic information between bacterial isolates. Taxa-specific genetic markers can then be translated into clinical diagnostics, or can be used to identify broadly conserved putative therapeutic candidates.

2014 ◽  
Author(s):  
Jason W Sahl ◽  
Greg Caporaso ◽  
David A Rasko ◽  
Paul S Keim

Background. As whole genome sequence data from bacterial isolates becomes cheaper to generate, computational methods are needed to correlate sequence data with biological observations. Here we present the large-scale BLAST score ratio (LS-BSR) pipeline, which rapidly compares the genetic content of hundreds to thousands of bacterial genomes, and returns a matrix that describes the relatedness of all coding sequences (CDSs) in all genomes surveyed. This matrix can be easily parsed in order to identify genetic relationships between bacterial genomes. Although pipelines have been published that group peptides by sequence similarity, no other software performs the large-scale, flexible, full-genome comparative analyses carried out by LS-BSR. Results. To demonstrate the utility of the method, the LS-BSR pipeline was tested on 96 Escherichia coli and Shigella genomes; the pipeline ran in 163 minutes using 16 processors, which is a greater than 7-fold speedup compared to using a single processor. The BSR values for each CDS, which indicate a relative level of relatedness, were then mapped to each genome on an independent core genome single nucleotide polymorphism (SNP) based phylogeny. Comparisons were then used to identify clade specific CDS markers and validate the LS-BSR pipeline based on molecular markers that delineate between classical E. coli pathogenic variant (pathovar) designations. Scalability tests demonstrated that the LS-BSR pipeline can process 1,000 E. coli genomes in ~60h using 16 processors. Conclusions. LS-BSR is an open-source, parallel implementation of the BSR algorithm, enabling rapid comparison of the genetic content of large numbers of genomes. The results of the pipeline can be used to identify specific markers between user-defined phylogenetic groups, and to identify the loss and/or acquisition of genetic information between bacterial isolates. Taxa-specific genetic markers can then be translated into clinical diagnostics, or can be used to identify broadly conserved putative therapeutic candidates.


Author(s):  
Pamela Wiener ◽  
Christelle Robert ◽  
Abulgasim Ahbara ◽  
Mazdak Salavati ◽  
Ayele Abebe ◽  
...  

Abstract Great progress has been made over recent years in the identification of selection signatures in the genomes of livestock species. This work has primarily been carried out in commercial breeds for which the dominant selection pressures, are associated with artificial selection. As agriculture and food security are likely to be strongly affected by climate change, a better understanding of environment-imposed selection on agricultural species is warranted. Ethiopia is an ideal setting to investigate environmental adaptation in livestock due to its wide variation in geo-climatic characteristics and the extensive genetic and phenotypic variation of its livestock. Here, we identified over three million single nucleotide variants across 12 Ethiopian sheep populations and applied landscape genomics approaches to investigate the association between these variants and environmental variables. Our results suggest that environmental adaptation for precipitation-related variables is stronger than that related to altitude or temperature, consistent with large-scale meta-analyses of selection pressure across species. The set of genes showing association with environmental variables was enriched for genes highly expressed in human blood and nerve tissues. There was also evidence of enrichment for genes associated with high-altitude adaptation although no strong association was identified with hypoxia-inducible-factor (HIF) genes. One of the strongest altitude-related signals was for a collagen gene, consistent with previous studies of high-altitude adaptation. Several altitude-associated genes also showed evidence of adaptation with temperature, suggesting a relationship between responses to these environmental factors. These results provide a foundation to investigate further the effects of climatic variables on small ruminant populations.


Mobile DNA ◽  
2019 ◽  
Vol 10 (1) ◽  
Author(s):  
Jerilyn A. Walker ◽  
◽  
Vallmer E. Jordan ◽  
Jessica M. Storer ◽  
Cody J. Steely ◽  
...  

Abstract Background Baboons (genus Papio) and geladas (Theropithecus gelada) are now generally recognized as close phylogenetic relatives, though morphologically quite distinct and generally classified in separate genera. Primate specific Alu retrotransposons are well-established genomic markers for the study of phylogenetic and population genetic relationships. We previously reported a computational reconstruction of Papio phylogeny using large-scale whole genome sequence (WGS) analysis of Alu insertion polymorphisms. Recently, high coverage WGS was generated for Theropithecus gelada. The objective of this study was to apply the high-throughput “poly-Detect” method to computationally determine the number of Alu insertion polymorphisms shared by T. gelada and Papio, and vice versa, by each individual Papio species and T. gelada. Secondly, we performed locus-specific polymerase chain reaction (PCR) assays on a diverse DNA panel to complement the computational data. Results We identified 27,700 Alu insertions from T. gelada WGS that were also present among six Papio species, with nearly half (12,956) remaining unfixed among 12 Papio individuals. Similarly, each of the six Papio species had species-indicative Alu insertions that were also present in T. gelada. In general, P. kindae shared more insertion polymorphisms with T. gelada than did any of the other five Papio species. PCR-based genotype data provided additional support for the computational findings. Conclusions Our discovery that several thousand Alu insertion polymorphisms are shared by T. gelada and Papio baboons suggests a much more permeable reproductive barrier between the two genera then previously suspected. Their intertwined evolution likely involves a long history of admixture, gene flow and incomplete lineage sorting.


2019 ◽  
Vol 20 (S15) ◽  
Author(s):  
Jinhong Shi ◽  
Yan Yan ◽  
Matthew G. Links ◽  
Longhai Li ◽  
Jo-Anne R. Dillon ◽  
...  

Abstract Background Antimicrobial resistance (AMR) is a major threat to global public health because it makes standard treatments ineffective and contributes to the spread of infections. It is important to understand AMR’s biological mechanisms for the development of new drugs and more rapid and accurate clinical diagnostics. The increasing availability of whole-genome SNP (single nucleotide polymorphism) information, obtained from whole-genome sequence data, along with AMR profiles provides an opportunity to use feature selection in machine learning to find AMR-associated mutations. This work describes the use of a supervised feature selection approach using deep neural networks to detect AMR-associated genetic factors from whole-genome SNP data. Results The proposed method, DNP-AAP (deep neural pursuit – average activation potential), was tested on a Neisseria gonorrhoeae dataset with paired whole-genome sequence data and resistance profiles to five commonly used antibiotics including penicillin, tetracycline, azithromycin, ciprofloxacin, and cefixime. The results show that DNP-AAP can effectively identify known AMR-associated genes in N. gonorrhoeae, and also provide a list of candidate genomic features (SNPs) that might lead to the discovery of novel AMR determinants. Logistic regression classifiers were built with the identified SNPs and the prediction AUCs (area under the curve) for penicillin, tetracycline, azithromycin, ciprofloxacin, and cefixime were 0.974, 0.969, 0.949, 0.994, and 0.976, respectively. Conclusions DNP-AAP can effectively identify known AMR-associated genes in N. gonorrhoeae. It also provides a list of candidate genes and intergenic regions that might lead to novel AMR factor discovery. More generally, DNP-AAP can be applied to AMR analysis of any bacterial species with genomic variants and phenotype data. It can serve as a useful screening tool for microbiologists to generate genetic candidates for further lab experiments.


2016 ◽  
Author(s):  
Paolo Devanna ◽  
Xiaowei Sylvia Chen ◽  
Joses Ho ◽  
Dario Gajewski ◽  
Alessandro Gialluisi ◽  
...  

ABSTRACTNext generation sequencing has opened the way for the large scale interrogation of cohorts at the whole exome, or whole genome level. Currently, the field largely focuses on potential disease causing variants that fall within coding sequences and that are predicted to cause protein sequence changes, generally discarding non-coding variants. However non-coding DNA makes up ~98% of the genome and contains a range of sequences essential for controlling the expression of protein coding genes. Thus, potentially causative non-coding variation is currently being overlooked. To address this, we have designed an approach to assess variation in one class of non-coding regulatory DNA; the 3′UTRome. Variants in the 3'UTR region of genes are of particular interest because 3'UTRs are responsible for modulating protein expression levels via their interactions with microRNAs. Furthermore they are amenable to large scale analysis as 3′UTR-microRNA interactions are based on complementary base pairing and as such can be predicted in silico at the genome-wide level. We report a strategy for identifying and functionally testing variants in microRNA binding sites within the 3'UTRome and demonstrate the efficacy of this pipeline in a cohort of language impaired children. Using whole exome sequence data from 43 probands, we extracted variants that lay within 3'UTR microRNA binding sites. We identified a common variant (SNP) in a microRNA binding site and found this SNP to be associated with an endophenotype of language impairment (non-word repetition). We showed that this variant disrupted microRNA regulation in cells and was linked to altered gene expression in the brain, suggesting it may represent a risk factor contributing to SLI. This work demonstrates that biologically relevant variants are currently being under-investigated despite the wealth of next-generation sequencing data available and presents a simple strategy for interrogating non-coding regions of the genome. We propose that this strategy should be routinely applied to whole exome and whole genome sequence data in order to broaden our understanding of how non-coding genetic variation underlies complex phenotypes such as neurodevelopmental disorders.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Blanca M. Perez-Sepulveda ◽  
Darren Heavens ◽  
Caisey V. Pulford ◽  
Alexander V. Predeus ◽  
Ross Low ◽  
...  

AbstractWe have developed an efficient and inexpensive pipeline for streamlining large-scale collection and genome sequencing of bacterial isolates. Evaluation of this method involved a worldwide research collaboration focused on the model organism Salmonella enterica, the 10KSG consortium. Following the optimization of a logistics pipeline that involved shipping isolates as thermolysates in ambient conditions, the project assembled a diverse collection of 10,419 isolates from low- and middle-income countries. The genomes were sequenced using the LITE pipeline for library construction, with a total reagent cost of less than USD$10 per genome. Our method can be applied to other large bacterial collections to underpin global collaborations.


2015 ◽  
Author(s):  
Danielle Ingle ◽  
Mary Valcanis ◽  
Alex Kuzevski ◽  
Marija Tauschek ◽  
Michael Inouye ◽  
...  

The lipopolysaccharide (O) and flagellar (H) surface antigens of Escherichia coli are targets for serotyping that have traditionally been used to identify pathogenic lineages of E. coli. As serotyping has several limitations, public health reference laboratories are increasingly moving towards whole genome sequencing (WGS) for the rapid characterisation of bacterial isolates. Here we present a method to rapidly and accurately serotype E. coli isolates from raw, short read sequence data, leveraging the known genetic basis for the biosynthesis of O- and H-antigens. Our approach bypasses the need for de novo genome assembly by directly screening WGS reads against a curated database of alleles linked to known E. coli O-groups and H-types (the EcOH database) using the software package SRST2. We validated our approach by comparing in silico results with those obtained via serological phenotyping of 197 enteropathogenic (EPEC) isolates. We also demonstrated the utility of our method to characterise enterotoxigenic E. coli (ETEC) and the uropathogenic E. coli (UPEC) epidemic clone ST131, and for in silico serotyping of foodborne outbreak-related isolates in the public GenomeTrakr database.


2021 ◽  
Vol 12 ◽  
Author(s):  
Steven P.T. Hooton ◽  
Alexander C.W. Pritchard ◽  
Karishma Asiani ◽  
Charlotte J. Gray-Hammerton ◽  
Dov J. Stekel ◽  
...  

Salmonella Typhimurium carrying the multidrug resistance (MDR) plasmid pMG101 was isolated from three burns patients in Boston United States in 1973. pMG101 was transferrable into other Salmonella spp. and Escherichia coli hosts and carried what was a novel and unusual combination of AMR genes and silver resistance. Previously published short-read DNA sequence of pMG101 showed that it was a 183.5Kb IncHI plasmid, where a Tn7-mediated transposition of pco/sil resistance genes into the chromosome of the E. coli K-12 J53 host strain had occurred. We noticed differences in streptomycin resistance and plasmid size between two stocks of E. coli K-12 J53 pMG101 we possessed, which had been obtained from two different laboratories (pMG101-A and pMG101-B). Long-read sequencing (PacBio) of the two strains unexpectedly revealed plasmid and chromosomal rearrangements in both. pMG101-A is a non-transmissible 383Kb closed-circular plasmid consisting of an IncHI2 plasmid sequence fused to an IncFI/FIIA plasmid. pMG101-B is a mobile closed-circular 154 Kb IncFI/FIIA plasmid. Sequence identity of pMG101-B with the fused IncFI/IncFIIA region of pMG101-A was >99%. Assembled host sequence reads of pMG101-B showed Tn7-mediated transposition of pco/sil into the E. coli J53 chromosome between yhiM and yhiN. Long read sequence data in combination with laboratory experiments have demonstrated large scale changes in pMG101. Loss of conjugation function and movement of resistance genes into the chromosome suggest that even under long-term laboratory storage, mobile genetic elements such as transposons and insertion sequences can drive the evolution of plasmids and host. This study emphasises the importance of utilising long read sequencing technologies of plasmids and host strains at the earliest opportunity.


Sign in / Sign up

Export Citation Format

Share Document