GeneMates: an R package for Detecting Horizontal Gene Co-transfer between Bacteria Using Gene-gene Associations Controlled for Population Structure

AbstractAntimicrobial resistance (AMR) in bacteria has been a global threat to public health for decades. A well-known driving force for the emergence, evolution and dissemination of genetic AMR determinants in bacterial populations is horizontal gene transfer, which is frequently mediated by mobile genetic elements (MGEs). Some MGEs can capture, maintain, and rearrange multiple AMR genes in a donor bacterium before moving them into recipients, giving rise to a phenomenon called horizontal gene co-transfer (HGcoT). This physical linkage or co-localisation between mobile AMR genes is of particular concern because it facilitates rapid dissemination of multidrug resistance within and across bacterial populations, providing opportunities for co-selection of AMR genes and limiting our therapeutic options. The study of HGcoT can be benefited from large-scale whole-genome sequencing (WGS) data, however, by far most published studies of HGcoT only consider simple co-occurrence measures, which can be confounded by strong bacterial population structure due to clonal reproduction, leading to spurious associations. To address this issue, we present GeneMates, an R package implementing a network approach to identification of HGcoT using WGS data. The package enables users to test for associations between presence-absence of bacterial genes using univariate linear mixed models controlling for population structure based on core-genome variation. Furthermore, when physical distances between genes of interest are measurable in bacterial genomes, users can evaluate distance consistency to further support their inference of putative horizontally co-transferred genes, whose co-occurrence cannot be completely explained by the population structure. We demonstrate how this package can be used to identify co-transferred AMR genes and recover known MGEs from Escherichia coli and Salmonella Typhimurium WGS data. GeneMates is accessible at github.com/wanyuac/GeneMates.

Download Full-text

Efficient inference of recent and ancestral recombination within bacterial populations

10.1101/059642 ◽

2016 ◽

Cited By ~ 1

Author(s):

Rafal Mostowy ◽

Nicholas J. Croucher ◽

Cheryl P. Andam ◽

Jukka Corander ◽

William P. Hanage ◽

...

Keyword(s):

Genetic Material ◽

Computational Cost ◽

Bacterial Genomes ◽

Accurate Identification ◽

Bacterial Populations ◽

Recombination Hotspots ◽

Whole Genomes ◽

Multiple Species ◽

Bacterial Genes ◽

Insight Into

AbstractProkaryotic evolution is affected by horizontal transfer of genetic material through recombination. Inference of an evolutionary tree of bacteria thus relies on accurate identification of the population genetic structure and recombination-derived mosaicism. Rapidly growing databases represent a challenge for computational methods to detect recombinations in bacterial genomes. We introduce a novel algorithm called fastGEAR which identifies lineages in diverse microbial alignments, and recombinations between them and from external origins. The algorithm detects both recent recombinations (affecting a few isolates) and ancestral recombinations between detected lineages (affecting entire lineages), thus providing insight into recombinations affecting deep branches of the phylogenetic tree. In sim-ulations, fastGEAR had comparable power to detect recent recombinations and outstanding power to detect the ancestral ones, compared to state-of-the-art methods, often with a fraction of computational cost. We demonstrate the utility of the method by analysing a collection of 616 whole-genomes of a recombinogenic pathogen Streptococcus pneumoniae, for which the method provided a high-resolution view of recombination across the genome. We examined in detail the penicillin-binding genes across the Streptococcus genus, demonstrating previously undetected genetic exchanges between different species at these three loci. Hence, fastGEAR can be readily applied to investigate mosaicism in bacterial genes across multiple species. Finally, fastGEAR correctly identified many known recombination hotspots and pointed to potential new ones. Matlab code and Linux/Windows executables are available at https://users.ics.aalto.fi/~pemartti/fastGEAR/

Download Full-text

Consistent Metagenome-Derived Metrics Verify and Delineate Bacterial Species Boundaries

mSystems ◽

10.1128/msystems.00731-19 ◽

2020 ◽

Vol 5 (1) ◽

Cited By ~ 14

Author(s):

Matthew R. Olm ◽

Alexander Crits-Christoph ◽

Spencer Diamond ◽

Adi Lavy ◽

Paula B. Matheus Carnevali ◽

...

Keyword(s):

Bacterial Diversity ◽

Ribosomal Proteins ◽

Large Scale ◽

Bacterial Species ◽

Bacterial Genome ◽

16S Rrna Genes ◽

Rrna Genes ◽

Species Discrimination ◽

Bacterial Genomes ◽

Discrimination Power

ABSTRACT Longstanding questions relate to the existence of naturally distinct bacterial species and genetic approaches to distinguish them. Bacterial genomes in public databases form distinct groups, but these databases are subject to isolation and deposition biases. To avoid these biases, we compared 5,203 bacterial genomes from 1,457 environmental metagenomic samples to test for distinct clouds of diversity and evaluated metrics that could be used to define the species boundary. Bacterial genomes from the human gut, soil, and the ocean all exhibited gaps in whole-genome average nucleotide identities (ANI) near the previously suggested species threshold of 95% ANI. While genome-wide ratios of nonsynonymous and synonymous nucleotide differences (dN/dS) decrease until ANI values approach ∼98%, two methods for estimating homologous recombination approached zero at ∼95% ANI, supporting breakdown of recombination due to sequence divergence as a species-forming force. We evaluated 107 genome-based metrics for their ability to distinguish species when full genomes are not recovered. Full-length 16S rRNA genes were least useful, in part because they were underrecovered from metagenomes. However, many ribosomal proteins displayed both high metagenomic recoverability and species discrimination power. Taken together, our results verify the existence of sequence-discrete microbial species in metagenome-derived genomes and highlight the usefulness of ribosomal genes for gene-level species discrimination. IMPORTANCE There is controversy about whether bacterial diversity is clustered into distinct species groups or exists as a continuum. To address this issue, we analyzed bacterial genome databases and reports from several previous large-scale environment studies and identified clear discrete groups of species-level bacterial diversity in all cases. Genetic analysis further revealed that quasi-sexual reproduction via horizontal gene transfer is likely a key evolutionary force that maintains bacterial species integrity. We next benchmarked over 100 metrics to distinguish these bacterial species from each other and identified several genes encoding ribosomal proteins with high species discrimination power. Overall, the results from this study provide best practices for bacterial species delineation based on genome content and insight into the nature of bacterial species population genetics.

Download Full-text

Evidence for a large-scale population structure among accessions of Arabidopsis thaliana: possible causes and consequences for the distribution of linkage disequilibrium

Molecular Ecology ◽

10.1111/j.1365-294x.2006.02865.x ◽

2006 ◽

Vol 15 (6) ◽

pp. 1507-1517 ◽

Cited By ~ 90

Author(s):

MARIE-FRANCE OSTROWSKI ◽

JACQUES DAVID ◽

SYLVAIN SANTONI ◽

HEATHER MCKHANN ◽

XAVIER REBOUD ◽

...

Keyword(s):

Arabidopsis Thaliana ◽

Population Structure ◽

Linkage Disequilibrium ◽

Large Scale ◽

Scale Population

Download Full-text

Dynamics of Genome Architecture in Rhizobium sp. Strain NGR234

Journal of Bacteriology ◽

10.1128/jb.184.1.171-176.2002 ◽

2002 ◽

Vol 184 (1) ◽

pp. 171-176 ◽

Cited By ~ 62

Author(s):

Patrick Mavingui ◽

Margarita Flores ◽

Xianwu Guo ◽

Guillermo Dávila ◽

Xavier Perret ◽

...

Keyword(s):

Large Scale ◽

Insertion Sequence ◽

Biological Significance ◽

Genome Architecture ◽

Bacterial Genomes ◽

Symbiotic Plasmid ◽

Sequence Elements ◽

Dynamic Structures ◽

Genome Analyses ◽

Insertion Sequence Elements

ABSTRACT Bacterial genomes are usually partitioned in several replicons, which are dynamic structures prone to mutation and genomic rearrangements, thus contributing to genome evolution. Nevertheless, much remains to be learned about the origins and dynamics of the formation of bacterial alternative genomic states and their possible biological consequences. To address these issues, we have studied the dynamics of the genome architecture in Rhizobium sp. strain NGR234 and analyzed its biological significance. NGR234 genome consists of three replicons: the symbiotic plasmid pNGR234a (536,165 bp), the megaplasmid pNGR234b (>2,000 kb), and the chromosome (>3,700 kb). Here we report that genome analyses of cell siblings showed the occurrence of large-scale DNA rearrangements consisting of cointegrations and excisions between the three replicons. As a result, four new genomic architectures have emerged. Three consisted of the cointegrates between two replicons: chromosome-pNGR234a, chromosome-pNGR234b, and pNGR234a-pNGR234b. The other consisted of a cointegrate of the three replicons (chromosome-pNGR234a-pNGR234b). Cointegration and excision of pNGR234a with either the chromosome or pNGR234b were studied and found to proceed via a Campbell-type mechanism, mediated by insertion sequence elements. We provide evidence showing that changes in the genome architecture did not alter the growth and symbiotic proficiency of Rhizobium derivatives.

Download Full-text

SkewIT: The Skew Index Test for large-scale GC Skew analysis of bacterial genomes

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008439 ◽

2020 ◽

Vol 16 (12) ◽

pp. e1008439

Author(s):

Jennifer Lu ◽

Steven L. Salzberg

Keyword(s):

Large Scale ◽

Analysis Tool ◽

Index Test ◽

Bacterial Genomes ◽

Phylogenetic Groups ◽

Bacterial Phyla ◽

Link Type ◽

Gc Skew ◽

A Genome ◽

Web App

GC skew is a phenomenon observed in many bacterial genomes, wherein the two replication strands of the same chromosome contain different proportions of guanine and cytosine nucleotides. Here we demonstrate that this phenomenon, which was first discovered in the mid-1990s, can be used today as an analysis tool for the 15,000+ complete bacterial genomes in NCBI’s Refseq library. In order to analyze all 15,000+ genomes, we introduce a new method, SkewIT (Skew Index Test), that calculates a single metric representing the degree of GC skew for a genome. Using this metric, we demonstrate how GC skew patterns are conserved within certain bacterial phyla, e.g. Firmicutes, but show different patterns in other phylogenetic groups such as Actinobacteria. We also discovered that outlier values of SkewIT highlight potential bacterial mis-assemblies. Using our newly defined metric, we identify multiple mis-assembled chromosomal sequences in previously published complete bacterial genomes. We provide a SkewIT web app https://jenniferlu717.shinyapps.io/SkewIT/ that calculates SkewI for any user-provided bacterial sequence. The web app also provides an interactive interface for the data generated in this paper, allowing users to further investigate the SkewI values and thresholds of the Refseq-97 complete bacterial genomes. Individual scripts for analysis of bacterial genomes are provided in the following repository: https://github.com/jenniferlu717/SkewIT.

Download Full-text

Peer Review #1 of "The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes (v0.1)"

10.7287/peerj.332v0.1/reviews/1 ◽

2014 ◽

Author(s):

NJ Loman

Keyword(s):

Peer Review ◽

Large Scale ◽

Bacterial Genomes ◽

Blast Score

Download Full-text

Feasibility and Evaluation of a Large-Scale External Validation Approach for Patient-Level Prediction in an International Data Network: Validation of models predicting stroke in female patients newly diagnosed with atrial fibrillation.

10.21203/rs.2.11750/v2 ◽

2020 ◽

Author(s):

Jenna Marie Reps ◽

Ross Williams ◽

Seng Chan You ◽

Thomas Falconer ◽

Evan Minty ◽

...

Keyword(s):

Atrial Fibrillation ◽

Large Scale ◽

Data Science ◽

Prediction Models ◽

External Validation ◽

Scale Up ◽

R Package ◽

Prognostic Models ◽

Healthcare Data ◽

Patient Level

Abstract Objective: To demonstrate how the Observational Healthcare Data Science and Informatics (OHDSI) collaborative network and standardization can be utilized to scale-up external validation of patient-level prediction models by enabling validation across a large number of heterogeneous observational healthcare datasets.Materials & Methods: Five previously published prognostic models (ATRIA, CHADS2, CHADS2VASC, Q-Stroke and Framingham) that predict future risk of stroke in patients with atrial fibrillation were replicated using the OHDSI frameworks. A network study was run that enabled the five models to be externally validated across nine observational healthcare datasets spanning three countries and five independent sites. Results: The five existing models were able to be integrated into the OHDSI framework for patient-level prediction and they obtained mean c-statistics ranging between 0.57-0.63 across the 6 databases with sufficient data to predict stroke within 1 year of initial atrial fibrillation diagnosis for females with atrial fibrillation. This was comparable with existing validation studies. The validation network study was run across nine datasets within 60 days once the models were replicated. An R package for the study was published at https://github.com/OHDSI/StudyProtocolSandbox/tree/master/ExistingStrokeRiskExternalValidation.Discussion: This study demonstrates the ability to scale up external validation of patient-level prediction models using a collaboration of researchers and a data standardization that enable models to be readily shared across data sites. External validation is necessary to understand the transportability or reproducibility of a prediction model, but without collaborative approaches it can take three or more years for a model to be validated by one independent researcher. Conclusion : In this paper we show it is possible to both scale-up and speed-up external validation by showing how validation can be done across multiple databases in less than 2 months. We recommend that researchers developing new prediction models use the OHDSI network to externally validate their models.

Download Full-text

Optimizing smartphone-based canopy hemispherical photography

10.1101/2021.03.17.435793 ◽

2021 ◽

Author(s):

Gastón Mauro Díaz

Keyword(s):

Large Scale ◽

Forest Canopy ◽

Accuracy Assessment ◽

Low Cost ◽

R Package ◽

Coefficient Of Determination ◽

Native Forest ◽

Hemispherical Photography ◽

Area Index ◽

Plant Area Index

1) Hemispherical photography (HP) is a long-standing tool for forest canopy characterization. Currently, there are low-cost fisheye lenses to convert smartphones into high-portable HP equipment; however, they cannot be used whenever since HP is sensitive to illumination conditions. To obtain sound results outside diffuse light conditions, a deep-learning-based system needs to be developed. A ready-to-use alternative is the multiscale color-based binarization algorithm, but it can provide moderate-quality results only for open forests. To overcome this limitation, I propose coupling it with the model-based local thresholding algorithm. I call this coupling the MBCB approach. 2) Methods presented here are part of the R package CAnopy IMage ANalysis (caiman), which I am developing. The accuracy assessment of the new MBCB approach was done with data from a pine plantation and a broadleaf native forest. 3) The coefficient of determination (R^2) was greater than 0.7, and the root mean square error (RMSE) lower than 20 %, both for plant area index calculation. 4) Results suggest that the new MBCB approach allows the calculation of unbiased canopy metrics from smartphone-based HP acquired in sunlight conditions, even for closed canopies. This facilitates large-scale and opportunistic sampling with hemispherical photography.

Download Full-text

Implementation of the Omega (ω) Index to detect large-scale systematic cheating

10.35542/osf.io/exwkp ◽

2019 ◽

Author(s):

Alvin Vista

Keyword(s):

Standardized Testing ◽

Large Scale ◽

Type I Error ◽

R Package ◽

Statistical Testing ◽

System Level ◽

Control Group ◽

Type I ◽

Data Contamination ◽

Cheating Detection

Cheating detection is an important issue in standardized testing, especially in large-scale settings. Statistical approaches are often computationally intensive and require specialised software to conduct. We present a two-stage approach that quickly filters suspected groups using statistical testing on an IRT-based answer-copying index. We also present an approach to mitigate data contamination and improve the performance of the index. The computation of the index was implemented through a modified version of an open source R package, thus enabling wider access to the method. Using data from PIRLS 2011 (N=64,232) we conduct a simulation to demonstrate our approach. Type I error was well-controlled and no control group was falsely flagged for cheating, while 16 (combined n=12,569) of the 18 (combined n=14,149) simulated groups were detected. Implications for system-level cheating detection and further improvements of the approach were discussed.

Download Full-text

The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes

10.7287/peerj.preprints.220v1 ◽

2014 ◽

Author(s):

Jason W Sahl ◽

Greg Caporaso ◽

David A Rasko ◽

Paul S Keim

Keyword(s):

Large Scale ◽

Sequence Data ◽

Parallel Implementation ◽

Genetic Relationships ◽

Clinical Diagnostics ◽

Whole Genome Sequence ◽

Bacterial Isolates ◽

Bacterial Genomes ◽

E Coli ◽

Blast Score

Background. As whole genome sequence data from bacterial isolates becomes cheaper to generate, computational methods are needed to correlate sequence data with biological observations. Here we present the large-scale BLAST score ratio (LS-BSR) pipeline, which rapidly compares the genetic content of hundreds to thousands of bacterial genomes, and returns a matrix that describes the relatedness of all coding sequences (CDSs) in all genomes surveyed. This matrix can be easily parsed in order to identify genetic relationships between bacterial genomes. Although pipelines have been published that group peptides by sequence similarity, no other software performs the large-scale, flexible, full-genome comparative analyses carried out by LS-BSR. Results. To demonstrate the utility of the method, the LS-BSR pipeline was tested on 96 Escherichia coli and Shigella genomes; the pipeline ran in 163 minutes using 16 processors, which is a greater than 7-fold speedup compared to using a single processor. The BSR values for each CDS, which indicate a relative level of relatedness, were then mapped to each genome on an independent core genome single nucleotide polymorphism (SNP) based phylogeny. Comparisons were then used to identify clade specific CDS markers and validate the LS-BSR pipeline based on molecular markers that delineate between classical E. coli pathogenic variant (pathovar) designations. Scalability tests demonstrated that the LS-BSR pipeline can process 1,000 E. coli genomes in ~60h using 16 processors. Conclusions. LS-BSR is an open-source, parallel implementation of the BSR algorithm, enabling rapid comparison of the genetic content of large numbers of genomes. The results of the pipeline can be used to identify specific markers between user-defined phylogenetic groups, and to identify the loss and/or acquisition of genetic information between bacterial isolates. Taxa-specific genetic markers can then be translated into clinical diagnostics, or can be used to identify broadly conserved putative therapeutic candidates.

Download Full-text