Socru: Typing of genome level order and orientation in bacteria

Mapping Intimacies ◽

10.1101/543702 ◽

2019 ◽

Author(s):

Andrew J. Page ◽

Gemma C. Langridge

Keyword(s):

Large Scale ◽

Genome Rearrangement ◽

Bacterial Species ◽

Level Structure ◽

Typing Method ◽

Long Read ◽

Multiple Copies ◽

Ribosomal Operons ◽

Genome Level ◽

Species Specific

AbstractSummaryGenome rearrangements occur in bacteria between repeat sequences and impact growth and gene expression. Homologous recombination can occur between ribosomal operons, which are found in multiple copies in many bacteria. Inversion between indirect repeats and excision/translocation between direct repeats enable structural genome rearrangement. To identify what these rearrangements are by sequencing, reads of several thousand bases are required to span the ribosomal operons. With long read sequencing aiding the routine generation of complete bacterial assemblies, we have developed socru, a typing method for the order and orientation of genome fragments between ribosomal operons, defined against species-specific baselines. It allows for a single identifier to convey the order and orientation of genome level structure and 434 of the most common bacterial species are supported. Additionally, socru can be used to identify large scale misassemblies.Availability and implementationSocru is written in Python 3, runs on Linux and OSX systems and is available under the open source license GNU GPL 3 from https://github.com/quadram-institute-bioscience/[email protected]

socru: typing of genome-level order and orientation around ribosomal operons in bacteria

Microbial Genomics ◽

10.1099/mgen.0.000396 ◽

2020 ◽

Vol 6 (7) ◽

Cited By ~ 2

Author(s):

Andrew J. Page ◽

Emma V. Ainsworth ◽

Gemma C. Langridge

Keyword(s):

Structural Changes ◽

Bacterial Species ◽

Level Structure ◽

Typing Method ◽

Single Nucleotide ◽

Long Read ◽

Multiple Copies ◽

Ribosomal Operons ◽

Genome Level ◽

Level Order

Rearrangements of large genome fragments occur in bacteria between repeat sequences and can impact on growth and gene expression. Homologous recombination resulting in inversion between indirect repeats and excision/translocation between direct repeats enables these structural changes. One form of rearrangement occurs around ribosomal operons, found in multiple copies across many bacteria, but identification of these rearrangements by sequencing requires reads of several thousand bases to span the ribosomal operons. With long-read sequencing aiding the routine generation of complete bacterial assemblies, we have developed socru, a typing method for the order and orientation of genome fragments between ribosomal operons. It allows for a single identifier to convey the order and orientation of genome-level structure and we have successfully applied this typing to 433 of the most common bacterial species. In a focused analysis, we observed the presence of multiple structural genotypes in nine bacterial pathogens, underscoring the importance of routinely assessing this form of variation alongside traditional single-nucleotide polymorphism (SNP) typing.

Long-Read-Sequenced Reference Genomes of the Seven Major Lineages of Enterotoxigenic Escherichia Coli (ETEC) Circulating in Modern Time

10.21203/rs.3.rs-237525/v1 ◽

2021 ◽

Author(s):

Astrid von Mentzer ◽

Grace A Blackwell ◽

Derek Pickard ◽

Christine J Boinett ◽

Enrique Joffré ◽

...

Keyword(s):

Escherichia Coli ◽

Large Scale ◽

Bacterial Species ◽

Enterotoxigenic Escherichia Coli ◽

Poor Countries ◽

Depth Analysis ◽

Sequencing Studies ◽

Long Read ◽

To Come ◽

Reference Genomes

Abstract Abstract Enterotoxigenic Escherichia coli (ETEC) is an enteric pathogen responsible for the majority of diarrheal cases worldwide. ETEC infections are estimated to cause 80,000 fatalities per year, with the highest rates of burden, ca 75 million cases per year, amongst children under five years of age in resource-poor countries. It is also the leading cause of diarrhoea in travellers. Previous large-scale sequencing studies have found seven major ETEC lineages currently in circulation worldwide. We used PacBio long-read sequencing combined with Illumina sequencing to create high-quality complete reference genomes for each of the major lineages with manually curated chromosomes and plasmids. We confirm that the major ETEC lineages all harbour conserved plasmids that have been associated with their respective background genomes for decades and that the plasmids and chromosomes of ETEC are both crucial for ETEC virulence and success as pathogens. The in-depth analysis of gene content, synteny and correct annotations of plasmids will elucidate other plasmids with and without virulence factors in related bacterial species. These reference genomes allow for fast and accurate comparison between different ETEC strains, and these data will form the foundation of ETEC genomics research for years to come.

Novel canine high-quality metagenome-assembled genomes, prophages, and host-associated plasmids by long-read metagenomics together with Hi-C proximity ligation

10.1101/2021.07.02.450895 ◽

2021 ◽

Author(s):

Anna Cusco ◽

Daniel Perez ◽

Joaquim Vines ◽

Norma Fabregas ◽

Olga Francino

Keyword(s):

Mobile Genetic Elements ◽

Bacterial Host ◽

High Quality ◽

Short Read ◽

Genetic Elements ◽

Proximity Ligation ◽

Long Read ◽

Ribosomal Operons ◽

Species Specific ◽

Public Datasets

Long-read metagenomics facilitates the assembly of high-quality metagenome-assembled genomes (HQ MAGs) out of complex microbiomes. It provides highly contiguous assemblies by spanning repetitive regions, complete ribosomal genes, and mobile genetic elements. Hi-C proximity ligation data bins the long contigs and their associated extra-chromosomal elements to their bacterial host. Here, we characterized a canine fecal sample combining a long-read metagenomics assembly with Hi-C data, and further correcting frameshift errors. We retrieved 27 HQ MAGs and seven medium-quality (MQ) MAGs considering MIMAG criteria. All the long-read canine MAGs improved previous short-read MAGs from public datasets regarding contiguity of the assembly, presence, and completeness of the ribosomal operons, and presence of canonical tRNAs. This trend was also observed when comparing to representative genomes from a pure culture (short-read assemblies). Moreover, Hi-C data linked six potential plasmids to their bacterial hosts. Finally, we identified 51 bacteriophages integrated into their bacterial host, providing novel host information for eight viral clusters that included Gut Phage Database viral genomes. Even though three viral clusters were species-specific, most of them presented a broader host range. In conclusion, long-read metagenomics retrieved long contigs harboring complete assembled ribosomal operons, prophages, and other mobile genetic elements. Hi-C binned together the long contigs into HQ and MQ MAGs, some of them representing closely related species. Long-read metagenomics and Hi-C proximity ligation are likely to become a comprehensive approach to HQ MAGs discovery and assignment of extra-chromosomal elements to their bacterial host.

Long-read-sequenced reference genomes of the seven major lineages of enterotoxigenic Escherichia coli (ETEC) circulating in modern time

Scientific Reports ◽

10.1038/s41598-021-88316-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Astrid von Mentzer ◽

Grace A. Blackwell ◽

Derek Pickard ◽

Christine J. Boinett ◽

Enrique Joffré ◽

...

Keyword(s):

Escherichia Coli ◽

Large Scale ◽

Bacterial Species ◽

Enterotoxigenic Escherichia Coli ◽

Poor Countries ◽

Depth Analysis ◽

Sequencing Studies ◽

Long Read ◽

To Come ◽

Reference Genomes

AbstractEnterotoxigenic Escherichia coli (ETEC) is an enteric pathogen responsible for the majority of diarrheal cases worldwide. ETEC infections are estimated to cause 80,000 deaths annually, with the highest rates of burden, ca 75 million cases per year, amongst children under 5 years of age in resource-poor countries. It is also the leading cause of diarrhoea in travellers. Previous large-scale sequencing studies have found seven major ETEC lineages currently in circulation worldwide. We used PacBio long-read sequencing combined with Illumina sequencing to create high-quality complete reference genomes for each of the major lineages with manually curated chromosomes and plasmids. We confirm that the major ETEC lineages all harbour conserved plasmids that have been associated with their respective background genomes for decades, suggesting that the plasmids and chromosomes of ETEC are both crucial for ETEC virulence and success as pathogens. The in-depth analysis of gene content, synteny and correct annotations of plasmids will elucidate other plasmids with and without virulence factors in related bacterial species. These reference genomes allow for fast and accurate comparison between different ETEC strains, and these data will form the foundation of ETEC genomics research for years to come.

Taxonomic resolution of the ribosomal RNA operon in bacteria: implications for its use with long-read sequencing

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqz016 ◽

2019 ◽

Vol 2 (1) ◽

Cited By ~ 2

Author(s):

Leonardo de Oliveira Martins ◽

Andrew J Page ◽

Alison E Mather ◽

Ian G Charles

Keyword(s):

Phylogenetic Signal ◽

Taxonomic Diversity ◽

Taxonomic Resolution ◽

Bacterial Cells ◽

Species Classification ◽

Sequencing Technologies ◽

Ribosomal Operon ◽

Long Read ◽

Multiple Copies ◽

Ribosomal Operons

Abstract DNA barcoding through the use of amplified regions of the ribosomal operon, such as the 16S gene, is a routine method to gain an overview of the microbial taxonomic diversity within a sample without the need to isolate and culture the microbes present. However, bacterial cells usually have multiple copies of this ribosomal operon, and choosing the ‘wrong’ copy could provide a misleading species classification. While this presents less of a problem for well-characterized organisms with large sequence databases to interrogate, it is a significant challenge for lesser known organisms with unknown copy number and diversity. Using the entire length of the ribosomal operon, which encompasses the 16S, 23S, 5S and internal transcribed spacer regions, should provide greater taxonomic resolution but has not been well explored. Here, we use publicly available reference genomes and explore the theoretical boundaries when using concatenated genes and the full-length ribosomal operons, which has been made possible by the development and uptake of long-read sequencing technologies. We quantify the issues of both copy choice and operon length in a phylogenetic context to demonstrate that longer regions improve the phylogenetic signal while maintaining taxonomic accuracy.

Long-read-sequenced reference genomes of the seven major lineages of enterotoxigenic Escherichia coli (ETEC) circulating in modern time

10.1101/2020.07.16.203430 ◽

2020 ◽

Author(s):

Astrid von Mentzer ◽

Grace Blackwell ◽

Derek Pickard ◽

Christine J. Boinett ◽

Enrique Joffré ◽

...

Keyword(s):

Escherichia Coli ◽

Large Scale ◽

Bacterial Species ◽

Enterotoxigenic Escherichia Coli ◽

Poor Countries ◽

Depth Analysis ◽

Sequencing Studies ◽

Long Read ◽

To Come ◽

Reference Genomes

AbstractBackgroundEnterotoxigenic Escherichia coli (ETEC) is an enteric pathogen responsible for the majority of diarrheal cases worldwide. ETEC infections are estimated to cause 80,000 fatalities per year, with the highest burden, ca 75 million cases per year, amongst children under five years of age in resource poor countries. It is also the leading cause of diarrhoea in travellers. Previous large-scale sequencing studies have found seven major ETEC lineages currently in circulation world-wide.MethodsHere we have used PacBio long read sequencing in combination with Illumina sequencing to create high quality complete reference genomes for each of these lineages with manually curated chromosomes and plasmids. The plasmids carrying ETEC virulence genes were compared to other available long-read sequenced ETEC strains using blastn.ResultsThe ETEC reference strains harbour between two and five plasmids, including virulence, antibiotic resistance and phage-plasmids. The virulence plasmids carrying the colonization factors are highly conserved as shown by comparison with plasmids with other ETEC strains and confirm that the plasmids and chromosomes of ETEC are both crucial for ETEC virulence and superiority as pathogens.ConclusionWe confirm that the major ETEC lineages all harbor conserved plasmids that have been associated to their respective background genomes for decades. The in-depth analysis of gene content and order and correct annotations of plasmids will help to elucidate other plasmids with and without virulence factors in related bacterial species. These reference genomes allow for rapid and accurate comparison between different ETEC strains and these data will form the foundation of ETEC genomics research for years to come.

FlsnRNA-seq: protoplasting-free full-length single-nucleus RNA profiling in plants

Genome Biology ◽

10.1186/s13059-021-02288-0 ◽

2021 ◽

Vol 22 (1) ◽

Cited By ~ 2

Author(s):

Yanping Long ◽

Zhijian Liu ◽

Jinbu Jia ◽

Weipeng Mo ◽

Liang Fang ◽

...

Keyword(s):

Single Cell ◽

Cell Walls ◽

Large Scale ◽

Full Length ◽

Cell Level ◽

Root Cells ◽

Rna Profiling ◽

Different Types ◽

Long Read ◽

Single Nucleus

AbstractThe broad application of single-cell RNA profiling in plants has been hindered by the prerequisite of protoplasting that requires digesting the cell walls from different types of plant tissues. Here, we present a protoplasting-free approach, flsnRNA-seq, for large-scale full-length RNA profiling at a single-nucleus level in plants using isolated nuclei. Combined with 10x Genomics and Nanopore long-read sequencing, we validate the robustness of this approach in Arabidopsis root cells and the developing endosperm. Sequencing results demonstrate that it allows for uncovering alternative splicing and polyadenylation-related RNA isoform information at the single-cell level, which facilitates characterizing cell identities.

Comparative genomics suggests a taxonomic revision of the Staphylococcus cohnii species complex

Genome Biology and Evolution ◽

10.1093/gbe/evab020 ◽

2021 ◽

Author(s):

Anna Lavecchia ◽

Matteo Chiara ◽

Caterina De Virgilio ◽

Caterina Manzari ◽

Carlo Pazzani ◽

...

Keyword(s):

Comparative Genomics ◽

Species Complex ◽

Large Scale ◽

Genomic Sequence ◽

Bacterial Species ◽

Taxonomic Revision ◽

Distinct Species ◽

The Novel ◽

Staphylococcus Cohnii ◽

Taxonomic Assignments

Abstract Staphylococcus cohnii (SC), a coagulase-negative bacterium, was first isolated in 1975 from human skin. Early phenotypic analyses led to the delineation of two subspecies (subsp.), Staphylococcus cohnii subsp. cohnii (SCC) and Staphylococcus cohnii subsp. urealyticus (SCU). SCC was considered to be specific to humans whereas SCU apparently demonstrated a wider host range, from lower primates to humans. The type strains ATCC 29974 and ATCC 49330 have been designated for SCC and SCU, respectively. Comparative analysis of 66 complete genome sequences—including a novel SC isolate—revealed unexpected patterns within the SC complex, both in terms of genomic sequence identity and gene content, highlighting the presence of 3 phylogenetically distinct groups. Based on our observations, and on the current guidelines for taxonomic classification for bacterial species, we propose a revision of the SC species complex. We suggest that SCC and SCU should be regarded as two distinct species: SC and SU (Staphylococcus urealyticus), and that two distinct subspecies, SCC and SCB (SC subsp. barensis, represented by the novel strain isolated in Bari) should be recognized within SC. Furthermore, since large scale comparative genomics studies recurrently suggest inconsistencies or conflicts in taxonomic assignments of bacterial species, we believe that the approach proposed here might be considered for more general application.

Consistent Metagenome-Derived Metrics Verify and Delineate Bacterial Species Boundaries

mSystems ◽

10.1128/msystems.00731-19 ◽

2020 ◽

Vol 5 (1) ◽

Cited By ~ 14

Author(s):

Matthew R. Olm ◽

Alexander Crits-Christoph ◽

Spencer Diamond ◽

Adi Lavy ◽

Paula B. Matheus Carnevali ◽

...

Keyword(s):

Bacterial Diversity ◽

Ribosomal Proteins ◽

Large Scale ◽

Bacterial Species ◽

Bacterial Genome ◽

16S Rrna Genes ◽

Rrna Genes ◽

Species Discrimination ◽

Bacterial Genomes ◽

Discrimination Power

ABSTRACT Longstanding questions relate to the existence of naturally distinct bacterial species and genetic approaches to distinguish them. Bacterial genomes in public databases form distinct groups, but these databases are subject to isolation and deposition biases. To avoid these biases, we compared 5,203 bacterial genomes from 1,457 environmental metagenomic samples to test for distinct clouds of diversity and evaluated metrics that could be used to define the species boundary. Bacterial genomes from the human gut, soil, and the ocean all exhibited gaps in whole-genome average nucleotide identities (ANI) near the previously suggested species threshold of 95% ANI. While genome-wide ratios of nonsynonymous and synonymous nucleotide differences (dN/dS) decrease until ANI values approach ∼98%, two methods for estimating homologous recombination approached zero at ∼95% ANI, supporting breakdown of recombination due to sequence divergence as a species-forming force. We evaluated 107 genome-based metrics for their ability to distinguish species when full genomes are not recovered. Full-length 16S rRNA genes were least useful, in part because they were underrecovered from metagenomes. However, many ribosomal proteins displayed both high metagenomic recoverability and species discrimination power. Taken together, our results verify the existence of sequence-discrete microbial species in metagenome-derived genomes and highlight the usefulness of ribosomal genes for gene-level species discrimination. IMPORTANCE There is controversy about whether bacterial diversity is clustered into distinct species groups or exists as a continuum. To address this issue, we analyzed bacterial genome databases and reports from several previous large-scale environment studies and identified clear discrete groups of species-level bacterial diversity in all cases. Genetic analysis further revealed that quasi-sexual reproduction via horizontal gene transfer is likely a key evolutionary force that maintains bacterial species integrity. We next benchmarked over 100 metrics to distinguish these bacterial species from each other and identified several genes encoding ribosomal proteins with high species discrimination power. Overall, the results from this study provide best practices for bacterial species delineation based on genome content and insight into the nature of bacterial species population genetics.

A Genomic Distance Based on MUM Indicates Discontinuity between Most Bacterial Species and Genera

Journal of Bacteriology ◽

10.1128/jb.01202-08 ◽

2008 ◽

Vol 191 (1) ◽

pp. 91-99 ◽

Cited By ~ 115

Author(s):

Marc Deloger ◽

Meriem El Karoui ◽

Marie-Agnès Petit

Keyword(s):

Dna Sequences ◽

Dna Content ◽

Core Genome ◽

Biological Diversity ◽

Bacterial Species ◽

Genomic Distance ◽

The Core ◽

Intraspecies Diversity ◽

Genome Level ◽

Definition Of

ABSTRACT The fundamental unit of biological diversity is the species. However, a remarkable extent of intraspecies diversity in bacteria was discovered by genome sequencing, and it reveals the need to develop clear criteria to group strains within a species. Two main types of analyses used to quantify intraspecies variation at the genome level are the average nucleotide identity (ANI), which detects the DNA conservation of the core genome, and the DNA content, which calculates the proportion of DNA shared by two genomes. Both estimates are based on BLAST alignments for the definition of DNA sequences common to the genome pair. Interestingly, however, results using these methods on intraspecies pairs are not well correlated. This prompted us to develop a genomic-distance index taking into account both criteria of diversity, which are based on DNA maximal unique matches (MUM) shared by two genomes. The values, called MUMi, for MUM index, correlate better with the ANI than with the DNA content. Moreover, the MUMi groups strains in a way that is congruent with routinely used multilocus sequence-typing trees, as well as with ANI-based trees. We used the MUMi to determine the relatedness of all available genome pairs at the species and genus levels. Our analysis reveals a certain consistency in the current notion of bacterial species, in that the bulk of intraspecies and intragenus values are clearly separable. It also confirms that some species are much more diverse than most. As the MUMi is fast to calculate, it offers the possibility of measuring genome distances on the whole database of available genomes.