PaSiT: a novel approach based on short-oligonucleotide frequencies for efficient bacterial identification and typing

Gleb Goussarov; Ilse Cleenwerck; Mohamed Mysara; Natalie Leys; Pieter Monsieurs; Guillaume Tahon; Aurélien Carlier; Peter Vandamme; Rob Van Houdt

doi:10.1093/bioinformatics/btz964

PaSiT: a novel approach based on short-oligonucleotide frequencies for efficient bacterial identification and typing

Bioinformatics ◽

10.1093/bioinformatics/btz964 ◽

2020 ◽

Vol 36 (8) ◽

pp. 2337-2344 ◽

Cited By ~ 1

Author(s):

Gleb Goussarov ◽

Ilse Cleenwerck ◽

Mohamed Mysara ◽

Natalie Leys ◽

Pieter Monsieurs ◽

...

Keyword(s):

Large Scale ◽

Bacterial Species ◽

Supplementary Information ◽

Nucleotide Identity ◽

Average Nucleotide Identity ◽

Bacterial Genomes ◽

Short Oligonucleotide ◽

Novel Approach ◽

Novel Method ◽

Alignment Step

Abstract Motivation One of the most widespread methods used in taxonomy studies to distinguish between strains or taxa is the calculation of average nucleotide identity. It requires a computationally expensive alignment step and is therefore not suitable for large-scale comparisons. Short oligonucleotide-based methods do offer a faster alternative but at the expense of accuracy. Here, we aim to address this shortcoming by providing a software that implements a novel method based on short-oligonucleotide frequencies to compute inter-genomic distances. Results Our tetranucleotide and hexanucleotide implementations, which were optimized based on a taxonomically well-defined set of over 200 newly sequenced bacterial genomes, are as accurate as the short oligonucleotide-based method TETRA and average nucleotide identity, for identifying bacterial species and strains, respectively. Moreover, the lightweight nature of this method makes it applicable for large-scale analyses. Availability and implementation The method introduced here was implemented, together with other existing methods, in a dependency-free software written in C, GenDisCal, available as source code from https://github.com/LM-UGent/GenDisCal. The software supports multithreading and has been tested on Windows and Linux (CentOS). In addition, a Java-based graphical user interface that acts as a wrapper for the software is also available. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Consistent Metagenome-Derived Metrics Verify and Delineate Bacterial Species Boundaries

mSystems ◽

10.1128/msystems.00731-19 ◽

2020 ◽

Vol 5 (1) ◽

Cited By ~ 14

Author(s):

Matthew R. Olm ◽

Alexander Crits-Christoph ◽

Spencer Diamond ◽

Adi Lavy ◽

Paula B. Matheus Carnevali ◽

...

Keyword(s):

Bacterial Diversity ◽

Ribosomal Proteins ◽

Large Scale ◽

Bacterial Species ◽

Bacterial Genome ◽

16S Rrna Genes ◽

Rrna Genes ◽

Species Discrimination ◽

Bacterial Genomes ◽

Discrimination Power

ABSTRACT Longstanding questions relate to the existence of naturally distinct bacterial species and genetic approaches to distinguish them. Bacterial genomes in public databases form distinct groups, but these databases are subject to isolation and deposition biases. To avoid these biases, we compared 5,203 bacterial genomes from 1,457 environmental metagenomic samples to test for distinct clouds of diversity and evaluated metrics that could be used to define the species boundary. Bacterial genomes from the human gut, soil, and the ocean all exhibited gaps in whole-genome average nucleotide identities (ANI) near the previously suggested species threshold of 95% ANI. While genome-wide ratios of nonsynonymous and synonymous nucleotide differences (dN/dS) decrease until ANI values approach ∼98%, two methods for estimating homologous recombination approached zero at ∼95% ANI, supporting breakdown of recombination due to sequence divergence as a species-forming force. We evaluated 107 genome-based metrics for their ability to distinguish species when full genomes are not recovered. Full-length 16S rRNA genes were least useful, in part because they were underrecovered from metagenomes. However, many ribosomal proteins displayed both high metagenomic recoverability and species discrimination power. Taken together, our results verify the existence of sequence-discrete microbial species in metagenome-derived genomes and highlight the usefulness of ribosomal genes for gene-level species discrimination. IMPORTANCE There is controversy about whether bacterial diversity is clustered into distinct species groups or exists as a continuum. To address this issue, we analyzed bacterial genome databases and reports from several previous large-scale environment studies and identified clear discrete groups of species-level bacterial diversity in all cases. Genetic analysis further revealed that quasi-sexual reproduction via horizontal gene transfer is likely a key evolutionary force that maintains bacterial species integrity. We next benchmarked over 100 metrics to distinguish these bacterial species from each other and identified several genes encoding ribosomal proteins with high species discrimination power. Overall, the results from this study provide best practices for bacterial species delineation based on genome content and insight into the nature of bacterial species population genetics.

Download Full-text

A large-scale evaluation of algorithms to calculate average nucleotide identity

Antonie van Leeuwenhoek ◽

10.1007/s10482-017-0844-4 ◽

2017 ◽

Vol 110 (10) ◽

pp. 1281-1286 ◽

Cited By ~ 853

Author(s):

Seok-Hwan Yoon ◽

Sung-min Ha ◽

Jeongmin Lim ◽

Soonjae Kwon ◽

Jongsik Chun

Keyword(s):

Large Scale ◽

Nucleotide Identity ◽

Average Nucleotide Identity ◽

Scale Evaluation

Download Full-text

Nocardioides donggukensis sp. nov. and Hyunsoonleella aquatilis sp. nov., isolated from Jeongbang Waterfall on Jeju Island

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijsem.0.005176 ◽

2021 ◽

Vol 71 (12) ◽

Author(s):

Inhyup Kim ◽

Geeta Chhetri ◽

Jiyoun Kim ◽

Minchung Kang ◽

Yoonseop So ◽

...

Keyword(s):

Type Strain ◽

Type Species ◽

Bacterial Species ◽

Jeju Island ◽

Nucleotide Identity ◽

Rrna Gene ◽

Average Nucleotide Identity ◽

Phenotypic Data ◽

Content Type ◽

Link Type

Two bacterial strains, designated MJB4T and SJ7T, were isolated from water samples collected from Jeongbang Falls on Jeju Island, Republic of Korea. Phylogenetic analysis of 16S rRNA gene sequences indicated that the two strains belonged to the genera Nocardioides and Hyunsoonleella , owing to their high similarities to Nocardioides jensenii DSM 29641T (97.5 %) and Hyunsoonleella rubra FA042 T (96.3 %), respectively. These values are much lower than the gold standard for bacterial species (98.7 %). The average nucleotide identity values between strains MJB4T, SJ7T and the reference strains, Nocardioides jensenii DSM 29641T, Nocardioides daejeonensis MJ31T and Hyunsoonleella flava T58T were 77.2, 75.9 and 75.4 %, respectively. Strains MJB4T and SJ7T and the type strains of the species involved in system incidence have average nucleotide identity and average amino acid threshold values of 60.1–82.6 % for the species boundary (95–96 %), which confirms that strains MJB4T and SJ7T represent two new species of genus Nocardioides and Hyunsoonleella , respectively. Based on phylogenetic and phenotypic data, strains MJB4T and SJ7T are considered to represent novel species of the genus Nocardioides and Hyunsoonleella , respectively, for which the names Nocardioides donggukensis sp. nov. (type strain MJB4T=KACC 21724T=NBRC 114402T) and Hyunsoonleella aquatilis sp. nov., (type strain SJ7T=KACC 21715T=NBRC 114486T) have been proposed.

Download Full-text

A Universal, Genomewide GuideFinder for CRISPR/Cas9 Targeting in Microbial Genomes

mSphere ◽

10.1128/msphere.00086-20 ◽

2020 ◽

Vol 5 (1) ◽

Author(s):

Michelle Spoto ◽

Changhui Guan ◽

Elizabeth Fleming ◽

Julia Oh

Keyword(s):

Gene Function ◽

Large Scale ◽

Essential Gene ◽

Bacterial Species ◽

Bacterial Genome ◽

Model Organisms ◽

Design Parameters ◽

Bacterial Genomes ◽

Wide Range ◽

User Friendly

ABSTRACT The CRISPR/Cas system has significant potential to facilitate gene editing in a variety of bacterial species. CRISPR interference (CRISPRi) and CRISPR activation (CRISPRa) represent modifications of the CRISPR/Cas9 system utilizing a catalytically inactive Cas9 protein for transcription repression and activation, respectively. While CRISPRi and CRISPRa have tremendous potential to systematically investigate gene function in bacteria, few programs are specifically tailored to identify guides in draft bacterial genomes genomewide. Furthermore, few programs offer open-source code with flexible design parameters for bacterial targeting. To address these limitations, we created GuideFinder, a customizable, user-friendly program that can design guides for any annotated bacterial genome. GuideFinder designs guides from NGG protospacer-adjacent motif (PAM) sites for any number of genes by the use of an annotated genome and FASTA file input by the user. Guides are filtered according to user-defined design parameters and removed if they contain any off-target matches. Iteration with lowered parameter thresholds allows the program to design guides for genes that did not produce guides with the more stringent parameters, one of several features unique to GuideFinder. GuideFinder can also identify paired guides for targeting multiplicity, whose validity we tested experimentally. GuideFinder has been tested on a variety of diverse bacterial genomes, finding guides for 95% of genes on average. Moreover, guides designed by the program are functionally useful—focusing on CRISPRi as a potential application—as demonstrated by essential gene knockdown in two staphylococcal species. Through the large-scale generation of guides, this open-access software will improve accessibility to CRISPR/Cas studies of a variety of bacterial species. IMPORTANCE With the explosion in our understanding of human and environmental microbial diversity, corresponding efforts to understand gene function in these organisms are strongly needed. CRISPR/Cas9 technology has revolutionized interrogation of gene function in a wide variety of model organisms. Efficient CRISPR guide design is required for systematic gene targeting. However, existing tools are not adapted for the broad needs of microbial targeting, which include extraordinary species and subspecies genetic diversity, the overwhelming majority of which is characterized by draft genomes. In addition, flexibility in guide design parameters is important to consider the wide range of factors that can affect guide efficacy, many of which can be species and strain specific. We designed GuideFinder, a customizable, user-friendly program that addresses the limitations of existing software and that can design guides for any annotated bacterial genome with numerous features that facilitate guide design in a wide variety of microorganisms.

Download Full-text

Network-based characterization of disease–disease relationships in terms of drugs and therapeutic targets

Bioinformatics ◽

10.1093/bioinformatics/btaa439 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i516-i524

Author(s):

Midori Iida ◽

Michio Iwata ◽

Yoshihiro Yamanishi

Keyword(s):

Large Scale ◽

Expression Patterns ◽

Therapeutic Targets ◽

Molecular Networks ◽

Supplementary Information ◽

New Associations ◽

Disease States ◽

Molecular Features ◽

Novel Approach

Abstract Motivation Disease states are distinguished from each other in terms of differing clinical phenotypes, but characteristic molecular features are often common to various diseases. Similarities between diseases can be explained by characteristic gene expression patterns. However, most disease–disease relationships remain uncharacterized. Results In this study, we proposed a novel approach for network-based characterization of disease–disease relationships in terms of drugs and therapeutic targets. We performed large-scale analyses of omics data and molecular interaction networks for 79 diseases, including adrenoleukodystrophy, leukaemia, Alzheimer's disease, asthma, atopic dermatitis, breast cancer, cystic fibrosis and inflammatory bowel disease. We quantified disease–disease similarities based on proximities of abnormally expressed genes in various molecular networks, and showed that similarities between diseases could be explained by characteristic molecular network topologies. Furthermore, we developed a kernel matrix regression algorithm to predict the commonalities of drugs and therapeutic targets among diseases. Our comprehensive prediction strategy indicated many new associations among phenotypically diverse diseases. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Learning for Tail Label Data: A Label-Specific Feature Approach

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/533 ◽

2019 ◽

Cited By ~ 2

Author(s):

Tong Wei ◽

Wei-Wei Tu ◽

Yu-Feng Li

Keyword(s):

Clustering Analysis ◽

Large Scale ◽

Feature Representation ◽

Supplementary Information ◽

High Dimensional ◽

Learning Problem ◽

Label Data ◽

Label Correlations ◽

Novel Method ◽

Sparse Features

Tail label data (TLD) is prevalent in real-world tasks, and large-scale multi-label learning (LMLL) is its major learning scheme. Previous LMLL studies typically need to additionally take into account extensive head label data (HLD), and thus fail to guide the learning behavior of TLD. In many applications such as recommender systems, however, the prediction of tail label is very necessary, since it provides very important supplementary information. We call this kind of problem as \emph{tail label learning}. In this paper, we propose a novel method for the tail label learning problem. Based on the observation that the raw feature representation in LMLL data usually benefits HLD, which may not be suitable for TLD, we construct effective and rich label-specific features through exploring labeled data distribution and leveraging label correlations. Specifically, we employ clustering analysis to explore discriminative features for each tail label replacing the original high-dimensional and sparse features. In addition, due to the scarcity of positive examples of TLD, we encode knowledge from HLD by exploiting label correlations to enhance the label-specific features. Experimental results verify the superiority of the proposed method in terms of performance on TLD.

Download Full-text

Novel Approach to Quantitative Detection of Specific rRNA in a Microbial Community, Using Catalytic DNA

Applied and Environmental Microbiology ◽

10.1128/aem.71.8.4879-4884.2005 ◽

2005 ◽

Vol 71 (8) ◽

pp. 4879-4884 ◽

Cited By ~ 12

Author(s):

Hikaru Suenaga ◽

Rui Liu ◽

Yuko Shiramasa ◽

Takahiro Kanagawa

Keyword(s):

Microbial Community ◽

16S Rrna ◽

Bacterial Species ◽

Quantitative Detection ◽

Catalytic Function ◽

Novel Approach ◽

Specific Manner ◽

Novel Method ◽

Species Specific ◽

Wastewater Treatment Systems

ABSTRACT We developed a novel method for the quantitative detection of the 16S rRNA of a specific bacterial species in the microbial community by using deoxyribozyme (DNAzyme), which possesses the catalytic function to cleave RNA in a sequence-specific manner. A mixture of heterogeneous 16S rRNA containing the target 16S rRNA was incubated with a species-specific DNAzyme. The cleaved target 16S rRNA was separated from the intact 16S rRNA by electrophoresis, and then their amounts were compared for the quantitative detection of target 16S rRNA. This method was used to determine the abundance of the 16S rRNA of a filamentous bacterium, Sphaerotilus natans, in activated sludge, which is a microbial mixture used in wastewater treatment systems. The result indicated that this DNAzyme-based approach would be applicable to actual microbial communities.

Download Full-text

A high-resolution genomic composition-based method with the ability to distinguish similar bacterial organisms

BMC Genomics ◽

10.1186/s12864-019-6119-x ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 1

Author(s):

Yizhuang Zhou ◽

Wenting Zhang ◽

Huixian Wu ◽

Kai Huang ◽

Junfei Jin

Keyword(s):

High Resolution ◽

Pearson Correlation ◽

Bacterial Species ◽

Real Data ◽

Species Differentiation ◽

Similar Species ◽

Novel Approach ◽

Identical Composition ◽

Novel Method ◽

Species Specific

Abstract Background Genomic composition has been found to be species specific and is used to differentiate bacterial species. To date, almost no published composition-based approaches are able to distinguish between most closely related organisms, including intra-genus species and intra-species strains. Thus, it is necessary to develop a novel approach to address this problem. Results Here, we initially determine that the “tetranucleotide-derived z-value Pearson correlation coefficient” (TETRA) approach is representative of other published statistical methods. Then, we devise a novel method called “Tetranucleotide-derived Z-value Manhattan Distance” (TZMD) and compare it with the TETRA approach. Our results show that TZMD reflects the maximal genome difference, while TETRA does not in most conditions, demonstrating in theory that TZMD provides improved resolution. Additionally, our analysis of real data shows that TZMD improves species differentiation and clearly differentiates similar organisms, including similar species belonging to the same genospecies, subspecies and intraspecific strains, most of which cannot be distinguished by TETRA. Furthermore, TZMD is able to determine clonal strains with the TZMD = 0 criterion, which intrinsically encompasses identical composition, high average nucleotide identity and high percentage of shared genomes. Conclusions Our extensive assessment demonstrates that TZMD has high resolution. This study is the first to propose a composition-based method for differentiating bacteria at the strain level and to demonstrate that composition is also strain specific. TZMD is a powerful tool and the first easy-to-use approach for differentiating clonal and non-clonal strains. Therefore, as the first composition-based algorithm for strain typing, TZMD will facilitate bacterial studies in the future.

Download Full-text

Bacteria and Metabolic Potential in Karst Caves Revealed by Intensive Bacterial Cultivation and Genome Assembly

Applied and Environmental Microbiology ◽

10.1128/aem.02440-20 ◽

2021 ◽

Vol 87 (6) ◽

Author(s):

Hai-Zhen Zhu ◽

Zhi-Feng Zhang ◽

Nan Zhou ◽

Cheng-Ying Jiang ◽

Bao-Jun Wang ◽

...

Keyword(s):

Large Scale ◽

Bacterial Species ◽

Biogeochemical Cycling ◽

Metagenomic Data ◽

Karst Cave ◽

Bacterial Genomes ◽

Metabolic Potential ◽

Karst Caves ◽

Microbial Resources ◽

Cave Ecosystems

ABSTRACT Karst caves are widely distributed subsurface systems, and the microbiomes therein are proposed to be the driving force for cave evolution and biogeochemical cycling. In past years, culture-independent studies on the microbiomes of cave systems have been conducted, yet intensive microbial cultivation is still needed to validate the sequence-derived hypothesis and to disclose the microbial functions in cave ecosystems. In this study, the microbiomes of two karst caves in Guizhou Province in southwest China were examined. A total of 3,562 bacterial strains were cultivated from rock, water, and sediment samples, and 329 species (including 14 newly described species) of 102 genera were found. We created a cave bacterial genome collection of 218 bacterial genomes from a karst cave microbiome through the extraction of 204 database-derived genomes and de novo sequencing of 14 new bacterial genomes. The cultivated genome collection obtained in this study and the metagenome data from previous studies were used to investigate the bacterial metabolism and potential involvement in the carbon, nitrogen, and sulfur biogeochemical cycles in the cave ecosystem. New N2-fixing Azospirillum and alkane-oxidizing Oleomonas species were documented in the karst cave microbiome. Two pcaIJ clusters of the β-ketoadipate pathway that were abundant in both the cultivated microbiomes and the metagenomic data were identified, and their representatives from the cultivated bacterial genomes were functionally demonstrated. This large-scale cultivation of a cave microbiome represents the most intensive collection of cave bacterial resources to date and provides valuable information and diverse microbial resources for future cave biogeochemical research. IMPORTANCE Karst caves are oligotrophic environments that are dark and humid and have a relatively stable annual temperature. The diversity of bacteria and their metabolisms are crucial for understanding the biogeochemical cycling in cave ecosystems. We integrated large-scale bacterial cultivation with metagenomic data mining to explore the compositions and metabolisms of the microbiomes in two karst cave systems. Our results reveal the presence of a highly diversified cave bacterial community, and 14 new bacterial species were described and their genomes sequenced. In this study, we obtained the most intensive collection of cultivated microbial resources from karst caves to date and predicted the various important routes for the biogeochemical cycling of elements in cave ecosystems.

Download Full-text

A universal, genome-wide guide finder for CRISPR/Cas9 targeting in microbial genomes

10.1101/194241 ◽

2017 ◽

Author(s):

Michelle Spoto ◽

Elizabeth Fleming ◽

Julia Oh

Keyword(s):

Large Scale ◽

Essential Gene ◽

Bacterial Species ◽

Bacterial Genome ◽

Design Parameters ◽

Bacterial Genomes ◽

Microbial Genomes ◽

Genome Wide ◽

Cas9 Protein ◽

User Friendly

AbstractBackgroundThe CRISPR/Cas system has significant potential to facilitate gene editing in a variety of bacterial species. CRISPR interference (CRISPRi) and CRISPR activation (CRISPRa) represent modifications of the CRISPR/Cas9 system utilizing a catalytically inactive Cas9 protein for transcription repression or activation, respectively. While CRISPRi and CRISPRa have tremendous potential to systematically investigate gene function in bacteria, no pan-bacterial, genome-wide tools exist for guide discovery. We have created Guide Finder: a customizable, user-friendly program that can design guides for any annotated bacterial genome.ResultsGuide Finder designs guides from NGG PAM sites for any number of genes using an annotated genome and fasta file input by the user. Guides are filtered according to user-defined design parameters and removed if they contain any off-target matches. Iteration with lowered parameter thresholds allows the program to design guides for genes that did not produce guides with the more stringent parameters, a feature unique to Guide Finder. Guide Finder has been tested on a variety of diverse bacterial genomes, on average finding guides for 95% of genes. Moreover, guides designed by the program are functionally useful—focusing on CRISPRi as a potential application—as demonstrated by essential gene knockdown in two staphylococcal species.ConclusionsThrough the large-scale generation of guides, this open-access software will improve accessibility to CRISPR/Cas studies for a variety of bacterial species.

Download Full-text