MG-MLST: Characterizing the Microbiome at the Strain Level in Metagenomic Data

The microbiome plays an important role in human physiology. The composition of the human microbiome has been described at the phylum, class, genus, and species levels, however, it is largely unknown at the strain level. The importance of strain-level differences in microbial communities has been increasingly recognized in understanding disease associations. Current methods for identifying strain populations often require deep metagenomic sequencing and a comprehensive set of reference genomes. In this study, we developed a method, metagenomic multi-locus sequence typing (MG-MLST), to determine strain-level composition in a microbial community by combining high-throughput sequencing with multi-locus sequence typing (MLST). We used a commensal bacterium, Propionibacterium acnes, as an example to test the ability of MG-MLST in identifying the strain composition. Using simulated communities, MG-MLST accurately predicted the strain populations in all samples. We further validated the method using MLST gene amplicon libraries and metagenomic shotgun sequencing data of clinical skin samples. MG-MLST yielded consistent results of the strain composition to those obtained from nearly full-length 16S rRNA clone libraries and metagenomic shotgun sequencing analysis. When comparing strain-level differences between acne and healthy skin microbiomes, we demonstrated that strains of RT2/6 were highly associated with healthy skin, consistent with previous findings. In summary, MG-MLST provides a quantitative analysis of the strain populations in the microbiome with diversity and richness. It can be applied to microbiome studies to reveal strain-level differences between groups, which are critical in many microorganism-related diseases.

Download Full-text

Accurate and sensitive detection of microbial eukaryotes from whole metagenome shotgun sequencing

Microbiome ◽

10.1186/s40168-021-01015-y ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

Abigail L. Lind ◽

Katherine S. Pollard

Keyword(s):

Gene Families ◽

Shotgun Sequencing ◽

Metagenomic Data ◽

Marker Genes ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Dna And Rna ◽

Paired Samples ◽

Microbial Eukaryotes ◽

Conserved Gene

Abstract Background Microbial eukaryotes are found alongside bacteria and archaea in natural microbial systems, including host-associated microbiomes. While microbial eukaryotes are critical to these communities, they are challenging to study with shotgun sequencing techniques and are therefore often excluded. Results Here, we present EukDetect, a bioinformatics method to identify eukaryotes in shotgun metagenomic sequencing data. Our approach uses a database of 521,824 universal marker genes from 241 conserved gene families, which we curated from 3713 fungal, protist, non-vertebrate metazoan, and non-streptophyte archaeplastida genomes and transcriptomes. EukDetect has a broad taxonomic coverage of microbial eukaryotes, performs well on low-abundance and closely related species, and is resilient against bacterial contamination in eukaryotic genomes. Using EukDetect, we describe the spatial distribution of eukaryotes along the human gastrointestinal tract, showing that fungi and protists are present in the lumen and mucosa throughout the large intestine. We discover that there is a succession of eukaryotes that colonize the human gut during the first years of life, mirroring patterns of developmental succession observed in gut bacteria. By comparing DNA and RNA sequencing of paired samples from human stool, we find that many eukaryotes continue active transcription after passage through the gut, though some do not, suggesting they are dormant or nonviable. We analyze metagenomic data from the Baltic Sea and find that eukaryotes differ across locations and salinity gradients. Finally, we observe eukaryotes in Arabidopsis leaf samples, many of which are not identifiable from public protein databases. Conclusions EukDetect provides an automated and reliable way to characterize eukaryotes in shotgun sequencing datasets from diverse microbiomes. We demonstrate that it enables discoveries that would be missed or clouded by false positives with standard shotgun sequence analysis. EukDetect will greatly advance our understanding of how microbial eukaryotes contribute to microbiomes.

Download Full-text

Binnacle: Using Scaffolds to Improve the Contiguity and Quality of Metagenomic Bins

Frontiers in Microbiology ◽

10.3389/fmicb.2021.638561 ◽

2021 ◽

Vol 12 ◽

Author(s):

Harihara Subrahmaniam Muralidharan ◽

Nidhi Shah ◽

Jacquelyn S. Meisel ◽

Mihai Pop

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Mobile Elements ◽

Shotgun Sequencing ◽

Strain Level ◽

Level Variation ◽

Sequencing Data ◽

Sequencing Errors ◽

Complete Genomes

High-throughput sequencing has revolutionized the field of microbiology, however, reconstructing complete genomes of organisms from whole metagenomic shotgun sequencing data remains a challenge. Recovered genomes are often highly fragmented, due to uneven abundances of organisms, repeats within and across genomes, sequencing errors, and strain-level variation. To address the fragmented nature of metagenomic assemblies, scientists rely on a process called binning, which clusters together contigs inferred to originate from the same organism. Existing binning algorithms use oligonucleotide frequencies and contig abundance (coverage) within and across samples to group together contigs from the same organism. However, these algorithms often miss short contigs and contigs from regions with unusual coverage or DNA composition characteristics, such as mobile elements. Here, we propose that information from assembly graphs can assist current strategies for metagenomic binning. We use MetaCarvel, a metagenomic scaffolding tool, to construct assembly graphs where contigs are nodes and edges are inferred based on paired-end reads. We developed a tool, Binnacle, that extracts information from the assembly graphs and clusters scaffolds into comprehensive bins. Binnacle also provides wrapper scripts to integrate with existing binning methods. The Binnacle pipeline can be found on GitHub (https://github.com/marbl/binnacle). We show that binning graph-based scaffolds, rather than contigs, improves the contiguity and quality of the resulting bins, and captures a broader set of the genes of the organisms being reconstructed.

Download Full-text

Insights into the Human Virome Using CRISPR Spacers from Microbiomes

Viruses ◽

10.3390/v10090479 ◽

2018 ◽

Vol 10 (9) ◽

pp. 479 ◽

Cited By ~ 5

Author(s):

Claudio Hidalgo-Cantabrana ◽

Rosemary Sanozky-Dawes ◽

Rodolphe Barrangou

Keyword(s):

Pathogenic Bacteria ◽

Human Microbiome ◽

Adaptive Immune System ◽

Metagenomic Data ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Adaptive Immune ◽

Associated Proteins ◽

Health And Disease ◽

Generation Sequencing

Due to recent advances in next-generation sequencing over the past decade, our understanding of the human microbiome and its relationship to health and disease has increased dramatically. Yet, our insights into the human virome, and its interplay with important microbes that impact human health, is relatively limited. Prokaryotic and eukaryotic viruses are present throughout the human body, comprising a large and diverse population which influences several niches and impacts our health at various body sites. The presence of prokaryotic viruses like phages, has been documented at many different body sites, with the human gut being the richest ecological niche. Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) and associated proteins constitute the adaptive immune system of bacteria, which prevents attack by invasive nucleic acid. CRISPR-Cas systems function by uptake and integration of foreign genetic element sequences into the CRISPR array, which constitutes a genomic archive of iterative vaccination events. Consequently, CRISPR spacers can be investigated to reconstruct interplay between viruses and bacteria, and metagenomic sequencing data can be exploited to provide insights into host-phage interactions within a niche. Here, we show how the CRISPR spacer content of commensal and pathogenic bacteria can be used to determine the evidence of their phage exposure. This framework opens new opportunities for investigating host-virus dynamics in metagenomic data, and highlights the need to dedicate more efforts for virome sampling and sequencing.

Download Full-text

Evaluation of the CosmosID Bioinformatics Platform for Prosthetic Joint-Associated Sonicate Fluid Shotgun Metagenomic Data Analysis

Journal of Clinical Microbiology ◽

10.1128/jcm.01182-18 ◽

2018 ◽

Vol 57 (2) ◽

Cited By ~ 8

Author(s):

Qun Yan ◽

Yu Mi Wi ◽

Matthew J. Thoendel ◽

Yash S. Raval ◽

Kerryl E. Greenwood-Quaintance ◽

...

Keyword(s):

Antibiotic Resistance ◽

Metagenomic Data ◽

Metagenomic Sequencing ◽

Antibacterial Resistance ◽

Sequencing Data ◽

Bacterial Detection ◽

Shotgun Metagenomic Sequencing ◽

Prosthetic Joint ◽

Validation Set ◽

Fluid Culture

ABSTRACT We previously demonstrated that shotgun metagenomic sequencing can detect bacteria in sonicate fluid, providing a diagnosis of prosthetic joint infection (PJI). A limitation of the approach that we used is that data analysis was time-consuming and specialized bioinformatics expertise was required, both of which are barriers to routine clinical use. Fortunately, automated commercial analytic platforms that can interpret shotgun metagenomic data are emerging. In this study, we evaluated the CosmosID bioinformatics platform using shotgun metagenomic sequencing data derived from 408 sonicate fluid samples from our prior study with the goal of evaluating the platform vis-à-vis bacterial detection and antibiotic resistance gene detection for predicting staphylococcal antibacterial susceptibility. Samples were divided into a derivation set and a validation set, each consisting of 204 samples; results from the derivation set were used to establish cutoffs, which were then tested in the validation set for identifying pathogens and predicting staphylococcal antibacterial resistance. Metagenomic analysis detected bacteria in 94.8% (109/115) of sonicate fluid culture-positive PJIs and 37.8% (37/98) of sonicate fluid culture-negative PJIs. Metagenomic analysis showed sensitivities ranging from 65.7 to 85.0% for predicting staphylococcal antibacterial resistance. In conclusion, the CosmosID platform has the potential to provide fast, reliable bacterial detection and identification from metagenomic shotgun sequencing data derived from sonicate fluid for the diagnosis of PJI. Strategies for metagenomic detection of antibiotic resistance genes for predicting staphylococcal antibacterial resistance need further development.

Download Full-text

Utilizing the VirIdAl Pipeline to Search for Viruses in the Metagenomic Data of Bat Samples

Viruses ◽

10.3390/v13102006 ◽

2021 ◽

Vol 13 (10) ◽

pp. 2006

Author(s):

Anna Y Budkina ◽

Elena V Korneenko ◽

Ivan A Kotov ◽

Daniil A Kiselev ◽

Ilya V Artyushin ◽

...

Keyword(s):

Large Scale ◽

High Throughput Sequencing ◽

Metagenomic Data ◽

Sequencing Data ◽

Viral Pathogens ◽

Genomic Databases ◽

Bioinformatic Pipeline ◽

Viral Genomes ◽

Sequencing Technologies ◽

Viral Screening

According to various estimates, only a small percentage of existing viruses have been discovered, naturally much less being represented in the genomic databases. High-throughput sequencing technologies develop rapidly, empowering large-scale screening of various biological samples for the presence of pathogen-associated nucleotide sequences, but many organisms are yet to be attributed specific loci for identification. This problem particularly impedes viral screening, due to vast heterogeneity in viral genomes. In this paper, we present a new bioinformatic pipeline, VirIdAl, for detecting and identifying viral pathogens in sequencing data. We also demonstrate the utility of the new software by applying it to viral screening of the feces of bats collected in the Moscow region, which revealed a significant variety of viruses associated with bats, insects, plants, and protozoa. The presence of alpha and beta coronavirus reads, including the MERS-like bat virus, deserves a special mention, as it once again indicates that bats are indeed reservoirs for many viral pathogens. In addition, it was shown that alignment-based methods were unable to identify the taxon for a large proportion of reads, and we additionally applied other approaches, showing that they can further reveal the presence of viral agents in sequencing data. However, the incompleteness of viral databases remains a significant problem in the studies of viral diversity, and therefore necessitates the use of combined approaches, including those based on machine learning methods.

Download Full-text

Towards end-to-end disease prediction from raw metagenomic data

10.1101/2020.10.29.360297 ◽

2020 ◽

Author(s):

Maxence Queyrel ◽

Edi Prifti ◽

Jean-Daniel Zucker

Keyword(s):

Dna Sequences ◽

Real Life ◽

Multiple Instance Learning ◽

Disease Classification ◽

Metagenomic Data ◽

Numerical Representation ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

End To End ◽

Bioinformatics Workflows

AbstractAnalysis of the human microbiome using metagenomic sequencing data has demonstrated high ability in discriminating various human diseases. Raw metagenomic sequencing data require multiple complex and computationally heavy bioinformatics steps prior to data analysis. Such data contain millions of short sequences read from the fragmented DNA sequences and are stored as fastq files. Conventional processing pipelines consist multiple steps including quality control, filtering, alignment of sequences against genomic catalogs (genes, species, taxonomic levels, functional pathways, etc.). These pipelines are complex to use, time consuming and rely on a large number of parameters that often provide variability and impact the estimation of the microbiome elements. Recent studies have demonstrated that training Deep Neural Networks directly from raw sequencing data is a promising approach to bypass some of the challenges associated with mainstream bioinformatics pipelines. Most of these methods use the concept of word and sentence embeddings that create a meaningful and numerical representation of DNA sequences, while extracting features and reducing the dimentionality of the data. In this paper we present an end-to-end approach that classifies patients into disease groups directly from raw metagenomic reads: metagenome2vec. This approach is composed of four steps (i) generating a vocabulary of k-mers and learning their numerical embeddings; (ii) learning DNA sequence (read) embeddings; (iii) identifying the genome from which the sequence is most likely to come and (iv) training a multiple instance learning classifier which predicts the phenotype based on the vector representation of the raw data. An attention mechanism is applied in the network so that the model can be interpreted, assigning a weight to the influence of the prediction for each genome. Using two public real-life datasets as well a simulated one, we demonstrated that this original approach reached very high performances, comparable with the state-of-the-art methods applied directly on processed data though mainstream bioinformatics workflows. These results are encouraging for this proof of concept work. We believe that with further dedication, the DNN models have the potential to surpass mainstream bioinformatics workflows in disease classification tasks.

Download Full-text

CasCollect: targeted assembly of CRISPR-associated operons from high-throughput sequencing data

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa063 ◽

2020 ◽

Vol 2 (3) ◽

Author(s):

Joshua D Podlevsky ◽

Corey M Hudson ◽

Jerilyn A Timlin ◽

Kelly P Williams

Keyword(s):

High Throughput Sequencing ◽

Markov Models ◽

Adaptive Immune System ◽

Metagenomic Data ◽

Sequencing Data ◽

Bacteriophage Therapy ◽

High Throughput Sequencing Data ◽

Assembly Pipeline ◽

Cas Genes ◽

Cas Proteins

Abstract CRISPR arrays and CRISPR-associated (Cas) proteins comprise a widespread adaptive immune system in bacteria and archaea. These systems function as a defense against exogenous parasitic mobile genetic elements that include bacteriophages, plasmids and foreign nucleic acids. With the continuous spread of antibiotic resistance, knowledge of pathogen susceptibility to bacteriophage therapy is becoming more critical. Additionally, gene-editing applications would benefit from the discovery of new cas genes with favorable properties. While next-generation sequencing has produced staggering quantities of data, transitioning from raw sequencing reads to the identification of CRISPR/Cas systems has remained challenging. This is especially true for metagenomic data, which has the highest potential for identifying novel cas genes. We report a comprehensive computational pipeline, CasCollect, for the targeted assembly and annotation of cas genes and CRISPR arrays—even isolated arrays—from raw sequencing reads. Benchmarking our targeted assembly pipeline demonstrates significantly improved timing by almost two orders of magnitude compared with conventional assembly and annotation, while retaining the ability to detect CRISPR arrays and cas genes. CasCollect is a highly versatile pipeline and can be used for targeted assembly of any specialty gene set, reconfigurable for user provided Hidden Markov Models and/or reference nucleotide sequences.

Download Full-text

phylogenize: correcting for phylogeny reveals genes associated with microbial distributions

Bioinformatics ◽

10.1093/bioinformatics/btz722 ◽

2019 ◽

Vol 36 (4) ◽

pp. 1289-1290

Author(s):

Patrick H Bradley ◽

Katherine S Pollard

Keyword(s):

Community Composition ◽

Human Microbiome ◽

Human Microbiome Project ◽

Shotgun Sequencing ◽

Supplementary Information ◽

Phylogenetic Comparative Methods ◽

Supplementary Data ◽

Sequencing Data ◽

Phylogenetic Regression ◽

Project Data

Abstract Summary Phylogenetic comparative methods are powerful but presently under-utilized ways to identify microbial genes underlying differences in community composition. These methods help to identify functionally important genes because they test for associations beyond those expected when related microbes occupy similar environments. We present phylogenize, a pipeline with web, QIIME 2 and R interfaces that allows researchers to perform phylogenetic regression on 16S amplicon and shotgun sequencing data and to visualize results. phylogenize applies broadly to both host-associated and environmental microbiomes. Using Human Microbiome Project and Earth Microbiome Project data, we show that phylogenize draws similar conclusions from 16S versus shotgun sequencing and reveals both known and candidate pathways associated with host colonization. Availability and implementation phylogenize is available at https://phylogenize.org and https://bitbucket.org/pbradz/phylogenize. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Synthetic Sequencing Standards: A Guide to Database Choice for Rumen Microbiota Amplicon Sequencing Analysis

Frontiers in Microbiology ◽

10.3389/fmicb.2020.606825 ◽

2020 ◽

Vol 11 ◽

Author(s):

Paul E. Smith ◽

Sinead M. Waters ◽

Ruth Gómez Expósito ◽

Hauke Smidt ◽

Ciara A. Carberry ◽

...

Keyword(s):

High Throughput Sequencing ◽

Cost Effective ◽

Amplicon Sequencing ◽

Gas Production ◽

Reference Database ◽

Specific Reference ◽

Sequencing Analysis ◽

Sequencing Data ◽

Rumen Microbiota ◽

Reference Databases

Our understanding of complex microbial communities, such as those residing in the rumen, has drastically advanced through the use of high throughput sequencing (HTS) technologies. Indeed, with the use of barcoded amplicon sequencing, it is now cost effective and computationally feasible to identify individual rumen microbial genera associated with ruminant livestock nutrition, genetics, performance and greenhouse gas production. However, across all disciplines of microbial ecology, there is currently little reporting of the use of internal controls for validating HTS results. Furthermore, there is little consensus of the most appropriate reference database for analyzing rumen microbiota amplicon sequencing data. Therefore, in this study, a synthetic rumen-specific sequencing standard was used to assess the effects of database choice on results obtained from rumen microbial amplicon sequencing. Four DADA2 reference training sets (RDP, SILVA, GTDB, and RefSeq + RDP) were compared to assess their ability to correctly classify sequences included in the rumen-specific sequencing standard. In addition, two thresholds of phylogenetic bootstrapping, 50 and 80, were applied to investigate the effect of increasing stringency. Sequence classification differences were apparent amongst the databases. For example the classification of Clostridium differed between all databases, thus highlighting the need for a consistent approach to nomenclature amongst different reference databases. It is hoped the effect of database on taxonomic classification observed in this study, will encourage research groups across various microbial disciplines to develop and routinely use their own microbiome-specific reference standard to validate analysis pipelines and database choice.

Download Full-text

A high-resolution pipeline for 16S-sequencing identifies bacterial strains in human microbiome

10.1101/565572 ◽

2019 ◽

Cited By ~ 1

Author(s):

Igor Segota ◽

Tao Long

Keyword(s):

Bacterial Species ◽

Human Microbiome ◽

Amplicon Sequencing ◽

R Package ◽

Strain Level ◽

Sequencing Data ◽

Bacterial Strains ◽

16S Sequencing ◽

16S Amplicon Sequencing ◽

Sequencing Data Analysis

We developed a High-resolution Microbial Analysis Pipeline (HiMAP) for 16S amplicon sequencing data analysis, aiming at bacterial species or strain-level identification from human microbiome to enable experimental validation for causal effects of the associated bacterial strains on health and diseases. HiMAP achieved higher accuracy in identifying species in human microbiome mock community than other pipelines. HiMAP identified majority of the species, with strain-level resolution wherever possible, as detected by whole genome shotgun sequencing using MetaPhlAn2 and reported comparable relative abundances. HiMAP is an open-source R package available at https://github.com/taolonglab/himap.

Download Full-text