Accurate and sensitive detection of microbial eukaryotes from whole metagenome shotgun sequencing

Abstract Background Microbial eukaryotes are found alongside bacteria and archaea in natural microbial systems, including host-associated microbiomes. While microbial eukaryotes are critical to these communities, they are challenging to study with shotgun sequencing techniques and are therefore often excluded. Results Here, we present EukDetect, a bioinformatics method to identify eukaryotes in shotgun metagenomic sequencing data. Our approach uses a database of 521,824 universal marker genes from 241 conserved gene families, which we curated from 3713 fungal, protist, non-vertebrate metazoan, and non-streptophyte archaeplastida genomes and transcriptomes. EukDetect has a broad taxonomic coverage of microbial eukaryotes, performs well on low-abundance and closely related species, and is resilient against bacterial contamination in eukaryotic genomes. Using EukDetect, we describe the spatial distribution of eukaryotes along the human gastrointestinal tract, showing that fungi and protists are present in the lumen and mucosa throughout the large intestine. We discover that there is a succession of eukaryotes that colonize the human gut during the first years of life, mirroring patterns of developmental succession observed in gut bacteria. By comparing DNA and RNA sequencing of paired samples from human stool, we find that many eukaryotes continue active transcription after passage through the gut, though some do not, suggesting they are dormant or nonviable. We analyze metagenomic data from the Baltic Sea and find that eukaryotes differ across locations and salinity gradients. Finally, we observe eukaryotes in Arabidopsis leaf samples, many of which are not identifiable from public protein databases. Conclusions EukDetect provides an automated and reliable way to characterize eukaryotes in shotgun sequencing datasets from diverse microbiomes. We demonstrate that it enables discoveries that would be missed or clouded by false positives with standard shotgun sequence analysis. EukDetect will greatly advance our understanding of how microbial eukaryotes contribute to microbiomes.

Download Full-text

Accurate and sensitive detection of microbial eukaryotes from metagenomic shotgun sequencing

10.1101/2020.07.22.216580 ◽

2020 ◽

Author(s):

Abigail L. Lind ◽

Katherine S. Pollard

Keyword(s):

Shotgun Sequencing ◽

Marker Genes ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Dna And Rna ◽

Paired Samples ◽

Microbial Eukaryotes ◽

Shotgun Metagenomic Sequencing ◽

Active Transcription ◽

Eukaryotic Genomes

AbstractMicrobial eukaryotes are found alongside bacteria and archaea in natural microbial systems, including host-associated microbiomes. While microbial eukaryotes are critical to these communities, they are often not included in metagenomic analyses. Here we present EukDetect, a bioinformatics method to identify eukaryotes in shotgun metagenomic sequencing data. Our approach uses a database of universal marker genes, which we curated from all 2,571 currently available fungal, protist, worm and other diverse eukaryotic genomes. EukDetect is accurate and sensitive, has a broad taxonomic coverage of microbial eukaryotes, and is resilient against bacterial contamination in eukaryotic genomes. Using EukDetect, we describe the spatial distribution of eukaryotes along the human gastrointestinal tract, showing that fungi and protists are present in the lumen and mucosa throughout the large intestine. We discover that there is a succession of eukaryotes that colonize the human gut during the first years of life, similar to patterns of developmental succession observed in gut bacteria. By comparing DNA and RNA sequencing of paired samples from human stool, we find that many eukaryotes continue active transcription after passage through the gut, while others do not, suggesting they are dormant or nonviable. Finally, we observe eukaryotes in Arabidopsis leaf samples, many of which are not identifiable from public protein databases. Thus, EukDetect provides an automated and reliable way to characterize eukaryotes in shotgun sequencing datasets from diverse microbiomes.

Download Full-text

Comprehensive discovery of CRISPR-targeted terminally redundant sequences in the human gut metagenome: Viruses, plasmids, and more

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009428 ◽

2021 ◽

Vol 17 (10) ◽

pp. e1009428

Author(s):

Ryota Sugimoto ◽

Luca Nishimura ◽

Phuong Thanh Nguyen ◽

Jumpei Ito ◽

Nicholas F. Parrish ◽

...

Keyword(s):

De Novo ◽

Sequence Similarity ◽

Metagenomic Data ◽

Marker Genes ◽

Biological Entity ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Human Gut ◽

Protein Coding ◽

Viral Sequences

Viruses are the most numerous biological entity, existing in all environments and infecting all cellular organisms. Compared with cellular life, the evolution and origin of viruses are poorly understood; viruses are enormously diverse, and most lack sequence similarity to cellular genes. To uncover viral sequences without relying on either reference viral sequences from databases or marker genes that characterize specific viral taxa, we developed an analysis pipeline for virus inference based on clustered regularly interspaced short palindromic repeats (CRISPR). CRISPR is a prokaryotic nucleic acid restriction system that stores the memory of previous exposure. Our protocol can infer CRISPR-targeted sequences, including viruses, plasmids, and previously uncharacterized elements, and predict their hosts using unassembled short-read metagenomic sequencing data. By analyzing human gut metagenomic data, we extracted 11,391 terminally redundant CRISPR-targeted sequences, which are likely complete circular genomes. The sequences included 2,154 tailed-phage genomes, together with 257 complete crAssphage genomes, 11 genomes larger than 200 kilobases, 766 genomes of Microviridae species, 56 genomes of Inoviridae species, and 95 previously uncharacterized circular small genomes that have no reliably predicted protein-coding gene. We predicted the host(s) of approximately 70% of the discovered genomes at the taxonomic level of phylum by linking protospacers to taxonomically assigned CRISPR direct repeats. These results demonstrate that our protocol is efficient for de novo inference of CRISPR-targeted sequences and their host prediction.

Download Full-text

MG-MLST: Characterizing the Microbiome at the Strain Level in Metagenomic Data

Microorganisms ◽

10.3390/microorganisms8050684 ◽

2020 ◽

Vol 8 (5) ◽

pp. 684

Author(s):

Nathanael J. Bangayan ◽

Baochen Shi ◽

Jerry Trinh ◽

Emma Barnard ◽

Gabriela Kasimatis ◽

...

Keyword(s):

High Throughput Sequencing ◽

Human Microbiome ◽

Shotgun Sequencing ◽

Strain Level ◽

Multi Locus Sequence Typing ◽

Metagenomic Data ◽

Sequencing Analysis ◽

Metagenomic Sequencing ◽

Healthy Skin ◽

Sequencing Data

The microbiome plays an important role in human physiology. The composition of the human microbiome has been described at the phylum, class, genus, and species levels, however, it is largely unknown at the strain level. The importance of strain-level differences in microbial communities has been increasingly recognized in understanding disease associations. Current methods for identifying strain populations often require deep metagenomic sequencing and a comprehensive set of reference genomes. In this study, we developed a method, metagenomic multi-locus sequence typing (MG-MLST), to determine strain-level composition in a microbial community by combining high-throughput sequencing with multi-locus sequence typing (MLST). We used a commensal bacterium, Propionibacterium acnes, as an example to test the ability of MG-MLST in identifying the strain composition. Using simulated communities, MG-MLST accurately predicted the strain populations in all samples. We further validated the method using MLST gene amplicon libraries and metagenomic shotgun sequencing data of clinical skin samples. MG-MLST yielded consistent results of the strain composition to those obtained from nearly full-length 16S rRNA clone libraries and metagenomic shotgun sequencing analysis. When comparing strain-level differences between acne and healthy skin microbiomes, we demonstrated that strains of RT2/6 were highly associated with healthy skin, consistent with previous findings. In summary, MG-MLST provides a quantitative analysis of the strain populations in the microbiome with diversity and richness. It can be applied to microbiome studies to reveal strain-level differences between groups, which are critical in many microorganism-related diseases.

Download Full-text

De novo virus inference and host prediction from metagenome using CRISPR spacers

10.1101/2020.09.04.282665 ◽

2020 ◽

Author(s):

Ryota Sugimoto ◽

Luca Nishimura ◽

Phuong Nguyen Thanh ◽

Jumpei Ito ◽

Nicholas F. Parrish ◽

...

Keyword(s):

De Novo ◽

Sequence Similarity ◽

Metagenomic Data ◽

Marker Genes ◽

Biological Entity ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Human Gut ◽

Cellular Genes ◽

Viral Sequences

AbstractViruses are the most numerous biological entity, existing in all environments and infecting all cellular organisms. Compared with cellular life, the evolution and origin of viruses are poorly understood; viruses are enormously diverse and most lack sequence similarity to cellular genes. To uncover viral sequences without relying on either reference viral sequences from databases or marker genes known to characterize specific viral taxa, we developed an analysis pipeline for virus inference based on clustered regularly interspaced short palindromic repeats (CRISPR). CRISPR is a prokaryotic nucleic acid restriction system that stores memory of previous exposure. Our protocol can infer viral sequences targeted by CRISPR and predict their hosts using unassembled short-read metagenomic sequencing data. Analysing human gut metagenomic data, we extracted 11,391 terminally redundant CRISPR-targeted sequences which are likely complete circular genomes of viruses or plasmids. The sequences include 257 complete crAssphage family genomes, 11 genomes larger than 200 kilobases, 766 genomes of Microviridae species, 114 genomes of Inoviridae species and many entirely novel genomes of unknown taxa. We predicted the host(s) of approximately 70% of discovered genomes by linking protospacers to taxonomically assigned CRISPR direct repeats. These results support that our protocol is efficient for de novo inference of viral genomes and host prediction. In addition, we investigated the origin of the diversity-generating retroelement (DGR) locus of the crAssphage family. Phylogenetic analysis and gene locus comparisons indicate that DGR is orthologous in human gut crAssphages and shares a common ancestor with baboon-derived crAssphage; however, the locus has likely been lost in multiple lineages recently.

Download Full-text

Evaluation of the CosmosID Bioinformatics Platform for Prosthetic Joint-Associated Sonicate Fluid Shotgun Metagenomic Data Analysis

Journal of Clinical Microbiology ◽

10.1128/jcm.01182-18 ◽

2018 ◽

Vol 57 (2) ◽

Cited By ~ 8

Author(s):

Qun Yan ◽

Yu Mi Wi ◽

Matthew J. Thoendel ◽

Yash S. Raval ◽

Kerryl E. Greenwood-Quaintance ◽

...

Keyword(s):

Antibiotic Resistance ◽

Metagenomic Data ◽

Metagenomic Sequencing ◽

Antibacterial Resistance ◽

Sequencing Data ◽

Bacterial Detection ◽

Shotgun Metagenomic Sequencing ◽

Prosthetic Joint ◽

Validation Set ◽

Fluid Culture

ABSTRACT We previously demonstrated that shotgun metagenomic sequencing can detect bacteria in sonicate fluid, providing a diagnosis of prosthetic joint infection (PJI). A limitation of the approach that we used is that data analysis was time-consuming and specialized bioinformatics expertise was required, both of which are barriers to routine clinical use. Fortunately, automated commercial analytic platforms that can interpret shotgun metagenomic data are emerging. In this study, we evaluated the CosmosID bioinformatics platform using shotgun metagenomic sequencing data derived from 408 sonicate fluid samples from our prior study with the goal of evaluating the platform vis-à-vis bacterial detection and antibiotic resistance gene detection for predicting staphylococcal antibacterial susceptibility. Samples were divided into a derivation set and a validation set, each consisting of 204 samples; results from the derivation set were used to establish cutoffs, which were then tested in the validation set for identifying pathogens and predicting staphylococcal antibacterial resistance. Metagenomic analysis detected bacteria in 94.8% (109/115) of sonicate fluid culture-positive PJIs and 37.8% (37/98) of sonicate fluid culture-negative PJIs. Metagenomic analysis showed sensitivities ranging from 65.7 to 85.0% for predicting staphylococcal antibacterial resistance. In conclusion, the CosmosID platform has the potential to provide fast, reliable bacterial detection and identification from metagenomic shotgun sequencing data derived from sonicate fluid for the diagnosis of PJI. Strategies for metagenomic detection of antibiotic resistance genes for predicting staphylococcal antibacterial resistance need further development.

Download Full-text

MetaDEGalaxy: Galaxy workflow for differential abundance analysis of 16s metagenomic data

F1000Research ◽

10.12688/f1000research.18866.2 ◽

2019 ◽

Vol 8 ◽

pp. 726

Author(s):

Mike W.C. Thang ◽

Xin-Yi Chua ◽

Gareth Price ◽

Dominique Gorse ◽

Matt A. Field

Keyword(s):

Microbial Communities ◽

Sequence Data ◽

Metagenomic Data ◽

Marker Genes ◽

Metagenomic Sequencing ◽

Differential Analysis ◽

Biomedical Sciences ◽

Metagenomic Sequence ◽

Differential Abundance ◽

Differential Abundance Analysis

Metagenomic sequencing is an increasingly common tool in environmental and biomedical sciences. While software for detailing the composition of microbial communities using 16S rRNA marker genes is relatively mature, increasingly researchers are interested in identifying changes exhibited within microbial communities under differing environmental conditions. In order to gain maximum value from metagenomic sequence data we must improve the existing analysis environment by providing accessible and scalable computational workflows able to generate reproducible results. Here we describe a complete end-to-end open-source metagenomics workflow running within Galaxy for 16S differential abundance analysis. The workflow accepts 454 or Illumina sequence data (either overlapping or non-overlapping paired end reads) and outputs lists of the operational taxonomic unit (OTUs) exhibiting the greatest change under differing conditions. A range of analysis steps and graphing options are available giving users a high-level of control over their data and analyses. Additionally, users are able to input complex sample-specific metadata information which can be incorporated into differential analysis and used for grouping / colouring within graphs. Detailed tutorials containing sample data and existing workflows are available for three different input types: overlapping and non-overlapping read pairs as well as for pre-generated Biological Observation Matrix (BIOM) files. Using the Galaxy platform we developed MetaDEGalaxy, a complete metagenomics differential abundance analysis workflow. MetaDEGalaxy is designed for bench scientists working with 16S data who are interested in comparative metagenomics. MetaDEGalaxy builds on momentum within the wider Galaxy metagenomics community with the hope that more tools will be added as existing methods mature.

Download Full-text

Towards end-to-end disease prediction from raw metagenomic data

10.1101/2020.10.29.360297 ◽

2020 ◽

Author(s):

Maxence Queyrel ◽

Edi Prifti ◽

Jean-Daniel Zucker

Keyword(s):

Dna Sequences ◽

Real Life ◽

Multiple Instance Learning ◽

Disease Classification ◽

Metagenomic Data ◽

Numerical Representation ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

End To End ◽

Bioinformatics Workflows

AbstractAnalysis of the human microbiome using metagenomic sequencing data has demonstrated high ability in discriminating various human diseases. Raw metagenomic sequencing data require multiple complex and computationally heavy bioinformatics steps prior to data analysis. Such data contain millions of short sequences read from the fragmented DNA sequences and are stored as fastq files. Conventional processing pipelines consist multiple steps including quality control, filtering, alignment of sequences against genomic catalogs (genes, species, taxonomic levels, functional pathways, etc.). These pipelines are complex to use, time consuming and rely on a large number of parameters that often provide variability and impact the estimation of the microbiome elements. Recent studies have demonstrated that training Deep Neural Networks directly from raw sequencing data is a promising approach to bypass some of the challenges associated with mainstream bioinformatics pipelines. Most of these methods use the concept of word and sentence embeddings that create a meaningful and numerical representation of DNA sequences, while extracting features and reducing the dimentionality of the data. In this paper we present an end-to-end approach that classifies patients into disease groups directly from raw metagenomic reads: metagenome2vec. This approach is composed of four steps (i) generating a vocabulary of k-mers and learning their numerical embeddings; (ii) learning DNA sequence (read) embeddings; (iii) identifying the genome from which the sequence is most likely to come and (iv) training a multiple instance learning classifier which predicts the phenotype based on the vector representation of the raw data. An attention mechanism is applied in the network so that the model can be interpreted, assigning a weight to the influence of the prediction for each genome. Using two public real-life datasets as well a simulated one, we demonstrated that this original approach reached very high performances, comparable with the state-of-the-art methods applied directly on processed data though mainstream bioinformatics workflows. These results are encouraging for this proof of concept work. We believe that with further dedication, the DNN models have the potential to surpass mainstream bioinformatics workflows in disease classification tasks.

Download Full-text

Harnessing the strategy of metagenomics for exploring the intestinal microecology of sable (Martes zibellina), the national first-level protected animal

10.21203/rs.3.rs-28506/v3 ◽

2020 ◽

Author(s):

Jiakuo Yan ◽

Xiaoyang Wu ◽

Jun Chen ◽

Yao Chen ◽

Honghai Zhang

Keyword(s):

Information Processing ◽

Complex Structure ◽

Intestinal Flora ◽

Metagenomic Library ◽

Metagenomic Data ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Illumina Hiseq ◽

Martes Zibellina ◽

Gene Functions

Abstract Sable (Martes zibellina), a member of family Mustelidae, order Carnivora, is primarily distributed in the cold northern zone of Eurasia. The purpose of this study was to explore the intestinal flora of the sable by metagenomic library-based techniques. Libraries were sequenced on an Illumina HiSeq 4000 instrument. The effective sequencing data of each sample was above 6,000 M, and the ratio of clean reads to raw reads was over 98%. The total ORF length was approximately 603,031, equivalent to 347.36 Mbp. We investigated gene functions with the KEGG database and identified 7,140 KEGG ortholog (KO) groups comprising 129,788 genes across all of the samples. We selected a subset of genes with the highest abundances to construct cluster heat maps. From the results of the KEGG metabolic pathway annotations, we acquired information on gene functions, as represented by the categories of metabolism, environmental information processing, genetic information processing, cellular processes and organismal systems. We then investigated gene function with the CAZy database and identified functional carbohydrate hydrolases corresponding to genes in the intestinal microorganisms of sable. This finding is consistent with the fact that the sable is adapted to cold environments and requires a large amount of energy to maintain its metabolic activity. We also investigated gene functions with the eggNOG database; the main functions of genes included gene duplication, recombination and repair, transport and metabolism of amino acids, and transport and metabolism of carbohydrates. In this study, we attempted to identify the complex structure of the microbial population of sable based on metagenomic sequencing methods, which use whole metagenomic data, and to map the obtained sequences to known genes or pathways in existing databases, such as CAZy, KEGG, and eggNOG. We then explored the genetic composition and functional diversity of the microbial community based on the mapped functional categories.

Download Full-text

BiomeSeq: A Tool for the Characterization of Animal Microbiomes from Metagenomic Data

10.21203/rs.3.rs-842545/v1 ◽

2021 ◽

Author(s):

Kelly A. Mulholland ◽

Calvin L. Keeler

Keyword(s):

Relative Abundance ◽

Performance Metrics ◽

Complete Characterization ◽

Metagenomic Data ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Microbial Composition ◽

Additional Species ◽

User Friendly

Abstract BackgroundThe complete characterization of a microbiome is critical in elucidating the complex ecology of the microbial composition within healthy and diseased animals. Many microbiome studies characterize only the bacterial component, for which there are several well-developed sequencing methods, bioinformatics tools and databases available. The lack of comprehensive bioinformatics workflows and databases have limited efforts to characterize the other components existing in a microbiome. BiomeSeq is a tool for the analysis of the complete animal microbiome using metagenomic sequencing data. With its comprehensive workflow and customizable parameters and microbial databases, BiomeSeq can rapidly quantify the viral, fungal, bacteriophage and bacterial components of a sample and produce informative tables for analysis. ResultsSimulated datasets were constructed, which contained known abundances of microbial sequences, and several performance metrics were analyzed, including correlation of predicted abundance with known abundance, root mean square error and rate of speed. BiomeSeq demonstrated high precision (average of 99.52%) and sensitivity (average of 93.01%). BiomeSeq was employed in detecting and quantifying the respiratory microbiome of a commercial poultry broiler flock throughout its grow-out cycle from hatching to processing and successfully processed 780 million reads. For each microbial species detected, BiomeSeq calculated the normalized abundance, percent relative abundance, and coverage as well as the diversity for each sample. Rate of speed for each step in the pipeline, precision and accuracy were calculated to examine BiomeSeq’s performance using in silico sequencing datasets. When compared to bacterial results generated by the commonly used 16S rRNA sequencing method, BiomeSeq detected the same most abundant bacteria, including Gallibacterium, Corynebacterium and Staphylococcus, as well as several additional species. ConclusionsBiomeSeq provides for the detection and quantification of the microbiome from next-generation metagenomic sequencing data. This tool is implemented into a user-friendly container that requires one command and generates a table containing taxonomical information for each microbe detected. It also determines normalized abundance, percent relative abundance, genome coverage and sample diversity calculations for each sample.

Download Full-text

Conserved bacterial genomes from two geographically distinct peritidal stromatolite formations shed light on potential functional guilds

10.1101/818625 ◽

2019 ◽

Author(s):

Samantha C. Waterworth ◽

Eric W. Isemonger ◽

Evan R. Rees ◽

Rosemary A. Dorrington ◽

Jason C. Kwan

Keyword(s):

Microbial Mats ◽

Bacterial Species ◽

Species Conservation ◽

Cumulative Effect ◽

Metagenomic Data ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Space Forms ◽

Nitrogenous Compounds ◽

Shark Bay

SUMMARYStromatolites are complex microbial mats that form lithified layers and ancient forms are the oldest evidence of life on earth, dating back over 3.4 billion years. Modern stromatolites are relatively rare but may provide clues about the function and evolution of their ancient counterparts. In this study, we focus on peritidal stromatolites occurring at Cape Recife and Schoenmakerskop on the southeastern South African coastline. Using assembled shotgun metagenomic data we obtained 183 genomic bins, of which the most dominant taxa were from the Cyanobacteriia class (Cyanobacteria phylum), with lower but notable abundances of bacteria classified as Alphaproteobacteria, Gammaproteobacteria and Bacteroidia. We identified functional gene sets in bacterial species conserved across two geographically distinct stromatolite formations, which may promote carbonate precipitation through the reduction of nitrogenous compounds and possible production of calcium ions. We propose that an abundance of extracellular alkaline phosphatases may lead to the formation of phosphatic deposits within these stromatolites. We conclude that the cumulative effect of several conserved bacterial species drives accretion in these two stromatolite formations.ORIGINALITY-SIGNIFICANCEPeritidal stromatolites are unique among stromatolite formations as they grow at the dynamic interface of calcium carbonate-rich groundwater and coastal marine waters. The peritidal space forms a relatively unstable environment and the factors that influence the growth of these peritidal structures is not well understood. To our knowledge, this is the first comparative study that assesses species conservation within the microbial communities of two geographically distinct peritidal stromatolite formations. We assessed the potential functional roles of these communities using genomic bins clustered from metagenomic sequencing data. We identified several conserved bacterial species across the two sites and hypothesize that their genetic functional potential may be important in the formation of pertidal stromatolites. We contrasted these findings against a well-studied site in Shark Bay, Australia and show that, unlike these hypersaline formations, archaea do not play a major role in peritidal stromatolite formation. Furthermore, bacterial nitrogen and phosphate metabolisms of conserved species may be driving factors behind lithification in peritidal stromatolites.

Download Full-text