SHOGUN: a modular, accurate and scalable framework for microbiome quantification

Abstract Summary The software pipeline SHOGUN profiles known taxonomic and gene abundances of short-read shotgun metagenomics sequencing data. The pipeline is scalable, modular and flexible. Data analysis and transformation steps can be run individually or together in an automated workflow. Users can easily create new reference databases and can select one of three DNA alignment tools, ranging from ultra-fast low-RAM k-mer-based database search to fully exhaustive gapped DNA alignment, to best fit their analysis needs and computational resources. The pipeline includes an implementation of a published method for taxonomy assignment disambiguation with empirical Bayesian redistribution. The software is installable via the conda resource management framework, has plugins for the QIIME2 and QIITA packages and produces both taxonomy and gene abundance profile tables with a single command, thus promoting convenient and reproducible metagenomics research. Availability and implementation https://github.com/knights-lab/SHOGUN.

Download Full-text

Improved microbial community characterization of 16S rRNA via metagenome hybridization capture enrichment

10.1101/2020.12.18.423101 ◽

2020 ◽

Author(s):

Megan Sarah Beaudry ◽

Jincheng Wang ◽

Troy Kieran ◽

Jesse Thomas ◽

Natalia Juliana Bayona-Vasquez ◽

...

Keyword(s):

16S Rrna ◽

Sequence Similarity ◽

Rrna Gene ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

16S Rrna Sequence ◽

Shotgun Metagenomics ◽

Reference Databases ◽

Rrna Sequences ◽

16S Rrna Sequences

Environmental microbial diversity is often investigated from a molecular perspective using 16S ribosomal RNA (rRNA) gene amplicons and shotgun metagenomics. While amplicon methods are fast, low-cost, and have curated reference databases, they can suffer from amplification bias and are limited in genomic scope. In contrast, shotgun metagenomic methods sample more genomic regions with fewer sequence acquisition biases. However, shotgun metagenomic sequencing is much more expensive (even with moderate sequencing depth) and computationally challenging. Here, we develop a set of 16S rRNA sequence capture baits that offer a potential middle ground with the advantages from both approaches for investigating microbial communities. These baits cover the diversity of all 16S rRNA sequences available in the Greengenes (v. 13.5) database, with no sequence having < 80% sequence similarity to at least one bait for all segments of 16S. The use of our baits provide comparable results to 16S amplicon libraries and shotgun metagenomic libraries when assigning taxonomic units from 16S sequences within the metagenomic reads. We demonstrate that 16S rRNA capture baits can be used on a range of microbial samples (i.e., mock communities and rodent fecal samples) to increase the proportion of 16S rRNA sequences (average >400-fold) and decrease analysis time to obtain consistent community assessments. Furthermore, our study reveals that bioinformatic methods used to analyze sequencing data may have a greater influence on estimates of community composition than library preparation method used, likely in part to the extent and curation of the reference databases considered.

Download Full-text

Improved Microbial Community Characterization of 16S rRNA via Metagenome Hybridization Capture Enrichment

Frontiers in Microbiology ◽

10.3389/fmicb.2021.644662 ◽

2021 ◽

Vol 12 ◽

Author(s):

Megan S. Beaudry ◽

Jincheng Wang ◽

Troy J. Kieran ◽

Jesse Thomas ◽

Natalia J. Bayona-Vásquez ◽

...

Keyword(s):

16S Rrna ◽

Community Composition ◽

Bacterial Community Composition ◽

Rrna Gene ◽

Sequencing Data ◽

Shotgun Metagenomics ◽

Metagenomic Libraries ◽

Reference Databases ◽

Rrna Sequences ◽

16S Rrna Sequences

Environmental microbial diversity is often investigated from a molecular perspective using 16S ribosomal RNA (rRNA) gene amplicons and shotgun metagenomics. While amplicon methods are fast, low-cost, and have curated reference databases, they can suffer from amplification bias and are limited in genomic scope. In contrast, shotgun metagenomic methods sample more genomic regions with fewer sequence acquisition biases, but are much more expensive (even with moderate sequencing depth) and computationally challenging. Here, we develop a set of 16S rRNA sequence capture baits that offer a potential middle ground with the advantages from both approaches for investigating microbial communities. These baits cover the diversity of all 16S rRNA sequences available in the Greengenes (v. 13.5) database, with no sequence having <78% sequence identity to at least one bait for all segments of 16S. The use of our baits provide comparable results to 16S amplicon libraries and shotgun metagenomic libraries when assigning taxonomic units from 16S sequences within the metagenomic reads. We demonstrate that 16S rRNA capture baits can be used on a range of microbial samples (i.e., mock communities and rodent fecal samples) to increase the proportion of 16S rRNA sequences (average > 400-fold) and decrease analysis time to obtain consistent community assessments. Furthermore, our study reveals that bioinformatic methods used to analyze sequencing data may have a greater influence on estimates of community composition than library preparation method used, likely due in part to the extent and curation of the reference databases considered. Thus, enriching existing aliquots of shotgun metagenomic libraries and obtaining modest numbers of reads from them offers an efficient orthogonal method for assessment of bacterial community composition.

Download Full-text

Accurate Reconstruction of Microbial Strains from Metagenomic Sequencing Using Representative Reference Genomes

10.1101/215707 ◽

2017 ◽

Cited By ~ 2

Author(s):

Zhemin Zhou ◽

Nina Luhmann ◽

Nabil-Fareed Alikhan ◽

Christopher Quince ◽

Mark Achtman

Keyword(s):

Evaluation Studies ◽

Species Level ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Reference Databases ◽

Microbial Strains ◽

Taxonomic Assignments ◽

Taxonomic Groups ◽

Reference Genomes ◽

Recent Evaluation

AbstractExploring the genetic diversity of microbes within the environment through metagenomic sequencing first requires classifying these reads into taxonomic groups. Current methods compare these sequencing data with existing biased and limited reference databases. Several recent evaluation studies demonstrate that current methods either lack sufficient sensitivity for species-level assignments or suffer from false positives, overestimating the number of species in the metagenome. Both are especially problematic for the identification of low-abundance microbial species, e. g. detecting pathogens in ancient metagenomic samples. We present a new method, SPARSE, which improves taxonomic assignments of metagenomic reads. SPARSE balances existing biased reference databases by grouping reference genomes into similarity-based hierarchical clusters, implemented as an efficient incremental data structure. SPARSE assigns reads to these clusters using a probabilistic model, which specifically penalizes non-specific mappings of reads from unknown sources and hence reduces false-positive assignments. Our evaluation on simulated datasets from two recent evaluation studies demonstrated the improved precision of SPARSE in comparison to other methods for species-level classification. In a third simulation, our method successfully differentiated multiple co-existing Escherichia coli strains from the same sample. In real archaeological datasets, SPARSE identified ancient pathogens with ≤ 0.02% abundance, consistent with published findings that required additional sequencing data. In these datasets, other methods either missed targeted pathogens or reported non-existent ones. SPARSE and all evaluation scripts are available at https://github.com/zheminzhou/SPARSE.

Download Full-text

The impact of sequencing depth on the inferred taxonomic composition and AMR gene content of metagenomic samples

10.1101/593301 ◽

2019 ◽

Author(s):

H. Soon Gweon ◽

Liam P. Shaw ◽

Jeremy Swann ◽

Nicola De Maio ◽

Manal AbuOun ◽

...

Keyword(s):

Data Processing ◽

Open Source ◽

Open Source Software ◽

River Sediment ◽

Taxonomic Composition ◽

Sequencing Depth ◽

Gene Content ◽

Metagenomic Data ◽

Software Pipeline ◽

Shotgun Metagenomics

ABSTRACTBackgroundShotgun metagenomics is increasingly used to characterise microbial communities, particularly for the investigation of antimicrobial resistance (AMR) in different animal and environmental contexts. There are many different approaches for inferring the taxonomic composition and AMR gene content of complex community samples from shotgun metagenomic data, but there has been little work establishing the optimum sequencing depth, data processing and analysis methods for these samples. In this study we used shotgun metagenomics and sequencing of cultured isolates from the same samples to address these issues. We sampled three potential environmental AMR gene reservoirs (pig caeca, river sediment, effluent) and sequenced samples with shotgun metagenomics at high depth (∼200 million reads per sample). Alongside this, we cultured single-colony isolates ofEnterobacteriaceaefrom the same samples and used hybrid sequencing (short- and long-reads) to create high-quality assemblies for comparison to the metagenomic data. To automate data processing, we developed an open-source software pipeline, ‘ResPipe’.ResultsTaxonomic profiling was much more stable to sequencing depth than AMR gene content. 1 million reads per sample was sufficient to achieve <1% dissimilarity to the full taxonomic composition. However, at least 80 million reads per sample were required to recover the full richness of different AMR gene families present in the sample, and additional allelic diversity of AMR genes was still being discovered in effluent at 200 million reads per sample. Normalising the number of reads mapping to AMR genes using gene length and an exogenous spike ofThermus thermophilusDNA substantially changed the estimated gene abundance distributions. While the majority of genomic content from cultured isolates from effluent was recoverable using shotgun metagenomics, this was not the case for pig caeca or river sediment.ConclusionsSequencing depth and profiling method can critically affect the profiling of polymicrobial animal and environmental samples with shotgun metagenomics. Both sequencing of cultured isolates and shotgun metagenomics can recover substantial diversity that is not identified using the other methods. Particular consideration is required when inferring AMR gene content or presence by mapping metagenomic reads to a database. ResPipe, the open-source software pipeline we have developed, is freely available (https://gitlab.com/hsgweon/ResPipe).

Download Full-text

Inferring species compositions of complex fungal communities from long- and short-read sequence data

10.1101/2021.05.02.442318 ◽

2021 ◽

Author(s):

Yiheng Hu ◽

Laszlo Irinyi ◽

Minh Thuy Vi Hoang ◽

Tavish Eenjes ◽

Abigail Graetz ◽

...

Keyword(s):

Community Composition ◽

Pathogen Detection ◽

High Throughput Sequencing ◽

Sequence Data ◽

Whole Genome Sequence ◽

Composition Analysis ◽

Sequencing Data ◽

Species Classification ◽

Shotgun Metagenomics ◽

Query Coverage

Background: The kingdom fungi is crucial for life on earth and is highly diverse. Yet fungi are challenging to characterize. They can be difficult to culture and may be morphologically indistinct in culture. They can have complex genomes of over 1 Gb in size and are still underrepresented in whole genome sequence databases. Overall their description and analysis lags far behind other microbes such as bacteria. At the same time, classification of species via high throughput sequencing without prior purification is increasingly becoming the norm for pathogen detection, microbiome studies, and environmental monitoring. However, standardized procedures for characterizing unknown fungi from complex sequencing data have not yet been established. Results: We compared different metagenomics sequencing and analysis strategies for the identification of fungal species. Using two fungal mock communities of 44 phylogenetically diverse species, we compared species classification and community composition analysis pipelines using shotgun metagenomics and amplicon sequencing data generated from both short and long read sequencing technologies. We show that regardless of the sequencing methodology used, the highest accuracy of species identification was achieved by sequence alignment against a fungi-specific database. During the assessment of classification algorithms, we found that applying cut-offs to the query coverage of each read or contig significantly improved the classification accuracy and community composition analysis without significant data loss. Conclusion: Overall, our study expands the toolkit for identifying fungi by improving sequence-based fungal classification, and provides a practical guide for the design of metagenomics analyses.

Download Full-text

PIMBA: a PIpeline for MetaBarcoding Analysis

10.1101/2021.03.23.436646 ◽

2021 ◽

Author(s):

Renato R. M. Oliveira ◽

Raissa L S Silva ◽

Gisele L. Nunes ◽

Guilherme Oliveira

Keyword(s):

Environmental Samples ◽

Next Generation Sequencing Data ◽

Ecological Studies ◽

Sequencing Data ◽

Monitoring Method ◽

Computational Tools ◽

Dna Metabarcoding ◽

Reference Databases ◽

User Friendly ◽

Generation Sequencing

DNA metabarcoding is an emerging monitoring method capable of assessing biodiversity from environmental samples (eDNA). Advances in computational tools have been required due to the increase of Next-Generation Sequencing data. Tools for DNA metabarcoding analysis, such as MOTHUR, QIIME, Obitools, and mBRAVE have been widely used in ecological studies. However, some difficulties are encountered when there is a need to use custom databases. Here we present PIMBA, a PIpeline for MetaBarcoding Analysis, which allows the use of customized databases, as well as other reference databases used by the softwares mentioned here. PIMBA is an open-source and user-friendly pipeline that consolidates all analyses in just three command lines.

Download Full-text

The impact of sequencing depth on the inferred taxonomic composition and AMR gene content of metagenomic samples

Environmental Microbiome ◽

10.1186/s40793-019-0347-1 ◽

2019 ◽

Vol 14 (1) ◽

Cited By ~ 12

Author(s):

H. Soon Gweon ◽

◽

Liam P. Shaw ◽

Jeremy Swann ◽

Nicola De Maio ◽

...

Keyword(s):

Data Processing ◽

Open Source ◽

Open Source Software ◽

River Sediment ◽

Taxonomic Composition ◽

Sequencing Depth ◽

Gene Content ◽

Metagenomic Data ◽

Software Pipeline ◽

Shotgun Metagenomics

Abstract Background Shotgun metagenomics is increasingly used to characterise microbial communities, particularly for the investigation of antimicrobial resistance (AMR) in different animal and environmental contexts. There are many different approaches for inferring the taxonomic composition and AMR gene content of complex community samples from shotgun metagenomic data, but there has been little work establishing the optimum sequencing depth, data processing and analysis methods for these samples. In this study we used shotgun metagenomics and sequencing of cultured isolates from the same samples to address these issues. We sampled three potential environmental AMR gene reservoirs (pig caeca, river sediment, effluent) and sequenced samples with shotgun metagenomics at high depth (~ 200 million reads per sample). Alongside this, we cultured single-colony isolates of Enterobacteriaceae from the same samples and used hybrid sequencing (short- and long-reads) to create high-quality assemblies for comparison to the metagenomic data. To automate data processing, we developed an open-source software pipeline, ‘ResPipe’. Results Taxonomic profiling was much more stable to sequencing depth than AMR gene content. 1 million reads per sample was sufficient to achieve < 1% dissimilarity to the full taxonomic composition. However, at least 80 million reads per sample were required to recover the full richness of different AMR gene families present in the sample, and additional allelic diversity of AMR genes was still being discovered in effluent at 200 million reads per sample. Normalising the number of reads mapping to AMR genes using gene length and an exogenous spike of Thermus thermophilus DNA substantially changed the estimated gene abundance distributions. While the majority of genomic content from cultured isolates from effluent was recoverable using shotgun metagenomics, this was not the case for pig caeca or river sediment. Conclusions Sequencing depth and profiling method can critically affect the profiling of polymicrobial animal and environmental samples with shotgun metagenomics. Both sequencing of cultured isolates and shotgun metagenomics can recover substantial diversity that is not identified using the other methods. Particular consideration is required when inferring AMR gene content or presence by mapping metagenomic reads to a database. ResPipe, the open-source software pipeline we have developed, is freely available (https://gitlab.com/hsgweon/ResPipe).

Download Full-text

Fast-GBS v2.0: an analysis toolkit for genotyping-by-sequencing data

Genome ◽

10.1139/gen-2020-0077 ◽

2020 ◽

Vol 63 (11) ◽

pp. 577-581

Author(s):

Davoud Torkamaneh ◽

Jérôme Laroche ◽

François Belzile

Keyword(s):

Data Analysis ◽

Missing Data ◽

Low Cost ◽

Genotyping By Sequencing ◽

Data Imputation ◽

Sequencing Data ◽

Missing Data Imputation ◽

Analysis Process ◽

Computational Resources ◽

Analysis Platform

Genotyping-by-sequencing (GBS) is a rapid, flexible, low-cost, and robust genotyping method that simultaneously discovers variants and calls genotypes within a broad range of samples. These characteristics make GBS an excellent tool for many applications and research questions from conservation biology to functional genomics in both model and non-model species. Continued improvement of GBS relies on a more comprehensive understanding of data analysis, development of fast and efficient bioinformatics pipelines, accurate missing data imputation, and active post-release support. Here, we present the second generation of Fast-GBS (v2.0) that offers several new options (e.g., processing paired-end reads and imputation of missing data) and features (e.g., summary statistics of genotypes) to improve the GBS data analysis process. The performance assessment analysis showed that Fast-GBS v2.0 outperformed other available analytical pipelines, such as GBS-SNP-CROP and Gb-eaSy. Fast-GBS v2.0 provides an analysis platform that can be run with different types of sequencing data, modest computational resources, and allows for missing-data imputation for various species in different contexts.

Download Full-text

Synthetic Sequencing Standards: A Guide to Database Choice for Rumen Microbiota Amplicon Sequencing Analysis

Frontiers in Microbiology ◽

10.3389/fmicb.2020.606825 ◽

2020 ◽

Vol 11 ◽

Author(s):

Paul E. Smith ◽

Sinead M. Waters ◽

Ruth Gómez Expósito ◽

Hauke Smidt ◽

Ciara A. Carberry ◽

...

Keyword(s):

High Throughput Sequencing ◽

Cost Effective ◽

Amplicon Sequencing ◽

Gas Production ◽

Reference Database ◽

Specific Reference ◽

Sequencing Analysis ◽

Sequencing Data ◽

Rumen Microbiota ◽

Reference Databases

Our understanding of complex microbial communities, such as those residing in the rumen, has drastically advanced through the use of high throughput sequencing (HTS) technologies. Indeed, with the use of barcoded amplicon sequencing, it is now cost effective and computationally feasible to identify individual rumen microbial genera associated with ruminant livestock nutrition, genetics, performance and greenhouse gas production. However, across all disciplines of microbial ecology, there is currently little reporting of the use of internal controls for validating HTS results. Furthermore, there is little consensus of the most appropriate reference database for analyzing rumen microbiota amplicon sequencing data. Therefore, in this study, a synthetic rumen-specific sequencing standard was used to assess the effects of database choice on results obtained from rumen microbial amplicon sequencing. Four DADA2 reference training sets (RDP, SILVA, GTDB, and RefSeq + RDP) were compared to assess their ability to correctly classify sequences included in the rumen-specific sequencing standard. In addition, two thresholds of phylogenetic bootstrapping, 50 and 80, were applied to investigate the effect of increasing stringency. Sequence classification differences were apparent amongst the databases. For example the classification of Clostridium differed between all databases, thus highlighting the need for a consistent approach to nomenclature amongst different reference databases. It is hoped the effect of database on taxonomic classification observed in this study, will encourage research groups across various microbial disciplines to develop and routinely use their own microbiome-specific reference standard to validate analysis pipelines and database choice.

Download Full-text

A cloud-based workflow to quantify transcript-expression levels in public cancer compendia

10.1101/063552 ◽

2016 ◽

Author(s):

PJ Tatlow ◽

Stephen R. Piccolo

Keyword(s):

Performance Metrics ◽

Virtual Machines ◽

Cancer Cell Line ◽

Large Data ◽

Molecular Data ◽

The Cancer Genome Atlas ◽

Transcript Expression ◽

Sequencing Data ◽

Expression Levels ◽

Computational Resources

AbstractPublic compendia of raw sequencing data are now measured in petabytes. Accordingly, it is becoming infeasible for individual researchers to transfer these data to local computers. Recently, the National Cancer Institute funded an initiative to explore opportunities and challenges of working with molecular data in cloud-computing environments. With data in the cloud, it becomes possible for scientists to take their tools to the data and thereby avoid large data transfers. It also becomes feasible to scale computing resources to the needs of a given analysis. To evaluate this concept, we quantified transcript-expression levels for 12,307 RNA-Sequencing samples from the Cancer Cell Line Encyclopedia and The Cancer Genome Atlas. We used two cloud-based configurations to process the data and examined the performance and cost profiles of each configuration. Using “preemptible virtual machines”, we processed the samples for as little as $0.09 (USD) per sample. In total, we processed the TCGA samples (n=11,373) for only $1,065.49 and simultaneously processed thousands of samples at a time. As the samples were being processed, we collected detailed performance metrics, which helped us to track the duration of each processing step and to identify computational resources used at different stages of sample processing. Although the computational demands of reference alignment and expression quantification have decreased considerably, there remains a critical need for researchers to optimize preprocessing steps (e.g., sorting, converting, and trimming sequencing reads). We have created open-source Docker containers that include all the software and scripts necessary to process such data in the cloud and to collect performance metrics. The processed data are available in tabular format and in Google's BigQuery database (see https://osf.io/gqrz9).

Download Full-text