MetaDEGalaxy: Galaxy workflow for differential abundance analysis of 16s metagenomic data

Metagenomic sequencing is an increasingly common tool in environmental and biomedical sciences. While software for detailing the composition of microbial communities using 16S rRNA marker genes is relatively mature, increasingly researchers are interested in identifying changes exhibited within microbial communities under differing environmental conditions. In order to gain maximum value from metagenomic sequence data we must improve the existing analysis environment by providing accessible and scalable computational workflows able to generate reproducible results. Here we describe a complete end-to-end open-source metagenomics workflow running within Galaxy for 16S differential abundance analysis. The workflow accepts 454 or Illumina sequence data (either overlapping or non-overlapping paired end reads) and outputs lists of the operational taxonomic unit (OTUs) exhibiting the greatest change under differing conditions. A range of analysis steps and graphing options are available giving users a high-level of control over their data and analyses. Additionally, users are able to input complex sample-specific metadata information which can be incorporated into differential analysis and used for grouping / colouring within graphs. Detailed tutorials containing sample data and existing workflows are available for three different input types: overlapping and non-overlapping read pairs as well as for pre-generated Biological Observation Matrix (BIOM) files. Using the Galaxy platform we developed MetaDEGalaxy, a complete metagenomics differential abundance analysis workflow. MetaDEGalaxy is designed for bench scientists working with 16S data who are interested in comparative metagenomics. MetaDEGalaxy builds on momentum within the wider Galaxy metagenomics community with the hope that more tools will be added as existing methods mature.

Download Full-text

MetaDEGalaxy: Galaxy workflow for differential abundance analysis of 16s metagenomic data

F1000Research ◽

10.12688/f1000research.18866.1 ◽

2019 ◽

Vol 8 ◽

pp. 726 ◽

Cited By ~ 1

Author(s):

Mike W.C. Thang ◽

Xin-Yi Chua ◽

Gareth Price ◽

Dominique Gorse ◽

Matt A. Field

Keyword(s):

Microbial Communities ◽

Sequence Data ◽

Metagenomic Data ◽

Marker Genes ◽

Metagenomic Sequencing ◽

Differential Analysis ◽

Biomedical Sciences ◽

Metagenomic Sequence ◽

Differential Abundance ◽

Differential Abundance Analysis

Metagenomic sequencing is an increasingly common tool in environmental and biomedical sciences yet analysis workflows remain immature relative to other field such as DNASeq and RNASeq analysis pipelines. While software for detailing the composition of microbial communities using 16S rRNA marker genes is constantly improving, increasingly researchers are interested in identifying changes exhibited within microbial communities under differing environmental conditions. In order to gain maximum value from metagenomic sequence data we must improve the existing analysis environment by providing accessible and scalable computational workflows able to generate reproducible results. Here we describe a complete end-to-end open-source metagenomics workflow running within Galaxy for 16S differential abundance analysis. The workflow accepts 454 or Illumina sequence data (either overlapping or non-overlapping paired end reads) and outputs lists of the operational taxonomic unit (OTUs) exhibiting the greatest change under differing conditions. A range of analysis steps and graphing options are available giving users a high-level of control over their data and analyses. Additionally, users are able to input complex sample-specific metadata information which can be incorporated into differential analysis and used for grouping / colouring within graphs. Detailed tutorials containing sample data and existing workflows are available for three different input types: overlapping and non-overlapping read pairs as well as for pre-generated Biological Observation Matrix (BIOM) files. Using the Galaxy platform we developed MetaDEGalaxy, a complete metagenomics differential abundance analysis workflow. MetaDEGalaxy is designed for bench scientists working with 16S data who are interested in comparative metagenomics. MetaDEGalaxy builds on momentum within the wider Galaxy metagenomics community with the hope that more tools will be added as existing methods mature.

Download Full-text

Fast functional annotation of metagenomic shotgun data by DNA alignment to a microbial gene catalog

10.1101/120402 ◽

2017 ◽

Author(s):

Stuart M. Brown ◽

Yuhan Hao ◽

Hao Chen ◽

Bobby P. Laungani ◽

Thahmina A. Ali ◽

...

Keyword(s):

Functional Annotation ◽

Sequence Data ◽

Human Microbiome ◽

Metagenomic Data ◽

Metagenomic Sequencing ◽

Alternative Analysis ◽

Metagenomic Sequence ◽

Shotgun Metagenomics ◽

Gene Functions ◽

Dna Alignment

AbstractBackgroundMetagenomic shotgun sequencing is becoming increasingly popular to study microbes associated with the human body and in environmental samples. A key goal of shotgun metagenomic sequencing is to identify gene functions and metabolic pathways that differ between samples or conditions. However, current methods to identify function in the large number of reads in a high-throughput sequence data file rely on the computationally intensive and low stringency approach of mapping each read to a generic database of proteins or reference microbial genomes.ResultsWe have developed an alternative analysis approach for shotgun metagenomic sequence data utilizing Bowtie2 DNA-DNA alignment of the reads to a database of well annotated genes compiled from human microbiome data. This method is rapid, and provides high stringency matches (>90% DNA sequence identity) of shotgun metagenomics reads to genes with annotated functions. We demonstrate the use of this method with synthetic data, Human Microbiome Project shotgun metagenomic data sets, and data from a study of liver disease. Differentially abundant KEGG gene functions can be detected in these experiments.ConclusionsFunctional annotation of metagenomic shotgun sequence reads can be accomplished by rapid DNA-DNA matching to a custom database of microbial sequences using the Bowtie2 sequence alignment tool. This method can be used for a variety of microbiome studies and allows functional analysis which is otherwise computationally demanding. This rapid annotation method is freely available as a Galaxy workflow within a Docker image.

Download Full-text

Microbe Finder (MiFi®): Implementation of an Interactive Pathogen Detection Tool in Metagenomic Sequence Data

Plants ◽

10.3390/plants10020250 ◽

2021 ◽

Vol 10 (2) ◽

pp. 250

Author(s):

Andres S. Espindola ◽

Kitty F. Cardwell

Keyword(s):

High Throughput ◽

Web Application ◽

Pathogen Detection ◽

High Throughput Sequencing ◽

Sequence Data ◽

Metagenomic Data ◽

Diagnostic Tools ◽

Metagenomic Sequencing ◽

Metagenomic Sequence ◽

Metagenomic Sequence Data

Agricultural high throughput diagnostics need to be fast, accurate and have multiplexing capacity. Metagenomic sequencing is being widely evaluated for plant and animal diagnostics. Bioinformatic analysis of metagenomic sequence data has been a bottleneck for diagnostic analysis due to the size of the data files. Most available tools for analyzing high-throughput sequencing (HTS) data require that the user have computer coding skills and access to high-performance computing. To overcome constraints to most sequencing-based diagnostic pipelines today, we have developed Microbe Finder (MiFi®). MiFi® is a web application for quick detection and identification of known pathogen species/strains in raw, unassembled HTS metagenomic data. HTS-based diagnostic tools developed through MiFi® must pass rigorous validation, which is outlined in this manuscript. MiFi® allows researchers to collaborate in the development and validation of HTS-based diagnostic assays using MiProbe™, a platform used for developing pathogen-specific e-probes. Validated e-probes are made available to diagnosticians through MiDetect™. Here we describe the e-probe development, curation and validation process of MiFi® using grapevine pathogens as a model system. MiFi® can be used with any pathosystem and HTS platform after e-probes have been validated.

Download Full-text

Deep Sequencing of a Dimethylsulfoniopropionate-Degrading Gene (dmdA) by Using PCR Primer Pairs Designed on the Basis of Marine Metagenomic Data

Applied and Environmental Microbiology ◽

10.1128/aem.01258-09 ◽

2009 ◽

Vol 76 (2) ◽

pp. 609-617 ◽

Cited By ~ 41

Author(s):

Vanessa A. Varaljay ◽

Erinn C. Howard ◽

Shulei Sun ◽

Mary Ann Moran

Keyword(s):

Sequence Data ◽

Gene Clusters ◽

Amino Acid Identity ◽

Metagenomic Data ◽

Data Sets ◽

Free Living ◽

Metagenomic Sequence ◽

Primer Sets ◽

Design And Testing ◽

Pcr Primer

ABSTRACT In silico design and testing of environmental primer pairs with metagenomic data are beneficial for capturing a greater proportion of the natural sequence heterogeneity in microbial functional genes, as well as for understanding limitations of existing primer sets that were designed from more restricted sequence data. PCR primer pairs targeting 10 environmental clades and subclades of the dimethylsulfoniopropionate (DMSP) demethylase protein, DmdA, were designed using an iterative bioinformatic approach that took advantage of thousands of dmdA sequences captured in marine metagenomic data sets. Using the bioinformatically optimized primers, dmdA genes were amplified from composite free-living coastal bacterioplankton DNA (from 38 samples over 5 years and two locations) and sequenced using 454 technology. An average of 6,400 amplicons per primer pair represented more than 700 clusters of environmental dmdA sequences across all primers, with clusters defined conservatively at >90% nucleotide sequence identity (∼95% amino acid identity). Degenerate and inosine-based primers did not perform better than specific primer pairs in determining dmdA richness and sometimes captured a lower degree of richness of sequences from the same DNA sample. A comparison of dmdA sequences in free-living versus particle-associated bacteria in southeastern U.S. coastal waters showed that sequence richness in some dmdA subgroups differed significantly between size fractions, though most gene clusters were shared (52 to 91%) and most sequences were affiliated with the shared clusters (∼90%). The availability of metagenomic sequence data has significantly enhanced the design of quantitative PCR primer pairs for this key functional gene, providing robust access to the capabilities and activities of DMSP demethylating bacteria in situ.

Download Full-text

Comparison of errors between a differential and a classical abundance analysis

Canadian Journal of Physics ◽

10.1139/cjp-2016-0877 ◽

2017 ◽

Vol 95 (9) ◽

pp. 855-857

Author(s):

Henrique Reggiani ◽

Jorge Meléndez

Keyword(s):

Measurement Errors ◽

Signal To Noise Ratio ◽

High Signal ◽

Differential Analysis ◽

Analysis Method ◽

Chemical Abundances ◽

Differential Abundance ◽

Low Metallicity ◽

Low Levels ◽

Differential Abundance Analysis

The differential abundance analysis method can improve the precision of stellar chemical abundances. The method compares the equivalent widths of a certain line in a star with the same line in a star considered to be a standard representative of its class, using high resolution and high signal to noise ratio spectra. The method has achieved great results by reducing the measurement errors to unprecedentedly low levels. However, to date, there has not been a consistent analysis on the actual improvements of this method when compared to a classical analysis in metal-poor stars. Here we present a comparison between the errors of a classical stellar analysis and a differential analysis among low-metallicity stars.

Download Full-text

Searching more genomic sequence with less memory for fast and accurate metagenomic profiling

10.1101/036681 ◽

2016 ◽

Author(s):

Shea N Gardner ◽

Sasha K Ames ◽

Maya B Gokhale ◽

Tom R Slezak ◽

Jonathan Allen

Keyword(s):

Large Scale ◽

Genomic Sequence ◽

Sequence Data ◽

Low Cost ◽

False Negative ◽

Human Microbiome ◽

Human Microbiome Project ◽

Metagenomic Data ◽

Reference Database ◽

Metagenomic Sequence

Software for rapid, accurate, and comprehensive microbial profiling of metagenomic sequence data on a desktop will play an important role in large scale clinical use of metagenomic data. Here we describe LMAT-ML (Livermore Metagenomics Analysis Toolkit-Marker Library) which can be run with 24 GB of DRAM memory, an amount available on many clusters, or with 16 GB DRAM plus a 24 GB low cost commodity flash drive (NVRAM), a cost effective alternative for desktop or laptop users. We compared results from LMAT with five other rapid, low-memory tools for metagenome analysis for 131 Human Microbiome Project samples, and assessed discordant calls with BLAST. All the tools except LMAT-ML reported overly specific or incorrect species and strain resolution of reads that were in fact much more widely conserved across species, genera, and even families. Several of the tools misclassified reads from synthetic or vector sequence as microbial or human reads as viral. We attribute the high numbers of false positive and false negative calls to a limited reference database with inadequate representation of known diversity. Our comparisons with real world samples show that LMAT-ML is the only tool tested that classifies the majority of reads, and does so with high accuracy.

Download Full-text

Censcyt: censored covariates in differential abundance analysis in cytometry

BMC Bioinformatics ◽

10.1186/s12859-021-04125-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Reto Gerber ◽

Mark D. Robinson

Keyword(s):

Error Control ◽

Proportional Hazards ◽

Proportional Hazards Model ◽

Cox Proportional Hazards ◽

Cox Proportional Hazards Model ◽

Measured Quantity ◽

Differential Analysis ◽

Differential Abundance ◽

Study Results ◽

Differential Abundance Analysis

Abstract Background Innovations in single cell technologies have lead to a flurry of datasets and computational tools to process and interpret them, including analyses of cell composition changes and transition in cell states. The diffcyt workflow for differential discovery in cytometry data consist of several steps, including preprocessing, cell population identification and differential testing for an association with a binary or continuous covariate. However, the commonly measured quantity of survival time in clinical studies often results in a censored covariate where classical differential testing is inapplicable. Results To overcome this limitation, multiple methods to directly include censored covariates in differential abundance analysis were examined with the use of simulation studies and a case study. Results show that multiple imputation based methods offer on-par performance with the Cox proportional hazards model in terms of sensitivity and error control, while offering flexibility to account for covariates. The tested methods are implemented in the package censcyt as an extension of diffcyt and are available at https://bioconductor.org/packages/censcyt. Conclusion Methods for the direct inclusion of a censored variable as a predictor in GLMMs are a valid alternative to classical survival analysis methods, such as the Cox proportional hazard model, while allowing for more flexibility in the differential analysis.

Download Full-text

Rapid profiling of the preterm infant gut microbiota using nanopore sequencing aids pathogen diagnostics

10.1101/180406 ◽

2017 ◽

Cited By ~ 13

Author(s):

Richard M. Leggett ◽

Cristina Alcon-Giner ◽

Darren Heavens ◽

Shabhonam Caim ◽

Thomas C. Brook ◽

...

Keyword(s):

Gut Microbiota ◽

Real Time ◽

Time Course ◽

Sequence Data ◽

Clinical Samples ◽

Metagenomic Sequencing ◽

Time Analysis ◽

Metagenomic Sequence ◽

Real Time Analysis ◽

Increased Risk

ABSTRACTThe Oxford Nanopore MinION sequencing platform offers near real time analysis of DNA reads as they are generated, which makes the device attractive for in-field or clinical deployment, e.g. rapid diagnostics. We used the MinION platform for shotgun metagenomic sequencing and analysis of gut-associated microbial communities; firstly, we used a 20-species human microbiota mock community to demonstrate how Nanopore metagenomic sequence data can be reliably and rapidly classified. Secondly, we profiled faecal microbiomes from preterm infants at increased risk of necrotising enterocolitis and sepsis. In single patient time course, we captured the diversity of the immature gut microbiota and observed how its complexity changes over time in response to interventions, i.e. probiotic, antibiotics and episodes of suspected sepsis. Finally, we performed ‘real-time’ runs from sample to analysis using faecal samples of critically ill infants and of healthy infants receiving probiotic supplementation. Real-time analysis was facilitated by our new NanoOK RT software package which analysed sequences as they were generated. We reliably identified potentially pathogenic taxa (i.e. Klebsiella pneumoniae and Enterobacter cloacae) and their corresponding antimicrobial resistance (AMR) gene profiles within as little as one hour of sequencing. Antibiotic treatment decisions may be rapidly modified in response to these AMR profiles, which we validated using pathogen isolation, whole genome sequencing and antibiotic susceptibility testing. Our results demonstrate that our pipeline can process clinical samples to a rich dataset able to inform tailored patient antimicrobial treatment in less than 5 hours.

Download Full-text

Ontology-Enriched Specifications Enabling Findable, Accessible, Interoperable, and Reusable Marine Metagenomic Datasets in Cyberinfrastructure Systems

Frontiers in Microbiology ◽

10.3389/fmicb.2021.765268 ◽

2021 ◽

Vol 12 ◽

Author(s):

Kai L. Blumberg ◽

Alise J. Ponsero ◽

Matthew Bomhoff ◽

Elisha M. Wood-Charlson ◽

Edward F. DeLong ◽

...

Keyword(s):

Microbial Communities ◽

Sequence Data ◽

Technical Specification ◽

Global Ocean ◽

Metagenomic Data ◽

Use Of Data ◽

Machine Systems ◽

Future Direction ◽

Meta Analyses ◽

Machine Readable

Marine microbial ecology requires the systematic comparison of biogeochemical and sequence data to analyze environmental influences on the distribution and variability of microbial communities. With ever-increasing quantities of metagenomic data, there is a growing need to make datasets Findable, Accessible, Interoperable, and Reusable (FAIR) across diverse ecosystems. FAIR data is essential to developing analytical frameworks that integrate microbiological, genomic, ecological, oceanographic, and computational methods. Although community standards defining the minimal metadata required to accompany sequence data exist, they haven’t been consistently used across projects, precluding interoperability. Moreover, these data are not machine-actionable or discoverable by cyberinfrastructure systems. By making ‘omic and physicochemical datasets FAIR to machine systems, we can enable sequence data discovery and reuse based on machine-readable descriptions of environments or physicochemical gradients. In this work, we developed a novel technical specification for dataset encapsulation for the FAIR reuse of marine metagenomic and physicochemical datasets within cyberinfrastructure systems. This includes using Frictionless Data Packages enriched with terminology from environmental and life-science ontologies to annotate measured variables, their units, and the measurement devices used. This approach was implemented in Planet Microbe, a cyberinfrastructure platform and marine metagenomic web-portal. Here, we discuss the data properties built into the specification to make global ocean datasets FAIR within the Planet Microbe portal. We additionally discuss the selection of, and contributions to marine-science ontologies used within the specification. Finally, we use the system to discover data by which to answer various biological questions about environments, physicochemical gradients, and microbial communities in meta-analyses. This work represents a future direction in marine metagenomic research by proposing a specification for FAIR dataset encapsulation that, if adopted within cyberinfrastructure systems, would automate the discovery, exchange, and re-use of data needed to answer broader reaching questions than originally intended.

Download Full-text

Comprehensive discovery of CRISPR-targeted terminally redundant sequences in the human gut metagenome: Viruses, plasmids, and more

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009428 ◽

2021 ◽

Vol 17 (10) ◽

pp. e1009428

Author(s):

Ryota Sugimoto ◽

Luca Nishimura ◽

Phuong Thanh Nguyen ◽

Jumpei Ito ◽

Nicholas F. Parrish ◽

...

Keyword(s):

De Novo ◽

Sequence Similarity ◽

Metagenomic Data ◽

Marker Genes ◽

Biological Entity ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Human Gut ◽

Protein Coding ◽

Viral Sequences

Viruses are the most numerous biological entity, existing in all environments and infecting all cellular organisms. Compared with cellular life, the evolution and origin of viruses are poorly understood; viruses are enormously diverse, and most lack sequence similarity to cellular genes. To uncover viral sequences without relying on either reference viral sequences from databases or marker genes that characterize specific viral taxa, we developed an analysis pipeline for virus inference based on clustered regularly interspaced short palindromic repeats (CRISPR). CRISPR is a prokaryotic nucleic acid restriction system that stores the memory of previous exposure. Our protocol can infer CRISPR-targeted sequences, including viruses, plasmids, and previously uncharacterized elements, and predict their hosts using unassembled short-read metagenomic sequencing data. By analyzing human gut metagenomic data, we extracted 11,391 terminally redundant CRISPR-targeted sequences, which are likely complete circular genomes. The sequences included 2,154 tailed-phage genomes, together with 257 complete crAssphage genomes, 11 genomes larger than 200 kilobases, 766 genomes of Microviridae species, 56 genomes of Inoviridae species, and 95 previously uncharacterized circular small genomes that have no reliably predicted protein-coding gene. We predicted the host(s) of approximately 70% of the discovered genomes at the taxonomic level of phylum by linking protospacers to taxonomically assigned CRISPR direct repeats. These results demonstrate that our protocol is efficient for de novo inference of CRISPR-targeted sequences and their host prediction.

Download Full-text