Dadasnake, a Snakemake implementation of DADA2 to process amplicon sequencing data for microbial ecology

Abstract Background Amplicon sequencing of phylogenetic marker genes, e.g., 16S, 18S, or ITS ribosomal RNA sequences, is still the most commonly used method to determine the composition of microbial communities. Microbial ecologists often have expert knowledge on their biological question and data analysis in general, and most research institutes have computational infrastructures to use the bioinformatics command line tools and workflows for amplicon sequencing analysis, but requirements of bioinformatics skills often limit the efficient and up-to-date use of computational resources. Results We present dadasnake, a user-friendly, 1-command Snakemake pipeline that wraps the preprocessing of sequencing reads and the delineation of exact sequence variants by using the favorably benchmarked and widely used DADA2 algorithm with a taxonomic classification and the post-processing of the resultant tables, including hand-off in standard formats. The suitability of the provided default configurations is demonstrated using mock community data from bacteria and archaea, as well as fungi. Conclusions By use of Snakemake, dadasnake makes efficient use of high-performance computing infrastructures. Easy user configuration guarantees flexibility of all steps, including the processing of data from multiple sequencing platforms. It is easy to install dadasnake via conda environments. dadasnake is available at https://github.com/a-h-b/dadasnake.

Download Full-text

dadasnake, a Snakemake implementation of DADA2 to process amplicon sequencing data for microbial ecology

10.1101/2020.05.17.095679 ◽

2020 ◽

Author(s):

Christina Weiβbecker ◽

Beatrix Schnabel ◽

Anna Heintz-Buschart

Keyword(s):

High Performance ◽

Expert Knowledge ◽

Amplicon Sequencing ◽

Marker Genes ◽

Sequencing Analysis ◽

Sequencing Data ◽

Hand Off ◽

Sequencing Platforms ◽

Computational Resources ◽

User Friendly

AbstractBackgroundAmplicon sequencing of phylogenetic marker genes, e.g. 16S, 18S or ITS rRNA sequences, is still the most commonly used method to determine the composition of microbial communities. Microbial ecologists often have expert knowledge on their biological question and data analysis in general, and most research institutes have computational infrastructures to employ the bioinformatics command line tools and workflows for amplicon sequencing analysis, but requirements of bioinformatics skills often limit the efficient and up-to-date use of computational resources.Resultsdadasnake wraps pre-processing of sequencing reads, delineation of exact sequence variants using the favorably benchmarked, widely-used the DADA2 algorithm, taxonomic classification and post-processing of the resultant tables, and hand-off in standard formats, into a user-friendly, one-command Snakemake pipeline. The suitability of the provided default configurations is demonstrated using mock-community data from bacteria and archaea, as well as fungi.ConclusionsBy use of Snakemake, dadasnake makes efficient use of high-performance computing infrastructures. Easy user configuration guarantees flexibility of all steps, including the processing of data from multiple sequencing platforms. dadasnake facilitates easy installation via conda environments. dadasnake is available at https://github.com/a-h-b/dadasnake.

Download Full-text

ASAP 2: a pipeline and web server to analyze marker gene amplicon sequencing data automatically and consistently

BMC Bioinformatics ◽

10.1186/s12859-021-04555-0 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Renmao Tian ◽

Behzad Imanian

Keyword(s):

Statistical Tests ◽

Marker Gene ◽

Web Server ◽

Amplicon Sequencing ◽

Marker Genes ◽

Sequence Variant ◽

Complex Data ◽

Sequencing Analysis ◽

Sequencing Data ◽

Link Type

Abstract Background Amplicon sequencing of marker genes such as 16S rDNA have been widely used to survey and characterize microbial community. However, the complex data analyses have required many interfering manual steps often leading to inconsistencies in results. Results Here, we have developed a pipeline, amplicon sequence analysis pipeline 2 (ASAP 2), to automate and glide through the processes without the usual manual inspections and user’s interference, for instance, in the detection of barcode orientation, selection of high-quality region of reads, and determination of resampling depth and many more. The pipeline integrates all the analytical processes such as importing data, demultiplexing, summarizing read profiles, trimming quality, denoising, removing chimeric sequences and making the feature table among others. The pipeline accepts multiple file formats as input including multiplexed or demultiplexed, paired-end or single-end, barcode inside or outside and raw or intermediate data (e.g. feature table). The outputs include taxonomic classification, alpha/beta diversity, community composition, ordination analysis and statistical tests. ASAP 2 supports merging multiple sequencing runs which helps integrate and compare data from different sources (public databases and collaborators). Conclusions Our pipeline minimizes hands-on interference and runs amplicon sequence variant (ASV)-based amplicon sequencing analysis automatically and consistently. Our web server assists researchers that have no access to high performance computer (HPC) or have limited bioinformatics skills. The pipeline and web server can be accessed at https://github.com/tianrenmaogithub/asap2 and https://hts.iit.edu/asap2, respectively.

Download Full-text

Workstation benchmark of Spark Capable Genome Analysis ToolKit 4 Variant Calling

10.1101/2020.05.17.101105 ◽

2020 ◽

Author(s):

Marcus H. Hansen ◽

Anita T. Simonsen ◽

Hans B. Ommen ◽

Charlotte G. Nyvold

Keyword(s):

Dna Sequencing ◽

Genome Analysis ◽

High Speed ◽

High Performance ◽

Variant Calling ◽

Amplicon Sequencing ◽

Targeted Sequencing ◽

Sequencing Analysis ◽

Genome Analysis Toolkit ◽

Order Of Magnitude

AbstractBackgroundRapid and practical DNA-sequencing processing has become essential for modern biomedical laboratories, especially in the field of cancer, pathology and genetics. While sequencing turn-over time has been, and still is, a bottleneck in research and diagnostics, the field of bioinformatics is moving at a rapid pace – both in terms of hardware and software development. Here, we benchmarked the local performance of three of the most important Spark-enabled Genome analysis toolkit 4 (GATK4) tools in a targeted sequencing workflow: Duplicate marking, base quality score recalibration (BQSR) and variant calling on targeted DNA sequencing using a modest hyperthreading 12-core single CPU and a high-speed PCI express solid-state drive.ResultsCompared to the previous GATK version the performance of Spark-enabled BQSR and HaplotypeCaller is shifted towards a more efficient usage of the available cores on CPU and outperforms the earlier GATK3.8 version with an order of magnitude reduction in processing time to analysis ready variants, whereas MarkDuplicateSpark was found to be thrice as fast. Furthermore, HaploTypeCallerSpark and BQSRPipelineSpark were significantly faster than the equivalent GATK4 standard tools with a combined ∼86% reduction in execution time, reaching a median rate of ten million processed bases per second, and duplicate marking was reduced ∼42%. The called variants were found to be in close agreement between the Spark and non-Spark versions, with an overall concordance of 98%. In this setup, the tools were also highly efficient when compared execution on a small 72 virtual CPU/18-node Google Cloud cluster.ConclusionIn conclusion, GATK4 offers practical parallelization possibilities for DNA sequence processing, and the Spark-enabled tools optimize performance and utilization of local CPUs. Spark utilizing GATK variant calling is several times faster than previous GATK3.8 multithreading with the same multi-core, single CPU, configuration. The improved opportunities for parallel computations not only hold implications for high-performance cluster, but also for modest laboratory or research workstations for targeted sequencing analysis, such as exome, panel or amplicon sequencing.

Download Full-text

Natrix: a Snakemake-based workflow for processing, clustering, and taxonomically assigning amplicon sequencing reads

BMC Bioinformatics ◽

10.1186/s12859-020-03852-4 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Marius Welzel ◽

Anja Lange ◽

Dominik Heider ◽

Michael Schwarz ◽

Bernd Freisleben ◽

...

Keyword(s):

High Throughput Sequencing ◽

Workflow Management ◽

Amplicon Sequencing ◽

Version Control ◽

Marker Genes ◽

Sequencing Data ◽

Taxonomic Assignment ◽

Ecological Processes ◽

Link Type ◽

User Friendly

Abstract Background Sequencing of marker genes amplified from environmental samples, known as amplicon sequencing, allows us to resolve some of the hidden diversity and elucidate evolutionary relationships and ecological processes among complex microbial communities. The analysis of large numbers of samples at high sequencing depths generated by high throughput sequencing technologies requires efficient, flexible, and reproducible bioinformatics pipelines. Only a few existing workflows can be run in a user-friendly, scalable, and reproducible manner on different computing devices using an efficient workflow management system. Results We present Natrix, an open-source bioinformatics workflow for preprocessing raw amplicon sequencing data. The workflow contains all analysis steps from quality assessment, read assembly, dereplication, chimera detection, split-sample merging, sequence representative assignment (OTUs or ASVs) to the taxonomic assignment of sequence representatives. The workflow is written using Snakemake, a workflow management engine for developing data analysis workflows. In addition, Conda is used for version control. Thus, Snakemake ensures reproducibility and Conda offers version control of the utilized programs. The encapsulation of rules and their dependencies support hassle-free sharing of rules between workflows and easy adaptation and extension of existing workflows. Natrix is freely available on GitHub (https://github.com/MW55/Natrix) or as a Docker container on DockerHub (https://hub.docker.com/r/mw55/natrix). Conclusion Natrix is a user-friendly and highly extensible workflow for processing Illumina amplicon data.

Download Full-text

Synthetic Sequencing Standards: A Guide to Database Choice for Rumen Microbiota Amplicon Sequencing Analysis

Frontiers in Microbiology ◽

10.3389/fmicb.2020.606825 ◽

2020 ◽

Vol 11 ◽

Author(s):

Paul E. Smith ◽

Sinead M. Waters ◽

Ruth Gómez Expósito ◽

Hauke Smidt ◽

Ciara A. Carberry ◽

...

Keyword(s):

High Throughput Sequencing ◽

Cost Effective ◽

Amplicon Sequencing ◽

Gas Production ◽

Reference Database ◽

Specific Reference ◽

Sequencing Analysis ◽

Sequencing Data ◽

Rumen Microbiota ◽

Reference Databases

Our understanding of complex microbial communities, such as those residing in the rumen, has drastically advanced through the use of high throughput sequencing (HTS) technologies. Indeed, with the use of barcoded amplicon sequencing, it is now cost effective and computationally feasible to identify individual rumen microbial genera associated with ruminant livestock nutrition, genetics, performance and greenhouse gas production. However, across all disciplines of microbial ecology, there is currently little reporting of the use of internal controls for validating HTS results. Furthermore, there is little consensus of the most appropriate reference database for analyzing rumen microbiota amplicon sequencing data. Therefore, in this study, a synthetic rumen-specific sequencing standard was used to assess the effects of database choice on results obtained from rumen microbial amplicon sequencing. Four DADA2 reference training sets (RDP, SILVA, GTDB, and RefSeq + RDP) were compared to assess their ability to correctly classify sequences included in the rumen-specific sequencing standard. In addition, two thresholds of phylogenetic bootstrapping, 50 and 80, were applied to investigate the effect of increasing stringency. Sequence classification differences were apparent amongst the databases. For example the classification of Clostridium differed between all databases, thus highlighting the need for a consistent approach to nomenclature amongst different reference databases. It is hoped the effect of database on taxonomic classification observed in this study, will encourage research groups across various microbial disciplines to develop and routinely use their own microbiome-specific reference standard to validate analysis pipelines and database choice.

Download Full-text

Optimization of 16S amplicon analysis using mock communities: implications for estimating community diversity

10.7287/peerj.preprints.2196v2 ◽

2016 ◽

Cited By ~ 1

Author(s):

Andrew Krohn ◽

Bo Stevens ◽

Adam Robbins-Pianka ◽

Matthew Belus ◽

Gerard J Allan ◽

...

Keyword(s):

Amplicon Sequencing ◽

Community Diversity ◽

Accurate Estimation ◽

Marker Genes ◽

Sequencing Data ◽

Mock Community ◽

Data Set ◽

Environmental Diversity ◽

Quality Filtering ◽

Mock Communities

Diversity of complex microbial communities can be rapidly assessed by community amplicon sequencing of marker genes (e.g., 16S), often yielding many thousands of DNA sequences per sample. However, analysis of community amplicon sequencing data requires multiple computational steps which affect the outcome of a final data set. Here we use mock communities to describe the effects of parameter adjustments for raw sequence quality filtering, picking operational taxonomic units (OTUs), taxonomic assignment, and OTU table filtering as implemented in QIIME 1.9.1. We demonstrate a workflow optimization based upon this exploration which we also apply to environmental samples. We found that quality filtering of raw data and filtering of OTU tables had large effects on observed OTU diversity. While all taxonomy assigners performed with similar accuracy, an appropriate choice of similarity threshold for defining OTUs depended on the method used for OTU picking. Our “default” analysis in QIIME overestimated mock community diversity by at least a factor of ten, compared to the optimized analysis which correctly characterized the taxonomic composition of the mock communities while still overestimating OTU diversity by about a factor of two. Though observed relative abundances of mock community member taxa were approximately correct, most were still represented by multiple OTUs. Low-frequency OTUs conspecific to constituent mock community taxa were characterized by multiple substitution and indel errors and the presence of a low quality base call resulting in sequence truncation during quality filtering. Low quality base calls were observed at “G” positions most of the time, and were also associated with a preceding “TTT” trinucleotide motif. Environmental diversity estimates were reduced by about 40% from 2508 to 1533 OTUs when comparing output from the default and optimized workflows. We attribute this reduction in observed diversity to the removal of erroneous sequences from the data set. Our results indicate that both strict quality filtering of raw sequencing data and careful filtering of raw OTU tables are important steps for accurate estimation of microbial community diversity.

Download Full-text

Natrix: A Snakemake-based workflow for processing, clustering, and taxonomically assigning amplicon sequencing reads

10.1101/2020.09.23.309864 ◽

2020 ◽

Author(s):

Marius Welzel ◽

Anja Lange ◽

Dominik Heider ◽

Michael Schwarz ◽

Bernd Freisleben ◽

...

Keyword(s):

High Throughput Sequencing ◽

Workflow Management ◽

Amplicon Sequencing ◽

Version Control ◽

Marker Genes ◽

Sequencing Data ◽

Taxonomic Assignment ◽

Ecological Processes ◽

Sequencing Technologies ◽

User Friendly

AbstractSequencing of marker genes amplified from environmental samples, known as amplicon sequencing, allows us to resolve some of the hidden diversity and elucidate evolutionary relationships and ecological processes among complex microbial communities. The analysis of large numbers of samples at high sequencing depths generated by high throughput sequencing technologies requires effcient, flexible, and reproducible bioinformatics pipelines. Only a few existing workflows can be run in a user-friendly, scalable, and reproducible manner on different computing devices using an effcient workflow management system. We present Natrix, an open-source bioinformatics workflow for preprocessing raw amplicon sequencing data. The workflow contains all analysis steps from quality assessment, read assembly, dereplication, chimera detection, split-sample merging, sequence representative assignment (OTUs or ASVs) to the taxonomic assignment of sequence representatives. The workflow is written using Snakemake, a workflow management engine for developing data analysis workflows. In addition, Conda is used for version control. Thus, Snakemake ensures reproducibility and Conda offers version control of the utilized programs. The encapsulation of rules and their dependencies support hassle-free sharing of rules between workflows and easy adaptation and extension of existing workflows. Natrix is freely available on GitHub (https://github.com/MW55/Natrix).

Download Full-text

Comparative performance of the GenoLab M and NovaSeq 6000 sequencing platforms for transcriptome and LncRNA analysis

BMC Genomics ◽

10.1186/s12864-021-08150-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yongfeng Liu ◽

Ran Han ◽

Letian Zhou ◽

Mingjie Luo ◽

Lidong Zeng ◽

...

Keyword(s):

Next Generation Sequencing ◽

High Performance ◽

Next Generation ◽

Sequencing Data ◽

Comparative Performance ◽

Sequencing Platform ◽

Next Generation Sequencing Platform ◽

Alternatively Spliced ◽

Sequencing Platforms ◽

Generation Sequencing

Abstract Background GenoLab M is a recently established next-generation sequencing platform from GeneMind Biosciences. Presently, Illumina sequencers are the globally leading sequencing platform in the next-generation sequencing market. Here, we present the first report to compare the transcriptome and LncRNA sequencing data of the GenoLab M sequencer to NovaSeq 6000 platform in various types of analysis. Results We tested 16 libraries in three species using various library kits from different companies. We compared the data quality, genes expression, alternatively spliced (AS) events, single nucleotide polymorphism (SNP), and insertions–deletions (InDel) between two sequencing platforms. The data suggested that platforms have comparable sensitivity and accuracy in terms of quantification of gene expression levels with technical compatibility. Conclusions Genolab M is a promising next-generation sequencing platform for transcriptomics and LncRNA studies with high performance at low costs.

Download Full-text

Effects of Dietary Modified Bazhen on Reproductive Performance, Immunity, Breast Milk Microbes, and Metabolome Characterization of Sows

Frontiers in Microbiology ◽

10.3389/fmicb.2021.758224 ◽

2021 ◽

Vol 12 ◽

Author(s):

Jian Geng ◽

Weicheng Jin ◽

Jingyou Hao ◽

Mohan Huo ◽

Yuefeng Zhang ◽

...

Keyword(s):

T Cells ◽

Breast Milk ◽

High Performance ◽

Interleukin 2 ◽

Serum Levels ◽

Amplicon Sequencing ◽

Control Group ◽

Rrna Gene ◽

Standard Diet ◽

Sequencing Data

Bazhen is a classic prescription used for the prevention of qi and blood deficiency. The present study aimed to investigate the effects of dietary supplementation with modified Bazhen powder (MBP) on sows during lactation. Forty pure-bred Yorkshire sows on day 100 of gestation were randomly fed a standard diet supplemented with 20 g MBP per sow per day (MBP group) or without (control group) during -14 to 7 days relative to parturition. Results showed that the serum levels of interleukin 2 (IL-2), immunoglobulin A (IgA), and IgG were higher, whereas IL-10 level was lower in sows fed with MBP diet than in controls on day 7 postpartum. A significantly elevated proportion of serum CD4+ T cells and a slight increase in the ratio of CD4+ to CD8+ T cells in the MBP group were also observed. Furthermore, MBP supplementation improved gastrointestinal function of postpartum sows, evidenced by increased levels of motilin, gastrin, and nitric oxide. Ultra high-performance liquid chromatography combined with a quadrupole time of flight and tandem mass spectrometer identified a total of 21 absorbed milk components. 16S rRNA gene amplicon sequencing data revealed that the microbiota diversity of the colostrum and transitional milk in the MBP group was increased. At the genus level, relative abundances of Enterococcus and Anaerostipes were significantly lower in the MBP group on day 0 of lactation. Metabolomic analysis showed that 38 metabolites were upregulated, and 41 metabolites were downregulated in the transitional milk; 31 metabolites were upregulated and 8 metabolites were downregulated in the colostrum in response to MBP. Metabolic pathways, protein digestion and absorption, and biosynthesis of amino acids were enriched in the colostrum and transitional milk. Our findings provide new insights into the beneficial effects of MBP, highlighted by the changes to the microbiota and metabolomic profile of breast milk from sows fed with an MBP-supplemented diet. Thus, MBP should be considered as a potential dietary supplement for lactating sows in pork production.

Download Full-text

Comparative Performance of the Genolab M and Novaseq 6000 Sequencing Platforms for Transcriptome and LncRNA Analysis

10.21203/rs.3.rs-900102/v1 ◽

2021 ◽

Author(s):

Yongfeng Liu ◽

Ran Han ◽

Letian Zhou ◽

Mingjie Luo ◽

Lidong Zeng ◽

...

Keyword(s):

Next Generation Sequencing ◽

High Performance ◽

Next Generation ◽

Sequencing Data ◽

Comparative Performance ◽

Sequencing Platform ◽

Genes Expression ◽

Alternatively Spliced ◽

Sequencing Platforms ◽

Generation Sequencing

Abstract Background: GenoLab M is a recently established next-generation sequencing platform from GeneMind Biosciences. Presently, Illumina sequencers are the globally leading sequencing platform in the next-generation sequencing market. Here, we present the first report to compare the transcriptome and LncRNA sequencing data of the GenoLab M sequencer to NovaSeq 6000 platform in various types of analysis.Results: We tested 16 libraries in three species using various library kits from different companies. We compared the data quality, genes expression, alternatively spliced (AS) events, single nucleotide polymorphism (SNP), and insertions–deletions (InDel) between two sequencing platforms. The data suggested that platforms have comparable sensitivity and accuracy in terms of quantification of gene expression levels with technical compatibility. Conclusions: Genolab M is a promising sequencing platform for transcriptomics and LncRNA studies with high performance at low costs.

Download Full-text