Synthetic Sequencing Standards: A Guide to Database Choice for Rumen Microbiota Amplicon Sequencing Analysis

Our understanding of complex microbial communities, such as those residing in the rumen, has drastically advanced through the use of high throughput sequencing (HTS) technologies. Indeed, with the use of barcoded amplicon sequencing, it is now cost effective and computationally feasible to identify individual rumen microbial genera associated with ruminant livestock nutrition, genetics, performance and greenhouse gas production. However, across all disciplines of microbial ecology, there is currently little reporting of the use of internal controls for validating HTS results. Furthermore, there is little consensus of the most appropriate reference database for analyzing rumen microbiota amplicon sequencing data. Therefore, in this study, a synthetic rumen-specific sequencing standard was used to assess the effects of database choice on results obtained from rumen microbial amplicon sequencing. Four DADA2 reference training sets (RDP, SILVA, GTDB, and RefSeq + RDP) were compared to assess their ability to correctly classify sequences included in the rumen-specific sequencing standard. In addition, two thresholds of phylogenetic bootstrapping, 50 and 80, were applied to investigate the effect of increasing stringency. Sequence classification differences were apparent amongst the databases. For example the classification of Clostridium differed between all databases, thus highlighting the need for a consistent approach to nomenclature amongst different reference databases. It is hoped the effect of database on taxonomic classification observed in this study, will encourage research groups across various microbial disciplines to develop and routinely use their own microbiome-specific reference standard to validate analysis pipelines and database choice.

Download Full-text

Handling of targeted amplicon sequencing data focusing on index hopping and demultiplexing using a nested metabarcoding approach in ecology

Scientific Reports ◽

10.1038/s41598-021-98018-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Yasemin Guenay-Greunke ◽

David A. Bohan ◽

Michael Traugott ◽

Corinna Wallinger

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Cost Effective ◽

Amplicon Sequencing ◽

Sequencing Depth ◽

Sequencing Error ◽

Sequencing Data ◽

Large Sample ◽

Sequencing Errors ◽

Plant Feeding

AbstractHigh-throughput sequencing platforms are increasingly being used for targeted amplicon sequencing because they enable cost-effective sequencing of large sample sets. For meaningful interpretation of targeted amplicon sequencing data and comparison between studies, it is critical that bioinformatic analyses do not introduce artefacts and rely on detailed protocols to ensure that all methods are properly performed and documented. The analysis of large sample sets and the use of predefined indexes create challenges, such as adjusting the sequencing depth across samples and taking sequencing errors or index hopping into account. However, the potential biases these factors introduce to high-throughput amplicon sequencing data sets and how they may be overcome have rarely been addressed. On the example of a nested metabarcoding analysis of 1920 carabid beetle regurgitates to assess plant feeding, we investigated: (i) the variation in sequencing depth of individually tagged samples and the effect of library preparation on the data output; (ii) the influence of sequencing errors within index regions and its consequences for demultiplexing; and (iii) the effect of index hopping. Our results demonstrate that despite library quantification, large variation in read counts and sequencing depth occurred among samples and that the sequencing error rate in bioinformatic software is essential for accurate adapter/primer trimming and demultiplexing. Moreover, setting an index hopping threshold to avoid incorrect assignment of samples is highly recommended.

Download Full-text

Combining whole genome shotgun sequencing and rDNA amplicon analyses to improve detection of microbe-microbe interaction networks in plant leaves

10.1101/823492 ◽

2019 ◽

Cited By ~ 2

Author(s):

Julian Regalado ◽

Derek S. Lundberg ◽

Oliver Deusch ◽

Sonja Kersten ◽

Talia Karasov ◽

...

Keyword(s):

16S Rdna ◽

High Throughput Sequencing ◽

Cost Effective ◽

Amplicon Sequencing ◽

Low Complexity ◽

Shotgun Sequencing ◽

Wild Plants ◽

Hybrid Strategy ◽

Domains Of Life ◽

Metagenome Sequencing

AbstractMicroorganisms from all domains of life establish associations with plants. Although some harm the plant, others antagonize pathogens or prime the plant immune system, acquire nutrients, tune plant hormone levels, or perform additional services. Most culture-independent plant microbiome research has focused on amplicon sequencing of 16S rDNA and/or the internal transcribed spacer (ITS) of rDNA loci, but the decreasing cost of high-throughput sequencing has made shotgun metagenome sequencing increasingly accessible. Here, we describe shotgun sequencing of 275 wild Arabidopsis thaliana leaf microbiomes from southwest Germany, with additional bacterial 16S rDNA and eukaryotic ITS1 amplicon data from 176 of these samples. The shotgun data were dominated by bacterial sequences, with eukaryotes contributing only a minority of reads. For shotgun and amplicon data, microbial membership showed weak associations with both site of origin and plant genotype, both of which were highly confounded in this dataset. There was large variation among microbiomes, with one extreme comprising samples of low complexity and a high load of microorganisms typical of infected plants, and the other extreme being samples of high complexity and a low microbial load. We use the metagenome data, which captures the ratio of bacterial to plant DNA in leaves of wild plants, to scale the 16S rDNA amplicon data such that they reflect absolute bacterial abundance. We show that this cost-effective hybrid strategy overcomes compositionality problems in amplicon data and leads to fundamentally different conclusions about microbiome community assembly.

Download Full-text

Natrix: a Snakemake-based workflow for processing, clustering, and taxonomically assigning amplicon sequencing reads

BMC Bioinformatics ◽

10.1186/s12859-020-03852-4 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Marius Welzel ◽

Anja Lange ◽

Dominik Heider ◽

Michael Schwarz ◽

Bernd Freisleben ◽

...

Keyword(s):

High Throughput Sequencing ◽

Workflow Management ◽

Amplicon Sequencing ◽

Version Control ◽

Marker Genes ◽

Sequencing Data ◽

Taxonomic Assignment ◽

Ecological Processes ◽

Link Type ◽

User Friendly

Abstract Background Sequencing of marker genes amplified from environmental samples, known as amplicon sequencing, allows us to resolve some of the hidden diversity and elucidate evolutionary relationships and ecological processes among complex microbial communities. The analysis of large numbers of samples at high sequencing depths generated by high throughput sequencing technologies requires efficient, flexible, and reproducible bioinformatics pipelines. Only a few existing workflows can be run in a user-friendly, scalable, and reproducible manner on different computing devices using an efficient workflow management system. Results We present Natrix, an open-source bioinformatics workflow for preprocessing raw amplicon sequencing data. The workflow contains all analysis steps from quality assessment, read assembly, dereplication, chimera detection, split-sample merging, sequence representative assignment (OTUs or ASVs) to the taxonomic assignment of sequence representatives. The workflow is written using Snakemake, a workflow management engine for developing data analysis workflows. In addition, Conda is used for version control. Thus, Snakemake ensures reproducibility and Conda offers version control of the utilized programs. The encapsulation of rules and their dependencies support hassle-free sharing of rules between workflows and easy adaptation and extension of existing workflows. Natrix is freely available on GitHub (https://github.com/MW55/Natrix) or as a Docker container on DockerHub (https://hub.docker.com/r/mw55/natrix). Conclusion Natrix is a user-friendly and highly extensible workflow for processing Illumina amplicon data.

Download Full-text

GenOO: A Modern Perl Framework for High Throughput Sequencing analysis

10.1101/019265 ◽

2015 ◽

Cited By ~ 2

Author(s):

Manolis Maragkakis ◽

Panagiotis Alexiou ◽

Zissimos Mourelatos

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Complex Analysis ◽

Object Oriented ◽

Sequencing Analysis ◽

Sequencing Data ◽

Analysis Tools ◽

High Throughput Sequencing Data ◽

Biological Entities ◽

Computational Structures

Background: High throughput sequencing (HTS) has become one of the primary experimental tools used to extract genomic information from biological samples. Bioinformatics tools are continuously being developed for the analysis of HTS data. Beyond some well-defined core analyses, such as quality control or genomic alignment, the consistent development of custom tools and the representation of sequencing data in organized computational structures and entities remains a challenging effort for bioinformaticians. Results: In this work, we present GenOO [jee-noo], an open-source; object-oriented (OO) Perl framework specifically developed for the design and implementation of HTS analysis tools. GenOO models biological entities such as genes and transcripts as Perl objects, and includes relevant modules, attributes and methods that allow for the manipulation of high throughput sequencing data. GenOO integrates these elements in a simple and transparent way which allows for the creation of complex analysis pipelines minimizing the overhead for the researcher. GenOO has been designed with flexibility in mind, and has an easily extendable modular structure with minimal requirements for external tools and libraries. As an example of the framework’s capabilities and usability, we present a short and simple walkthrough of a custom use case in HTS analysis. Conclusions: GenOO is a tool of high software quality which can be efficiently used for advanced HTS analyses. It has been used to develop several custom analysis tools, leading to a number of published works. Using GenOO as a core development module can greatly benefit users, by reducing the overhead and complexity of managing HTS data and biological entities at hand.

Download Full-text

The theory and practice of measuring broad-range recombination rate from marker selected pools

10.1101/762575 ◽

2019 ◽

Author(s):

Kevin H.-C. Wei ◽

Aditya Mantha ◽

Doris Bachtrog

Keyword(s):

Genetic Distance ◽

Allele Frequency ◽

Recombination Rate ◽

High Throughput Sequencing ◽

Genetic Material ◽

Cost Effective ◽

Theory And Practice ◽

Rate Variation ◽

Sequencing Data ◽

Genome Wide

ABSTRACTRecombination is the exchange of genetic material between homologous chromosomes via physical crossovers. Pioneered by T. H. Morgan and A. Sturtevant over a century ago, methods to estimate recombination rate and genetic distance require scoring large number of recombinant individuals between molecular or visible markers. While high throughput sequencing methods have allowed for genome wide crossover detection producing high resolution maps, such methods rely on large number of recombinants individually sequenced and are therefore difficult to scale. Here, we present a simple and scalable method to infer near chromosome-wide recombination rate from marker selected pools and the corresponding analytical software MarSuPial. Rather than genotyping individuals from recombinant backcrosses, we bulk sequence marker selected pools to infer the allele frequency decay around the selected locus; since the number of recombinant individuals increases proportionally to the genetic distance from the selected locus, the allele frequency across the chromosome can be used to estimate the genetic distance and recombination rate. We mathematically demonstrate the relationship between allele frequency attenuation, recombinant fraction, genetic distance, and recombination rate in marker selected pools. Based on available chromosome-wide recombination rate models of Drosophila, we simulated read counts and determined that nonlinear local regressions (LOESS) produce robust estimates despite the high noise inherent to sequencing data. To empirically validate this approach, we show that (single) marker selected pools closely recapitulate genetic distances inferred from scoring recombinants between double markers. We theoretically determine how secondary loci with viability impacts can modulate the allele frequency decay and how to account for such effects directly from the data. We generated the recombinant map of three wild derived strains which strongly correlates with previous genome-wide measurements. Interestingly, amidst extensive recombination rate variation, multiple regions of the genomes show elevated rates across all strains. Lastly, we apply this method to estimate chromosome-wide crossover interference. Altogether, we find that marker selected pools is a simple and cost effective method for broad recombination rate estimates. Although it does not identify instances of crossovers, it can generate near chromosome-wide recombination maps in as little as one or two libraries.

Download Full-text

Natrix: A Snakemake-based workflow for processing, clustering, and taxonomically assigning amplicon sequencing reads

10.1101/2020.09.23.309864 ◽

2020 ◽

Author(s):

Marius Welzel ◽

Anja Lange ◽

Dominik Heider ◽

Michael Schwarz ◽

Bernd Freisleben ◽

...

Keyword(s):

High Throughput Sequencing ◽

Workflow Management ◽

Amplicon Sequencing ◽

Version Control ◽

Marker Genes ◽

Sequencing Data ◽

Taxonomic Assignment ◽

Ecological Processes ◽

Sequencing Technologies ◽

User Friendly

AbstractSequencing of marker genes amplified from environmental samples, known as amplicon sequencing, allows us to resolve some of the hidden diversity and elucidate evolutionary relationships and ecological processes among complex microbial communities. The analysis of large numbers of samples at high sequencing depths generated by high throughput sequencing technologies requires effcient, flexible, and reproducible bioinformatics pipelines. Only a few existing workflows can be run in a user-friendly, scalable, and reproducible manner on different computing devices using an effcient workflow management system. We present Natrix, an open-source bioinformatics workflow for preprocessing raw amplicon sequencing data. The workflow contains all analysis steps from quality assessment, read assembly, dereplication, chimera detection, split-sample merging, sequence representative assignment (OTUs or ASVs) to the taxonomic assignment of sequence representatives. The workflow is written using Snakemake, a workflow management engine for developing data analysis workflows. In addition, Conda is used for version control. Thus, Snakemake ensures reproducibility and Conda offers version control of the utilized programs. The encapsulation of rules and their dependencies support hassle-free sharing of rules between workflows and easy adaptation and extension of existing workflows. Natrix is freely available on GitHub (https://github.com/MW55/Natrix).

Download Full-text

HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data

10.1101/2020.12.16.419085 ◽

2020 ◽

Author(s):

Evangelos A. Dimopoulos ◽

Alberto Carmagnini ◽

Irina M. Velsko ◽

Christina Warinner ◽

Greger Larson ◽

...

Keyword(s):

Species Identification ◽

High Throughput Sequencing ◽

Bayesian Framework ◽

Metagenomic Data ◽

Sequencing Data ◽

Taxonomic Assignment ◽

Species Classification ◽

High Throughput Sequencing Data ◽

Reference Databases ◽

Specific Species

Identification of specific species in metagenomic samples is critical for several key applications, yet many tools available require large computational power and are often prone to false positive identifications. Here we describe High-AccuracY and Scalable Taxonomic Assignment of MetagenomiC data (HAYSTAC), which can estimate the probability that a specific taxon is present in a metagenome. HAYSTAC provides a user-friendly tool to construct databases, based on publicly available genomes, that are used for competitive reads mapping. It then uses a novel Bayesian framework to infer the abundance and statistical support for each species identification, and provide per-read species classification. Unlike other methods, HAYSTAC is specifically designed to efficiently handle both ancient and modern DNA data, as well as incomplete reference databases, making it possible to run highly accurate hypothesis-driven analyses (i.e., assessing the presence of a specific species) on variably sized reference databases while dramatically improving processing speeds. We tested the performance and accuracy of HAYSTAC using simulated Illumina libraries, both with and without ancient DNA damage, and compared the results to other currently available methods (i.e., Kraken2/Braken, MALT/HOPS, and Sigma). HAYSTAC identified fewer false positives than both Kraken2/Braken and MALT in all simulations, and fewer than Sigma in simulations of ancient data. It uses less memory than Kraken2/Braken as well as MALT both during database construction and sample analysis. Lastly, we used HAYSTAC to search for specific pathogens in two published ancient metagenomic datasets, demonstrating how it can be applied to empirical datasets. HAYSTAC is available from https://github.com/antonisdim/HAYSTAC

Download Full-text

Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies

PeerJ ◽

10.7717/peerj.1839 ◽

2016 ◽

Vol 4 ◽

pp. e1839 ◽

Cited By ~ 57

Author(s):

Tom O. Delmont ◽

A. Murat Eren

Keyword(s):

High Throughput Sequencing ◽

Draft Genome ◽

Cost Effective ◽

Single Copy ◽

Eukaryotic Genome ◽

Sequencing Data ◽

Bacterial Genomes ◽

Long Read ◽

Domains Of Life ◽

Genome Assemblies

High-throughput sequencing provides a fast and cost-effective mean to recover genomes of organisms from all domains of life. However, adequate curation of the assembly results against potential contamination of non-target organisms requires advanced bioinformatics approaches and practices. Here, we re-analyzed the sequencing data generated for the tardigradeHypsibius dujardini,and created a holistic display of the eukaryotic genome assembly using DNA data originating from two groups and eleven sequencing libraries. By using bacterial single-copy genes, k-mer frequencies, and coverage values of scaffolds we could identify and characterize multiple near-complete bacterial genomes from the raw assembly, and curate a 182 Mbp draft genome forH. dujardinisupported by RNA-Seq data. Our results indicate that most contaminant scaffolds were assembled from Moleculo long-read libraries, and most of these contaminants have differed between library preparations. Our re-analysis shows that visualization and curation of eukaryotic genome assemblies can benefit from tools designed to address the needs of today’s microbiologists, who are constantly challenged by the difficulties associated with the identification of distinct microbial genomes in complex environmental metagenomes.

Download Full-text

ChiTaH: a fast and accurate tool for identifying known human chimeric sequences from high-throughput sequencing data

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab112 ◽

2021 ◽

Vol 3 (4) ◽

Author(s):

Rajesh Detroja ◽

Alessandro Gorohovski ◽

Olawumi Giwa ◽

Gideon Baum ◽

Milana Frenkel-Morgenstern

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Complex Disease ◽

Single Cells ◽

Reference Database ◽

Sequencing Data ◽

Sequencing Technologies ◽

High Throughput Sequencing Data ◽

Chimeric Rnas ◽

Sensitivity Specificity

Abstract Fusion genes or chimeras typically comprise sequences from two different genes. The chimeric RNAs of such joined sequences often serve as cancer drivers. Identifying such driver fusions in a given cancer or complex disease is important for diagnosis and treatment. The advent of next-generation sequencing technologies, such as DNA-Seq or RNA-Seq, together with the development of suitable computational tools, has made the global identification of chimeras in tumors possible. However, the testing of over 20 computational methods showed these to be limited in terms of chimera prediction sensitivity, specificity, and accurate quantification of junction reads. These shortcomings motivated us to develop the first ‘reference-based’ approach termed ChiTaH (Chimeric Transcripts from High–throughput sequencing data). ChiTaH uses 43,466 non–redundant known human chimeras as a reference database to map sequencing reads and to accurately identify chimeric reads. We benchmarked ChiTaH and four other methods to identify human chimeras, leveraging both simulated and real sequencing datasets. ChiTaH was found to be the most accurate and fastest method for identifying known human chimeras from simulated and sequencing datasets. Moreover, especially ChiTaH uncovered heterogeneity of the BCR-ABL1 chimera in both bulk and single-cells of the K-562 cell line, which was confirmed experimentally.

Download Full-text

The impact of freeze-drying infant fecal samples on measures of their bacterial community profiles and milk-derived oligosaccharide content

PeerJ ◽

10.7717/peerj.1612 ◽

2016 ◽

Vol 4 ◽

pp. e1612 ◽

Cited By ~ 8

Author(s):

Zachery T. Lewis ◽

Jasmine C.C. Davis ◽

Jennifer T. Smilowitz ◽

J. Bruce German ◽

Carlito B. Lebrilla ◽

...

Keyword(s):

Bacterial Community ◽

Freeze Drying ◽

Marker Gene ◽

Cost Effective ◽

Amplicon Sequencing ◽

Sample Treatment ◽

Sequencing Data ◽

Fecal Samples ◽

Freeze Dried ◽

The Impact

Infant fecal samples are commonly studied to investigate the impacts of breastfeeding on the development of the microbiota and subsequent health effects. Comparisons of infants living in different geographic regions and environmental contexts are needed to aid our understanding of evolutionarily-selected milk adaptations. However, the preservation of fecal samples from individuals in remote locales until they can be processed can be a challenge. Freeze-drying (lyophilization) offers a cost-effective way to preserve some biological samples for transport and analysis at a later date. Currently, it is unknown what, if any, biases are introduced into various analyses by the freeze-drying process. Here, we investigated how freeze-drying affected analysis of two relevant and intertwined aspects of infant fecal samples, marker gene amplicon sequencing of the bacterial community and the fecal oligosaccharide profile (undigested human milk oligosaccharides). No differences were discovered between the fecal oligosaccharide profiles of wet and freeze-dried samples. The marker gene sequencing data showed an increase in proportional representation ofBacteriodesand a decrease in detection of bifidobacteria and members of class Bacilli after freeze-drying. This sample treatment bias may possibly be related to the cell morphology of these different taxa (Gram status). However, these effects did not overwhelm the natural variation among individuals, as the community data still strongly grouped by subject and not by freeze-drying status. We also found that compensating for sample concentration during freeze-drying, while not necessary, was also not detrimental. Freeze-drying may therefore be an acceptable method of sample preservation and mass reduction for some studies of microbial ecology and milk glycan analysis.

Download Full-text