Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data

AbstractBackgroundThe accuracy of microbial community surveys based on marker-gene and metagenomic sequencing (MGS) suffers from the presence of contaminants — DNA sequences not truly present in the sample. Contaminants come from various sources, including reagents. Appropriate laboratory practices can reduce contamination, but do not eliminate it. Here we introduce decontam (https://github.com/benjjneb/decontam), an open-source R package that implements a statistical classification procedure that identifies contaminants in MGS data based on two widely reproduced patterns: contaminants appear at higher frequencies in low-concentration samples, and are often found in negative controls.Resultsdecontam classified amplicon sequence variants (ASVs) in a human oral dataset consistently with prior microscopic observations of the microbial taxa inhabiting that environment and previous reports of contaminant taxa. In metagenomics and marker-gene measurements of a dilution series, decontam substantially reduced technical variation arising from different sequencing protocols. The application of decontam to two recently published datasets corroborated and extended their conclusions that little evidence existed for an indigenous placenta microbiome, and that some low-frequency taxa seemingly associated with preterm birth were contaminants.Conclusionsdecontam improves the quality of metagenomic and marker-gene sequencing by identifying and removing contaminant DNA sequences. decontam integrates easily with existing MGS workflows, and allows researchers to generate more accurate profiles of microbial communities at little to no additional cost.

Download Full-text

Consistent and correctable bias in metagenomic sequencing experiments

10.1101/559831 ◽

2019 ◽

Cited By ~ 9

Author(s):

Michael R. McLaren ◽

Amy D. Willis ◽

Benjamin J. Callahan

Keyword(s):

Marker Gene ◽

Pcr Amplification ◽

Rrna Gene ◽

Metagenomic Sequencing ◽

Shotgun Metagenomics ◽

Biological Communities ◽

Metagenomics Data ◽

Specific Factors ◽

Relative Abundances ◽

True Values

AbstractMeasurements of biological communities by marker-gene and metagenomic sequencing are biased: The measured relative abundances of taxa or their genes are systematically distorted from their true values because each step in the experimental workflow preferentially detects some taxa over others. Bias can lead to qualitatively incorrect conclusions and makes measurements from different protocols quantitatively incomparable. A rigorous understanding of bias is therefore essential. Here we propose, test, and apply a simple mathematical model of how bias distorts marker-gene and metagenomics measurements: Bias multiplies the true relative abundances within each sample by taxon-and protocol-specific factors that describe the different efficiencies with which taxa are detected by the workflow. Critically, these factors are consistent across samples with different compositions, allowing bias to be estimated and corrected. We validate this model in 16S rRNA gene and shotgun metagenomics data from bacterial communities with defined compositions. We use it to reason about the effects of bias on downstream statistical analyses, finding that analyses based on taxon ratios are less sensitive to bias than analyses based on taxon proportions. Finally, we demonstrate how this model can be used to quantify bias from samples of defined composition, partition bias into steps such as DNA extraction and PCR amplification, and to correct biased measurements. Our model improves on previous models by providing a better fit to experimental data and by providing a composition-independent approach to analyzing, measuring, and correcting bias.

Download Full-text

PanGIA: A Metagenomics Analytical Framework for Routine Biosurveillance and Clinical Pathogen Detection

10.1101/2020.04.20.051813 ◽

2020 ◽

Author(s):

Po-E Li ◽

Joseph A. Russell ◽

David Yarmosh ◽

Alan G. Shteyman ◽

Kyle Parker ◽

...

Keyword(s):

Reference Genome ◽

Query Sequence ◽

Marker Gene ◽

Control Sample ◽

Analytical Framework ◽

Ease Of Use ◽

Shotgun Metagenomics ◽

Link Type ◽

Metagenomics Data ◽

Real World Datasets

ABSTRACTMetagenomics is emerging as an important tool in biosurveillance, public health, and clinical applications. However, ease-of-use for execution and data analysis remains a barrier-of-entry to the adoption of metagenomics in applied health and forensics settings. In addition, these venues often have more stringent requirements for reporting, accuracy, and precision than the traditional ecological research role of the technology. Here, we present PanGIA (Pan-Genomics for Infectious Agents), a novel bioinformatics analysis platform for hosting, processing, analyzing, and reporting shotgun metagenomics data of complex samples suspected of containing one or more pathogens. PanGIA was developed to address gaps that often preclude clinicians, medical technicians, forensics personnel, or other non-expert end-users from the routine application of metagenomics for pathogen identification. Though primarily designed to detect pathogenic microorganisms within clinical and environmental metagenomics data, PanGIA also serves as an analytical framework for microbial community profiling and comparative metagenomics. To provide statistical confidence in PanGIA’s taxonomic assignments, the system provides two independent estimations of probability for species and strain level detection. First, PanGIA integrates coverage data with ‘uniqueness’ information mapped across each reference genome for a stand-alone determination of confidence for each query sequence at each taxonomy level. Second, if a negative-control sample is provided, PanGIA compares this sample with a corresponding experimental unknown sample and determines a measure of confidence associated with ‘detection above background’. An integrated graphical user interface allows interactive interrogation and enables users to summarize multiple sample results by confidence score, normalized read abundance, reference genome linear coverage, depth-of-coverage, RPKM, and other metrics to detect specific organisms-of-interest. Comparison testing of the PanGIA algorithm against a number of recent k-mer, read-mapping, and marker-gene based taxonomy classifiers across various real-world datasets with spiked targets shows superior mean positive predictive value, sensitivity, and specificity. PanGIA can process a five million paired-end read dataset in under 1 hour on commodity computational hardware. The source code and documentation are publicly available at https://github.com/LANL-Bioinformatics/PanGIA or https://github.com/mriglobal/PanGIA. The database for PanGIA can be downloaded from ftp://bioinformatics.mriglobal.org/. The full GUI-based PanGIA analysis environment is available in a Docker container and can be installed from https://hub.docker.com/r/poeli/pangia/.

Download Full-text

Taxonomic classification method for metagenomics based on core protein families with Core-Kaiju

Nucleic Acids Research ◽

10.1093/nar/gkaa568 ◽

2020 ◽

Vol 48 (16) ◽

pp. e93-e93

Author(s):

Anna Tovo ◽

Peter Menzel ◽

Anders Krogh ◽

Marco Cosentino Lagomarsino ◽

Samir Suweis

Keyword(s):

Core Protein ◽

Marker Gene ◽

Protein Domain ◽

Classification Method ◽

Shotgun Metagenomics ◽

Novel Approach ◽

True Number ◽

Metagenomics Data ◽

Reference Databases ◽

Mock Communities

Abstract Characterizing species diversity and composition of bacteria hosted by biota is revolutionizing our understanding of the role of symbiotic interactions in ecosystems. Determining microbiomes diversity implies the assignment of individual reads to taxa by comparison to reference databases. Although computational methods aimed at identifying the microbe(s) taxa are available, it is well known that inferences using different methods can vary widely depending on various biases. In this study, we first apply and compare different bioinformatics methods based on 16S ribosomal RNA gene and shotgun sequencing to three mock communities of bacteria, of which the compositions are known. We show that none of these methods can infer both the true number of taxa and their abundances. We thus propose a novel approach, named Core-Kaiju, which combines the power of shotgun metagenomics data with a more focused marker gene classification method similar to 16S, but based on emergent statistics of core protein domain families. We thus test the proposed method on various mock communities and we show that Core-Kaiju reliably predicts both number of taxa and abundances. Finally, we apply our method on human gut samples, showing how Core-Kaiju may give more accurate ecological characterization and a fresh view on real microbiomes.

Download Full-text

Taxonomic classification method for metagenomics based on core protein families with Core-Kaiju

10.1101/2020.01.08.898395 ◽

2020 ◽

Author(s):

Anna Tovo ◽

Peter Menzel ◽

Anders Krogh ◽

Marco Cosentino Lagomarsino ◽

Samir Suweis

Keyword(s):

Core Protein ◽

Marker Gene ◽

Taxonomic Classification ◽

Protein Domain ◽

Classification Method ◽

Shotgun Metagenomics ◽

Novel Approach ◽

True Number ◽

Metagenomics Data ◽

Mock Communities

ABSTRACTCharacterizing species diversity and composition of bacteria hosted by biota is revolutionizing our understanding of the role of symbiotic interactions in ecosystems. However, determining microbiomes diversity implies the classification of taxa composition within the sampled community, which is often done via the assignment of individual reads to taxa by comparison to reference databases. Although computational methods aimed at identifying the microbe(s) taxa are available, it is well known that inferences using different methods can vary widely depending on various biases. In this study, we first apply and compare different bioinformatics methods based on 16S ribosomal RNA gene and whole genome shotgun sequencing for taxonomic classification to three small mock communities of bacteria, of which the compositions are known. We show that none of these methods can infer both the true number of taxa and their abundances. We thus propose a novel approach, named Core-Kaiju, which combines the power of shotgun metagenomics data with a more focused marker gene classification method similar to 16S, but based on emergent statistics of core protein domain families. We thus test the proposed method on the three small mock communities and also on medium- and highly complex mock community datasets taken from the Critical Assessment of Metagenome Interpretation challenge. We show that Core-Kaiju reliably predicts both number of taxa and abundance of the analysed mock bacterial communities. Finally we apply our method on human gut samples, showing how Core-Kaiju may give more accurate ecological characterization and fresh view on real microbiomes.

Download Full-text

Consistent and correctable bias in metagenomic sequencing experiments

eLife ◽

10.7554/elife.46923 ◽

2019 ◽

Vol 8 ◽

Cited By ~ 42

Author(s):

Michael R McLaren ◽

Amy D Willis ◽

Benjamin J Callahan

Keyword(s):

Experimental Data ◽

Bacterial Communities ◽

Marker Gene ◽

Rrna Gene ◽

Metagenomic Sequencing ◽

Shotgun Metagenomics ◽

Biological Communities ◽

Experimental Bias ◽

Metagenomics Data ◽

Or Gene

Marker-gene and metagenomic sequencing have profoundly expanded our ability to measure biological communities. But the measurements they provide differ from the truth, often dramatically, because these experiments are biased toward detecting some taxa over others. This experimental bias makes the taxon or gene abundances measured by different protocols quantitatively incomparable and can lead to spurious biological conclusions. We propose a mathematical model for how bias distorts community measurements based on the properties of real experiments. We validate this model with 16S rRNA gene and shotgun metagenomics data from defined bacterial communities. Our model better fits the experimental data despite being simpler than previous models. We illustrate how our model can be used to evaluate protocols, to understand the effect of bias on downstream statistical analyses, and to measure and correct bias given suitable calibration controls. These results illuminate new avenues toward truly quantitative and reproducible metagenomics measurements.

Download Full-text