PanGIA: A Metagenomics Analytical Framework for Routine Biosurveillance and Clinical Pathogen Detection

ABSTRACTMetagenomics is emerging as an important tool in biosurveillance, public health, and clinical applications. However, ease-of-use for execution and data analysis remains a barrier-of-entry to the adoption of metagenomics in applied health and forensics settings. In addition, these venues often have more stringent requirements for reporting, accuracy, and precision than the traditional ecological research role of the technology. Here, we present PanGIA (Pan-Genomics for Infectious Agents), a novel bioinformatics analysis platform for hosting, processing, analyzing, and reporting shotgun metagenomics data of complex samples suspected of containing one or more pathogens. PanGIA was developed to address gaps that often preclude clinicians, medical technicians, forensics personnel, or other non-expert end-users from the routine application of metagenomics for pathogen identification. Though primarily designed to detect pathogenic microorganisms within clinical and environmental metagenomics data, PanGIA also serves as an analytical framework for microbial community profiling and comparative metagenomics. To provide statistical confidence in PanGIA’s taxonomic assignments, the system provides two independent estimations of probability for species and strain level detection. First, PanGIA integrates coverage data with ‘uniqueness’ information mapped across each reference genome for a stand-alone determination of confidence for each query sequence at each taxonomy level. Second, if a negative-control sample is provided, PanGIA compares this sample with a corresponding experimental unknown sample and determines a measure of confidence associated with ‘detection above background’. An integrated graphical user interface allows interactive interrogation and enables users to summarize multiple sample results by confidence score, normalized read abundance, reference genome linear coverage, depth-of-coverage, RPKM, and other metrics to detect specific organisms-of-interest. Comparison testing of the PanGIA algorithm against a number of recent k-mer, read-mapping, and marker-gene based taxonomy classifiers across various real-world datasets with spiked targets shows superior mean positive predictive value, sensitivity, and specificity. PanGIA can process a five million paired-end read dataset in under 1 hour on commodity computational hardware. The source code and documentation are publicly available at https://github.com/LANL-Bioinformatics/PanGIA or https://github.com/mriglobal/PanGIA. The database for PanGIA can be downloaded from ftp://bioinformatics.mriglobal.org/. The full GUI-based PanGIA analysis environment is available in a Docker container and can be installed from https://hub.docker.com/r/poeli/pangia/.

Download Full-text

Consistent and correctable bias in metagenomic sequencing experiments

10.1101/559831 ◽

2019 ◽

Cited By ~ 9

Author(s):

Michael R. McLaren ◽

Amy D. Willis ◽

Benjamin J. Callahan

Keyword(s):

Marker Gene ◽

Pcr Amplification ◽

Rrna Gene ◽

Metagenomic Sequencing ◽

Shotgun Metagenomics ◽

Biological Communities ◽

Metagenomics Data ◽

Specific Factors ◽

Relative Abundances ◽

True Values

AbstractMeasurements of biological communities by marker-gene and metagenomic sequencing are biased: The measured relative abundances of taxa or their genes are systematically distorted from their true values because each step in the experimental workflow preferentially detects some taxa over others. Bias can lead to qualitatively incorrect conclusions and makes measurements from different protocols quantitatively incomparable. A rigorous understanding of bias is therefore essential. Here we propose, test, and apply a simple mathematical model of how bias distorts marker-gene and metagenomics measurements: Bias multiplies the true relative abundances within each sample by taxon-and protocol-specific factors that describe the different efficiencies with which taxa are detected by the workflow. Critically, these factors are consistent across samples with different compositions, allowing bias to be estimated and corrected. We validate this model in 16S rRNA gene and shotgun metagenomics data from bacterial communities with defined compositions. We use it to reason about the effects of bias on downstream statistical analyses, finding that analyses based on taxon ratios are less sensitive to bias than analyses based on taxon proportions. Finally, we demonstrate how this model can be used to quantify bias from samples of defined composition, partition bias into steps such as DNA extraction and PCR amplification, and to correct biased measurements. Our model improves on previous models by providing a better fit to experimental data and by providing a composition-independent approach to analyzing, measuring, and correcting bias.

Download Full-text

Taxonomic classification method for metagenomics based on core protein families with Core-Kaiju

Nucleic Acids Research ◽

10.1093/nar/gkaa568 ◽

2020 ◽

Vol 48 (16) ◽

pp. e93-e93

Author(s):

Anna Tovo ◽

Peter Menzel ◽

Anders Krogh ◽

Marco Cosentino Lagomarsino ◽

Samir Suweis

Keyword(s):

Core Protein ◽

Marker Gene ◽

Protein Domain ◽

Classification Method ◽

Shotgun Metagenomics ◽

Novel Approach ◽

True Number ◽

Metagenomics Data ◽

Reference Databases ◽

Mock Communities

Abstract Characterizing species diversity and composition of bacteria hosted by biota is revolutionizing our understanding of the role of symbiotic interactions in ecosystems. Determining microbiomes diversity implies the assignment of individual reads to taxa by comparison to reference databases. Although computational methods aimed at identifying the microbe(s) taxa are available, it is well known that inferences using different methods can vary widely depending on various biases. In this study, we first apply and compare different bioinformatics methods based on 16S ribosomal RNA gene and shotgun sequencing to three mock communities of bacteria, of which the compositions are known. We show that none of these methods can infer both the true number of taxa and their abundances. We thus propose a novel approach, named Core-Kaiju, which combines the power of shotgun metagenomics data with a more focused marker gene classification method similar to 16S, but based on emergent statistics of core protein domain families. We thus test the proposed method on various mock communities and we show that Core-Kaiju reliably predicts both number of taxa and abundances. Finally, we apply our method on human gut samples, showing how Core-Kaiju may give more accurate ecological characterization and a fresh view on real microbiomes.

Download Full-text

Taxonomic classification method for metagenomics based on core protein families with Core-Kaiju

10.1101/2020.01.08.898395 ◽

2020 ◽

Author(s):

Anna Tovo ◽

Peter Menzel ◽

Anders Krogh ◽

Marco Cosentino Lagomarsino ◽

Samir Suweis

Keyword(s):

Core Protein ◽

Marker Gene ◽

Taxonomic Classification ◽

Protein Domain ◽

Classification Method ◽

Shotgun Metagenomics ◽

Novel Approach ◽

True Number ◽

Metagenomics Data ◽

Mock Communities

ABSTRACTCharacterizing species diversity and composition of bacteria hosted by biota is revolutionizing our understanding of the role of symbiotic interactions in ecosystems. However, determining microbiomes diversity implies the classification of taxa composition within the sampled community, which is often done via the assignment of individual reads to taxa by comparison to reference databases. Although computational methods aimed at identifying the microbe(s) taxa are available, it is well known that inferences using different methods can vary widely depending on various biases. In this study, we first apply and compare different bioinformatics methods based on 16S ribosomal RNA gene and whole genome shotgun sequencing for taxonomic classification to three small mock communities of bacteria, of which the compositions are known. We show that none of these methods can infer both the true number of taxa and their abundances. We thus propose a novel approach, named Core-Kaiju, which combines the power of shotgun metagenomics data with a more focused marker gene classification method similar to 16S, but based on emergent statistics of core protein domain families. We thus test the proposed method on the three small mock communities and also on medium- and highly complex mock community datasets taken from the Critical Assessment of Metagenome Interpretation challenge. We show that Core-Kaiju reliably predicts both number of taxa and abundance of the analysed mock bacterial communities. Finally we apply our method on human gut samples, showing how Core-Kaiju may give more accurate ecological characterization and fresh view on real microbiomes.

Download Full-text

DEgenes Hunter - A Flexible R Pipeline for Automated RNA-seq Studies in Organisms without Reference Genome

Genomics and Computational Biology ◽

10.18547/gcb.2017.vol3.iss3.e31 ◽

2017 ◽

Vol 3 (3) ◽

pp. 31 ◽

Cited By ~ 8

Author(s):

Isabel González Gayte ◽

Rocío Bautista Moreno ◽

Pedro Seoane Zonjic ◽

M. Gonzalo Claros

Keyword(s):

Differentially Expressed Genes ◽

Reference Genome ◽

Real Data ◽

Ease Of Use ◽

Differentially Expressed ◽

P Value ◽

Rna Seq ◽

Functional Interpretation ◽

Differential Gene ◽

Real World Datasets

Differential gene expression based on RNA-seq is widely used. Bioinformatics skills are required since no algorithm is appropriate for all experimental designs. Moreover, when working with organisms without reference genome, functional analysis is less than straightforward in most situations. DEgenes Hunter, an attempt to automate the process, is based on two independent scripts, one for differential expression and one for functional interpretation. Based on replicates, the R script decides which of the edgeR, DEseq2, NOISeq and limma algorithms are appropriate. It performs quality control calculations and provides the prevalent, most reliable, set of differentially expressed genes, and lists all other possible candidates for further functional interpretation. It also provides a combined P-value that allows differentially expressed genes ranking. It has been tested with synthetic and real-world datasets, showing in both cases ease of use and reliable results. With real data, DEgenes Hunter offers straightforward functional interpretation.

Download Full-text

Consistent and correctable bias in metagenomic sequencing experiments

eLife ◽

10.7554/elife.46923 ◽

2019 ◽

Vol 8 ◽

Cited By ~ 42

Author(s):

Michael R McLaren ◽

Amy D Willis ◽

Benjamin J Callahan

Keyword(s):

Experimental Data ◽

Bacterial Communities ◽

Marker Gene ◽

Rrna Gene ◽

Metagenomic Sequencing ◽

Shotgun Metagenomics ◽

Biological Communities ◽

Experimental Bias ◽

Metagenomics Data ◽

Or Gene

Marker-gene and metagenomic sequencing have profoundly expanded our ability to measure biological communities. But the measurements they provide differ from the truth, often dramatically, because these experiments are biased toward detecting some taxa over others. This experimental bias makes the taxon or gene abundances measured by different protocols quantitatively incomparable and can lead to spurious biological conclusions. We propose a mathematical model for how bias distorts community measurements based on the properties of real experiments. We validate this model with 16S rRNA gene and shotgun metagenomics data from defined bacterial communities. Our model better fits the experimental data despite being simpler than previous models. We illustrate how our model can be used to evaluate protocols, to understand the effect of bias on downstream statistical analyses, and to measure and correct bias given suitable calibration controls. These results illuminate new avenues toward truly quantitative and reproducible metagenomics measurements.

Download Full-text

PPIT: an R package for inferring microbial taxonomy from nifH sequences

Bioinformatics ◽

10.1093/bioinformatics/btab100 ◽

2021 ◽

Author(s):

Bennett J Kapili ◽

Anne E Dekas

Keyword(s):

Gene Transfer ◽

Horizontal Gene Transfer ◽

Query Sequence ◽

Marker Gene ◽

R Package ◽

Supplementary Information ◽

Marker Genes ◽

Pairwise Identity ◽

Metabolic Marker ◽

Microbial Taxonomy

Abstract Motivation Linking microbial community members to their ecological functions is a central goal of environmental microbiology. When assigned taxonomy, amplicon sequences of metabolic marker genes can suggest such links, thereby offering an overview of the phylogenetic structure underpinning particular ecosystem functions. However, inferring microbial taxonomy from metabolic marker gene sequences remains a challenge, particularly for the frequently sequenced nitrogen fixation marker gene, nitrogenase reductase (nifH). Horizontal gene transfer in recent nifH evolutionary history can confound taxonomic inferences drawn from the pairwise identity methods used in existing software. Other methods for inferring taxonomy are not standardized and require manual inspection that is difficult to scale. Results We present Phylogenetic Placement for Inferring Taxonomy (PPIT), an R package that infers microbial taxonomy from nifH amplicons using both phylogenetic and sequence identity approaches. After users place query sequences on a reference nifH gene tree provided by PPIT (n = 6317 full-length nifH sequences), PPIT searches the phylogenetic neighborhood of each query sequence and attempts to infer microbial taxonomy. An inference is drawn only if references in the phylogenetic neighborhood are: (1) taxonomically consistent and (2) share sufficient pairwise identity with the query, thereby avoiding erroneous inferences due to known horizontal gene transfer events. We find that PPIT returns a higher proportion of correct taxonomic inferences than BLAST-based approaches at the cost of fewer total inferences. We demonstrate PPIT on deep-sea sediment and find that Deltaproteobacteria are the most abundant potential diazotrophs. Using this dataset we show that emending PPIT inferences based on visual inspection of query sequence placement can achieve taxonomic inferences for nearly all sequences in a query set. We additionally discuss how users can apply PPIT to the analysis of other marker genes. Availability PPIT is freely available to non-commercial users at https://github.com/bkapili/ppit. Installation includes a vignette that demonstrates package use and reproduces the nifH amplicon analysis discussed here. The raw nifH amplicon sequence data have been deposited in the GenBank, EMBL, and DDBJ databases under BioProject number PRJEB37167. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Identity and compatibility of reference genome resources

10.1101/2021.03.15.435425 ◽

2021 ◽

Author(s):

Michał Stolarczyk ◽

Bingjie Xue ◽

Nathan C. Sheffield

Keyword(s):

Coordinate System ◽

Genome Analysis ◽

Reference Genome ◽

Reference Data ◽

Coordinate Systems ◽

Link Type ◽

Novel Approach ◽

Many Sources ◽

Parent Child Relationships ◽

Parent Child

Genome analysis relies on reference data like sequences, feature annotations, and aligner indexes. These data can be found in many versions from many sources, making it challenging to identify and assess compatibility among them. For example, how can you determine which indexes are derived from identical raw sequence files, or which annotations share a compatible coordinate system? Here, we describe a novel approach to establish identity and compatibility of reference genome resources. We approach this with three advances: First, we derive unique identifiers for each resource; second, we record parent-child relationships among resources; and third, we describe recursive identifiers that determine identity as well as compatibility of coordinate systems and sequence names. These advances facilitate portability, reproducibility, and re-use of genome reference data.Availabilityhttps://refgenie.databio.org

Download Full-text

Metagenomic Sequences of Three Drinking Water and Two Shower Hose Biofilm Samples Treated with or without Copper-Silver Ionization

Microbiology Resource Announcements ◽

10.1128/mra.01220-19 ◽

2020 ◽

Vol 9 (3) ◽

Author(s):

Anke Stüken ◽

Thomas H. A. Haverkamp

Keyword(s):

Drinking Water ◽

Microbial Communities ◽

Tap Water ◽

Data Sets ◽

Sequence Analyses ◽

Illumina Hiseq ◽

Shotgun Metagenomics ◽

Premise Plumbing ◽

Plumbing Systems ◽

Metagenomics Data

We announce five shotgun metagenomics data sets from two Norwegian premise plumbing systems. The samples were shotgun sequenced on two lanes of an Illumina HiSeq 3000 instrument (THRUplex chemistry, 151 bp, paired-end reads), providing an extensive resource for sequence analyses of tap water and biofilm microbial communities.

Download Full-text

RefKA: A fast and efficient long-read genome assembly approach for large and complex genomes

10.1101/2020.04.17.035287 ◽

2020 ◽

Author(s):

Yuxuan Yuan ◽

Philipp E. Bayer ◽

Robyn Anderson ◽

HueyTyng Lee ◽

Chon-Kit Kenneth Chan ◽

...

Keyword(s):

Genome Assembly ◽

Chinese Spring ◽

Complete Genome ◽

Reference Genome ◽

Computing Time ◽

Link Type ◽

Recent Advances ◽

Long Read ◽

Genome Assemblies

AbstractRecent advances in long-read sequencing have the potential to produce more complete genome assemblies using sequence reads which can span repetitive regions. However, overlap based assembly methods routinely used for this data require significant computing time and resources. Here, we have developed RefKA, a reference-based approach for long read genome assembly. This approach relies on breaking up a closely related reference genome into bins, aligning k-mers unique to each bin with PacBio reads, and then assembling each bin in parallel followed by a final bin-stitching step. During benchmarking, we assembled the wheat Chinese Spring (CS) genome using publicly available PacBio reads in parallel in 168 wall hours on a 250 CPU system. The maximum RAM used was 300 Gb and the computing time was 42,000 CPU hours. The approach opens applications for the assembly of other large and complex genomes with much-reduced computing requirements. The RefKA pipeline is available at https://github.com/AppliedBioinformatics/RefKA

Download Full-text

Whisper: Read sorting allows robust mapping of sequencing data

10.1101/240358 ◽

2017 ◽

Author(s):

Sebastian Deorowicz ◽

Agnieszka Debudaj-Grabysz ◽

Adam Gudyś ◽

Szymon Grabowski

Keyword(s):

Reference Genome ◽

Variant Calling ◽

Real Data ◽

Supplementary Information ◽

Sequencing Data ◽

Suffix Arrays ◽

Link Type ◽

Mapping Tool ◽

Reverse Complement ◽

Comparable Accuracy

AbstractMotivationMapping reads to a reference genome is often the first step in a sequencing data analysis pipeline. Mistakes made at this computationally challenging stage cannot be recovered easily.ResultsWe present Whisper, an accurate and high-performant mapping tool, based on the idea of sorting reads and then mapping them against suffix arrays for the reference genome and its reverse complement. Employing task and data parallelism as well as storing temporary data on disk result in superior time efficiency at reasonable memory requirements. Whisper excels at large NGS read collections, in particular Illumina reads with typical WGS coverage. The experiments with real data indicate that our solution works in about 15% of the time needed by the well-known Bowtie2 and BWA-MEM tools at a comparable accuracy (validated in variant calling pipeline).AvailabilityWhisper is available for free from https://github.com/refresh-bio/Whisper or http://sun.aei.polsl.pl/REFRESH/Whisper/[email protected] informationSupplementary data are available at publisher Web site.

Download Full-text