Accurate Reconstruction of Microbial Strains from Metagenomic Sequencing Using Representative Reference Genomes

AbstractExploring the genetic diversity of microbes within the environment through metagenomic sequencing first requires classifying these reads into taxonomic groups. Current methods compare these sequencing data with existing biased and limited reference databases. Several recent evaluation studies demonstrate that current methods either lack sufficient sensitivity for species-level assignments or suffer from false positives, overestimating the number of species in the metagenome. Both are especially problematic for the identification of low-abundance microbial species, e. g. detecting pathogens in ancient metagenomic samples. We present a new method, SPARSE, which improves taxonomic assignments of metagenomic reads. SPARSE balances existing biased reference databases by grouping reference genomes into similarity-based hierarchical clusters, implemented as an efficient incremental data structure. SPARSE assigns reads to these clusters using a probabilistic model, which specifically penalizes non-specific mappings of reads from unknown sources and hence reduces false-positive assignments. Our evaluation on simulated datasets from two recent evaluation studies demonstrated the improved precision of SPARSE in comparison to other methods for species-level classification. In a third simulation, our method successfully differentiated multiple co-existing Escherichia coli strains from the same sample. In real archaeological datasets, SPARSE identified ancient pathogens with ≤ 0.02% abundance, consistent with published findings that required additional sequencing data. In these datasets, other methods either missed targeted pathogens or reported non-existent ones. SPARSE and all evaluation scripts are available at https://github.com/zheminzhou/SPARSE.

Download Full-text

Identification of core and rare species in metagenome samples based on shotgun metagenomic sequencing, Fourier transforms and spectral comparisons

ISME Communications ◽

10.1038/s43705-021-00010-6 ◽

2021 ◽

Vol 1 (1) ◽

Author(s):

Marie-Madlen Pust ◽

Burkhard Tümmler

Keyword(s):

Microbial Communities ◽

Rare Species ◽

Fourier Transforms ◽

Species Level ◽

Reference Frequency ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Discrete Fourier Transforms ◽

False Discovery ◽

Shotgun Metagenomic Sequencing

AbstractIn shotgun metagenomic sequencing applications, low signal-to-noise ratios may complicate species-level differentiation of genetically similar core species and impede high-confidence detection of rare species. However, core and rare species can take pivotal roles in their habitats and should hence be studied as one entity to gain insights into the total potential of microbial communities in terms of taxonomy and functionality. Here, we offer a solution towards increased species-level specificity, decreased false discovery and omission rates of core and rare species in complex metagenomic samples by introducing the rare species identifier (raspir) tool. The python software is based on discrete Fourier transforms and spectral comparisons of biological and reference frequency signals obtained from real and ideal distributions of short DNA reads mapping towards circular reference genomes. Simulation-based testing of raspir enabled the detection of rare species with genome coverages of less than 0.2%. Species-level differentiation of rare Escherichia coli and Shigella spp., as well as the clear delineation between human Streptococcus spp. was feasible with low false discovery (1.3%) and omission rates (13%). Publicly available human placenta sequencing data were reanalysed with raspir. Raspir was unable to identify placental microbial communities, reinforcing the sterile womb paradigm.

Download Full-text

Improved microbial community characterization of 16S rRNA via metagenome hybridization capture enrichment

10.1101/2020.12.18.423101 ◽

2020 ◽

Author(s):

Megan Sarah Beaudry ◽

Jincheng Wang ◽

Troy Kieran ◽

Jesse Thomas ◽

Natalia Juliana Bayona-Vasquez ◽

...

Keyword(s):

16S Rrna ◽

Sequence Similarity ◽

Rrna Gene ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

16S Rrna Sequence ◽

Shotgun Metagenomics ◽

Reference Databases ◽

Rrna Sequences ◽

16S Rrna Sequences

Environmental microbial diversity is often investigated from a molecular perspective using 16S ribosomal RNA (rRNA) gene amplicons and shotgun metagenomics. While amplicon methods are fast, low-cost, and have curated reference databases, they can suffer from amplification bias and are limited in genomic scope. In contrast, shotgun metagenomic methods sample more genomic regions with fewer sequence acquisition biases. However, shotgun metagenomic sequencing is much more expensive (even with moderate sequencing depth) and computationally challenging. Here, we develop a set of 16S rRNA sequence capture baits that offer a potential middle ground with the advantages from both approaches for investigating microbial communities. These baits cover the diversity of all 16S rRNA sequences available in the Greengenes (v. 13.5) database, with no sequence having < 80% sequence similarity to at least one bait for all segments of 16S. The use of our baits provide comparable results to 16S amplicon libraries and shotgun metagenomic libraries when assigning taxonomic units from 16S sequences within the metagenomic reads. We demonstrate that 16S rRNA capture baits can be used on a range of microbial samples (i.e., mock communities and rodent fecal samples) to increase the proportion of 16S rRNA sequences (average >400-fold) and decrease analysis time to obtain consistent community assessments. Furthermore, our study reveals that bioinformatic methods used to analyze sequencing data may have a greater influence on estimates of community composition than library preparation method used, likely in part to the extent and curation of the reference databases considered.

Download Full-text

Accurate Reconstruction of Microbial Strains from Metagenomic Sequencing Using Representative Reference Genomes

Lecture Notes in Computer Science - Research in Computational Molecular Biology ◽

10.1007/978-3-319-89929-9_15 ◽

2018 ◽

pp. 225-240 ◽

Cited By ~ 6

Author(s):

Zhemin Zhou ◽

Nina Luhmann ◽

Nabil-Fareed Alikhan ◽

Christopher Quince ◽

Mark Achtman

Keyword(s):

Metagenomic Sequencing ◽

Microbial Strains ◽

Reference Genomes ◽

Accurate Reconstruction

Download Full-text

Biogeography of Heterotrophic Flagellate Populations Indicates the Presence of Generalist and Specialist Taxa in the Arctic Ocean

Applied and Environmental Microbiology ◽

10.1128/aem.02737-14 ◽

2015 ◽

Vol 81 (6) ◽

pp. 2137-2148 ◽

Cited By ~ 17

Author(s):

Mary Thaler ◽

Connie Lovejoy

Keyword(s):

Arctic Ocean ◽

Water Column ◽

High Throughput Sequencing ◽

Species Level ◽

The Arctic ◽

Heterotrophic Flagellate ◽

Sequencing Data ◽

Content Type ◽

The Arctic Ocean ◽

Taxonomic Groups

ABSTRACTHeterotrophic marine flagellates (HF) are ubiquitous in the world's oceans and represented in nearly all branches of the domain Eukaryota. However, the factors determining distributions of major taxonomic groups are poorly known. The Arctic Ocean is a good model environment for examining the distribution of functionally similar but phylogenetically diverse HF because the physical oceanography and annual ice cycles result in distinct environments that could select for microbial communities or favor specific taxa. We reanalyzed new and previously published high-throughput sequencing data from multiple studies in the Arctic Ocean to identify broad patterns in the distribution of individual taxa. HF accounted for fewer than 2% to over one-half of the reads from the water column and for up to 60% of reads from ice, which was dominated byCryothecomonas. In the water column, many HF phylotypes belonging to Telonemia and Picozoa, uncultured marine stramenopiles (MAST), and choanoflagellates were geographically widely distributed. However, for two groups in particular, Telonemia andCryothecomonas, some species level taxa showed more restricted distributions. For example, several phylotypes of Telonemia favored open waters with lower nutrients such as the Canada Basin and offshore of the Mackenzie Shelf. In summary, we found that while some Arctic HF were successful over a range of conditions, others could be specialists that occur under particular conditions. We conclude that tracking species level diversity in HF not only is feasible but also provides a potential tool for understanding the responses of marine microbial ecosystems to rapidly changing ice regimes.

Download Full-text

Profiling microbial strains in urban environments using metagenomic sequencing data

Biology Direct ◽

10.1186/s13062-018-0211-z ◽

2018 ◽

Vol 13 (1) ◽

Cited By ~ 12

Author(s):

Moreno Zolfo ◽

Francesco Asnicar ◽

Paolo Manghi ◽

Edoardo Pasolli ◽

Adrian Tett ◽

...

Keyword(s):

Urban Environments ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Microbial Strains

Download Full-text

PStrain: an iterative microbial strains profiling algorithm for shotgun metagenomic sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa1056 ◽

2020 ◽

Author(s):

Shuai Wang ◽

Yiqi Jiang ◽

Shuaicheng Li

Keyword(s):

Optimization Method ◽

Supplementary Information ◽

Marker Genes ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Shotgun Metagenomic Sequencing ◽

Genotype Frequencies ◽

Microbial Strains ◽

And Control ◽

First Time

Abstract Motivation The microbial community plays an essential role in human diseases and physiological activities. The functions of microbes can differ due to strain-level differences in the genome sequences. Shotgun metagenomic sequencing allows us to profile the strains in microbial communities practically. However, current methods are underdeveloped due to the highly similar sequences among strains. We observe that strains genotypes at the same single nucleotide variant (SNV) locus can be speculated by the genotype frequencies. Also, the variants in different loci covered by the same reads can provide evidence that they reside on the same strain. Results These insights inspire us to design PStrain, an optimization method that utilizes genotype frequencies and the reads which cover multiple SNV loci to profile strains iteratively based on SNVs in a set of MetaPhlAn2 marker genes. Compared to the state-of-art methods, PStrain, on average, improved the performance of inferring strains abundances and genotypes by 87.75% and 59.45%, respectively. We have applied the PStrain package to the dataset with two cohorts of colorectal cancer (CRC) and found that the sequences of Bacteroides coprocola strains are significantly different between CRC and control samples, which is the first time to report the potential role of B.coprocola in the gut microbiota of CRC. Availabilityand implementation https://github.com/wshuai294/PStrain. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Characterizing and Evaluating the Zoonotic Potential of Novel Viruses Discovered in Vampire Bats

Viruses ◽

10.3390/v13020252 ◽

2021 ◽

Vol 13 (2) ◽

pp. 252

Author(s):

Laura M. Bergner ◽

Nardus Mollentze ◽

Richard J. Orton ◽

Carlos Tello ◽

Alice Broos ◽

...

Keyword(s):

Machine Learning ◽

Phylogenetic Analyses ◽

Human Infection ◽

Machine Learning Algorithms ◽

Zoonotic Potential ◽

Metagenomic Sequencing ◽

Learning Models ◽

Sequencing Data ◽

Vampire Bats ◽

Machine Learning Models

The contemporary surge in metagenomic sequencing has transformed knowledge of viral diversity in wildlife. However, evaluating which newly discovered viruses pose sufficient risk of infecting humans to merit detailed laboratory characterization and surveillance remains largely speculative. Machine learning algorithms have been developed to address this imbalance by ranking the relative likelihood of human infection based on viral genome sequences, but are not yet routinely applied to viruses at the time of their discovery. Here, we characterized viral genomes detected through metagenomic sequencing of feces and saliva from common vampire bats (Desmodus rotundus) and used these data as a case study in evaluating zoonotic potential using molecular sequencing data. Of 58 detected viral families, including 17 which infect mammals, the only known zoonosis detected was rabies virus; however, additional genomes were detected from the families Hepeviridae, Coronaviridae, Reoviridae, Astroviridae and Picornaviridae, all of which contain human-infecting species. In phylogenetic analyses, novel vampire bat viruses most frequently grouped with other bat viruses that are not currently known to infect humans. In agreement, machine learning models built from only phylogenetic information ranked all novel viruses similarly, yielding little insight into zoonotic potential. In contrast, genome composition-based machine learning models estimated different levels of zoonotic potential, even for closely related viruses, categorizing one out of four detected hepeviruses and two out of three picornaviruses as having high priority for further research. We highlight the value of evaluating zoonotic potential beyond ad hoc consideration of phylogeny and provide surveillance recommendations for novel viruses in a wildlife host which has frequent contact with humans and domestic animals.

Download Full-text

METAMVGL: a multi-view graph-based metagenomic contig binning algorithm by integrating assembly and paired-end graphs

BMC Bioinformatics ◽

10.1186/s12859-021-04284-4 ◽

2021 ◽

Vol 22 (S10) ◽

Author(s):

Zhenmiao Zhang ◽

Lu Zhang

Keyword(s):

De Novo ◽

Label Propagation ◽

Next Generation Sequencing Data ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Fecal Samples ◽

Microbial Genomes ◽

Metagenome Assembly ◽

High Chance ◽

Mock Communities

Abstract Background Due to the complexity of microbial communities, de novo assembly on next generation sequencing data is commonly unable to produce complete microbial genomes. Metagenome assembly binning becomes an essential step that could group the fragmented contigs into clusters to represent microbial genomes based on contigs’ nucleotide compositions and read depths. These features work well on the long contigs, but are not stable for the short ones. Contigs can be linked by sequence overlap (assembly graph) or by the paired-end reads aligned to them (PE graph), where the linked contigs have high chance to be derived from the same clusters. Results We developed METAMVGL, a multi-view graph-based metagenomic contig binning algorithm by integrating both assembly and PE graphs. It could strikingly rescue the short contigs and correct the binning errors from dead ends. METAMVGL learns the two graphs’ weights automatically and predicts the contig labels in a uniform multi-view label propagation framework. In experiments, we observed METAMVGL made use of significantly more high-confidence edges from the combined graph and linked dead ends to the main graph. It also outperformed many state-of-the-art contig binning algorithms, including MaxBin2, MetaBAT2, MyCC, CONCOCT, SolidBin and GraphBin on the metagenomic sequencing data from simulation, two mock communities and Sharon infant fecal samples. Conclusions Our findings demonstrate METAMVGL outstandingly improves the short contig binning and outperforms the other existing contig binning tools on the metagenomic sequencing data from simulation, mock communities and infant fecal samples.

Download Full-text

Reference flow: reducing reference bias using multiple population genomes

Genome Biology ◽

10.1186/s13059-020-02229-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Nae-Chyun Chen ◽

Brad Solomon ◽

Taher Mun ◽

Sheila Iyer ◽

Ben Langmead

Keyword(s):

Genetic Variation ◽

Reference Genome ◽

Alignment Method ◽

Sequencing Data ◽

Computational Overhead ◽

Reference Flow ◽

Multiple Population ◽

Reference Bias ◽

Flow Alignment ◽

Reference Genomes

AbstractMost sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.

Download Full-text

SCAPP: an algorithm for improved plasmid assembly in metagenomes

Microbiome ◽

10.1186/s40168-021-01068-z ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

David Pellow ◽

Alvah Zorea ◽

Maraike Probst ◽

Ori Furman ◽

Arik Segal ◽

...

Keyword(s):

Bacterial Species ◽

Bacterial Genome ◽

Biological Knowledge ◽

Assessment Procedure ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Human Gut ◽

Double Stranded Dna ◽

Wide Range ◽

Python Package

Abstract Background Metagenomic sequencing has led to the identification and assembly of many new bacterial genome sequences. These bacteria often contain plasmids: usually small, circular double-stranded DNA molecules that may transfer across bacterial species and confer antibiotic resistance. These plasmids are generally less studied and understood than their bacterial hosts. Part of the reason for this is insufficient computational tools enabling the analysis of plasmids in metagenomic samples. Results We developed SCAPP (Sequence Contents-Aware Plasmid Peeler)—an algorithm and tool to assemble plasmid sequences from metagenomic sequencing. SCAPP builds on some key ideas from the Recycler algorithm while improving plasmid assemblies by integrating biological knowledge about plasmids. We compared the performance of SCAPP to Recycler and metaplasmidSPAdes on simulated metagenomes, real human gut microbiome samples, and a human gut plasmidome dataset that we generated. We also created plasmidome and metagenome data from the same cow rumen sample and used the parallel sequencing data to create a novel assessment procedure. Overall, SCAPP outperformed Recycler and metaplasmidSPAdes across this wide range of datasets. Conclusions SCAPP is an easy to use Python package that enables the assembly of full plasmid sequences from metagenomic samples. It outperformed existing metagenomic plasmid assemblers in most cases and assembled novel and clinically relevant plasmids in samples we generated such as a human gut plasmidome. SCAPP is open-source software available from: https://github.com/Shamir-Lab/SCAPP.

Download Full-text