scholarly journals Characterization and Simulation of Metagenomic Nanopore Sequencing Data with Meta-NanoSim

Author(s):  
Chen Yang ◽  
Theodora Lo ◽  
Ka Ming Nip ◽  
Saber Hafezqorani ◽  
René L Warren ◽  
...  

Abstract Background: Nanopore sequencing is crucial to metagenomic studies as its kilobase-long reads can contribute to resolving genomic structural differences among microbes. However, sequencing platform-specific challenges, including high base-call error rate, non-uniform read lengths, and the presence of chimeric artifacts, necessitate specifically designed analytical tools, such as microbial abundance estimation and metagenome assembly algorithms. When developing and testing bioinformatics tools and pipelines, the use of simulated datasets with characteristics that are true to the sequencing platform under evaluation is a cost-effective way to provide a ground truth and assess the performance in a controlled environment. Results: Here, we present Meta-NanoSim, a fast and versatile utility that characterizes and simulates the unique properties of nanopore metagenomic reads. It improves upon state-of-the-art methods on microbial abundance estimation through a base-level quantification algorithm. Meta-NanoSim can simulate complex microbial communities composed of both linear and circular genomes, and can stream reference genomes from online servers directly. Simulated datasets showed high congruence with experimental data in terms of read length, error profiles, and abundance levels. We demonstrate that Meta-NanoSim simulated data can facilitate the development of metagenomic algorithms and guide experimental design through a metagenome assembly benchmarking task. Conclusions: The Meta-NanoSim characterization module investigates read features including chimeric information and abundance levels, while the simulation module simulates large and complex multi-sample microbial communities with different abundance profiles. All trained models and the software are freely accessible at Github: https://github.com/bcgsc/NanoSim .

2020 ◽  
Author(s):  
Yoonjee Kang ◽  
Denis Thieffry ◽  
Laura Cantini

AbstractNetworks are powerful tools to represent and investigate biological systems. The development of algorithms inferring regulatory interactions from functional genomics data has been an active area of research. With the advent of single-cell RNA-seq data (scRNA-seq), numerous methods specifically designed to take advantage of single-cell datasets have been proposed. However, published benchmarks on single-cell network inference are mostly based on simulated data. Once applied to real data, these benchmarks take into account only a small set of genes and only compare the inferred networks with an imposed ground-truth.Here, we benchmark four single-cell network inference methods based on their reproducibility, i.e. their ability to infer similar networks when applied to two independent datasets for the same biological condition. We tested each of these methods on real data from three biological conditions: human retina, T-cells in colorectal cancer, and human hematopoiesis.GENIE3 results to be the most reproducible algorithm, independently from the single-cell sequencing platform, the cell type annotation system, the number of cells constituting the dataset, or the thresholding applied to the links of the inferred networks. In order to ensure the reproducibility and ease extensions of this benchmark study, we implemented all the analyses in scNET, a Jupyter notebook available at https://github.com/ComputationalSystemsBiology/scNET.


2017 ◽  
Author(s):  
Tslil Gabrieli ◽  
Hila Sharim ◽  
Yael Michaeli ◽  
Yuval Ebenstein

ABSTRACTVariations in the genetic code, from single point mutations to large structural or copy number alterations, influence susceptibility, onset, and progression of genetic diseases and tumor transformation. Next-generation sequencing analysis is unable to reliably capture aberrations larger than the typical sequencing read length of several hundred bases. Long-read, single-molecule sequencing methods such as SMRT and nanopore sequencing can address larger variations, but require costly whole genome analysis. Here we describe a method for isolation and enrichment of a large genomic region of interest for targeted analysis based on Cas9 excision of two sites flanking the target region and isolation of the excised DNA segment by pulsed field gel electrophoresis. The isolated target remains intact and is ideally suited for optical genome mapping and long-read sequencing at high coverage. In addition, analysis is performed directly on native genomic DNA that retains genetic and epigenetic composition without amplification bias. This method enables detection of mutations and structural variants as well as detailed analysis by generation of hybrid scaffolds composed of optical maps and sequencing data at a fraction of the cost of whole genome sequencing.


2021 ◽  
Author(s):  
Chen Yang ◽  
Theodora Lo ◽  
Ka Ming Nip ◽  
Saber Hafezqorani ◽  
Rene L Warren ◽  
...  

Nanopore sequencing is crucial to metagenomic studies as its kilobase-long reads can contribute to resolving genomic structural differences among microbes. However, platform-specific challenges, including high base-call error rate, non-uniform read lengths, and the presence of chimeric artifacts, necessitate specifically designed analytical tools. Here, we present Meta-NanoSim, a fast and versatile utility that characterizes and simulates the unique properties of nanopore metagenomic reads. Further, Meta-NanoSim improves upon state-of-the-art methods on microbial abundance estimation through a base-level quantification algorithm. We demonstrate that Meta-NanoSim simulated data can facilitate the development of metagenomic algorithms and guide experimental design through a metagenomic assembly benchmarking task.


Author(s):  
Christian Brandt ◽  
Erik Bongcam-Rudloff ◽  
Bettina Müller

Abstract Anaerobic digestion (AD) has long been critical technology for green energy, but the majority of the microorganisms involved are unknown and not cultivable, which makes abundance tracking difficult. Developments in nanopore sequencing make it a promising approach for monitoring microbial communities via metagenomic sequencing. For reliable monitoring of AD via long reads, a robust protocol for obtaining less fragmented, high-quality DNA, while preserving bacterial composition, was established. Samples from 20 different biogas/waste-water reactors were investigated and a median of 20 Gb sequencing data per flow cell were retrieved for each reactor. Using the GTDB index allowed sufficient characterisation of abundance of bacteria and archaea in biogas reactors. A dramatic improvement (1.8- to 13-fold increase) in taxonomic classification was achieved using the GTDB-based index compared with the RefSeq index. Ongoing efforts in GTDB to achieve more phylogenetically coherent taxonomic species definitions, including meta-assembled genomes, give a clear advantage over conventional classification databases such as RefSeq. Unlike conventional 16S rRNA studies, metagenomic read classification allows abundance of the unknown microbial fraction to be monitored.


2020 ◽  
Author(s):  
Christian Brandt ◽  
Erik Bongcam-Rudloff ◽  
Bettina Müller

Abstract Background: Anaerobic digestion (AD) has long been critical technology for green energy, but the majority of the microorganisms involved are unknown and are currently not cultivable, which makes abundance tracking difficult. Developments in nanopore long-read sequencing make it a promising approach for monitoring microbial communities via metagenomic sequencing. For reliable monitoring of AD via long reads, a robust protocol for obtaining less fragmented, high-quality DNA, while preserving bacteria and archaea composition, was established. Results: Samples from 20 different biogas/wastewater reactors were investigated, and a median of 20.5 Gb sequencing data per nanopore flow cell was retrieved for each reactor using the developed DNA isolation protocol. The nanopore sequencing data was compared against Illumina sequencing data while using different taxonomic indices for read classifications. The Genome Taxonomy Database (GTDB) index allowed sufficient characterisation of the abundance of bacteria and archaea in biogas reactors with a dramatic improvement (1.8- to 13-fold increase) in taxonomic classification compared to the RefSeq index. Both technologies performed similarly in taxonomic read classification with a slight advantage for Illumina in regards to the total proportion of classified reads. However, nanopore sequencing data revealed a higher genus richness after classification. Conclusion: Metagenomic read classification via nanopore provides a promising approach to monitor the abundance of taxa present in a microbial AD community, as an alternative to 16S rRNA studies or Illumina Sequencing.


2020 ◽  
Author(s):  
Christian Brandt ◽  
Erik Bongcam-Rudloff ◽  
Bettina Müller

Abstract Background: Anaerobic digestion (AD) has long been critical technology for green energy, but the majority of the microorganisms involved are unknown and are currently not cultivable, which makes abundance tracking difficult. Developments in nanopore long-read sequencing make it a promising approach for monitoring microbial communities via metagenomic sequencing. For reliable monitoring of AD via long reads, a robust protocol for obtaining less fragmented, high-quality DNA, while preserving bacteria and archaea composition, was established. Results: Samples from 20 different biogas/wastewater reactors were investigated, and a median of 20.5 Gb sequencing data per nanopore flow cell was retrieved for each reactor using the developed DNA isolation protocol. The nanopore sequencing data was compared against Illumina sequencing data while using different taxonomic indices for read classifications. The Genome Taxonomy Database (GTDB) index allowed sufficient characterisation of the abundance of bacteria and archaea in biogas reactors with a dramatic improvement (1.8- to 13-fold increase) in taxonomic classification compared to the RefSeq index. Both technologies performed similarly in taxonomic read classification with a slight advantage for Illumina in regards to the total proportion of classified reads. However, nanopore sequencing data revealed a higher genus richness after classification. Conclusion: Metagenomic read classification via nanopore provides a promising approach to monitor the abundance of taxa present in a microbial AD community, as an alternative to 16S rRNA studies or Illumina Sequencing.


2020 ◽  
Author(s):  
Kaiyuan Zhu ◽  
Welles Robinson ◽  
Alejandro A. Schäffer ◽  
Junyan Xu ◽  
Eytan Ruppin ◽  
...  

AbstractThe identification and quantification of microbial abundance at the species or strain level from sequencing data is crucial for our understanding of human health and disease. Existing approaches for microbial abundance estimation either use accurate but computationally expensive alignment-based approaches for species-level estimation or less accurate but computationally fast alignment-free approaches that fail to classify many reads accurately at the species or strain-level.Here we introduce CAMMiQ, a novel combinatorial solution to the microbial identification and abundance estimation problem, which performs better than the best used tools on simulated and real datasets with respect to the number of correctly classified reads (i.e., specificity) by an order of magnitude and resolves possible mixtures of similar genomes.As we demonstrate, CAMMiQ can better distinguish between single cells deliberately infected with distinct Salmonella strains and sequenced using scRNA-seq reads than alternative approaches. We also demonstrate that CAMMiQ is also more accurate than the best used approaches on a variety of synthetic genomic read data involving some of the most challenging bacterial genomes derived from NCBI RefSeq database; it can distinguish not only distinct species but also closely related strains of bacteria.The key methodological innovation of CAMMiQ is its use of arbitrary length, doubly-unique substrings, i.e. substrings that appear in (exactly) two genomes in the input database, instead of fixed-length, unique substrings. To resolve the ambiguity in the genomic origin of doubly-unique substrings, CAMMiQ employs a combinatorial optimization formulation, which can be solved surprisingly quickly. CAMMiQ’s index consists of a sparsified subset of the shortest unique and doubly-unique substrings of each genome in the database, within a user specified length range and as such it is fairly compact. In short, CAMMiQ offers more accurate genomic identification and abundance estimation than the best used alternatives while using similar computational resources.Availabilityhttps://github.com/algo-cancer/CAMMiQ


2014 ◽  
Author(s):  
Christopher W. Beitel ◽  
Lutz Froenicke ◽  
Jenna M. Lang ◽  
Ian F. Korf ◽  
Richard W. Michelmore ◽  
...  

Metagenomics is a valuable tool for the study of microbial communities but has been limited by the difficulty of “binning” the resulting sequences into groups corresponding to the individual species and strains that constitute the community. Moreover, there are presently no methods to track the flow of mobile DNA elements such as plasmids through communities or to determine which of these are co-localized within the same cell. We address these limitations by applying Hi-C, a technology originally designed for the study of three-dimensional genome structure in eukaryotes, to measure the cellular co-localization of DNA sequences. We leveraged Hi-C data generated from a synthetic metagenome sample to accurately cluster metagenome assembly contigs into groups that contain nearly complete genomes of each species. The Hi-C data also reliably associated plasmids with the chromosomes of their host and with each other. We further demonstrated that Hi-C data provides a long-range signal of strain-specific genotypes, indicating such data may be useful for high-resolution genotyping of microbial populations. Our work demonstrates that Hi-C sequencing data provide valuable information for metagenome analyses that are not currently obtainable by other methods. This metagenomic Hi-C method could facilitate future studies of the fine-scale population structure of microbes, as well as studies of how antibiotic resistance plasmids (or other genetic elements) mobilize in microbial communities. The method is not limited to microbiology; the genetic architecture of other heterogeneous populations of cells could also be studied with this technique.


Sign in / Sign up

Export Citation Format

Share Document