Characterization and simulation of metagenomic nanopore sequencing data with Meta-NanoSim

Nanopore sequencing is crucial to metagenomic studies as its kilobase-long reads can contribute to resolving genomic structural differences among microbes. However, platform-specific challenges, including high base-call error rate, non-uniform read lengths, and the presence of chimeric artifacts, necessitate specifically designed analytical tools. Here, we present Meta-NanoSim, a fast and versatile utility that characterizes and simulates the unique properties of nanopore metagenomic reads. Further, Meta-NanoSim improves upon state-of-the-art methods on microbial abundance estimation through a base-level quantification algorithm. We demonstrate that Meta-NanoSim simulated data can facilitate the development of metagenomic algorithms and guide experimental design through a metagenomic assembly benchmarking task.

Download Full-text

Rapid Mycobacterium tuberculosis spoligotyping from uncorrected long reads using Galru

10.1101/2020.05.31.126490 ◽

2020 ◽

Author(s):

Andrew J. Page ◽

Nabil-Fareed Alikhan ◽

Michael Strinden ◽

Thanh Le Viet ◽

Timofey Skvortsov

Keyword(s):

Mycobacterium Tuberculosis ◽

State Of The Art ◽

Sequence Data ◽

Human Pathogen ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Long Reads ◽

Long Read

AbstractSpoligotyping of Mycobacterium tuberculosis provides a subspecies classification of this major human pathogen. Spoligotypes can be predicted from short read genome sequencing data; however, no methods exist for long read sequence data such as from Nanopore or PacBio. We present a novel software package Galru, which can rapidly detect the spoligotype of a Mycobacterium tuberculosis sample from as little as a single uncorrected long read. It allows for near real-time spoligotyping from long read data as it is being sequenced, giving rapid sample typing. We compare it to the existing state of the art software and find it performs identically to the results obtained from short read sequencing data. Galru is freely available from https://github.com/quadram-institute-bioscience/galru under the GPLv3 open source licence.

Download Full-text

Evaluation of Germline Structural Variant Calling Methods for Nanopore Sequencing Data

Frontiers in Genetics ◽

10.3389/fgene.2021.761791 ◽

2021 ◽

Vol 12 ◽

Author(s):

Davide Bolognini ◽

Alberto Magi

Keyword(s):

Variant Calling ◽

Research Report ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Factors Affecting ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Sequencing Studies ◽

Long Read

Structural variants (SVs) are genomic rearrangements that involve at least 50 nucleotides and are known to have a serious impact on human health. While prior short-read sequencing technologies have often proved inadequate for a comprehensive assessment of structural variation, more recent long reads from Oxford Nanopore Technologies have already been proven invaluable for the discovery of large SVs and hold the potential to facilitate the resolution of the full SV spectrum. With many long-read sequencing studies to follow, it is crucial to assess factors affecting current SV calling pipelines for nanopore sequencing data. In this brief research report, we evaluate and compare the performances of five long-read SV callers across four long-read aligners using both real and synthetic nanopore datasets. In particular, we focus on the effects of read alignment, sequencing coverage, and variant allele depth on the detection and genotyping of SVs of different types and size ranges and provide insights into precision and recall of SV callsets generated by integrating the various long-read aligners and SV callers. The computational pipeline we propose is publicly available at https://github.com/davidebolo1993/EViNCe and can be adjusted to further evaluate future nanopore sequencing datasets.

Download Full-text

Nanopore sequencing data analysis: state of the art, applications and challenges

Briefings in Bioinformatics ◽

10.1093/bib/bbx062 ◽

2017 ◽

Cited By ~ 20

Author(s):

Alberto Magi ◽

Roberto Semeraro ◽

Alessandra Mingrino ◽

Betti Giusti ◽

Romina D’Aurizio

Keyword(s):

Data Analysis ◽

State Of The Art ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Sequencing Data Analysis

Download Full-text

Abundance tracking by long-read nanopore sequencing of complex microbial communities in samples from 20 different biogas/wastewater plants

10.21203/rs.2.17734/v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Christian Brandt ◽

Erik Bongcam-Rudloff ◽

Bettina Müller

Keyword(s):

Microbial Communities ◽

Fold Increase ◽

Green Energy ◽

Dramatic Improvement ◽

Metagenomic Sequencing ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Long Reads ◽

Long Read ◽

Water Reactors

Abstract Anaerobic digestion (AD) has long been critical technology for green energy, but the majority of the microorganisms involved are unknown and not cultivable, which makes abundance tracking difficult. Developments in nanopore sequencing make it a promising approach for monitoring microbial communities via metagenomic sequencing. For reliable monitoring of AD via long reads, a robust protocol for obtaining less fragmented, high-quality DNA, while preserving bacterial composition, was established. Samples from 20 different biogas/waste-water reactors were investigated and a median of 20 Gb sequencing data per flow cell were retrieved for each reactor. Using the GTDB index allowed sufficient characterisation of abundance of bacteria and archaea in biogas reactors. A dramatic improvement (1.8- to 13-fold increase) in taxonomic classification was achieved using the GTDB-based index compared with the RefSeq index. Ongoing efforts in GTDB to achieve more phylogenetically coherent taxonomic species definitions, including meta-assembled genomes, give a clear advantage over conventional classification databases such as RefSeq. Unlike conventional 16S rRNA studies, metagenomic read classification allows abundance of the unknown microbial fraction to be monitored.

Download Full-text

Characterization and Simulation of Metagenomic Nanopore Sequencing Data with Meta-NanoSim

10.21203/rs.3.rs-1125389/v1 ◽

2021 ◽

Author(s):

Chen Yang ◽

Theodora Lo ◽

Ka Ming Nip ◽

Saber Hafezqorani ◽

René L Warren ◽

...

Keyword(s):

Microbial Communities ◽

Simulated Data ◽

Ground Truth ◽

Read Length ◽

Abundance Estimation ◽

Nanopore Sequencing ◽

Microbial Abundance ◽

Sequencing Data ◽

Sequencing Platform ◽

Metagenome Assembly

Abstract Background: Nanopore sequencing is crucial to metagenomic studies as its kilobase-long reads can contribute to resolving genomic structural differences among microbes. However, sequencing platform-specific challenges, including high base-call error rate, non-uniform read lengths, and the presence of chimeric artifacts, necessitate specifically designed analytical tools, such as microbial abundance estimation and metagenome assembly algorithms. When developing and testing bioinformatics tools and pipelines, the use of simulated datasets with characteristics that are true to the sequencing platform under evaluation is a cost-effective way to provide a ground truth and assess the performance in a controlled environment. Results: Here, we present Meta-NanoSim, a fast and versatile utility that characterizes and simulates the unique properties of nanopore metagenomic reads. It improves upon state-of-the-art methods on microbial abundance estimation through a base-level quantification algorithm. Meta-NanoSim can simulate complex microbial communities composed of both linear and circular genomes, and can stream reference genomes from online servers directly. Simulated datasets showed high congruence with experimental data in terms of read length, error profiles, and abundance levels. We demonstrate that Meta-NanoSim simulated data can facilitate the development of metagenomic algorithms and guide experimental design through a metagenome assembly benchmarking task. Conclusions: The Meta-NanoSim characterization module investigates read features including chimeric information and abundance levels, while the simulation module simulates large and complex multi-sample microbial communities with different abundance profiles. All trained models and the software are freely accessible at Github: https://github.com/bcgsc/NanoSim .

Download Full-text

NanoSpring: reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach

10.1101/2021.06.09.447198 ◽

2021 ◽

Author(s):

Qingxi Meng ◽

Shubham Chandak ◽

Yifan Zhu ◽

Tsachy Weissman

Keyword(s):

State Of The Art ◽

Lossless Compression ◽

General Purpose ◽

Quality Score ◽

Nanopore Sequencing ◽

The Past ◽

Genome Data ◽

Sequencing Technologies ◽

Efficient Storage ◽

Long Reads

Motivation: The amount of data produced by genome sequencing experiments has been growing rapidly over the past several years, making compression important for efficient storage, transfer and analysis of the data. In recent years, nanopore sequencing technologies have seen increasing adoption since they are portable, real-time and provide long reads. However, there has been limited progress on compression of nanopore sequencing reads obtained in FASTQ files. Previous work ENANO focuses mostly on quality score compression and does not achieve significant gains for the compression of read sequences over general-purpose compressors. RENANO achieves significantly better compression for read sequences but is limited to aligned data with a reference available. Results: We present NanoSpring, a reference-free compressor for nanopore sequencing reads, relying on an approximate assembly approach. We evaluate NanoSpring on a variety of datasets including bacterial, metagenomic, plant, animal, and human whole genome data. For recently basecalled high quality nanopore datasets, NanoSpring achieves close to 3x improvement in compression over state-of-the-art reference-free compressors. The computational requirements of NanoSpring are practical, although it uses more time and memory during compression than previous tools to achieve the compression gains. Availability: NanoSpring is available on GitHub at https://github.com/qm2/NanoSpring.

Download Full-text

ELECTOR: Evaluator for long reads correction methods

10.1101/512889 ◽

2019 ◽

Cited By ~ 1

Author(s):

Camille Marchet ◽

Pierre Morisse ◽

Lolita Lecompte ◽

Arnaud Lefebvre ◽

Thierry Lecroq ◽

...

Keyword(s):

Error Correction ◽

State Of The Art ◽

Error Rates ◽

Sequencing Data ◽

Third Generation Sequencing ◽

Long Reads ◽

Wide Range ◽

Downstream Processes ◽

Generation Sequencing

AbstractMotivationIn the last few years, the error rates of third generation sequencing data have been capped above 5%, including many insertions and deletions. Thereby, an increasing number of long reads correction methods have been proposed to reduce the noise in these sequences. Whether hybrid or self-correction methods, there exist multiple approaches to correct long reads. As the quality of the error correction has huge impacts on downstream processes, developing methods allowing to evaluate error correction tools with precise and reliable statistics is therefore a crucial need. Since error correction is often a resource bottleneck in long reads pipelines, a key feature of assessment methods is therefore to be efficient, in order to allow the fast comparison of different tools.ResultsWe propose ELECTOR, a reliable and efficient tool to evaluate long reads correction, that enables the evaluation of hybrid and self-correction methods. Our tool provides a complete and relevant set of metrics to assess the read quality improvement after correction and scales to large datasets. ELECTOR is directly compatible with a wide range of state-of-the-art error correction tools, using whether simulated or real long reads. We show that ELECTOR displays a wider range of metrics than the state-of-the-art tool, LRCstats, and additionally importantly decreases the runtime needed for assessment on all the studied datasets.AvailabilityELECTOR is available at https://github.com/kamimrcht/[email protected] or [email protected]

Download Full-text

DNA methylation-calling tools for Oxford Nanopore sequencing: a survey and human epigenome-wide evaluation

Genome Biology ◽

10.1186/s13059-021-02510-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yang Liu ◽

Wojciech Rosikiewicz ◽

Ziwei Pan ◽

Nathaniel Jillette ◽

Ping Wang ◽

...

Keyword(s):

Dna Methylation ◽

Single Molecule ◽

Evaluation Criteria ◽

Systematic Evaluation ◽

Whole Genome ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Long Read ◽

Genome Scale ◽

Analytical Tools

Abstract Background Nanopore long-read sequencing technology greatly expands the capacity of long-range, single-molecule DNA-modification detection. A growing number of analytical tools have been developed to detect DNA methylation from nanopore sequencing reads. Here, we assess the performance of different methylation-calling tools to provide a systematic evaluation to guide researchers performing human epigenome-wide studies. Results We compare seven analytic tools for detecting DNA methylation from nanopore long-read sequencing data generated from human natural DNA at a whole-genome scale. We evaluate the per-read and per-site performance of CpG methylation prediction across different genomic contexts, CpG site coverage, and computational resources consumed by each tool. The seven tools exhibit different performances across the evaluation criteria. We show that the methylation prediction at regions with discordant DNA methylation patterns, intergenic regions, low CG density regions, and repetitive regions show room for improvement across all tools. Furthermore, we demonstrate that 5hmC levels at least partly contribute to the discrepancy between bisulfite and nanopore sequencing. Lastly, we provide an online DNA methylation database (https://nanome.jax.org) to display the DNA methylation levels detected by nanopore sequencing and bisulfite sequencing data across different genomic contexts. Conclusions Our study is the first systematic benchmark of computational methods for detection of mammalian whole-genome DNA modifications in nanopore sequencing. We provide a broad foundation for cross-platform standardization and an evaluation of analytical tools designed for genome-scale modified base detection using nanopore sequencing.

Download Full-text

Spacemake: processing and analysis of large-scale spatial transcriptomics data

10.1101/2021.11.07.467598 ◽

2021 ◽

Author(s):

Tamas Ryszard Sztanka-Toth ◽

Marvin Jens ◽

Nikos Karaiskos ◽

Nikolaus Rajewsky

Keyword(s):

Large Scale ◽

Modular Design ◽

State Of The Art ◽

Sequencing Data ◽

Unified Framework ◽

Tissue Sections ◽

Long Reads ◽

Rna Biology ◽

Transcriptomics Data ◽

Downstream Analysis

Spatial sequencing methods increasingly gain popularity within RNA biology studies. State-of-the-art techniques can read mRNA expression levels from tissue sections and at the same time register information about the original locations of the molecules in the tissue. The resulting datasets are processed and analyzed by accompanying software which, however, is incompatible across inputs from different technologies. Here, we present spacemake, a modular, robust and scalable spatial transcriptomics pipeline built in snakemake and python. Spacemake is designed to handle all major spatial transcriptomics datasets and can be readily configured to run on other technologies. It can process and analyze several samples in parallel, even if they stem from different experimental methods. Spacemake's unified framework enables reproducible data processing from raw sequencing data to automatically generated downstream analysis reports. Moreover, spacemake is built with a modular design and offers additional functionality such as sample merging, saturation analysis and analysis of long-reads as separate modules. Moreover, spacemake employs novoSpaRc to integrate spatial and single-cell transcriptomics data, resulting in increased gene counts for the spatial dataset. Spacemake is open-source, extendable and can be readily integrated with existing computational workflows.

Download Full-text

Haplotype Threading: Accurate Polyploid Phasing from Long Reads

10.1101/2020.02.04.933523 ◽

2020 ◽

Cited By ~ 2

Author(s):

Sven D. Schrinner ◽

Rebecca Serra Mari ◽

Jana Ebler ◽

Mikko Rautiainen ◽

Lancelot Seillier ◽

...

Keyword(s):

Scoring Function ◽

Simulated Data ◽

Real Data ◽

Error Rates ◽

Sequencing Data ◽

Data Set ◽

Current State ◽

Long Reads ◽

History Of ◽

Genomic Regions

AbstractResolving genomes at haplotype level is crucial for understanding the evolutionary history of polyploid species and for designing advanced breeding strategies. As a highly complex computational problem, polyploid phasing still presents considerable challenges, especially in regions of collapsing haplotypes.We present WhatsHap polyphase, a novel two-stage approach that addresses these challenges by (i) clustering reads using a position-dependent scoring function and (ii) threading the haplotypes through the clusters by dynamic programming. We demonstrate on a simulated data set that this results in accurate haplotypes with switch error rates that are around three times lower than those obtainable by the current state-of-the-art and even around seven times lower in regions of collapsing haplotypes. Using a real data set comprising long and short read tetraploid potato sequencing data we show that WhatsHap polyphase is able to phase the majority of the potato genes after error correction, which enables the assembly of local genomic regions of interest at haplotype level. Our algorithm is implemented as part of the widely used open source tool WhatsHap and ready to be included in production settings.

Download Full-text