Fast and sensitive mapping of error-prone nanopore sequencing reads with GraphMap

Mapping Intimacies ◽

10.1101/020719 ◽

2015 ◽

Cited By ~ 1

Author(s):

Ivan Sovic ◽

Mile Sikic ◽

Andreas Wilm ◽

Shannon Nicole Fenlon ◽

Swaine Chen ◽

...

Keyword(s):

Human Genome ◽

Variant Calling ◽

Error Rates ◽

Nanopore Sequencing ◽

Structural Variants ◽

Specific Identification ◽

Long Reads ◽

Long Read ◽

Specific Error ◽

Very High

Exploiting the power of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. We present the first nanopore read mapper (GraphMap) that uses a read-funneling paradigm to robustly handle variable error rates and fast graph traversal to align long reads with speed and very high precision (>95%). Evaluation on MinION sequencing datasets against short and long-read mappers indicates that GraphMap increases mapping sensitivity by at least 15-80%. GraphMap alignments are the first to demonstrate consensus calling with <1 error in 100,000 bases, variant calling on the human genome with 76% improvement in sensitivity over the next best mapper (BWA-MEM), precise detection of structural variants from 100bp to 4kbp in length and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap.

Evaluation of Germline Structural Variant Calling Methods for Nanopore Sequencing Data

Frontiers in Genetics ◽

10.3389/fgene.2021.761791 ◽

2021 ◽

Vol 12 ◽

Author(s):

Davide Bolognini ◽

Alberto Magi

Keyword(s):

Variant Calling ◽

Research Report ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Factors Affecting ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Sequencing Studies ◽

Long Read

Structural variants (SVs) are genomic rearrangements that involve at least 50 nucleotides and are known to have a serious impact on human health. While prior short-read sequencing technologies have often proved inadequate for a comprehensive assessment of structural variation, more recent long reads from Oxford Nanopore Technologies have already been proven invaluable for the discovery of large SVs and hold the potential to facilitate the resolution of the full SV spectrum. With many long-read sequencing studies to follow, it is crucial to assess factors affecting current SV calling pipelines for nanopore sequencing data. In this brief research report, we evaluate and compare the performances of five long-read SV callers across four long-read aligners using both real and synthetic nanopore datasets. In particular, we focus on the effects of read alignment, sequencing coverage, and variant allele depth on the detection and genotyping of SVs of different types and size ranges and provide insights into precision and recall of SV callsets generated by integrating the various long-read aligners and SV callers. The computational pipeline we propose is publicly available at https://github.com/davidebolo1993/EViNCe and can be adjusted to further evaluate future nanopore sequencing datasets.

Rapid multi-locus sequence typing direct from uncorrected long reads using Krocus

10.1101/259150 ◽

2018 ◽

Author(s):

Andrew J. Page ◽

Jacqueline A. Keane

Keyword(s):

Error Rates ◽

Multi Locus Sequence Typing ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Standard Tool ◽

Long Read ◽

Sequence Types ◽

Very High

AbstractGenome sequencing is rapidly being adopted in reference labs and hospitals for bacterial outbreak investigation and diagnostics where time is critical. Seven gene multi-locus sequence typing is a standard tool for broadly classifying samples into sequence types, allowing, in many cases, to rule a sample in or out of an outbreak, or allowing for general characteristics about a bacterial strain to be inferred. Long read sequencing technologies, such as from PacBio or Oxford Nanopore, can produce read data within minutes of an experiment starting, unlike short read sequencing technologies which require many hours/days. However, the error rates of raw uncorrected long read data are very high. We present Krocus which can predict a sequence type directly from uncorrected long reads, and which was designed to consume read data as it is produced, providing results in minutes. It is the only tool which can do this from uncorrected long reads. We tested Krocus on over 600 samples sequenced with using long read sequencing technologies from PacBio and Oxford Nanopore. It provides sequence types on average within 90 seconds, with a sensitivity of 94% and specificity of 97%, directly from uncorrected raw sequence reads. The software is written in Python and is available under the open source license GNU GPL version 3.

Benchmarking Long-Read Assemblers for Genomic Analyses of Bacterial Pathogens Using Oxford Nanopore Sequencing

International Journal of Molecular Sciences ◽

10.3390/ijms21239161 ◽

2020 ◽

Vol 21 (23) ◽

pp. 9161

Author(s):

Zhao Chen ◽

David L. Erickson ◽

Jianghong Meng

Keyword(s):

Virulence Genes ◽

Bacterial Pathogens ◽

Error Rates ◽

Nanopore Sequencing ◽

Long Reads ◽

Oxford Nanopore ◽

Genomic Analyses ◽

Long Read ◽

Genome Analyses ◽

Assembly Algorithms

Oxford Nanopore sequencing can be used to achieve complete bacterial genomes. However, the error rates of Oxford Nanopore long reads are greater compared to Illumina short reads. Long-read assemblers using a variety of assembly algorithms have been developed to overcome this deficiency, which have not been benchmarked for genomic analyses of bacterial pathogens using Oxford Nanopore long reads. In this study, long-read assemblers, namely Canu, Flye, Miniasm/Racon, Raven, Redbean, and Shasta, were thus benchmarked using Oxford Nanopore long reads of bacterial pathogens. Ten species were tested for mediocre- and low-quality simulated reads, and 10 species were tested for real reads. Raven was the most robust assembler, obtaining complete and accurate genomes. All Miniasm/Racon and Raven assemblies of mediocre-quality reads provided accurate antimicrobial resistance (AMR) profiles, while the Raven assembly of Klebsiella variicola with low-quality reads was the only assembly with an accurate AMR profile among all assemblers and species. All assemblers functioned well for predicting virulence genes using mediocre-quality and real reads, whereas only the Raven assemblies of low-quality reads had accurate numbers of virulence genes. Regarding multilocus sequence typing (MLST), Miniasm/Racon was the most effective assembler for mediocre-quality reads, while only the Raven assemblies of Escherichia coli O157:H7 and K. variicola with low-quality reads showed positive MLST results. Miniasm/Racon and Raven were the best performers for MLST using real reads. The Miniasm/Racon and Raven assemblies showed accurate phylogenetic inference. For the pan-genome analyses, Raven was the strongest assembler for simulated reads, whereas Miniasm/Racon and Raven performed the best for real reads. Overall, the most robust and accurate assembler was Raven, closely followed by Miniasm/Racon.

Rapid multi-locus sequence typing direct from uncorrected long reads using Krocus

PeerJ ◽

10.7717/peerj.5233 ◽

2018 ◽

Vol 6 ◽

pp. e5233 ◽

Cited By ~ 6

Author(s):

Andrew J. Page ◽

Jacqueline A. Keane

Keyword(s):

Error Rates ◽

Multi Locus Sequence Typing ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Standard Tool ◽

Sample Data ◽

Long Read ◽

Sequence Types ◽

Very High

Genome sequencing is rapidly being adopted in reference labs and hospitals for bacterial outbreak investigation and diagnostics where time is critical. Seven gene multi-locus sequence typing is a standard tool for broadly classifying samples into sequence types (STs), allowing, in many cases, to rule a sample out of an outbreak, or allowing for general characteristics about a bacterial strain to be inferred. Long-read sequencing technologies, such as from Oxford Nanopore, can produce read data within minutes of an experiment starting, unlike short-read sequencing technologies which require many hours/days. However, the error rates of raw uncorrected long read data are very high. We present Krocus which can predict a ST directly from uncorrected long reads, and which was designed to consume read data as it is produced, providing results in minutes. It is the only tool which can do this from uncorrected long reads. We tested Krocus on over 700 isolates sequenced using long-read sequencing technologies from Pacific Biosciences and Oxford Nanopore. It provides STs for isolates on average within 90 s, with a sensitivity of 94% and specificity of 97% on real sample data, directly from uncorrected raw sequence reads. The software is written in Python and is available under the open source license GNU GPL version 3.

Nanopore sequencing enables high-resolution analysis of resistance determinants and mobile elements in the human gut microbiome

10.1101/456905 ◽

2018 ◽

Cited By ~ 6

Author(s):

Denis Bertrand ◽

Jim Shaw ◽

Manesh Kalathiappan ◽

Amanda Hui Qi Ng ◽

Senthil Muthiah ◽

...

Keyword(s):

Complex Dynamics ◽

Human Microbiome ◽

Abundant Species ◽

Error Rates ◽

Nanopore Sequencing ◽

High Quality ◽

Short Read ◽

Long Reads ◽

Long Read

AbstractThe analysis of information rich whole-metagenome datasets acquired from complex microbial communities is often restricted by the fragmented nature of assembly from short-read sequencing. The availability of long-reads from third-generation sequencing technologies (e.g. PacBio or Oxford Nanopore) can help improve assembly quality in principle, but high error rates and low throughput have limited their application in metagenomics. In this work, we describe the first hybrid metagenomic assembler which combines the advantages of short and long-read technologies, providing an order of magnitude improvement in contiguity compared to short read assemblies, and high base-pair level accuracy. The proposed approach (OPERA-MS) integrates a novel assembly-based metagenome clustering technique with an exact scaffolding algorithm that can efficiently assemble repeat rich sequences. Based on evaluations with defined in vitro communities and virtual gut microbiomes, we show that it is possible to assemble near complete genomes from metagenomes with as little as 9× long read coverage, thus enabling high quality assembly of lowly abundant species (<1%). Furthermore, OPERA-MS’s fine-grained clustering is able to deconvolute and assemble multiple genomes of the same species in a single sample, allowing us to study the complex dynamics of the human microbiome at the sub-species level. Applying nanopore sequencing to gut metagenomes of patients undergoing antibiotic treatment, we show that long reads can be obtained from stool samples in clinical studies to produce more meaningful metagenomic assemblies (up to 200× improvement over short-read assemblies), including the closed assembly of >80 putative plasmid/phage sequences and a 263kbp jumbo phage. Our results highlight that high-quality hybrid assemblies provide an unprecedented view of the gut resistome in these patients, including strain dynamics and identification of novel plasmid sequences.

Vulcan: Improved long-read mapping and structural variant calling via dual-mode alignment

10.1101/2021.05.29.446291 ◽

2021 ◽

Author(s):

Yilei Fu ◽

Medhat Mahmoud ◽

Viginesh Vaibhav Muraliraman ◽

Fritz J Sedlazeck ◽

Todd J Treangen

Keyword(s):

Human Genome ◽

Variant Calling ◽

Dual Mode ◽

Read Mapping ◽

Structural Variant ◽

Long Reads ◽

Oxford Nanopore ◽

Mutational Hotspots ◽

Long Read ◽

High Level

Background: Long-read sequencing has enabled unprecedented surveys of structural variation across the entire human genome. To maximize the potential of long-read sequencing in this context, novel mapping methods have emerged that have primarily focused on either speed or accuracy. Various heuristics and scoring schemas have been implemented in widely-used read mappers (minimap2 and NGMLR) to optimize for speed or accuracy, which have variable performance across different genomic regions and for specific structural variants. Our hypothesis is that constraining read mapping to the use of a single gap penalty across distinct mutational hotspots reduces read alignment accuracy and impedes structural variant detection. Findings: We tested our hypothesis by implementing a read mapping pipeline called Vulcan that uses two distinct gap penalty modes, which we refer to as dual-mode alignment. The high-level idea is that Vulcan leverages the computed normalized edit distance of the mapped reads via e.g. minimap2 to identify poorly aligned reads and realigns them using the more accurate yet computationally more expensive long read mapper (NGMLR). In support of our hypothesis, we show Vulcan improves the alignments for Oxford Nanopore Technology (ONT) long-reads for both simulated and real datasets. These improvements, in turn, lead to improved accuracy for structural variant calling performance on human genome datasets compared to either of the read mapping methods alone. Conclusions: Vulcan is the first long-read mapping framework that combines two distinct gap penalty modes, resulting in improved structural variant recall and precision. Vulcan is open-source and available under the MIT License at https://gitlab.com/treangenlab/vulcan

Merfin: improved variant filtering and polishing via k-mer validation

10.1101/2021.07.16.452324 ◽

2021 ◽

Author(s):

Giulio Formenti ◽

Arang Rhie ◽

Brian P Walenz ◽

Francoise Thibaud-Nissen ◽

Kishwar Shafin ◽

...

Keyword(s):

Human Genome ◽

Variant Calling ◽

Read Mapping ◽

Mapping Algorithm ◽

Copy Numbers ◽

Long Reads ◽

Variant Filtering ◽

Long Read ◽

Finishing Tool

Read mapping and variant calling approaches have been widely used for accurate genotyping and improving consensus quality assembled from noisy long reads. Variant calling accuracy relies heavily on the read quality, the precision of the read mapping algorithm and variant caller, and the criteria adopted to filter the calls. However, it is impossible to define a single set of optimal parameters, as they vary depending on the quality of the read set, the variant caller of choice, and the quality of the unpolished assembly. To overcome this issue, we have devised a new tool called Merfin (k-mer based finishing tool), a k-mer based variant filtering algorithm for improved genotyping and polishing. Merfin evaluates the accuracy of a call based on expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller internal score. Moreover, we introduce novel assembly quality and completeness metrics that account for the expected genomic copy numbers. Merfin significantly increased the precision of a variant call and reduced frameshift errors when applied to PacBio HiFi, PacBio CLR, or Nanopore long read based assemblies. We demonstrate the utility while polishing the first complete human genome, a fully phased human genome, and non-human high-quality genomes.

Vulcan: Improved long-read mapping and structural variant calling via dual-mode alignment

GigaScience ◽

10.1093/gigascience/giab063 ◽

2021 ◽

Vol 10 (9) ◽

Author(s):

Yilei Fu ◽

Medhat Mahmoud ◽

Viginesh Vaibhav Muraliraman ◽

Fritz J Sedlazeck ◽

Todd J Treangen

Keyword(s):

Human Genome ◽

Variant Calling ◽

Dual Mode ◽

Read Mapping ◽

Structural Variant ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

High Level ◽

Improved Accuracy

Abstract Background Long-read sequencing has enabled unprecedented surveys of structural variation across the entire human genome. To maximize the potential of long-read sequencing in this context, novel mapping methods have emerged that have primarily focused on either speed or accuracy. Various heuristics and scoring schemas have been implemented in widely used read mappers (minimap2 and NGMLR) to optimize for speed or accuracy, which have variable performance across different genomic regions and for specific structural variants. Our hypothesis is that constraining read mapping to the use of a single gap penalty across distinct mutational hot spots reduces read alignment accuracy and impedes structural variant detection. Findings We tested our hypothesis by implementing a read-mapping pipeline called Vulcan that uses two distinct gap penalty modes, which we refer to as dual-mode alignment. The high-level idea is that Vulcan leverages the computed normalized edit distance of the mapped reads via minimap2 to identify poorly aligned reads and realigns them using the more accurate yet computationally more expensive long-read mapper (NGMLR). In support of our hypothesis, we show that Vulcan improves the alignments for Oxford Nanopore Technology long reads for both simulated and real datasets. These improvements, in turn, lead to improved accuracy for structural variant calling performance on human genome datasets compared to either of the read-mapping methods alone. Conclusions Vulcan is the first long-read mapping framework that combines two distinct gap penalty modes for improved structural variant recall and precision. Vulcan is open-source and available under the MIT License at https://gitlab.com/treangenlab/vulcan.

Ratatosk: hybrid error correction of long reads enables accurate variant calling and assembly

Genome Biology ◽

10.1186/s13059-020-02244-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Guillaume Holley ◽

Doruk Beyter ◽

Helga Ingimundardottir ◽

Peter L. Møller ◽

Snædis Kristmundsdottir ◽

...

Keyword(s):

Error Correction ◽

Human Genome ◽

Error Rate ◽

Variant Calling ◽

High Error Rate ◽

Sequencing Data ◽

Short Read ◽

Long Reads ◽

Median Error ◽

Long Read

AbstractA major challenge to long read sequencing data is their high error rate of up to 15%. We present Ratatosk, a method to correct long reads with short read data. We demonstrate on 5 human genome trios that Ratatosk reduces the error rate of long reads 6-fold on average with a median error rate as low as 0.22 %. SNP calls in Ratatosk corrected reads are nearly 99 % accurate and indel calls accuracy is increased by up to 37 %. An assembly of Ratatosk corrected reads from an Ashkenazi individual yields a contig N50 of 45 Mbp and less misassemblies than a PacBio HiFi reads assembly.

Faculty Opinions recommendation of Nanopore sequencing and assembly of a human genome with ultra-long reads.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.732576831.793542122 ◽

2018 ◽

Author(s):

James Coker

Keyword(s):

Human Genome ◽

Nanopore Sequencing ◽

Long Reads