Vulcan: Improved long-read mapping and structural variant calling via dual-mode alignment

Abstract Background Long-read sequencing has enabled unprecedented surveys of structural variation across the entire human genome. To maximize the potential of long-read sequencing in this context, novel mapping methods have emerged that have primarily focused on either speed or accuracy. Various heuristics and scoring schemas have been implemented in widely used read mappers (minimap2 and NGMLR) to optimize for speed or accuracy, which have variable performance across different genomic regions and for specific structural variants. Our hypothesis is that constraining read mapping to the use of a single gap penalty across distinct mutational hot spots reduces read alignment accuracy and impedes structural variant detection. Findings We tested our hypothesis by implementing a read-mapping pipeline called Vulcan that uses two distinct gap penalty modes, which we refer to as dual-mode alignment. The high-level idea is that Vulcan leverages the computed normalized edit distance of the mapped reads via minimap2 to identify poorly aligned reads and realigns them using the more accurate yet computationally more expensive long-read mapper (NGMLR). In support of our hypothesis, we show that Vulcan improves the alignments for Oxford Nanopore Technology long reads for both simulated and real datasets. These improvements, in turn, lead to improved accuracy for structural variant calling performance on human genome datasets compared to either of the read-mapping methods alone. Conclusions Vulcan is the first long-read mapping framework that combines two distinct gap penalty modes for improved structural variant recall and precision. Vulcan is open-source and available under the MIT License at https://gitlab.com/treangenlab/vulcan.

Download Full-text

Vulcan: Improved long-read mapping and structural variant calling via dual-mode alignment

10.1101/2021.05.29.446291 ◽

2021 ◽

Author(s):

Yilei Fu ◽

Medhat Mahmoud ◽

Viginesh Vaibhav Muraliraman ◽

Fritz J Sedlazeck ◽

Todd J Treangen

Keyword(s):

Human Genome ◽

Variant Calling ◽

Dual Mode ◽

Read Mapping ◽

Structural Variant ◽

Long Reads ◽

Oxford Nanopore ◽

Mutational Hotspots ◽

Long Read ◽

High Level

Background: Long-read sequencing has enabled unprecedented surveys of structural variation across the entire human genome. To maximize the potential of long-read sequencing in this context, novel mapping methods have emerged that have primarily focused on either speed or accuracy. Various heuristics and scoring schemas have been implemented in widely-used read mappers (minimap2 and NGMLR) to optimize for speed or accuracy, which have variable performance across different genomic regions and for specific structural variants. Our hypothesis is that constraining read mapping to the use of a single gap penalty across distinct mutational hotspots reduces read alignment accuracy and impedes structural variant detection. Findings: We tested our hypothesis by implementing a read mapping pipeline called Vulcan that uses two distinct gap penalty modes, which we refer to as dual-mode alignment. The high-level idea is that Vulcan leverages the computed normalized edit distance of the mapped reads via e.g. minimap2 to identify poorly aligned reads and realigns them using the more accurate yet computationally more expensive long read mapper (NGMLR). In support of our hypothesis, we show Vulcan improves the alignments for Oxford Nanopore Technology (ONT) long-reads for both simulated and real datasets. These improvements, in turn, lead to improved accuracy for structural variant calling performance on human genome datasets compared to either of the read mapping methods alone. Conclusions: Vulcan is the first long-read mapping framework that combines two distinct gap penalty modes, resulting in improved structural variant recall and precision. Vulcan is open-source and available under the MIT License at https://gitlab.com/treangenlab/vulcan

Download Full-text

Merfin: improved variant filtering and polishing via k-mer validation

10.1101/2021.07.16.452324 ◽

2021 ◽

Author(s):

Giulio Formenti ◽

Arang Rhie ◽

Brian P Walenz ◽

Francoise Thibaud-Nissen ◽

Kishwar Shafin ◽

...

Keyword(s):

Human Genome ◽

Variant Calling ◽

Read Mapping ◽

Mapping Algorithm ◽

Copy Numbers ◽

Long Reads ◽

Variant Filtering ◽

Long Read ◽

Finishing Tool

Read mapping and variant calling approaches have been widely used for accurate genotyping and improving consensus quality assembled from noisy long reads. Variant calling accuracy relies heavily on the read quality, the precision of the read mapping algorithm and variant caller, and the criteria adopted to filter the calls. However, it is impossible to define a single set of optimal parameters, as they vary depending on the quality of the read set, the variant caller of choice, and the quality of the unpolished assembly. To overcome this issue, we have devised a new tool called Merfin (k-mer based finishing tool), a k-mer based variant filtering algorithm for improved genotyping and polishing. Merfin evaluates the accuracy of a call based on expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller internal score. Moreover, we introduce novel assembly quality and completeness metrics that account for the expected genomic copy numbers. Merfin significantly increased the precision of a variant call and reduced frameshift errors when applied to PacBio HiFi, PacBio CLR, or Nanopore long read based assemblies. We demonstrate the utility while polishing the first complete human genome, a fully phased human genome, and non-human high-quality genomes.

Download Full-text

Fast and sensitive mapping of error-prone nanopore sequencing reads with GraphMap

10.1101/020719 ◽

2015 ◽

Cited By ~ 1

Author(s):

Ivan Sovic ◽

Mile Sikic ◽

Andreas Wilm ◽

Shannon Nicole Fenlon ◽

Swaine Chen ◽

...

Keyword(s):

Human Genome ◽

Variant Calling ◽

Error Rates ◽

Nanopore Sequencing ◽

Structural Variants ◽

Specific Identification ◽

Long Reads ◽

Long Read ◽

Specific Error ◽

Very High

Exploiting the power of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. We present the first nanopore read mapper (GraphMap) that uses a read-funneling paradigm to robustly handle variable error rates and fast graph traversal to align long reads with speed and very high precision (>95%). Evaluation on MinION sequencing datasets against short and long-read mappers indicates that GraphMap increases mapping sensitivity by at least 15-80%. GraphMap alignments are the first to demonstrate consensus calling with <1 error in 100,000 bases, variant calling on the human genome with 76% improvement in sensitivity over the next best mapper (BWA-MEM), precise detection of structural variants from 100bp to 4kbp in length and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap.

Download Full-text

Evaluation of Germline Structural Variant Calling Methods for Nanopore Sequencing Data

Frontiers in Genetics ◽

10.3389/fgene.2021.761791 ◽

2021 ◽

Vol 12 ◽

Author(s):

Davide Bolognini ◽

Alberto Magi

Keyword(s):

Variant Calling ◽

Research Report ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Factors Affecting ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Sequencing Studies ◽

Long Read

Structural variants (SVs) are genomic rearrangements that involve at least 50 nucleotides and are known to have a serious impact on human health. While prior short-read sequencing technologies have often proved inadequate for a comprehensive assessment of structural variation, more recent long reads from Oxford Nanopore Technologies have already been proven invaluable for the discovery of large SVs and hold the potential to facilitate the resolution of the full SV spectrum. With many long-read sequencing studies to follow, it is crucial to assess factors affecting current SV calling pipelines for nanopore sequencing data. In this brief research report, we evaluate and compare the performances of five long-read SV callers across four long-read aligners using both real and synthetic nanopore datasets. In particular, we focus on the effects of read alignment, sequencing coverage, and variant allele depth on the detection and genotyping of SVs of different types and size ranges and provide insights into precision and recall of SV callsets generated by integrating the various long-read aligners and SV callers. The computational pipeline we propose is publicly available at https://github.com/davidebolo1993/EViNCe and can be adjusted to further evaluate future nanopore sequencing datasets.

Download Full-text

A Long-Read Sequencing Approach for Direct Haplotype Phasing in Clinical Settings

International Journal of Molecular Sciences ◽

10.3390/ijms21239177 ◽

2020 ◽

Vol 21 (23) ◽

pp. 9177

Author(s):

Simone Maestri ◽

Maria Giovanna Maturo ◽

Emanuela Cosentino ◽

Luca Marcolungo ◽

Barbara Iadarola ◽

...

Keyword(s):

Diagnostic Testing ◽

Variant Calling ◽

Clinical Settings ◽

Sequencing Data ◽

Sequencing Platform ◽

Variant Discovery ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Second Generation Sequencing

The reconstruction of individual haplotypes can facilitate the interpretation of disease risks; however, high costs and technical challenges still hinder their assessment in clinical settings. Second-generation sequencing is the gold standard for variant discovery but, due to the production of short reads covering small genomic regions, allows only indirect haplotyping based on statistical methods. In contrast, third-generation methods such as the nanopore sequencing platform developed by Oxford Nanopore Technologies (ONT) generate long reads that can be used for direct haplotyping, with fewer drawbacks. However, robust standards for variant phasing in ONT-based target resequencing efforts are not yet available. In this study, we presented a streamlined proof-of-concept workflow for variant calling and phasing based on ONT data in a clinically relevant 12-kb region of the APOE locus, a hotspot for variants and haplotypes associated with aging-related diseases and longevity. Starting with sequencing data from simple amplicons of the target locus, we demonstrated that ONT data allow for reliable single-nucleotide variant (SNV) calling and phasing from as little as 60 reads, although the recognition of indels is less efficient. Even so, we identified the best combination of ONT read sets (600) and software (BWA/Minimap2 and HapCUT2) that enables full haplotype reconstruction when both SNVs and indels have been identified previously using a highly-accurate sequencing platform. In conclusion, we established a rapid and inexpensive workflow for variant phasing based on ONT long reads. This allowed for the analysis of multiple samples in parallel and can easily be implemented in routine clinical practice, including diagnostic testing.

Download Full-text

precisionFDA Truth Challenge V2: Calling variants from short- and long-reads in difficult-to-map regions

10.1101/2020.11.13.380741 ◽

2020 ◽

Cited By ~ 2

Author(s):

Nathan D. Olson ◽

Justin Wagner ◽

Jennifer McDaniel ◽

Sarah H. Stephens ◽

Samuel T. Westreich ◽

...

Keyword(s):

Machine Learning ◽

Variant Calling ◽

Learning Approaches ◽

Sequencing Technologies ◽

Innovative Methods ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Recent Developments ◽

Genomic Regions

SummaryThe precisionFDA Truth Challenge V2 aimed to assess the state-of-the-art of variant calling in difficult-to-map regions and the Major Histocompatibility Complex (MHC). Starting with FASTQ files, 20 challenge participants applied their variant calling pipelines and submitted 64 variant callsets for one or more sequencing technologies (~35X Illumina, ~35X PacBio HiFi, and ~50X Oxford Nanopore Technologies). Submissions were evaluated following best practices for benchmarking small variants with the new GIAB benchmark sets and genome stratifications. Challenge submissions included a number of innovative methods for all three technologies, with graph-based and machine-learning methods scoring best for short-read and long-read datasets, respectively. New methods out-performed the 2016 Truth Challenge winners, and new machine-learning approaches combining multiple sequencing technologies performed particularly well. Recent developments in sequencing and variant calling have enabled benchmarking variants in challenging genomic regions, paving the way for the identification of previously unknown clinically relevant variants.

Download Full-text

SVIM: Structural Variant Identification using Mapped Long Reads

10.1101/494096 ◽

2018 ◽

Cited By ~ 2

Author(s):

David Heller ◽

Martin Vingron

Keyword(s):

Single Molecule ◽

Simulated Data ◽

Structural Variants ◽

Human Phenotype ◽

Structural Variant ◽

Small Indels ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read

AbstractMotivationStructural variants are defined as genomic variants larger than 50bp. They have been shown to affect more bases in any given genome than SNPs or small indels. Additionally, they have great impact on human phenotype and diversity and have been linked to numerous diseases. Due to their size and association with repeats, they are difficult to detect by shotgun sequencing, especially when based on short reads. Long read, single molecule sequencing technologies like those offered by Pacific Biosciences or Oxford Nanopore Technologies produce reads with a length of several thousand base pairs. Despite the higher error rate and sequencing cost, long read sequencing offers many advantages for the detection of structural variants. Yet, available software tools still do not fully exploit the possibilities.ResultsWe present SVIM, a tool for the sensitive detection and precise characterization of structural variants from long read data. SVIM consists of three components for the collection, clustering and combination of structural variant signatures from read alignments. It discriminates five different variant classes including similar types, such as tandem and interspersed duplications and novel element insertions. SVIM is unique in its capability of extracting both the genomic origin and destination of duplications. It compares favorably with existing tools in evaluations on simulated data and real datasets from PacBio and Nanopore sequencing machines.Availability and implementationThe source code and executables of SVIM are available on Github: github.com/eldariont/svim. SVIM has been implemented in Python 3 and published on bioconda and the Python Package [email protected]

Download Full-text

Correcting palindromes in long reads after whole-genome amplification

10.1101/173872 ◽

2017 ◽

Cited By ~ 1

Author(s):

Sven Warris ◽

Elio Schijlen ◽

Henri van de Geest ◽

Rahulsimham Vegesna ◽

Thamara Hesselink ◽

...

Keyword(s):

Whole Genome Amplification ◽

De Novo ◽

Single Cells ◽

Whole Genome ◽

Read Mapping ◽

Genome Amplification ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Real World Datasets

AbstractNext-generation sequencing requires sufficient DNA to be available. If limited, whole-genome amplification is applied to generate additional amounts of DNA. Such amplification often results in many chimeric DNA fragments, in particular artificial palindromic sequences, which limit the usefulness of long reads from technologies such as PacBio and Oxford Nanopore. Here, we present Pacasus, a tool for correcting such errors in long reads. We demonstrate on two real-world datasets that it markedly improves subsequent read mapping and de novo assembly, yielding results similar to these that would be obtained with non-amplified DNA. With Pacasus long-read technologies become readily available for sequencing targets with very small amounts of DNA, such as single cells or even single chromosomes.

Download Full-text

Structural variants identified by Oxford Nanopore PromethION sequencing of the human genome

10.1101/434118 ◽

2018 ◽

Cited By ~ 9

Author(s):

De Coster Wouter ◽

De Roeck Arne ◽

De Pooter Tim ◽

D’Hert Svenn ◽

De Rijk Peter ◽

...

Keyword(s):

Variant Calling ◽

High Sensitivity ◽

Structural Variants ◽

Computationally Efficient ◽

Sequencing Platform ◽

Structural Variant ◽

Oxford Nanopore ◽

Sequencing Studies ◽

Long Read

AbstractWe sequenced the Yoruban NA19240 genome on the long read sequencing platform Oxford Nanopore PromethION for benchmarking and evaluation of recently published aligners and structural variant calling tools. In this work, we determined the precision and recall, present high confidence and high sensitivity call sets of variants and discuss optimal parameters. The aligner Minimap2 and structural variant caller Sniffles are both the most accurate and the most computationally efficient tools in our study. We describe our scalable workflow for identification, annotation, and characterization of tens of thousands of structural variants from long read genome sequencing of an individual or population. By discussing the results of this genome we provide an approximation of what can be expected in future long read sequencing studies aiming for structural variant identification.

Download Full-text

Ratatosk: hybrid error correction of long reads enables accurate variant calling and assembly

Genome Biology ◽

10.1186/s13059-020-02244-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Guillaume Holley ◽

Doruk Beyter ◽

Helga Ingimundardottir ◽

Peter L. Møller ◽

Snædis Kristmundsdottir ◽

...

Keyword(s):

Error Correction ◽

Human Genome ◽

Error Rate ◽

Variant Calling ◽

High Error Rate ◽

Sequencing Data ◽

Short Read ◽

Long Reads ◽

Median Error ◽

Long Read

AbstractA major challenge to long read sequencing data is their high error rate of up to 15%. We present Ratatosk, a method to correct long reads with short read data. We demonstrate on 5 human genome trios that Ratatosk reduces the error rate of long reads 6-fold on average with a median error rate as low as 0.22 %. SNP calls in Ratatosk corrected reads are nearly 99 % accurate and indel calls accuracy is increased by up to 37 %. An assembly of Ratatosk corrected reads from an Ashkenazi individual yields a contig N50 of 45 Mbp and less misassemblies than a PacBio HiFi reads assembly.

Download Full-text