Fast alignment of reads to a variation graph with application to SNP detection

Abstract Sequencing technologies has provided the basis of most modern genome sequencing studies due to its high base-level accuracy and relatively low cost. One of the most demanding step is mapping reads to the human reference genome. The reliance on a single reference human genome could introduce substantial biases in downstream analyses. Pangenomic graph reference representations offer an attractive approach for storing genetic variations. Moreover, it is possible to include known variants in the reference in order to make read mapping, variant calling, and genotyping variant-aware. Only recently a framework for variation graphs, vg [Garrison E, Adam MN, Siren J, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2018;36:875–9], have improved variation-aware alignment and variant calling in general. The major bottleneck of vg is its high cost of reads mapping to a variation graph. In this paper we study the problem of SNP calling on a variation graph and we present a fast reads alignment tool, named VG SNP-Aware. VG SNP-Aware is able align reads exactly to a variation graph and detect SNPs based on these aligned reads. The results show that VG SNP-Aware can efficiently map reads to a variation graph with a speedup of 40× with respect to vg and similar accuracy on SNPs detection.

Download Full-text

Evaluation of Germline Structural Variant Calling Methods for Nanopore Sequencing Data

Frontiers in Genetics ◽

10.3389/fgene.2021.761791 ◽

2021 ◽

Vol 12 ◽

Author(s):

Davide Bolognini ◽

Alberto Magi

Keyword(s):

Variant Calling ◽

Research Report ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Factors Affecting ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Sequencing Studies ◽

Long Read

Structural variants (SVs) are genomic rearrangements that involve at least 50 nucleotides and are known to have a serious impact on human health. While prior short-read sequencing technologies have often proved inadequate for a comprehensive assessment of structural variation, more recent long reads from Oxford Nanopore Technologies have already been proven invaluable for the discovery of large SVs and hold the potential to facilitate the resolution of the full SV spectrum. With many long-read sequencing studies to follow, it is crucial to assess factors affecting current SV calling pipelines for nanopore sequencing data. In this brief research report, we evaluate and compare the performances of five long-read SV callers across four long-read aligners using both real and synthetic nanopore datasets. In particular, we focus on the effects of read alignment, sequencing coverage, and variant allele depth on the detection and genotyping of SVs of different types and size ranges and provide insights into precision and recall of SV callsets generated by integrating the various long-read aligners and SV callers. The computational pipeline we propose is publicly available at https://github.com/davidebolo1993/EViNCe and can be adjusted to further evaluate future nanopore sequencing datasets.

Download Full-text

LRez: C ++ API and toolkit for analyzing and managing Linked-Reads data

Bioinformatics Advances ◽

10.1093/bioadv/vbab022 ◽

2021 ◽

Author(s):

Pierre Morisse ◽

Claire Lemaitre ◽

Fabrice Legeai

Keyword(s):

Genome Assembly ◽

Low Cost ◽

Variant Calling ◽

Supplementary Information ◽

Supplementary Data ◽

High Quality ◽

Dna Molecule ◽

Sequencing Technologies ◽

Wide Range ◽

Genomic Regions

Abstract Motivation Linked-Reads technologies combine both the high-quality and low cost of short-reads sequencing and long-range information, through the use of barcodes tagging reads which originate from a common long DNA molecule. This technology has been employed in a broad range of applications including genome assembly, phasing and scaffolding, as well as structural variant calling. However, to date, no tool or API dedicated to the manipulation of Linked-Reads data exist. Results We introduce LRez, a C ++ API and toolkit which allows easy management of Linked-Reads data. LRez includes various functionalities, for computing numbers of common barcodes between genomic regions, extracting barcodes from BAM files, as well as indexing and querying BAM, FASTQ and gzipped FASTQ files to quickly fetch all reads or alignments containing a given barcode. LRez is compatible with a wide range of Linked-Reads sequencing technologies, and can thus be used in any tool or pipeline requiring barcode processing or indexing, in order to improve their performances. Availability and implementation LRez is implemented in C ++, supported on Unix-based platforms, and available under AGPL-3.0 License at https://github.com/morispi/LRez, and as a bioconda module. Supplementary information Supplementary data are available at Bioinformatics Advances

Download Full-text

Reliable variant calling during runtime of Illumina sequencing

10.1101/387662 ◽

2018 ◽

Cited By ~ 1

Author(s):

Tobias P. Loka ◽

Simon H. Tausch ◽

Bernhard Y. Renard

Keyword(s):

Real Time ◽

Disease Outbreaks ◽

Variant Calling ◽

Read Mapping ◽

Mapping Algorithm ◽

Infectious Disease Outbreaks ◽

Whole Exome ◽

Similar Accuracy ◽

Post Hoc

AbstractThe sequential paradigm of data acquisition and analysis in next-generation sequencing leads to high turnaround times for the generation of interpretable results. We combined a novel real-time read mapping algorithm with fast variant calling to obtain reliable variant calls still during the sequencing process. Thereby, our new algorithm allows for accurate read mapping results for intermediate cycles and supports large reference genomes such as the complete human reference. This enables the combination of real-time read mapping results with complex follow-up analysis. In this study, we showed the accuracy and scalability of our approach by applying real-time read mapping and variant calling to seven publicly available human whole exome sequencing datasets. Thereby, up to 89% of all detected SNPs were already identified after 40 sequencing cycles while showing similar precision as at the end of sequencing. Final results showed similar accuracy to those of conventional post-hoc analysis methods. When compared to standard routines, our live approach enables considerably faster interventions in clinical applications and infectious disease outbreaks. Besides variant calling, our approach can be adapted for a plethora of other mapping-based analyses.

Download Full-text

Comparison of Read Mapping and Variant Calling Tools for the Analysis of Plant NGS Data

Plants ◽

10.3390/plants9040439 ◽

2020 ◽

Vol 9 (4) ◽

pp. 439 ◽

Cited By ~ 3

Author(s):

Hanna Marie Schilbert ◽

Andreas Rempel ◽

Boas Pucker

Keyword(s):

High Throughput Sequencing ◽

Performance Metrics ◽

Model Organism ◽

Variant Calling ◽

Reference Sequence ◽

Read Mapping ◽

The Past ◽

Sequencing Technologies ◽

Plant Sciences ◽

Ngs Data

High-throughput sequencing technologies have rapidly developed during the past years and have become an essential tool in plant sciences. However, the analysis of genomic data remains challenging and relies mostly on the performance of automatic pipelines. Frequently applied pipelines involve the alignment of sequence reads against a reference sequence and the identification of sequence variants. Since most benchmarking studies of bioinformatics tools for this purpose have been conducted on human datasets, there is a lack of benchmarking studies in plant sciences. In this study, we evaluated the performance of 50 different variant calling pipelines, including five read mappers and ten variant callers, on six real plant datasets of the model organism Arabidopsis thaliana. Sets of variants were evaluated based on various parameters including sensitivity and specificity. We found that all investigated tools are suitable for analysis of NGS data in plant research. When looking at different performance metrics, BWA-MEM and Novoalign were the best mappers and GATK returned the best results in the variant calling step.

Download Full-text

Comparison of read mapping and variant calling tools for the analysis of plant NGS data

10.1101/2020.03.10.986059 ◽

2020 ◽

Author(s):

Hanna Marie Schilbert ◽

Andreas Rempel ◽

Boas Pucker

Keyword(s):

High Throughput Sequencing ◽

Model Organism ◽

Variant Calling ◽

Reference Sequence ◽

Read Mapping ◽

The Past ◽

Sequencing Technologies ◽

Plant Sciences ◽

Ngs Data ◽

Real Plant

AbstractHigh-throughput sequencing technologies have rapidly developed during the past years and became an essential tool in plant sciences. However, the analysis of genomic data remains challenging and relies mostly on the performance of automatic pipelines. Frequently applied pipelines involve the alignment of sequence reads against a reference sequence and the identification of sequence variants. Since most benchmarking studies of bioinformatics tools for this purpose have been conducted on human datasets, there is a lack of benchmarking studies in plant sciences. In this study, we evaluated the performance of 50 different variant calling pipelines, including five read mappers and ten variant callers, on six real plant datasets of the model organism Arabidopsis thaliana. Sets of variants were evaluated based on various parameters including sensitivity and specificity. We found that all investigated tools are suitable for analysis of NGS data in plant research. When looking at different performance metrices, BWA-MEM and Novoalign were the best mappers and GATK returned the best results in the variant calling step.

Download Full-text

Reliable variant calling during runtime of Illumina sequencing

Scientific Reports ◽

10.1038/s41598-019-52991-z ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Tobias P. Loka ◽

Simon H. Tausch ◽

Bernhard Y. Renard

Keyword(s):

Real Time ◽

Disease Outbreaks ◽

Variant Calling ◽

Read Mapping ◽

Mapping Algorithm ◽

Infectious Disease Outbreaks ◽

Whole Exome ◽

Similar Accuracy ◽

Post Hoc

Abstract The sequential paradigm of data acquisition and analysis in next-generation sequencing leads to high turnaround times for the generation of interpretable results. We combined a novel real-time read mapping algorithm with fast variant calling to obtain reliable variant calls still during the sequencing process. Thereby, our new algorithm allows for accurate read mapping results for intermediate cycles and supports large reference genomes such as the complete human reference. This enables the combination of real-time read mapping results with complex follow-up analysis. In this study, we showed the accuracy and scalability of our approach by applying real-time read mapping and variant calling to seven publicly available human whole exome sequencing datasets. Thereby, up to 89% of all detected SNPs were already identified after 40 sequencing cycles while showing similar precision as at the end of sequencing. Final results showed similar accuracy to those of conventional post-hoc analysis methods. When compared to standard routines, our live approach enables considerably faster interventions in clinical applications and infectious disease outbreaks. Besides variant calling, our approach can be adapted for a plethora of other mapping-based analyses.

Download Full-text

Sensitive alignment using paralogous sequence variants improves long-read mapping and variant calling in segmental duplications

Nucleic Acids Research ◽

10.1093/nar/gkaa829 ◽

2020 ◽

Vol 48 (19) ◽

pp. e114-e114

Author(s):

Timofey Prodanov ◽

Vikas Bansal

Keyword(s):

Probabilistic Method ◽

Variant Calling ◽

Sequence Variants ◽

Segmental Duplications ◽

High Confidence ◽

Read Mapping ◽

Paralogous Sequence ◽

Short Read ◽

Sequencing Technologies ◽

Long Read

Abstract The ability to characterize repetitive regions of the human genome is limited by the read lengths of short-read sequencing technologies. Although long-read sequencing technologies such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies can potentially overcome this limitation, long segmental duplications with high sequence identity pose challenges for long-read mapping. We describe a probabilistic method, DuploMap, designed to improve the accuracy of long-read mapping in segmental duplications. It analyzes reads mapped to segmental duplications using existing long-read aligners and leverages paralogous sequence variants (PSVs)—sequence differences between paralogous sequences—to distinguish between multiple alignment locations. On simulated datasets, DuploMap increased the percentage of correctly mapped reads with high confidence for multiple long-read aligners including Minimap2 (74.3–90.6%) and BLASR (82.9–90.7%) while maintaining high precision. Across multiple whole-genome long-read datasets, DuploMap aligned an additional 8–21% of the reads in segmental duplications with high confidence relative to Minimap2. Using DuploMap-aligned PacBio circular consensus sequencing reads, an additional 8.9 Mb of DNA sequence was mappable, variant calling achieved a higher F1 score and 14 713 additional variants supported by linked-read data were identified. Finally, we demonstrate that a significant fraction of PSVs in segmental duplications overlaps with variants and adversely impacts short-read variant calling.

Download Full-text

Sensitive alignment using paralogous sequence variants improves long read mapping and variant calling in segmental duplications

10.1101/2020.07.15.202929 ◽

2020 ◽

Cited By ~ 1

Author(s):

Timofey Prodanov ◽

Vikas Bansal

Keyword(s):

Probabilistic Method ◽

Variant Calling ◽

Sequence Variants ◽

Segmental Duplications ◽

High Confidence ◽

Read Mapping ◽

Paralogous Sequence ◽

Short Read ◽

Sequencing Technologies ◽

Long Read

AbstractThe ability to characterize repetitive regions of the human genome is limited by the read lengths of short-read sequencing technologies. Although long-read sequencing technologies such as Pacific Biosciences and Oxford Nanopore can potentially overcome this limitation, long segmental duplications with high sequence identity pose challenges for long-read mapping. We describe a probabilistic method, DuploMap, designed to improve the accuracy of long read mapping in segmental duplications. It analyzes reads mapped to segmental duplications using existing long-read aligners and leverages paralogous sequence variants (PSVs) – sequence differences between paralogous sequences – to distinguish between multiple alignment locations. On simulated datasets, Duplomap increased the percentage of correctly mapped reads with high confidence for multiple long-read aligners including Minimap2 (74.3% to 90.6%) and BLASR (82.9% to 90.7%) while maintaining high precision. Across multiple whole-genome long-read datasets, DuploMap aligned an additional 8-21% of the reads in segmental duplications with high confidence relative to Minimap2. Using Duplomap aligned PacBio CCS reads, an additional 8.9 Mbp of DNA sequence was mappable, variant calling achieved a higher F1-score and 14,713 additional variants supported by linked-read data were identified. Finally, we demonstrate that a significant fraction of PSVs in segmental duplications overlap with variants and adversely impact short-read variant calling.

Download Full-text

MALVA: genotyping by Mapping-free ALlele detection of known VAriants

10.1101/575126 ◽

2019 ◽

Cited By ~ 1

Author(s):

Giulia Bernardini ◽

Paola Bonizzoni ◽

Luca Denti ◽

Marco Previtali ◽

Alexander Schönhuth

Keyword(s):

Variant Calling ◽

Genome Project ◽

Human Populations ◽

Genome Data ◽

Variant Discovery ◽

Sequencing Technologies ◽

Order Of Magnitude ◽

Speed Up ◽

Similar Accuracy ◽

Genomic Regions

AbstractThe amount of genetic variation discovered and characterized in human populations is huge, and is growing rapidly with the widespread availability of modern sequencing technologies. Such a great deal of variation data, that accounts for human diversity, leads to various challenging computational tasks, including variant calling and genotyping of newly sequenced individuals. The standard pipelines for addressing these problems include read mapping, which is a computationally expensive procedure. A few mapping-free tools were proposed in recent years to speed up the genotyping process. While such tools have highly efficient run-times, they focus on isolated, bi-allelic SNPs, providing limited support for multi-allelic SNPs, indels, and genomic regions with high variant density.To address these issues, we introduceMALVA, a fast and lightweight mapping-free method to genotype an individual directly from a sample of reads.MALVAis the first mapping-free tool that is able to genotype multi-allelic SNPs and indels, even in high density genomic regions, and to effectively handle a huge number of variants such as those provided by the 1000 Genome Project. An experimental evaluation on whole-genome data shows thatMALVArequires one order of magnitude less time to genotype a donor than alignment-based pipelines, providing similar accuracy. Remarkably, on indels,MALVAprovides even better results than the most widely adopted variant discovery tools.

Download Full-text

Analyzing Low-Level mtDNA Heteroplasmy—Pitfalls and Challenges from Bench to Benchmarking

International Journal of Molecular Sciences ◽

10.3390/ijms22020935 ◽

2021 ◽

Vol 22 (2) ◽

pp. 935

Author(s):

Federica Fazzini ◽

Liane Fendt ◽

Sebastian Schönherr ◽

Lukas Forer ◽

Bernd Schöpf ◽

...

Keyword(s):

Variant Calling ◽

Illumina Miseq ◽

Bioinformatic Analysis ◽

Low Level ◽

Sequencing Technologies ◽

Laboratory Setup ◽

Dna Mixture ◽

The Individual ◽

Highly Sensitive Detection ◽

Sensitivity Specificity

Massive parallel sequencing technologies are promising a highly sensitive detection of low-level mutations, especially in mitochondrial DNA (mtDNA) studies. However, processes from DNA extraction and library construction to bioinformatic analysis include several varying tasks. Further, there is no validated recommendation for the comprehensive procedure. In this study, we examined potential pitfalls on the sequencing results based on two-person mtDNA mixtures. Therefore, we compared three DNA polymerases, six different variant callers in five mixtures between 50% and 0.5% variant allele frequencies generated with two different amplification protocols. In total, 48 samples were sequenced on Illumina MiSeq. Low-level variant calling at the 1% variant level and below was performed by comparing trimming and PCR duplicate removal as well as six different variant callers. The results indicate that sensitivity, specificity, and precision highly depend on the investigated polymerase but also vary based on the analysis tools. Our data highlight the advantage of prior standardization and validation of the individual laboratory setup with a DNA mixture model. Finally, we provide an artificial heteroplasmy benchmark dataset that can help improve somatic variant callers or pipelines, which may be of great interest for research related to cancer and aging.

Download Full-text