Dysgu: efficient structural variant calling using short or long reads

Mapping Intimacies ◽

10.1101/2021.05.28.446147 ◽

2021 ◽

Author(s):

Duncan M Baird ◽

Kez Cleal

Keyword(s):

Structural Variation ◽

State Of The Art ◽

Variant Calling ◽

High Sensitivity ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Low Coverage ◽

Paired End Sequencing

Structural variation (SV) plays a fundamental role in genome evolution and can underlie inherited or acquired diseases such as cancer. Long-read sequencing technologies have led to improvements in the characterization of structural variants (SVs), although paired-end sequencing offers better scalability. Here, we present dysgu, which calls SVs or indels using paired-end or long reads. Dysgu detects signals from alignment gaps, discordant and supplementary mappings, and generates consensus contigs, before classifying events using machine learning. Additional SVs are identified by remapping of anomalous sequences. Dysgu outperforms existing state-of-the-art tools using paired-end or long-reads, offering high sensitivity and precision whilst being among the fastest tools to run. We find that combining low coverage paired-end and long-reads is competitive in terms of performance with long-reads at higher coverage values.

Download Full-text

SVNN: an efficient PacBio-specific pipeline for structural variations calling using neural networks

BMC Bioinformatics ◽

10.1186/s12859-021-04184-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Shaya Akbarinejad ◽

Mostafa Hadadian Nejad Yousefi ◽

Maziar Goudarzi

Keyword(s):

Structural Variation ◽

State Of The Art ◽

Detection Methods ◽

Sequencing Error ◽

Structural Variations ◽

High Coverage ◽

Long Reads ◽

Long Read ◽

Sensitivity Improvement ◽

Low Coverage

Abstract Background Once aligned, long-reads can be a useful source of information to identify the type and position of structural variations. However, due to the high sequencing error of long reads, long-read structural variation detection methods are far from precise in low-coverage cases. To be accurate, they need to use high-coverage data, which in turn, results in an extremely time-consuming pipeline, especially in the alignment phase. Therefore, it is of utmost importance to have a structural variation calling pipeline which is both fast and precise for low-coverage data. Results In this paper, we present SVNN, a fast yet accurate, structural variation calling pipeline for PacBio long-reads that takes raw reads as the input and detects structural variants of size larger than 50 bp. Our pipeline utilizes state-of-the-art long-read aligners, namely NGMLR and Minimap2, and structural variation callers, videlicet Sniffle and SVIM. We found that by using a neural network, we can extract features from Minimap2 output to detect a subset of reads that provide useful information for structural variation detection. By only mapping this subset with NGMLR, which is far slower than Minimap2 but better serves downstream structural variation detection, we can increase the sensitivity in an efficient way. As a result of using multiple tools intelligently, SVNN achieves up to 20 percentage points of sensitivity improvement in comparison with state-of-the-art methods and is three times faster than a naive combination of state-of-the-art tools to achieve almost the same accuracy. Conclusion Since prohibitive costs of using high-coverage data have impeded long-read applications, with SVNN, we provide the users with a much faster structural variation detection platform for PacBio reads with high precision and sensitivity in low-coverage scenarios.

Download Full-text

Evaluation of Germline Structural Variant Calling Methods for Nanopore Sequencing Data

Frontiers in Genetics ◽

10.3389/fgene.2021.761791 ◽

2021 ◽

Vol 12 ◽

Author(s):

Davide Bolognini ◽

Alberto Magi

Keyword(s):

Variant Calling ◽

Research Report ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Factors Affecting ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Sequencing Studies ◽

Long Read

Structural variants (SVs) are genomic rearrangements that involve at least 50 nucleotides and are known to have a serious impact on human health. While prior short-read sequencing technologies have often proved inadequate for a comprehensive assessment of structural variation, more recent long reads from Oxford Nanopore Technologies have already been proven invaluable for the discovery of large SVs and hold the potential to facilitate the resolution of the full SV spectrum. With many long-read sequencing studies to follow, it is crucial to assess factors affecting current SV calling pipelines for nanopore sequencing data. In this brief research report, we evaluate and compare the performances of five long-read SV callers across four long-read aligners using both real and synthetic nanopore datasets. In particular, we focus on the effects of read alignment, sequencing coverage, and variant allele depth on the detection and genotyping of SVs of different types and size ranges and provide insights into precision and recall of SV callsets generated by integrating the various long-read aligners and SV callers. The computational pipeline we propose is publicly available at https://github.com/davidebolo1993/EViNCe and can be adjusted to further evaluate future nanopore sequencing datasets.

Download Full-text

A benchmark of structural variation detection by long reads through a realistic simulated model

10.1101/2020.12.25.424397 ◽

2020 ◽

Author(s):

Nicolas Dierckxsens ◽

Tong Li ◽

Joris R. Vermeesch ◽

Zhi Xie

Keyword(s):

Structural Variation ◽

Rapid Evolution ◽

Detection Methods ◽

Sequencing Data ◽

Simulated Model ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Sequencing Platforms ◽

The Impact

ABSTRACTDespite the rapid evolution of new sequencing technologies, structural variation detection remains poorly ascertained. The high discrepancy between the results of structural variant analysis programs makes it difficult to assess their performance on real datasets. Accurate simulations of structural variation distributions and sequencing data of the human genome are crucial for the development and benchmarking of new tools. In order to gain a better insight into the detection of structural variation with long sequencing reads, we created a realistic simulated model to thoroughly compare SV detection methods and the impact of the chosen sequencing technology and sequencing depth. To achieve this, we developed Sim-it, a straightforward tool for the simulation of both structural variation and long-read data. These simulations from Sim-it revealed the strengths and weaknesses for current available structural variation callers and long read sequencing platforms. Our findings were also supported by the latest structural variation benchmark set developed by the GIAB Consortium. With these findings, we developed a new method (combiSV) that can combine the results from five different SV callers into a superior call set with increased recall and precision. Both Sim-it and combiSV are open source and can be downloaded at https://github.com/ndierckx/.

Download Full-text

Long-read-based Human Genomic Structural Variation Detection with cuteSV

10.1101/780700 ◽

2019 ◽

Cited By ~ 1

Author(s):

Tao Jiang ◽

Bo Liu ◽

Yue Jiang ◽

Junyi Li ◽

Yan Gao ◽

...

Keyword(s):

Structural Variation ◽

High Sensitivity ◽

Structural Variations ◽

Genomic Structural Variation ◽

Long Reads ◽

Detection Approach ◽

Refinement Method ◽

Long Read ◽

Human Genomic ◽

And Performance

AbstractLong-read sequencing enables the comprehensive discovery of structural variations (SVs). However, it is still non-trivial to achieve high sensitivity and performance simultaneously due to the complex SV characteristics implied by noisy long reads. Therefore, we propose cuteSV, a sensitive, fast and scalable long-read-based SV detection approach. cuteSV uses tailored methods to collect the signatures of various types of SVs and employs a clustering-and-refinement method to analyze the signatures to implement sensitive SV detection. Benchmarks on real PacBio and ONT datasets demonstrate that cuteSV has better yields and scalability than state-of-the-art tools. cuteSV is available at https://github.com/tjiangHIT/cuteSV.

Download Full-text

Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing

Nature Communications ◽

10.1038/s41467-019-12493-y ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 26

Author(s):

Peter Edge ◽

Vikas Bansal

Keyword(s):

Single Molecule ◽

Variant Calling ◽

Small Scale ◽

Whole Genome ◽

Limited Information ◽

Single Nucleotide Variants ◽

Pacific Biosciences ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read

Abstract Whole-genome sequencing using sequencing technologies such as Illumina enables the accurate detection of small-scale variants but provides limited information about haplotypes and variants in repetitive regions of the human genome. Single-molecule sequencing (SMS) technologies such as Pacific Biosciences and Oxford Nanopore generate long reads that can potentially address the limitations of short-read sequencing. However, the high error rate of SMS reads makes it challenging to detect small-scale variants in diploid genomes. We introduce a variant calling method, Longshot, which leverages the haplotype information present in SMS reads to accurately detect and phase single-nucleotide variants (SNVs) in diploid genomes. We demonstrate that Longshot achieves very high accuracy for SNV detection using whole-genome Pacific Biosciences data, outperforms existing variant calling methods, and enables variant detection in duplicated regions of the genome that cannot be mapped using short reads.

Download Full-text

Economic Genome Assembly from Low Coverage Illumina and Nanopore Data

10.1101/2020.02.07.939454 ◽

2020 ◽

Author(s):

Thomas Gatter ◽

Sarah von Löhneysen ◽

Polina Drozdova ◽

Tom Hartmann ◽

Peter F. Stadler

Keyword(s):

Genomic Sequence ◽

State Of The Art ◽

Fruit Fly ◽

Computational Effort ◽

Maximum Weight ◽

New Approach ◽

Short Read ◽

Long Reads ◽

Long Read ◽

Low Coverage

AbstractWe describe a new approach to assemble genomes from a combination of low-coverage short and long reads. LazyBastard starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs, which are then reduced to a long-read overlap graph G. Edges are removed from G to obtain first a consistent orientation and then a DAG. Using heuristics based on properties of proper interval graphs, contigs are extracted as maximum weight paths. These are translated into genomic sequence only in the final step. A prototype implementation of LazyBastard, entirely written in python, not only yields significantly more accurate assemblies of the yeast and fruit fly genomes compared to state-of-the-art pipelines but also requires much less computational effort.FundingRSF / Helmholtz Association 18-44-06201; Deutsche Academische Austauschdienst, DFG STA 850/19-2 within SPP 1738; German Federal Ministery of Education an Research 031A538A, de.NBI-RBC

Download Full-text

Symphonizing pileup and full-alignment for deep learning-based long-read variant calling

10.1101/2021.12.29.474431 ◽

2021 ◽

Author(s):

Zhenxian Zheng ◽

Shumin Li ◽

Junhao Su ◽

Amy Wing-Sze Leung ◽

Tak-Wah Lam ◽

...

Keyword(s):

Deep Learning ◽

State Of The Art ◽

Variant Calling ◽

The Other ◽

Snp Calling ◽

Long Reads ◽

Pile Up ◽

Long Read

Deep learning-based variant callers are becoming the standard and have achieved superior SNP calling performance using long reads. In this paper, we present Clair3, which makes the best of two major method categories: pile-up calling handles most variant candidates with speed, and full-alignment tackles complicated candidates to maximize precision and recall. Clair3 ran faster than any of the other state-of-the-art variant callers and performed the best, especially at lower coverage.

Download Full-text

precisionFDA Truth Challenge V2: Calling variants from short- and long-reads in difficult-to-map regions

10.1101/2020.11.13.380741 ◽

2020 ◽

Cited By ~ 2

Author(s):

Nathan D. Olson ◽

Justin Wagner ◽

Jennifer McDaniel ◽

Sarah H. Stephens ◽

Samuel T. Westreich ◽

...

Keyword(s):

Machine Learning ◽

Variant Calling ◽

Learning Approaches ◽

Sequencing Technologies ◽

Innovative Methods ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Recent Developments ◽

Genomic Regions

SummaryThe precisionFDA Truth Challenge V2 aimed to assess the state-of-the-art of variant calling in difficult-to-map regions and the Major Histocompatibility Complex (MHC). Starting with FASTQ files, 20 challenge participants applied their variant calling pipelines and submitted 64 variant callsets for one or more sequencing technologies (~35X Illumina, ~35X PacBio HiFi, and ~50X Oxford Nanopore Technologies). Submissions were evaluated following best practices for benchmarking small variants with the new GIAB benchmark sets and genome stratifications. Challenge submissions included a number of innovative methods for all three technologies, with graph-based and machine-learning methods scoring best for short-read and long-read datasets, respectively. New methods out-performed the 2016 Truth Challenge winners, and new machine-learning approaches combining multiple sequencing technologies performed particularly well. Recent developments in sequencing and variant calling have enabled benchmarking variants in challenging genomic regions, paving the way for the identification of previously unknown clinically relevant variants.

Download Full-text

deSAMBA: fast and accurate classification of metagenomics long reads with sparse approximate matches

10.1101/736777 ◽

2019 ◽

Author(s):

Gaoyang Li ◽

Bo Liu ◽

Yadong Wang

Keyword(s):

State Of The Art ◽

Supplementary Information ◽

Alignment Algorithm ◽

Classification Approach ◽

Fast Speed ◽

Sequencing Technologies ◽

Long Reads ◽

Good Classification ◽

Long Read

AbstractSummaryLong read sequencing technologies are promising to metagenomics studies. However, there is still lack of read classification tools to fast and accurately identify the taxonomies of noisy long reads, which is a bottleneck to the use of long read sequencing. Herein, we propose deSAMBA, a tailored long read classification approach that uses a novel sparse approximate match block (SAMB)-based pseudo alignment algorithm. Benchmarks on real datasets demonstrate that deSAMBA enables to simultaneously achieve fast speed and good classification yields, which outperforms state-of-the-art tools and has many potentials to cutting-edge metagenomics studies.Availability and Implementationhttps://github.com/hitbc/deSAMBA.Supplementary information:

Download Full-text

The application of Nanopore sequencing for variant calling on the human mitochondrial DNA

Biological Communications ◽

10.21638/spbu03.2021.202 ◽

2021 ◽

Vol 66 (2) ◽

Author(s):

Anton Shikov ◽

Viktoriya Tsay ◽

Mikhail Fedyakov ◽

Yuri Eismont ◽

Alena Rudnik ◽

...

Keyword(s):

Mitochondrial Dna ◽

False Negative ◽

Variant Calling ◽

Illumina Miseq ◽

Routine Practice ◽

Jaccard Coefficient ◽

High Coverage ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read

The emergence of long-read sequencing technologies has made a revolutionary step in genome biology and medicine. However, long reads are characterized by a relatively high error rate, impairing their usage for variant calling as a part of routine practice. Thus, we here examine different popular variant callers on long-read sequences of the human mitochondrial genome, convenient in terms of small size and easily obtained high coverage. The sequencing of mitochondrial DNA from 8 patients was conducted via Illumina (MiSeq) and the Oxford Nanopore platform (MinION), with the former utilized as a gold standard when evaluating variant calling’s accuracy. We used a conventional GATK3-BWA-based pipeline for paired-end reads and Guppy basecaller coupled with minimap2 for MinION data, respectively. We then compared the outputs of Clairvoyante, Nanopolish, GATK3, Longshot, DeepVariant, and Varscan tools applied on long-read alignments by analyzing false-positive and false-negative rates. While for most callers, raw signals represented false positives due to homopolymeric errors, Nanopolish demonstrated both high similarity (Jaccard coefficient of 0.82) and a comparable number of calls with the Illumina data (140 vs. 154) with the best performance according to AUC (area under ROC curve, 0.953) as well. In sum, our results, despite being obtained from a small dataset, provide evidence that sufficient coverage coupled with an optimal pipeline could make long reads of mitochondrial DNA applicable for variant calling.

Download Full-text