SVNN: an efficient PacBio-specific pipeline for structural variations calling using neural networks

Abstract Background Once aligned, long-reads can be a useful source of information to identify the type and position of structural variations. However, due to the high sequencing error of long reads, long-read structural variation detection methods are far from precise in low-coverage cases. To be accurate, they need to use high-coverage data, which in turn, results in an extremely time-consuming pipeline, especially in the alignment phase. Therefore, it is of utmost importance to have a structural variation calling pipeline which is both fast and precise for low-coverage data. Results In this paper, we present SVNN, a fast yet accurate, structural variation calling pipeline for PacBio long-reads that takes raw reads as the input and detects structural variants of size larger than 50 bp. Our pipeline utilizes state-of-the-art long-read aligners, namely NGMLR and Minimap2, and structural variation callers, videlicet Sniffle and SVIM. We found that by using a neural network, we can extract features from Minimap2 output to detect a subset of reads that provide useful information for structural variation detection. By only mapping this subset with NGMLR, which is far slower than Minimap2 but better serves downstream structural variation detection, we can increase the sensitivity in an efficient way. As a result of using multiple tools intelligently, SVNN achieves up to 20 percentage points of sensitivity improvement in comparison with state-of-the-art methods and is three times faster than a naive combination of state-of-the-art tools to achieve almost the same accuracy. Conclusion Since prohibitive costs of using high-coverage data have impeded long-read applications, with SVNN, we provide the users with a much faster structural variation detection platform for PacBio reads with high precision and sensitivity in low-coverage scenarios.

Download Full-text

Dysgu: efficient structural variant calling using short or long reads

10.1101/2021.05.28.446147 ◽

2021 ◽

Author(s):

Duncan M Baird ◽

Kez Cleal

Keyword(s):

Structural Variation ◽

State Of The Art ◽

Variant Calling ◽

High Sensitivity ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Low Coverage ◽

Paired End Sequencing

Structural variation (SV) plays a fundamental role in genome evolution and can underlie inherited or acquired diseases such as cancer. Long-read sequencing technologies have led to improvements in the characterization of structural variants (SVs), although paired-end sequencing offers better scalability. Here, we present dysgu, which calls SVs or indels using paired-end or long reads. Dysgu detects signals from alignment gaps, discordant and supplementary mappings, and generates consensus contigs, before classifying events using machine learning. Additional SVs are identified by remapping of anomalous sequences. Dysgu outperforms existing state-of-the-art tools using paired-end or long-reads, offering high sensitivity and precision whilst being among the fastest tools to run. We find that combining low coverage paired-end and long-reads is competitive in terms of performance with long-reads at higher coverage values.

Download Full-text

A benchmark of structural variation detection by long reads through a realistic simulated model

10.1101/2020.12.25.424397 ◽

2020 ◽

Author(s):

Nicolas Dierckxsens ◽

Tong Li ◽

Joris R. Vermeesch ◽

Zhi Xie

Keyword(s):

Structural Variation ◽

Rapid Evolution ◽

Detection Methods ◽

Sequencing Data ◽

Simulated Model ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Sequencing Platforms ◽

The Impact

ABSTRACTDespite the rapid evolution of new sequencing technologies, structural variation detection remains poorly ascertained. The high discrepancy between the results of structural variant analysis programs makes it difficult to assess their performance on real datasets. Accurate simulations of structural variation distributions and sequencing data of the human genome are crucial for the development and benchmarking of new tools. In order to gain a better insight into the detection of structural variation with long sequencing reads, we created a realistic simulated model to thoroughly compare SV detection methods and the impact of the chosen sequencing technology and sequencing depth. To achieve this, we developed Sim-it, a straightforward tool for the simulation of both structural variation and long-read data. These simulations from Sim-it revealed the strengths and weaknesses for current available structural variation callers and long read sequencing platforms. Our findings were also supported by the latest structural variation benchmark set developed by the GIAB Consortium. With these findings, we developed a new method (combiSV) that can combine the results from five different SV callers into a superior call set with increased recall and precision. Both Sim-it and combiSV are open source and can be downloaded at https://github.com/ndierckx/.

Download Full-text

Long-read-based Human Genomic Structural Variation Detection with cuteSV

10.1101/780700 ◽

2019 ◽

Cited By ~ 1

Author(s):

Tao Jiang ◽

Bo Liu ◽

Yue Jiang ◽

Junyi Li ◽

Yan Gao ◽

...

Keyword(s):

Structural Variation ◽

High Sensitivity ◽

Structural Variations ◽

Genomic Structural Variation ◽

Long Reads ◽

Detection Approach ◽

Refinement Method ◽

Long Read ◽

Human Genomic ◽

And Performance

AbstractLong-read sequencing enables the comprehensive discovery of structural variations (SVs). However, it is still non-trivial to achieve high sensitivity and performance simultaneously due to the complex SV characteristics implied by noisy long reads. Therefore, we propose cuteSV, a sensitive, fast and scalable long-read-based SV detection approach. cuteSV uses tailored methods to collect the signatures of various types of SVs and employs a clustering-and-refinement method to analyze the signatures to implement sensitive SV detection. Benchmarks on real PacBio and ONT datasets demonstrate that cuteSV has better yields and scalability than state-of-the-art tools. cuteSV is available at https://github.com/tjiangHIT/cuteSV.

Download Full-text

Economic Genome Assembly from Low Coverage Illumina and Nanopore Data

10.1101/2020.02.07.939454 ◽

2020 ◽

Author(s):

Thomas Gatter ◽

Sarah von Löhneysen ◽

Polina Drozdova ◽

Tom Hartmann ◽

Peter F. Stadler

Keyword(s):

Genomic Sequence ◽

State Of The Art ◽

Fruit Fly ◽

Computational Effort ◽

Maximum Weight ◽

New Approach ◽

Short Read ◽

Long Reads ◽

Long Read ◽

Low Coverage

AbstractWe describe a new approach to assemble genomes from a combination of low-coverage short and long reads. LazyBastard starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs, which are then reduced to a long-read overlap graph G. Edges are removed from G to obtain first a consistent orientation and then a DAG. Using heuristics based on properties of proper interval graphs, contigs are extracted as maximum weight paths. These are translated into genomic sequence only in the final step. A prototype implementation of LazyBastard, entirely written in python, not only yields significantly more accurate assemblies of the yeast and fruit fly genomes compared to state-of-the-art pipelines but also requires much less computational effort.FundingRSF / Helmholtz Association 18-44-06201; Deutsche Academische Austauschdienst, DFG STA 850/19-2 within SPP 1738; German Federal Ministery of Education an Research 031A538A, de.NBI-RBC

Download Full-text

A long reads-based de-novo assembly of the genome of the Arlee homozygous line reveals chromosomal rearrangements in rainbow trout

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab052 ◽

2021 ◽

Author(s):

Guangtu Gao ◽

Susana Magadan ◽

Geoffrey C Waldbieser ◽

Ramey C Youngblood ◽

Paul A Wheeler ◽

...

Keyword(s):

Rainbow Trout ◽

Chromosome Number ◽

Genome Assembly ◽

De Novo Assembly ◽

De Novo ◽

Sequence Data ◽

Structural Variations ◽

High Coverage ◽

Haploid Chromosome Number ◽

Long Reads

Abstract Currently, there is still a need to improve the contiguity of the rainbow trout reference genome and to use multiple genetic backgrounds that will represent the genetic diversity of this species. The Arlee doubled haploid line was originated from a domesticated hatchery strain that was originally collected from the northern California coast. The Canu pipeline was used to generate the Arlee line genome de-novo assembly from high coverage PacBio long-reads sequence data. The assembly was further improved with Bionano optical maps and Hi-C proximity ligation sequence data to generate 32 major scaffolds corresponding to the karyotype of the Arlee line (2 N = 64). It is composed of 938 scaffolds with N50 of 39.16 Mb and a total length of 2.33 Gb, of which ∼95% was in 32 chromosome sequences with only 438 gaps between contigs and scaffolds. In rainbow trout the haploid chromosome number can vary from 29 to 32. In the Arlee karyotype the haploid chromosome number is 32 because chromosomes Omy04, 14 and 25 are divided into six acrocentric chromosomes. Additional structural variations that were identified in the Arlee genome included the major inversions on chromosomes Omy05 and Omy20 and additional 15 smaller inversions that will require further validation. This is also the first rainbow trout genome assembly that includes a scaffold with the sex-determination gene (sdY) in the chromosome Y sequence. The utility of this genome assembly is demonstrated through the improved annotation of the duplicated genome loci that harbor the IGH genes on chromosomes Omy12 and Omy13.

Download Full-text

Rapid Mycobacterium tuberculosis spoligotyping from uncorrected long reads using Galru

10.1101/2020.05.31.126490 ◽

2020 ◽

Author(s):

Andrew J. Page ◽

Nabil-Fareed Alikhan ◽

Michael Strinden ◽

Thanh Le Viet ◽

Timofey Skvortsov

Keyword(s):

Mycobacterium Tuberculosis ◽

State Of The Art ◽

Sequence Data ◽

Human Pathogen ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Long Reads ◽

Long Read

AbstractSpoligotyping of Mycobacterium tuberculosis provides a subspecies classification of this major human pathogen. Spoligotypes can be predicted from short read genome sequencing data; however, no methods exist for long read sequence data such as from Nanopore or PacBio. We present a novel software package Galru, which can rapidly detect the spoligotype of a Mycobacterium tuberculosis sample from as little as a single uncorrected long read. It allows for near real-time spoligotyping from long read data as it is being sequenced, giving rapid sample typing. We compare it to the existing state of the art software and find it performs identically to the results obtained from short read sequencing data. Galru is freely available from https://github.com/quadram-institute-bioscience/galru under the GPLv3 open source licence.

Download Full-text

HapCHAT: Adaptive haplotype assembly for efficiently leveraging high coverage in long reads

10.1101/170225 ◽

2017 ◽

Author(s):

Stefano Beretta ◽

Murray D Patterson ◽

Simone Zaccaria ◽

Gianluca Della Vedova ◽

Paola Bonizzoni

Keyword(s):

Error Rate ◽

Feasible Solution ◽

State Of The Art ◽

High Coverage ◽

Haplotype Blocks ◽

Haplotype Assembly ◽

Current State ◽

Error Corrections ◽

Long Reads ◽

Computational Resources

AbstractBackgroundHaplotype assembly is the process of assigning the different alleles of the variants covered by mapped sequencing reads to the two haplotypes of the genome of a human individual. Long reads, which are nowadays cheaper to produce and more widely available than ever before, have been used to reduce the fragmentation of the assembled haplotypes since their ability to span several variants along the genome. These long reads are also characterized by a high error rate, an issue which may be mitigated, however, with larger sets of reads, when this error rate is uniform across genome positions. Unfortunately, current state-of-the-art dynamic programming approaches designed for long reads deal only with limited coverages.ResultsHere, we propose a new method for assembling haplotypes which combines and extends the features of previous approaches to deal with long reads and higher coverages. In particular, our algorithm is able to dynamically adapt the estimated number of errors at each variant site, while minimizing the total number of error corrections necessary for finding a feasible solution. This allows our method to significantly reduce the required computational resources, allowing to consider datasets composed of higher coverages. The algorithm has been implemented in a freely available tool, HapCHAT: Haplotype Assembly Coverage Handling by Adapting Thresholds. An experimental analysis on sequencing reads with up to 60× coverage reveals improvements in accuracy and recall achieved by considering a higher coverage with lower runtimes.ConclusionsOur method leverages the long-range information of sequencing reads that allows to obtain assembled haplotypes fragmented in a lower number of unphased haplotype blocks. At the same time, our method is also able to deal with higher coverages to better correct the errors in the original reads and to obtain more accurate haplotypes as a result.AvailabilityHapCHAT is available at http://hapchat.algolab.eu under the GPL license.

Download Full-text

A benchmark of structural variation detection by long reads through a realistic simulated model

Genome Biology ◽

10.1186/s13059-021-02551-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Nicolas Dierckxsens ◽

Tong Li ◽

Joris R. Vermeesch ◽

Zhi Xie

Keyword(s):

Structural Variation ◽

New Method ◽

Sequencing Data ◽

Simulated Model ◽

Long Reads ◽

Long Read ◽

Sequencing Platforms

AbstractAccurate simulations of structural variation distributions and sequencing data are crucial for the development and benchmarking of new tools. We develop Sim-it, a straightforward tool for the simulation of both structural variation and long-read data. These simulations from Sim-it reveal the strengths and weaknesses for current available structural variation callers and long-read sequencing platforms. With these findings, we develop a new method (combiSV) that can combine the results from structural variation callers into a superior call set with increased recall and precision, which is also observed for the latest structural variation benchmark set developed by the GIAB Consortium.

Download Full-text

LongPhase: an ultra-fast chromosome-scale phasing algorithm for small and large variants

10.1101/2021.09.09.459623 ◽

2021 ◽

Author(s):

Jyun-Hong Lin ◽

Liang-Chi Chen ◽

Shu-Qi Yu ◽

Yao-Ting Huang

Keyword(s):

Variant Calling ◽

Cost Effective ◽

Nucleotide Polymorphisms ◽

Structural Variations ◽

Single Nucleotide ◽

Chromosome Conformation ◽

Long Reads ◽

Cost Effective Approach ◽

Long Read ◽

Microbial Strains

AbstractLong-read phasing has been used for reconstructing diploid genomes, improving variant calling, and resolving microbial strains in metagenomics. However, the phasing blocks of existing methods are broken by large Structural Variations (SVs), and the efficiency is unsatisfactory for population-scale phasing. This paper presents an ultra-fast algorithm, LongPhase, which can simultaneously phase single nucleotide polymorphisms (SNPs) and SVs of a human genome in ∼10-20 minutes, 10x faster than the state-of-the-art WhatsHap and Margin. In particular, LongPhase produces much larger phased blocks at almost chromosome level with only long reads (N50=26Mbp). We demonstrate that LongPhase combined with Nanopore is a cost-effective approach for providing chromosome-scale phasing without the need for additional trios, chromosome-conformation, and single-cell strand-seq data.

Download Full-text

cloudSPAdes: assembly of synthetic long reads using de Bruijn graphs

Bioinformatics ◽

10.1093/bioinformatics/btz349 ◽

2019 ◽

Vol 35 (14) ◽

pp. i61-i70 ◽

Cited By ~ 4

Author(s):

Ivan Tolstoganov ◽

Anton Bankevich ◽

Zhoutao Chen ◽

Pavel A Pevzner

Keyword(s):

Narrow Range ◽

State Of The Art ◽

Supplementary Information ◽

De Bruijn Graph ◽

Hybrid Assembly ◽

De Bruijn Graphs ◽

Long Reads ◽

Long Read ◽

De Bruijn ◽

New Applications

Abstract Motivation The recently developed barcoding-based synthetic long read (SLR) technologies have already found many applications in genome assembly and analysis. However, although some new barcoding protocols are emerging and the range of SLR applications is being expanded, the existing SLR assemblers are optimized for a narrow range of parameters and are not easily extendable to new barcoding technologies and new applications such as metagenomics or hybrid assembly. Results We describe the algorithmic challenge of the SLR assembly and present a cloudSPAdes algorithm for SLR assembly that is based on analyzing the de Bruijn graph of SLRs. We benchmarked cloudSPAdes across various barcoding technologies/applications and demonstrated that it improves on the state-of-the-art SLR assemblers in accuracy and speed. Availability and implementation Source code and installation manual for cloudSPAdes are available at https://github.com/ablab/spades/releases/tag/cloudspades-paper. Supplementary Information Supplementary data are available at Bioinformatics online.

Download Full-text