scholarly journals Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads

PeerJ ◽  
2016 ◽  
Vol 4 ◽  
pp. e2016 ◽  
Author(s):  
Chengxi Ye ◽  
Zhanshan (Sam) Ma

Motivation.The third generation sequencing (3GS) technology generates long sequences of thousands of bases. However, its current error rates are estimated in the range of 15–40%, significantly higher than those of the prevalent next generation sequencing (NGS) technologies (less than 1%). Fundamental bioinformatics tasks such asde novogenome assembly and variant calling require high-quality sequences that need to be extracted from these long but erroneous 3GS sequences.Results.We describe a versatile and efficient linear complexity consensus algorithm Sparc to facilitatede novogenome assembly. Sparc builds a sparse k-mer graph using a collection of sequences from a targeted genomic region. The heaviest path which approximates the most likely genome sequence is searched through a sparsity-induced reweighted graph as the consensus sequence. Sparc supports using NGS and 3GS data together, which leads to significant improvements in both cost efficiency and computational efficiency. Experiments with Sparc show that our algorithm can efficiently provide high-quality consensus sequences using both PacBio and Oxford Nanopore sequencing technologies. With only 30× PacBio data, Sparc can reach a consensus with error rate <0.5%. With the more challenging Oxford Nanopore data, Sparc can also achieve similar error rate when combined with NGS data. Compared with the existing approaches, Sparc[i] calculates the consensus with higher accuracy, uses 80% less memory and time, approximately. The source code is available for download athttps://github.com/yechengxi/Sparc.

2019 ◽  
Author(s):  
David Porubsky ◽  
Peter Ebert ◽  
Peter A. Audano ◽  
Mitchell R. Vollger ◽  
William T. Harvey ◽  
...  

The prevailing genome assembly paradigm is to produce consensus sequences that “collapse” parental haplotypes into a consensus sequence. Here, we leverage the chromosome-wide phasing and scaffolding capabilities of single-cell strand sequencing (Strand-seq)1,2 and combine them with high-fidelity (HiFi) long sequencing reads3, in a novel reference-free workflow for diploid de novo genome assembly. Employing this strategy, we produce completely phased de novo genome assemblies separately for each haplotype of a single individual of Puerto Rican origin (HG00733) in the absence of parental data. The assemblies are accurate (QV > 40), highly contiguous (contig N50 > 25 Mbp) with low switch error rates (0.4%) providing fully phased single-nucleotide variants (SNVs), indels, and structural variants (SVs). A comparison of Oxford Nanopore and PacBio phased assemblies identifies 150 regions that are preferential sites of contig breaks irrespective of sequencing technology or phasing algorithms.


2015 ◽  
Author(s):  
Chengxi Ye ◽  
Sam Ma

Motivation: The third generation sequencing (3GS) technology generates long sequences of thousands of bases. However, its error rates are estimated in the range of 15-40%, much higher than the previous generation (approximately 1%). Fundamental tasks such as genome assembly and variant calling require us to obtain high quality sequences from these long erroneous sequences. Results: In this paper we describe a versatile and efficient linear complexity consensus algorithm Sparc that builds a sparse k-mer graph using a collection of sequences from the same genomic region. The heaviest path approximates the most likely genome sequence (consensus) and is sought through a sparsity-induced reweighted graph. Experiments show that our algorithm can efficiently provide high-quality consensus sequences with error rate <0.5% using both PacBio and Oxford Nanopore sequencing technologies. Compared with the existing approaches, Sparc calculates the consensus with higher accuracy, uses 80% less memory, and is 5x faster, approximately. Availability: The source code is available for download at http://sourceforge.net/p/sparc-consensus/code/ and a testing dataset is available: https://www.dropbox.com/sh/trng8vdaeqywx1e/AAASJesLVAJZcbORkU9f4LuBa?dl=0 (Please copy the link to a browser to access if directly clicking the link fails)


2015 ◽  
Author(s):  
Chengxi Ye ◽  
Sam Ma

Motivation: The third generation sequencing (3GS) technology generates long sequences of thousands of bases. However, its error rates are estimated in the range of 15-40%, much higher than the previous generation (approximately 1%). Fundamental tasks such as genome assembly and variant calling require us to obtain high quality sequences from these long erroneous sequences. Results: In this paper we describe a versatile and efficient linear complexity consensus algorithm Sparc that builds a sparse k-mer graph using a collection of sequences from the same genomic region. The heaviest path approximates the most likely genome sequence (consensus) and is sought through a sparsity-induced reweighted graph. Experiments show that our algorithm can efficiently provide high-quality consensus sequences with error rate <0.5% using both PacBio and Oxford Nanopore sequencing technologies. Compared with the existing approaches, Sparc calculates the consensus with higher accuracy, uses 80% less memory, and is 5x faster, approximately. Availability: The source code is available for download at http://sourceforge.net/p/sparc-consensus/code/ and a testing dataset is available: https://www.dropbox.com/sh/trng8vdaeqywx1e/AAASJesLVAJZcbORkU9f4LuBa?dl=0 (Please copy the link to a browser to access if directly clicking the link fails)


2018 ◽  
Author(s):  
Ou Wang ◽  
Robert Chin ◽  
Xiaofang Cheng ◽  
Michelle Ka Wu ◽  
Qing Mao ◽  
...  

Obtaining accurate sequences from long DNA molecules is very important for genome assembly and other applications. Here we describe single tube long fragment read (stLFR), a technology that enables this a low cost. It is based on adding the same barcode sequence to sub-fragments of the original long DNA molecule (DNA co-barcoding). To achieve this efficiently, stLFR uses the surface of microbeads to create millions of miniaturized barcoding reactions in a single tube. Using a combinatorial process up to 3.6 billion unique barcode sequences were generated on beads, enabling practically non-redundant co-barcoding with 50 million barcodes per sample. Using stLFR, we demonstrate efficient unique co-barcoding of over 8 million 20-300 kb genomic DNA fragments. Analysis of the genome of the human genome NA12878 with stLFR demonstrated high quality variant calling and phasing into contigs up to N50 34 Mb. We also demonstrate detection of complex structural variants and complete diploid de novo assembly of NA12878. These analyses were all performed using single stLFR libraries and their construction did not significantly add to the time or cost of whole genome sequencing (WGS) library preparation. stLFR represents an easily automatable solution that enables high quality sequencing, phasing, SV detection, scaffolding, cost-effective diploid de novo genome assembly, and other long DNA sequencing applications.


2021 ◽  
Vol 99 (Supplement_3) ◽  
pp. 23-24
Author(s):  
Kimberly M Davenport ◽  
Derek M Bickhart ◽  
Kim Worley ◽  
Shwetha C Murali ◽  
Noelle Cockett ◽  
...  

Abstract Sheep are an important agricultural species used for both food and fiber in the United States and globally. A high-quality reference genome enhances the ability to discover genetic and biological mechanisms influencing important traits, such as meat and wool quality. The rapid advances in genome assembly algorithms and emergence of increasingly long sequence read length provide the opportunity for an improved de novo assembly of the sheep reference genome. Tissue was collected postmortem from an adult Rambouillet ewe selected by USDA-ARS for the Ovine Functional Annotation of Animal Genomes project. Short-read (55x coverage), long-read PacBio (75x coverage), and Hi-C data from this ewe were retrieved from public databases. We generated an additional 50x coverage of Oxford Nanopore data and assembled the combined long-read data with canu v1.9. The assembled contigs were polished with Nanopolish v0.12.5 and scaffolded using Hi-C data with Salsa v2.2. Gaps were filled with PBsuite v15.8.24 and polished with Nanopolish v0.12.5 followed by removal of duplicate contigs with PurgeDups v1.0.1. Chromosomes were oriented by identifying centromeres and telomeres with RepeatMasker v4.1.1, indicating a need to reverse the orientation of chromosome 11 relative to Oar_rambouillet_v1.0. Final polishing was performed with two rounds of a pipeline which consisted of freebayes v1.3.1 to call variants, Merfin to validate them, and BCFtools to generate the consensus fasta. The ARS-UI_Ramb_v2.0 assembly has improved continuity (contig N50 of 43.19 Mb) with a 19-fold and 38-fold decrease in the number of scaffolds compared with Oar_rambouillet_v1.0 and Oar_v4.0. ARS-UI_Ramb_v2.0 has greater per-base accuracy and fewer insertions and deletions identified from mapped RNA sequence than previous assemblies. This significantly improved reference assembly, public at NCBI GenBank under accession number GCA_016772045, will optimize the functional annotation of the sheep genome and facilitate improved mapping accuracy of genetic variant and expression data for traits relevant the sheep industry.


2020 ◽  
Author(s):  
John M. Sutton ◽  
Janna L. Fierst

SummaryHigh quality reference genome sequences are the core of modern genomics. Oxford Nanopore Technologies (ONT) produces inexpensive DNA sequences in excess of 100,000 nucleotides but error rates remain >10% and assembling these sequences, particularly for eukaryotes, is a non-trivial problem. To date there has been no comprehensive attempt to generate experimental design for ONT genome sequencing and assembly. Here, we simulate ONT and Illumina DNA sequence reads for Escherichia coli, Caenorhabditis elegans, Arabidopsis thaliana, and Drosophila melanogaster. We quantify the influence of sequencing coverage, assembly software and experimental design on de novo genome assembly and error correction to predict the optimum sequencing strategy for these organisms. We show proof of concept using real ONT data generated for the nematode Caenorhabditis remanei. ONT sequencing is inexpensive and accessible, and our quantitative results will be helpful for a broad array of researchers seeking guidance for de novo genome assembly projects.


2020 ◽  
Vol 2 (2) ◽  
Author(s):  
Juliane C Dohm ◽  
Philipp Peters ◽  
Nancy Stralis-Pavese ◽  
Heinz Himmelbauer

Abstract Third-generation sequencing technologies provided by Pacific Biosciences and Oxford Nanopore Technologies generate read lengths in the scale of kilobasepairs. However, these reads display high error rates, and correction steps are necessary to realize their great potential in genomics and transcriptomics. Here, we compare properties of PacBio and Nanopore data and assess correction methods by Canu, MARVEL and proovread in various combinations. We found total error rates of around 13% in the raw datasets. PacBio reads showed a high rate of insertions (around 8%) whereas Nanopore reads showed similar rates for substitutions, insertions and deletions of around 4% each. In data from both technologies the errors were uniformly distributed along reads apart from noisy 5′ ends, and homopolymers appeared among the most over-represented kmers relative to a reference. Consensus correction using read overlaps reduced error rates to about 1% when using Canu or MARVEL after patching. The lowest error rate in Nanopore data (0.45%) was achieved by applying proovread on MARVEL-patched data including Illumina short-reads, and the lowest error rate in PacBio data (0.42%) was the result of Canu correction with minimap2 alignment after patching. Our study provides valuable insights and benchmarks regarding long-read data and correction methods.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Nathan LaPierre ◽  
Rob Egan ◽  
Wei Wang ◽  
Zhong Wang

Abstract Background Long read sequencing technologies such as Oxford Nanopore can greatly decrease the complexity of de novo genome assembly and large structural variation identification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. The limited sensitivity of existing read-based error correction methods can cause large-scale mis-assemblies in the assembled genomes, motivating further innovation in this area. Results Here we developed a Convolutional Neural Network (CNN) based method, called MiniScrub, for identification and subsequent “scrubbing” (removal) of low-quality Nanopore read segments to minimize their interference in downstream assembly process. MiniScrub first generates read-to-read overlaps via MiniMap2, then encodes the overlaps into images, and finally builds CNN models to predict low-quality segments. Applying MiniScrub to real world control datasets under several different parameters, we show that it robustly improves read quality, and improves read error correction in the metagenome setting. Compared to raw reads, de novo genome assembly with scrubbed reads produces many fewer mis-assemblies and large indel errors. Conclusions MiniScrub is able to robustly improve read quality of Oxford Nanopore reads, especially in the metagenome setting, making it useful for downstream applications such as de novo assembly. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub.


2018 ◽  
Author(s):  
Nathan LaPierre ◽  
Rob Egan ◽  
Wei Wang ◽  
Zhong Wang

AbstractLong read sequencing technologies such as Oxford Nanopore can greatly de-crease the complexity of de novo genome assembly and large structural variation iden-tification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. Many methods for resolving these errors require access to reference genomes, high-fidelity short reads, or reference genomes, which are often not available. De novo error correction modules are available, often as part of assembly tools, but large-scale errors still remain in resulting assemblies, motivating further innovation in this area. We developed a novel Convolutional Neu-ral Network (CNN) based method, called MiniScrub, for de novo identification and subsequent “scrubbing” (removal) of low-quality Nanopore read segments. MiniScrub first generates read-to-read alignments by MiniMap, then encodes the alignments into images, and finally builds CNN models to predict low-quality segments that could be scrubbed based on a customized quality cutoff. Applying MiniScrub to real world con-trol datasets under several different parameters, we show that it robustly improves read quality. Compared to raw reads, de novo genome assembly with scrubbed reads pro-duces many fewer mis-assemblies and large indel errors. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub


Gigabyte ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-26
Author(s):  
John M. Sutton ◽  
Joshua D. Millwood ◽  
A. Case McCormack ◽  
Janna L. Fierst

High quality reference genome sequences are the core of modern genomics. Oxford Nanopore Technologies (ONT) produces inexpensive DNA sequences, but has high error rates, which make sequence assembly and analysis difficult as genome size and complexity increases. Robust experimental design is necessary for ONT genome sequencing and assembly, but few studies have addressed eukaryotic organisms. Here, we present novel results using simulated and empirical ONT and DNA libraries to identify best practices for sequencing and assembly for several model species. We find that the unique error structure of ONT libraries causes errors to accumulate and assembly statistics plateau as sequence depth increases. High-quality assembled eukaryotic sequences require high-molecular-weight DNA extractions that increase sequence read length, and computational protocols that reduce error through pre-assembly correction and read selection. Our quantitative results will be helpful for researchers seeking guidance for de novo assembly projects.


Sign in / Sign up

Export Citation Format

Share Document