Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation

AbstractLong-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either PacBio or Oxford Nanopore technologies, and achieves a contig NG50 of greater than 21 Mbp on both human and Drosophila melanogaster PacBio datasets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.

Download Full-text

De novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm

10.1101/463463 ◽

2018 ◽

Cited By ~ 8

Author(s):

Kristoffer Sahlin ◽

Paul Medvedev

Keyword(s):

Clustering Algorithm ◽

De Novo ◽

Substantial Improvement ◽

Error Rates ◽

Reconstruction Algorithms ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Transcript Reconstruction ◽

Oxford Nanopore Technologies

AbstractLong-read sequencing of transcripts with PacBio Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (in order to scale) and makes use of quality values (in order to handle variable error rates). We test isONclust on three simulated and five biological datasets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large datasets. Our tool is available at https://github.com/ksahlin/isONclust.

Download Full-text

Overlap detection on long, error-prone sequencing reads via smooth q-gram

Bioinformatics ◽

10.1093/bioinformatics/btaa252 ◽

2020 ◽

Vol 36 (19) ◽

pp. 4838-4845

Author(s):

Yan Song ◽

Haixu Tang ◽

Haoyu Zhang ◽

Qin Zhang

Keyword(s):

Single Molecule ◽

De Novo ◽

Error Rates ◽

Supplementary Information ◽

Sequencing Error ◽

Fragment Assembly ◽

Detection Algorithms ◽

Third Generation Sequencing ◽

Oxford Nanopore ◽

Assembly Algorithms

Abstract Motivation Third generation sequencing techniques, such as the Single Molecule Real Time technique from PacBio and the MinION technique from Oxford Nanopore, can generate long, error-prone sequencing reads which pose new challenges for fragment assembly algorithms. In this paper, we study the overlap detection problem for error-prone reads, which is the first and most critical step in the de novo fragment assembly. We observe that all the state-of-the-art methods cannot achieve an ideal accuracy for overlap detection (in terms of relatively low precision and recall) due to the high sequencing error rates, especially when the overlap lengths between reads are relatively short (e.g. <2000 bases). This limitation appears inherent to these algorithms due to their usage of q-gram-based seeds under the seed-extension framework. Results We propose smooth q-gram, a variant of q-gram that captures q-gram pairs within small edit distances and design a novel algorithm for detecting overlapping reads using smooth q-gram-based seeds. We implemented the algorithm and tested it on both PacBio and Nanopore sequencing datasets. Our benchmarking results demonstrated that our algorithm outperforms the existing q-gram-based overlap detection algorithms, especially for reads with relatively short overlapping lengths. Availability and implementation The source code of our implementation in C++ is available at https://github.com/FIGOGO/smoothq. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MECAT: an ultra-fast mapping, error correction andde novoassembly tool for single-molecule sequencing reads

10.1101/089250 ◽

2016 ◽

Cited By ~ 2

Author(s):

Chuan-Le Xiao ◽

Ying Chen ◽

Shang-qian Xie ◽

Kai-Ning Chen ◽

Yan Wang ◽

...

Keyword(s):

Error Correction ◽

Single Molecule ◽

De Novo ◽

Computational Cost ◽

Pairwise Alignment ◽

Global Alignment ◽

Chinese Han ◽

Celera Assembler ◽

Reference Quality ◽

Molecular Sequencing

ABSTRACTThe high computational cost of current assembly methods for the long, noisy single molecular sequencing (SMS) reads has prevented them from assembling large genomes. We introduce an ultra-fast alignment method based on a novel global alignment score. For large human SMS data, our method is 7X faster than MHAP for pairwise alignment and 15X faster than BLASR for reference mapping. We develop a Mapping, Error Correction and de novo Assembly Tool (MECAT) by integrating our new alignment and error correction methods, with the Celera Assembler. MECAT is capable of producing high qualityde novoassembly of large genome from SMS reads with low computational cost. MECAT produces reference-quality assemblies ofSaccharomyces cerevisiae,Arabidopsis thaliana,Drosophila melanogasterand reconstructs the human CHM1 genome with 15% longer NG50 in only 7600 CPU core hours using 54X SMS reads and a Chinese Han genome in 19200 CPU core hours using 102X SMS reads.

Download Full-text

De novo genome assembly of the olive fruit fly (Bactrocera oleae) developed through a combination of linked-reads and long-read technologies

10.1101/505040 ◽

2018 ◽

Cited By ~ 1

Author(s):

Haig Djambazian ◽

Anthony Bayega ◽

Konstantina T. Tsoumani ◽

Efthimia Sagri ◽

Maria-Eleni Gregoriou ◽

...

Keyword(s):

Y Chromosome ◽

De Novo ◽

Fruit Fly ◽

Bactrocera Oleae ◽

Olive Fruit Fly ◽

Olive Fruit ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Oxford Nanopore Technologies

AbstractLong-read sequencing has greatly contributed to the generation of high quality assemblies, albeit at a high cost. It is also not always clear how to combine sequencing platforms. We sequenced the genome of the olive fruit fly (Bactrocera oleae), the most important pest in the olive fruits agribusiness industry, using Illumina short-reads, mate-pairs, 10x Genomics linked-reads, Pacific Biosciences (PacBio), and Oxford Nanopore Technologies (ONT). The 10x linked-reads assembly gave the most contiguous assembly with an N50 of 2.16 Mb. Scaffolding the linked-reads assembly using long-reads from ONT gave a more contiguous assembly with scaffold N50 of 4.59 Mb. We also present the most extensive transcriptome datasets of the olive fly derived from different tissues and stages of development. Finally, we used the Chromosome Quotient method to identify Y-chromosome scaffolds and show that the long-reads based assembly generates very highly contiguous Y-chromosome assembly.JR is a member of the MinION Access Program (MAP) and has received free-of-charge flow cells and sequencing kits from Oxford Nanopore Technologies for other projects. JR has had no other financial support from ONT.AB has received re-imbursement for travel costs associated with attending Nanopore Community meeting 2018, a meeting organized my Oxford Nanopore Technologies.

Download Full-text

Full-coverage sequencing of HIV-1 provirus from a reference plasmid

10.1101/611848 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alejandro R. Gener

Keyword(s):

Single Molecule ◽

Integration Site ◽

Full Length ◽

Rapid Pcr ◽

Oxford Nanopore ◽

Long Read ◽

Reference Plasmid ◽

Nanopore Dna Sequencing ◽

Oxford Nanopore Technologies ◽

Hiv 1

ABSTRACTObjective(s)To evaluate nanopore DNA sequencing for sequencing full-length HIV-1 provirus.DesignI used nanopore sequencing to sequence full-length HIV-1 from a plasmid (pHXB2).MethodspHXB2 plasmid was processed with the Rapid PCR-Barcoding library kit and sequenced on the MinION sequencer (Oxford Nanopore Technologies, Oxford., UK). Raw fast5 reads were converted into fastq (base called) with Albacore, Guppy, and FlipFlop base callers. Reads were first aligned to the reference with BWA-MEM to evaluate sample coverage manually. Reads were then assembled with Canu into contigs, and contigs manually finished in SnapGene.ResultsI sequenced full-length HXB2 HIV-1 from 5’ to 3’ LTR (100%), with median per-base coverage of over 9000x in one 12-barcoded experiment on a single MinION flow cell. The longest HIV-spanning read to-date was generated, at a length of 11,487 bases, which included full-length HIV-1 and plasmid backbone on either side. At least 20 variants were discovered in pHXB2 compared to reference.ConclusionsThe MinION sequencer performed as-expected, covering full-length HIV. The discovery of variants in a dogmatic reference plasmid demonstrates the need for single-molecule sequence verification moving forward. These results illustrate the utility of long read sequencing to advance the study of HIV at single integration site resolution.

Download Full-text

Genomic Surveillance for Antimicrobial Resistance inMannheimia haemolyticaUsing Nanopore Single Molecule Sequencing Technology

10.1101/395087 ◽

2018 ◽

Author(s):

Alexander Lim ◽

Bryan Naidenov ◽

Haley Bates ◽

Karyn Willyerd ◽

Timothy Snider ◽

...

Keyword(s):

Antibiotic Resistance ◽

Antimicrobial Resistance ◽

Single Molecule ◽

Resistant Strain ◽

De Novo ◽

Gene Annotation ◽

Cost Effective ◽

Multidrug Resistant ◽

Oxford Nanopore ◽

Long Read

AbstractDisruptive innovations in long-range, cost-effective direct template nucleic acid sequencing are transforming clinical and diagnostic medicine. A multidrug resistant strain and a pan-susceptible strain ofMannheimia haemolytica, isolated from pneumonic bovine lung samples, were respectively sequenced at 146x and 111x coverage with Oxford Nanopore Technologies MinION.De novoassembly produced a complete genome for the non-resistant strain and a nearly complete assembly for the drug resistant strain. Functional annotation using RAST (Rapid Annotations using Subsystems Technology), CARD (Comprehensive Antibiotic Resistance Database) and ResFinder databases identified genes conferring resistance to different classes of antibiotics including beta lactams, tetracyclines, lincosamides, phenicols, aminoglycosides, sulfonamides and macrolides. Antibiotic resistance phenotypes of theM. haemolyticastrains were confirmed with minimum inhibitory concentration (MIC) assays. The sequencing capacity of highly portable MinION devices was verified by sub-sampling sequencing reads; potential for antimicrobial resistance determined by identification of resistance genes in the draft assemblies with as little as 5,437 MinION reads corresponded to all classes of MIC assays. The resulting quality assemblies and AMR gene annotation highlight efficiency of ultra long-read, whole-genome sequencing (WGS) as a valuable tool in diagnostic veterinary medicine.

Download Full-text

Benchmarking of long-read assemblers for prokaryote whole genome sequencing

F1000Research ◽

10.12688/f1000research.21782.1 ◽

2019 ◽

Vol 8 ◽

pp. 2138 ◽

Cited By ~ 15

Author(s):

Ryan R. Wick ◽

Kathryn E. Holt

Keyword(s):

Data Sets ◽

Computationally Efficient ◽

Short Read Sequencing ◽

Oxford Nanopore ◽

Long Read ◽

Sequencing Platforms ◽

Computational Resources ◽

Assembly Algorithms ◽

Oxford Nanopore Technologies ◽

Multiple Assembly

Background: Data sets from long-read sequencing platforms (Oxford Nanopore Technologies and Pacific Biosciences) allow for most prokaryote genomes to be completely assembled – one contig per chromosome or plasmid. However, the high per-read error rate of long-read sequencing necessitates different approaches to assembly than those used for short-read sequencing. Multiple assembly tools (assemblers) exist, which use a variety of algorithms for long-read assembly. Methods: We used 500 simulated read sets and 120 real read sets to assess the performance of six long-read assemblers (Canu, Flye, Miniasm/Minipolish, Raven, Redbean and Shasta) across a wide variety of genomes and read parameters. Assemblies were assessed on their structural accuracy/completeness, sequence identity, contig circularisation and computational resources used. Results: Canu v1.9 produced moderately reliable assemblies but had the longest runtimes of all assemblers tested. Flye v2.6 was more reliable and did particularly well with plasmid assembly. Miniasm/Minipolish v0.3 was the only assembler which consistently produced clean contig circularisation. Raven v0.0.5 was the most reliable for chromosome assembly, though it did not perform well on small plasmids and had circularisation issues. Redbean v2.5 and Shasta v0.3.0 were computationally efficient but more likely to produce incomplete assemblies. Conclusions: Of the assemblers tested, Flye, Miniasm/Minipolish and Raven performed best overall. However, no single tool performed well on all metrics, highlighting the need for continued development on long-read assembly algorithms.

Download Full-text

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

10.1101/008003 ◽

2014 ◽

Cited By ~ 13

Author(s):

Konstantin Berlin ◽

Sergey Koren ◽

Chen-Shan Chin ◽

James Drake ◽

Jane M Landolin ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Locality Sensitive Hashing ◽

Model Organisms ◽

Smrt Sequencing ◽

High Coverage ◽

Celera Assembler ◽

Single Molecule Sequencing ◽

Long Reads ◽

Long Read

We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.

Download Full-text

De-novo Assembly of Limnospira fusiformis Using Ultra-Long Reads

Frontiers in Microbiology ◽

10.3389/fmicb.2021.657995 ◽

2021 ◽

Vol 12 ◽

Author(s):

McKenna Hicks ◽

Thuy-Khanh Tran-Dao ◽

Logan Mulroney ◽

David L. Bernick

Keyword(s):

Phylogenetic Analysis ◽

Type Strain ◽

Reference Genome ◽

De Novo ◽

Illumina Miseq ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Oxford Nanopore Technologies ◽

Rdna Analysis

The Limnospira genus is a recently established clade that is economically important due to its worldwide use in biotechnology and agriculture. This genus includes organisms that were reclassified from Arthrospira, which are commercially marketed as “Spirulina.” Limnospira are photoautotrophic organisms that are widely used for research in nutrition, medicine, bioremediation, and biomanufacturing. Despite its widespread use, there is no closed genome for the Limnospira genus, and no reference genome for the type strain, Limnospira fusiformis. In this work, the L. fusiformis genome was sequenced using Oxford Nanopore Technologies MinION and assembled using only ultra-long reads (>35 kb). This assembly was polished with Illumina MiSeq reads sourced from an axenic L. fusiformis culture; axenicity was verified via microscopy and rDNA analysis. Ultra-long read sequencing resulted in a 6.42 Mb closed genome assembled as a single contig with no plasmid. Phylogenetic analysis placed L. fusiformis in the Limnospira clade; some Arthrospira were also placed in this clade, suggesting a misclassification of these strains. This work provides a fully closed and accurate reference genome for the economically important type strain, L. fusiformis. We also present a rapid axenicity method to isolate L. fusiformis. These contributions enable future biotechnological development of L. fusiformis by way of genetic engineering.

Download Full-text

Benchmarking Long-Read Assemblers for Genomic Analyses of Bacterial Pathogens Using Oxford Nanopore Sequencing

International Journal of Molecular Sciences ◽

10.3390/ijms21239161 ◽

2020 ◽

Vol 21 (23) ◽

pp. 9161

Author(s):

Zhao Chen ◽

David L. Erickson ◽

Jianghong Meng

Keyword(s):

Virulence Genes ◽

Bacterial Pathogens ◽

Error Rates ◽

Nanopore Sequencing ◽

Long Reads ◽

Oxford Nanopore ◽

Genomic Analyses ◽

Long Read ◽

Genome Analyses ◽

Assembly Algorithms

Oxford Nanopore sequencing can be used to achieve complete bacterial genomes. However, the error rates of Oxford Nanopore long reads are greater compared to Illumina short reads. Long-read assemblers using a variety of assembly algorithms have been developed to overcome this deficiency, which have not been benchmarked for genomic analyses of bacterial pathogens using Oxford Nanopore long reads. In this study, long-read assemblers, namely Canu, Flye, Miniasm/Racon, Raven, Redbean, and Shasta, were thus benchmarked using Oxford Nanopore long reads of bacterial pathogens. Ten species were tested for mediocre- and low-quality simulated reads, and 10 species were tested for real reads. Raven was the most robust assembler, obtaining complete and accurate genomes. All Miniasm/Racon and Raven assemblies of mediocre-quality reads provided accurate antimicrobial resistance (AMR) profiles, while the Raven assembly of Klebsiella variicola with low-quality reads was the only assembly with an accurate AMR profile among all assemblers and species. All assemblers functioned well for predicting virulence genes using mediocre-quality and real reads, whereas only the Raven assemblies of low-quality reads had accurate numbers of virulence genes. Regarding multilocus sequence typing (MLST), Miniasm/Racon was the most effective assembler for mediocre-quality reads, while only the Raven assemblies of Escherichia coli O157:H7 and K. variicola with low-quality reads showed positive MLST results. Miniasm/Racon and Raven were the best performers for MLST using real reads. The Miniasm/Racon and Raven assemblies showed accurate phylogenetic inference. For the pan-genome analyses, Raven was the strongest assembler for simulated reads, whereas Miniasm/Racon and Raven performed the best for real reads. Overall, the most robust and accurate assembler was Raven, closely followed by Miniasm/Racon.

Download Full-text