scholarly journals Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions

2018 ◽  
Vol 20 (4) ◽  
pp. 1542-1559 ◽  
Author(s):  
Damla Senol Cali ◽  
Jeremie S Kim ◽  
Saugata Ghose ◽  
Can Alkan ◽  
Onur Mutlu

Abstract Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance, as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we (1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and (2) provide guidelines for determining the appropriate tools for each step. Based on our analyses, we make four key observations: (1) the choice of the tool for basecalling plays a critical role in overcoming the high error rates of nanopore sequencing technology. (2) Read-to-read overlap finding tools, GraphMap and Minimap, perform similarly in terms of accuracy. However, Minimap has a lower memory usage, and it is faster than GraphMap. (3) There is a trade-off between accuracy and performance when deciding on the appropriate tool for the assembly step. The fast but less accurate assembler Miniasm can be used for quick initial assembly, and further polishing can be applied on top of it to increase the accuracy, which leads to faster overall assembly. (4) The state-of-the-art polishing tool, Racon, generates high-quality consensus sequences while providing a significant speedup over another polishing tool, Nanopolish. We analyze various combinations of different tools and expose the trade-offs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, to overcome the high error rates of the nanopore sequencing technology.

PeerJ ◽  
2019 ◽  
Vol 7 ◽  
pp. e6902 ◽  
Author(s):  
Simon Roux ◽  
Gareth Trubl ◽  
Danielle Goudeau ◽  
Nandita Nath ◽  
Estelle Couradeau ◽  
...  

Background Metagenomics has transformed our understanding of microbial diversity across ecosystems, with recent advances enabling de novo assembly of genomes from metagenomes. These metagenome-assembled genomes are critical to provide ecological, evolutionary, and metabolic context for all the microbes and viruses yet to be cultivated. Metagenomes can now be generated from nanogram to subnanogram amounts of DNA. However, these libraries require several rounds of PCR amplification before sequencing, and recent data suggest these typically yield smaller and more fragmented assemblies than regular metagenomes. Methods Here we evaluate de novo assembly methods of 169 PCR-amplified metagenomes, including 25 for which an unamplified counterpart is available, to optimize specific assembly approaches for PCR-amplified libraries. We first evaluated coverage bias by mapping reads from PCR-amplified metagenomes onto reference contigs obtained from unamplified metagenomes of the same samples. Then, we compared different assembly pipelines in terms of assembly size (number of bp in contigs ≥ 10 kb) and error rates to evaluate which are the best suited for PCR-amplified metagenomes. Results Read mapping analyses revealed that the depth of coverage within individual genomes is significantly more uneven in PCR-amplified datasets versus unamplified metagenomes, with regions of high depth of coverage enriched in short inserts. This enrichment scales with the number of PCR cycles performed, and is presumably due to preferential amplification of short inserts. Standard assembly pipelines are confounded by this type of coverage unevenness, so we evaluated other assembly options to mitigate these issues. We found that a pipeline combining read deduplication and an assembly algorithm originally designed to recover genomes from libraries generated after whole genome amplification (single-cell SPAdes) frequently improved assembly of contigs ≥10 kb by 10 to 100-fold for low input metagenomes. Conclusions PCR-amplified metagenomes have enabled scientists to explore communities traditionally challenging to describe, including some with extremely low biomass or from which DNA is particularly difficult to extract. Here we show that a modified assembly pipeline can lead to an improved de novo genome assembly from PCR-amplified datasets, and enables a better genome recovery from low input metagenomes.


2018 ◽  
Author(s):  
Simon Roux ◽  
Gareth Trubl ◽  
Danielle Goudeau ◽  
Nandita Nath ◽  
Estelle Couradeau ◽  
...  

Background. Metagenomics has transformed our understanding of microbial diversity across ecosystems, with recent advances enabling de novo assembly of genomes from metagenomes. These metagenome-assembled genomes are critical to provide ecological, evolutionary, and metabolic context for all the microbes and viruses yet to be cultivated. Metagenomes can now be generated from nanogram to subnanogram amounts of DNA. However, these libraries require several rounds of PCR amplification before sequencing, and recent data suggest these typically yield smaller and more fragmented assemblies than regular metagenomes. Methods. Here we evaluate de novo assembly methods of 169 PCR-amplified metagenomes, including 25 for which an unamplified counterpart is available, to optimize specific assembly approaches for PCR-amplified libraries. We first evaluated coverage bias by mapping reads from PCR-amplified metagenomes onto reference contigs obtained from unamplified metagenomes of the same samples. Then, we compared different assembly pipelines in terms of assembly size (number of bp in contigs ≥ 10kb) and error rates to evaluate which are the best suited for PCR-amplified metagenomes. Results. Read mapping analyses revealed that the depth of coverage within individual genomes is significantly more uneven in PCR-amplified datasets versus unamplified metagenomes, with regions of high depth of coverage enriched in short inserts. This enrichment scales with the number of PCR cycles performed, and is presumably due to preferential amplification of short inserts. Standard assembly pipelines are confounded by this type of coverage unevenness, so we evaluated other assembly options to mitigate these issues. We found that a pipeline combining read deduplication and an assembly algorithm originally designed to recover genomes from libraries generated after whole genome amplification (single-cell SPAdes) frequently improved assembly of contigs ≥ 10kb by 10 to 100-fold for low input metagenomes. Conclusions. PCR-amplified metagenomes have enabled scientists to explore communities traditionally challenging to describe, including some with extremely low biomass or from which DNA is particularly difficult to extract. Here we show that a modified assembly pipeline can lead to an improved de novo genome assembly from PCR-amplified datasets, and enables a better genome recovery from low input metagenomes.


2017 ◽  
Author(s):  
Maximilian H.-W. Schmidt ◽  
Alxander Vogel ◽  
Alisandra K. Denton ◽  
Benjamin Istace ◽  
Alexandra Wormit ◽  
...  

Recent updates in sequencing technology have made it possible to obtain Gigabases of sequence data from one single flowcell. Prior to this update, the nanopore sequencing technology was mainly used to analyze and assemble microbial samples1-3. Here, we describe the generation of a comprehensive nanopore sequencing dataset with a median fragment size of 11,979 bp for the wild tomato species Solanum pennellii featuring an estimated genome size of ca 1.0 to 1.1 Gbases. We describe its genome assembly to a contig N50 of 2.5 MB using a pipeline comprising a Canu4 pre-processing and a subsequent assembly using SMARTdenovo. We show that the obtained nanopore based de novo genome reconstruction is structurally highly similar to that of the reference S. pennellii LA7165 genome but has a high error rate caused mostly by deletions in homopolymers. After polishing the assembly with Illumina short read data we obtained an error rate of <0.02 % when assessed versus the same Illumina data. More importantly however we obtained a gene completeness of 96.53% which even slightly surpasses that of the reference S. pennellii genome5. Taken together our data indicate such long read sequencing data can be used to affordably sequence and assemble Gbase sized diploid plant genomes.Raw data is available at http://www.plabipd.de/portal/solanum-pennellii and has been deposited as PRJEB19787.


2018 ◽  
Author(s):  
Simon Roux ◽  
Gareth Trubl ◽  
Danielle Goudeau ◽  
Nandita Nath ◽  
Estelle Couradeau ◽  
...  

Background. Metagenomics has transformed our understanding of microbial diversity across ecosystems, with recent advances enabling de novo assembly of genomes from metagenomes. These metagenome-assembled genomes are critical to provide ecological, evolutionary, and metabolic context for all the microbes and viruses yet to be cultivated. Metagenomes can now be generated from nanogram to subnanogram amounts of DNA. However, these libraries require several rounds of PCR amplification before sequencing, and recent data suggest these typically yield smaller and more fragmented assemblies than regular metagenomes. Methods. Here we evaluate de novo assembly methods of 169 PCR-amplified metagenomes, including 25 for which an unamplified counterpart is available, to optimize specific assembly approaches for PCR-amplified libraries. We first evaluated coverage bias by mapping reads from PCR-amplified metagenomes onto reference contigs obtained from unamplified metagenomes of the same samples. Then, we compared different assembly pipelines in terms of assembly size (number of bp in contigs ≥ 10kb) and error rates to evaluate which are the best suited for PCR-amplified metagenomes. Results. Read mapping analyses revealed that the depth of coverage within individual genomes is significantly more uneven in PCR-amplified datasets versus unamplified metagenomes, with regions of high depth of coverage enriched in short inserts. This enrichment scales with the number of PCR cycles performed, and is presumably due to preferential amplification of short inserts. Standard assembly pipelines are confounded by this type of coverage unevenness, so we evaluated other assembly options to mitigate these issues. We found that a pipeline combining read deduplication and an assembly algorithm originally designed to recover genomes from libraries generated after whole genome amplification (single-cell SPAdes) frequently improved assembly of contigs ≥ 10kb by 10 to 100-fold for low input metagenomes. Conclusions. PCR-amplified metagenomes have enabled scientists to explore communities traditionally challenging to describe, including some with extremely low biomass or from which DNA is particularly difficult to extract. Here we show that a modified assembly pipeline can lead to an improved de novo genome assembly from PCR-amplified datasets, and enables a better genome recovery from low input metagenomes.


Author(s):  
Guangtu Gao ◽  
Susana Magadan ◽  
Geoffrey C Waldbieser ◽  
Ramey C Youngblood ◽  
Paul A Wheeler ◽  
...  

Abstract Currently, there is still a need to improve the contiguity of the rainbow trout reference genome and to use multiple genetic backgrounds that will represent the genetic diversity of this species. The Arlee doubled haploid line was originated from a domesticated hatchery strain that was originally collected from the northern California coast. The Canu pipeline was used to generate the Arlee line genome de-novo assembly from high coverage PacBio long-reads sequence data. The assembly was further improved with Bionano optical maps and Hi-C proximity ligation sequence data to generate 32 major scaffolds corresponding to the karyotype of the Arlee line (2 N = 64). It is composed of 938 scaffolds with N50 of 39.16 Mb and a total length of 2.33 Gb, of which ∼95% was in 32 chromosome sequences with only 438 gaps between contigs and scaffolds. In rainbow trout the haploid chromosome number can vary from 29 to 32. In the Arlee karyotype the haploid chromosome number is 32 because chromosomes Omy04, 14 and 25 are divided into six acrocentric chromosomes. Additional structural variations that were identified in the Arlee genome included the major inversions on chromosomes Omy05 and Omy20 and additional 15 smaller inversions that will require further validation. This is also the first rainbow trout genome assembly that includes a scaffold with the sex-determination gene (sdY) in the chromosome Y sequence. The utility of this genome assembly is demonstrated through the improved annotation of the duplicated genome loci that harbor the IGH genes on chromosomes Omy12 and Omy13.


2020 ◽  
Vol 9 (6) ◽  
pp. 385-393
Author(s):  
Arvid Niemeyer ◽  
Lucia Rottmair ◽  
Cornelius Neumann ◽  
Cornelius Möckel

AbstractLight not only enables humans to perceive their surroundings, but also influences their sleep–wake cycle, mood, concentration and performance. Targeted use of these so called nonvisual effects could also have a positive contribution in automobiles by keeping passengers alert, minimizing error rates or bootsting attention in general. Since construction space in vehicle interios is scarce, this study compared the influence of differently-sized light panels and thus solid angles on nonvisual effects. In a counterbalanced order, 32 volunteers were exposed to three lighting conditions in the morning: baseline (12 lx, 2200 K), small (200 lx, 6500 K, 0.05 sr) and large (200 lx, 6500 K, 0.44 sr). During each session of 60 min, alertness, concentration and working memory were assessed before and during light exposure. After data analysis no significant main effects of light, measurement point or interaction between light and measurement point could be seen.


2003 ◽  
Vol 93 (2) ◽  
pp. 219-228 ◽  
Author(s):  
Béatrice Denoyes-Rothan ◽  
Guy Guérin ◽  
Christophe Délye ◽  
Barbara Smith ◽  
Dror Minz ◽  
...  

Ninety-five isolates of Colletotrichum including 81 isolates of C. acutatum (62 from strawberry) and 14 isolates of C. gloeosporioides (13 from strawberry) were characterized by various molecular methods and pathogenicity tests. Results based on random amplified polymorphic DNA (RAPD) polymorphism and internal transcribed spacer (ITS) 2 sequence data provided clear genetic evidence of two subgroups in C. acutatum. The first subgroup, characterized as CA-clonal, included only isolates from strawberry and exhibited identical RAPD patterns and nearly identical ITS2 sequence analysis. A larger genetic group, CA-variable, included isolates from various hosts and exhibited variable RAPD patterns and divergent ITS2 sequence analysis. Within the C. acutatum population isolated from strawberry, the CA-clonal group is prevalent in Europe (54 isolates of 62). A subset of European C. acutatum isolates isolated from strawberry and representing the CA-clonal and CA-variable groups was assigned to two pathogenicity groups. No correlation could be drawn between genetic and pathogenicity groups. On the basis of molecular data, it is proposed that the CA-clonal subgroup contains closely related, highly virulent C. acutatum isolates that may have developed host specialization to strawberry. C. gloeosporioides isolates from Europe, which were rarely observed were either slightly or nonpathogenic on strawberry. The absence of correlation between genetic polymorphism and geographical origin in Colletotrichum spp. suggests a worldwide dissemination of isolates, probably through international plant exchanges.


mSystems ◽  
2018 ◽  
Vol 3 (3) ◽  
Author(s):  
Gabriel A. Al-Ghalith ◽  
Benjamin Hillmann ◽  
Kaiwei Ang ◽  
Robin Shields-Cutler ◽  
Dan Knights

ABSTRACT Next-generation sequencing technology is of great importance for many biological disciplines; however, due to technical and biological limitations, the short DNA sequences produced by modern sequencers require numerous quality control (QC) measures to reduce errors, remove technical contaminants, or merge paired-end reads together into longer or higher-quality contigs. Many tools for each step exist, but choosing the appropriate methods and usage parameters can be challenging because the parameterization of each step depends on the particularities of the sequencing technology used, the type of samples being analyzed, and the stochasticity of the instrumentation and sample preparation. Furthermore, end users may not know all of the relevant information about how their data were generated, such as the expected overlap for paired-end sequences or type of adaptors used to make informed choices. This increasing complexity and nuance demand a pipeline that combines existing steps together in a user-friendly way and, when possible, learns reasonable quality parameters from the data automatically. We propose a user-friendly quality control pipeline called SHI7 (canonically pronounced “shizen”), which aims to simplify quality control of short-read data for the end user by predicting presence and/or type of common sequencing adaptors, what quality scores to trim, whether the data set is shotgun or amplicon sequencing, whether reads are paired end or single end, and whether pairs are stitchable, including the expected amount of pair overlap. We hope that SHI7 will make it easier for all researchers, expert and novice alike, to follow reasonable practices for short-read data quality control. IMPORTANCE Quality control of high-throughput DNA sequencing data is an important but sometimes laborious task requiring background knowledge of the sequencing protocol used (such as adaptor type, sequencing technology, insert size/stitchability, paired-endedness, etc.). Quality control protocols typically require applying this background knowledge to selecting and executing numerous quality control steps with the appropriate parameters, which is especially difficult when working with public data or data from collaborators who use different protocols. We have created a streamlined quality control pipeline intended to substantially simplify the process of DNA quality control from raw machine output files to actionable sequence data. In contrast to other methods, our proposed pipeline is easy to install and use and attempts to learn the necessary parameters from the data automatically with a single command.


1999 ◽  
Vol 339 (3) ◽  
pp. 767-773 ◽  
Author(s):  
Romain R. VIVÈS ◽  
David A. PYE ◽  
Markku SALMIVIRTA ◽  
John J. HOPWOOD ◽  
Ulf LINDAHL ◽  
...  

The biological activity of heparan sulphate (HS) and heparin largely depends on internal oligosaccharide sequences that provide specific binding sites for an extensive range of proteins. Identification of such structures is crucial for the complete understanding of glycosaminoglycan (GAG)-protein interactions. We describe here a simple method of sequence analysis relying on the specific tagging of the sugar reducing end by 3H radiolabelling, the combination of chemical scission and specific enzymic digestion to generate intermediate fragments, and the analysis of the generated products by strong-anion-exchange HPLC. We present full sequence data on microgram quantities of four unknown oligosaccharides (three HS-derived hexasaccharides and one heparin-derived octasaccharide) which illustrate the utility and relative simplicity of the technique. The results clearly show that it is also possible to read sequences of inhomogeneous preparations. Application of this technique to biologically active oligosaccharides should accelerate progress in the understanding of HS and heparin structure-function relationships and provide new insights into the primary structure of these polysaccharides.


Sign in / Sign up

Export Citation Format

Share Document