Long-reads are revolutionizing 20 years of insect genome sequencing

Mapping Intimacies ◽

10.1101/2021.02.14.431146 ◽

2021 ◽

Author(s):

Scott Hotaling ◽

John S. Sproul ◽

Jacqueline Heckenhauer ◽

Ashlyn Powell ◽

Amanda M. Larracuente ◽

...

Keyword(s):

Drosophila Melanogaster ◽

Nuclear Genome ◽

Sequencing Technology ◽

Assembly Quality ◽

Insect Genome ◽

Long Reads ◽

Long Read ◽

Field Perspective ◽

Genome Assemblies ◽

The Impact

The first insect genome (Drosophila melanogaster) was published two decades ago. Today, nuclear genome assemblies are available for a staggering 601 different insects representing 20 orders. Here, we analyzed the best assembly for each insect and provide a “state of the field” perspective, emphasizing taxonomic representation, assembly quality, gene completeness, and sequencing technology. We show that while genomic efforts have been biased towards specific groups (e.g., Diptera), assemblies are generally contiguous with gene regions intact. Most notable, however, has been the impact of long-read sequencing; assemblies that incorporate long-reads are ∼48x more contiguous than those that do not.

Download Full-text

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab034 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Jean-Marc Aury ◽

Benjamin Istace

Keyword(s):

Single Molecule ◽

Direct Consequence ◽

High Quality ◽

Sequencing Errors ◽

Coding Regions ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

Download Full-text

A highly contiguous nuclear genome assembly of the mandarinfish Synchiropus splendidus (Syngnathiformes: Callionymidae)

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab306 ◽

2021 ◽

Author(s):

Martin Stervander ◽

William A Cresko

Keyword(s):

Genome Assembly ◽

Nuclear Genome ◽

Phylogenetic Position ◽

Blue Color ◽

Long Reads ◽

The Family ◽

Commercially Important ◽

Genomic Resource ◽

Genome Assemblies ◽

Important Fish

Abstract The fish order Syngnathiformes has been referred to as a collection of misfit fishes, comprising commercially important fish such as red mullets as well as the highly diverse seahorses, pipefishes, and seadragons—the well-known family Syngnathidae, with their unique adaptations including male pregnancy. Another ornate member of this order is the species mandarinfish. No less than two types of chromatophores have been discovered in the spectacularly colored mandarinfish: the cyanophore (producing blue color) and the dichromatic cyano-erythrophore (producing blue and red). The phylogenetic position of mandarinfish in Syngnathiformes, and their promise of additional genetic discoveries beyond the chromatophores, made mandarinfish an appealing target for whole genome sequencing. We used linked sequences to create synthetic long reads, producing a highly contiguous genome assembly for the mandarinfish. The genome assembly comprises 483 Mbp (longest scaffold 29 Mbp), has an N50 of 12 Mbp, and an L50 of 14 scaffolds. The assembly completeness is also high, with 92.6% complete, 4.4% fragmented, and 2.9% missing out of 4,584 BUSCO genes found in ray-finned fishes. Outside the family Syngnathidae, the mandarinfish represents one of the most contiguous syngnathiform genome assemblies to date. The mandarinfish genomic resource will likely serve as a high-quality outgroup to syngnathid fish, and furthermore for research on the genomic underpinnings of the evolution of novel pigmentation.

Download Full-text

TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats

Bioinformatics ◽

10.1093/bioinformatics/btaa440 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i75-i83 ◽

Cited By ~ 5

Author(s):

Alla Mikheenko ◽

Andrey V Bzikadze ◽

Alexey Gurevich ◽

Karen H Miga ◽

Pavel A Pevzner

Keyword(s):

Quality Assessment ◽

Chromosome Segregation ◽

Tandem Repeats ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Cellular Processes ◽

Long Reads ◽

Long Read ◽

Eukaryotic Genomes

Abstract Motivation Extra-long tandem repeats (ETRs) are widespread in eukaryotic genomes and play an important role in fundamental cellular processes, such as chromosome segregation. Although emerging long-read technologies have enabled ETR assemblies, the accuracy of such assemblies is difficult to evaluate since there are no tools for their quality assessment. Moreover, since the mapping of error-prone reads to ETRs remains an open problem, it is not clear how to polish draft ETR assemblies. Results To address these problems, we developed the TandemTools software that includes the TandemMapper tool for mapping reads to ETRs and the TandemQUAST tool for polishing ETR assemblies and their quality assessment. We demonstrate that TandemTools not only reveals errors in ETR assemblies but also improves the recently generated assemblies of human centromeres. Availability and implementation https://github.com/ablab/TandemTools. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A comprehensive evaluation of long read error correction methods

BMC Genomics ◽

10.1186/s12864-020-07227-0 ◽

2020 ◽

Vol 21 (S6) ◽

Author(s):

Haowen Zhang ◽

Chirag Jain ◽

Srinivas Aluru

Keyword(s):

Error Correction ◽

Hybrid Methods ◽

Comprehensive Evaluation ◽

Data Sets ◽

Short Reads ◽

Long Reads ◽

Long Read ◽

Read Error Correction ◽

Downstream Analysis ◽

The Impact

Abstract Background Third-generation single molecule sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. Results In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research. Conclusions Despite the high error rate of long reads, the state-of-the-art correction tools can achieve high correction quality. When short reads are available, the best hybrid methods outperform non-hybrid methods in terms of correction quality and computing resource usage. When choosing tools for use, practitioners are suggested to be careful with a few correction tools that discard reads, and check the effect of error correction tools on downstream analysis. Our evaluation code is available as open-source at https://github.com/haowenz/LRECE.

Download Full-text

QAlign: Aligning nanopore reads accurately using current-level modeling

10.1101/862813 ◽

2019 ◽

Author(s):

Dhaivat Joshi ◽

Shunfu Mao ◽

Sreeram Kannan ◽

Suhas Diggavi

Keyword(s):

Reference Genome ◽

Genomic Analysis ◽

Vital Role ◽

High Error Rate ◽

Sequencing Technology ◽

Long Reads ◽

A Genome ◽

Long Read ◽

Nanopore Sequencer ◽

Sequencing Process

AbstractMotivationEfficient and accurate alignment of DNA / RNA sequence reads to each other or to a reference genome / transcriptome is an important problem in genomic analysis. Nanopore sequencing has emerged as a major sequencing technology and many long-read aligners have been designed for aligning nanopore reads. However, the high error rate makes accurate and efficient alignment difficult. Utilizing the noise and error characteristics inherent in the sequencing process properly can play a vital role in constructing a robust aligner. In this paper, we design QAlign, a pre-processor that can be used with any long-read aligner for aligning long reads to a genome / transcriptome or to other long reads. The key idea in QAlign is to convert the nucleotide reads into discretized current levels that capture the error modes of the nanopore sequencer before running it through a sequence aligner.ResultsWe show that QAlign is able to improve alignment rates from around 80% up to 90% with nanopore reads when aligning to the genome. We also show that QAlign improves the average overlap quality by 9.2%, 2.5% and 10.8% in three real datasets for read-to-read alignment. Read-to-transcriptome alignment rates are improved from 51.6% to 75.4% and 82.6% to 90% in two real datasets.Availabilityhttps://github.com/joshidhaivat/QAlign.git

Download Full-text

Impact of Chimera-less Long Reads on Metagenomics of Human Gut Viromes Treated With Multiple Displacement Amplification

10.21203/rs.3.rs-58640/v1 ◽

2020 ◽

Author(s):

Yuya Kiguchi ◽

Suguru Nishijima ◽

Naveen Kumar ◽

Masahira Hattori ◽

Wataru Suda

Keyword(s):

De Novo ◽

Multiple Displacement Amplification ◽

Read Length ◽

Human Gut ◽

Average Read Length ◽

Sequencing Technologies ◽

Genomic Complexity ◽

Long Reads ◽

Long Read ◽

The Impact

Abstract Background: The ecological and biological features of the indigenous phage community (virome) in the human gut microbiome are poorly understood, possibly due to many fragmented contigs and fewer complete genomes based on conventional short-read metagenomics. Long-read sequencing technologies have attracted attention as an alternative approach to reconstruct long and accurate contigs from microbial communities. However, the impact of long-read metagenomics on human gut virome analysis has not been well evaluated. Results: Here we present chimera-less PacBio long-read metagenomics of multiple displacement amplification (MDA)-treated human gut virome DNA. The method included the development of a novel bioinformatics tool, SACRA (Split Amplified Chimeric Read Algorithm), which efficiently detects and splits numerous chimeric reads in PacBio reads from the MDA-treated virome samples. SACRA treatment of PacBio reads from five samples markedly reduced the average chimera ratio from 72 to 1.5%, generating chimera-less PacBio reads with an average read-length of 1.8 kb. De novo assembly of the chimera-less long reads generated contigs with an average N50 length of 11.1 kb, whereas those of MiSeq short reads from the same samples were 0.7 kb, dramatically improving contig extension. Alignment of both contig sets generated 378 high-quality merged contigs (MCs) composed of the minimum scaffolds of 434 MiSeq and 637 PacBio contigs, respectively, and also identified numerous MiSeq short fragmented contigs ≤500 bp additionally aligned to MCs, which possibly originated from a small fraction of MiSeq chimeric reads. The alignment also revealed that fragmentations of the scaffolded MiSeq contigs were caused primarily by genomic complexity of the community, including local repeats, hypervariable regions, and highly conserved sequences in and between the phage genomes. We identified 142 complete and near-complete phage genomes including 108 novel genomes, varying from 5 to 185 kb in length, the majority of which were predicted to be Microviridae phages including several variants with homologous but distinct genomes, which were fragmented in MiSeq contigs. Conclusions: Long-read metagenomics coupled with SACRA provides an improved method to reconstruct accurate and extended phage genomes from MDA-treated virome samples of the human gut, and potentially from other environmental virome samples.

Download Full-text

Complete genome assemblies for three variants of the Wolbachia endosymbiont of Drosophila melanogaster

10.1101/729061 ◽

2019 ◽

Author(s):

Preston J. Basting ◽

Casey M. Bergman

Keyword(s):

Drosophila Melanogaster ◽

Complete Genome ◽

Shotgun Sequencing ◽

Sequencing Data ◽

Wolbachia Pipientis ◽

Long Read ◽

Wolbachia Endosymbiont ◽

Genome Assemblies

ABSTRACTHere we report genome assemblies for three strains of Wolbachia pipientis assembled from unenriched, unfiltered long-read shotgun sequencing data of geographically distinct strains of Drosophila melanogaster. Our simple methodology can be applied to long-read datasets of other Wolbachia-infected species to produce complete assemblies for this important model symbiont.

Download Full-text

A new method for long-read sequencing of animal mitochondrial genomes: application to the identification of equine mitochondrial DNA variants

10.1101/2019.12.20.884486 ◽

2019 ◽

Author(s):

Sophie Dhorne-Pollet ◽

Eric Barrey ◽

Nicolas Pollet

Keyword(s):

Mitochondrial Dna ◽

Nuclear Dna ◽

Multiple Displacement Amplification ◽

Mitochondrial Genomes ◽

Sequencing Technology ◽

Long Reads ◽

Selective Elimination ◽

Long Read ◽

Variant Analysis ◽

Mitochondrial Dna Variants

AbstractBackgroundWe present here an approach to sequence whole mitochondrial genomes using nanopore long-read sequencing. Our method relies on the selective elimination of nuclear DNA using an exonuclease treatment and on the amplification of circular mitochondrial DNA using a multiple displacement amplification step.ResultsWe optimized each preparative step to obtain a 100 million-fold enrichment of horse mitochondrial DNA relative to nuclear DNA. We sequenced these amplified mitochondrial DNA using nanopore sequencing technology and obtained mitochondrial DNA reads that represented up to half of the sequencing output. The sequence reads were 2.3 kb of mean length and provided an even coverage of the mitochondrial genome. Long-reads spanning half or more of the whole mtDNA provided a coverage that varied between 118X and 488X. Finally, we identified SNPs with a precision of 98.1%; recall of 85.2% and a F1-score of 0.912.ConclusionsOur analyses show that our method to amplify mtDNA and to sequence it using the nanopore technology is usable for mitochondrial DNA variant analysis. With minor modifications, this approach could easily be applied to other large circular DNA molecules.

Download Full-text

Telomere-to-telomere gapless chromosomes of banana using nanopore sequencing

Communications Biology ◽

10.1038/s42003-021-02559-3 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Caroline Belser ◽

Franc-Christophe Baurens ◽

Benjamin Noel ◽

Guillaume Martin ◽

Corinne Cruaud ◽

...

Keyword(s):

Musa Acuminata ◽

Genetic Maps ◽

Nanopore Sequencing ◽

Genome Coverage ◽

Long Reads ◽

Oxford Nanopore ◽

A Genome ◽

Long Read ◽

Genome Assemblies ◽

First Time

AbstractLong-read technologies hold the promise to obtain more complete genome assemblies and to make them easier. Coupled with long-range technologies, they can reveal the architecture of complex regions, like centromeres or rDNA clusters. These technologies also make it possible to know the complete organization of chromosomes, which remained complicated before even when using genetic maps. However, generating a gapless and telomere-to-telomere assembly is still not trivial, and requires a combination of several technologies and the choice of suitable software. Here, we report a chromosome-scale assembly of a banana genome (Musa acuminata) generated using Oxford Nanopore long-reads. We generated a genome coverage of 177X from a single PromethION flowcell with near 17X with reads longer than 75 kbp. From the 11 chromosomes, 5 were entirely reconstructed in a single contig from telomere to telomere, revealing for the first time the content of complex regions like centromeres or clusters of paralogous genes.

Download Full-text

Evaluating nanopore sequencing data processing pipelines for structural variation identification

Genome Biology ◽

10.1186/s13059-019-1858-1 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 9

Author(s):

Anbo Zhou ◽

Timothy Lin ◽

Jinchuan Xing

Keyword(s):

Detection Accuracy ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Sequencing Technology ◽

Structural Variations ◽

Human Genomes ◽

Data Assessment ◽

Machine Learning Approach ◽

Long Read ◽

The Impact

Abstract Background Structural variations (SVs) account for about 1% of the differences among human genomes and play a significant role in phenotypic variation and disease susceptibility. The emerging nanopore sequencing technology can generate long sequence reads and can potentially provide accurate SV identification. However, the tools for aligning long-read data and detecting SVs have not been thoroughly evaluated. Results Using four nanopore datasets, including both empirical and simulated reads, we evaluate four alignment tools and three SV detection tools. We also evaluate the impact of sequencing depth on SV detection. Finally, we develop a machine learning approach to integrate call sets from multiple pipelines. Overall SV callers’ performance varies depending on the SV types. For an initial data assessment, we recommend using aligner minimap2 in combination with SV caller Sniffles because of their speed and relatively balanced performance. For detailed analysis, we recommend incorporating information from multiple call sets to improve the SV call performance. Conclusions We present a workflow for evaluating aligners and SV callers for nanopore sequencing data and approaches for integrating multiple call sets. Our results indicate that additional optimizations are needed to improve SV detection accuracy and sensitivity, and an integrated call set can provide enhanced performance. The nanopore technology is improving, and the sequencing community is likely to grow accordingly. In turn, better benchmark call sets will be available to more accurately assess the performance of available tools and facilitate further tool development.

Download Full-text