scholarly journals QAlign: Aligning nanopore reads accurately using current-level modeling

2019 ◽  
Author(s):  
Dhaivat Joshi ◽  
Shunfu Mao ◽  
Sreeram Kannan ◽  
Suhas Diggavi

AbstractMotivationEfficient and accurate alignment of DNA / RNA sequence reads to each other or to a reference genome / transcriptome is an important problem in genomic analysis. Nanopore sequencing has emerged as a major sequencing technology and many long-read aligners have been designed for aligning nanopore reads. However, the high error rate makes accurate and efficient alignment difficult. Utilizing the noise and error characteristics inherent in the sequencing process properly can play a vital role in constructing a robust aligner. In this paper, we design QAlign, a pre-processor that can be used with any long-read aligner for aligning long reads to a genome / transcriptome or to other long reads. The key idea in QAlign is to convert the nucleotide reads into discretized current levels that capture the error modes of the nanopore sequencer before running it through a sequence aligner.ResultsWe show that QAlign is able to improve alignment rates from around 80% up to 90% with nanopore reads when aligning to the genome. We also show that QAlign improves the average overlap quality by 9.2%, 2.5% and 10.8% in three real datasets for read-to-read alignment. Read-to-transcriptome alignment rates are improved from 51.6% to 75.4% and 82.6% to 90% in two real datasets.Availabilityhttps://github.com/joshidhaivat/QAlign.git

Author(s):  
Dhaivat Joshi ◽  
Shunfu Mao ◽  
Sreeram Kannan ◽  
Suhas Diggavi

Abstract Motivation Efficient and accurate alignment of DNA/RNA sequence reads to each other or to a reference genome/transcriptome is an important problem in genomic analysis. Nanopore sequencing has emerged as a major sequencing technology and many long-read aligners have been designed for aligning nanopore reads. However, the high error rate makes accurate and efficient alignment difficult. Utilizing the noise and error characteristics inherent in the sequencing process properly can play a vital role in constructing a robust aligner. In this article, we design QAlign, a pre-processor that can be used with any long-read aligner for aligning long reads to a genome/transcriptome or to other long reads. The key idea in QAlign is to convert the nucleotide reads into discretized current levels that capture the error modes of the nanopore sequencer before running it through a sequence aligner. Results We show that QAlign is able to improve alignment rates from around 80% up to 90% with nanopore reads when aligning to the genome. We also show that QAlign improves the average overlap quality by 9.2, 2.5 and 10.8% in three real datasets for read-to-read alignment. Read-to-transcriptome alignment rates are improved from 51.6% to 75.4% and 82.6% to 90% in two real datasets. Availability and implementation https://github.com/joshidhaivat/QAlign.git. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Sam Kovaka ◽  
Aleksey V. Zimin ◽  
Geo M. Pertea ◽  
Roham Razaghi ◽  
Steven L. Salzberg ◽  
...  

AbstractRNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new computational methods to handle the high error rate of long-read sequencing technology, which previous assemblers could not tolerate. It also offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of assemblies. On 33 short-read datasets from humans and two plant species, StringTie2 is 47.3% more precise and 3.9% more sensitive than Scallop. On multiple long read datasets, StringTie2 on average correctly assembles 8.3 and 2.6 times as many transcripts as FLAIR and Traphlor, respectively, with substantially higher precision. StringTie2 is also faster and has a smaller memory footprint than all comparable tools.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Edwin A. Solares ◽  
Yuan Tao ◽  
Anthony D. Long ◽  
Brandon S. Gaut

Abstract Background Despite marked recent improvements in long-read sequencing technology, the assembly of diploid genomes remains a difficult task. A major obstacle is distinguishing between alternative contigs that represent highly heterozygous regions. If primary and secondary contigs are not properly identified, the primary assembly will overrepresent both the size and complexity of the genome, which complicates downstream analysis such as scaffolding. Results Here we illustrate a new method, which we call HapSolo, that identifies secondary contigs and defines a primary assembly based on multiple pairwise contig alignment metrics. HapSolo evaluates candidate primary assemblies using BUSCO scores and then distinguishes among candidate assemblies using a cost function. The cost function can be defined by the user but by default considers the number of missing, duplicated and single BUSCO genes within the assembly. HapSolo performs hill climbing to minimize cost over thousands of candidate assemblies. We illustrate the performance of HapSolo on genome data from three species: the Chardonnay grape (Vitis vinifera), with a genome of 490 Mb, a mosquito (Anopheles funestus; 200 Mb) and the Thorny Skate (Amblyraja radiata; 2650 Mb). Conclusions HapSolo rapidly identified candidate assemblies that yield improvements in assembly metrics, including decreased genome size and improved N50 scores. Contig N50 scores improved by 35%, 9% and 9% for Chardonnay, mosquito and the thorny skate, respectively, relative to unreduced primary assemblies. The benefits of HapSolo were amplified by down-stream analyses, which we illustrated by scaffolding with Hi-C data. We found, for example, that prior to the application of HapSolo, only 52% of the Chardonnay genome was captured in the largest 19 scaffolds, corresponding to the number of chromosomes. After the application of HapSolo, this value increased to ~ 84%. The improvements for the mosquito’s largest three scaffolds, representing the number of chromosomes, were from 61 to 86%, and the improvement was even more pronounced for thorny skate. We compared the scaffolding results to assemblies that were based on PurgeDups for identifying secondary contigs, with generally superior results for HapSolo.


2021 ◽  
Author(s):  
Xinxin Yi ◽  
Jing Liu ◽  
Shengcai Chen ◽  
Hao Wu ◽  
Min Liu ◽  
...  

Cultivated soybean (Glycine max) is an important source for protein and oil. Many elite cultivars with different traits have been developed for different conditions. Each soybean strain has its own genetic diversity, and the availability of more high-quality soybean genomes can enhance comparative genomic analysis for identifying genetic underpinnings for its unique traits. In this study, we constructed a high-quality de novo assembly of an elite soybean cultivar Jidou 17 (JD17) with chromsome contiguity and high accuracy. We annotated 52,840 gene models and reconstructed 74,054 high-quality full-length transcripts. We performed a genome-wide comparative analysis based on the reference genome of JD17 with three published soybeans (WM82, ZH13 and W05) , which identified five large inversions and two large translocations specific to JD17, 20,984 - 46,912 PAVs spanning 13.1 - 46.9 Mb in size, and 5 - 53 large PAV clusters larger than 500kb. 1,695,741 - 3,664,629 SNPs and 446,689 - 800,489 Indels were identified and annotated between JD17 and them. Symbiotic nitrogen fixation (SNF) genes were identified and the effects from these variants were further evaluated. It was found that the coding sequences of 9 nitrogen fixation-related genes were greatly affected. The high-quality genome assembly of JD17 can serve as a valuable reference for soybean functional genomics research.


Plants ◽  
2019 ◽  
Vol 8 (8) ◽  
pp. 270 ◽  
Author(s):  
Yun Gyeong Lee ◽  
Sang Chul Choi ◽  
Yuna Kang ◽  
Kyeong Min Kim ◽  
Chon-Sik Kang ◽  
...  

The whole genome sequencing (WGS) has become a crucial tool in understanding genome structure and genetic variation. The MinION sequencing of Oxford Nanopore Technologies (ONT) is an excellent approach for performing WGS and it has advantages in comparison with other Next-Generation Sequencing (NGS): It is relatively inexpensive, portable, has simple library preparation, can be monitored in real-time, and has no theoretical limits on reading length. Sorghum bicolor (L.) Moench is diploid (2n = 2x = 20) with a genome size of about 730 Mb, and its genome sequence information is released in the Phytozome database. Therefore, sorghum can be used as a good reference. However, plant species have complex and large genomes when compared to animals or microorganisms. As a result, complete genome sequencing is difficult for plant species. MinION sequencing that produces long-reads can be an excellent tool for overcoming the weak assembly of short-reads generated from NGS by minimizing the generation of gaps or covering the repetitive sequence that appears on the plant genome. Here, we conducted the genome sequencing for S. bicolor cv. BTx623 while using the MinION platform and obtained 895,678 reads and 17.9 gigabytes (Gb) (ca. 25× coverage of reference) from long-read sequence data. A total of 6124 contigs (covering 45.9%) were generated from Canu, and a total of 2661 contigs (covering 50%) were generated from Minimap and Miniasm with a Racon through a de novo assembly using two different tools and mapped assembled contigs against the sorghum reference genome. Our results provide an optimal series of long-read sequencing analysis for plant species while using the MinION platform and a clue to determine the total sequencing scale for optimal coverage that is based on various genome sizes.


2016 ◽  
Author(s):  
Eric Disdero ◽  
Jonathan Filée

AbstractMotivationPopulation genomic analysis of transposable elements has greatly benefited from recent advances of sequencing technologies. However, the propensity of transposable elements to nest in highly repeated regions of genomes limits the efficiency of bioinformatic tools when short read sequences technology is used.ResultsLoRTE is the first tool able to use PacBio long read sequences to identify transposon deletions and insertions between a reference genome and genomes of different strains or populations. Tested against Drosophila melanogaster PacBio datasets, LoRTE appears to be a reliable and broadly applicable tools to study the dynamic and evolutionary impact of transposable elements using low coverage, long read sequences.Availability and ImplementationLoRTE is available at http://www.egce.cnrs-gif.fr/?p=6422. It is written in Python 2.7 and only requires the NCBI BLAST + package. LoRTE can be used on standard computer with limited RAM resources and reasonable running time even with large [email protected]


2014 ◽  
Author(s):  
Rajiv C McCoy ◽  
Ryan W Taylor ◽  
Timothy A Blauwkamp ◽  
Joanna L Kelley ◽  
Michael Kertesz ◽  
...  

High-throughput DNA sequencing technologies have revolutionized genomic analysis, including thede novoassembly of whole genomes. Nevertheless, assembly of complex genomes remains challenging, in part due to the presence of dispersed repeats which introduce ambiguity during genome reconstruction. Transposable elements (TEs) can be particularly problematic, especially for TE families exhibiting high sequence identity, high copy number, or present in complex genomic arrangements. While TEs strongly affect genome function and evolution, most currentde novoassembly approaches cannot resolve long, identical, and abundant families of TEs. Here, we applied a novel Illumina technology called TruSeq synthetic long-reads, which are generated through highly parallel library preparation and local assembly of short read data and achieve lengths of 1.5-18.5 Kbp with an extremely low error rate (∼0.03% per base). To test the utility of this technology, we sequenced and assembled the genome of the model organismDrosophila melanogaster(reference genome strainy;cn,bw,sp) achieving an N50 contig size of 69.7 Kbp and covering 96.9% of the euchromatic chromosome arms of the current reference genome. TruSeq synthetic long-read technology enables placement of individual TE copies in their proper genomic locations as well as accurate reconstruction of TE sequences. We entirely recovered and accurately placed 4,229 (77.8%) of the 5,434 of annotated transposable elements with perfect identity to the current reference genome. As TEs are ubiquitous features of genomes of many species, TruSeq synthetic long- reads, and likely other methods that generate long reads, offer a powerful approach to improvede novoassemblies of whole genomes.


F1000Research ◽  
2015 ◽  
Vol 4 ◽  
pp. 17 ◽  
Author(s):  
Ron Ammar ◽  
Tara A. Paton ◽  
Dax Torti ◽  
Adam Shlien ◽  
Gary D. Bader

Haplotypes are often critical for the interpretation of genetic laboratory observations into medically actionable findings. Current massively parallel DNA sequencing technologies produce short sequence reads that are often unable to resolve haplotype information. Phasing short read data typically requires supplemental statistical phasing based on known haplotype structure in the population or parental genotypic data. Here we demonstrate that the MinION nanopore sequencer is capable of producing very long reads to resolve both variants and haplotypes of HLA-A, HLA-B and CYP2D6 genes important in determining patient drug response in sample NA12878 of CEPH/UTAH pedigree 1463, without the need for statistical phasing. Long read data from a single 24-hour nanopore sequencing run was used to reconstruct haplotypes, which were confirmed by HapMap data and statistically phased Complete Genomics and Sequenom genotypes. Our results demonstrate that nanopore sequencing is an emerging standalone technology with potential utility in a clinical environment to aid in medical decision-making.


2017 ◽  
Author(s):  
Brian M. Forde ◽  
Lauren J. McAllister ◽  
James C. Paton ◽  
Adrienne W. Paton ◽  
Scott A. Beatson

AbstractShiga toxigenicEscherichia coli(STEC) are important food-borne pathogens and a major cause of haemorrhagic colitis and haemolytic-uremic syndrome (HUS) worldwide. In 1995 a severe HUS outbreak in Adelaide occurred. A recent genomic analysis of STEC O111:H-strains 95JB1 and 95NR1 from this outbreak found that the more virulent isolate, 95NR1, harboured two additional copies of the Shiga toxin 2 (Stx2) genes although the structure of the Stx2-converting prophages could not be fully resolved due to the fragmented assembly. In this study we have used Pacific Biosciences (PacBio) single molecule real-time (SMRT) long read sequencing to characterise the complete epigenome (genome and methylome) of 95JB1 and 95NR1. Using long reads we completely resolved the structure of two, tandemly inserted, stx2-converting phage in 95NR1. Our analysis of the methylome of 95NR1 and 95JB1 identified hemi-methylation of a novel motif (5’-CTGCm6AG-3’) in more than 4000 sites in the 95NR1 genome. These sites were entirely unmethalyted in the 95JB1, including at least 180 potential promoter regions that could explain regulatory differences between the strains. We identified a Type IIG methyltransferase encoded in both genomes in association with three additional genes in an operon-like arrangement. IS1203mediated disruption of this operon in 95JB1 is the likely cause of the observed differential patterns of methylation between 95NR1 and 95JB1. This study demonstrates the enormous potential of PacBio SMRT sequencing to resolve complex prophage regions and reveal the genetic and epigenetic heterogeneity within a clonal population of bacteria.


2018 ◽  
Author(s):  
Theodore S. Kalbfleisch ◽  
Edward S. Rice ◽  
Michael S. DePriest ◽  
Brian P. Walenz ◽  
Matthew S. Hestand ◽  
...  

AbstractEquCab2, a high-quality reference genome for the domestic horse, was released in 2007. Since then, it has served as the foundation for nearly all genomic work done in equids. Recent advances in genomic sequencing technology and computational assembly methods have allowed scientists to improve reference assemblies of large animal and plant genomes in terms of contiguity and composition. In 2014, the equine genomics research community began a project to improve the reference sequence for the horse, building upon the solid foundation of EquCab2 and incorporating new short-read data, long-read data, and proximity ligation data. The result, EquCab3, is presented here. The count of non-N bases in the incorporated chromosomes is improved from 2.33Gb in EquCab2 to 2.41Gb from EquCab3. Contiguity has also been improved nearly 40-fold with a contig N50 of 4.5Mb and scaffold contiguity enhanced to where all but one of the 32 chromosomes is comprised of a single scaffold.


Sign in / Sign up

Export Citation Format

Share Document