scholarly journals NanoSpring: reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach

2021 ◽  
Author(s):  
Qingxi Meng ◽  
Shubham Chandak ◽  
Yifan Zhu ◽  
Tsachy Weissman

Motivation: The amount of data produced by genome sequencing experiments has been growing rapidly over the past several years, making compression important for efficient storage, transfer and analysis of the data. In recent years, nanopore sequencing technologies have seen increasing adoption since they are portable, real-time and provide long reads. However, there has been limited progress on compression of nanopore sequencing reads obtained in FASTQ files. Previous work ENANO focuses mostly on quality score compression and does not achieve significant gains for the compression of read sequences over general-purpose compressors. RENANO achieves significantly better compression for read sequences but is limited to aligned data with a reference available. Results: We present NanoSpring, a reference-free compressor for nanopore sequencing reads, relying on an approximate assembly approach. We evaluate NanoSpring on a variety of datasets including bacterial, metagenomic, plant, animal, and human whole genome data. For recently basecalled high quality nanopore datasets, NanoSpring achieves close to 3x improvement in compression over state-of-the-art reference-free compressors. The computational requirements of NanoSpring are practical, although it uses more time and memory during compression than previous tools to achieve the compression gains. Availability: NanoSpring is available on GitHub at https://github.com/qm2/NanoSpring.

2018 ◽  
Vol 35 (15) ◽  
pp. 2674-2676 ◽  
Author(s):  
Shubham Chandak ◽  
Kedar Tatwawadi ◽  
Idoia Ochoa ◽  
Mikel Hernaez ◽  
Tsachy Weissman

Abstract Motivation High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. Results In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina’s NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources. Availability and implementation SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 12 ◽  
Author(s):  
Davide Bolognini ◽  
Alberto Magi

Structural variants (SVs) are genomic rearrangements that involve at least 50 nucleotides and are known to have a serious impact on human health. While prior short-read sequencing technologies have often proved inadequate for a comprehensive assessment of structural variation, more recent long reads from Oxford Nanopore Technologies have already been proven invaluable for the discovery of large SVs and hold the potential to facilitate the resolution of the full SV spectrum. With many long-read sequencing studies to follow, it is crucial to assess factors affecting current SV calling pipelines for nanopore sequencing data. In this brief research report, we evaluate and compare the performances of five long-read SV callers across four long-read aligners using both real and synthetic nanopore datasets. In particular, we focus on the effects of read alignment, sequencing coverage, and variant allele depth on the detection and genotyping of SVs of different types and size ranges and provide insights into precision and recall of SV callsets generated by integrating the various long-read aligners and SV callers. The computational pipeline we propose is publicly available at https://github.com/davidebolo1993/EViNCe and can be adjusted to further evaluate future nanopore sequencing datasets.


F1000Research ◽  
2015 ◽  
Vol 4 ◽  
pp. 17 ◽  
Author(s):  
Ron Ammar ◽  
Tara A. Paton ◽  
Dax Torti ◽  
Adam Shlien ◽  
Gary D. Bader

Haplotypes are often critical for the interpretation of genetic laboratory observations into medically actionable findings. Current massively parallel DNA sequencing technologies produce short sequence reads that are often unable to resolve haplotype information. Phasing short read data typically requires supplemental statistical phasing based on known haplotype structure in the population or parental genotypic data. Here we demonstrate that the MinION nanopore sequencer is capable of producing very long reads to resolve both variants and haplotypes of HLA-A, HLA-B and CYP2D6 genes important in determining patient drug response in sample NA12878 of CEPH/UTAH pedigree 1463, without the need for statistical phasing. Long read data from a single 24-hour nanopore sequencing run was used to reconstruct haplotypes, which were confirmed by HapMap data and statistically phased Complete Genomics and Sequenom genotypes. Our results demonstrate that nanopore sequencing is an emerging standalone technology with potential utility in a clinical environment to aid in medical decision-making.


2021 ◽  
Author(s):  
Chen Yang ◽  
Theodora Lo ◽  
Ka Ming Nip ◽  
Saber Hafezqorani ◽  
Rene L Warren ◽  
...  

Nanopore sequencing is crucial to metagenomic studies as its kilobase-long reads can contribute to resolving genomic structural differences among microbes. However, platform-specific challenges, including high base-call error rate, non-uniform read lengths, and the presence of chimeric artifacts, necessitate specifically designed analytical tools. Here, we present Meta-NanoSim, a fast and versatile utility that characterizes and simulates the unique properties of nanopore metagenomic reads. Further, Meta-NanoSim improves upon state-of-the-art methods on microbial abundance estimation through a base-level quantification algorithm. We demonstrate that Meta-NanoSim simulated data can facilitate the development of metagenomic algorithms and guide experimental design through a metagenomic assembly benchmarking task.


F1000Research ◽  
2015 ◽  
Vol 4 ◽  
pp. 17 ◽  
Author(s):  
Ron Ammar ◽  
Tara A. Paton ◽  
Dax Torti ◽  
Adam Shlien ◽  
Gary D. Bader

Haplotypes are often critical for the interpretation of genetic laboratory observations into medically actionable findings. Current massively parallel DNA sequencing technologies produce short sequence reads that are often unable to resolve haplotype information. Phasing short read data typically requires supplemental statistical phasing based on known haplotype structure in the population or parental genotypic data. Here we demonstrate that the MinION nanopore sequencer is capable of producing very long reads to resolve both variants and haplotypes of HLA-A, HLA-B and CYP2D6 genes important in determining patient drug response in sample NA12878 of CEPH/UTAH pedigree 1463, without the need for statistical phasing. Long read data from a single 24-hour nanopore sequencing run was used to reconstruct haplotypes, which were confirmed by HapMap data and statistically phased Complete Genomics and Sequenom genotypes. Our results demonstrate that nanopore sequencing is an emerging standalone technology with potential utility in a clinical environment to aid in medical decision-making.


2021 ◽  
Author(s):  
Alexandre Wagner Silva Hilsdorf ◽  
Marcela Uliano-Silva ◽  
Luiz Lehmann Coutinho ◽  
Horácio Montenegro ◽  
Vera Maria Fonseca Almeida-Val ◽  
...  

ABSTRACTColossoma macropomum known as “tambaqui” is the largest Characiformes fish in the Amazon River Basin and a leading species in Brazilian aquaculture and fisheries. Good quality meat and great adaptability to culture systems are some of its remarkable farming features. To support studies into the genetics and genomics of the tambaqui, we have produced the first high-quality genome for the species. We combined Illumina and PacBio sequencing technologies to generate a reference genome, assembled with 39X coverage of long reads and polished to a QV=36 with 130X coverage of short reads. The genome was assembled into 1,269 scaffolds to a total of 1,221,847,006 bases, with a scaffold N50 size of 40 Mb where 93% of all assembled bases were placed in the largest 54 scaffolds that corresponds to the diploid karyotype of the tambaqui. Furthermore, the NCBI Annotation Pipeline annotated genes, pseudogenes, and non-coding transcripts using the RefSeq database as evidence, guaranteeing a high-quality annotation. A Genome Data Viewer for the tambaqui was produced which benefits any groups interested in exploring unique genomic features of the species. The availability of a highly accurate genome assembly for tambaqui provides the foundation for novel insights about ecological and evolutionary facets and is a helpful resource for aquaculture purposes.


1980 ◽  
Vol 33 (3) ◽  
pp. 482-500
Author(s):  
T. Gray ◽  
T. G. Thorne

This paper, which was presented at an Ordinary Meeting of the Institute in London on 21 November 1979, with S. Ratcliffe in the Chair, traces the evolution of airborne doppler navigation systems over the past two decades and discusses the treatment of errors introduced when flying over water. Both authors are with Decca Radar Ltd.The object of this paper is to describe briefly the major developments which have taken place in doppler navigation systems during the past 15 years or so, to indicate the current state of the art and to examine in some detail the behaviour of doppler systems when flying over water. The year 1960 is taken as a starting point since by that time commercial doppler systems were established. A typical general purpose system of that generation (Fig. 1) consisted of the following units:1. An antenna, probably mechanically stabilized.2. A transmitter/receiver using either pulse or FMCW modulation.3. A tracking system.4. An analog computer.5. A display of ground-speed and drift-angle.6. A display of present position.


Insects ◽  
2021 ◽  
Vol 12 (7) ◽  
pp. 591
Author(s):  
Hasiba Asma ◽  
Marc S. Halfon

An ever-growing number of insect genomes is being sequenced across the evolutionary spectrum. Comprehensive annotation of not only genes but also regulatory regions is critical for reaping the full benefits of this sequencing. Driven by developments in sequencing technologies and in both empirical and computational discovery strategies, the past few decades have witnessed dramatic progress in our ability to identify cis-regulatory modules (CRMs), sequences such as enhancers that play a major role in regulating transcription. Nevertheless, providing a timely and comprehensive regulatory annotation of newly sequenced insect genomes is an ongoing challenge. We review here the methods being used to identify CRMs in both model and non-model insect species, and focus on two tools that we have developed, REDfly and SCRMshaw. These resources can be paired together in a powerful combination to facilitate insect regulatory annotation over a broad range of species, with an accuracy equal to or better than that of other state-of-the-art methods.


2019 ◽  
Author(s):  
Gaoyang Li ◽  
Bo Liu ◽  
Yadong Wang

AbstractSummaryLong read sequencing technologies are promising to metagenomics studies. However, there is still lack of read classification tools to fast and accurately identify the taxonomies of noisy long reads, which is a bottleneck to the use of long read sequencing. Herein, we propose deSAMBA, a tailored long read classification approach that uses a novel sparse approximate match block (SAMB)-based pseudo alignment algorithm. Benchmarks on real datasets demonstrate that deSAMBA enables to simultaneously achieve fast speed and good classification yields, which outperforms state-of-the-art tools and has many potentials to cutting-edge metagenomics studies.Availability and Implementationhttps://github.com/hitbc/deSAMBA.Supplementary information:


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Minhyeok Cho ◽  
Albert No

Abstract Background Advances in sequencing technology have drastically reduced sequencing costs. As a result, the amount of sequencing data increases explosively. Since FASTQ files (standard sequencing data formats) are huge, there is a need for efficient compression of FASTQ files, especially quality scores. Several quality scores compression algorithms are recently proposed, mainly focused on lossy compression to boost the compression rate further. However, for clinical applications and archiving purposes, lossy compression cannot replace lossless compression. One of the main challenges for lossless compression is time complexity, where it takes thousands of seconds to compress a 1 GB file. Also, there are desired features for compression algorithms, such as random access. Therefore, there is a need for a fast lossless compressor with a reasonable compression rate and random access functionality. Results This paper proposes a Fast and Concurrent Lossless Quality scores Compressor (FCLQC) that supports random access and achieves a lower running time based on concurrent programming. Experimental results reveal that FCLQC is significantly faster than the baseline compressors on compression and decompression at the expense of compression ratio. Compared to LCQS (baseline quality score compression algorithm), FCLQC shows at least 31x compression speed improvement in all settings, where a performance degradation in compression ratio is up to 13.58% (8.26% on average). Compared to general-purpose compressors (such as 7-zip), FCLQC shows 3x faster compression speed while having better compression ratios, at least 2.08% (4.69% on average). Moreover, the speed of random access decompression also outperforms the others. The concurrency of FCLQC is implemented using Rust; the performance gain increases near-linearly with the number of threads. Conclusion The superiority of compression and decompression speed makes FCLQC a practical lossless quality score compressor candidate for speed-sensitive applications of DNA sequencing data. FCLQC is available at https://github.com/Minhyeok01/FCLQC and is freely available for non-commercial usage.


Sign in / Sign up

Export Citation Format

Share Document