scholarly journals FCLQC: fast and concurrent lossless quality scores compressor

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Minhyeok Cho ◽  
Albert No

Abstract Background Advances in sequencing technology have drastically reduced sequencing costs. As a result, the amount of sequencing data increases explosively. Since FASTQ files (standard sequencing data formats) are huge, there is a need for efficient compression of FASTQ files, especially quality scores. Several quality scores compression algorithms are recently proposed, mainly focused on lossy compression to boost the compression rate further. However, for clinical applications and archiving purposes, lossy compression cannot replace lossless compression. One of the main challenges for lossless compression is time complexity, where it takes thousands of seconds to compress a 1 GB file. Also, there are desired features for compression algorithms, such as random access. Therefore, there is a need for a fast lossless compressor with a reasonable compression rate and random access functionality. Results This paper proposes a Fast and Concurrent Lossless Quality scores Compressor (FCLQC) that supports random access and achieves a lower running time based on concurrent programming. Experimental results reveal that FCLQC is significantly faster than the baseline compressors on compression and decompression at the expense of compression ratio. Compared to LCQS (baseline quality score compression algorithm), FCLQC shows at least 31x compression speed improvement in all settings, where a performance degradation in compression ratio is up to 13.58% (8.26% on average). Compared to general-purpose compressors (such as 7-zip), FCLQC shows 3x faster compression speed while having better compression ratios, at least 2.08% (4.69% on average). Moreover, the speed of random access decompression also outperforms the others. The concurrency of FCLQC is implemented using Rust; the performance gain increases near-linearly with the number of threads. Conclusion The superiority of compression and decompression speed makes FCLQC a practical lossless quality score compressor candidate for speed-sensitive applications of DNA sequencing data. FCLQC is available at https://github.com/Minhyeok01/FCLQC and is freely available for non-commercial usage.

Author(s):  
Kamal Al-Khayyat ◽  
Imad Al-Shaikhli ◽  
Mohamad Al-Hagery

This paper details the examination of a particular case of data compression, where the compression algorithm removes the redundancy from data, which occurs when edge-based compression algorithms compress (previously compressed) pixelated images. The newly created redundancy can be removed using another round of compression. This work utilized the JPEG-LS as an example of an edge-based compression algorithm for compressing pixelated images. The output of this process was subjected to another round of compression using a more robust but slower compressor (PAQ8f). The compression ratio of the second compression was, on average,  18%, which is high for random data. The results of the second compression were superior to the lossy JPEG. Under the used data set, lossy JPEG needs to sacrifice  10% on average to realize nearly total lossless compression ratios of the two-successive compressions. To generalize the results, fast general-purpose compression algorithms (7z, bz2, and Gzip) were used too.


Author(s):  
Felix Hanau ◽  
Hannes Röst ◽  
Idoia Ochoa

Abstract Motivation Mass spectrometry data, used for proteomics and metabolomics analyses, have seen considerable growth in the last years. Aiming at reducing the associated storage costs, dedicated compression algorithms for Mass Spectrometry (MS) data have been proposed, such as MassComp and MSNumpress. However, these algorithms focus on either lossless or lossy compression, respectively, and do not exploit the additional redundancy existing across scans contained in a single file. We introduce mspack, a compression algorithm for MS data that exploits this additional redundancy and that supports both lossless and lossy compression, as well as the mzML and the legacy mzXML formats. mspack applies several preprocessing lossless transforms and optional lossy transforms with a configurable error, followed by the general purpose compressors gzip or bsc to achieve a higher compression ratio. Results We tested mspack on several datasets generated by commonly used mass spectrometry instruments. When used with the bsc compression backend, mspack achieves on average 76% smaller file sizes for lossless compression and 94% smaller file sizes for lossy compression, as compared to the original files. Lossless mspack achieves 10 - 60% lower file sizes than MassComp, and lossy mspack compresses 36 - 60% better than the lossy MSNumpress, for the same error, while exhibiting comparable accuracy and running time. Availability mspack is implemented in C ++ and freely available at https://github.com/fhanau/mspack under the Apache license. Supplementary information Supplementary data are available at Bioinformatics online.


BMC Genomics ◽  
2019 ◽  
Vol 20 (S10) ◽  
Author(s):  
Tao Tang ◽  
Yuansheng Liu ◽  
Buzhong Zhang ◽  
Benyue Su ◽  
Jinyan Li

Abstract Background The rapid development of Next-Generation Sequencing technologies enables sequencing genomes with low cost. The dramatically increasing amount of sequencing data raised crucial needs for efficient compression algorithms. Reference-based compression algorithms have exhibited outstanding performance on compressing single genomes. However, for the more challenging and more useful problem of compressing a large collection of n genomes, straightforward application of these reference-based algorithms suffers a series of issues such as difficult reference selection and remarkable performance variation. Results We propose an efficient clustering-based reference selection algorithm for reference-based compression within separate clusters of the n genomes. This method clusters the genomes into subsets of highly similar genomes using MinHash sketch distance, and uses the centroid sequence of each cluster as the reference genome for an outstanding reference-based compression of the remaining genomes in each cluster. A final reference is then selected from these reference genomes for the compression of the remaining reference genomes. Our method significantly improved the performance of the-state-of-art compression algorithms on large-scale human and rice genome databases containing thousands of genome sequences. The compression ratio gain can reach up to 20-30% in most cases for the datasets from NCBI, the 1000 Human Genomes Project and the 3000 Rice Genomes Project. The best improvement boosts the performance from 351.74 compression folds to 443.51 folds. Conclusions The compression ratio of reference-based compression on large scale genome datasets can be improved via reference selection by applying appropriate data preprocessing and clustering methods. Our algorithm provides an efficient way to compress large genome database.


2018 ◽  
Vol 35 (15) ◽  
pp. 2674-2676 ◽  
Author(s):  
Shubham Chandak ◽  
Kedar Tatwawadi ◽  
Idoia Ochoa ◽  
Mikel Hernaez ◽  
Tsachy Weissman

Abstract Motivation High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. Results In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina’s NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources. Availability and implementation SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Miaoshan Lu ◽  
Shaowei An ◽  
Ruimin Wang ◽  
Jinyin Wang ◽  
Changbin Yu

ABSTRACTWith the precision of mass spectrometer going higher and the emergence of data independence acquisition (DIA), the file size is increasing rapidly. Beyond the widely-used open format mzML (Deutsch 2008), near-lossless or lossless compression algorithms and formats have emerged. The data precision is often related to the instrument and subsequent processing algorithms. Unlike storage-oriented formats, which focusing more on lossless compression and compression rate, computation-oriented formats focus as much on decoding speed and disk read strategy as compression rate. Here we describe “Aird", an opensource and computation-oriented format with controllable precision, flexible indexing strategies and high compression rate. Aird uses JavaScript Object Notation (JSON) for metadata storage, multiple indexing, and reordered storage strategies for higher speed of data randomly reading. Aird also provides a novel compressor called Zlib-Diff-PforDelta (ZDPD) for m/z data compression. Compared with Zlib only, m/z data size is about 65% lower in Aird, and merely takes 33% decoding time.AvailabilityAird SDK is written in Java, which allow scholars to access mass spectrometry data efficiently. It is available at https://github.com/Propro-Studio/Aird-SDK AirdPro can convert vendor files into Aird files, which is available at https://github.com/Propro-Studio/AirdPro


Author(s):  
Gody Mostafa ◽  
Abdelhalim Zekry ◽  
Hatem Zakaria

When transmitting the data in digital communication, it is well desired that the transmitting data bits should be as minimal as possible, so many techniques are used to compress the data. In this paper, a Lempel-Ziv algorithm for data compression was implemented through VHDL coding. One of the most lossless data compression algorithms commonly used is Lempel-Ziv. The work in this paper is devoted to improve the compression rate, space-saving, and utilization of the Lempel-Ziv algorithm using a systolic array approach. The developed design is validated with VHDL simulations using Xilinx ISE 14.5 and synthesized on Virtex-6 FPGA chip. The results show that our design is efficient in providing high compression rates and space-saving percentage as well as improved utilization. The Throughput is increased by 50% and the design area is decreased by more than 23% with a high compression ratio compared to comparable previous designs.


2021 ◽  
Author(s):  
Qingxi Meng ◽  
Shubham Chandak ◽  
Yifan Zhu ◽  
Tsachy Weissman

Motivation: The amount of data produced by genome sequencing experiments has been growing rapidly over the past several years, making compression important for efficient storage, transfer and analysis of the data. In recent years, nanopore sequencing technologies have seen increasing adoption since they are portable, real-time and provide long reads. However, there has been limited progress on compression of nanopore sequencing reads obtained in FASTQ files. Previous work ENANO focuses mostly on quality score compression and does not achieve significant gains for the compression of read sequences over general-purpose compressors. RENANO achieves significantly better compression for read sequences but is limited to aligned data with a reference available. Results: We present NanoSpring, a reference-free compressor for nanopore sequencing reads, relying on an approximate assembly approach. We evaluate NanoSpring on a variety of datasets including bacterial, metagenomic, plant, animal, and human whole genome data. For recently basecalled high quality nanopore datasets, NanoSpring achieves close to 3x improvement in compression over state-of-the-art reference-free compressors. The computational requirements of NanoSpring are practical, although it uses more time and memory during compression than previous tools to achieve the compression gains. Availability: NanoSpring is available on GitHub at https://github.com/qm2/NanoSpring.


2018 ◽  
Vol 7 (2.31) ◽  
pp. 69 ◽  
Author(s):  
G Murugesan ◽  
Rosario Gilmary

Text files utilize substantial amount of memory or disk space. Transmission of these files across a network depends upon a considerable amount of bandwidth. Compression procedures are explicitly advantageous in telecommunications and information technology because it facilitate devices to disseminate or reserve the equivalent amount of data in fewer bits. Text compression techniques section, the English passage by observing the patters and provide alternative symbols for larger patters of text. To diminish the depository of copious information and data storage expenditure, compression algorithms were used. Compression of significant and massive cluster of information can head to the improvement in retrieval time. Novel lossless compression algorithms have been introduced for better compression ratio. In this work, the various existing compression mechanisms that are particular for compressing the text files and Deoxyribonucleic acid (DNA) sequence files are analyzed. The performance is correlated in terms of compression ratio, time taken to compress/decompress the sequence and file size. In this proposed work, the input file is converted to DNA format and then DNA compression procedure is applied.  


Author(s):  
Gerald Schaefer ◽  
Roman Starosolski

In this article, we present experiments aimed to identify a suitable compression algorithm for colour retina images (Schaefer & Starosolski, 2006). Such an algorithm, in order to prove useful in a real-life PACS, should not only reduce the file size of the images significantly but also has to be fast enough, both for compression and decompression. Furthermore, it should be covered by international standards such as ISO standards and, in particular for medical imaging, the Digital Imaging and Communication in Medicine (DICOM) standard (Mildenberger, Eichelberg, & Martin, 2002; National Electrical Manufacturers Association, 2004). For our study, we therefore selected those compression algorithms that are supported in DICOM, namely TIFF PackBits (Adobe Systems Inc., 1995), Lossless JPEG (Langdon, Gulati, & Seiler, 1992), JPEG-LS (ISO/IEC, 1999), and JPEG2000 (ISO/IEC, 2002). For comparison, we also included CALIC (Wu, 1997), which is often employed for benchmarking compression algorithms. All algorithms were evaluated in terms of compression ratio which describes the reduction of file size and speed. For speed, we consider both the time it takes to encode an image (compression speed) and to decode (decompression speed), as both are relevant within a PACS.


2014 ◽  
Vol 12 (06) ◽  
pp. 1442002 ◽  
Author(s):  
Idoia Ochoa ◽  
Mikel Hernaez ◽  
Tsachy Weissman

With the release of the latest Next-Generation Sequencing (NGS) machine, the HiSeq X by Illumina, the cost of sequencing the whole genome of a human is expected to drop to a mere $1000. This milestone in sequencing history marks the era of affordable sequencing of individuals and opens the doors to personalized medicine. In accord, unprecedented volumes of genomic data will require storage for processing. There will be dire need not only of compressing aligned data, but also of generating compressed files that can be fed directly to downstream applications to facilitate the analysis of and inference on the data. Several approaches to this challenge have been proposed in the literature; however, focus thus far has been on the low coverage regime and most of the suggested compressors are not based on effective modeling of the data. We demonstrate the benefit of data modeling for compressing aligned reads. Specifically, we show that, by working with data models designed for the aligned data, we can improve considerably over the best compression ratio achieved by previously proposed algorithms. Our results indicate that the pareto-optimal barrier for compression rate and speed claimed by Bonfield and Mahoney (2013) [Bonfield JK and Mahoneys MV, Compression of FASTQ and SAM format sequencing data, PLOS ONE, 8(3):e59190, 2013.] does not apply for high coverage aligned data. Furthermore, our improved compression ratio is achieved by splitting the data in a manner conducive to operations in the compressed domain by downstream applications.


Sign in / Sign up

Export Citation Format

Share Document