mspack: efficient lossless and lossy mass spectrometry data compression

Author(s):  
Felix Hanau ◽  
Hannes Röst ◽  
Idoia Ochoa

Abstract Motivation Mass spectrometry data, used for proteomics and metabolomics analyses, have seen considerable growth in the last years. Aiming at reducing the associated storage costs, dedicated compression algorithms for Mass Spectrometry (MS) data have been proposed, such as MassComp and MSNumpress. However, these algorithms focus on either lossless or lossy compression, respectively, and do not exploit the additional redundancy existing across scans contained in a single file. We introduce mspack, a compression algorithm for MS data that exploits this additional redundancy and that supports both lossless and lossy compression, as well as the mzML and the legacy mzXML formats. mspack applies several preprocessing lossless transforms and optional lossy transforms with a configurable error, followed by the general purpose compressors gzip or bsc to achieve a higher compression ratio. Results We tested mspack on several datasets generated by commonly used mass spectrometry instruments. When used with the bsc compression backend, mspack achieves on average 76% smaller file sizes for lossless compression and 94% smaller file sizes for lossy compression, as compared to the original files. Lossless mspack achieves 10 - 60% lower file sizes than MassComp, and lossy mspack compresses 36 - 60% better than the lossy MSNumpress, for the same error, while exhibiting comparable accuracy and running time. Availability mspack is implemented in C ++ and freely available at https://github.com/fhanau/mspack under the Apache license. Supplementary information Supplementary data are available at Bioinformatics online.

2020 ◽  
Author(s):  
Miaoshan Lu ◽  
Shaowei An ◽  
Ruimin Wang ◽  
Jinyin Wang ◽  
Changbin Yu

ABSTRACTWith the precision of mass spectrometer going higher and the emergence of data independence acquisition (DIA), the file size is increasing rapidly. Beyond the widely-used open format mzML (Deutsch 2008), near-lossless or lossless compression algorithms and formats have emerged. The data precision is often related to the instrument and subsequent processing algorithms. Unlike storage-oriented formats, which focusing more on lossless compression and compression rate, computation-oriented formats focus as much on decoding speed and disk read strategy as compression rate. Here we describe “Aird", an opensource and computation-oriented format with controllable precision, flexible indexing strategies and high compression rate. Aird uses JavaScript Object Notation (JSON) for metadata storage, multiple indexing, and reordered storage strategies for higher speed of data randomly reading. Aird also provides a novel compressor called Zlib-Diff-PforDelta (ZDPD) for m/z data compression. Compared with Zlib only, m/z data size is about 65% lower in Aird, and merely takes 33% decoding time.AvailabilityAird SDK is written in Java, which allow scholars to access mass spectrometry data efficiently. It is available at https://github.com/Propro-Studio/Aird-SDK AirdPro can convert vendor files into Aird files, which is available at https://github.com/Propro-Studio/AirdPro


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Minhyeok Cho ◽  
Albert No

Abstract Background Advances in sequencing technology have drastically reduced sequencing costs. As a result, the amount of sequencing data increases explosively. Since FASTQ files (standard sequencing data formats) are huge, there is a need for efficient compression of FASTQ files, especially quality scores. Several quality scores compression algorithms are recently proposed, mainly focused on lossy compression to boost the compression rate further. However, for clinical applications and archiving purposes, lossy compression cannot replace lossless compression. One of the main challenges for lossless compression is time complexity, where it takes thousands of seconds to compress a 1 GB file. Also, there are desired features for compression algorithms, such as random access. Therefore, there is a need for a fast lossless compressor with a reasonable compression rate and random access functionality. Results This paper proposes a Fast and Concurrent Lossless Quality scores Compressor (FCLQC) that supports random access and achieves a lower running time based on concurrent programming. Experimental results reveal that FCLQC is significantly faster than the baseline compressors on compression and decompression at the expense of compression ratio. Compared to LCQS (baseline quality score compression algorithm), FCLQC shows at least 31x compression speed improvement in all settings, where a performance degradation in compression ratio is up to 13.58% (8.26% on average). Compared to general-purpose compressors (such as 7-zip), FCLQC shows 3x faster compression speed while having better compression ratios, at least 2.08% (4.69% on average). Moreover, the speed of random access decompression also outperforms the others. The concurrency of FCLQC is implemented using Rust; the performance gain increases near-linearly with the number of threads. Conclusion The superiority of compression and decompression speed makes FCLQC a practical lossless quality score compressor candidate for speed-sensitive applications of DNA sequencing data. FCLQC is available at https://github.com/Minhyeok01/FCLQC and is freely available for non-commercial usage.


Author(s):  
Marcela Aguilera Flores ◽  
Iulia M Lazar

Abstract Summary The ‘Unknown Mutation Analysis (XMAn)’ database is a compilation of Homo sapiens mutated peptides in FASTA format, that was constructed for facilitating the identification of protein sequence alterations by tandem mass spectrometry detection. The database comprises 2 539 031 non-redundant mutated entries from 17 599 proteins, of which 2 377 103 are missense and 161 928 are nonsense mutations. It can be used in conjunction with search engines that seek the identification of peptide amino acid sequences by matching experimental tandem mass spectrometry data to theoretical sequences from a database. Availability and implementation XMAn v2 can be accessed from github.com/lazarlab/XMAnv2. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 35 (15) ◽  
pp. 2674-2676 ◽  
Author(s):  
Shubham Chandak ◽  
Kedar Tatwawadi ◽  
Idoia Ochoa ◽  
Mikel Hernaez ◽  
Tsachy Weissman

Abstract Motivation High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. Results In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina’s NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources. Availability and implementation SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. Supplementary information Supplementary data are available at Bioinformatics online.


2016 ◽  
Vol 2016 ◽  
pp. 1-7 ◽  
Author(s):  
Pamela Vinitha Eric ◽  
Gopakumar Gopalakrishnan ◽  
Muralikrishnan Karunakaran

This paper proposes a seed based lossless compression algorithm to compress a DNA sequence which uses a substitution method that is similar to the LempelZiv compression scheme. The proposed method exploits the repetition structures that are inherent in DNA sequences by creating an offline dictionary which contains all such repeats along with the details of mismatches. By ensuring that only promising mismatches are allowed, the method achieves a compression ratio that is at par or better than the existing lossless DNA sequence compression algorithms.


2022 ◽  
Vol 23 (1) ◽  
Author(s):  
Miaoshan Lu ◽  
Shaowei An ◽  
Ruimin Wang ◽  
Jinyin Wang ◽  
Changbin Yu

Abstract Background With the precision of the mass spectrometry (MS) going higher, the MS file size increases rapidly. Beyond the widely-used open format mzML, near-lossless or lossless compression algorithms and formats emerged in scenarios with different precision requirements. The data precision is often related to the instrument and subsequent processing algorithms. Unlike storage-oriented formats, which focus more on lossless compression rate, computation-oriented formats concentrate as much on decoding speed as the compression rate. Results Here we introduce “Aird”, an opensource and computation-oriented format with controllable precision, flexible indexing strategies, and high compression rate. Aird provides a novel compressor called Zlib-Diff-PforDelta (ZDPD) for m/z data. Compared with Zlib only, m/z data size is about 55% lower in Aird average. With the high-speed decoding and encoding performance of the single instruction multiple data technology used in the ZDPD, Aird merely takes 33% decoding time compared with Zlib. We have downloaded seven datasets from ProteomeXchange and Metabolights. They are from different SCIEX, Thermo, and Agilent instruments. Then we convert the raw data into mzML, mgf, and mz5 file formats by MSConvert and compare them with Aird format. Aird uses JavaScript Object Notation for metadata storage. Aird-SDK is written in Java, and AirdPro is a GUI client for vendor file converting written in C#. They are freely available at https://github.com/CSi-Studio/Aird-SDK and https://github.com/CSi-Studio/AirdPro. Conclusions With the innovation of MS acquisition mode, MS data characteristics are also constantly changing. New data features can bring more effective compression methods and new index modes to achieve high search performance. The MS data storage mode will also become professional and customized. ZDPD uses multiple MS digital features, and researchers also can use it in other formats like mzML. Aird is designed to become a computing-oriented data format with high scalability, compression rate, and fast decoding speed.


Author(s):  
Kamal Al-Khayyat ◽  
Imad Al-Shaikhli ◽  
Mohamad Al-Hagery

This paper details the examination of a particular case of data compression, where the compression algorithm removes the redundancy from data, which occurs when edge-based compression algorithms compress (previously compressed) pixelated images. The newly created redundancy can be removed using another round of compression. This work utilized the JPEG-LS as an example of an edge-based compression algorithm for compressing pixelated images. The output of this process was subjected to another round of compression using a more robust but slower compressor (PAQ8f). The compression ratio of the second compression was, on average,  18%, which is high for random data. The results of the second compression were superior to the lossy JPEG. Under the used data set, lossy JPEG needs to sacrifice  10% on average to realize nearly total lossless compression ratios of the two-successive compressions. To generalize the results, fast general-purpose compression algorithms (7z, bz2, and Gzip) were used too.


14 C differs from other nuclides measured by accelerator mass spectrometry (AMS) in that an extensive database of dates already exists, AMS dates should therefore have comparable accuracy, and the measurement of isotopic ratios to better than 1 % , which was an important technical goal, has been reached. The main advantage of being able to date samples 1000 times smaller than previously lies in the extra selectivity that can be employed. This is reflected in the results and applications. Selection can apply at several levels; from objects th at formerly contained too little carbon, to the choice of archaeological material, to the extraction of specific chemical compounds from a complex environmental sample. This is particularly useful in removing uncertainty regarding the validity of a date, since a given sample may comprise carbon atoms from different sources each with their own 14 C ‘age’. Examples from archaeological and environmental research illustrating these points are given. 14 C dating by AMS differs from conventional radiocarbon dating by having the potential to measure much lower levels of 14 C, and therefore should double the time span of the method. This potential has not yet been realized because of sample contamination effects, and work in progress to reduce these is described.


2020 ◽  
Vol 36 (16) ◽  
pp. 4423-4431
Author(s):  
Wenbo Xu ◽  
Yan Tian ◽  
Siye Wang ◽  
Yupeng Cui

Abstract Motivation The classification of high-throughput protein data based on mass spectrometry (MS) is of great practical significance in medical diagnosis. Generally, MS data are characterized by high dimension, which inevitably leads to prohibitive cost of computation. To solve this problem, one-bit compressed sensing (CS), which is an extreme case of quantized CS, has been employed on MS data to select important features with low dimension. Though enjoying remarkably reduction of computation complexity, the current one-bit CS method does not consider the unavoidable noise contained in MS dataset, and does not exploit the inherent structure of the underlying MS data. Results We propose two feature selection (FS) methods based on one-bit CS to deal with the noise and the underlying block-sparsity features, respectively. In the first method, the FS problem is modeled as a perturbed one-bit CS problem, where the perturbation represents the noise in MS data. By iterating between perturbation refinement and FS, this method selects the significant features from noisy data. The second method formulates the problem as a perturbed one-bit block CS problem and selects the features block by block. Such block extraction is due to the fact that the significant features in the first method usually cluster in groups. Experiments show that, the two proposed methods have better classification performance for real MS data when compared with the existing method, and the second one outperforms the first one. Availability and implementation The source code of our methods is available at: https://github.com/tianyan8023/OBCS. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Sebastian Deorowicz

AbstractMotivationThe amount of genomic data that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives.ResultsWe present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools.Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.


Sign in / Sign up

Export Citation Format

Share Document