Aird: A computation-oriented mass spectrometry data format enables higher compression ratio and less decoding time

ABSTRACTWith the precision of mass spectrometer going higher and the emergence of data independence acquisition (DIA), the file size is increasing rapidly. Beyond the widely-used open format mzML (Deutsch 2008), near-lossless or lossless compression algorithms and formats have emerged. The data precision is often related to the instrument and subsequent processing algorithms. Unlike storage-oriented formats, which focusing more on lossless compression and compression rate, computation-oriented formats focus as much on decoding speed and disk read strategy as compression rate. Here we describe “Aird", an opensource and computation-oriented format with controllable precision, flexible indexing strategies and high compression rate. Aird uses JavaScript Object Notation (JSON) for metadata storage, multiple indexing, and reordered storage strategies for higher speed of data randomly reading. Aird also provides a novel compressor called Zlib-Diff-PforDelta (ZDPD) for m/z data compression. Compared with Zlib only, m/z data size is about 65% lower in Aird, and merely takes 33% decoding time.AvailabilityAird SDK is written in Java, which allow scholars to access mass spectrometry data efficiently. It is available at https://github.com/Propro-Studio/Aird-SDK AirdPro can convert vendor files into Aird files, which is available at https://github.com/Propro-Studio/AirdPro

Download Full-text

Aird: a computation-oriented mass spectrometry data format enables a higher compression ratio and less decoding time

BMC Bioinformatics ◽

10.1186/s12859-021-04490-0 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Miaoshan Lu ◽

Shaowei An ◽

Ruimin Wang ◽

Jinyin Wang ◽

Changbin Yu

Keyword(s):

Mass Spectrometry ◽

Data Storage ◽

High Speed ◽

Lossless Compression ◽

Mass Spectrometry Data ◽

Compression Rate ◽

Search Performance ◽

Data Format ◽

Link Type ◽

Decoding Speed

Abstract Background With the precision of the mass spectrometry (MS) going higher, the MS file size increases rapidly. Beyond the widely-used open format mzML, near-lossless or lossless compression algorithms and formats emerged in scenarios with different precision requirements. The data precision is often related to the instrument and subsequent processing algorithms. Unlike storage-oriented formats, which focus more on lossless compression rate, computation-oriented formats concentrate as much on decoding speed as the compression rate. Results Here we introduce “Aird”, an opensource and computation-oriented format with controllable precision, flexible indexing strategies, and high compression rate. Aird provides a novel compressor called Zlib-Diff-PforDelta (ZDPD) for m/z data. Compared with Zlib only, m/z data size is about 55% lower in Aird average. With the high-speed decoding and encoding performance of the single instruction multiple data technology used in the ZDPD, Aird merely takes 33% decoding time compared with Zlib. We have downloaded seven datasets from ProteomeXchange and Metabolights. They are from different SCIEX, Thermo, and Agilent instruments. Then we convert the raw data into mzML, mgf, and mz5 file formats by MSConvert and compare them with Aird format. Aird uses JavaScript Object Notation for metadata storage. Aird-SDK is written in Java, and AirdPro is a GUI client for vendor file converting written in C#. They are freely available at https://github.com/CSi-Studio/Aird-SDK and https://github.com/CSi-Studio/AirdPro. Conclusions With the innovation of MS acquisition mode, MS data characteristics are also constantly changing. New data features can bring more effective compression methods and new index modes to achieve high search performance. The MS data storage mode will also become professional and customized. ZDPD uses multiple MS digital features, and researchers also can use it in other formats like mzML. Aird is designed to become a computing-oriented data format with high scalability, compression rate, and fast decoding speed.

Download Full-text

mzMLb: a future-proof raw mass spectrometry data format based on standards-compliant mzML and optimized for speed and storage requirements

10.1101/2020.02.13.947218 ◽

2020 ◽

Author(s):

Ranjeet S. Bhamber ◽

Andris Jankevics ◽

Eric W Deutsch ◽

Andrew R Jones ◽

Andrew W Dowsey

Keyword(s):

Mass Spectrometry ◽

Mass Spectrometry Data ◽

File Size ◽

Data Format ◽

File Access ◽

Data Interchange ◽

File Formats ◽

Reference Implementation ◽

And Storage ◽

Access Efficiency

AbstractWith ever-increasing amounts of data produced by mass spectrometry (MS) proteomics and metabolomics, and the sheer volume of samples now analyzed, the need for a common open format possessing both file size efficiency and faster read/write speeds has become paramount to drive the next generation of data analysis pipelines. The Proteomics Standards Initiative (PSI) has established a clear and precise XML representation for data interchange, mzML, receiving substantial uptake; nevertheless, storage and file access efficiency has not been the main focus. We propose an HDF5 file format ‘mzMLb’ that is optimised for both read/write speed and storage of the raw mass spectrometry data. We provide extensive validation of write speed, random read speed and storage size, demonstrating a flexible format that with or without compression is faster than all existing approaches in virtually all cases, while with compression, is comparable in size to proprietary vendor file formats. Since our approach uniquely preserves the XML encoding of the metadata, the format implicitly supports future versions of mzML and is straightforward to implement: mzMLb’s design adheres to both HDF5 and NetCDF4 standard implementations, which allows it to be easily utilised by third parties due to their widespread programming language support. A reference implementation within the established ProteoWizard toolkit is provided.

Download Full-text

mspack: efficient lossless and lossy mass spectrometry data compression

Bioinformatics ◽

10.1093/bioinformatics/btab636 ◽

2021 ◽

Author(s):

Felix Hanau ◽

Hannes Röst ◽

Idoia Ochoa

Keyword(s):

Mass Spectrometry ◽

Lossless Compression ◽

Lossy Compression ◽

General Purpose ◽

Mass Spectrometry Data ◽

Supplementary Information ◽

Compression Algorithms ◽

Single File ◽

Comparable Accuracy ◽

Better Than

Abstract Motivation Mass spectrometry data, used for proteomics and metabolomics analyses, have seen considerable growth in the last years. Aiming at reducing the associated storage costs, dedicated compression algorithms for Mass Spectrometry (MS) data have been proposed, such as MassComp and MSNumpress. However, these algorithms focus on either lossless or lossy compression, respectively, and do not exploit the additional redundancy existing across scans contained in a single file. We introduce mspack, a compression algorithm for MS data that exploits this additional redundancy and that supports both lossless and lossy compression, as well as the mzML and the legacy mzXML formats. mspack applies several preprocessing lossless transforms and optional lossy transforms with a configurable error, followed by the general purpose compressors gzip or bsc to achieve a higher compression ratio. Results We tested mspack on several datasets generated by commonly used mass spectrometry instruments. When used with the bsc compression backend, mspack achieves on average 76% smaller file sizes for lossless compression and 94% smaller file sizes for lossy compression, as compared to the original files. Lossless mspack achieves 10 - 60% lower file sizes than MassComp, and lossy mspack compresses 36 - 60% better than the lossy MSNumpress, for the same error, while exhibiting comparable accuracy and running time. Availability mspack is implemented in C ++ and freely available at https://github.com/fhanau/mspack under the Apache license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

mzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant mzML and Optimized for Speed and Storage Requirements

Journal of Proteome Research ◽

10.1021/acs.jproteome.0c00192 ◽

2020 ◽

Vol 20 (1) ◽

pp. 172-183

Author(s):

Ranjeet S. Bhamber ◽

Andris Jankevics ◽

Eric W. Deutsch ◽

Andrew R. Jones ◽

Andrew W. Dowsey

Keyword(s):

Mass Spectrometry ◽

Mass Spectrometry Data ◽

Data Format ◽

And Storage

Download Full-text

MassComp, a lossless compressor for mass spectrometry data

10.1101/542894 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ruochen Yang ◽

Xi Chen ◽

Idoia Ochoa

Keyword(s):

Mass Spectrometry ◽

Numerical Data ◽

Mass Spectrometry Data ◽

Compression Algorithms ◽

Compression Performance ◽

Average Improvement ◽

The Family ◽

Efficient Representation ◽

Cost Efficient ◽

Biology Research

Background: Mass Spectrometry (MS) is a widely used technique in biology research, and has become key in proteomics and metabolomics analyses. As a result, the amount of MS data has significantly increased in recent years. For example, the MS repository MassIVE contains more than 123TB of data. Somehow surprisingly, these data are stored uncompressed, hence incurring a significant storage cost. Efficient representation of these data is therefore paramount to lessen the burden of storage and facilitate its dissemination. Results We present MassComp, a lossless compressor optimized for the numerical (m/z)-intensity pairs that account for most of the MS data. We tested MassComp on several MS data and show that it delivers on average a 46% reduction on the size of the numerical data, and up to 89%. These results correspond to an average improvement of more than 27% when compared to the general compressor gzip and of 40% when compared to the state-of-the-art numerical compressor FPC. When tested on entire files retrieved from the MassIVE repository, MassComp achieves on average a 59% size reduction. MassComp is written in C++ and freely available at https://github.com/iochoa/MassComp. Conclusions: The compression performance of MassComp demonstrates its potential to significantly reduce the footprint of MS data, and shows the benefits of designing specialized compression algorithms tailored to MS data. MassComp is an addition to the family of omics compression algorithms designed to lessen the storage burden and facilitate the exchange and dissemination of omics data.

Download Full-text

Compression of text files using genomic code compression algorithm

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.31.13399 ◽

2018 ◽

Vol 7 (2.31) ◽

pp. 69 ◽

Cited By ~ 1

Author(s):

G Murugesan ◽

Rosario Gilmary

Keyword(s):

Data Storage ◽

Compression Ratio ◽

Deoxyribonucleic Acid ◽

Lossless Compression ◽

Text Compression ◽

Input File ◽

File Size ◽

Compression Algorithms ◽

Code Compression ◽

Massive Cluster

Text files utilize substantial amount of memory or disk space. Transmission of these files across a network depends upon a considerable amount of bandwidth. Compression procedures are explicitly advantageous in telecommunications and information technology because it facilitate devices to disseminate or reserve the equivalent amount of data in fewer bits. Text compression techniques section, the English passage by observing the patters and provide alternative symbols for larger patters of text. To diminish the depository of copious information and data storage expenditure, compression algorithms were used. Compression of significant and massive cluster of information can head to the improvement in retrieval time. Novel lossless compression algorithms have been introduced for better compression ratio. In this work, the various existing compression mechanisms that are particular for compressing the text files and Deoxyribonucleic acid (DNA) sequence files are analyzed. The performance is correlated in terms of compression ratio, time taken to compress/decompress the sequence and file size. In this proposed work, the input file is converted to DNA format and then DNA compression procedure is applied.

Download Full-text

FCLQC: fast and concurrent lossless quality scores compressor

BMC Bioinformatics ◽

10.1186/s12859-021-04516-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Minhyeok Cho ◽

Albert No

Keyword(s):

Compression Ratio ◽

Random Access ◽

Lossless Compression ◽

Lossy Compression ◽

General Purpose ◽

Quality Score ◽

Compression Rate ◽

Sequencing Data ◽

Compression Algorithms ◽

Compression Speed

Abstract Background Advances in sequencing technology have drastically reduced sequencing costs. As a result, the amount of sequencing data increases explosively. Since FASTQ files (standard sequencing data formats) are huge, there is a need for efficient compression of FASTQ files, especially quality scores. Several quality scores compression algorithms are recently proposed, mainly focused on lossy compression to boost the compression rate further. However, for clinical applications and archiving purposes, lossy compression cannot replace lossless compression. One of the main challenges for lossless compression is time complexity, where it takes thousands of seconds to compress a 1 GB file. Also, there are desired features for compression algorithms, such as random access. Therefore, there is a need for a fast lossless compressor with a reasonable compression rate and random access functionality. Results This paper proposes a Fast and Concurrent Lossless Quality scores Compressor (FCLQC) that supports random access and achieves a lower running time based on concurrent programming. Experimental results reveal that FCLQC is significantly faster than the baseline compressors on compression and decompression at the expense of compression ratio. Compared to LCQS (baseline quality score compression algorithm), FCLQC shows at least 31x compression speed improvement in all settings, where a performance degradation in compression ratio is up to 13.58% (8.26% on average). Compared to general-purpose compressors (such as 7-zip), FCLQC shows 3x faster compression speed while having better compression ratios, at least 2.08% (4.69% on average). Moreover, the speed of random access decompression also outperforms the others. The concurrency of FCLQC is implemented using Rust; the performance gain increases near-linearly with the number of threads. Conclusion The superiority of compression and decompression speed makes FCLQC a practical lossless quality score compressor candidate for speed-sensitive applications of DNA sequencing data. FCLQC is available at https://github.com/Minhyeok01/FCLQC and is freely available for non-commercial usage.

Download Full-text