Aird: a computation-oriented mass spectrometry data format enables a higher compression ratio and less decoding time

Miaoshan Lu; Shaowei An; Ruimin Wang; Jinyin Wang; Changbin Yu

doi:10.1186/s12859-021-04490-0

Aird: a computation-oriented mass spectrometry data format enables a higher compression ratio and less decoding time

BMC Bioinformatics ◽

10.1186/s12859-021-04490-0 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Miaoshan Lu ◽

Shaowei An ◽

Ruimin Wang ◽

Jinyin Wang ◽

Changbin Yu

Keyword(s):

Mass Spectrometry ◽

Data Storage ◽

High Speed ◽

Lossless Compression ◽

Mass Spectrometry Data ◽

Compression Rate ◽

Search Performance ◽

Data Format ◽

Link Type ◽

Decoding Speed

Abstract Background With the precision of the mass spectrometry (MS) going higher, the MS file size increases rapidly. Beyond the widely-used open format mzML, near-lossless or lossless compression algorithms and formats emerged in scenarios with different precision requirements. The data precision is often related to the instrument and subsequent processing algorithms. Unlike storage-oriented formats, which focus more on lossless compression rate, computation-oriented formats concentrate as much on decoding speed as the compression rate. Results Here we introduce “Aird”, an opensource and computation-oriented format with controllable precision, flexible indexing strategies, and high compression rate. Aird provides a novel compressor called Zlib-Diff-PforDelta (ZDPD) for m/z data. Compared with Zlib only, m/z data size is about 55% lower in Aird average. With the high-speed decoding and encoding performance of the single instruction multiple data technology used in the ZDPD, Aird merely takes 33% decoding time compared with Zlib. We have downloaded seven datasets from ProteomeXchange and Metabolights. They are from different SCIEX, Thermo, and Agilent instruments. Then we convert the raw data into mzML, mgf, and mz5 file formats by MSConvert and compare them with Aird format. Aird uses JavaScript Object Notation for metadata storage. Aird-SDK is written in Java, and AirdPro is a GUI client for vendor file converting written in C#. They are freely available at https://github.com/CSi-Studio/Aird-SDK and https://github.com/CSi-Studio/AirdPro. Conclusions With the innovation of MS acquisition mode, MS data characteristics are also constantly changing. New data features can bring more effective compression methods and new index modes to achieve high search performance. The MS data storage mode will also become professional and customized. ZDPD uses multiple MS digital features, and researchers also can use it in other formats like mzML. Aird is designed to become a computing-oriented data format with high scalability, compression rate, and fast decoding speed.

Download Full-text

Aird: A computation-oriented mass spectrometry data format enables higher compression ratio and less decoding time

10.1101/2020.10.14.338921 ◽

2020 ◽

Author(s):

Miaoshan Lu ◽

Shaowei An ◽

Ruimin Wang ◽

Jinyin Wang ◽

Changbin Yu

Keyword(s):

Mass Spectrometry ◽

Lossless Compression ◽

Mass Spectrometry Data ◽

Compression Rate ◽

File Size ◽

Data Format ◽

Compression Algorithms ◽

Link Type ◽

Processing Algorithms ◽

Data Independence

ABSTRACTWith the precision of mass spectrometer going higher and the emergence of data independence acquisition (DIA), the file size is increasing rapidly. Beyond the widely-used open format mzML (Deutsch 2008), near-lossless or lossless compression algorithms and formats have emerged. The data precision is often related to the instrument and subsequent processing algorithms. Unlike storage-oriented formats, which focusing more on lossless compression and compression rate, computation-oriented formats focus as much on decoding speed and disk read strategy as compression rate. Here we describe “Aird", an opensource and computation-oriented format with controllable precision, flexible indexing strategies and high compression rate. Aird uses JavaScript Object Notation (JSON) for metadata storage, multiple indexing, and reordered storage strategies for higher speed of data randomly reading. Aird also provides a novel compressor called Zlib-Diff-PforDelta (ZDPD) for m/z data compression. Compared with Zlib only, m/z data size is about 65% lower in Aird, and merely takes 33% decoding time.AvailabilityAird SDK is written in Java, which allow scholars to access mass spectrometry data efficiently. It is available at https://github.com/Propro-Studio/Aird-SDK AirdPro can convert vendor files into Aird files, which is available at https://github.com/Propro-Studio/AirdPro

Download Full-text

mzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant mzML and Optimized for Speed and Storage Requirements

Journal of Proteome Research ◽

10.1021/acs.jproteome.0c00192 ◽

2020 ◽

Vol 20 (1) ◽

pp. 172-183

Author(s):

Ranjeet S. Bhamber ◽

Andris Jankevics ◽

Eric W. Deutsch ◽

Andrew R. Jones ◽

Andrew W. Dowsey

Keyword(s):

Mass Spectrometry ◽

Mass Spectrometry Data ◽

Data Format ◽

And Storage

Download Full-text

mzMLb: a future-proof raw mass spectrometry data format based on standards-compliant mzML and optimized for speed and storage requirements

10.1101/2020.02.13.947218 ◽

2020 ◽

Author(s):

Ranjeet S. Bhamber ◽

Andris Jankevics ◽

Eric W Deutsch ◽

Andrew R Jones ◽

Andrew W Dowsey

Keyword(s):

Mass Spectrometry ◽

Mass Spectrometry Data ◽

File Size ◽

Data Format ◽

File Access ◽

Data Interchange ◽

File Formats ◽

Reference Implementation ◽

And Storage ◽

Access Efficiency

AbstractWith ever-increasing amounts of data produced by mass spectrometry (MS) proteomics and metabolomics, and the sheer volume of samples now analyzed, the need for a common open format possessing both file size efficiency and faster read/write speeds has become paramount to drive the next generation of data analysis pipelines. The Proteomics Standards Initiative (PSI) has established a clear and precise XML representation for data interchange, mzML, receiving substantial uptake; nevertheless, storage and file access efficiency has not been the main focus. We propose an HDF5 file format ‘mzMLb’ that is optimised for both read/write speed and storage of the raw mass spectrometry data. We provide extensive validation of write speed, random read speed and storage size, demonstrating a flexible format that with or without compression is faster than all existing approaches in virtually all cases, while with compression, is comparable in size to proprietary vendor file formats. Since our approach uniquely preserves the XML encoding of the metadata, the format implicitly supports future versions of mzML and is straightforward to implement: mzMLb’s design adheres to both HDF5 and NetCDF4 standard implementations, which allows it to be easily utilised by third parties due to their widespread programming language support. A reference implementation within the established ProteoWizard toolkit is provided.

Download Full-text

Classification of high-speed gas chromatography–mass spectrometry data by principal component analysis coupled with piecewise alignment and feature selection

Journal of Chromatography A ◽

10.1016/j.chroma.2006.06.087 ◽

2006 ◽

Vol 1129 (1) ◽

pp. 111-118 ◽

Cited By ~ 32

Author(s):

Nathanial E. Watson ◽

Matthew M. VanWingerden ◽

Karisa M. Pierce ◽

Bob W. Wright ◽

Robert E. Synovec

Keyword(s):

Mass Spectrometry ◽

Principal Component Analysis ◽

Gas Chromatography ◽

Feature Selection ◽

High Speed ◽

Principal Component ◽

Component Analysis ◽

Mass Spectrometry Data ◽

Gas Chromatography Mass Spectrometry

Download Full-text

mspack: efficient lossless and lossy mass spectrometry data compression

Bioinformatics ◽

10.1093/bioinformatics/btab636 ◽

2021 ◽

Author(s):

Felix Hanau ◽

Hannes Röst ◽

Idoia Ochoa

Keyword(s):

Mass Spectrometry ◽

Lossless Compression ◽

Lossy Compression ◽

General Purpose ◽

Mass Spectrometry Data ◽

Supplementary Information ◽

Compression Algorithms ◽

Single File ◽

Comparable Accuracy ◽

Better Than

Abstract Motivation Mass spectrometry data, used for proteomics and metabolomics analyses, have seen considerable growth in the last years. Aiming at reducing the associated storage costs, dedicated compression algorithms for Mass Spectrometry (MS) data have been proposed, such as MassComp and MSNumpress. However, these algorithms focus on either lossless or lossy compression, respectively, and do not exploit the additional redundancy existing across scans contained in a single file. We introduce mspack, a compression algorithm for MS data that exploits this additional redundancy and that supports both lossless and lossy compression, as well as the mzML and the legacy mzXML formats. mspack applies several preprocessing lossless transforms and optional lossy transforms with a configurable error, followed by the general purpose compressors gzip or bsc to achieve a higher compression ratio. Results We tested mspack on several datasets generated by commonly used mass spectrometry instruments. When used with the bsc compression backend, mspack achieves on average 76% smaller file sizes for lossless compression and 94% smaller file sizes for lossy compression, as compared to the original files. Lossless mspack achieves 10 - 60% lower file sizes than MassComp, and lossy mspack compresses 36 - 60% better than the lossy MSNumpress, for the same error, while exhibiting comparable accuracy and running time. Availability mspack is implemented in C ++ and freely available at https://github.com/fhanau/mspack under the Apache license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

154: Integration of TPSA and High-Throughput Mass Spectrometry Data Improves Prostate Cancer Prediction

The Journal of Urology ◽

10.1016/s0022-5347(18)30419-1 ◽

2007 ◽

Vol 177 (4S) ◽

pp. 52-53

Author(s):

Stefano Ongarello ◽

Eberhard Steiner ◽

Regina Achleitner ◽

Isabel Feuerstein ◽

Birgit Stenzel ◽

...

Keyword(s):

Prostate Cancer ◽

Mass Spectrometry ◽

High Throughput ◽

Mass Spectrometry Data ◽

Cancer Prediction

Download Full-text

Rapid Online Buffer Exchange: A Method for Screening of Proteins, Protein Complexes, and Cell Lysates by Native Mass Spectrometry

10.26434/chemrxiv.8792177 ◽

2019 ◽

Author(s):

Zachary VanAernum ◽

Florian Busch ◽

Benjamin J. Jones ◽

Mengxuan Jia ◽

Zibo Chen ◽

...

Keyword(s):

Mass Spectrometry ◽

High Speed ◽

Structural Information ◽

Protein Complexes ◽

High Sensitivity ◽

Native Mass Spectrometry ◽

Structural Features ◽

Consumer Products ◽

Cell Lysates ◽

Protein Expression And Purification

It is important to assess the identity and purity of proteins and protein complexes during and after protein purification to ensure that samples are of sufficient quality for further biochemical and structural characterization, as well as for use in consumer products, chemical processes, and therapeutics. Native mass spectrometry (nMS) has become an important tool in protein analysis due to its ability to retain non-covalent interactions during measurements, making it possible to obtain protein structural information with high sensitivity and at high speed. Interferences from the presence of non-volatiles are typically alleviated by offline buffer exchange, which is timeconsuming and difficult to automate. We provide a protocol for rapid online buffer exchange (OBE) nMS to directly screen structural features of pre-purified proteins, protein complexes, or clarified cell lysates. Information obtained by OBE nMS can be used for fast (<5 min) quality control and can further guide protein expression and purification optimization.

Download Full-text

Nonparametric Pre-Processing Methods and Inference Tools for Analyzing Time-of-Flight Mass Spectrometry Data.

Current Analytical Chemistry ◽

10.2174/157341107780361718 ◽

2007 ◽

Vol 3 (2) ◽

pp. 127-147 ◽

Cited By ~ 8

Author(s):

Anestis Antoniadis ◽

Jeremie Bigot ◽

Sophie Lambert-Lacroix ◽

Frederique Letue

Keyword(s):

Mass Spectrometry ◽

Time Of Flight ◽

Mass Spectrometry Data ◽

Processing Methods ◽

Flight Mass Spectrometry

Download Full-text

Ultra‐Fast Retroactive Processing by MetAlign of Liquid‐Chromatography High‐Resolution Full‐Scan Orbitrap Mass Spectrometry Data in WADA Human Urine Sample Monitoring Program

Rapid Communications in Mass Spectrometry ◽

10.1002/rcm.9141 ◽

2021 ◽

Author(s):

Safa Khelifi ◽

Khadija Saad ◽

Ariadni Vonaparti ◽

Souhila Mahieddine ◽

Sofia Salama ◽

...

Keyword(s):

Mass Spectrometry ◽

Liquid Chromatography ◽

High Resolution ◽

Urine Sample ◽

Human Urine ◽

Monitoring Program ◽

Mass Spectrometry Data ◽

Human Urine Sample ◽

Orbitrap Mass Spectrometry ◽

Full Scan

Download Full-text

Interlaboratory Comparison of Untargeted Mass Spectrometry Data Uncovers Underlying Causes for Variability

Journal of Natural Products ◽

10.1021/acs.jnatprod.0c01376 ◽

2021 ◽

Author(s):

Trevor N. Clark ◽

Joëlle Houriet ◽

Warren S. Vidar ◽

Joshua J. Kellogg ◽

Daniel A. Todd ◽

...

Keyword(s):

Mass Spectrometry ◽

Interlaboratory Comparison ◽

Mass Spectrometry Data ◽

Underlying Causes

Download Full-text