scholarly journals mzMLb: a future-proof raw mass spectrometry data format based on standards-compliant mzML and optimized for speed and storage requirements

2020 ◽  
Author(s):  
Ranjeet S. Bhamber ◽  
Andris Jankevics ◽  
Eric W Deutsch ◽  
Andrew R Jones ◽  
Andrew W Dowsey

AbstractWith ever-increasing amounts of data produced by mass spectrometry (MS) proteomics and metabolomics, and the sheer volume of samples now analyzed, the need for a common open format possessing both file size efficiency and faster read/write speeds has become paramount to drive the next generation of data analysis pipelines. The Proteomics Standards Initiative (PSI) has established a clear and precise XML representation for data interchange, mzML, receiving substantial uptake; nevertheless, storage and file access efficiency has not been the main focus. We propose an HDF5 file format ‘mzMLb’ that is optimised for both read/write speed and storage of the raw mass spectrometry data. We provide extensive validation of write speed, random read speed and storage size, demonstrating a flexible format that with or without compression is faster than all existing approaches in virtually all cases, while with compression, is comparable in size to proprietary vendor file formats. Since our approach uniquely preserves the XML encoding of the metadata, the format implicitly supports future versions of mzML and is straightforward to implement: mzMLb’s design adheres to both HDF5 and NetCDF4 standard implementations, which allows it to be easily utilised by third parties due to their widespread programming language support. A reference implementation within the established ProteoWizard toolkit is provided.

2020 ◽  
Author(s):  
Miaoshan Lu ◽  
Shaowei An ◽  
Ruimin Wang ◽  
Jinyin Wang ◽  
Changbin Yu

ABSTRACTWith the precision of mass spectrometer going higher and the emergence of data independence acquisition (DIA), the file size is increasing rapidly. Beyond the widely-used open format mzML (Deutsch 2008), near-lossless or lossless compression algorithms and formats have emerged. The data precision is often related to the instrument and subsequent processing algorithms. Unlike storage-oriented formats, which focusing more on lossless compression and compression rate, computation-oriented formats focus as much on decoding speed and disk read strategy as compression rate. Here we describe “Aird", an opensource and computation-oriented format with controllable precision, flexible indexing strategies and high compression rate. Aird uses JavaScript Object Notation (JSON) for metadata storage, multiple indexing, and reordered storage strategies for higher speed of data randomly reading. Aird also provides a novel compressor called Zlib-Diff-PforDelta (ZDPD) for m/z data compression. Compared with Zlib only, m/z data size is about 65% lower in Aird, and merely takes 33% decoding time.AvailabilityAird SDK is written in Java, which allow scholars to access mass spectrometry data efficiently. It is available at https://github.com/Propro-Studio/Aird-SDK AirdPro can convert vendor files into Aird files, which is available at https://github.com/Propro-Studio/AirdPro


2020 ◽  
Vol 20 (1) ◽  
pp. 172-183
Author(s):  
Ranjeet S. Bhamber ◽  
Andris Jankevics ◽  
Eric W. Deutsch ◽  
Andrew R. Jones ◽  
Andrew W. Dowsey

2022 ◽  
Vol 23 (1) ◽  
Author(s):  
Miaoshan Lu ◽  
Shaowei An ◽  
Ruimin Wang ◽  
Jinyin Wang ◽  
Changbin Yu

Abstract Background With the precision of the mass spectrometry (MS) going higher, the MS file size increases rapidly. Beyond the widely-used open format mzML, near-lossless or lossless compression algorithms and formats emerged in scenarios with different precision requirements. The data precision is often related to the instrument and subsequent processing algorithms. Unlike storage-oriented formats, which focus more on lossless compression rate, computation-oriented formats concentrate as much on decoding speed as the compression rate. Results Here we introduce “Aird”, an opensource and computation-oriented format with controllable precision, flexible indexing strategies, and high compression rate. Aird provides a novel compressor called Zlib-Diff-PforDelta (ZDPD) for m/z data. Compared with Zlib only, m/z data size is about 55% lower in Aird average. With the high-speed decoding and encoding performance of the single instruction multiple data technology used in the ZDPD, Aird merely takes 33% decoding time compared with Zlib. We have downloaded seven datasets from ProteomeXchange and Metabolights. They are from different SCIEX, Thermo, and Agilent instruments. Then we convert the raw data into mzML, mgf, and mz5 file formats by MSConvert and compare them with Aird format. Aird uses JavaScript Object Notation for metadata storage. Aird-SDK is written in Java, and AirdPro is a GUI client for vendor file converting written in C#. They are freely available at https://github.com/CSi-Studio/Aird-SDK and https://github.com/CSi-Studio/AirdPro. Conclusions With the innovation of MS acquisition mode, MS data characteristics are also constantly changing. New data features can bring more effective compression methods and new index modes to achieve high search performance. The MS data storage mode will also become professional and customized. ZDPD uses multiple MS digital features, and researchers also can use it in other formats like mzML. Aird is designed to become a computing-oriented data format with high scalability, compression rate, and fast decoding speed.


2019 ◽  
Author(s):  
Wout Bittremieux

AbstractGiven the wide diversity in applications of biological mass spectrometry, custom data analyses are often needed to fully interpret the results of an experiment. Such bioinformatics scripts necessarily include similar basic functionality to read mass spectral data from standard file formats, process it, and visualize it. Rather than having to reimplement this functionality, to facilitate this task, spectrum_utils is a Python package for mass spectrometry data processing and visualization. Its high-level functionality enables developers to quickly prototype ideas for computational mass spectrometry projects in only a few lines of code. Notably, the data processing functionality is highly optimized for computational efficiency to be able to deal with the large volumes of data that are generated during mass spectrometry experiments. The visualization functionality makes it possible to easily produce publication-quality figures as well as interactive spectrum plots for inclusion on web pages. spectrum_utils is available for Python 3.6+, includes extensive online documentation and examples, and can be easily installed using conda. It is freely available as open source under the Apache 2.0 license at https://github.com/bittremieux/spectrum_utils.


2007 ◽  
Vol 177 (4S) ◽  
pp. 52-53
Author(s):  
Stefano Ongarello ◽  
Eberhard Steiner ◽  
Regina Achleitner ◽  
Isabel Feuerstein ◽  
Birgit Stenzel ◽  
...  

2007 ◽  
Vol 3 (2) ◽  
pp. 127-147 ◽  
Author(s):  
Anestis Antoniadis ◽  
Jeremie Bigot ◽  
Sophie Lambert-Lacroix ◽  
Frederique Letue

Author(s):  
Trevor N. Clark ◽  
Joëlle Houriet ◽  
Warren S. Vidar ◽  
Joshua J. Kellogg ◽  
Daniel A. Todd ◽  
...  

Author(s):  
In Kwon Choi ◽  
Eroma Abeysinghe ◽  
Eric Coulter ◽  
Suresh Marru ◽  
Marlon Pierce ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document