scholarly journals Brotli Compressed Data Format

Author(s):  
J. Alakuijala ◽  
Z. Szabadka
Keyword(s):  
Author(s):  
T. Kawashima ◽  
T. Igarashi ◽  
R. Hines ◽  
M. Ogawa

Author(s):  
Oscar Miguel-Hurtado ◽  
Luis Mengibar-Pozo ◽  
Michael G. Lorenz ◽  
Richard Guest

2018 ◽  
Author(s):  
Claudio Albert ◽  
Tom Paridaens ◽  
Jan Voges ◽  
Daniel Naro ◽  
Junaid J. Ahmad ◽  
...  

AbstractThe MPEG-G standardization initiative is a coordinated international effort to specify a compressed data format that enables large scale genomic data to be processed, transported and shared. The standard consists of a set of specifications (i.e., a book) describing: i) a nor-mative format syntax, and ii) a normative decoding process to retrieve the information coded in a compliant file or bitstream. Such decoding process enables the use of leading-edge com-pression technologies that have exhibited significant compression gains over currently used formats for storage of unaligned and aligned sequencing reads. Additionally, the standard provides a wealth of much needed functionality, such as selective access, data aggregation, ap-plication programming interfaces to the compressed data, standard interfaces to support data protection mechanisms, support for streaming and a procedure to assess the conformance of implementations. ISO/IEC is engaged in supporting the maintenance and availability of the standard specification, which guarantees the perenniality of applications using MPEG-G. Fi-nally, the standard ensures interoperability and integration with existing genomic information processing pipelines by providing support for conversion from the FASTQ/SAM/BAM file formats.In this paper we provide an overview of the MPEG-G specification, with particular focus on the main advantages and novel functionality it offers. As the standard only specifies the decoding process, encoding performance, both in terms of speed and compression ratio, can vary depending on specific encoder implementations, and will likely improve during the lifetime of MPEG-G. Hence, the performance statistics provided here are only indicative baseline examples of the technologies included in the standard.


2021 ◽  
Vol 17 (10) ◽  
pp. e1009524
Author(s):  
Shian Su ◽  
Quentin Gouil ◽  
Marnie E. Blewitt ◽  
Dianne Cook ◽  
Peter F. Hickey ◽  
...  

A key benefit of long-read nanopore sequencing technology is the ability to detect modified DNA bases, such as 5-methylcytosine. The lack of R/Bioconductor tools for the effective visualization of nanopore methylation profiles between samples from different experimental groups led us to develop the NanoMethViz R package. Our software can handle methylation output generated from a range of different methylation callers and manages large datasets using a compressed data format. To fully explore the methylation patterns in a dataset, NanoMethViz allows plotting of data at various resolutions. At the sample-level, we use dimensionality reduction to look at the relationships between methylation profiles in an unsupervised way. We visualize methylation profiles of classes of features such as genes or CpG islands by scaling them to relative positions and aggregating their profiles. At the finest resolution, we visualize methylation patterns across individual reads along the genome using the spaghetti plot and heatmaps, allowing users to explore particular genes or genomic regions of interest. In summary, our software makes the handling of methylation signal more convenient, expands upon the visualization options for nanopore data and works seamlessly with existing methylation analysis tools available in the Bioconductor project. Our software is available at https://bioconductor.org/packages/NanoMethViz.


2021 ◽  
Vol 14 (11) ◽  
pp. 2627-2641
Author(s):  
Michael Kuchnik ◽  
George Amvrosiadis ◽  
Virginia Smith

Deep learning accelerators efficiently train over vast and growing amounts of data, placing a newfound burden on commodity networks and storage devices. A common approach to conserve bandwidth involves resizing or compressing data prior to training. We introduce Progressive Compressed Records (PCRs), a data format that uses compression to reduce the overhead of fetching and transporting data, effectively reducing the training time required to achieve a target accuracy. PCRs deviate from previous storage formats by combining progressive compression with an efficient storage layout to view a single dataset at multiple fidelities---all without adding to the total dataset size. We implement PCRs and evaluate them on a range of datasets, training tasks, and hardware architectures. Our work shows that: (i) the amount of compression a dataset can tolerate exceeds 50% of the original encoding for many DL training tasks; (ii) it is possible to automatically and efficiently select appropriate compression levels for a given task; and (iii) PCRs enable tasks to readily access compressed data at runtime--- utilizing as little as half the training bandwidth and thus potentially doubling training speed.


Sign in / Sign up

Export Citation Format

Share Document