Brotli Compressed Data Format

A universal compressed data format for foreign file systems

Proceedings DCC '95 Data Compression Conference ◽

10.1109/dcc.1995.515539 ◽

2002 ◽

Author(s):

T. Kawashima ◽

T. Igarashi ◽

R. Hines ◽

M. Ogawa

Keyword(s):

File Systems ◽

Data Format ◽

Compressed Data

Download Full-text

Compressed data format for handwritten signature biometrics

44th Annual 2010 IEEE International Carnahan Conference on Security Technology ◽

10.1109/ccst.2010.5678722 ◽

2010 ◽

Author(s):

Oscar Miguel-Hurtado ◽

Luis Mengibar-Pozo ◽

Michael G. Lorenz ◽

Richard Guest

Keyword(s):

Data Format ◽

Handwritten Signature ◽

Compressed Data

Download Full-text

An introduction to MPEG-G, the new ISO standard for genomic information representation

10.1101/426353 ◽

2018 ◽

Cited By ~ 4

Author(s):

Claudio Albert ◽

Tom Paridaens ◽

Jan Voges ◽

Daniel Naro ◽

Junaid J. Ahmad ◽

...

Keyword(s):

Large Scale ◽

Leading Edge ◽

Genomic Information ◽

Data Format ◽

File Formats ◽

Protection Mechanisms ◽

Performance Statistics ◽

Significant Compression ◽

Programming Interfaces ◽

Compressed Data

AbstractThe MPEG-G standardization initiative is a coordinated international effort to specify a compressed data format that enables large scale genomic data to be processed, transported and shared. The standard consists of a set of specifications (i.e., a book) describing: i) a nor-mative format syntax, and ii) a normative decoding process to retrieve the information coded in a compliant file or bitstream. Such decoding process enables the use of leading-edge com-pression technologies that have exhibited significant compression gains over currently used formats for storage of unaligned and aligned sequencing reads. Additionally, the standard provides a wealth of much needed functionality, such as selective access, data aggregation, ap-plication programming interfaces to the compressed data, standard interfaces to support data protection mechanisms, support for streaming and a procedure to assess the conformance of implementations. ISO/IEC is engaged in supporting the maintenance and availability of the standard specification, which guarantees the perenniality of applications using MPEG-G. Fi-nally, the standard ensures interoperability and integration with existing genomic information processing pipelines by providing support for conversion from the FASTQ/SAM/BAM file formats.In this paper we provide an overview of the MPEG-G specification, with particular focus on the main advantages and novel functionality it offers. As the standard only specifies the decoding process, encoding performance, both in terms of speed and compression ratio, can vary depending on specific encoder implementations, and will likely improve during the lifetime of MPEG-G. Hence, the performance statistics provided here are only indicative baseline examples of the technologies included in the standard.

Download Full-text

NanoMethViz: An R/Bioconductor package for visualizing long-read methylation data

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009524 ◽

2021 ◽

Vol 17 (10) ◽

pp. e1009524

Author(s):

Shian Su ◽

Quentin Gouil ◽

Marnie E. Blewitt ◽

Dianne Cook ◽

Peter F. Hickey ◽

...

Keyword(s):

Cpg Islands ◽

R Package ◽

Data Format ◽

Bioconductor Project ◽

Modified Dna ◽

Long Read ◽

Effective Visualization ◽

Genomic Regions ◽

Methylation Patterns ◽

Compressed Data

A key benefit of long-read nanopore sequencing technology is the ability to detect modified DNA bases, such as 5-methylcytosine. The lack of R/Bioconductor tools for the effective visualization of nanopore methylation profiles between samples from different experimental groups led us to develop the NanoMethViz R package. Our software can handle methylation output generated from a range of different methylation callers and manages large datasets using a compressed data format. To fully explore the methylation patterns in a dataset, NanoMethViz allows plotting of data at various resolutions. At the sample-level, we use dimensionality reduction to look at the relationships between methylation profiles in an unsupervised way. We visualize methylation profiles of classes of features such as genes or CpG islands by scaling them to relative positions and aggregating their profiles. At the finest resolution, we visualize methylation patterns across individual reads along the genome using the spaghetti plot and heatmaps, allowing users to explore particular genes or genomic regions of interest. In summary, our software makes the handling of methylation signal more convenient, expands upon the visualization options for nanopore data and works seamlessly with existing methylation analysis tools available in the Bioconductor project. Our software is available at https://bioconductor.org/packages/NanoMethViz.

Download Full-text

DEFLATE Compressed Data Format Specification version 1.3

10.17487/rfc1951 ◽

1996 ◽

Cited By ~ 200

Author(s):

P. Deutsch

Keyword(s):

Data Format ◽

Compressed Data

Download Full-text

Progressive compressed records

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476308 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2627-2641

Author(s):

Michael Kuchnik ◽

George Amvrosiadis ◽

Virginia Smith

Keyword(s):

Data Format ◽

Training Time ◽

Storage Devices ◽

Efficient Storage ◽

Dataset Size ◽

Hardware Architectures ◽

Single Dataset ◽

Time Required ◽

And Storage ◽

Compressed Data

Deep learning accelerators efficiently train over vast and growing amounts of data, placing a newfound burden on commodity networks and storage devices. A common approach to conserve bandwidth involves resizing or compressing data prior to training. We introduce Progressive Compressed Records (PCRs), a data format that uses compression to reduce the overhead of fetching and transporting data, effectively reducing the training time required to achieve a target accuracy. PCRs deviate from previous storage formats by combining progressive compression with an efficient storage layout to view a single dataset at multiple fidelities---all without adding to the total dataset size. We implement PCRs and evaluate them on a range of datasets, training tasks, and hardware architectures. Our work shows that: (i) the amount of compression a dataset can tolerate exceeds 50% of the original encoding for many DL training tasks; (ii) it is possible to automatically and efficiently select appropriate compression levels for a given task; and (iii) PCRs enable tasks to readily access compressed data at runtime--- utilizing as little as half the training bandwidth and thus potentially doubling training speed.

Download Full-text