compressed data structures Latest Research Papers

Lossless Indexing with Counting de Bruijn Graphs

10.1101/2021.11.09.467907 ◽

2021 ◽

Author(s):

Mikhail Karasikov ◽

Harun Mustafa ◽

Gunnar Rätsch ◽

André Kahles

Keyword(s):

High Throughput Sequencing ◽

Sparse Matrices ◽

Rna Seq ◽

Sequencing Data ◽

De Bruijn Graphs ◽

High Throughput Sequencing Data ◽

Alignment Algorithms ◽

Compressed Data Structures ◽

De Bruijn ◽

Public Repositories

High-throughput sequencing data is rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in solving the experiment discovery problem and building compressed representations of annotated de Bruijn graphs where k-mer sets can be efficiently indexed and interactively queried. However, approaches for representing and retrieving other quantitative attributes such as gene expression or genome positions in a general manner have yet to be developed. In this work, we propose the concept of Counting de Bruijn graphs generalizing the notion of annotated (or colored) de Bruijn graphs. Counting de Bruijn graphs supplement each node-label relation with one or many attributes (e.g., a k-mer count or its positions in genome). To represent them, we first observe that many schemes for the representation of compressed binary matrices already support the rank operation on the columns or rows, which can be used to define an inherent indexing of any additional quantitative attributes. Based on this property, we generalize these schemes and introduce a new approach for representing non-binary sparse matrices in compressed data structures. Finally, we notice that relation attributes are often easily predictable from a node's local neighborhood in the graph. Notable examples are genome positions shifting by 1 for neighboring nodes in the graph, or expression levels that are often shared across neighbors. We exploit this regularity of graph annotations and apply an invertible delta-like coding to achieve better compression. We show that Counting de Bruijn graphs index k-mer counts from 2,652 human RNA-Seq read sets in representations over 8-fold smaller and yet faster to query compared to state-of-the-art bioinformatics tools. Furthermore, Counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip -9 for human Illumina RNA-Seq and 57% smaller for PacBio HiFi sequencing of viral samples. A complete joint searchable index of all viral PacBio SMRT reads from NCBI's SRA (152,884 read sets, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, they generate a lossless and fully queryable index that is 4.4-fold smaller compared to the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools employing de Bruijn graphs and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed and fully searchable graph-based sequence indexes.

Download Full-text

Engineering Practical Lempel-Ziv Tries

Journal of Experimental Algorithmics ◽

10.1145/3481638 ◽

2021 ◽

Vol 26 (1) ◽

pp. 1-47

Author(s):

Diego Arroyuelo ◽

Rodrigo Cánovas ◽

Johannes Fischer ◽

Dominik Köppl ◽

Marvin Löbel ◽

...

Keyword(s):

Factor Structure ◽

Data Structures ◽

Independent Interest ◽

New Techniques ◽

Memory Space ◽

Compressed Data Structures ◽

Regular Factor ◽

Factorization Algorithms ◽

Compressed Data

The Lempel-Ziv 78 ( LZ78 ) and Lempel-Ziv-Welch ( LZW ) text factorizations are popular, not only for bare compression but also for building compressed data structures on top of them. Their regular factor structure makes them computable within space bounded by the compressed output size. In this article, we carry out the first thorough study of low-memory LZ78 and LZW text factorization algorithms, introducing more efficient alternatives to the classical methods, as well as new techniques that can run within less memory space than the necessary to hold the compressed file. Our results build on hash-based representations of tries that may have independent interest.

Download Full-text

Compressed Data Structures for Binary Relations in Practice

IEEE Access ◽

10.1109/access.2020.2970983 ◽

2020 ◽

Vol 8 ◽

pp. 25949-25963

Author(s):

Carlos Quijada Fuentes ◽

Miguel R. Penabad ◽

Susana Ladra ◽

Gilberto Gutierrez Retamal

Keyword(s):

Data Structures ◽

Binary Relations ◽

Compressed Data Structures ◽

Compressed Data

Download Full-text

Integrated querying and version control of context-specific biological networks

Database ◽

10.1093/database/baaa018 ◽

2020 ◽

Vol 2020 ◽

Cited By ~ 1

Author(s):

Tyler Cowman ◽

Mustafa Coşkun ◽

Ananth Grama ◽

Mehmet Koyutürk

Keyword(s):

Data Structures ◽

Biological Networks ◽

Computational Cost ◽

Tissue Type ◽

Large Network ◽

Integrated Network ◽

Compressed Data Structures ◽

Tree Data ◽

Context Specific ◽

Tree Data Structure

Abstract Motivation Biomolecular data stored in public databases is increasingly specialized to organisms, context/pathology and tissue type, potentially resulting in significant overhead for analyses. These networks are often specializations of generic interaction sets, presenting opportunities for reducing storage and computational cost. Therefore, it is desirable to develop effective compression and storage techniques, along with efficient algorithms and a flexible query interface capable of operating on compressed data structures. Current graph databases offer varying levels of support for network integration. However, these solutions do not provide efficient methods for the storage and querying of versioned networks. Results We present VerTIoN, a framework consisting of novel data structures and associated query mechanisms for integrated querying of versioned context-specific biological networks. As a use case for our framework, we study network proximity queries in which the user can select and compose a combination of tissue-specific and generic networks. Using our compressed version tree data structure, in conjunction with state-of-the-art numerical techniques, we demonstrate real-time querying of large network databases. Conclusion Our results show that it is possible to support flexible queries defined on heterogeneous networks composed at query time while drastically reducing response time for multiple simultaneous queries. The flexibility offered by VerTIoN in composing integrated network versions opens significant new avenues for the utilization of ever increasing volume of context-specific network data in a broad range of biomedical applications. Availability and Implementation VerTIoN is implemented as a C++ library and is available at http://compbio.case.edu/omics/software/vertion and https://github.com/tjcowman/vertion Contact [email protected]

Download Full-text

Compressed Data Structures for Astronomical Content-Aware Resource Search

2019 38th International Conference of the Chilean Computer Science Society (SCCC) ◽

10.1109/sccc49216.2019.8966420 ◽

2019 ◽

Author(s):

Mauricio Araya ◽

Diego Arroyuelo ◽

Camilo Saldias ◽

Mauricio Solar

Keyword(s):

Data Structures ◽

Resource Search ◽

Compressed Data Structures ◽

Compressed Data ◽

Content Aware

Download Full-text

Applications of Non-Uniquely Decodable Codes to Privacy-Preserving High-Entropy Data Representation

Algorithms ◽

10.3390/a12040078 ◽

2019 ◽

Vol 12 (4) ◽

pp. 78

Author(s):

Muhammed Oğuzhan Külekci ◽

Yasin Öztürk

Keyword(s):

Original Data ◽

Data Representation ◽

Code Word ◽

Direct Access ◽

Privacy Concerns ◽

High Entropy ◽

Compressed Data Structures ◽

Encryption Schemes ◽

Boundary Information ◽

Compressed Data

Non-uniquely-decodable (non-UD) codes can be defined as the codes that cannot be uniquely decoded without additional disambiguation information. These are mainly the class of non–prefix–free codes, where a code-word can be a prefix of other(s), and thus, the code-word boundary information is essential for correct decoding. Due to their inherent unique decodability problem, such non-UD codes have not received much attention except a few studies, in which using compressed data structures to represent the disambiguation information efficiently had been previously proposed. It had been shown before that the compression ratio can get quite close to Huffman/Arithmetic codes with an additional capability of providing direct access in compressed data, which is a missing feature in the regular Huffman codes. In this study we investigate non-UD codes in another dimension addressing the privacy of the high-entropy data. We particularly focus on such massive volumes, where typical examples are encoded video or similar multimedia files. Representation of such a volume with non–UD coding creates two elements as the disambiguation information and the payload, where decoding the original data from these elements becomes hard when one of them is missing. We make use of this observation for privacy concerns. and study the space consumption as well as the hardness of that decoding. We conclude that non-uniquely-decodable codes can be an alternative to selective encryption schemes that aim to secure only part of the data when data is huge. We provide a freely available software implementation of the proposed scheme as well.

Download Full-text

Compressed data structures for bi-objective {0,1}-knapsack problems

Computers & Operations Research ◽

10.1016/j.cor.2017.08.008 ◽

2018 ◽

Vol 89 ◽

pp. 82-93

Author(s):

Pedro Correia ◽

Luís Paquete ◽

José Rui Figueira

Keyword(s):

Data Structures ◽

Knapsack Problems ◽

Compressed Data Structures ◽

Compressed Data

Download Full-text

Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes

Genome Research ◽

10.1101/gr.211748.116 ◽

2016 ◽

Vol 27 (2) ◽

pp. 300-309 ◽

Cited By ~ 13

Author(s):

Dirk D. Dolle ◽

Zhicheng Liu ◽

Matthew Cotten ◽

Jared T. Simpson ◽

Zamin Iqbal ◽

...

Keyword(s):

Data Structures ◽

Human Genomes ◽

Compressed Data Structures ◽

Compressed Data

Download Full-text

Algorithms for the compression of genomic big data

10.7287/peerj.preprints.2176 ◽

2016 ◽

Author(s):

Nicola Prezza ◽

Alberto Policriti

Keyword(s):

Data Structures ◽

Dynamic Data Structures ◽

Dynamic Data ◽

Working Space ◽

Input Size ◽

Desktop Computers ◽

Compressed Data Structures ◽

Text Collections ◽

Two Measures ◽

Burrows Wheeler Transform

Motivations. Building the Burrows-Wheeler transform (BWT) and computing the Lempel-Ziv parsing (LZ77) of huge collections of genomes is becoming an important task in bioinformatic analyses as these datasets often need to be compressed and indexed prior to analysis. Given that the sizes of such datasets often exceed RAM capacity of common machines however, standard algorithms cannot be used to solve this problem as they require a working space at least linear in the input size. One way to solve this problem is to exploit the intrinsic compressibility of such datasets: two genomes from the same species share most of their information (often more than 99%), so families of genomes can be considerably compressed. A solution to the above problem could therefore be that of designing algorithms working in compressed working space, i.e. algorithms that stream the input from disk and require in RAM a space that is proportional to the size of the compressed text. Methods. In this work we present algorithms and data structures to compress and index text in compressed working space. These results build upon compressed dynamic data structures, a sub-field of compressed data structures research that is lately receiving a lot of attention. We focus on two measures of compressibility: the empirical entropy H of the text and the number r of equal-letter runs in the BWT of the text. We show how to build the BWT and LZ77 using only O(Hn) and (rlog n) working space, n being the size of the collection. For the case of repetitive text collections (such as sets of genomes from the same species), this considerably improves the working space required by state-of-the art algorithms in the literature. The algorthms and data structures here discussed have all been implemented in a public C++ library, available at github.com/nicolaprezza/DYNAMIC. The library includes dynamic gap-encoded bitvectors, run-length encoded (RLE) strings, and RLE FM-indexes. Results. We conclude with an overview of the experimental results that we obtained running our algorithms on highly repetitive genomic datasets. As expected, our solutions require only a small fraction of the working space used by solutions working in non-compressed space, making it feasible to compute BWT and LZ77 of huge collections of genomes even on desktop computers with small amounts of RAM available. As a downside of using complex dynamic data structures however, running times are still not practical so improvements such as parallelization may be needed in order to make these solutions fully practical.

Download Full-text

Algorithms for the compression of genomic big data

10.7287/peerj.preprints.2176v1 ◽

2016 ◽

Author(s):

Nicola Prezza ◽

Alberto Policriti

Keyword(s):

Data Structures ◽

Dynamic Data Structures ◽

Dynamic Data ◽

Working Space ◽

Input Size ◽

Desktop Computers ◽

Compressed Data Structures ◽

Text Collections ◽

Two Measures ◽

Burrows Wheeler Transform

Motivations. Building the Burrows-Wheeler transform (BWT) and computing the Lempel-Ziv parsing (LZ77) of huge collections of genomes is becoming an important task in bioinformatic analyses as these datasets often need to be compressed and indexed prior to analysis. Given that the sizes of such datasets often exceed RAM capacity of common machines however, standard algorithms cannot be used to solve this problem as they require a working space at least linear in the input size. One way to solve this problem is to exploit the intrinsic compressibility of such datasets: two genomes from the same species share most of their information (often more than 99%), so families of genomes can be considerably compressed. A solution to the above problem could therefore be that of designing algorithms working in compressed working space, i.e. algorithms that stream the input from disk and require in RAM a space that is proportional to the size of the compressed text. Methods. In this work we present algorithms and data structures to compress and index text in compressed working space. These results build upon compressed dynamic data structures, a sub-field of compressed data structures research that is lately receiving a lot of attention. We focus on two measures of compressibility: the empirical entropy H of the text and the number r of equal-letter runs in the BWT of the text. We show how to build the BWT and LZ77 using only O(Hn) and (rlog n) working space, n being the size of the collection. For the case of repetitive text collections (such as sets of genomes from the same species), this considerably improves the working space required by state-of-the art algorithms in the literature. The algorthms and data structures here discussed have all been implemented in a public C++ library, available at github.com/nicolaprezza/DYNAMIC. The library includes dynamic gap-encoded bitvectors, run-length encoded (RLE) strings, and RLE FM-indexes. Results. We conclude with an overview of the experimental results that we obtained running our algorithms on highly repetitive genomic datasets. As expected, our solutions require only a small fraction of the working space used by solutions working in non-compressed space, making it feasible to compute BWT and LZ77 of huge collections of genomes even on desktop computers with small amounts of RAM available. As a downside of using complex dynamic data structures however, running times are still not practical so improvements such as parallelization may be needed in order to make these solutions fully practical.

Download Full-text

compressed data structures
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Lossless Indexing with Counting de Bruijn Graphs

Engineering Practical Lempel-Ziv Tries

Compressed Data Structures for Binary Relations in Practice

Integrated querying and version control of context-specific biological networks

Compressed Data Structures for Astronomical Content-Aware Resource Search

Applications of Non-Uniquely Decodable Codes to Privacy-Preserving High-Entropy Data Representation

Compressed data structures for bi-objective {0,1}-knapsack problems

Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes

Algorithms for the compression of genomic big data

Algorithms for the compression of genomic big data

Export Citation Format

compressed data structuresRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Lossless Indexing with Counting de Bruijn Graphs

Engineering Practical Lempel-Ziv Tries

Compressed Data Structures for Binary Relations in Practice

Integrated querying and version control of context-specific biological networks

Compressed Data Structures for Astronomical Content-Aware Resource Search

Applications of Non-Uniquely Decodable Codes to Privacy-Preserving High-Entropy Data Representation

Compressed data structures for bi-objective {0,1}-knapsack problems

Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes

Algorithms for the compression of genomic big data

Algorithms for the compression of genomic big data

compressed data structures
Recently Published Documents