compressed index Latest Research Papers

Computational pan-genomics utilizes information from multiple individual genomes in large-scale comparative analysis. Genetic variation between case-controls, ethnic groups, or species can be discovered thoroughly using pan-genomes of such subpopulations. Whole-genome sequencing (WGS) data volumes are growing rapidly, making genomic data compression and indexing methods very important. Despite current space-efficient repetitive sequence compression and indexing methods, the deployed compression methods are often sequential, computationally time-consuming, and do not provide efficient sequence alignment performance on vast collections of genomes such as pan-genomes. For performing rapid analytics with the ever-growing genomics data, data compression and indexing methods have to exploit distributed and parallel computing more efficiently. Instead of strict genome data compression methods, we will focus on the efficient construction of a compressed index for pan-genomes. Compressed hybrid-index enables fast sequence alignments to several genomes at once while shrinking the index size significantly compared to traditional indexes. We propose a scalable distributed compressed hybrid-indexing method for large genomic data sets enabling pan-genome-based sequence search and read alignment capabilities. We show the scalability of our tool, DHPGIndex, by executing experiments in a distributed Apache Spark-based computing cluster comprising 448 cores distributed over 26 nodes. The experiments have been performed both with human and bacterial genomes. DHPGIndex built a BLAST index for n = 250 human pan-genome with an 870:1 compression ratio (CR) in 342 minutes and a Bowtie2 index with 157:1 CR in 397 minutes. For n = 1,000 human pan-genome, the BLAST index was built in 1520 minutes with 532:1 CR and the Bowtie2 index in 1938 minutes with 76:1 CR. Bowtie2 aligned 14.6 GB of paired-end reads to the compressed (n = 1,000) index in 31.7 minutes on a single node. Compressing n = 13,375,031 (488 GB) GenBank database to BLAST index resulted in CR of 62:1 in 575 minutes. BLASTing 189,864 Crispr-Cas9 gRNA target sequences (23 MB in total) to the compressed index of human pan-genome (n = 1,000) finished in 45 minutes on a single node. 30 MB mixed bacterial sequences were (n = 599) were blasted to the compressed index of 488 GB GenBank database (n = 13,375,031) in 26 minutes on 25 nodes. 78 MB mixed sequences (n = 4,167) were blasted to the compressed index of 18 GB E. coli sequence database (n = 745,409) in 5.4 minutes on a single node.

Download Full-text

Pan-genomic Matching Statistics for Targeted Nanopore Sequencing

10.1101/2021.03.23.436610 ◽

2021 ◽

Author(s):

Omar Ahmed ◽

Massimiliano Rossi ◽

Sam Kovaka ◽

Michael Schatz ◽

Travis Gagie ◽

...

Keyword(s):

Targeted Sequencing ◽

Nanopore Sequencing ◽

Target Dna ◽

Specific Strain ◽

Reference Databases ◽

Novel Method ◽

Compressed Index ◽

Similar Accuracy ◽

Reference Genomes ◽

Index Size

Nanopore sequencing is an increasingly powerful tool for genomics. Recently, computational advances have allowed nanopores to sequence in a targeted fashion; as the sequencer emits data, software can analyze the data in real time and signal the sequencer to eject "non-target" DNA molecules. We present a novel method called SPUMONI, which enables rapid and accurate targeted sequencing with the help of efficient pangenome indexes. SPUMONI uses a compressed index to rapidly generate exact or approximate matching statistics (half-maximal exact matches) in a streaming fashion. When used to target a specific strain in a mock community, SPUMONI has similar accuracy as minimap2 when both are run against an index containing many strains per species. However SPUMONI is 12 times faster than minimap2. SPUMONI's index and peak memory footprint are also 15 to 4 times smaller than minimap2, respectively. These improvements become even more pronounced with even larger reference databases; SPUMONI's index size scales sublinearly with the number of reference genomes included. This could enable accurate targeted sequencing even in the case where the targeted strains have not necessarily been sequenced or assembled previously. SPUMONI is open source software available from https://github.com/oma219/spumoni.

Download Full-text

Accelerating Maximal-Exact-Match Seeding with Enumerated Radix Trees

10.1101/2020.03.23.003897 ◽

2020 ◽

Author(s):

Arun Subramaniyan ◽

Jack Wadden ◽

Kush Goliya ◽

Nathan Ozog ◽

Xiao Wu ◽

...

Keyword(s):

Data Structure ◽

Human Genome ◽

Memory Capacity ◽

Index Structure ◽

Memory Bandwidth ◽

Exact Match ◽

Genome Sequence Analysis ◽

Read Alignment ◽

Major Bottleneck ◽

Compressed Index

ABSTRACTMotivationRead alignment is a time-consuming step in genome sequence analysis. In the read alignment software BWA-MEM and the recently published faster version BWA-MEM2, the seeding step is a major bottleneck, for instance, contributing 38% to the overall execution time in BWA-MEM2 when aligning single-end whole human genome reads from the Platinum Genomes dataset. This is because both BWA-MEM and BWA-MEM2 use a compressed index structure called the FMD-Index, which results in high memory bandwidth requirements for seeding, primarily due to its character-by-character processing of reads.ResultsWe propose a memory bandwidth-aware data structure for maximal-exact-match seeding called Enumerated Radix Tree (ERT). ERT trades off memory capacity to improve seeding performance (∼60 GB index for human genome). Together with optimizations to the seeding algorithm and mate-rescue step, ERT when integrated into BWA-MEM2 speeds up overall read alignment by 1.28× and provides up to 2.1× higher seeding performance while guaranteeing identical output to the original software. Furthermore, we prototype an FPGA implementation of ERT on Amazon EC2 F1 cloud and observe 1.6× higher seeding throughput over a 48-thread optimized CPU-ERT implementation.Availability and implementationhttps://github.com/arun-sub/[email protected], [email protected]

Download Full-text

GraCT: A Grammar-based Compressed Index for Trajectory Data

Information Sciences ◽

10.1016/j.ins.2019.01.035 ◽

2019 ◽

Vol 483 ◽

pp. 106-135 ◽

Cited By ~ 2

Author(s):

Nieves R. Brisaboa ◽

Adrián Gómez-Brandón ◽

Gonzalo Navarro ◽

José R. Paramá

Keyword(s):

Trajectory Data ◽

Compressed Index

Download Full-text

DREAM-Yara: An exact read mapper for very large databases with short update time

10.1101/256354 ◽

2018 ◽

Cited By ~ 1

Author(s):

Temesgen Hailemariam Dadi ◽

Enrico Siragusa ◽

Vitor C. Piro ◽

Andreas Andrusch ◽

Enrico Seiler ◽

...

Keyword(s):

Computing Time ◽

Search Time ◽

Bloom Filter ◽

Bloom Filters ◽

Fast Search ◽

Approximate Search ◽

Large Sets ◽

Large Databases ◽

Very Large Databases ◽

Compressed Index

AbstractMotivationMapping-based approaches have become limited in their application to very large sets of references since computing an FM-index for very large databases (e.g. > 10 GB) has become a bottleneck. This affects many analyses that need such index as an essential step for approximate matching of the NGS reads to reference databases. For instance, in typical metagenomics analysis, the size of the reference sequences has become prohibitive to compute a single full-text index on standard machines. Even on large memory machines, computing such index takes about one day of computing time. As a result, updates of indices are rarely performed. Hence, it is desirable to create an alternative way of indexing while preserving fast search times.ResultsTo solve the index construction and update problem we propose the DREAM (Dynamic seaRchablE pArallel coMpressed index) framework and provide an implementation. The main contributions are the introduction of an approximate search distributor directories via a novel use of Bloom filters. We combine several Bloom filters to form an interleaved Bloom filter and use this new data structure to quickly exclude reads for parts of the databases where they cannot match. This allows us to keep the databases in several indices which can be easily rebuilt if parts are updated while maintaining a fast search time. The second main contribution is an implementation of DREAM-Yara a distributed version of a fully sensitive read mapper under the DREAM [email protected]://gitlab.com/pirovc/dream_yara/

Download Full-text

FM-index of alignment: A compressed index for similar strings

Theoretical Computer Science ◽

10.1016/j.tcs.2015.08.008 ◽

2016 ◽

Vol 638 ◽

pp. 159-170 ◽

Cited By ~ 13

Author(s):

Joong Chae Na ◽

Hyunjoon Kim ◽

Heejin Park ◽

Thierry Lecroq ◽

Martine Léonard ◽

...

Keyword(s):

Compressed Index

Download Full-text

Development of a Novel Compressed Index-Query Web Search Engine Model

Network and Communication Technology Innovations for Web and IT Advancement ◽

10.4018/978-1-4666-2157-2.ch019 ◽

2014 ◽

pp. 275-293

Author(s):

Hussein Al-Bahadili ◽

Saif Al-Saab

Keyword(s):

Data Compression ◽

Search Engine ◽

Web Search ◽

Server Side ◽

Engine Model ◽

Web Search Engine ◽

Index Index ◽

Compression Layer ◽

Compressed Index ◽

Space Requirements

In this paper, the authors present a description of a new Web search engine model, the compressed index-query (CIQ) Web search engine model. This model incorporates two bit-level compression layers implemented at the back-end processor (server) side, one layer resides after the indexer acting as a second compression layer to generate a double compressed index (index compressor), and the second layer resides after the query parser for query compression (query compressor) to enable bit-level compressed index-query search. The data compression algorithm used in this model is the Hamming codes-based data compression (HCDC) algorithm, which is an asymmetric, lossless, bit-level algorithm permits CIQ search. The different components of the new Web model are implemented in a prototype CIQ test tool (CIQTT), which is used as a test bench to validate the accuracy and integrity of the retrieved data and evaluate the performance of the proposed model. The test results demonstrate that the proposed CIQ model reduces disk space requirements and searching time by more than 24%, and attains a 100% agreement when compared with an uncompressed model.

Download Full-text

Isosurface Extraction Method Based on Compressed Index in Second-order Tetrahedron

Journal of Mechanical Engineering ◽

10.3901/jme.2014.19.166 ◽

2014 ◽

Vol 50 (19) ◽

pp. 166

Author(s):

Hedan LIU

Keyword(s):

Extraction Method ◽

Second Order ◽

Isosurface Extraction ◽

Compressed Index

Download Full-text

A Compressed Index for Hamming Distances

10.1007/978-3-319-11988-5_11 ◽

2014 ◽

pp. 113-126

Author(s):

Francisco Santoyo ◽

Edgar Chávez ◽

Eric S. Téllez

Keyword(s):

Compressed Index

Download Full-text

Using the Sadakane Compressed Suffix Tree to Solve the All-Pairs Suffix-Prefix Problem

BioMed Research International ◽

10.1155/2014/745298 ◽

2014 ◽

Vol 2014 ◽

pp. 1-11 ◽

Cited By ~ 3

Author(s):

Maan Haj Rachid ◽

Qutaibah Malluhi ◽

Mohamed Abouelhoda

Keyword(s):

Parallel Algorithm ◽

Suffix Tree ◽

De Novo ◽

Efficient Solutions ◽

Experimental Results ◽

Matching Problem ◽

De Novo Genome Assembly ◽

Large Size ◽

String Processing ◽

Compressed Index

The all-pairs suffix-prefix matching problem is a basic problem in string processing. It has an application in the de novo genome assembly task, which is one of the major bioinformatics problems. Due to the large size of the input data, it is crucial to use fast and space efficient solutions. In this paper, we present a space-economical solution to this problem using the generalized Sadakane compressed suffix tree. Furthermore, we present a parallel algorithm to provide more speed for shared memory computers. Our sequential and parallel algorithms are optimized by exploiting features of the Sadakane compressed index data structure. Experimental results show that our solution based on the Sadakane’s compressed index consumes significantly less space than the ones based on noncompressed data structures like the suffix tree and the enhanced suffix array. Our experimental results show that our parallel algorithm is efficient and scales well with increasing number of processors.

Download Full-text

compressed index
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment

Pan-genomic Matching Statistics for Targeted Nanopore Sequencing

Accelerating Maximal-Exact-Match Seeding with Enumerated Radix Trees

GraCT: A Grammar-based Compressed Index for Trajectory Data

DREAM-Yara: An exact read mapper for very large databases with short update time

FM-index of alignment: A compressed index for similar strings

Development of a Novel Compressed Index-Query Web Search Engine Model

Isosurface Extraction Method Based on Compressed Index in Second-order Tetrahedron

A Compressed Index for Hamming Distances

Using the Sadakane Compressed Suffix Tree to Solve the All-Pairs Suffix-Prefix Problem

Export Citation Format

compressed indexRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment

Pan-genomic Matching Statistics for Targeted Nanopore Sequencing

Accelerating Maximal-Exact-Match Seeding with Enumerated Radix Trees

GraCT: A Grammar-based Compressed Index for Trajectory Data

DREAM-Yara: An exact read mapper for very large databases with short update time

FM-index of alignment: A compressed index for similar strings

Development of a Novel Compressed Index-Query Web Search Engine Model

Isosurface Extraction Method Based on Compressed Index in Second-order Tetrahedron

A Compressed Index for Hamming Distances

Using the Sadakane Compressed Suffix Tree to Solve the All-Pairs Suffix-Prefix Problem

compressed index
Recently Published Documents