compressed index
Recently Published Documents


TOTAL DOCUMENTS

20
(FIVE YEARS 4)

H-INDEX

6
(FIVE YEARS 1)

PLoS ONE ◽  
2021 ◽  
Vol 16 (8) ◽  
pp. e0255260
Author(s):  
Altti Ilari Maarala ◽  
Ossi Arasalo ◽  
Daniel Valenzuela ◽  
Veli Mäkinen ◽  
Keijo Heljanko

Computational pan-genomics utilizes information from multiple individual genomes in large-scale comparative analysis. Genetic variation between case-controls, ethnic groups, or species can be discovered thoroughly using pan-genomes of such subpopulations. Whole-genome sequencing (WGS) data volumes are growing rapidly, making genomic data compression and indexing methods very important. Despite current space-efficient repetitive sequence compression and indexing methods, the deployed compression methods are often sequential, computationally time-consuming, and do not provide efficient sequence alignment performance on vast collections of genomes such as pan-genomes. For performing rapid analytics with the ever-growing genomics data, data compression and indexing methods have to exploit distributed and parallel computing more efficiently. Instead of strict genome data compression methods, we will focus on the efficient construction of a compressed index for pan-genomes. Compressed hybrid-index enables fast sequence alignments to several genomes at once while shrinking the index size significantly compared to traditional indexes. We propose a scalable distributed compressed hybrid-indexing method for large genomic data sets enabling pan-genome-based sequence search and read alignment capabilities. We show the scalability of our tool, DHPGIndex, by executing experiments in a distributed Apache Spark-based computing cluster comprising 448 cores distributed over 26 nodes. The experiments have been performed both with human and bacterial genomes. DHPGIndex built a BLAST index for n = 250 human pan-genome with an 870:1 compression ratio (CR) in 342 minutes and a Bowtie2 index with 157:1 CR in 397 minutes. For n = 1,000 human pan-genome, the BLAST index was built in 1520 minutes with 532:1 CR and the Bowtie2 index in 1938 minutes with 76:1 CR. Bowtie2 aligned 14.6 GB of paired-end reads to the compressed (n = 1,000) index in 31.7 minutes on a single node. Compressing n = 13,375,031 (488 GB) GenBank database to BLAST index resulted in CR of 62:1 in 575 minutes. BLASTing 189,864 Crispr-Cas9 gRNA target sequences (23 MB in total) to the compressed index of human pan-genome (n = 1,000) finished in 45 minutes on a single node. 30 MB mixed bacterial sequences were (n = 599) were blasted to the compressed index of 488 GB GenBank database (n = 13,375,031) in 26 minutes on 25 nodes. 78 MB mixed sequences (n = 4,167) were blasted to the compressed index of 18 GB E. coli sequence database (n = 745,409) in 5.4 minutes on a single node.


2021 ◽  
Author(s):  
Omar Ahmed ◽  
Massimiliano Rossi ◽  
Sam Kovaka ◽  
Michael Schatz ◽  
Travis Gagie ◽  
...  

Nanopore sequencing is an increasingly powerful tool for genomics. Recently, computational advances have allowed nanopores to sequence in a targeted fashion; as the sequencer emits data, software can analyze the data in real time and signal the sequencer to eject "non-target" DNA molecules. We present a novel method called SPUMONI, which enables rapid and accurate targeted sequencing with the help of efficient pangenome indexes. SPUMONI uses a compressed index to rapidly generate exact or approximate matching statistics (half-maximal exact matches) in a streaming fashion. When used to target a specific strain in a mock community, SPUMONI has similar accuracy as minimap2 when both are run against an index containing many strains per species. However SPUMONI is 12 times faster than minimap2. SPUMONI's index and peak memory footprint are also 15 to 4 times smaller than minimap2, respectively. These improvements become even more pronounced with even larger reference databases; SPUMONI's index size scales sublinearly with the number of reference genomes included. This could enable accurate targeted sequencing even in the case where the targeted strains have not necessarily been sequenced or assembled previously. SPUMONI is open source software available from https://github.com/oma219/spumoni.


2020 ◽  
Author(s):  
Arun Subramaniyan ◽  
Jack Wadden ◽  
Kush Goliya ◽  
Nathan Ozog ◽  
Xiao Wu ◽  
...  

ABSTRACTMotivationRead alignment is a time-consuming step in genome sequence analysis. In the read alignment software BWA-MEM and the recently published faster version BWA-MEM2, the seeding step is a major bottleneck, for instance, contributing 38% to the overall execution time in BWA-MEM2 when aligning single-end whole human genome reads from the Platinum Genomes dataset. This is because both BWA-MEM and BWA-MEM2 use a compressed index structure called the FMD-Index, which results in high memory bandwidth requirements for seeding, primarily due to its character-by-character processing of reads.ResultsWe propose a memory bandwidth-aware data structure for maximal-exact-match seeding called Enumerated Radix Tree (ERT). ERT trades off memory capacity to improve seeding performance (∼60 GB index for human genome). Together with optimizations to the seeding algorithm and mate-rescue step, ERT when integrated into BWA-MEM2 speeds up overall read alignment by 1.28× and provides up to 2.1× higher seeding performance while guaranteeing identical output to the original software. Furthermore, we prototype an FPGA implementation of ERT on Amazon EC2 F1 cloud and observe 1.6× higher seeding throughput over a 48-thread optimized CPU-ERT implementation.Availability and implementationhttps://github.com/arun-sub/[email protected], [email protected]


2019 ◽  
Vol 483 ◽  
pp. 106-135 ◽  
Author(s):  
Nieves R. Brisaboa ◽  
Adrián Gómez-Brandón ◽  
Gonzalo Navarro ◽  
José R. Paramá

2018 ◽  
Author(s):  
Temesgen Hailemariam Dadi ◽  
Enrico Siragusa ◽  
Vitor C. Piro ◽  
Andreas Andrusch ◽  
Enrico Seiler ◽  
...  

AbstractMotivationMapping-based approaches have become limited in their application to very large sets of references since computing an FM-index for very large databases (e.g. > 10 GB) has become a bottleneck. This affects many analyses that need such index as an essential step for approximate matching of the NGS reads to reference databases. For instance, in typical metagenomics analysis, the size of the reference sequences has become prohibitive to compute a single full-text index on standard machines. Even on large memory machines, computing such index takes about one day of computing time. As a result, updates of indices are rarely performed. Hence, it is desirable to create an alternative way of indexing while preserving fast search times.ResultsTo solve the index construction and update problem we propose the DREAM (Dynamic seaRchablE pArallel coMpressed index) framework and provide an implementation. The main contributions are the introduction of an approximate search distributor directories via a novel use of Bloom filters. We combine several Bloom filters to form an interleaved Bloom filter and use this new data structure to quickly exclude reads for parts of the databases where they cannot match. This allows us to keep the databases in several indices which can be easily rebuilt if parts are updated while maintaining a fast search time. The second main contribution is an implementation of DREAM-Yara a distributed version of a fully sensitive read mapper under the DREAM [email protected]://gitlab.com/pirovc/dream_yara/


2016 ◽  
Vol 638 ◽  
pp. 159-170 ◽  
Author(s):  
Joong Chae Na ◽  
Hyunjoon Kim ◽  
Heejin Park ◽  
Thierry Lecroq ◽  
Martine Léonard ◽  
...  
Keyword(s):  

Author(s):  
Hussein Al-Bahadili ◽  
Saif Al-Saab

In this paper, the authors present a description of a new Web search engine model, the compressed index-query (CIQ) Web search engine model. This model incorporates two bit-level compression layers implemented at the back-end processor (server) side, one layer resides after the indexer acting as a second compression layer to generate a double compressed index (index compressor), and the second layer resides after the query parser for query compression (query compressor) to enable bit-level compressed index-query search. The data compression algorithm used in this model is the Hamming codes-based data compression (HCDC) algorithm, which is an asymmetric, lossless, bit-level algorithm permits CIQ search. The different components of the new Web model are implemented in a prototype CIQ test tool (CIQTT), which is used as a test bench to validate the accuracy and integrity of the retrieved data and evaluate the performance of the proposed model. The test results demonstrate that the proposed CIQ model reduces disk space requirements and searching time by more than 24%, and attains a 100% agreement when compared with an uncompressed model.


Author(s):  
Francisco Santoyo ◽  
Edgar Chávez ◽  
Eric S. Téllez
Keyword(s):  

2014 ◽  
Vol 2014 ◽  
pp. 1-11 ◽  
Author(s):  
Maan Haj Rachid ◽  
Qutaibah Malluhi ◽  
Mohamed Abouelhoda

The all-pairs suffix-prefix matching problem is a basic problem in string processing. It has an application in the de novo genome assembly task, which is one of the major bioinformatics problems. Due to the large size of the input data, it is crucial to use fast and space efficient solutions. In this paper, we present a space-economical solution to this problem using the generalized Sadakane compressed suffix tree. Furthermore, we present a parallel algorithm to provide more speed for shared memory computers. Our sequential and parallel algorithms are optimized by exploiting features of the Sadakane compressed index data structure. Experimental results show that our solution based on the Sadakane’s compressed index consumes significantly less space than the ones based on noncompressed data structures like the suffix tree and the enhanced suffix array. Our experimental results show that our parallel algorithm is efficient and scales well with increasing number of processors.


Sign in / Sign up

Export Citation Format

Share Document