Compact Data Structures to Represent and Query Data Warehouses into Main Memory

Abstract Background Design of valid high-quality primers is essential for qPCR experiments. MRPrimer is a powerful pipeline based on MapReduce that combines both primer design for target sequences and homology tests on off-target sequences. It takes an entire sequence DB as input and returns all feasible and valid primer pairs existing in the DB. Due to the effectiveness of primers designed by MRPrimer in qPCR analysis, it has been widely used for developing many online design tools and building primer databases. However, the computational speed of MRPrimer is too slow to deal with the sizes of sequence DBs growing exponentially and thus must be improved. Results We develop a fast GPU-based pipeline for primer design (GPrimer) that takes the same input and returns the same output with MRPrimer. MRPrimer consists of a total of seven MapReduce steps, among which two steps are very time-consuming. GPrimer significantly improves the speed of those two steps by exploiting the computational power of GPUs. In particular, it designs data structures for coalesced memory access in GPU and workload balancing among GPU threads and copies the data structures between main memory and GPU memory in a streaming fashion. For human RefSeq DB, GPrimer achieves a speedup of 57 times for the entire steps and a speedup of 557 times for the most time-consuming step using a single machine of 4 GPUs, compared with MRPrimer running on a cluster of six machines. Conclusions We propose a GPU-based pipeline for primer design that takes an entire sequence DB as input and returns all feasible and valid primer pairs existing in the DB at once without an additional step using BLAST-like tools. The software is available at https://github.com/qhtjrmin/GPrimer.git.

Download Full-text

Efficient Simulation of Large-Scale P2P Networks: Compact Data Structures

15th EUROMICRO International Conference on Parallel, Distributed and Network-Based Processing (PDP'07) ◽

10.1109/pdp.2007.41 ◽

2007 ◽

Cited By ~ 3

Author(s):

Andreas Binzenhofer ◽

Tobias Hossfeld ◽

Gerald Kunzmann ◽

Kolja Eger

Keyword(s):

Data Structures ◽

Large Scale ◽

P2p Networks ◽

Efficient Simulation ◽

Compact Data Structures

Download Full-text

When Edge Computing Meets Compact Data Structures

10.1109/ieeecloudsummit52029.2021.00013 ◽

2021 ◽

Author(s):

Zheng Li ◽

Diego Seco ◽

Jose Fuentes-Sepulveda

Keyword(s):

Data Structures ◽

Edge Computing ◽

Compact Data Structures

Download Full-text

Compact Data Structures for Location-Based Forwarding in NDN Networks

2018 IEEE International Conference on Communications Workshops (ICC Workshops) ◽

10.1109/iccw.2018.8403578 ◽

2018 ◽

Cited By ~ 5

Author(s):

Yoshiki Kurihara ◽

Yuki Koizumi ◽

Toru Hasegawa

Keyword(s):

Data Structures ◽

Compact Data Structures

Download Full-text

Preface – Compact Data Structures

Journal of Discrete Algorithms ◽

10.1016/j.jda.2017.04.002 ◽

2017 ◽

Vol 43 ◽

pp. 1 ◽

Cited By ~ 1

Author(s):

Travis Gagie

Keyword(s):

Data Structures ◽

Compact Data Structures

Download Full-text

E-ETL: Framework for Managing Evolving ETL Workflows

Foundations of Computing and Decision Sciences ◽

10.2478/fcds-2013-0005 ◽

2013 ◽

Vol 38 (2) ◽

pp. 131-142 ◽

Cited By ~ 1

Author(s):

Artur Wojciechowski

Keyword(s):

Open Source ◽

Data Structures ◽

Structural Changes ◽

Data Sources ◽

Data Warehouses ◽

High Importance ◽

External Data

AbstractData warehouses integrate external data sources (EDSs), which very often change their data structures (schemas). In many cases, such changes cause an erroneous execution of an already deployed ETL workow. Structural changes of EDSs are frequent, therefore an automatic reparation of an ETL workow, after such changes, is of a high importance. This paper presents a framework, called E-ETL, for handling the evolution of an ETL layer. Detection of changes in EDSs causes a repa- ration of the fragment of ETL workow which interacts with the changed EDSs. The proposed framework was developed as a module external to a standard commercial or open-source ETL engine, accessing the engine by means of API. The innovation of this framework consists in: (1) the algorithms for semi-automatic reparation of an ETL workow and (2) its ability to interact with various ETL engines that provide API.

Download Full-text

Storing Set Families More Compactly with Top ZDDs

Algorithms ◽

10.3390/a14060172 ◽

2021 ◽

Vol 14 (6) ◽

pp. 172

Author(s):

Kotaro Matsuda ◽

Shuhei Denzumi ◽

Kunihiko Sadakane

Keyword(s):

Data Structures ◽

Binary Decision Diagrams ◽

Real Data ◽

Directed Acyclic Graphs ◽

Main Memory ◽

Large Set ◽

Decision Diagrams ◽

Binary Decision ◽

Acyclic Graphs

Zero-suppressed Binary Decision Diagrams (ZDDs) are data structures for representing set families in a compressed form. With ZDDs, many valuable operations on set families can be done in time polynomial in ZDD size. In some cases, however, the size of ZDDs for representing large set families becomes too huge to store them in the main memory. This paper proposes top ZDD, a novel representation of ZDDs which uses less space than existing ones. The top ZDD is an extension of the top tree, which compresses trees, to compress directed acyclic graphs by sharing identical subgraphs. We prove that navigational operations on ZDDs can be done in time poly-logarithmic in ZDD size, and show that there exist set families for which the size of the top ZDD is exponentially smaller than that of the ZDD. We also show experimentally that our top ZDDs have smaller sizes than ZDDs for real data.

Download Full-text

Review of Compact Data Structures - a practical approach by Gonzalo Navarro

ACM SIGACT News ◽

10.1145/3289137.3289140 ◽

2018 ◽

Vol 49 (3) ◽

pp. 9-13

Author(s):

László Kozma

Keyword(s):

Data Structures ◽

Practical Approach ◽

Compact Data Structures

Download Full-text

CoroBase

Proceedings of the VLDB Endowment ◽

10.14778/3430915.3430932 ◽

2020 ◽

Vol 14 (3) ◽

pp. 431-444

Author(s):

Yongjun He ◽

Jiacheng Lu ◽

Tianzheng Wang

Keyword(s):

Data Structures ◽

Main Memory ◽

Data Prefetching ◽

Backward Compatibility ◽

Transaction Models ◽

Main Memory Database ◽

Hide Data ◽

Rich Data ◽

Software Prefetching ◽

Database Engine

Data stalls are a major overhead in main-memory database engines due to the use of pointer-rich data structures. Lightweight coroutines ease the implementation of software prefetching to hide data stalls by overlapping computation and asynchronous data prefetching. Prior solutions, however, mainly focused on (1) individual components and operations and (2) intra-transaction batching that requires interface changes, breaking backward compatibility. It was not clear how they apply to a full database engine and how much end-to-end benefit they bring under various workloads. This paper presents CoroBase, a main-memory database engine that tackles these challenges with a new coroutine-to-transaction paradigm. Coroutine-to-transaction models transactions as coroutines and thus enables inter-transaction batching, avoiding application changes but retaining the benefits of prefetching. We show that on a 48-core server, CoroBase can perform close to 2x better for read-intensive workloads and remain competitive for workloads that inherently do not benefit from software prefetching.

Download Full-text