SetSketch

Abstract A data structure is called succinct if its asymptotical space requirement matches the original data size. The development of succinct data structures is an important factor to deal with the explosively increasing big data. Moreover, wider variations of big data have been produced in various fields recently and there is a substantial need for the development of more application-specific succinct data structures. In this study, we review the recently proposed application-oriented succinct data structures motivated by big data applications in three different fields: privacy-preserving computation in cryptography, genome assembly in bioinformatics, and work space reduction for compressed communications.

Download Full-text

Cache-efficient sweeping-based interval joins for extended Allen relation predicates

The VLDB Journal ◽

10.1007/s00778-020-00650-5 ◽

2021 ◽

Author(s):

Danila Piatov ◽

Sven Helmer ◽

Anton Dignös ◽

Fabio Persia

Keyword(s):

Data Structure ◽

Experimental Evaluation ◽

State Of The Art ◽

Temporal Databases ◽

Access Method ◽

Wide Range ◽

Interval Relation ◽

Cache Efficient ◽

Join Algorithms ◽

Better Than

AbstractWe develop a family of efficient plane-sweeping interval join algorithms for evaluating a wide range of interval predicates such as Allen’s relationships and parameterized relationships. Our technique is based on a framework, components of which can be flexibly combined in different manners to support the required interval relation. In temporal databases, our algorithms can exploit a well-known and flexible access method, the Timeline Index, thus expanding the set of operations it supports even further. Additionally, employing a compact data structure, the gapless hash map, we utilize the CPU cache efficiently. In an experimental evaluation, we show that our approach is several times faster and scales better than state-of-the-art techniques, while being much better suited for real-time event processing.

Download Full-text

Towards Federation and Interoperability of Cloud Storage Systems

Data Intensive Storage Services for Cloud Environments ◽

10.4018/978-1-4666-3934-8.ch005 ◽

2013 ◽

pp. 60-71 ◽

Cited By ~ 1

Author(s):

Sebastian Dippl ◽

Michael C. Jaeger ◽

Achim Luhn ◽

Alexandra Shulman-Peleg ◽

Gil Vernik

Keyword(s):

Big Data ◽

Cloud Storage ◽

Storage Systems ◽

Network Throughput ◽

Use Cases ◽

Single System ◽

Cloud Systems ◽

Logical Conclusion ◽

Big Data Applications ◽

Distributed Cloud

While it is common to use storage in a cloud-based manner, the question of true interoperability is rarely fully addressed. This question becomes even more relevant since the steadily growing amount of data that needs to be stored will supersede the capacity of a single system in terms of resources, availability, and network throughput quite soon. The logical conclusion is that a network of systems needs to be created that is able to cope with the requirements of big data applications and data deluge scenarios. This chapter shows how federation and interoperability will fit into a cloud storage scenario. The authors take a look at the challenges that federation imposes on autonomous, heterogeneous, and distributed cloud systems, and present approaches that help deal with the special requirements introduced by the VISION Cloud use cases from healthcare, media, telecommunications, and enterprise domains. Finally, the authors give an overview on how VISION Cloud addresses these requirements in its research scenarios and architecture.

Download Full-text

Voxelisation Algorithms and Data Structures: A Review

Sensors ◽

10.3390/s21248241 ◽

2021 ◽

Vol 21 (24) ◽

pp. 8241

Author(s):

Mitko Aleksandrov ◽

Sisi Zlatanova ◽

David J. Heslop

Keyword(s):

Data Structure ◽

Data Structures ◽

State Of The Art ◽

Algorithms And Data Structures ◽

3D Objects ◽

Digital Representations ◽

Memory Footprint ◽

Voxel Data ◽

2D And 3D ◽

Geometric Primitives

Voxel-based data structures, algorithms, frameworks, and interfaces have been used in computer graphics and many other applications for decades. There is a general necessity to seek adequate digital representations, such as voxels, that would secure unified data structures, multi-resolution options, robust validation procedures and flexible algorithms for different 3D tasks. In this review, we evaluate the most common properties and algorithms for voxelisation of 2D and 3D objects. Thus, many voxelisation algorithms and their characteristics are presented targeting points, lines, triangles, surfaces and solids as geometric primitives. For lines, we identify three groups of algorithms, where the first two achieve different voxelisation connectivity, while the third one presents voxelisation of curves. We can say that surface voxelisation is a more desired voxelisation type compared to solid voxelisation, as it can be achieved faster and requires less memory if voxels are stored in a sparse way. At the same time, we evaluate in the paper the available voxel data structures. We split all data structures into static and dynamic grids considering the frequency to update a data structure. Static grids are dominated by SVO-based data structures focusing on memory footprint reduction and attributes preservation, where SVDAG and SSVDAG are the most advanced methods. The state-of-the-art dynamic voxel data structure is NanoVDB which is superior to the rest in terms of speed as well as support for out-of-core processing and data management, which is the key to handling large dynamically changing scenes. Overall, we can say that this is the first review evaluating the available voxelisation algorithms for different geometric primitives as well as voxel data structures.

Download Full-text

MapReduce Algorithm for Variants of Skyline Queries: Skyband and Dominating Queries

Algorithms ◽

10.3390/a12080166 ◽

2019 ◽

Vol 12 (8) ◽

pp. 166

Author(s):

Md. Anisuzzaman Siddique ◽

Hao Tian ◽

Mahboob Qaosar ◽

Yasuhiko Morimoto

Keyword(s):

Big Data ◽

Knowledge Discovery ◽

Distributed Environments ◽

Skyline Query ◽

Skyline Queries ◽

Distributed Framework ◽

Extensive Evaluation ◽

Effectiveness And Efficiency ◽

Synthetic Datasets ◽

Better Than

The skyline query and its variant queries are useful functions in the early stages of a knowledge-discovery processes. The skyline query and its variant queries select a set of important objects, which are better than other common objects in the dataset. In order to handle big data, such knowledge-discovery queries must be computed in parallel distributed environments. In this paper, we consider an efficient parallel algorithm for the “K-skyband query” and the “top-k dominating query”, which are popular variants of skyline query. We propose a method for computing both queries simultaneously in a parallel distributed framework called MapReduce, which is a popular framework for processing “big data” problems. Our extensive evaluation results validate the effectiveness and efficiency of the proposed algorithm on both real and synthetic datasets.

Download Full-text

State of the art in big data applications in microgrid: A review

Advanced Engineering Informatics ◽

10.1016/j.aei.2019.100945 ◽

2019 ◽

Vol 42 ◽

pp. 100945 ◽

Cited By ~ 8

Author(s):

Karim Moharm

Keyword(s):

Big Data ◽

State Of The Art ◽

Big Data Applications

Download Full-text

SAHA: A String Adaptive Hash Table for Analytical Databases

Applied Sciences ◽

10.3390/app10061915 ◽

2020 ◽

Vol 10 (6) ◽

pp. 1915

Author(s):

Tianqi Zheng ◽

Zhibin Zhang ◽

Xueqi Cheng

Keyword(s):

Data Structure ◽

State Of The Art ◽

Long Strings ◽

Hash Table ◽

Use Cases ◽

Hash Tables ◽

Modern Analytical

Hash tables are the fundamental data structure for analytical database workloads, such as aggregation, joining, set filtering and records deduplication. The performance aspects of hash tables differ drastically with respect to what kind of data are being processed or how many inserts, lookups and deletes are constructed. In this paper, we address some common use cases of hash tables: aggregating and joining over arbitrary string data. We designed a new hash table, SAHA, which is tightly integrated with modern analytical databases and optimized for string data with the following advantages: (1) it inlines short strings and saves hash values for long strings only; (2) it uses special memory loading techniques to do quick dispatching and hashing computations; and (3) it utilizes vectorized processing to batch hashing operations. Our evaluation results reveal that SAHA outperforms state-of-the-art hash tables by one to five times in analytical workloads, including Google’s SwissTable and Facebook’s F14Table. It has been merged into the ClickHouse database and shows promising results in production.

Download Full-text

Big Data Management Canvas: A Reference Model for Value Creation from Data

Big Data and Cognitive Computing ◽

10.3390/bdcc3010019 ◽

2019 ◽

Vol 3 (1) ◽

pp. 19 ◽

Cited By ~ 5

Author(s):

Michael Kaufmann

Keyword(s):

Big Data ◽

Data Management ◽

Value Creation ◽

Reference Model ◽

Solution Space ◽

Use Cases ◽

Data Systems ◽

Big Data Applications ◽

Big Data Systems ◽

Map Data

Many big data projects are technology-driven and thus, expensive and inefficient. It is often unclear how to exploit existing data resources and map data, systems and analytics results to actual use cases. Existing big data reference models are mostly either technological or business-oriented in nature, but do not consequently align both aspects. To address this issue, a reference model for big data management is proposed that operationalizes value creation from big data by linking business targets with technical implementation. The purpose of this model is to provide a goal- and value-oriented framework to effectively map and plan purposeful big data systems aligned with a clear value proposition. Based on an epistemic model that conceptualizes big data management as a cognitive system, the solution space of data value creation is divided into five layers: preparation, analysis, interaction, effectuation, and intelligence. To operationalize the model, each of these layers is subdivided into corresponding business and IT aspects to create a link from use cases to technological implementation. The resulting reference model, the big data management canvas, can be applied to classify and extend existing big data applications and to derive and plan new big data solutions, visions, and strategies for future projects. To validate the model in the context of existing information systems, the paper describes three cases of big data management in existing companies.

Download Full-text

Introducing Data Structures for Big Data

Effective Big Data Management and Opportunities for Implementation - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-0182-4.ch002 ◽

2016 ◽

pp. 25-52 ◽

Cited By ~ 1

Author(s):

Ranjit Biswas

Keyword(s):

Big Data ◽

Data Structure ◽

Data Structures ◽

Binary Tree ◽

Data Science ◽

Heterogeneous Data ◽

Binary Trees ◽

Homogeneous Trees ◽

The Subject ◽

Homogeneous Data

The homogeneous data structure ‘train' and the heterogeneous data structure ‘atrain' are the fundamental, very powerful dynamic and flexible data structures, being the first data structures introduced exclusively for big data. Thus ‘Data Structures for Big Data' is to be regarded as a new subject in Big Data Science, not just as a new topic, considering the explosive momentum of the big data. Based upon the notion of the big data structures train and atrain, the author introduces the useful data structures for the programmers working with big data which are: homogeneous stacks ‘train stack' and ‘rT-coach stack', heterogeneous stacks ‘atrain stack' and ‘rA-coach stack', homogeneous queues ‘train queue' and ‘rT-coach queue', heterogeneous queues ‘atrain queue' and ‘rA-coach queue', homogeneous binary trees ‘train binary tree' and ‘rT-coach binary tree', heterogeneous binary trees ‘atrain binary tree' and ‘rA-coach binary tree', homogeneous trees ‘train tree' and ‘rT-coach tree', heterogeneous trees ‘atrain tree' and ‘rA-coach tree', to enrich the subject ‘Data Structures for Big Data' for big data science.

Download Full-text

Towards Federation and Interoperability of Cloud Storage Systems

Cloud Technology ◽

10.4018/978-1-4666-6539-2.ch019 ◽

2015 ◽

pp. 423-434

Author(s):

Sebastian Dippl ◽

Michael C. Jaeger ◽

Achim Luhn ◽

Alexandra Shulman-Peleg ◽

Gil Vernik

Keyword(s):

Big Data ◽

Cloud Storage ◽

Storage Systems ◽

Network Throughput ◽

Use Cases ◽

Single System ◽

Cloud Systems ◽

Logical Conclusion ◽

Big Data Applications ◽

Distributed Cloud

While it is common to use storage in a cloud-based manner, the question of true interoperability is rarely fully addressed. This question becomes even more relevant since the steadily growing amount of data that needs to be stored will supersede the capacity of a single system in terms of resources, availability, and network throughput quite soon. The logical conclusion is that a network of systems needs to be created that is able to cope with the requirements of big data applications and data deluge scenarios. This chapter shows how federation and interoperability will fit into a cloud storage scenario. The authors take a look at the challenges that federation imposes on autonomous, heterogeneous, and distributed cloud systems, and present approaches that help deal with the special requirements introduced by the VISION Cloud use cases from healthcare, media, telecommunications, and enterprise domains. Finally, the authors give an overview on how VISION Cloud addresses these requirements in its research scenarios and architecture.

Download Full-text