scholarly journals SetSketch

2021 ◽  
Vol 14 (11) ◽  
pp. 2244-2257
Author(s):  
Otmar Ertl

MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to continuously fill the gap between both use cases. Its commutative and idempotent insert operation and its mergeable state make it suitable for distributed environments. Fast, robust, and easy-to-implement estimators for cardinality and joint quantities, as well as the ability to use SetSketch for similarity search, enable versatile applications. The presented joint estimator can also be applied to other data structures such as MinHash, HyperLogLog, or Hyper-MinHash, where it even performs better than the corresponding state-of-the-art estimators in many cases.

2019 ◽  
Vol 13 (2) ◽  
pp. 227-236
Author(s):  
Tetsuo Shibuya

Abstract A data structure is called succinct if its asymptotical space requirement matches the original data size. The development of succinct data structures is an important factor to deal with the explosively increasing big data. Moreover, wider variations of big data have been produced in various fields recently and there is a substantial need for the development of more application-specific succinct data structures. In this study, we review the recently proposed application-oriented succinct data structures motivated by big data applications in three different fields: privacy-preserving computation in cryptography, genome assembly in bioinformatics, and work space reduction for compressed communications.


2021 ◽  
Author(s):  
Danila Piatov ◽  
Sven Helmer ◽  
Anton Dignös ◽  
Fabio Persia

AbstractWe develop a family of efficient plane-sweeping interval join algorithms for evaluating a wide range of interval predicates such as Allen’s relationships and parameterized relationships. Our technique is based on a framework, components of which can be flexibly combined in different manners to support the required interval relation. In temporal databases, our algorithms can exploit a well-known and flexible access method, the Timeline Index, thus expanding the set of operations it supports even further. Additionally, employing a compact data structure, the gapless hash map, we utilize the CPU cache efficiently. In an experimental evaluation, we show that our approach is several times faster and scales better than state-of-the-art techniques, while being much better suited for real-time event processing.


Author(s):  
Sebastian Dippl ◽  
Michael C. Jaeger ◽  
Achim Luhn ◽  
Alexandra Shulman-Peleg ◽  
Gil Vernik

While it is common to use storage in a cloud-based manner, the question of true interoperability is rarely fully addressed. This question becomes even more relevant since the steadily growing amount of data that needs to be stored will supersede the capacity of a single system in terms of resources, availability, and network throughput quite soon. The logical conclusion is that a network of systems needs to be created that is able to cope with the requirements of big data applications and data deluge scenarios. This chapter shows how federation and interoperability will fit into a cloud storage scenario. The authors take a look at the challenges that federation imposes on autonomous, heterogeneous, and distributed cloud systems, and present approaches that help deal with the special requirements introduced by the VISION Cloud use cases from healthcare, media, telecommunications, and enterprise domains. Finally, the authors give an overview on how VISION Cloud addresses these requirements in its research scenarios and architecture.


Sensors ◽  
2021 ◽  
Vol 21 (24) ◽  
pp. 8241
Author(s):  
Mitko Aleksandrov ◽  
Sisi Zlatanova ◽  
David J. Heslop

Voxel-based data structures, algorithms, frameworks, and interfaces have been used in computer graphics and many other applications for decades. There is a general necessity to seek adequate digital representations, such as voxels, that would secure unified data structures, multi-resolution options, robust validation procedures and flexible algorithms for different 3D tasks. In this review, we evaluate the most common properties and algorithms for voxelisation of 2D and 3D objects. Thus, many voxelisation algorithms and their characteristics are presented targeting points, lines, triangles, surfaces and solids as geometric primitives. For lines, we identify three groups of algorithms, where the first two achieve different voxelisation connectivity, while the third one presents voxelisation of curves. We can say that surface voxelisation is a more desired voxelisation type compared to solid voxelisation, as it can be achieved faster and requires less memory if voxels are stored in a sparse way. At the same time, we evaluate in the paper the available voxel data structures. We split all data structures into static and dynamic grids considering the frequency to update a data structure. Static grids are dominated by SVO-based data structures focusing on memory footprint reduction and attributes preservation, where SVDAG and SSVDAG are the most advanced methods. The state-of-the-art dynamic voxel data structure is NanoVDB which is superior to the rest in terms of speed as well as support for out-of-core processing and data management, which is the key to handling large dynamically changing scenes. Overall, we can say that this is the first review evaluating the available voxelisation algorithms for different geometric primitives as well as voxel data structures.


Algorithms ◽  
2019 ◽  
Vol 12 (8) ◽  
pp. 166
Author(s):  
Md. Anisuzzaman Siddique ◽  
Hao Tian ◽  
Mahboob Qaosar ◽  
Yasuhiko Morimoto

The skyline query and its variant queries are useful functions in the early stages of a knowledge-discovery processes. The skyline query and its variant queries select a set of important objects, which are better than other common objects in the dataset. In order to handle big data, such knowledge-discovery queries must be computed in parallel distributed environments. In this paper, we consider an efficient parallel algorithm for the “K-skyband query” and the “top-k dominating query”, which are popular variants of skyline query. We propose a method for computing both queries simultaneously in a parallel distributed framework called MapReduce, which is a popular framework for processing “big data” problems. Our extensive evaluation results validate the effectiveness and efficiency of the proposed algorithm on both real and synthetic datasets.


2020 ◽  
Vol 10 (6) ◽  
pp. 1915
Author(s):  
Tianqi Zheng ◽  
Zhibin Zhang ◽  
Xueqi Cheng

Hash tables are the fundamental data structure for analytical database workloads, such as aggregation, joining, set filtering and records deduplication. The performance aspects of hash tables differ drastically with respect to what kind of data are being processed or how many inserts, lookups and deletes are constructed. In this paper, we address some common use cases of hash tables: aggregating and joining over arbitrary string data. We designed a new hash table, SAHA, which is tightly integrated with modern analytical databases and optimized for string data with the following advantages: (1) it inlines short strings and saves hash values for long strings only; (2) it uses special memory loading techniques to do quick dispatching and hashing computations; and (3) it utilizes vectorized processing to batch hashing operations. Our evaluation results reveal that SAHA outperforms state-of-the-art hash tables by one to five times in analytical workloads, including Google’s SwissTable and Facebook’s F14Table. It has been merged into the ClickHouse database and shows promising results in production.


2019 ◽  
Vol 3 (1) ◽  
pp. 19 ◽  
Author(s):  
Michael Kaufmann

Many big data projects are technology-driven and thus, expensive and inefficient. It is often unclear how to exploit existing data resources and map data, systems and analytics results to actual use cases. Existing big data reference models are mostly either technological or business-oriented in nature, but do not consequently align both aspects. To address this issue, a reference model for big data management is proposed that operationalizes value creation from big data by linking business targets with technical implementation. The purpose of this model is to provide a goal- and value-oriented framework to effectively map and plan purposeful big data systems aligned with a clear value proposition. Based on an epistemic model that conceptualizes big data management as a cognitive system, the solution space of data value creation is divided into five layers: preparation, analysis, interaction, effectuation, and intelligence. To operationalize the model, each of these layers is subdivided into corresponding business and IT aspects to create a link from use cases to technological implementation. The resulting reference model, the big data management canvas, can be applied to classify and extend existing big data applications and to derive and plan new big data solutions, visions, and strategies for future projects. To validate the model in the context of existing information systems, the paper describes three cases of big data management in existing companies.


Author(s):  
Ranjit Biswas

The homogeneous data structure ‘train' and the heterogeneous data structure ‘atrain' are the fundamental, very powerful dynamic and flexible data structures, being the first data structures introduced exclusively for big data. Thus ‘Data Structures for Big Data' is to be regarded as a new subject in Big Data Science, not just as a new topic, considering the explosive momentum of the big data. Based upon the notion of the big data structures train and atrain, the author introduces the useful data structures for the programmers working with big data which are: homogeneous stacks ‘train stack' and ‘rT-coach stack', heterogeneous stacks ‘atrain stack' and ‘rA-coach stack', homogeneous queues ‘train queue' and ‘rT-coach queue', heterogeneous queues ‘atrain queue' and ‘rA-coach queue', homogeneous binary trees ‘train binary tree' and ‘rT-coach binary tree', heterogeneous binary trees ‘atrain binary tree' and ‘rA-coach binary tree', homogeneous trees ‘train tree' and ‘rT-coach tree', heterogeneous trees ‘atrain tree' and ‘rA-coach tree', to enrich the subject ‘Data Structures for Big Data' for big data science.


2015 ◽  
pp. 423-434
Author(s):  
Sebastian Dippl ◽  
Michael C. Jaeger ◽  
Achim Luhn ◽  
Alexandra Shulman-Peleg ◽  
Gil Vernik

While it is common to use storage in a cloud-based manner, the question of true interoperability is rarely fully addressed. This question becomes even more relevant since the steadily growing amount of data that needs to be stored will supersede the capacity of a single system in terms of resources, availability, and network throughput quite soon. The logical conclusion is that a network of systems needs to be created that is able to cope with the requirements of big data applications and data deluge scenarios. This chapter shows how federation and interoperability will fit into a cloud storage scenario. The authors take a look at the challenges that federation imposes on autonomous, heterogeneous, and distributed cloud systems, and present approaches that help deal with the special requirements introduced by the VISION Cloud use cases from healthcare, media, telecommunications, and enterprise domains. Finally, the authors give an overview on how VISION Cloud addresses these requirements in its research scenarios and architecture.


Sign in / Sign up

Export Citation Format

Share Document