Introducing Data Structures for Big Data

Author(s):  
Ranjit Biswas

The homogeneous data structure ‘train' and the heterogeneous data structure ‘atrain' are the fundamental, very powerful dynamic and flexible data structures, being the first data structures introduced exclusively for big data. Thus ‘Data Structures for Big Data' is to be regarded as a new subject in Big Data Science, not just as a new topic, considering the explosive momentum of the big data. Based upon the notion of the big data structures train and atrain, the author introduces the useful data structures for the programmers working with big data which are: homogeneous stacks ‘train stack' and ‘rT-coach stack', heterogeneous stacks ‘atrain stack' and ‘rA-coach stack', homogeneous queues ‘train queue' and ‘rT-coach queue', heterogeneous queues ‘atrain queue' and ‘rA-coach queue', homogeneous binary trees ‘train binary tree' and ‘rT-coach binary tree', heterogeneous binary trees ‘atrain binary tree' and ‘rA-coach binary tree', homogeneous trees ‘train tree' and ‘rT-coach tree', heterogeneous trees ‘atrain tree' and ‘rA-coach tree', to enrich the subject ‘Data Structures for Big Data' for big data science.

Author(s):  
Ranjit Biswas

The data structure “r-Train” (“Train” in short) where r is a natural number is a new kind of powerful robust data structure that can store homogeneous data dynamically in a flexible way, in particular for large amounts of data. But a train cannot store heterogeneous data (by the term heterogeneous data, the authors mean data of various datatypes). In fact, the classical data structures (e.g., array, linked list, etc.) can store and handle homogeneous data only, not heterogeneous data. The advanced data structure “r-Atrain” (“Atrain” in short) is logically almost analogous to the data structure r-train (train) but with an advanced level of construction to accommodate heterogeneous data of large volumes. The data structure train can be viewed as a special case of the data structure atrain. It is important to note that none of these two new data structures is a competitor of the other. By default, any heterogeneous data structure can work as a homogeneous data structure too. However, for working with a huge volume of homogeneous data, train is more suitable than atrain. For working with heterogeneous data, atrain is suitable while train cannot be applicable. The natural number r is suitably predecided and fixed by the programmer depending upon the problem under consideration and also upon the organization/industry for which the problem is posed.


2020 ◽  
Vol 9 (2) ◽  
pp. 25-36
Author(s):  
Necmi Gürsakal ◽  
Ecem Ozkan ◽  
Fırat Melih Yılmaz ◽  
Deniz Oktay

The interest in data science is increasing in recent years. Data science, including mathematics, statistics, big data, machine learning, and deep learning, can be considered as the intersection of statistics, mathematics and computer science. Although the debate continues about the core area of data science, the subject is a huge hit. Universities have a high demand for data science. They are trying to live up to this demand by opening postgraduate and doctoral programs. Since the subject is a new field, there are significant differences between the programs given by universities in data science. Besides, since the subject is close to statistics, most of the time, data science programs are opened in the statistics departments, and this also causes differences between the programs. In this article, we will summarize the data science education developments in the world and in Turkey specifically and how data science education should be at the graduate level.


2021 ◽  
Vol 14 (11) ◽  
pp. 2244-2257
Author(s):  
Otmar Ertl

MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to continuously fill the gap between both use cases. Its commutative and idempotent insert operation and its mergeable state make it suitable for distributed environments. Fast, robust, and easy-to-implement estimators for cardinality and joint quantities, as well as the ability to use SetSketch for similarity search, enable versatile applications. The presented joint estimator can also be applied to other data structures such as MinHash, HyperLogLog, or Hyper-MinHash, where it even performs better than the corresponding state-of-the-art estimators in many cases.


2019 ◽  
Vol 3 (122) ◽  
pp. 78-90 ◽  
Author(s):  
Olena Ihorivna Syrotkina ◽  
Mykhailo Oleksandrovych Aleksieiev ◽  
Iryna Mykhailivna Udovyk

This article addresses the subject of creating mathematical methods in order to optimize time and computing resources when processing “big data.” One of the ways of solving this problem is the creation of NoSQL systems, an advantage of which is the flexibility of data models as well as the possibility of horizontal scaling, parallel processing and the speed of obtaining results. From the viewpoint of “big data” analysis, there have been other methods developed such as machine learning, artificial intelligence, distributed processing of streams and events, and visual data research technology.Furthermore, the aim of the research is to develop mathematical methods for processing “big data” based on the system analysis of the data structure properties known as “m-tuples based on ordered sets of arbitrary cardinality (OSAC).”The data structure “m-tuples based on OSAC” is the Boolean, which is ordered by right-side enumeration of the elements of the basis set with cardinality n from the lower boundary of the possible change of the index value for each element of the tuple to the upper one. We formulated certain properties for the data structure investigated. These properties result from rules of logic when forming this structure. We also described mathematical methods based on these properties. Boolean graphs are illustrated with drawings and the outlined vertices of the graph correspond to the declared properties of the given data structure. We derived analytical dependencies to determine these Boolean elements. These Boolean elements do not require the execution of algorithms that implement the particular operations of intersection, union, and membership because the desired result is already determined by these properties.The properties of the data structure in question with regards to m-tuples based on OSAC allow us to determine some interdependencies between m-tuples by their location in the structure. Their location is determined by a pair of indices (j, m) without executing computing algorithms. In this case, the time estimate for obtaining results changes from a cubic O(n3) to linear O(n) dependency.


Author(s):  
Luca Barbaglia ◽  
Sergio Consoli ◽  
Sebastiano Manzan ◽  
Diego Reforgiato Recupero ◽  
Michaela Saisana ◽  
...  

AbstractThis chapter is an introduction to the use of data science technologies in the fields of economics and finance. The recent explosion in computation and information technology in the past decade has made available vast amounts of data in various domains, which has been referred to as Big Data. In economics and finance, in particular, tapping into these data brings research and business closer together, as data generated in ordinary economic activity can be used towards effective and personalized models. In this context, the recent use of data science technologies for economics and finance provides mutual benefits to both scientists and professionals, improving forecasting and nowcasting for several kinds of applications. This chapter introduces the subject through underlying technical challenges such as data handling and protection, modeling, integration, and interpretation. It also outlines some of the common issues in economic modeling with data science technologies and surveys the relevant big data management and analytics solutions, motivating the use of data science methods in economics and finance.


2021 ◽  
Vol 73 (1) ◽  
pp. 134-141
Author(s):  
A.R. Baidalina ◽  
◽  
S.A. Boranbayev ◽  

The article discusses ways of programming algorithms for complex data structures in Python. Knowledge of these structures and the corresponding algorithms is necessary when choosing the best methods for developing various software. When studying the subject "Algorithms and Data Structures", it is important to understand the essence of data structures. This is due to the fact that manipulating a data structure to fit a specific problem requires an understanding of the essence and algorithms of this data structure. Examples of programming algorithms related to dynamic lists and binary search trees in the currently widely used Python language are given. The algorithms for traversing the graph in depth and breadth are optimally and clearly implemented using the Python dictionary.


2019 ◽  
Vol 13 (2) ◽  
pp. 227-236
Author(s):  
Tetsuo Shibuya

Abstract A data structure is called succinct if its asymptotical space requirement matches the original data size. The development of succinct data structures is an important factor to deal with the explosively increasing big data. Moreover, wider variations of big data have been produced in various fields recently and there is a substantial need for the development of more application-specific succinct data structures. In this study, we review the recently proposed application-oriented succinct data structures motivated by big data applications in three different fields: privacy-preserving computation in cryptography, genome assembly in bioinformatics, and work space reduction for compressed communications.


2021 ◽  
Vol 9 ◽  
Author(s):  
Viktor Sebestyén ◽  
Tímea Czvetkó ◽  
János Abonyi

The aim of this paper is to provide an overview of the interrelationship between data science and climate studies, as well as describes how sustainability climate issues can be managed using the Big Data tools. Climate-related Big Data articles are analyzed and categorized, which revealed the increasing number of applications of data-driven solutions in specific areas, however, broad integrative analyses are gaining less of a focus. Our major objective is to highlight the potential in the System of Systems (SoS) theorem, as the synergies between diverse disciplines and research ideas must be explored to gain a comprehensive overview of the issue. Data and systems science enables a large amount of heterogeneous data to be integrated and simulation models developed, while considering socio-environmental interrelations in parallel. The improved knowledge integration offered by the System of Systems thinking or climate computing has been demonstrated by analysing the possible inter-linkages of the latest Big Data application papers. The analysis highlights how data and models focusing on the specific areas of sustainability can be bridged to study the complex problems of climate change.


Author(s):  
Shaveta Bhatia

 The epoch of the big data presents many opportunities for the development in the range of data science, biomedical research cyber security, and cloud computing. Nowadays the big data gained popularity.  It also invites many provocations and upshot in the security and privacy of the big data. There are various type of threats, attacks such as leakage of data, the third party tries to access, viruses and vulnerability that stand against the security of the big data. This paper will discuss about the security threats and their approximate method in the field of biomedical research, cyber security and cloud computing.


Sign in / Sign up

Export Citation Format

Share Document