Distributed classification for imbalanced big data in distributed environments

The skyline query and its variant queries are useful functions in the early stages of a knowledge-discovery processes. The skyline query and its variant queries select a set of important objects, which are better than other common objects in the dataset. In order to handle big data, such knowledge-discovery queries must be computed in parallel distributed environments. In this paper, we consider an efficient parallel algorithm for the “K-skyband query” and the “top-k dominating query”, which are popular variants of skyline query. We propose a method for computing both queries simultaneously in a parallel distributed framework called MapReduce, which is a popular framework for processing “big data” problems. Our extensive evaluation results validate the effectiveness and efficiency of the proposed algorithm on both real and synthetic datasets.

Download Full-text

A General Overview of Privacy-Preserving Big Data Management and Analytics Models, Methods and Techniques in Specific Domains: Static and Dynamic Distributed Environments

2018 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2018.8621882 ◽

2018 ◽

Author(s):

Alfredo Cuzzocrea ◽

Carlo Mastroianni

Keyword(s):

Big Data ◽

Data Management ◽

Privacy Preserving ◽

Distributed Environments ◽

General Overview ◽

Methods And Techniques

Download Full-text

Service Level Agreements in Cloud Computing and Big Data

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v5i1.pp158-165 ◽

2015 ◽

Vol 5 (1) ◽

pp. 158

Author(s):

K. Radha ◽

B.Thirumala Rao ◽

Shaik Masthan Babu ◽

K.Thirupathi Rao ◽

V.Krishna Reddy ◽

...

Keyword(s):

Cloud Computing ◽

Big Data ◽

Service Level ◽

Distributed Environments ◽

Service Level Agreements ◽

Research Directions ◽

Ongoing Work

Now-a-days Most of the industries are having large volumes of data. Data has range of Tera bytes to Peta byte. Organizations are looking to handle the growth of data. Enterprises are using cloud deployments to address the big data and analytics with respect to the interaction between cloud and big data. This paper presents big data issues and research directions towards the ongoing work of processing of big data in the distributed environments.

Download Full-text

Proceedings of the International Workshop on Big Data in Emergent Distributed Environments

10.1145/3460866 ◽

2021 ◽

Keyword(s):

Big Data ◽

International Workshop ◽

Distributed Environments

Download Full-text

The Experimental Study of Performance Impairment of Big Data Processing in Dynamic and Opportunistic Environments

Journal of Communications ◽

10.12720/jcm.15.11.776-789 ◽

2020 ◽

pp. 776-789

Author(s):

Wei Li ◽

◽

William W. Guo

Keyword(s):

Big Data ◽

Impact Strength ◽

Data Processing ◽

Impact Factors ◽

Distributed Environments ◽

Distributed Environment ◽

Big Data Processing ◽

Performance Impairment ◽

Result Analysis ◽

The Impact

In contrast to HPC clusters, when big data is processing in a distributed, particularly dynamic and opportunistic environment, the overall performance must be impaired and even bottlenecked by the dynamics of overlay and the opportunism of computing nodes. The dynamics and opportunism are caused by churn and unreliability of a generic distributed environment, and they cannot be ignored or avoided. Understanding impact factors, their impact strength and the relevance between these impacts is the foundation of potential optimization. This paper derives the research background, methodology and results by reasoning the necessity of distributed environments for big data processing, scrutinizing the dynamics and opportunism of distributed environments, classifying impact factors, proposing evaluation metrics and carrying out a series of intensive experiments. The result analysis of this paper provides important insights to the impact strength of the factors and the relevance of impact across the factors. The production of the results aims at paving a way to future optimization or avoidance of potential bottlenecks for big data processing in distributed environments.

Download Full-text

SetSketch

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476276 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2244-2257

Author(s):

Otmar Ertl

Keyword(s):

Big Data ◽

Data Structure ◽

Data Structures ◽

Similarity Search ◽

State Of The Art ◽

Use Cases ◽

Distributed Environments ◽

Jaccard Similarity ◽

Big Data Applications ◽

Better Than

MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to continuously fill the gap between both use cases. Its commutative and idempotent insert operation and its mergeable state make it suitable for distributed environments. Fast, robust, and easy-to-implement estimators for cardinality and joint quantities, as well as the ability to use SetSketch for similarity search, enable versatile applications. The presented joint estimator can also be applied to other data structures such as MinHash, HyperLogLog, or Hyper-MinHash, where it even performs better than the corresponding state-of-the-art estimators in many cases.

Download Full-text