Optimizations for filter-based join algorithms in MapReduce

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201220 ◽

2021 ◽

pp. 1-18

Author(s):

Salahaldeen Rababa ◽

Amer Al-Badarneh

Keyword(s):

Cost Analysis ◽

Execution Time ◽

Large Scale ◽

Programming Model ◽

State Of The Art ◽

Total Execution Time ◽

Large Scale Data ◽

Heterogeneous Datasets ◽

Join Algorithms ◽

Scale Data

Large-scale datasets collected from heterogeneous sources often require a join operation to extract valuable information. MapReduce is an efficient programming model for processing large-scale data. However, it has some limitations in processing heterogeneous datasets. This is because of the large amount of redundant intermediate records that are transferred through the network. Several filtering techniques have been developed to improve the join performance, but they require multiple MapReduce jobs to process the input datasets. To address this issue, the adaptive filter-based join algorithms are presented in this paper. Specifically, three join algorithms are introduced to perform the processes of filters creation and redundant records elimination within a single MapReduce job. A cost analysis of the introduced join algorithms shows that the I/O cost is reduced compared to the state-of-the-art filter-based join algorithms. The performance of the join algorithms was evaluated in terms of the total execution time and the total amount of I/O data transferred. The experimental results show that the adaptive Bloom join, semi-adaptive intersection Bloom join, and adaptive intersection Bloom join decrease the total execution time by 30%, 25%, and 35%, respectively; and reduce the total amount of I/O data transferred by 18%, 25%, and 50%, respectively.

Download Full-text

Efficient and Secure Multi-party Computation for Heterogeneous Environment

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.38932 ◽

2021 ◽

Vol 9 (11) ◽

pp. 911-917

Author(s):

Surabhi Kumari

Keyword(s):

Large Scale ◽

State Of The Art ◽

System Perspective ◽

Large Scale Data ◽

Fundamental Cause ◽

Performance Limitation ◽

Computing Node ◽

Large Scale Data Processing ◽

And Performance ◽

Scale Data

Abstract: MPC (multi-party computation) is a comprehensive cryptographic concept that can be used to do computations while maintaining anonymity. MPC allows a group of people to work together on a function without revealing the plaintext's true input or output. Privacy-preserving voting, arithmetic calculation, and large-scale data processing are just a few of the applications of MPC. Each MPC party can run on a single computing node from a system perspective. Multiple parties' computing nodes could be homogenous or heterogeneous; nevertheless, MPC protocols' distributed workloads are always homogeneous (symmetric). We investigate the system performance of a representative MPC framework and a collection of MPC applications in this paper. On homogeneous and heterogeneous compute nodes, we describe the complete online calculation workflow of a state-of-the-art MPC protocol and examine the fundamental cause of its stall time and performance limitation. Keywords: Cloud Computing, IoT, MPC, Amazon Service, Virtualization.

Download Full-text

Recent Developments on Security and Reliability in Large-Scale Data Processing with MapReduce

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2016010104 ◽

2016 ◽

Vol 12 (1) ◽

pp. 49-68 ◽

Cited By ~ 7

Author(s):

Christian Esposito ◽

Massimo Ficco

Keyword(s):

Data Processing ◽

Large Scale ◽

Programming Model ◽

Big Data Analytics ◽

Large Scale Data ◽

Recent Developments ◽

Security And Reliability ◽

Large Scale Data Processing ◽

The Right ◽

Scale Data

The demand to access to a large volume of data, distributed across hundreds or thousands of machines, has opened new opportunities in commerce, science, and computing applications. MapReduce is a paradigm that offers a programming model and an associated implementation for processing massive datasets in a parallel fashion, by using non-dedicated distributed computing hardware. It has been successfully adopted in several academic and industrial projects for Big Data Analytics. However, since such analytics is increasingly demanded within the context of mission-critical applications, security and reliability in MapReduce frameworks are strongly required in order to manage sensible information, and to obtain the right answer at the right time. In this paper, the authors present the main implementation of the MapReduce programming paradigm, provided by Apache with the name of Hadoop. They illustrate the security and reliability concerns in the context of a large-scale data processing infrastructure. They review the available solutions, and their limitations to support security and reliability within the context MapReduce frameworks. The authors conclude by describing the undergoing evolution of such solutions, and the possible issues for improvements, which could be challenging research opportunities for academic researchers.

Download Full-text

A Deep Multiview Active Learning for Large-Scale Image Classification

Mathematical Problems in Engineering ◽

10.1155/2020/6639503 ◽

2020 ◽

Vol 2020 ◽

pp. 1-7

Author(s):

Tuozhong Yao ◽

Wenfeng Wang ◽

Yuhong Gu

Keyword(s):

Active Learning ◽

Large Scale ◽

State Of The Art ◽

Feature Representation ◽

Deep Convolutional Neural Networks ◽

Large Scale Data ◽

Version Space ◽

Potential Applications ◽

Effort Reduction ◽

Scale Data

Multiview active learning (MAL) is a technique which can achieve a large decrease in the size of the version space than traditional active learning and has great potential applications in large-scale data analysis. In this paper, we present a new deep multiview active learning (DMAL) framework which is the first to combine multiview active learning and deep learning for annotation effort reduction. In this framework, our approach advances the existing active learning methods in two aspects. First, we incorporate two different deep convolutional neural networks into active learning which uses multiview complementary information to improve the feature learnings. Second, through the properly designed framework, the feature representation and the classifier can be simultaneously updated with progressively annotated informative samples. The experiments with two challenging image datasets demonstrate that our proposed DMAL algorithm can achieve promising results than several state-of-the-art active learning algorithms.

Download Full-text

Simplified Mapreduce Mechanism for Large Scale Data Processing

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.8.15211 ◽

2018 ◽

Vol 7 (3.8) ◽

pp. 16

Author(s):

Md Tahsir Ahmed Munna ◽

Shaikh Muhammad Allayear ◽

Mirza Mohtashim Alam ◽

Sheikh Shah Mohammad Motiur Rahman ◽

Md Samadur Rahman ◽

...

Keyword(s):

Data Processing ◽

Large Scale ◽

Processing Time ◽

Programming Model ◽

Data Sets ◽

Hadoop Mapreduce ◽

Large Scale Data ◽

Large Scale Data Processing ◽

Scale Data ◽

Large Scale Data Sets

MapReduce has become a popular programming model for processing and running large-scale data sets with a parallel, distributed paradigm on a cluster. Hadoop MapReduce is needed especially for large scale data like big data processing. In this paper, we work to modify the Hadoop MapReduce Algorithm and implement it to reduce processing time.

Download Full-text

Parallel Implementation of Improved K-Means Based on a Cloud Platform

Information Technology And Control ◽

10.5755/j01.itc.48.4.23881 ◽

2019 ◽

Vol 48 (4) ◽

pp. 673-681

Author(s):

Shufen Zhang ◽

Zhiyu Liu ◽

Xuebin Chen ◽

Changyin Luo

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Programming Model ◽

Parallel Implementation ◽

Clustering Algorithms ◽

Data Set ◽

Large Scale Data ◽

Sample Density ◽

Scale Data ◽

Selection Of

In order to solve the problem of traditional K-Means clustering algorithm in dealing with large-scale data set, a Hadoop K-Means (referred to HKM) clustering algorithm is proposed. Firstly, according to the sample density, the algorithm eliminates the effects of noise points in the data set. Secondly, it optimizes the selection of the initial center point using the thought of the max-min distance. Finally, it uses a MapReduce programming model to realize the parallelization. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but can also solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.

Download Full-text

Glyfn: A Glyph-Aware Fusion Network for Distributed Chinese Event Detection

10.5121/csit.2021.110114 ◽

2021 ◽

Author(s):

Qi Zhai ◽

Zhigang Kan ◽

Linhui Feng ◽

Linbo Qiao ◽

Feng Liu

Keyword(s):

Event Detection ◽

Large Scale ◽

State Of The Art ◽

Language Model ◽

Special Kind ◽

Detection Task ◽

Experimental Results ◽

Large Scale Data ◽

Unstructured Text ◽

Scale Data

Recently, Chinese event detection has attracted more and more attention. As a special kind of hieroglyphics, Chinese glyphs are semantically useful but still unexplored in this task. In this paper, we propose a novel Glyph-Aware Fusion Network, named GlyFN. It introduces the glyphs' information into the pre-trained language model representation. To obtain a better representation, we design a Vector Linear Fusion mechanism to fuse them. Specifically, it first utilizes a max-pooling to capture salient information. Then, we use the linear operation of vectors to retain unique information. Moreover, for large-scale unstructured text, we distribute the data into different clusters parallelly. Finally, we conduct extensive experiments on ACE2005 and large-scale data. Experimental results show that GlyFN obtains increases of 7.48(10.18%) and 6.17(8.7%) in the F1-score for trigger identification and classification over the state-of-the-art methods, respectively. Furthermore, the event detection task for large-scale unstructured text can be efficiently accomplished through distribution.

Download Full-text

Mask R-CNN with data augmentation for food detection and recognition

10.36227/techrxiv.11974362 ◽

2020 ◽

Author(s):

Than Le

Keyword(s):

Large Scale ◽

Data Augmentation ◽

State Of The Art ◽

Projective Representation ◽

Large Scale Data ◽

Food Detection ◽

Food Recognition ◽

Data Driven Approach ◽

Scale Data ◽

Detection And Recognition

In this paper, we focus on simple data-driven approach to solve deep learning based on implementing the Mask R-CNN module by analyzing deeper manipulation of datasets. We firstly approach to affine transformation and projective representation to data augmentation analysis in order to increasing large-scale data manually based on the state-of-the-art in views of computer vision. Then we evaluate our method concretely by connection our datasets by visualization data and completely in testing to many methods to understand intelligent data analysis in object detection and segmentation by using more than 5000 image according to many similar objects. As far as, it illustrated efficiency of small applications such as food recognition, grasp and manipulation in robotics<br>

Download Full-text

ANN-inspired Straggler Map Reduce Detection in Big Data Processing

CONVERTER ◽

10.17762/converter.26 ◽

2021 ◽

pp. 116-127

Author(s):

Ajay Bansal, Manmohan Sharma, Ashu Gupta

Keyword(s):

Execution Time ◽

Large Scale ◽

Computation Time ◽

Backpropagation Neural Network ◽

Large Scale Data ◽

Proposed Model ◽

Three Stages ◽

Two Stages ◽

Remaining Time ◽

Scale Data

One of the most challenging aspects of using MapReduce to parallelize and distribute large-scale data processingis detecting straggler tasks. Identifying ongoing tasks on weak nodes is how it’s described. The total computation time isthe sum of the execution times of the two stages in the Map process (copy, combine) and the three stages in the Reducephase (shuffle, sort, and reduce). The main aim of this paper is to estimate the accurate execution time in each location. Theproposed approach uses a backpropagation neural network on Hadoop to detect straggler tasks and calculate the remainingtask execution time, which is crucial in straggler task identification. The comparative analysis is done with some efficientmodels in this domain, such as LATE, ESAMR, and the real remaining time for WordCount and Sort benchmarks. It wasfound that the proposed model is capable of detecting straggler tasks in accurately estimating execution time. It also helpsin reducing the execution time that it takes to complete a task.

Download Full-text

Modified Delay Scheduling: A Heuristic Approach for Hadoop Scheduling to Improve Fairness and Response Time

Parallel Processing Letters ◽

10.1142/s0129626415500097 ◽

2015 ◽

Vol 25 (04) ◽

pp. 1550009 ◽

Cited By ~ 2

Author(s):

N. P. Gopalan ◽

S. Suresh

Keyword(s):

Response Time ◽

Large Scale ◽

Programming Model ◽

Data Locality ◽

Data Intensive ◽

Large Scale Data ◽

Fair Scheduler ◽

Hadoop Clusters ◽

Data Intensive Applications ◽

Scale Data

Hadoop is a widely used open source implementation of MapReduce which is a popular programming model for parallel processing large scale data intensive applications in a cloud environment. Sharing Hadoop clusters has a tradeoff between fairness and data locality. When launching a local task is not possible, Hadoop Fair Scheduler (HFS) with delay scheduling postpones the node allocation for a while to a job which is to be scheduled next as per fairness to achieve high locality. This waiting becomes waste when the desired locality could not be achieved within a reasonable period. In this paper, a modified delay scheduling in HFS is proposed and implemented in Hadoop. It avoids the aforementioned waiting of delay scheduler if achieving locality is not possible. Instead of blindly waiting for a local node, the proposed algorithm first estimates the time to wait for a local node for the job and avoids waiting wherever achieving locality is not possible within the predefined delay threshold while accomplishing same locality. The performance of the proposed algorithm is evaluated by extensive experiments and it has been observed that the algorithm works significantly better in terms of response time and fairness achieving up to 20% speedup and up to 38% fairness in certain cases.

Download Full-text

A Survey on MapReduce Implementations

International Journal of Cloud Applications and Computing ◽

10.4018/ijcac.2016010104 ◽

2016 ◽

Vol 6 (1) ◽

pp. 59-87 ◽

Cited By ~ 2

Author(s):

Amer Al-Badarneh ◽

Amr Mohammad ◽

Salah Harb

Keyword(s):

Large Scale ◽

Programming Model ◽

Data Sets ◽

Mapreduce Framework ◽

Large Scale Data ◽

Parallel Data ◽

Efficiency Performance ◽

Scale Data ◽

Large Scale Data Sets ◽

Open Issues

A distinguished successful platform for parallel data processing MapReduce is attracting a significant momentum from both academia and industry as the volume of data to capture, transform, and analyse grows rapidly. Although MapReduce is used in many applications to analyse large scale data sets, there is still a lot of debate among scientists and researchers on its efficiency, performance, and usability to support more classes of applications. This survey presents a comprehensive review of various implementations of MapReduce framework. Initially the authors give an overview of MapReduce programming model. They then present a broad description of various technical aspects of the most successful implementations of MapReduce framework reported in the literature and discuss their main strengths and weaknesses. Finally, the authors conclude by introducing a comparison between MapReduce implementations and discuss open issues and challenges on enhancing MapReduce.

Download Full-text