Optimizations for filter-based join algorithms in MapReduce

2021 ◽  
pp. 1-18
Author(s):  
Salahaldeen Rababa ◽  
Amer Al-Badarneh

Large-scale datasets collected from heterogeneous sources often require a join operation to extract valuable information. MapReduce is an efficient programming model for processing large-scale data. However, it has some limitations in processing heterogeneous datasets. This is because of the large amount of redundant intermediate records that are transferred through the network. Several filtering techniques have been developed to improve the join performance, but they require multiple MapReduce jobs to process the input datasets. To address this issue, the adaptive filter-based join algorithms are presented in this paper. Specifically, three join algorithms are introduced to perform the processes of filters creation and redundant records elimination within a single MapReduce job. A cost analysis of the introduced join algorithms shows that the I/O cost is reduced compared to the state-of-the-art filter-based join algorithms. The performance of the join algorithms was evaluated in terms of the total execution time and the total amount of I/O data transferred. The experimental results show that the adaptive Bloom join, semi-adaptive intersection Bloom join, and adaptive intersection Bloom join decrease the total execution time by 30%, 25%, and 35%, respectively; and reduce the total amount of I/O data transferred by 18%, 25%, and 50%, respectively.

Author(s):  
Surabhi Kumari

Abstract: MPC (multi-party computation) is a comprehensive cryptographic concept that can be used to do computations while maintaining anonymity. MPC allows a group of people to work together on a function without revealing the plaintext's true input or output. Privacy-preserving voting, arithmetic calculation, and large-scale data processing are just a few of the applications of MPC. Each MPC party can run on a single computing node from a system perspective. Multiple parties' computing nodes could be homogenous or heterogeneous; nevertheless, MPC protocols' distributed workloads are always homogeneous (symmetric). We investigate the system performance of a representative MPC framework and a collection of MPC applications in this paper. On homogeneous and heterogeneous compute nodes, we describe the complete online calculation workflow of a state-of-the-art MPC protocol and examine the fundamental cause of its stall time and performance limitation. Keywords: Cloud Computing, IoT, MPC, Amazon Service, Virtualization.


2016 ◽  
Vol 12 (1) ◽  
pp. 49-68 ◽  
Author(s):  
Christian Esposito ◽  
Massimo Ficco

The demand to access to a large volume of data, distributed across hundreds or thousands of machines, has opened new opportunities in commerce, science, and computing applications. MapReduce is a paradigm that offers a programming model and an associated implementation for processing massive datasets in a parallel fashion, by using non-dedicated distributed computing hardware. It has been successfully adopted in several academic and industrial projects for Big Data Analytics. However, since such analytics is increasingly demanded within the context of mission-critical applications, security and reliability in MapReduce frameworks are strongly required in order to manage sensible information, and to obtain the right answer at the right time. In this paper, the authors present the main implementation of the MapReduce programming paradigm, provided by Apache with the name of Hadoop. They illustrate the security and reliability concerns in the context of a large-scale data processing infrastructure. They review the available solutions, and their limitations to support security and reliability within the context MapReduce frameworks. The authors conclude by describing the undergoing evolution of such solutions, and the possible issues for improvements, which could be challenging research opportunities for academic researchers.


2020 ◽  
Vol 2020 ◽  
pp. 1-7
Author(s):  
Tuozhong Yao ◽  
Wenfeng Wang ◽  
Yuhong Gu

Multiview active learning (MAL) is a technique which can achieve a large decrease in the size of the version space than traditional active learning and has great potential applications in large-scale data analysis. In this paper, we present a new deep multiview active learning (DMAL) framework which is the first to combine multiview active learning and deep learning for annotation effort reduction. In this framework, our approach advances the existing active learning methods in two aspects. First, we incorporate two different deep convolutional neural networks into active learning which uses multiview complementary information to improve the feature learnings. Second, through the properly designed framework, the feature representation and the classifier can be simultaneously updated with progressively annotated informative samples. The experiments with two challenging image datasets demonstrate that our proposed DMAL algorithm can achieve promising results than several state-of-the-art active learning algorithms.


2018 ◽  
Vol 7 (3.8) ◽  
pp. 16
Author(s):  
Md Tahsir Ahmed Munna ◽  
Shaikh Muhammad Allayear ◽  
Mirza Mohtashim Alam ◽  
Sheikh Shah Mohammad Motiur Rahman ◽  
Md Samadur Rahman ◽  
...  

MapReduce has become a popular programming model for processing and running large-scale data sets with a parallel, distributed paradigm on a cluster. Hadoop MapReduce is needed especially for large scale data like big data processing. In this paper, we work to modify the Hadoop MapReduce Algorithm and implement it to reduce processing time.  


2019 ◽  
Vol 48 (4) ◽  
pp. 673-681
Author(s):  
Shufen Zhang ◽  
Zhiyu Liu ◽  
Xuebin Chen ◽  
Changyin Luo

In order to solve the problem of traditional K-Means clustering algorithm in dealing with large-scale data set, a Hadoop K-Means (referred to HKM) clustering algorithm is proposed. Firstly, according to the sample density, the algorithm eliminates the effects of noise points in the data set. Secondly, it optimizes the selection of the initial center point using the thought of the max-min distance. Finally, it uses a MapReduce programming model to realize the parallelization. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but can also solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.


2021 ◽  
Author(s):  
Qi Zhai ◽  
Zhigang Kan ◽  
Linhui Feng ◽  
Linbo Qiao ◽  
Feng Liu

Recently, Chinese event detection has attracted more and more attention. As a special kind of hieroglyphics, Chinese glyphs are semantically useful but still unexplored in this task. In this paper, we propose a novel Glyph-Aware Fusion Network, named GlyFN. It introduces the glyphs' information into the pre-trained language model representation. To obtain a better representation, we design a Vector Linear Fusion mechanism to fuse them. Specifically, it first utilizes a max-pooling to capture salient information. Then, we use the linear operation of vectors to retain unique information. Moreover, for large-scale unstructured text, we distribute the data into different clusters parallelly. Finally, we conduct extensive experiments on ACE2005 and large-scale data. Experimental results show that GlyFN obtains increases of 7.48(10.18%) and 6.17(8.7%) in the F1-score for trigger identification and classification over the state-of-the-art methods, respectively. Furthermore, the event detection task for large-scale unstructured text can be efficiently accomplished through distribution.


2020 ◽  
Author(s):  
Than Le

In this paper, we focus on simple data-driven approach to solve deep learning based on implementing the Mask R-CNN module by analyzing deeper manipulation of datasets. We firstly approach to affine transformation and projective representation to data augmentation analysis in order to increasing large-scale data manually based on the state-of-the-art in views of computer vision. Then we evaluate our method concretely by connection our datasets by visualization data and completely in testing to many methods to understand intelligent data analysis in object detection and segmentation by using more than 5000 image according to many similar objects. As far as, it illustrated efficiency of small applications such as food recognition, grasp and manipulation in robotics<br>


CONVERTER ◽  
2021 ◽  
pp. 116-127
Author(s):  
Ajay Bansal, Manmohan Sharma, Ashu Gupta

One of the most challenging aspects of using MapReduce to parallelize and distribute large-scale data processingis detecting straggler tasks. Identifying ongoing tasks on weak nodes is how it’s described. The total computation time isthe sum of the execution times of the two stages in the Map process (copy, combine) and the three stages in the Reducephase (shuffle, sort, and reduce). The main aim of this paper is to estimate the accurate execution time in each location. Theproposed approach uses a backpropagation neural network on Hadoop to detect straggler tasks and calculate the remainingtask execution time, which is crucial in straggler task identification. The comparative analysis is done with some efficientmodels in this domain, such as LATE, ESAMR, and the real remaining time for WordCount and Sort benchmarks. It wasfound that the proposed model is capable of detecting straggler tasks in accurately estimating execution time. It also helpsin reducing the execution time that it takes to complete a task.


2015 ◽  
Vol 25 (04) ◽  
pp. 1550009 ◽  
Author(s):  
N. P. Gopalan ◽  
S. Suresh

Hadoop is a widely used open source implementation of MapReduce which is a popular programming model for parallel processing large scale data intensive applications in a cloud environment. Sharing Hadoop clusters has a tradeoff between fairness and data locality. When launching a local task is not possible, Hadoop Fair Scheduler (HFS) with delay scheduling postpones the node allocation for a while to a job which is to be scheduled next as per fairness to achieve high locality. This waiting becomes waste when the desired locality could not be achieved within a reasonable period. In this paper, a modified delay scheduling in HFS is proposed and implemented in Hadoop. It avoids the aforementioned waiting of delay scheduler if achieving locality is not possible. Instead of blindly waiting for a local node, the proposed algorithm first estimates the time to wait for a local node for the job and avoids waiting wherever achieving locality is not possible within the predefined delay threshold while accomplishing same locality. The performance of the proposed algorithm is evaluated by extensive experiments and it has been observed that the algorithm works significantly better in terms of response time and fairness achieving up to 20% speedup and up to 38% fairness in certain cases.


2016 ◽  
Vol 6 (1) ◽  
pp. 59-87 ◽  
Author(s):  
Amer Al-Badarneh ◽  
Amr Mohammad ◽  
Salah Harb

A distinguished successful platform for parallel data processing MapReduce is attracting a significant momentum from both academia and industry as the volume of data to capture, transform, and analyse grows rapidly. Although MapReduce is used in many applications to analyse large scale data sets, there is still a lot of debate among scientists and researchers on its efficiency, performance, and usability to support more classes of applications. This survey presents a comprehensive review of various implementations of MapReduce framework. Initially the authors give an overview of MapReduce programming model. They then present a broad description of various technical aspects of the most successful implementations of MapReduce framework reported in the literature and discuss their main strengths and weaknesses. Finally, the authors conclude by introducing a comparison between MapReduce implementations and discuss open issues and challenges on enhancing MapReduce.


Sign in / Sign up

Export Citation Format

Share Document