Significance of hierarchical and partitioning based clustering in grouping aware data placement for data intensive applications

In this data era, massive volumes of data are being generated every second in variety of domains such as Geoscience, Social Web, Finance, e-Commerce, Health Care, Climate modelling, Physics, Astronomy, Government sectors etc. Hadoop has been well-recognized as de factobig data processing platform that have been extensively adopted, and is currently widely used, in many application domains processing Big Data. Even though it is considered as an efficient solution for such complex query processing, it has its own limitation when the data to be processed exhibit interest locality. The data required for any query execution follows grouping behavior wherein only a part of the Big-Data is accessed frequently. During such scenarion, the time taken to execute a queryand return results, increases exponentially as the amount of data increases leading to much waiting time for the user. Since Hadoop default data placement strategy (HDDPS) does not consider such grouping behavior, it does not perform efficiently resulting in lacunas such as decreased local map task execution, increased query execution time etc. Hence proposed an Optimal Data Placement Strategy (ODPS) based on grouping semantics. In this paper we experiment the significance oftwo most promising clustering techniques viz. Hierarchical Agglomerative Clustering (HAC) and Markov Clustering (MCL) in grouping aware data placement for data intensive applications having interest locality. Initially user access pattern is identified by dynamically analyzing history log.Then both clustering techniques (HAC & MCL) are separately applied over the access pattern to obtain independent clusters. These clusters are interpreted and validated to extract the Optimal Data Groupings (ODG). Finally proposed strategy reorganizes the default data layouts in HDFSbased on ODG to achieve maximum parallel execution per group subjective to Load Balancer and Rack Awareness. Our proposed strategy is tested in 10 node cluster placed in a multi rack with Hadoop installed in every node deployed in cloud platform. Proposed strategy reduces the query execution time, significantly improves the data locality and has proved to be more efficient for massive datasets processing in heterogeneous distributed environment. Also MCL shows a marginal improved performance over HAC for queries exhibiting interest localities.

Download Full-text

DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications with Interest Locality

International Journal of Recent Trends in Engineering and Research ◽

10.23883/ijrter.2018.4333.w8uli ◽

2018 ◽

Vol 4 (6) ◽

pp. 172-181

Keyword(s):

Data Placement ◽

Data Intensive ◽

Data Grouping ◽

Data Intensive Applications

Download Full-text

Genetic Based Data Placement for Geo-Distributed Data-Intensive Applications in Cloud Computing

Lecture Notes in Computer Science - Advances in Services Computing ◽

10.1007/978-3-319-49178-3_20 ◽

2016 ◽

pp. 253-265 ◽

Cited By ~ 1

Author(s):

Weifeng Fan ◽

Jun Peng ◽

Xiaoyong Zhang ◽

Zhiwu Huang

Keyword(s):

Cloud Computing ◽

Data Placement ◽

Distributed Data ◽

Data Intensive ◽

Data Intensive Applications

Download Full-text

BRPS: A Big Data Placement Strategy for Data Intensive Applications

2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW) ◽

10.1109/icdmw.2016.0120 ◽

2016 ◽

Cited By ~ 4

Author(s):

Lihui Liu ◽

Junping Song ◽

Haibo Wang ◽

Pin Lv

Keyword(s):

Big Data ◽

Data Placement ◽

Data Intensive ◽

Data Intensive Applications

Download Full-text

Heuristic Data Placement for Data-Intensive Applications in Heterogeneous Cloud

Journal of Electrical and Computer Engineering ◽

10.1155/2016/3516358 ◽

2016 ◽

Vol 2016 ◽

pp. 1-8 ◽

Cited By ~ 4

Author(s):

Qing Zhao ◽

Congcong Xiong ◽

Peng Wang

Keyword(s):

Clustering Algorithm ◽

Recursive Partitioning ◽

Data Placement ◽

Data Intensive ◽

High Bandwidth ◽

Tree Data ◽

Placement Algorithm ◽

Heterogeneous Cloud ◽

The Cost ◽

Data Intensive Applications

Data placement is an important issue which aims at reducing the cost of internode data transfers in cloud especially for data-intensive applications, in order to improve the performance of the entire cloud system. This paper proposes an improved data placement algorithm for heterogeneous cloud environments. In the initialization phase, a data clustering algorithm based on data dependency clustering and recursive partitioning has been presented, and both the factor of data size and fixed position are incorporated. And then a heuristic tree-to-tree data placement strategy is advanced in order to make frequent data movements occur on high-bandwidth channels. Simulation results show that, compared with two classical strategies, this strategy can effectively reduce the amount of data transmission and its time consumption during execution.

Download Full-text

Efficient location-aware data placement for data-intensive applications in geo-distributed scientific data centers

Tsinghua Science & Technology ◽

10.1109/tst.2016.7590316 ◽

2016 ◽

Vol 21 (5) ◽

pp. 471-481 ◽

Cited By ~ 13

Author(s):

Jinghui Zhang ◽

Jian Chen ◽

Junzhou Luo ◽

Aibo Song

Keyword(s):

Data Centers ◽

Data Placement ◽

Scientific Data ◽

Data Intensive ◽

Location Aware ◽

Data Intensive Applications

Download Full-text

Special Issue on Infrastructures and Algorithms for Scalable Computing

Scalable Computing Practice and Experience ◽

10.12694/scpe.v19i3.1441 ◽

2018 ◽

Vol 19 (3) ◽

pp. iii-iv

Author(s):

Sasko Ristov

Keyword(s):

Sparse Matrices ◽

Data Placement ◽

Greedy Heuristic ◽

Special Issue ◽

Clustering Techniques ◽

Scalable Computing ◽

Data Intensive ◽

Wide Range ◽

Markov Clustering ◽

Data Intensive Applications

We are happy to present this special issue of the scientific journal Scalable Computing: Practice and Experience. In this special issue on Infrastructures and Algorithms for Scalable Computing (Volume 19, No 3 June 2018), we have selected four papers out of submitted nine, which gone through a peer review according to the journal policy. All papers represent novel results in the fields of distributed algorithms and infrastructures for scalable computing. The first paper presents present a novel approach for efficient data placement, which improves the performance of workflow execution in distributed datacenters. The greedy heuristic algorithm, which is based on a network flow optimization framework, minimizes the total storage cost, including efforts to move and store the data from different source locations and dependencies. The second paper evaluated the significance of different clustering techniques viz. k-means, Hierarchical Agglomerative Clustering and Markov Clustering in groupingawaredata placement for data-intensive applications with interest locality. The evaluation in Azure reported that Markov Clustering-based data placement strategy improves the local map execution and reduces the execution time compared to Hadoops Default Data Placement Strategy and other evaluated clustering techniques. This is more emphasized for data-intensive applications that have interest locality. The third paper presents an experimental evaluation of the openMP thread-mapping strategies in different hardware environments (IntelXeon Phi coprocessor and hybrid CPU-MIC platforms). The paper shows the optimal choice of thread affinity, the number of threads and the execution mode that can provide optimal performance of the LU factorization. In the fourth paper, the authors study the amount of memory occupied by sparse matrices split up into same-size blocks. The paper considers and statistically evaluates four popular storage formats and combinations among them. The conclusion is that block-based storage formats may significantly reduce memory footprints of sparse matrices arising from a wide range of application domains. We use this opportunity to thank all contributors to this Special Issue: all authors who submitted the results of their latest research and all reviewers for their valuable comments and suggestions for improvement. We would like to express our special gratitude for the Editor-in-Chief, Professor Dana Petcu, for her constant support during the whole process of this Special Issue.

Download Full-text