Significance of Hierarchical and Markov Clustering in Grouping Aware Data Placement for Data Intensive Applications Having Interest Locality

Vengadeswaran Shanmugasundaram; Balasundaram Sadhu Ramakrishnan

doi:10.12694/scpe.v19i3.1375

Significance of Hierarchical and Markov Clustering in Grouping Aware Data Placement for Data Intensive Applications Having Interest Locality

Scalable Computing Practice and Experience ◽

10.12694/scpe.v19i3.1375 ◽

2018 ◽

Vol 19 (3) ◽

pp. 245-258

Author(s):

Vengadeswaran Shanmugasundaram ◽

Balasundaram Sadhu Ramakrishnan

Keyword(s):

Big Data ◽

Data Placement ◽

Query Execution ◽

Access Pattern ◽

Clustering Techniques ◽

Data Intensive ◽

Markov Clustering ◽

Default Data ◽

Data Intensive Applications ◽

Grouping Behavior

In this data era, massive volumes of data are being generated every second in variety of domains such as Geoscience, Social Web, Finance, e-Commerce, Health Care, Climate modelling, Physics, Astronomy, Government sectors etc. Hadoop has been well-recognized as de factobig data processing platform that have been extensively adopted, and is currently widely used, in many application domains processing Big Data. Even though it is considered as an efficient solution for such complex query processing, it has its own limitation when the data to be processed exhibit interest locality. The data required for any query execution follows grouping behavior wherein only a part of the Big-Data is accessed frequently. During such scenarion, the time taken to execute a queryand return results, increases exponentially as the amount of data increases leading to much waiting time for the user. Since Hadoop default data placement strategy (HDDPS) does not consider such grouping behavior, it does not perform efficiently resulting in lacunas such as decreased local map task execution, increased query execution time etc. Hence proposed an Optimal Data Placement Strategy (ODPS) based on grouping semantics. In this paper we experiment the significance oftwo most promising clustering techniques viz. Hierarchical Agglomerative Clustering (HAC) and Markov Clustering (MCL) in grouping aware data placement for data intensive applications having interest locality. Initially user access pattern is identified by dynamically analyzing history log.Then both clustering techniques (HAC & MCL) are separately applied over the access pattern to obtain independent clusters. These clusters are interpreted and validated to extract the Optimal Data Groupings (ODG). Finally proposed strategy reorganizes the default data layouts in HDFSbased on ODG to achieve maximum parallel execution per group subjective to Load Balancer and Rack Awareness. Our proposed strategy is tested in 10 node cluster placed in a multi rack with Hadoop installed in every node deployed in cloud platform. Proposed strategy reduces the query execution time, significantly improves the data locality and has proved to be more efficient for massive datasets processing in heterogeneous distributed environment. Also MCL shows a marginal improved performance over HAC for queries exhibiting interest localities.

Download Full-text

Special Issue on Infrastructures and Algorithms for Scalable Computing

Scalable Computing Practice and Experience ◽

10.12694/scpe.v19i3.1441 ◽

2018 ◽

Vol 19 (3) ◽

pp. iii-iv

Author(s):

Sasko Ristov

Keyword(s):

Sparse Matrices ◽

Data Placement ◽

Greedy Heuristic ◽

Special Issue ◽

Clustering Techniques ◽

Scalable Computing ◽

Data Intensive ◽

Wide Range ◽

Markov Clustering ◽

Data Intensive Applications

We are happy to present this special issue of the scientific journal Scalable Computing: Practice and Experience. In this special issue on Infrastructures and Algorithms for Scalable Computing (Volume 19, No 3 June 2018), we have selected four papers out of submitted nine, which gone through a peer review according to the journal policy. All papers represent novel results in the fields of distributed algorithms and infrastructures for scalable computing. The first paper presents present a novel approach for efficient data placement, which improves the performance of workflow execution in distributed datacenters. The greedy heuristic algorithm, which is based on a network flow optimization framework, minimizes the total storage cost, including efforts to move and store the data from different source locations and dependencies. The second paper evaluated the significance of different clustering techniques viz. k-means, Hierarchical Agglomerative Clustering and Markov Clustering in groupingawaredata placement for data-intensive applications with interest locality. The evaluation in Azure reported that Markov Clustering-based data placement strategy improves the local map execution and reduces the execution time compared to Hadoops Default Data Placement Strategy and other evaluated clustering techniques. This is more emphasized for data-intensive applications that have interest locality. The third paper presents an experimental evaluation of the openMP thread-mapping strategies in different hardware environments (IntelXeon Phi coprocessor and hybrid CPU-MIC platforms). The paper shows the optimal choice of thread affinity, the number of threads and the execution mode that can provide optimal performance of the LU factorization. In the fourth paper, the authors study the amount of memory occupied by sparse matrices split up into same-size blocks. The paper considers and statistically evaluates four popular storage formats and combinations among them. The conclusion is that block-based storage formats may significantly reduce memory footprints of sparse matrices arising from a wide range of application domains. We use this opportunity to thank all contributors to this Special Issue: all authors who submitted the results of their latest research and all reviewers for their valuable comments and suggestions for improvement. We would like to express our special gratitude for the Editor-in-Chief, Professor Dana Petcu, for her constant support during the whole process of this Special Issue.

Download Full-text

An Optimal Data Placement Strategy for Improving System Performance of Massive Data Applications Using Graph Clustering

International Journal of Ambient Computing and Intelligence ◽

10.4018/ijaci.2018070102 ◽

2018 ◽

Vol 9 (3) ◽

pp. 15-30 ◽

Cited By ~ 4

Author(s):

S. Vengadeswaran ◽

S. R. Balasundaram

Keyword(s):

Big Data ◽

Execution Time ◽

Clustering Algorithm ◽

Graph Clustering ◽

Data Placement ◽

Data Locality ◽

Query Execution ◽

Data Set ◽

Statistical Measures ◽

Default Data

This article describes how the time taken to execute a query and return the results, increase exponentially as the data size increases, leading to more waiting times of the user. Hadoop with its distributed processing capability is considered as an efficient solution for processing such large data. Hadoop's Default Data Placement Strategy (HDDPS) allocates the data blocks randomly across the cluster of nodes without considering any of the execution parameters. This result in non-availability of the blocks required for execution in local machine so that the data has to be transferred across the network for execution, leading to data locality issue. Also, it is commonly observed that most of the data intensive applications show grouping semantics. Hence during query execution, only a part of the Big-Data set is utilized. Since such execution parameters and grouping behavior are not considered, the default placement does not perform well resulting in several lacunas such as decreased local map task execution, increased query execution time, query latency, etc. In order to overcome such issues, an Optimal Data Placement Strategy (ODPS) based on grouping semantics is proposed. Initially, user history log is dynamically analyzed for identifying access pattern which is depicted as a graph. Markov clustering, a Graph clustering algorithm is applied to identify groupings among the dataset. Then, an Optimal Data Placement Algorithm (ODPA) is proposed based on the statistical measures estimated from the clustered graph. This in turn re-organizes the default data layouts in HDFS to achieve improved performance for Big-Data sets in heterogeneous distributed environment. Our proposed strategy is tested in a 15 node cluster placed in a single rack topology. The result has proved to be more efficient for massive datasets, reducing query execution time by 26% and significantly improves the data locality by 38% compared to HDDPS.

Download Full-text

BRPS: A Big Data Placement Strategy for Data Intensive Applications

2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW) ◽

10.1109/icdmw.2016.0120 ◽

2016 ◽

Cited By ~ 4

Author(s):

Lihui Liu ◽

Junping Song ◽

Haibo Wang ◽

Pin Lv

Keyword(s):

Big Data ◽

Data Placement ◽

Data Intensive ◽

Data Intensive Applications

Download Full-text

NoSQL Databases

Advances in Data Mining and Database Management - Handbook of Research on Cloud Infrastructures for Big Data Analytics ◽

10.4018/978-1-4666-5864-6.ch008 ◽

2014 ◽

pp. 186-215 ◽

Cited By ~ 2

Author(s):

Ganesh Chandra Deka

Keyword(s):

Cloud Computing ◽

Big Data ◽

Data Processing ◽

Open Source ◽

Data Storage ◽

Big Data Processing ◽

Nosql Databases ◽

Data Intensive ◽

Huge Data ◽

Data Intensive Applications

NoSQL databases are designed to meet the huge data storage requirements of cloud computing and big data processing. NoSQL databases have lots of advanced features in addition to the conventional RDBMS features. Hence, the “NoSQL” databases are popularly known as “Not only SQL” databases. A variety of NoSQL databases having different features to deal with exponentially growing data-intensive applications are available with open source and proprietary option. This chapter discusses some of the popular NoSQL databases and their features on the light of CAP theorem.

Download Full-text

Optimizing VM allocation and data placement for data-intensive applications in cloud using ACO metaheuristic algorithm

Engineering Science and Technology an International Journal ◽

10.1016/j.jestch.2016.11.006 ◽

2017 ◽

Vol 20 (2) ◽

pp. 616-628 ◽

Cited By ~ 25

Author(s):

T.P. Shabeera ◽

S.D. Madhu Kumar ◽

Sameera M. Salam ◽

K. Murali Krishnan

Keyword(s):

Data Placement ◽

Metaheuristic Algorithm ◽

Data Intensive ◽

Vm Allocation ◽

Data Intensive Applications

Download Full-text

Grouping-Aware Data Placement in HDFS for Data-Intensive Applications Based on Graph Clustering

Advances in Computer and Computational Sciences - Advances in Intelligent Systems and Computing ◽

10.1007/978-981-10-3773-3_3 ◽

2017 ◽

pp. 21-31 ◽

Cited By ~ 2

Author(s):

S. Vengadeswaran ◽

S. R. Balasundaram

Keyword(s):

Graph Clustering ◽

Data Placement ◽

Data Intensive ◽

Data Intensive Applications

Download Full-text

Analysis of big data for data-intensive applications

2016 International Conference on Recent Advances and Innovations in Engineering (ICRAIE) ◽

10.1109/icraie.2016.7939551 ◽

2016 ◽

Author(s):

Meenu Dave ◽

Hemant Kumar Gianey

Keyword(s):

Big Data ◽

Data Intensive ◽

Data Intensive Applications

Download Full-text

Location-aware associated data placement for geo-distributed data-intensive applications

2015 IEEE Conference on Computer Communications (INFOCOM) ◽

10.1109/infocom.2015.7218428 ◽

2015 ◽

Cited By ~ 32

Author(s):

Boyang Yu ◽

Jianping Pan

Keyword(s):

Data Placement ◽

Distributed Data ◽

Data Intensive ◽

Location Aware ◽

Data Intensive Applications ◽

Associated Data

Download Full-text

DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications with Interest Locality

International Journal of Recent Trends in Engineering and Research ◽

10.23883/ijrter.2018.4333.w8uli ◽

2018 ◽

Vol 4 (6) ◽

pp. 172-181

Keyword(s):

Data Placement ◽

Data Intensive ◽

Data Grouping ◽

Data Intensive Applications

Download Full-text

Decoding Big Data Analytics for Emerging Business Through Data-Intensive Applications and Business Intelligence

Big Data Analytics for Sustainable Computing - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-9750-6.ch004 ◽

2020 ◽

pp. 66-80

Author(s):

Vinay Kellengere Shankarnarayan

Keyword(s):

Big Data ◽

Business Intelligence ◽

Business Models ◽

Big Data Analytics ◽

Research Area ◽

Future Perspective ◽

Massive Data ◽

Data Intensive ◽

Tools And Techniques ◽

Data Intensive Applications

In recent years, big data have gained massive popularity among researchers, decision analysts, and data architects in any enterprise. Big data had been just another way of saying analytics. In today's world, the company's capital lies with big data. Think of worlds huge companies. The value they offer comes from their data, which they analyze for their proactive benefits. This chapter showcases the insight of big data and its tools and techniques the companies have adopted to deal with data problems. The authors also focus on framework and methodologies to handle the massive data in order to make more accurate and precise decisions. The chapter begins with the current organizational scenario and what is meant by big data. Next, it draws out various challenges faced by organizations. The authors also observe big data business models and different frameworks available and how it has been categorized and finally the conclusion discusses the challenges and what is the future perspective of this research area.

Download Full-text