Heuristic Data Placement for Data-Intensive Applications in Heterogeneous Cloud

Data placement is an important issue which aims at reducing the cost of internode data transfers in cloud especially for data-intensive applications, in order to improve the performance of the entire cloud system. This paper proposes an improved data placement algorithm for heterogeneous cloud environments. In the initialization phase, a data clustering algorithm based on data dependency clustering and recursive partitioning has been presented, and both the factor of data size and fixed position are incorporated. And then a heuristic tree-to-tree data placement strategy is advanced in order to make frequent data movements occur on high-bandwidth channels. Simulation results show that, compared with two classical strategies, this strategy can effectively reduce the amount of data transmission and its time consumption during execution.

Download Full-text

A Data Placement Algorithm for Data Intensive Applications in Cloud

International Journal of Grid and Distributed Computing ◽

10.14257/ijgdc.2016.9.2.13 ◽

2016 ◽

Vol 9 (2) ◽

pp. 145-156 ◽

Cited By ~ 8

Author(s):

Qing Zhao ◽

Congcong Xiong ◽

Kunyu Zhang ◽

Yang Yue ◽

Jucheng Yang

Keyword(s):

Data Placement ◽

Data Intensive ◽

Placement Algorithm ◽

Data Intensive Applications

Download Full-text

A Data Placement Strategy for Data-Intensive Cloud Storage

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.354-355.896 ◽

2011 ◽

Vol 354-355 ◽

pp. 896-900 ◽

Cited By ~ 2

Author(s):

Jie Ding ◽

Hai Yun Han ◽

Ai Hua Zhou

Keyword(s):

Power Systems ◽

Cloud Storage ◽

Data Centers ◽

Data Placement ◽

Distributed Environment ◽

Data Dependencies ◽

Data Intensive ◽

Consistent Hashing ◽

The Cost ◽

Data Intensive Applications

Data-Intensive applications in power systems often perform complex computations which always involve large amount of datasets. In a distributed environment, an application may needs several datasets located in different data centers which faces two challenges including the high cost of data movements between data centers and data dependencies within the same data centers. In this paper, a data placement strategy among and within data centers in a cloud environment is proposed. Datasets are placed in different centers by a clustering scheme based on the data dependencies. And within the center, data is partitioned and replicated using consistent hashing. Simulations show that the algorithm can effectively reduce the cost of data movements and perform a evenly data distribution.

Download Full-text

Custom templates based heterogeneous resource allocation for data-intensive applications

10.32469/10355/86482 ◽

2020 ◽

Author(s):

◽

Ronny Bazan Antequera

Keyword(s):

High Performance ◽

Real Data ◽

University Of Missouri ◽

Application Performance ◽

Data Intensive ◽

Edge Based ◽

The Right ◽

Heterogeneous Cloud ◽

Data Intensive Applications ◽

Cloud Resources

[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI-COLUMBIA AT REQUEST OF AUTHOR.] The increase of data-intensive applications in science and engineering fields (i.e., bioinformatics, cybermanufacturing) demand the use of high-performance computing resources. However, data-intensive applications' local resources usually present limited capacity and availability due to sizable upfront costs. Moreover, using remote public resources presents constraints at the private edge network domain. Specifically, mis-configured network policies cause bottlenecks due to the other application cross-traffic attempting to use shared networking resources. Additionally, selecting the right remote resources can be cumbersome especially for those users who are interested in the application execution considering nonfunctional requirements such as performance, security and cost. The data-intensive applications have recurrent deployments and similar infrastructure requirements that can be addressed by creating templates. In this thesis, we handle applications requirements through intelligent resource 'abstractions' coupled with 'reusable' approaches that save time and effort in deploying new cloud architectures. Specifically, we design a novel custom template middleware that can retrieve blue prints of resource configuration, technical/policy information, and benchmarks of workflow performance to facilitate repeatable/reusable resource composition. The middleware considers hybrid-recommendation methodology (Online and offline recommendation) to leverage a catalog to rapidly check custom template solution correctness before/during resource consumption. Further, it prescribes application adaptations by fostering effective social interactions during the application's scaling stages. Based on the above approach, we organize the thesis contributions under two main thrusts: (i) Custom Templates for Cloud Networking for Data-intensive Applications: This involves scheduling transit selection, engineering at the campus-edge based upon real-time policy control. Our solution ensures prioritized application performance delivery for multi-tenant traffic profiles from a diverse set of actual data intensive applications in bioinformatics. (ii) Custom Templates for Cloud Computing for Data-intensive Applications: This involves recommending cloud resources for data-intensive applications based on a custom template catalog. We develop a novel expert system approach that is implemented as a middleware to abstracts data-intensive application requirements for custom templates composition. We uniquely consider heterogeneous cloud resources selection for the deployment of cloud architectures for real data-intensive applications in cybermanufacturing.

Download Full-text

Optimizing VM allocation and data placement for data-intensive applications in cloud using ACO metaheuristic algorithm

Engineering Science and Technology an International Journal ◽

10.1016/j.jestch.2016.11.006 ◽

2017 ◽

Vol 20 (2) ◽

pp. 616-628 ◽

Cited By ~ 25

Author(s):

T.P. Shabeera ◽

S.D. Madhu Kumar ◽

Sameera M. Salam ◽

K. Murali Krishnan

Keyword(s):

Data Placement ◽

Metaheuristic Algorithm ◽

Data Intensive ◽

Vm Allocation ◽

Data Intensive Applications

Download Full-text

Grouping-Aware Data Placement in HDFS for Data-Intensive Applications Based on Graph Clustering

Advances in Computer and Computational Sciences - Advances in Intelligent Systems and Computing ◽

10.1007/978-981-10-3773-3_3 ◽

2017 ◽

pp. 21-31 ◽

Cited By ~ 2

Author(s):

S. Vengadeswaran ◽

S. R. Balasundaram

Keyword(s):

Graph Clustering ◽

Data Placement ◽

Data Intensive ◽

Data Intensive Applications

Download Full-text

On the Benefits of Multipath Routing for Distributed Data-Intensive Applications with High Bandwidth Requirements and Multidomain Reach

2009 Seventh Annual Communication Networks and Services Research Conference ◽

10.1109/cnsr.2009.26 ◽

2009 ◽

Cited By ~ 11

Author(s):

Xiaomin Chen ◽

Mohit Chamania ◽

Admela Jukan ◽

André C. Drummond ◽

Nelson L. S. da Fonseca

Keyword(s):

Multipath Routing ◽

Distributed Data ◽

Data Intensive ◽

High Bandwidth ◽

Data Intensive Applications

Download Full-text

Location-aware associated data placement for geo-distributed data-intensive applications

2015 IEEE Conference on Computer Communications (INFOCOM) ◽

10.1109/infocom.2015.7218428 ◽

2015 ◽

Cited By ~ 32

Author(s):

Boyang Yu ◽

Jianping Pan

Keyword(s):

Data Placement ◽

Distributed Data ◽

Data Intensive ◽

Location Aware ◽

Data Intensive Applications ◽

Associated Data

Download Full-text

Significance of Hierarchical and Markov Clustering in Grouping Aware Data Placement for Data Intensive Applications Having Interest Locality

Scalable Computing Practice and Experience ◽

10.12694/scpe.v19i3.1375 ◽

2018 ◽

Vol 19 (3) ◽

pp. 245-258

Author(s):

Vengadeswaran Shanmugasundaram ◽

Balasundaram Sadhu Ramakrishnan

Keyword(s):

Big Data ◽

Data Placement ◽

Query Execution ◽

Access Pattern ◽

Clustering Techniques ◽

Data Intensive ◽

Markov Clustering ◽

Default Data ◽

Data Intensive Applications ◽

Grouping Behavior

In this data era, massive volumes of data are being generated every second in variety of domains such as Geoscience, Social Web, Finance, e-Commerce, Health Care, Climate modelling, Physics, Astronomy, Government sectors etc. Hadoop has been well-recognized as de factobig data processing platform that have been extensively adopted, and is currently widely used, in many application domains processing Big Data. Even though it is considered as an efficient solution for such complex query processing, it has its own limitation when the data to be processed exhibit interest locality. The data required for any query execution follows grouping behavior wherein only a part of the Big-Data is accessed frequently. During such scenarion, the time taken to execute a queryand return results, increases exponentially as the amount of data increases leading to much waiting time for the user. Since Hadoop default data placement strategy (HDDPS) does not consider such grouping behavior, it does not perform efficiently resulting in lacunas such as decreased local map task execution, increased query execution time etc. Hence proposed an Optimal Data Placement Strategy (ODPS) based on grouping semantics. In this paper we experiment the significance oftwo most promising clustering techniques viz. Hierarchical Agglomerative Clustering (HAC) and Markov Clustering (MCL) in grouping aware data placement for data intensive applications having interest locality. Initially user access pattern is identified by dynamically analyzing history log.Then both clustering techniques (HAC & MCL) are separately applied over the access pattern to obtain independent clusters. These clusters are interpreted and validated to extract the Optimal Data Groupings (ODG). Finally proposed strategy reorganizes the default data layouts in HDFSbased on ODG to achieve maximum parallel execution per group subjective to Load Balancer and Rack Awareness. Our proposed strategy is tested in 10 node cluster placed in a multi rack with Hadoop installed in every node deployed in cloud platform. Proposed strategy reduces the query execution time, significantly improves the data locality and has proved to be more efficient for massive datasets processing in heterogeneous distributed environment. Also MCL shows a marginal improved performance over HAC for queries exhibiting interest localities.

Download Full-text