Custom templates based heterogeneous resource allocation for data-intensive applications

Cloud Resources

[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI-COLUMBIA AT REQUEST OF AUTHOR.] The increase of data-intensive applications in science and engineering fields (i.e., bioinformatics, cybermanufacturing) demand the use of high-performance computing resources. However, data-intensive applications' local resources usually present limited capacity and availability due to sizable upfront costs. Moreover, using remote public resources presents constraints at the private edge network domain. Specifically, mis-configured network policies cause bottlenecks due to the other application cross-traffic attempting to use shared networking resources. Additionally, selecting the right remote resources can be cumbersome especially for those users who are interested in the application execution considering nonfunctional requirements such as performance, security and cost. The data-intensive applications have recurrent deployments and similar infrastructure requirements that can be addressed by creating templates. In this thesis, we handle applications requirements through intelligent resource 'abstractions' coupled with 'reusable' approaches that save time and effort in deploying new cloud architectures. Specifically, we design a novel custom template middleware that can retrieve blue prints of resource configuration, technical/policy information, and benchmarks of workflow performance to facilitate repeatable/reusable resource composition. The middleware considers hybrid-recommendation methodology (Online and offline recommendation) to leverage a catalog to rapidly check custom template solution correctness before/during resource consumption. Further, it prescribes application adaptations by fostering effective social interactions during the application's scaling stages. Based on the above approach, we organize the thesis contributions under two main thrusts: (i) Custom Templates for Cloud Networking for Data-intensive Applications: This involves scheduling transit selection, engineering at the campus-edge based upon real-time policy control. Our solution ensures prioritized application performance delivery for multi-tenant traffic profiles from a diverse set of actual data intensive applications in bioinformatics. (ii) Custom Templates for Cloud Computing for Data-intensive Applications: This involves recommending cloud resources for data-intensive applications based on a custom template catalog. We develop a novel expert system approach that is implemented as a middleware to abstracts data-intensive application requirements for custom templates composition. We uniquely consider heterogeneous cloud resources selection for the deployment of cloud architectures for real data-intensive applications in cybermanufacturing.

Empirical Performance Analysis of HPC Benchmarks Across Variations in Cloud Computing

International Journal of Cloud Applications and Computing ◽

10.4018/ijcac.2013010102 ◽

2013 ◽

Vol 3 (1) ◽

pp. 13-26 ◽

Cited By ~ 4

Author(s):

Sanjay P. Ahuja ◽

Sindhu Mani

Keyword(s):

Data Storage ◽

High Performance ◽

Large Data ◽

Extensive Study ◽

Memory Bandwidth ◽

Platform As A Service ◽

Data Intensive ◽

Computational Performance ◽

Empirical Performance ◽

High Performance Computing (HPC) applications are scientific applications that require significant CPU capabilities. They are also data-intensive applications requiring large data storage. While many researchers have examined the performance of Amazon’s EC2 platform across some HPC benchmarks, an extensive study and their comparison between Amazon’s EC2 and Microsoft’s Windows Azure is largely missing with metrics such as memory bandwidth, I/O performance, and communication and computational performance. The purpose of this paper is to implement existing benchmarks to evaluate and analyze these metrics for EC2 and Windows Azure that span both Infrastructure-as-a-Service and Platform-as-a-Service types. This was accomplished by running MPI versions of STREAM, Interleaved or Random (IOR) and NAS Parallel (NPB) benchmarks on small and medium instance types. In addition a new EC2 medium instance type (m1.medium) was also included in the analysis. These benchmarks measure the memory bandwidth, I/O performance, communication and computational performance.

Proceedings of the 17th international symposium on High performance distributed computing - HPDC '08 ◽

Evaluating the usefulness of content addressable storage for high-performance data intensive applications

10.1145/1383422.1383428 ◽

2008 ◽

Cited By ~ 13

Author(s):

Partho Nath ◽

Bhuvan Urgaonkar ◽

Anand Sivasubramaniam

Keyword(s):

High Performance ◽

Performance Data ◽

Data Intensive ◽

Overview of Big Data-Intensive Storage and its Technologies for Cloud and Fog Computing

Research Anthology on Privatizing and Securing Data ◽

10.4018/978-1-7998-8954-0.ch005 ◽

2021 ◽

pp. 112-153

Author(s):

Richard S. Segall ◽

Jeffrey S Cook ◽

Gao Niu

Keyword(s):

Big Data ◽

Data Storage ◽

High Performance ◽

Storage Systems ◽

Fog Computing ◽

Storage Management ◽

Data Intensive Computing ◽

Computing Systems ◽

Application Performance ◽

Data Intensive

Computing systems are becoming increasingly data-intensive because of the explosion of data and the needs for processing the data, and subsequently storage management is critical to application performance in such data-intensive computing systems. However, if existing resource management frameworks in these systems lack the support for storage management, this would cause unpredictable performance degradation when applications are under input/output (I/O) contention. Storage management of data-intensive systems is a challenge. Big Data plays a most major role in storage systems for data-intensive computing. This article deals with these difficulties along with discussion of High Performance Computing (HPC) systems, background for storage systems for data-intensive applications, storage patterns and storage mechanisms for Big Data, the Top 10 Cloud Storage Systems for data-intensive computing in today's world, and the interface between Big Data Intensive Storage and Cloud/Fog Computing. Big Data storage and its server statistics and usage distributions for the Top 500 Supercomputers in the world are also presented graphically and discussed as data-intensive storage components that can be interfaced with Fog-to-cloud interactions and enabling protocols.

Model-driven development of data intensive applications over cloud resources

Future Generation Computer Systems ◽

10.1016/j.future.2017.12.046 ◽

2018 ◽

Vol 87 ◽

pp. 888-909 ◽

Cited By ~ 7

Author(s):

Rafael Tolosana-Calasanz ◽

José Ángel Bañares ◽

José-Manuel Colom

Keyword(s):

Model Driven Development ◽

Model Driven ◽

Data Intensive ◽

Cloud Resources

Heuristic Data Placement for Data-Intensive Applications in Heterogeneous Cloud

Journal of Electrical and Computer Engineering ◽

10.1155/2016/3516358 ◽

2016 ◽

Vol 2016 ◽

pp. 1-8 ◽

Cited By ~ 4

Author(s):

Qing Zhao ◽

Congcong Xiong ◽

Peng Wang

Keyword(s):

Clustering Algorithm ◽

Recursive Partitioning ◽

Data Placement ◽

Data Intensive ◽

High Bandwidth ◽

Tree Data ◽

Placement Algorithm ◽

Heterogeneous Cloud ◽

The Cost ◽

Data placement is an important issue which aims at reducing the cost of internode data transfers in cloud especially for data-intensive applications, in order to improve the performance of the entire cloud system. This paper proposes an improved data placement algorithm for heterogeneous cloud environments. In the initialization phase, a data clustering algorithm based on data dependency clustering and recursive partitioning has been presented, and both the factor of data size and fixed position are incorporated. And then a heuristic tree-to-tree data placement strategy is advanced in order to make frequent data movements occur on high-bandwidth channels. Simulation results show that, compared with two classical strategies, this strategy can effectively reduce the amount of data transmission and its time consumption during execution.

High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on ◽

Impact of high performance sockets on data intensive applications

10.1109/hpdc.2003.1210013 ◽

2004 ◽

Cited By ~ 10

Author(s):

P. Balaji ◽

Jiesheng Wu ◽

T. Kurc ◽

U. Catalyurek ◽

D.K. Panda ◽

...

Keyword(s):

High Performance ◽

Data Intensive ◽

Building High-Resolution Sky Images Using the Cell/B.E.

Scientific Programming ◽

10.1155/2009/408370 ◽

2009 ◽

Vol 17 (1-2) ◽

pp. 113-134 ◽

Cited By ~ 4

Author(s):

Ana Lucia Varbanescu ◽

Alexander S. van Amesfoort ◽

Tim Cornwell ◽

Ger van Diepen ◽

Rob van Nieuwpoort ◽

...

Keyword(s):

High Performance ◽

Complete Solution ◽

Performance Potential ◽

Data Intensive ◽

Irregular Data ◽

Original Application ◽

The One ◽

High Level ◽

Application Specific

The performance potential of the Cell/B.E., as well as its availability, have attracted a lot of attention from various high-performance computing (HPC) fields. While computation intensive kernels proved to be exceptionally well suited for running on the Cell, irregular data-intensive applications are usually considered as poor matches. In this paper, we present our complete solution for enabling such a data-intensive application to run efficiently on the Cell/B.E. processor. Specifically, we target radioastronomy data gridding and degridding, two resembling imaging filters based on convolutional resampling. Our solution is based on building a high-level application model, used to evaluate parallelization alternatives. Next, we choose the one with the best performance potential, and we gradually exploit this potential by applying platform-specific and application-specific optimizations. After several iterations, our target application shows a speed-up factor between 10 and 20 on a dual-Cell blade when compared with the original application running on a commodity machine. Given these results, and based on our empirical observations, we are able to pinpoint a set of ten guidelines for parallelizing similar applications on the Cell/B.E. Finally, we conclude the Cell/B.E. can provide high performance for data-intensive applications at the price of increased programming efforts and with a significant aid from aggressive application-specific optimizations.

PHash: A memory-efficient, high-performance key-value store for large-scale data-intensive applications

Journal of Systems and Software ◽

10.1016/j.jss.2016.09.047 ◽

2017 ◽

Vol 123 ◽

pp. 33-44 ◽

Cited By ~ 2

Author(s):

Hyotaek Shim

Keyword(s):

High Performance ◽

Large Scale ◽

Data Intensive ◽

Large Scale Data ◽

Scale Data ◽

Memory Efficient

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing ◽

Towards Data Intensive Many-Task Computing

10.4018/978-1-61520-971-2.ch002 ◽

2012 ◽

pp. 28-73 ◽

Cited By ~ 8

Author(s):

Ioan Raicu ◽

Ian Foster ◽

Yong Zhao ◽

Alex Szalay ◽

Philip Little ◽

...

Keyword(s):

High Performance ◽

File Systems ◽

Data Locality ◽

Resource Provisioning ◽

Parallel File Systems ◽

Data Intensive ◽

Dynamic Resource Provisioning ◽

Rate Of Increase ◽

Parallel File ◽

Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Traditional techniques to support many-task computing commonly found in scientific computing (i.e. the reliance on parallel file systems with static configurations) do not scale to today’s largest systems for data intensive application, as the rate of increase in the number of processors per system is outgrowing the rate of performance increase of parallel file systems. In this chapter, the authors argue that in such circumstances, data locality is critical to the successful and efficient use of large distributed systems for data-intensive applications. They propose a “data diffusion” approach to enable data-intensive many-task computing. They define an abstract model for data diffusion, define and implement scheduling policies with heuristics that optimize real world performance, and develop a competitive online caching eviction policy. They also offer many empirical experiments to explore the benefits of data diffusion, both under static and dynamic resource provisioning, demonstrating approaches that improve both performance and scalability.