scholarly journals Evaluating the usefulness of content addressable storage for high-performance data intensive applications

Author(s):  
Partho Nath ◽  
Bhuvan Urgaonkar ◽  
Anand Sivasubramaniam
2020 ◽  
Author(s):  
◽  
Ronny Bazan Antequera

[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI-COLUMBIA AT REQUEST OF AUTHOR.] The increase of data-intensive applications in science and engineering fields (i.e., bioinformatics, cybermanufacturing) demand the use of high-performance computing resources. However, data-intensive applications' local resources usually present limited capacity and availability due to sizable upfront costs. Moreover, using remote public resources presents constraints at the private edge network domain. Specifically, mis-configured network policies cause bottlenecks due to the other application cross-traffic attempting to use shared networking resources. Additionally, selecting the right remote resources can be cumbersome especially for those users who are interested in the application execution considering nonfunctional requirements such as performance, security and cost. The data-intensive applications have recurrent deployments and similar infrastructure requirements that can be addressed by creating templates. In this thesis, we handle applications requirements through intelligent resource 'abstractions' coupled with 'reusable' approaches that save time and effort in deploying new cloud architectures. Specifically, we design a novel custom template middleware that can retrieve blue prints of resource configuration, technical/policy information, and benchmarks of workflow performance to facilitate repeatable/reusable resource composition. The middleware considers hybrid-recommendation methodology (Online and offline recommendation) to leverage a catalog to rapidly check custom template solution correctness before/during resource consumption. Further, it prescribes application adaptations by fostering effective social interactions during the application's scaling stages. Based on the above approach, we organize the thesis contributions under two main thrusts: (i) Custom Templates for Cloud Networking for Data-intensive Applications: This involves scheduling transit selection, engineering at the campus-edge based upon real-time policy control. Our solution ensures prioritized application performance delivery for multi-tenant traffic profiles from a diverse set of actual data intensive applications in bioinformatics. (ii) Custom Templates for Cloud Computing for Data-intensive Applications: This involves recommending cloud resources for data-intensive applications based on a custom template catalog. We develop a novel expert system approach that is implemented as a middleware to abstracts data-intensive application requirements for custom templates composition. We uniquely consider heterogeneous cloud resources selection for the deployment of cloud architectures for real data-intensive applications in cybermanufacturing.


2013 ◽  
Vol 3 (1) ◽  
pp. 13-26 ◽  
Author(s):  
Sanjay P. Ahuja ◽  
Sindhu Mani

High Performance Computing (HPC) applications are scientific applications that require significant CPU capabilities. They are also data-intensive applications requiring large data storage. While many researchers have examined the performance of Amazon’s EC2 platform across some HPC benchmarks, an extensive study and their comparison between Amazon’s EC2 and Microsoft’s Windows Azure is largely missing with metrics such as memory bandwidth, I/O performance, and communication and computational performance. The purpose of this paper is to implement existing benchmarks to evaluate and analyze these metrics for EC2 and Windows Azure that span both Infrastructure-as-a-Service and Platform-as-a-Service types. This was accomplished by running MPI versions of STREAM, Interleaved or Random (IOR) and NAS Parallel (NPB) benchmarks on small and medium instance types. In addition a new EC2 medium instance type (m1.medium) was also included in the analysis. These benchmarks measure the memory bandwidth, I/O performance, communication and computational performance.


Computer ◽  
2008 ◽  
Vol 41 (4) ◽  
pp. 60-68 ◽  
Author(s):  
Maya Gokhale ◽  
Jonathan Cohen ◽  
Andy Yoo ◽  
W. Marcus Miller ◽  
Arpith Jacob ◽  
...  

2009 ◽  
Vol 17 (1-2) ◽  
pp. 113-134 ◽  
Author(s):  
Ana Lucia Varbanescu ◽  
Alexander S. van Amesfoort ◽  
Tim Cornwell ◽  
Ger van Diepen ◽  
Rob van Nieuwpoort ◽  
...  

The performance potential of the Cell/B.E., as well as its availability, have attracted a lot of attention from various high-performance computing (HPC) fields. While computation intensive kernels proved to be exceptionally well suited for running on the Cell, irregular data-intensive applications are usually considered as poor matches. In this paper, we present our complete solution for enabling such a data-intensive application to run efficiently on the Cell/B.E. processor. Specifically, we target radioastronomy data gridding and degridding, two resembling imaging filters based on convolutional resampling. Our solution is based on building a high-level application model, used to evaluate parallelization alternatives. Next, we choose the one with the best performance potential, and we gradually exploit this potential by applying platform-specific and application-specific optimizations. After several iterations, our target application shows a speed-up factor between 10 and 20 on a dual-Cell blade when compared with the original application running on a commodity machine. Given these results, and based on our empirical observations, we are able to pinpoint a set of ten guidelines for parallelizing similar applications on the Cell/B.E. Finally, we conclude the Cell/B.E. can provide high performance for data-intensive applications at the price of increased programming efforts and with a significant aid from aggressive application-specific optimizations.


Author(s):  
Ioan Raicu ◽  
Ian Foster ◽  
Yong Zhao ◽  
Alex Szalay ◽  
Philip Little ◽  
...  

Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Traditional techniques to support many-task computing commonly found in scientific computing (i.e. the reliance on parallel file systems with static configurations) do not scale to today’s largest systems for data intensive application, as the rate of increase in the number of processors per system is outgrowing the rate of performance increase of parallel file systems. In this chapter, the authors argue that in such circumstances, data locality is critical to the successful and efficient use of large distributed systems for data-intensive applications. They propose a “data diffusion” approach to enable data-intensive many-task computing. They define an abstract model for data diffusion, define and implement scheduling policies with heuristics that optimize real world performance, and develop a competitive online caching eviction policy. They also offer many empirical experiments to explore the benefits of data diffusion, both under static and dynamic resource provisioning, demonstrating approaches that improve both performance and scalability.


Sign in / Sign up

Export Citation Format

Share Document