Autonomic Runtime Adaptation Framework for Power Management in Large-Scale High-Performance Computing Systems

Author(s):  
Sumit Kumar Saurav ◽  
S Bindhumadhva Bapu
2012 ◽  
Vol 63 (2) ◽  
pp. 365-377 ◽  
Author(s):  
Yulai Yuan ◽  
Yongwei Wu ◽  
Qiuping Wang ◽  
Guangwen Yang ◽  
Weimin Zheng

Author(s):  
Edgar Gabriel

This chapter discusses runtime adaption techniques targeting high-performance computing applications. In order to exploit the capabilities of modern high-end computing systems, applications and system software have to be able to adapt their behavior to hardware and application characteristics. Using the Abstract Data and Communication Library (ADCL) as the driving example, the chapter shows the advantage of using adaptive techniques to exploit characteristics of the network and of the application. This allows to reduce the execution time of applications significantly and to avoid having to maintain different architecture dependent versions of the source code.


2019 ◽  
Author(s):  
I.A. Sidorov ◽  
T.V. Sidorova ◽  
Ya.V. Kurzibova

The high-performance computing systems include a large number of hardware and software components that can cause failures. Nowadays, the well-known approaches to monitoring and ensuring the fault tolerance of the high-performance computing systems do not allow to fully implement its integrated solution. The aim of this paper is to develop methods and tools for identifying abnormal situations during large-scale computational experiments in high-performance computing environments, localizing these malfunctions, automatically troubleshooting if this is possible, and automatically reconfiguring the computing environment otherwise. The proposed approach is based on the idea of integrating monitoring systems, used in different nodes of the environment, into a unified meta-monitoring system. The use of the proposed approach minimizes the time to perform diagnostics and troubleshooting through the use of parallel operations. It also improves the resiliency of the computing environment processes by preventive measures to diagnose and troubleshoot of failures. These advantages lead to increasing the reliability and efficiency of the environment functioning. The novelty of the proposed approach is underlined by the following elements: mechanisms of the decentralized collection, storage, and processing of monitoring data; a new technique of decision-making in reconfiguring the environment; the supporting the provision of fault tolerance and reliability not only for software and hardware, but also for environment management systems.


2019 ◽  
Vol 15 (S367) ◽  
pp. 365-367
Author(s):  
A. Stoev ◽  
P. Stoeva ◽  
S. Kuzin ◽  
M. Kostov ◽  
A. Pertsov

AbstractThe increase in the amount of scientific information in heliophysics is related to both quantitative – increasing the number of high-power telescopes and the size of light receivers coupled to them, and qualitative reasons – new modes of observation, large-scale and multiple studies of the solar corona in different ranges, large-scale numerical experiments to simulate the evolution of various processes and formations, etc. The paper discusses the role and importance of methods for processing images of the solar corona, the store of obtained “raw” data and the need to access high-performance computing systems in order to obtain scientific results from the observational experiments, the need of international collaboration and access to the data in the era of increase in the amount of scientific information in heliophysics.


Sign in / Sign up

Export Citation Format

Share Document