Autonomic Runtime Adaptation Framework for Power Management in Large-Scale High-Performance Computing Systems

This chapter discusses runtime adaption techniques targeting high-performance computing applications. In order to exploit the capabilities of modern high-end computing systems, applications and system software have to be able to adapt their behavior to hardware and application characteristics. Using the Abstract Data and Communication Library (ADCL) as the driving example, the chapter shows the advantage of using adaptive techniques to exploit characteristics of the network and of the application. This allows to reduce the execution time of applications significantly and to avoid having to maintain different architecture dependent versions of the source code.

Download Full-text

New Computing Systems, Future High Performance Computing Environments and their Implications on Large-Scale Problems

Advances in Parallel and Vector Processing for Structural Mechanics ◽

10.4203/ccp.20.1.1 ◽

2009 ◽

Cited By ~ 1

Author(s):

A.K. Noor

Keyword(s):

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Computing Systems ◽

Large Scale Problems ◽

Computing Environments ◽

Performance Computing

Download Full-text

Meta-monitoring system for ensuring a fault tolerance of the intelligent high-performance computing environment

10.47350/iccs-de.2019.10 ◽

2019 ◽

Author(s):

I.A. Sidorov ◽

T.V. Sidorova ◽

Ya.V. Kurzibova

Keyword(s):

Fault Tolerance ◽

High Performance Computing ◽

Monitoring System ◽

High Performance ◽

Large Scale ◽

Computing Environment ◽

Computing Systems ◽

Computing Environments ◽

A New Technique ◽

Performance Computing

The high-performance computing systems include a large number of hardware and software components that can cause failures. Nowadays, the well-known approaches to monitoring and ensuring the fault tolerance of the high-performance computing systems do not allow to fully implement its integrated solution. The aim of this paper is to develop methods and tools for identifying abnormal situations during large-scale computational experiments in high-performance computing environments, localizing these malfunctions, automatically troubleshooting if this is possible, and automatically reconfiguring the computing environment otherwise. The proposed approach is based on the idea of integrating monitoring systems, used in different nodes of the environment, into a unified meta-monitoring system. The use of the proposed approach minimizes the time to perform diagnostics and troubleshooting through the use of parallel operations. It also improves the resiliency of the computing environment processes by preventive measures to diagnose and troubleshoot of failures. These advantages lead to increasing the reliability and efficiency of the environment functioning. The novelty of the proposed approach is underlined by the following elements: mechanisms of the decentralized collection, storage, and processing of monitoring data; a new technique of decision-making in reconfiguring the environment; the supporting the provision of fault tolerance and reliability not only for software and hardware, but also for environment management systems.

Download Full-text

A large-scale study of failures in high-performance computing systems

International Conference on Dependable Systems and Networks (DSN'06) ◽

10.1109/dsn.2006.5 ◽

2006 ◽

Cited By ~ 245

Author(s):

B. Schroeder ◽

G.A. Gibson

Keyword(s):

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Computing Systems ◽

Large Scale Study ◽

Performance Computing

Download Full-text

Processing methods and approaches for the analysis of images of the eclipsed solar corona taken during campaigns with the participation of amateur astronomers

Proceedings of the International Astronomical Union ◽

10.1017/s1743921321000028 ◽

2019 ◽

Vol 15 (S367) ◽

pp. 365-367

Author(s):

A. Stoev ◽

P. Stoeva ◽

S. Kuzin ◽

M. Kostov ◽

A. Pertsov

Keyword(s):

Solar Corona ◽

High Performance Computing ◽

High Power ◽

Numerical Experiments ◽

High Performance ◽

Large Scale ◽

Scientific Information ◽

Computing Systems ◽

Scientific Results ◽

Performance Computing

AbstractThe increase in the amount of scientific information in heliophysics is related to both quantitative – increasing the number of high-power telescopes and the size of light receivers coupled to them, and qualitative reasons – new modes of observation, large-scale and multiple studies of the solar corona in different ranges, large-scale numerical experiments to simulate the evolution of various processes and formations, etc. The paper discusses the role and importance of methods for processing images of the solar corona, the store of obtained “raw” data and the need to access high-performance computing systems in order to obtain scientific results from the observational experiments, the need of international collaboration and access to the data in the era of increase in the amount of scientific information in heliophysics.

Download Full-text