scholarly journals High-performance computing environment: a review of twenty years of experiments in China

2016 ◽  
Vol 3 (1) ◽  
pp. 36-48 ◽  
Author(s):  
Zhiwei Xu ◽  
Xuebin Chi ◽  
Nong Xiao

Abstract A high-performance computing environment, also known as a supercomputing environment, e-Science environment or cyberinfrastructure, is a crucial system that connects users’ applications to supercomputers, and provides usability, efficiency, sharing, and collaboration capabilities. This review presents important lessons drawn from China's nationwide efforts to build and use a high-performance computing environment over the past 20 years (1995–2015), including three observations and two open problems. We present evidence that such an environment helps to grow China's nationwide supercomputing ecosystem by orders of magnitude, where a loosely coupled architecture accommodates diversity. An important open problem is why technology for global networked supercomputing has not yet become as widespread as the Internet or Web. In the next 20 years, high-performance computing environments will need to provide zettaflops computing capability and 10 000 times better energy efficiency, and support seamless human-cyber-physical ternary computing.

2019 ◽  
Author(s):  
I.A. Sidorov ◽  
T.V. Sidorova ◽  
Ya.V. Kurzibova

The high-performance computing systems include a large number of hardware and software components that can cause failures. Nowadays, the well-known approaches to monitoring and ensuring the fault tolerance of the high-performance computing systems do not allow to fully implement its integrated solution. The aim of this paper is to develop methods and tools for identifying abnormal situations during large-scale computational experiments in high-performance computing environments, localizing these malfunctions, automatically troubleshooting if this is possible, and automatically reconfiguring the computing environment otherwise. The proposed approach is based on the idea of integrating monitoring systems, used in different nodes of the environment, into a unified meta-monitoring system. The use of the proposed approach minimizes the time to perform diagnostics and troubleshooting through the use of parallel operations. It also improves the resiliency of the computing environment processes by preventive measures to diagnose and troubleshoot of failures. These advantages lead to increasing the reliability and efficiency of the environment functioning. The novelty of the proposed approach is underlined by the following elements: mechanisms of the decentralized collection, storage, and processing of monitoring data; a new technique of decision-making in reconfiguring the environment; the supporting the provision of fault tolerance and reliability not only for software and hardware, but also for environment management systems.


Sign in / Sign up

Export Citation Format

Share Document