Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

Author(s):  
Franck Cappello

The emergence of petascale systems and the promise of future exascale systems have reinvigorated the community interest in how to manage failures in such systems and ensure that large applications, lasting several hours or tens of hours, are completed successfully. Most of the existing results for several key mechanisms associated with fault tolerance in high-performance computing (HPC) platforms follow the rollback—recovery approach. Over the last decade, these mechanisms have received a lot of attention from the community with different levels of success. Unfortunately, despite their high degree of optimization, existing approaches do not fit well with the challenging evolutions of large-scale systems. There is room and even a need for new approaches. Opportunities may come from different origins: diskless checkpointing, algorithmic-based fault tolerance, proactive operation, speculative execution, software transactional memory, forward recovery, etc. The contributions of this paper are as follows: (1) we summarize and analyze the existing results concerning the failures in large-scale computers and point out the urgent need for drastic improvements or disruptive approaches for fault tolerance in these systems; (2) we sketch most of the known opportunities and analyze their associated limitations; (3) we extract and express the challenges that the HPC community will have to face for addressing the stringent issue of failures in HPC systems.

Author(s):  
Pietro Cicotti ◽  
Manu Shantharam ◽  
Laura Carrington

In many scientific and computational domains, graphs are used to represent and analyze data. Such graphs often exhibit the characteristics of small-world networks: few high-degree vertexes connect many low-degree vertexes. Despite the randomness in a graph search, it is possible to capitalize on the characteristics of small-world networks and cache relevant information of high-degree vertexes. We applied this idea by caching remote vertex ids in a parallel breadth-first search benchmark. Our experiment with different implementations demonstrated significant performance improvements over the reference implementation in several configurations, using 64 to 1024 cores. We proposed a system design in which resources are dedicated exclusively to caching and shared among a set of nodes. Our evaluation demonstrates that this design reduces communication and has the potential to improve performance on large-scale systems in which the communication cost increases significantly with the distance between nodes. We also tested a memcached system as the cache server finding that its generic protocol, which does not match our usage semantics, hinders significantly the potential performance improvements and suggested that a generic system should also support a basic and lightweight communication protocol to meet the needs of high-performance computing applications. Finally, we explored different configurations to find efficient ways to utilize the resources allocated to solve a given problem size; to this extent, we found utilizing half of the compute cores per allocated node improves performance, and even in this case, caching variants always outperform the reference implementation.


Author(s):  
ROBERT STEWART ◽  
PATRICK MAIER ◽  
PHIL TRINDER

AbstractReliability is set to become a major concern on emergent large-scale architectures. While there are many parallel languages, and indeed many parallel functional languages, very few address reliability. The notable exception is the widely emulated Erlang distributed actor model that provides explicit supervision and recovery of actors with isolated state. We investigate scalable transparent fault tolerant functional computation with automatic supervision and recovery of tasks. We do so by developing HdpH-RS, a variant of the Haskell distributed parallel Haskell (HdpH) DSL with Reliable Scheduling. Extending the distributed work stealing protocol of HdpH for task supervision and recovery is challenging. To eliminate elusive concurrency bugs, we validate the HdpH-RS work stealing protocol using the SPIN model checker. HdpH-RS differs from the actor model in that its principal entities are tasks, i.e. independent stateless computations, rather than isolated stateful actors. Thanks to statelessness, fault recovery can be performed automatically and entirely hidden in the HdpH-RS runtime system. Statelessness is also key for proving a crucial property of the semantics of HdpH-RS: fault recovery does not change the result of the program, akin to deterministic parallelism. HdpH-RS provides a simple distributed fork/join-style programming model, with minimal exposure of fault tolerance at the language level, and a library of higher level abstractions such as algorithmic skeletons. In fact, the HdpH-RS DSL is exactly the same as the HdpH DSL, hence users can opt in or out of fault tolerant execution without any refactoring. Computations in HdpH-RS are always as reliable as the root node, no matter how many nodes and cores are actually used. We benchmark HdpH-RS on conventional clusters and an High Performance Computing platform: all benchmarks survive Chaos Monkey random fault injection; the system scales well e.g. up to 1,400 cores on the High Performance Computing; reliability and recovery overheads are consistently low even at scale.


Author(s):  
Camille Coti

This chapter gives an overview of techniques used to tolerate failures in high-performance distributed applications. We describe basic replication techniques, automatic rollback recovery and application-based fault tolerance. We present the challenges raised specifically by distributed, high performance computing and the performance overhead the fault tolerance mechanisms are likely to cost. Last, we give an example of a fault-tolerant algorithm that exploits specific properties of a recent algorithm.


2009 ◽  
Vol 10 (04) ◽  
pp. 345-364
Author(s):  
FATIHA BOUABACHE ◽  
THOMAS HERAULT ◽  
GILLES FEDAK ◽  
FRANCK CAPPELLO

An efficient and reliable fault tolerance protocol plays a key role in High Performance Computing. Rollback recovery is the most common fault tolerance technique used in High Performance Computing and especially in MPI applications. This technique relies on the reliability of the checkpoint storage. Most of the rollback recovery protocols assume that the checkpoint servers machines are reliable. However, in a grid environment any unit can fail at any moment, including components used to connect different administrative domains. Such failures lead to the loss of a whole set of machines, including the more reliable machines used to store the checkpoints in this administrative domain. Thus it is not safe to rely on the high Mean Time Between Failures of specific machines to store the checkpoint images. This paper introduces a new coordinated checkpoint protocol, which tolerates checkpoint server failures and clusters failures, and ensures a checkpoint storage reliability in a grid environment. To provide this reliability the protocol is based on a replication process. We propose new hierarchical replication strategies that exploit the locality of checkpoint images in order to minimize inter-cluster communication. We evaluate the effectiveness of our two hierarchical replication strategies through simulations against several criteria such as topology and scalability.


2021 ◽  
Vol 3 (3) ◽  
pp. 135-148
Author(s):  
Nayana Shetty

For the purpose of high performance computation, several machines are developed at an exascale level. These machines can perform at least one exaflop calculations per second, which corresponds to a billion billon or 108. The universe and nature can be understood in a better manner while addressing certain challenging computational issues by using these machines. However, certain obstacles are faced by these machines. As huge quantity of components is encompassed in the exascale machines, frequent failure may be experienced and also the resilience may be challenging. High progress rate must be maintained for the applications by incorporating certain form of fault tolerance in the system. Power management has to be performed by incorporating the system in a parallel manner. All layers inclusive of fault tolerance layer must adhere to the power limitation in the system. Huge energy bills may be expected on installation of exascale machines due to the high power consumption. For various fault tolerance models, the energy profile must be analyzed. Parallel recovery, message-logging, and restart or checkpoint fault tolerance models for rollback recovery are evaluated in this paper. For execution with failure, the most energy efficient solution is provided by parallel recovery when programs with various programming models are used. The execution is performed faster with parallel recovery when compared to the other techniques. An analytical model is used for exploring these models and their behavior at extreme scales.


Complexity ◽  
2019 ◽  
Vol 2019 ◽  
pp. 1-10 ◽  
Author(s):  
Kai Gong ◽  
Jia-Jian Wu ◽  
Ying Liu ◽  
Qing Li ◽  
Run-Ran Liu ◽  
...  

Many real-world infrastructure networks, such as power grids and communication networks, always depend on each other by their functional components that share geographic proximity. A lot of works were devoted to revealing the vulnerability of interdependent spatially embedded networks (ISENs) when facing node failures and showed that the ISENs are susceptible to geographically localized attacks caused by natural disasters or terrorist attacks. How to take emergency methods to prevent large scale of cascading failures on interdependent infrastructures is a longstanding problem. Here, we propose an effective strategy for the healing of local structures using the connection profile of a failed node, called the healing strategy by prioritizing minimum degrees (HPMD), in which a new link between two active low-degree neighbors of a failed node is established during the cascading process. Afterwards, comparisons are made between HPMD and three healing strategies based on three metrics: random choice, degree centrality, and local centrality, respectively. Simulations are performed on the ISENs composed of two diluted square lattices with the same size under localized attacks. Results show that HPMD can significantly improve the robustness of the system by enhancing the connectivity of low-degree nodes, which prevent the diffusion of failures from low-degree nodes to moderate-degree nodes. In particular, HPMD can outperform other three strategies in the size of the giant component of networks, critical attack radius, and the number of iterative cascade steps for a given quota of newly added links, which means HPMD is more effective, more timely, and less costly. The high performance of HPMD indicates low-degree nodes should be placed on the top priority for effective healing to resist the cascading of failures in the ISENs, which is totally different from the traditional methods that usually take high-degree nodes as critical nodes in a single network. Furthermore, HPMD considers the distance between a pair of nodes to control the variation in the network structures, which is more applicable to spatial networks than previous methods.


2019 ◽  
Author(s):  
I.A. Sidorov ◽  
T.V. Sidorova ◽  
Ya.V. Kurzibova

The high-performance computing systems include a large number of hardware and software components that can cause failures. Nowadays, the well-known approaches to monitoring and ensuring the fault tolerance of the high-performance computing systems do not allow to fully implement its integrated solution. The aim of this paper is to develop methods and tools for identifying abnormal situations during large-scale computational experiments in high-performance computing environments, localizing these malfunctions, automatically troubleshooting if this is possible, and automatically reconfiguring the computing environment otherwise. The proposed approach is based on the idea of integrating monitoring systems, used in different nodes of the environment, into a unified meta-monitoring system. The use of the proposed approach minimizes the time to perform diagnostics and troubleshooting through the use of parallel operations. It also improves the resiliency of the computing environment processes by preventive measures to diagnose and troubleshoot of failures. These advantages lead to increasing the reliability and efficiency of the environment functioning. The novelty of the proposed approach is underlined by the following elements: mechanisms of the decentralized collection, storage, and processing of monitoring data; a new technique of decision-making in reconfiguring the environment; the supporting the provision of fault tolerance and reliability not only for software and hardware, but also for environment management systems.


Author(s):  
Zizhong Chen

Today’s long running scientific applications typically tolerate failures by checkpoint/restart in which all process states of an application are saved into stable storage periodically. However, as the number of processors in a system increases, the amount of data that need to be saved into stable storage also increases linearly. Therefore, the classical checkpoint/restart approach has a potential scalability problem for large parallel systems. In this chapter, we introduce some scalable techniques to tolerate a small number of process failures in large parallel and distributed computing. We present several encoding strategies for diskless checkpointing to improve the scalability of the technique. We introduce the algorithm-based checkpoint-free fault tolerance technique to tolerate fail-stop failures without checkpoint or rollback recovery. Coding approaches and floating-point erasure correcting codes are also introduced to help applications to survive multiple simultaneous process failures. The introduced techniques are scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. Experimental results demonstrate that the introduced techniques are highly scalable.


Author(s):  
Jin-Xiu Wang ◽  
Fan-Zhou Kong ◽  
Hui-Xia Geng ◽  
Yue Zhao ◽  
Wei-Bing Guan ◽  
...  

The giant colony-forming haptophyte Phaeocystis globosa has caused several large-scale blooms in the Beibu Gulf since 2011, but the distribution and dynamics of the blooms remained largely unknown. In this study, colonies of P. globosa , as well as membrane-concentrated phytoplankton samples, were collected during eight cruises from September 2016 to August 2017 in the Beibu Gulf. Pigments were analyzed by high performance liquid chromatography coupled with a diode-array detector (HPLC-DAD). The pigment 19'-hexanoyloxyfucoxanthin (hex-fuco), generally considered as a diagnostic pigment for Phaeocystis , was not detected in P. globosa colonies in Beibu Gulf, whereas 19'-butanoyloxyfucoxanthin (but-fuco) was found in all colony samples. Moreover, but-fuco in membrane-concentrated phytoplankton samples exhibited a similar distribution pattern to that of P. globosa colonies, suggesting that but-fuco provided the diagnostic pigment for bloom-forming P. globosa in the Beibu Gulf. Based on the distribution of but-fuco in different water masses in the region prior to the formation of intensive blooms, it’s suggested that P. globosa blooms in the Beibu Gulf could originate from two different sources. IMPORTANCE Phaeocystis globosa has formed intensive blooms in the South China Sea and even around the world, causing huge social economic losses and environmental damage. However, little is known about the formation mechanism and dynamics of P. globosa blooms. 19'-hexanoyloxyfucoxanthin (hex-fuco) is often used as the pigment proxy to estimate Phaeocystis biomass, while this is challenged by the giant colony-forming P. globosa in the Beibu Gulf which only containing 19'-butanoyloxyfucoxanthin (but-fuco) but not hex-fuco. Using but-fuco as a diagnostic pigment, we traced two different origins of P. globosa bloom in Beibu Gulf. This study clarified the development process of P. globosa blooms in the Beibu Gulf, which provided a basis for the early monitoring and prevention of the bloom.


Author(s):  
C.K. Wu ◽  
P. Chang ◽  
N. Godinho

Recently, the use of refractory metal silicides as low resistivity, high temperature and high oxidation resistance gate materials in large scale integrated circuits (LSI) has become an important approach in advanced MOS process development (1). This research is a systematic study on the structure and properties of molybdenum silicide thin film and its applicability to high performance LSI fabrication.


Sign in / Sign up

Export Citation Format

Share Document