A Comprehensive Review on Power Efficient Fault Tolerance Models in High Performance Computation Systems

For the purpose of high performance computation, several machines are developed at an exascale level. These machines can perform at least one exaflop calculations per second, which corresponds to a billion billon or 108. The universe and nature can be understood in a better manner while addressing certain challenging computational issues by using these machines. However, certain obstacles are faced by these machines. As huge quantity of components is encompassed in the exascale machines, frequent failure may be experienced and also the resilience may be challenging. High progress rate must be maintained for the applications by incorporating certain form of fault tolerance in the system. Power management has to be performed by incorporating the system in a parallel manner. All layers inclusive of fault tolerance layer must adhere to the power limitation in the system. Huge energy bills may be expected on installation of exascale machines due to the high power consumption. For various fault tolerance models, the energy profile must be analyzed. Parallel recovery, message-logging, and restart or checkpoint fault tolerance models for rollback recovery are evaluated in this paper. For execution with failure, the most energy efficient solution is provided by parallel recovery when programs with various programming models are used. The execution is performed faster with parallel recovery when compared to the other techniques. An analytical model is used for exploring these models and their behavior at extreme scales.

Download Full-text

Fault Tolerance Techniques for Distributed, Parallel Applications

Innovative Research and Applications in Next-Generation High Performance Computing - Advances in Systems Analysis, Software Engineering, and High Performance Computing ◽

10.4018/978-1-5225-0287-6.ch009 ◽

2016 ◽

pp. 221-252

Author(s):

Camille Coti

Keyword(s):

Fault Tolerance ◽

High Performance Computing ◽

High Performance ◽

Fault Tolerant ◽

Distributed Applications ◽

Parallel Applications ◽

Rollback Recovery ◽

Tolerance Mechanisms ◽

Performance Computing

This chapter gives an overview of techniques used to tolerate failures in high-performance distributed applications. We describe basic replication techniques, automatic rollback recovery and application-based fault tolerance. We present the challenges raised specifically by distributed, high performance computing and the performance overhead the fault tolerance mechanisms are likely to cost. Last, we give an example of a fault-tolerant algorithm that exploits specific properties of a recent algorithm.

Download Full-text

HIERARCHICAL REPLICATION TECHNIQUES TO ENSURE CHECKPOINT STORAGE RELIABILITY IN GRID ENVIRONMENT

Journal of Interconnection Networks ◽

10.1142/s0219265909002613 ◽

2009 ◽

Vol 10 (04) ◽

pp. 345-364

Author(s):

FATIHA BOUABACHE ◽

THOMAS HERAULT ◽

GILLES FEDAK ◽

FRANCK CAPPELLO

Keyword(s):

Fault Tolerance ◽

High Performance Computing ◽

High Performance ◽

Rollback Recovery ◽

Grid Environment ◽

Replication Process ◽

Mean Time Between Failures ◽

Mpi Applications ◽

Recovery Protocols ◽

Performance Computing

An efficient and reliable fault tolerance protocol plays a key role in High Performance Computing. Rollback recovery is the most common fault tolerance technique used in High Performance Computing and especially in MPI applications. This technique relies on the reliability of the checkpoint storage. Most of the rollback recovery protocols assume that the checkpoint servers machines are reliable. However, in a grid environment any unit can fail at any moment, including components used to connect different administrative domains. Such failures lead to the loss of a whole set of machines, including the more reliable machines used to store the checkpoints in this administrative domain. Thus it is not safe to rely on the high Mean Time Between Failures of specific machines to store the checkpoint images. This paper introduces a new coordinated checkpoint protocol, which tolerates checkpoint server failures and clusters failures, and ensures a checkpoint storage reliability in a grid environment. To provide this reliability the protocol is based on a replication process. We propose new hierarchical replication strategies that exploit the locality of checkpoint images in order to minimize inter-cluster communication. We evaluate the effectiveness of our two hierarchical replication strategies through simulations against several criteria such as topology and scalability.

Download Full-text

Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

The International Journal of High Performance Computing Applications ◽

10.1177/1094342009106189 ◽

2009 ◽

Vol 23 (3) ◽

pp. 212-226 ◽

Cited By ~ 113

Author(s):

Franck Cappello

Keyword(s):

Fault Tolerance ◽

High Performance ◽

Large Scale ◽

Current Knowledge ◽

Speculative Execution ◽

Rollback Recovery ◽

Community Interest ◽

Different Origins ◽

Recovery Approach ◽

High Degree

The emergence of petascale systems and the promise of future exascale systems have reinvigorated the community interest in how to manage failures in such systems and ensure that large applications, lasting several hours or tens of hours, are completed successfully. Most of the existing results for several key mechanisms associated with fault tolerance in high-performance computing (HPC) platforms follow the rollback—recovery approach. Over the last decade, these mechanisms have received a lot of attention from the community with different levels of success. Unfortunately, despite their high degree of optimization, existing approaches do not fit well with the challenging evolutions of large-scale systems. There is room and even a need for new approaches. Opportunities may come from different origins: diskless checkpointing, algorithmic-based fault tolerance, proactive operation, speculative execution, software transactional memory, forward recovery, etc. The contributions of this paper are as follows: (1) we summarize and analyze the existing results concerning the failures in large-scale computers and point out the urgent need for drastic improvements or disruptive approaches for fault tolerance in these systems; (2) we sketch most of the known opportunities and analyze their associated limitations; (3) we extract and express the challenges that the HPC community will have to face for addressing the stringent issue of failures in HPC systems.

Download Full-text

Design of a New Multiplexer Structure Based on a New Fault-Tolerant Majority Gate in Quantum-Dot Cellular Automata

10.21203/rs.3.rs-355476/v1 ◽

2021 ◽

Author(s):

Yaser Rahmani ◽

Saeed Rasouli Heikalabad ◽

Mohammad Mosleh

Keyword(s):

Fault Tolerance ◽

Cellular Automata ◽

Quantum Dot ◽

High Performance ◽

Fault Tolerant ◽

Cmos Technology ◽

Good Alternative ◽

Majority Gate ◽

Power Efficient ◽

Quantum Dot Cellular Automata

Abstract Quantum-dot Cellular Automata (QCA) technology is believed to be a good alternative to CMOS technology. This nanoscale technology can provide a platform for design and implementation of high performance and power efficient logic circuits. However, the fabrication of QCA circuits is susceptible to faults appearing in this form of missing cells, additional cells, rotated cells, and displaced cells. Over the years, several solutions have been proposed to address these problems. This paper presents a new solution for improving the fault tolerance of three input majority gate. The proposed majority gate is then used to design 2-1 multiplexer and 4-1 multiplexer. The proposed designs are implemented in QCA Designer. Simulation results demonstrate significant improvements in terms of fault tolerance and area requirement.

Download Full-text

Ultracompact and low-power-consumption silicon thermo-optic switch for high-speed data

Nanophotonics ◽

10.1515/nanoph-2020-0496 ◽

2020 ◽

Vol 10 (2) ◽

pp. 937-945

Author(s):

Ruihuan Zhang ◽

Yu He ◽

Yong Zhang ◽

Shaohua An ◽

Qingming Zhu ◽

...

Keyword(s):

Power Consumption ◽

Low Power ◽

High Speed ◽

High Performance ◽

Pulse Amplitude ◽

Telecommunication Networks ◽

Low Power Consumption ◽

Power Efficient ◽

High Speed Data ◽

On Chip

AbstractUltracompact and low-power-consumption optical switches are desired for high-performance telecommunication networks and data centers. Here, we demonstrate an on-chip power-efficient 2 × 2 thermo-optic switch unit by using a suspended photonic crystal nanobeam structure. A submilliwatt switching power of 0.15 mW is obtained with a tuning efficiency of 7.71 nm/mW in a compact footprint of 60 μm × 16 μm. The bandwidth of the switch is properly designed for a four-level pulse amplitude modulation signal with a 124 Gb/s raw data rate. To the best of our knowledge, the proposed switch is the most power-efficient resonator-based thermo-optic switch unit with the highest tuning efficiency and data ever reported.

Download Full-text

The role of APL and J in high-performance computation

ACM SIGAPL APL Quote Quad ◽

10.1145/166198.166201 ◽

1993 ◽

Vol 24 (1) ◽

pp. 17-32 ◽

Cited By ~ 5

Author(s):

Robert Bernecky

Keyword(s):

High Performance ◽

High Performance Computation

Download Full-text

Parallel-META: efficient metagenomic data analysis based on high-performance computation

BMC Systems Biology ◽

10.1186/1752-0509-6-s1-s16 ◽

2012 ◽

Vol 6 (Suppl 1) ◽

pp. S16 ◽

Cited By ~ 21

Author(s):

Xiaoquan Su ◽

Jian Xu ◽

Kang Ning

Keyword(s):

Data Analysis ◽

High Performance ◽

Metagenomic Data ◽

High Performance Computation

Download Full-text

Re-Engineering a DNS Code for High-Performance Computation of Turbulent Flows

47th AIAA Aerospace Sciences Meeting including The New Horizons Forum and Aerospace Exposition ◽

10.2514/6.2009-566 ◽

2009 ◽

Cited By ~ 7

Author(s):

Yufeng Yao ◽

Zhi Shang ◽

Jony Castagna

Keyword(s):

Turbulent Flows ◽

High Performance ◽

High Performance Computation

Download Full-text

High-performance and energy-efficient fault-tolerance core mapping in NoC

Sustainable Computing Informatics and Systems ◽

10.1016/j.suscom.2017.08.004 ◽

2017 ◽

Vol 16 ◽

pp. 1-10 ◽

Cited By ~ 4

Author(s):

Naresh Kumar Reddy Beechu ◽

Vasantha Moodabettu Harishchandra ◽

Nithin Kumar Yernad Balachandra

Keyword(s):

Fault Tolerance ◽

Energy Efficient ◽

High Performance ◽

Core Mapping

Download Full-text

Application-based fault tolerance techniques for sparse matrix solvers

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017694946 ◽

2017 ◽

Vol 32 (5) ◽

pp. 627-640

Author(s):

Simon McIntosh–Smith ◽

Rob Hunt ◽

James Price ◽

Alex Warwick Vesztrocy

Keyword(s):

Fault Tolerance ◽

High Performance Computing ◽

High Performance ◽

Sparse Matrix ◽

Sparse Matrices ◽

Error Correcting Codes ◽

Computing Systems ◽

Hardware Costs ◽

Extreme Scale ◽

Performance Computing

High-performance computing systems continue to increase in size in the quest for ever higher performance. The resulting increased electronic component count, coupled with the decrease in feature sizes of the silicon manufacturing processes used to build these components, may result in future exascale systems being more susceptible to soft errors caused by cosmic radiation than in current high-performance computing systems. Through the use of techniques such as hardware-based error-correcting codes and checkpoint-restart, many of these faults can be mitigated at the cost of increased hardware overhead, run-time, and energy consumption that can be as much as 10–20%. Some predictions expect these overheads to continue to grow over time. For extreme scale systems, these overheads will represent megawatts of power consumption and millions of dollars of additional hardware costs, which could potentially be avoided with more sophisticated fault-tolerance techniques. In this paper we present new software-based fault tolerance techniques that can be applied to one of the most important classes of software in high-performance computing: iterative sparse matrix solvers. Our new techniques enables us to exploit knowledge of the structure of sparse matrices in such a way as to improve the performance, energy efficiency, and fault tolerance of the overall solution.

Download Full-text