Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions

We investigate the usefulness of gossip-based reduction algorithms in a high-performance computing (HPC) context. We compare them to state-of-the-art deterministic parallel reduction algorithms in terms of fault tolerance and resilience against silent data corruption (SDC) as well as in terms of performance and scalability. New gossip-based reduction algorithms are proposed, which significantly improve the state-of-the-art in terms of resilience against SDC. Moreover, a new gossip-inspired reduction algorithm is proposed, which promises a much more competitive runtime performance in an HPC context than classical gossip-based algorithms, in particular for low accuracy requirements.

Download Full-text

The State-of-the-Art Trends in Education Strategy for Sustainable Development of the High Performance Computing Ecosystem

Communications in Computer and Information Science - Supercomputing ◽

10.1007/978-3-319-71255-0_40 ◽

2017 ◽

pp. 494-504 ◽

Cited By ~ 1

Author(s):

Sergey Mosin

Keyword(s):

Sustainable Development ◽

High Performance Computing ◽

High Performance ◽

State Of The Art ◽

The State ◽

Education Strategy ◽

Performance Computing

Download Full-text

Preface

CLEI electronic journal ◽

10.19153/cleiej.17.1.0 ◽

2014 ◽

Vol 17 (1) ◽

Author(s):

A. Marcela Printista ◽

Carlos García Garino

Keyword(s):

High Performance Computing ◽

High Performance ◽

State Of The Art ◽

The State ◽

Electronic Journal ◽

Special Issue ◽

Performance Computing

This Special Issue of CLEI Electronic Journal presents the invited contributions selected fromthe best evaluated papers presented in VI HPCLatAm 2013. These communications that havebeen conveniently extended and properly evaluated. In this way the selected papers are avaluable contributions to the development of high performance computing in Latin Americaand summarizes the state of the art of HPC in our region.

Download Full-text

Application-based fault tolerance techniques for sparse matrix solvers

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017694946 ◽

2017 ◽

Vol 32 (5) ◽

pp. 627-640

Author(s):

Simon McIntosh–Smith ◽

Rob Hunt ◽

James Price ◽

Alex Warwick Vesztrocy

Keyword(s):

Fault Tolerance ◽

High Performance Computing ◽

High Performance ◽

Sparse Matrix ◽

Sparse Matrices ◽

Error Correcting Codes ◽

Computing Systems ◽

Hardware Costs ◽

Extreme Scale ◽

Performance Computing

High-performance computing systems continue to increase in size in the quest for ever higher performance. The resulting increased electronic component count, coupled with the decrease in feature sizes of the silicon manufacturing processes used to build these components, may result in future exascale systems being more susceptible to soft errors caused by cosmic radiation than in current high-performance computing systems. Through the use of techniques such as hardware-based error-correcting codes and checkpoint-restart, many of these faults can be mitigated at the cost of increased hardware overhead, run-time, and energy consumption that can be as much as 10–20%. Some predictions expect these overheads to continue to grow over time. For extreme scale systems, these overheads will represent megawatts of power consumption and millions of dollars of additional hardware costs, which could potentially be avoided with more sophisticated fault-tolerance techniques. In this paper we present new software-based fault tolerance techniques that can be applied to one of the most important classes of software in high-performance computing: iterative sparse matrix solvers. Our new techniques enables us to exploit knowledge of the structure of sparse matrices in such a way as to improve the performance, energy efficiency, and fault tolerance of the overall solution.

Download Full-text

High-Performance Computing Spare Replacement Hardware Fault Tolerance

10.2172/833493 ◽

2004 ◽

Author(s):

Jared Samuel Dreicer

Keyword(s):

Fault Tolerance ◽

High Performance Computing ◽

High Performance ◽

Performance Computing

Download Full-text

Architecture for the Integration of High Performance Computing Applications in PLM

Volume 2: 27th Computers and Information in Engineering Conference, Parts A and B ◽

10.1115/detc2007-35185 ◽

2007 ◽

Author(s):

Reiner Anderl ◽

Orkun Yaman

Keyword(s):

Data Management ◽

High Performance Computing ◽

High Performance ◽

State Of The Art ◽

Reference Information ◽

Simulation Domain ◽

Architectural Framework ◽

Industrial Context ◽

Performance Computing ◽

Integrate Data

High Performance Computing (HPC) has become ubiquitous for simulations in the industrial context. To identify the requirements for integration of HPC-relevant data and processes a survey has been conducted concerning the German car manufacturers and service and component suppliers. This contribution presents the results of the evaluation and suggests an architecture concept to integrate data and workflows related with CAE and HPC-facilities in PLM. It describes the state of the art of HPC-applications within the simulation domain. Intensive efforts are currently invested on CAE-data management. However, an approach to systematic data management of HPC does not exist. This study states importance of an integrating approach for data management of HPC-applications and develops an architectural framework to implement HPC-data management into the existing PLM landscape. Requirements on key functionalities and interfaces are defined as well as a framework for a reference information model is conceptualized.

Download Full-text

A Fault Tolerance Framework for High Performance Computing in Cloud

2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012) ◽

10.1109/ccgrid.2012.80 ◽

2012 ◽

Cited By ~ 26

Author(s):

Ifeanyi P. Egwutuoha ◽

Shiping Chen ◽

David Levy ◽

Bran Selic

Keyword(s):

Fault Tolerance ◽

High Performance Computing ◽

High Performance ◽

Performance Computing

Download Full-text

High-performance computing systems: Status and outlook

Acta Numerica ◽

10.1017/s0962492912000050 ◽

2012 ◽

Vol 21 ◽

pp. 379-474 ◽

Cited By ~ 36

Author(s):

J. J. Dongarra ◽

A. J. van der Steen

Keyword(s):

High Performance Computing ◽

High Performance ◽

State Of The Art ◽

Computing Systems ◽

Future Developments ◽

Steady Growth ◽

Current State ◽

Near Future ◽

Performance Computing ◽

Shed Light

This article describes the current state of the art of high-performance computing systems, and attempts to shed light on near-future developments that might prolong the steady growth in speed of such systems, which has been one of their most remarkable characteristics. We review the different ways devised to speed them up, both with regard to components and their architecture. In addition, we discuss the requirements for software that can take advantage of existing and future architectures.

Download Full-text

State of the Art and Future Trends in Data Reduction for High-Performance Computing

Supercomputing Frontiers and Innovations ◽

10.14529/jsfi200101 ◽

2020 ◽

Vol 7 (1) ◽

Keyword(s):

High Performance Computing ◽

Data Reduction ◽

High Performance ◽

State Of The Art ◽

Future Trends ◽

Performance Computing

Download Full-text

Energy-Aware High-Performance Computing: Survey of State-of-the-Art Tools, Techniques, and Environments

Scientific Programming ◽

10.1155/2019/8348791 ◽

2019 ◽

Vol 2019 ◽

pp. 1-19 ◽

Cited By ~ 4

Author(s):

Pawel Czarnul ◽

Jerzy Proficz ◽

Adam Krzywaniak

Keyword(s):

High Performance Computing ◽

High Performance ◽

Hybrid Methods ◽

State Of The Art ◽

Control Methods ◽

Energy Aware ◽

Power Capping ◽

Power Limits ◽

Performance Computing

The paper presents state of the art of energy-aware high-performance computing (HPC), in particular identification and classification of approaches by system and device types, optimization metrics, and energy/power control methods. System types include single device, clusters, grids, and clouds while considered device types include CPUs, GPUs, multiprocessor, and hybrid systems. Optimization goals include various combinations of metrics such as execution time, energy consumption, and temperature with consideration of imposed power limits. Control methods include scheduling, DVFS/DFS/DCT, power capping with programmatic APIs such as Intel RAPL, NVIDIA NVML, as well as application optimizations, and hybrid methods. We discuss tools and APIs for energy/power management as well as tools and environments for prediction and/or simulation of energy/power consumption in modern HPC systems. Finally, programming examples, i.e., applications and benchmarks used in particular works are discussed. Based on our review, we identified a set of open areas and important up-to-date problems concerning methods and tools for modern HPC systems allowing energy-aware processing.

Download Full-text

Transparent fault tolerance for scalable functional computation

Journal of Functional Programming ◽

10.1017/s095679681600006x ◽

2016 ◽

Vol 26 ◽

Cited By ~ 2

Author(s):

ROBERT STEWART ◽

PATRICK MAIER ◽

PHIL TRINDER

Keyword(s):

Fault Tolerance ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Programming Model ◽

Fault Tolerant ◽

Fault Recovery ◽

Actor Model ◽

Work Stealing ◽

Performance Computing

AbstractReliability is set to become a major concern on emergent large-scale architectures. While there are many parallel languages, and indeed many parallel functional languages, very few address reliability. The notable exception is the widely emulated Erlang distributed actor model that provides explicit supervision and recovery of actors with isolated state. We investigate scalable transparent fault tolerant functional computation with automatic supervision and recovery of tasks. We do so by developing HdpH-RS, a variant of the Haskell distributed parallel Haskell (HdpH) DSL with Reliable Scheduling. Extending the distributed work stealing protocol of HdpH for task supervision and recovery is challenging. To eliminate elusive concurrency bugs, we validate the HdpH-RS work stealing protocol using the SPIN model checker. HdpH-RS differs from the actor model in that its principal entities are tasks, i.e. independent stateless computations, rather than isolated stateful actors. Thanks to statelessness, fault recovery can be performed automatically and entirely hidden in the HdpH-RS runtime system. Statelessness is also key for proving a crucial property of the semantics of HdpH-RS: fault recovery does not change the result of the program, akin to deterministic parallelism. HdpH-RS provides a simple distributed fork/join-style programming model, with minimal exposure of fault tolerance at the language level, and a library of higher level abstractions such as algorithmic skeletons. In fact, the HdpH-RS DSL is exactly the same as the HdpH DSL, hence users can opt in or out of fault tolerant execution without any refactoring. Computations in HdpH-RS are always as reliable as the root node, no matter how many nodes and cores are actually used. We benchmark HdpH-RS on conventional clusters and an High Performance Computing platform: all benchmarks survive Chaos Monkey random fault injection; the system scales well e.g. up to 1,400 cores on the High Performance Computing; reliability and recovery overheads are consistently low even at scale.

Download Full-text