A SURVEY OF CHECKPOINT/RESTART TECHNIQUES ON DISTRIBUTED MEMORY SYSTEMS

2013 ◽  
Vol 23 (04) ◽  
pp. 1340011 ◽  
Author(s):  
FAISAL SHAHZAD ◽  
MARKUS WITTMANN ◽  
MORITZ KREUTZER ◽  
THOMAS ZEISER ◽  
GEORG HAGER ◽  
...  

The road to exascale computing poses many challenges for the High Performance Computing (HPC) community. Each step on the exascale path is mainly the result of a higher level of parallelism of the basic building blocks (i.e., CPUs, memory units, networking components, etc.). The reliability of each of these basic components does not increase at the same rate as the rate of hardware parallelism. This results in a reduction of the mean time to failure (MTTF) of the whole system. A fault tolerance environment is thus indispensable to run large applications on such clusters. Checkpoint/Restart (C/R) is the classic and most popular method to minimize failure damage. Its ease of implementation makes it useful, but typically it introduces significant overhead to the application. Several efforts have been made to reduce the C/R overhead. In this paper we compare various C/R techniques for their overheads by implementing them on two different categories of applications. These approaches are based on parallel-file-system (PFS)-level checkpoints (synchronous/asynchronous) and node-level checkpoints. We utilize the Scalable Checkpoint/Restart (SCR) library for the comparison of node-level checkpoints. For asynchronous PFS-level checkpoints, we use the Damaris library, the SCR asynchronous feature, and application-based checkpointing via dedicated threads. Our baseline for overhead comparison is the naïve application-based synchronous PFS-level checkpointing method. A 3D lattice-Boltzmann (LBM) flow solver and a Lanczos eigenvalue solver are used as prototypical applications in which all the techniques considered here may be applied.

2010 ◽  
Vol 118-120 ◽  
pp. 596-600
Author(s):  
Jian Xin Zhu ◽  
Xue Dong Chen ◽  
Shi Yi Bao

An innovative nuisance trip calculation method based on Markov model was proposed in this paper which was used to evaluate the effect of repairment on system reliability. By analysis of the availabilities of classic 1 out of 2 (1oo2) repairable system, a new definition of spurious trip was put forwarded where online repair was considered. Compared with the benefits obtained by online repairment, the repair-caused-nuisance-trip was analyzed in this paper. Numerical calculation revealed that the online repair is helpful for anti-spurious trip in 1oo2 redundant system. Dangerous failures, if not repaired or cannot be online fixed, have complex influence on system reliability. The dangerous failure is sometimes benefit for anti-spurious performance if it is not repaired. But Mean Time To Failure Spurious (MTTFs) reduces with the increase of dangerous failure provided that dangerous failure rate is bigger than safe failure rate. Meanwhile, the finding that common cause can reduce the chance of nuisance trip was also proposed in this paper, though the influence is too small to be neglected.


2019 ◽  
Author(s):  
Andreas Müller ◽  
Willem Deconinck ◽  
Christian Kühnlein ◽  
Gianmarco Mengaldo ◽  
Michael Lange ◽  
...  

Abstract. In the simulation of complex multi-scale flow problems, such as those arising in weather and climate modelling, one of the biggest challenges is to satisfy operational requirements in terms of time-to-solution and energy-to-solution yet without compromising the accuracy and stability of the calculation. These competing factors require the development of state-of-the-art algorithms that can optimally exploit the targeted underlying hardware and efficiently deliver the extreme computational capabilities typically required in operational forecast production. These algorithms should (i) minimise the energy footprint along with the time required to produce a solution, (ii) maintain a satisfying level of accuracy, (iii) be numerically stable and resilient, in case of hardware or software failure. The European Centre for Medium Range Weather Forecasts (ECMWF) is leading a project called ESCAPE (Energy-efficient SCalable Algorithms for weather Prediction on Exascale supercomputers) which is funded by Horizon 2020 (H2020) under initiative Future and Emerging Technologies in High Performance Computing (FET-HPC). The goal of the ESCAPE project is to develop a sustainable strategy to evolve weather and climate prediction models to next-generation computing technologies. The project partners incorporate the expertise of leading European regional forecasting consortia, university research, experienced high-performance computing centres and hardware vendors. This paper presents an overview of results obtained in the ESCAPE project in which weather prediction have been broken down into smaller building blocks called dwarfs. The participating weather prediction models are: IFS (Integrated Forecasting System), ALARO – a combination of AROME (Application de la Recherche à l'Opérationnel a Meso-Echelle) and ALADIN (Aire Limitée Adaptation Dynamique Développement International) and COSMO-EULAG – a combination of COSMO (Consortium for Small-scale Modeling) and EULAG (Eulerian/semi-Lagrangian fluid solver). The dwarfs are analysed and optimised in terms of computing performance for different hardware architectures (mainly Intel Skylake CPUs, NVIDIA GPUs, Intel Xeon Phi). The ESCAPE project includes the development of new algorithms that are specifically designed for better energy efficiency and improved portability through domain specific languages. In addition, the modularity of the algorithmic framework, naturally allows testing different existing numerical approaches, and their interplay with the emerging heterogeneous hardware landscape. Throughout the paper, we will compare different numerical techniques to solve the main building blocks that constitute weather models, in terms of energy efficiency and performance, on a variety of computing technologies.


SIMULATION ◽  
2021 ◽  
pp. 003754972110641
Author(s):  
Aurelio Vivas ◽  
Harold Castro

Since simulation became the third pillar of scientific research, several forms of computers have become available to drive computer aided simulations, and nowadays, clusters are the most popular type of computers supporting these tasks. For instance, cluster settings, such as the so-called supercomputers, cluster of workstations (COW), cluster of desktops (COD), and cluster of virtual machines (COV) have been considered in literature to embrace a variety of scientific applications. However, those scientific applications categorized as high-performance computing (HPC) are conceptually restricted to be addressed only by supercomputers. In this aspect, we introduce the notions of cluster overhead and cluster coupling to assess the capacity of non-HPC systems to handle HPC applications. We also compare the cluster overhead with an existing measure of overhead in computing systems, the total parallel overhead, to explain the correctness of our methodology. The evaluation of capacity considers the seven dwarfs of scientific computing, which are well-known, scientific computing building blocks considered in the development of HPC applications. The evaluation of these building blocks provides insights regarding the strengths and weaknesses of non-HPC systems to deal with future HPC applications developed with one or a combination of these algorithmic building blocks.


2013 ◽  
Vol 824 ◽  
pp. 170-177
Author(s):  
Olaitan Akinsanmi ◽  
K.R. Ekundayo ◽  
Patrick U. Okorie

This paper assesses the reliability of a Nokia N1650 mobile phone charger used in Zaria, Nigeria. The Part Stress Method was employed to assess the reliability of the system. Data on the failure rate of the system components were used, with special considerations given to factors like environment of use, quality of power supply and service personnel. A comparative assessment was made on the system, when operated within the Zaria environment and when operated within the country for which it was designed for (China).The result shows that a lower reliability level is associated with the use of the system in Zaria, Nigeria as compared with the reliability level when in use in the country for which it was designed for. The Mean-Time to Failure of the system which is the time it is expected to function without failure (MTTF) in Nigeria is 1 years as against 10 years in China. The ratio is 10:1 in favour of the designed country. The ratio of the failure rate of the system is also 10:1 in favour of the designed country, meaning it fails ten times faster in Zaria environment as compared to the country for which the system was designed for. These are accounted for by higher variation in the environmental factors such as temperature, poor power quality, and poor maintenance culture in the applied environment.


Nanophotonics ◽  
2020 ◽  
Vol 10 (1) ◽  
pp. 643-654
Author(s):  
Mark L. Brongersma

AbstractThe development of flat optics has taken the world by storm. The initial mission was to try and replace conventional optical elements by thinner, lightweight equivalents. However, while developing this technology and learning about its strengths and limitations, researchers have identified a myriad of exciting new opportunities. It is therefore a great moment to explore where flat optics can really make a difference and what materials and building blocks are needed to make further progress. Building on its strengths, flat optics is bound to impact computational imaging, active wavefront manipulation, ultrafast spatiotemporal control of light, quantum communications, thermal emission management, novel display technologies, and sensing. In parallel with the development of flat optics, we have witnessed an incredible progress in the large-area synthesis and physical understanding of atomically thin, two-dimensional (2D) quantum materials. Given that these materials bring a wealth of unique physical properties and feature the same dimensionality as planar optical elements, they appear to have exactly what it takes to develop the next generation of high-performance flat optics.


Author(s):  
Anne Benoit ◽  
Saurabh K Raina ◽  
Yves Robert

Errors have become a critical problem for high-performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their peculiarity is that such errors are identified only when the corrupted data is activated. To cope with silent errors, we need a verification mechanism to check whether the application state is correct. Checkpoints should be supplemented with verifications to detect silent errors. When a verification is successful, only the last checkpoint needs to be kept in memory because it is known to be correct. In this paper, we analytically determine the best balance of verifications and checkpoints so as to optimize platform throughput. We introduce a balanced algorithm using a pattern with p checkpoints and q verifications, which regularly interleaves both checkpoints and verifications across same-size computational chunks. We show how to compute the waste of an arbitrary pattern, and we prove that the balanced algorithm is optimal when the platform MTBF (mean time between failures) is large in front of the other parameters (checkpointing, verification and recovery costs). We conduct several simulations to show the gain achieved by this balanced algorithm for well-chosen values of p and q, compared with the base algorithm that always perform a verification just before taking a checkpoint ( p =  q = 1), and we exhibit gains of up to 19%.


2020 ◽  
Vol 245 ◽  
pp. 05003
Author(s):  
Christopher Jones ◽  
Patrick Gartung

The OpenMP standard is the primary mechanism used at high performance computing facilities to allow intra-process parallelization. In contrast, many HEP specific software packages (such as CMSSW, GaudiHive, and ROOT) make use of Intel’s Threading Building Blocks (TBB) library to accomplish the same goal. In these proceedings we will discuss our work to compare TBB and OpenMP when used for scheduling algorithms to be run by a HEP style data processing framework. This includes both scheduling of different interdependent algorithms to be run concurrently as well as scheduling concurrent work within one algorithm. As part of the discussion we present an overview of the OpenMP threading model. We also explain how we used OpenMP when creating a simplified HEP-like processing framework. Using that simplified framework, and a similar one written using TBB, we will present performance comparisons between TBB and different compiler versions of OpenMP.


Sign in / Sign up

Export Citation Format

Share Document