Hardware rollback recovery schemes for multiprocessor systems

Author(s):  
Z. Tong ◽  
R.Y. Kain ◽  
W.T. Tsai
Author(s):  
A Chien ◽  
P Balaji ◽  
N Dun ◽  
A Fang ◽  
H Fujita ◽  
...  

Exascale studies project reliability challenges for future HPC systems. We present the Global View Resilience (GVR) system, a library for portable resilience. GVR begins with a subset of the Global Arrays interface, and adds new capabilities to create versions, name versions, and compute on version data. Applications can focus versioning where and when it is most productive, and customize for each application structure independently. This control is portable, and its embedding in application source makes it natural to express and easy to maintain. The ability to name multiple versions and “partially materialize” them efficiently makes ambitious forward-recovery based on “data slices” across versions or data structures both easy to express and efficient. Using several large applications (OpenMC, preconditioned conjugate gradient (PCG) solver, ddcMD, and Chombo), we evaluate the programming effort to add resilience. The required changes are small (< 2% lines of code (LOC)), localized and machine-independent, and perhaps most important, require no software architecture changes. We also measure the overhead of adding GVR versioning and show that overheads < 2% are generally achieved. This overhead suggests that GVR can be implemented in large-scale codes and support portable error recovery with modest investment and runtime impact. Our results are drawn from both IBM BG/Q and Cray XC30 experiments, demonstrating portability. We also present two case studies of flexible error recovery, illustrating how GVR can be used for multi-version rollback recovery, and several different forward-recovery schemes. GVR’s multi-version enables applications to survive latent errors (silent data corruption) with significant detection latency, and forward recovery can make that recovery extremely efficient. Our results suggest that GVR is scalable, portable, and efficient. GVR interfaces are flexible, supporting a variety of recovery schemes, and altogether GVR embodies a gentle-slope path to tolerate growing error rates in future extreme-scale systems.


2009 ◽  
Vol 20 (10) ◽  
pp. 2628-2636 ◽  
Author(s):  
Jian WANG ◽  
Jian-Ling SUN ◽  
Xin-Yu WANG ◽  
Shen-Kang WANG ◽  
Jun-Bo CHEN

2018 ◽  
pp. 47-53
Author(s):  
B. Z. Shmeylin ◽  
E. A. Alekseeva

In this paper the tasks of managing the directory in coherence maintenance systems in multiprocessor systems with a large number of processors are solved. In microprocessor systems with a large number of processors (MSLP) the problem of maintaining the coherence of processor caches is significantly complicated. This is due to increased traffic on the memory buses and increased complexity of interprocessor communications. This problem is solved in various ways. In this paper, we propose the use of Bloom filters used to accelerate the determination of an element’s belonging to a certain array. In this article, such filters are used to establish the fact that the processor belongs to some subset of the processors and determine if the processor has a cache line in the set. In the paper, the processes of writing and reading information in the data shared between processors are discussed in detail, as well as the process of data replacement from private caches. The article also shows how the addresses of cache lines and processor numbers are removed from the Bloom filters. The system proposed in this paper allows significantly speeding up the implementation of operations to maintain cache coherence in the MSLP as compared to conventional systems. In terms of performance and additional hardware and software costs, the proposed system is not inferior to the most efficient of similar systems, but on some applications and significantly exceeds them.


Sign in / Sign up

Export Citation Format

Share Document