Overhead of using spare nodes

With the increasing fault rate on high-end supercomputers, the topic of fault tolerance has been gathering attention. To cope with this situation, various fault-tolerance techniques are under investigation; these include user-level, algorithm-based fault-tolerance techniques and parallel execution environments that enable jobs to continue following node failure. Even with these techniques, some programs with static load balancing, such as stencil computation, may underperform after a failure recovery. Even when spare nodes are present, they are not always substituted for failed nodes in an effective way. This article considers the questions of how spare nodes should be allocated, how to substitute them for faulty nodes, and how much the communication performance is affected by such a substitution. The third question stems from the modification of the rank mapping by node substitutions, which can incur additional message collisions. In a stencil computation, rank mapping is done in a straightforward way on a Cartesian network without incurring any message collisions. However, once a substitution has occurred, the optimal node-rank mapping may be destroyed. Therefore, these questions must be answered in a way that minimizes the degradation of communication performance. In this article, several spare node allocation and failed node substitution methods will be proposed, analyzed, and compared in terms of communication performance following the substitution. The proposed substitution methods are named sliding methods. The sliding methods are analyzed by using our developed simulation program and evaluated by using the K computer, Blue Gene/Q (BG/Q), and TSUBAME 2.5. It will be shown that when failures occur, the stencil communication performance on the K and BG/Q can be slowed around 10 times depending on the number of node failures. The barrier performance on the K can be cut in half. On BG/Q, barrier performance can be slowed by a factor of 10. Further, it will also be shown that almost no such communication performance degradation can be seen on TSUBAME 2.5. This is because TSUBAME 2.5 has an Infiniband network connected with a FatTree topology, while the K computer and BG/Q have dedicated Cartesian networks. Thus, the communication performance degradation depends on network characteristics.

Download Full-text

Fault Tolerance and Noise Immunity in Freespace Diffractive Optical Neural Networks

Engineering Research Express ◽

10.1088/2631-8695/ac4832 ◽

2022 ◽

Author(s):

Soumyashee Soumyaprakash Panda ◽

Ravi Hegde

Keyword(s):

Fault Tolerance ◽

Optical Networks ◽

Noise Immunity ◽

Performance Degradation ◽

Ex Situ ◽

Training Phase ◽

Optical Media ◽

Gradient Based ◽

Analog Systems ◽

Training Objective

Abstract Free-space diffractive optical networks are a class of trainable optical media that are currently being explored as a novel hardware platform for neural engines. The training phase of such systems is usually performed in a computer and the learned weights are then transferred onto optical hardware ("ex-situ training"). Although this process of weight transfer has many practical advantages, it is often accompanied by performance degrading faults in the fabricated hardware. Being analog systems, these engines are also subject to performance degradation due to noises in the inputs and during optoelectronic conversion. Considering diffractive optical networks (DON) trained for image classification tasks on standard datasets, we numerically study the performance degradation arising out of weight faults and injected noises and methods to ameliorate these effects. Training regimens based on intentional fault and noise injection during the training phase are only found marginally successful at imparting fault tolerance or noise immunity. We propose an alternative training regimen using gradient based regularization terms in the training objective that are found to impart some degree of fault tolerance and noise immunity in comparison to injection based training regimen.

Download Full-text

A Divide and Conquer Strategy for Scaling Weather Simulations with Multiple Regions of Interest

Scientific Programming ◽

10.1155/2013/682356 ◽

2013 ◽

Vol 21 (3-4) ◽

pp. 93-107

Author(s):

Preeti Malakar ◽

Thomas George ◽

Sameer Kumar ◽

Rashmi Mittal ◽

Vijay Natarajan ◽

...

Keyword(s):

Performance Improvement ◽

Performance Prediction ◽

Parallel Execution ◽

Regions Of Interest ◽

Divide And Conquer ◽

Processor Allocation ◽

Simulation Domain ◽

Multiple Regions ◽

Blue Gene ◽

Sequential Strategy

Accurate and timely prediction of weather phenomena, such as hurricanes and flash floods, require high-fidelity compute intensive simulations of multiple finer regions of interest within a coarse simulation domain. Current weather applications execute these nested simulations sequentially using all the available processors, which is sub-optimal due to their sub-linear scalability. In this work, we present a strategy for parallel execution of multiple nested domain simulations based on partitioning the 2-D processor grid into disjoint rectangular regions associated with each domain. We propose a novel combination of performance prediction, processor allocation methods and topology-aware mapping of the regions on torus interconnects. Experiments on IBM Blue Gene systems using WRF show that the proposed strategies result in performance improvement of up to 33% with topology-oblivious mapping and up to additional 7% with topology-aware mapping over the default sequential strategy.

Download Full-text

Assessment of risk modification due to safety barrier performance degradation in Natech events

Reliability Engineering & System Safety ◽

10.1016/j.ress.2021.107634 ◽

2021 ◽

Vol 212 ◽

pp. 107634

Author(s):

Alessio Misuri ◽

Gabriele Landucci ◽

Valerio Cozzani

Keyword(s):

Performance Degradation ◽

Safety Barrier ◽

Barrier Performance ◽

Risk Modification

Download Full-text

DBFT: A Byzantine Fault Tolerance Protocol with Graceful Performance Degradation

IEEE Transactions on Dependable and Secure Computing ◽

10.1109/tdsc.2021.3095544 ◽

2021 ◽

pp. 1-1

Author(s):

Jingjing Zhang ◽

Yingyao Rong ◽

Jiannong Cao ◽

Chunming Rong ◽

Jing Bian ◽

...

Keyword(s):

Fault Tolerance ◽

Performance Degradation ◽

Byzantine Fault Tolerance ◽

Byzantine Fault

Download Full-text

A Randomized Scheduling Algorithm for Multiprocessor Environments Using Local Search

Parallel Processing Letters ◽

10.1142/s012962641650002x ◽

2016 ◽

Vol 26 (01) ◽

pp. 1650002 ◽

Cited By ~ 5

Author(s):

Abhishek Mishra ◽

Pramod Kumar Mishra

Keyword(s):

Task Scheduling ◽

Scheduling Algorithm ◽

Scheduling Algorithms ◽

Parallel Execution ◽

Neighborhood Search ◽

Task Graph ◽

Communication Performance ◽

Task Scheduling Algorithm ◽

Dynamic Computation ◽

Performance Results

The LOCAL(A, B) randomized task scheduling algorithm is proposed for fully connected multiprocessors. It combines two given task scheduling algorithms (A, and B) using local neighborhood search to give a hybrid of the two given algorithms. Objective is to show that such type of hybridization can give much better performance results in terms of parallel execution times. Two task scheduling algorithms are selected: DSC (Dominant Sequence Clustering as algorithm A), and CPPS (Cluster Pair Priority Scheduling as algorithm B) and a hybrid is created (the LOCAL(DSC, CPPS) or simply the LOCAL task scheduling algorithm). The LOCAL task scheduling algorithm has time complexity O(|V||E|(|V |+|E|)), where V is the set of vertices, and E is the set of edges in the task graph. The LOCAL task scheduling algorithm is compared with six other algorithms: CPPS, DCCL (Dynamic Computation Communication Load), DSC, EZ (Edge Zeroing), LC (Linear Clustering), and RDCC (Randomized Dynamic Computation Communication). Performance evaluation of the LOCAL task scheduling algorithm shows that it gives up to 80.47 % improvement of NSL (Normalized Schedule Length) over other algorithms.

Download Full-text

Restoration of target work in automatic failure- and fault-tolerant multitasking distributed information-control system

Engineering Journal Science and Innovation ◽

10.18698/2308-6033-2019-7-1902 ◽

2019 ◽

Cited By ~ 2

Author(s):

A.V. Lobanov ◽

I.V. Asharina

Keyword(s):

Fault Tolerance ◽

Digital Computer ◽

Fault Tolerant ◽

Recovery Process ◽

Parallel Execution ◽

Point Of View ◽

Information Control ◽

Distributed Information ◽

Recovery Processes ◽

Digital Computers

The paper deals with the organization of target work recovery processes after admissible failures and faults in an automatic failure and fault tolerant multitask distributed multi-machine system of the network structure performing a set of the target functions set by external users. The system is characterized by parallel execution of a set of interacting target tasks performed on separate computer subsystems, which are organized sets of digital computers. The specified level of failure- and fault-tolerance of the task is provided by its replication, i.e. parallel execution of copies of this task on several computers that make up the system, with the exchange of results and the choice of the correct one. The study introduces the characteristics, principles of construction, features of the considered systems and their "philosophical" essence from the point of view of failure- and fault-tolerance. Within the research, we determined the factors of complexity in the design of failure- and fault-tolerant systems of this class. The most general model of malicious computer failure is adopted, in which the computer behavior can be arbitrary, different in relation to other computers interacting with it, and even as malicious. We focus on the part of the problem of organizing dynamic redundancy in the developed system. The problem arises after an acceptable set of faults is detected in this system in some complex (or some set of F complexes) by each of the fault-free digital computers of each such complex and each such fault is also synchronously and consistently identified by place of origin and by type as a software failure of a certain digital computer of this complex. This part of the problem is solved by restoring all necessary information identified in a state of software malfunction of a certain complex. The information is transmitted to this digital computer from fault-free digital computers of this complex. The list of instructions required for such a recovery, as well as the actions of the complex in the recovery process, is determined.

Download Full-text