scholarly journals Epidemic failure detection and consensus for extreme parallelism

Author(s):  
Amogh Katti ◽  
Giuseppe Di Fatta ◽  
Thomas Naughton ◽  
Christian Engelmann

Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum’s User Level Failure Mitigation proposal has introduced an operation, MPI_Comm_shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI_Comm_shrink operation requires a failure detection and consensus algorithm. This paper presents three novel failure detection and consensus algorithms using Gossiping. Stochastic pinging is used to quickly detect failures during the execution of the algorithm, failures are then disseminated to all the fault-free processes in the system and consensus on the failures is detected using the three consensus techniques. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that the stochastic pinging detects all the failures in the system. In all the algorithms, the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus. The third approach is a three-phase distributed failure detection and consensus algorithm and provides consistency guarantees even in very large and extreme-scale systems while at the same time being memory and bandwidth efficient.

2017 ◽  
Vol 26 (03) ◽  
pp. 1750002
Author(s):  
Fouad Hanna ◽  
Lionel Droz-Bartholet ◽  
Jean-Christophe Lapayre

The consensus problem has become a key issue in the field of collaborative telemedicine systems because of the need to guarantee the consistency of shared data. In this paper, we focus on the performance of consensus algorithms. First, we studied, in the literature, the most well-known algorithms in the domain. Experiments on these algorithms allowed us to propose a new algorithm that enhances the performance of consensus in different situations. During 2014, we presented our very first initial thoughts to enhance the performance of the consensus algorithms, but the proposed solution gave very moderate results. The goal of this paper is to present a new enhanced consensus algorithm, named Fouad, Lionel and J.-Christophe (FLC). This new algorithm was built on the architecture of the Mostefaoui-Raynal (MR) consensus algorithm and integrates new features and some known techniques in order to enhance the performance of consensus in situations where process crashes are present in the system. The results from our experiments running on the simulation platform Neko show that the FLC algorithm gives the best performance when using a multicast network model on different scenarios: in the first scenario, where there are no process crashes nor wrong suspicion, and even in the second one, where multiple simultaneous process crashes take place in the system.


Author(s):  
M. M. S. Khan ◽  
M. S. Arifin ◽  
M. H. Rahaman ◽  
I. K. Amin ◽  
M. R. T. Hossain ◽  
...  
Keyword(s):  

Author(s):  
Simon McIntosh–Smith ◽  
Rob Hunt ◽  
James Price ◽  
Alex Warwick Vesztrocy

High-performance computing systems continue to increase in size in the quest for ever higher performance. The resulting increased electronic component count, coupled with the decrease in feature sizes of the silicon manufacturing processes used to build these components, may result in future exascale systems being more susceptible to soft errors caused by cosmic radiation than in current high-performance computing systems. Through the use of techniques such as hardware-based error-correcting codes and checkpoint-restart, many of these faults can be mitigated at the cost of increased hardware overhead, run-time, and energy consumption that can be as much as 10–20%. Some predictions expect these overheads to continue to grow over time. For extreme scale systems, these overheads will represent megawatts of power consumption and millions of dollars of additional hardware costs, which could potentially be avoided with more sophisticated fault-tolerance techniques. In this paper we present new software-based fault tolerance techniques that can be applied to one of the most important classes of software in high-performance computing: iterative sparse matrix solvers. Our new techniques enables us to exploit knowledge of the structure of sparse matrices in such a way as to improve the performance, energy efficiency, and fault tolerance of the overall solution.


2019 ◽  
Vol 214 ◽  
pp. 08009 ◽  
Author(s):  
Matthias J. Schnepf ◽  
R. Florian von Cube ◽  
Max Fischer ◽  
Manuel Giffels ◽  
Christoph Heidecker ◽  
...  

Demand for computing resources in high energy physics (HEP) shows a highly dynamic behavior, while the provided resources by the Worldwide LHC Computing Grid (WLCG) remains static. It has become evident that opportunistic resources such as High Performance Computing (HPC) centers and commercial clouds are well suited to cover peak loads. However, the utilization of these resources gives rise to new levels of complexity, e.g. resources need to be managed highly dynamically and HEP applications require a very specific software environment usually not provided at opportunistic resources. Furthermore, aspects to consider are limitations in network bandwidth causing I/O-intensive workflows to run inefficiently. The key component to dynamically run HEP applications on opportunistic resources is the utilization of modern container and virtualization technologies. Based on these technologies, the Karlsruhe Institute of Technology (KIT) has developed ROCED, a resource manager to dynamically integrate and manage a variety of opportunistic resources. In combination with ROCED, HTCondor batch system acts as a powerful single entry point to all available computing resources, leading to a seamless and transparent integration of opportunistic resources into HEP computing. KIT is currently improving the resource management and job scheduling by focusing on I/O requirements of individual workflows, available network bandwidth as well as scalability. For these reasons, we are currently developing a new resource manager, called TARDIS. In this paper, we give an overview of the utilized technologies, the dynamic management, and integration of resources as well as the status of the I/O-based resource and job scheduling.


2020 ◽  
Vol 65 (2) ◽  
pp. 66
Author(s):  
M. Petrescu ◽  
R. Petrescu

The implementation of a fault-tolerant system requires some type of consensus algorithm for correct operation. From Paxos to View-stamped Replication and Raft multiple algorithms have been developed to handle this problem. This paper presents and compares the Raft algorithm and Apache Kafka, a distributed messaging system which, although at a higher level, implements many concepts present in Raft (strong leadership, append-only log, log compaction, etc.).This shows that mechanisms conceived to handle one class of problems (consensus algorithms) are very useful to handle a larger category in the context of distributed systems.


Author(s):  
J. Lamterkati ◽  
L. Ouboubker ◽  
M. Khafallah ◽  
A. El afia

<p><span>The study made in this paper concerns the use of the voltage-oriented control (VOC) of three-phase pulse width modulation (PWM) rectifier with constant switching frequency. This control method, called voltage-oriented controlwith space vector modulation (VOC-SVM). The proposed control scheme has been founded on the transformation between stationary (α-β) and and synchronously rotating (d-q) coordinate system, it is based on two cascaded control loops so that a fast inner loop controls the grid current and an external loop DC-link voltage, while the DC-bus voltage is maintained at the desired level and ansured the unity power factor operation. So, the stable state performance and robustness against the load’s disturbance of PWM rectifiers are boths improved. The proposed scheme has been implemented and simulated in MATLAB/Simulink environment. The control system of the VOC-SVM strategy has been built based on dSPACE system with DS1104 controller board. The results obtained show the validity of the model and its control method. Compared with the conventional SPWM method, the VOC-SVM ensures high performance and fast transient response.</span></p>


Sign in / Sign up

Export Citation Format

Share Document