A Probabilistic Fault-Tolerant Recovery Mechanism for Task and Result Certification of Large-Scale Distributed Applications

Author(s):  
Rim Chayeh ◽  
Christophe Cerin ◽  
Mohamed Jemni
Author(s):  
Valentin Cristea ◽  
Ciprian Dobre ◽  
Corina Stratan ◽  
Florin Pop

The domains of usage of large scale distributed systems have been extending during the past years from scientific to commercial applications. Together with the extension of the application domains, new requirements have emerged for large scale distributed systems. Among these requirements, fault tolerance is needed by more and more modern distributed applications, not only by the critical ones. In this chapter we analyze current existing work in enabling fault tolerance in case of large scale distributed systems, presenting specific problem, existing solution, as well as several future trends. The characteristics of these systems pose problems to ensuring fault tolerance especially because of their complexity, involving many resources and users geographically distributed, because of the volatility of resources that are available only for limited amounts of time, and because of the constraints imposed by the applications and resource owners. A general fault tolerant architecture should, at a minimum, be comprised of at least a mechanism to detect failures and a component capable to recover and handle the detected failures, usually using some form of a replication mechanism. In this chapter we analyzed existing fault tolerance implementations, as well as solutions adopted in real world large scale distributed systems. We analyzed the fault tolerance architectures being proposed for particular distributed architectures, such as Grid or P2P systems.


Author(s):  
Florin Pop

This chapter presents a fault tolerant framework for the applications scheduling in large scale distributed systems (LSDS). Due to the specific characteristics and requirements of distributed systems, a good scheduling model should be dynamic. More specifically, it should adapt the scheduling decisions to resource state changes, which are commonly captured through monitoring. The scheduler and the monitor are two important middleware pieces that correlate their actions to ensure the high performance execution of distributed applications. The chapter presents and analyses agent based architecture for scheduling in large scale distributed systems. Then the user and resources management are presented. Optimization schemes for scheduling consider the near-optimal algorithm for distributed scheduling. The chapter presents the solution for scheduling optimization. The chapter covers and explains the fault tolerance cases for Grid environments and describes two possible scenarios for scheduling system.


1996 ◽  
Author(s):  
Richard Hayton ◽  
Jean Bacon ◽  
John Bates ◽  
Ken Moody

2021 ◽  
Vol 20 (5s) ◽  
pp. 1-22
Author(s):  
Haoran Li ◽  
Chenyang Lu ◽  
Christopher D. Gill

Fault-tolerant coordination services have been widely used in distributed applications in cloud environments. Recent years have witnessed the emergence of time-sensitive applications deployed in edge computing environments, which introduces both challenges and opportunities for coordination services. On one hand, coordination services must recover from failures in a timely manner. On the other hand, edge computing employs local networked platforms that can be exploited to achieve timely recovery. In this work, we first identify the limitations of the leader election and recovery protocols underlying Apache ZooKeeper, the prevailing open-source coordination service. To reduce recovery latency from leader failures, we then design RT-Zookeeper with a set of novel features including a fast-convergence election protocol, a quorum channel notification mechanism, and a distributed epoch persistence protocol. We have implemented RT-Zookeeper based on ZooKeeper version 3.5.8. Empirical evaluation shows that RT-ZooKeeper achieves 91% reduction in maximum recovery latency in comparison to ZooKeeper. Furthermore, a case study demonstrates that fast failure recovery in RT-ZooKeeper can benefit a common messaging service like Kafka in terms of message latency.


2014 ◽  
Vol 933 ◽  
pp. 584-589
Author(s):  
Zhi Chun Zhang ◽  
Song Wei Li ◽  
Wei Ren Wang ◽  
Wei Zhang ◽  
Li Jun Qi

This paper presents a system in which the cluster devices are controlled by single-chip microcomputers, with emphasis on the cluster management techniques of single-chip microcomputers. Each device in a cluster is controlled by a single-chip microcomputer collecting sample data sent to and driving the device by driving data received from the same cluster management computer through COMs. The cluster management system running on the cluster management computer carries out such control as initial SCM identification, run time slice management, communication resource utilization, fault tolerance and error corrections on single-chip microcomputers. Initial SCM identification is achieved by signal responses between the single-chip microcomputers and the cluster management computer. By using the port priority and the parallelization of serial communications, the systems real-time performance is maximized. The real-time performance can be adjusted and improved by increasing or decreasing COMs and the ports linked to each COM, and the real-time performance can also be raised by configuring more cluster management computers. Fault-tolerant control occurs in the initialization phase and the operational phase. In the initialization phase, the cluster management system incorporates unidentified single-chip microcomputers into the system based on the history information recorded on external storage media. In the operational phase, if an operation error of reading and writing on a single-chip microcomputer reaches a predetermined threshold, the single-chip microcomputer is regarded as serious fault or not existing. The cluster management system maintains accuracy maintenance database on external storage medium to solve nonlinear control of specific devices and accuracy maintenance due to wear. The cluster management system uses object-oriented method to design a unified driving framework in order to enable the implementation of the cluster management system simplified, standardized and easy to transplant. The system has been applied in a large-scale simulation system of 230 single-chip microcomputers, which proves that the system is reliable, real-time and easy to maintain.


Author(s):  
B. Darsana ◽  
Karabi Konar

Current advances in portable devices, wireless technologies, and distributed systems have created a mobile computing environment that is characterized by a large scale of dynamism. Diversities in network connectivity, platform capability, and resource availability can significantly affect the application performance. Traditional middleware systems are not prepared to offer proper support for addressing the dynamic aspects of mobile systems. Modern distributed applications need a middleware that is capable of adapting to environment changes and that supports the required level of quality of service. This paper represents the experience of several research projects related to next generation middleware systems. We first indicate the major challenges in mobile computing systems and try to identify the main requirements for mobile middleware systems. The different categories of mobile middleware technologies are reviewed and their strength and weakness are analyzed.


Sign in / Sign up

Export Citation Format

Share Document