A Probabilistic Fault-Tolerant Recovery Mechanism for Task and Result Certification of Large-Scale Distributed Applications

The domains of usage of large scale distributed systems have been extending during the past years from scientific to commercial applications. Together with the extension of the application domains, new requirements have emerged for large scale distributed systems. Among these requirements, fault tolerance is needed by more and more modern distributed applications, not only by the critical ones. In this chapter we analyze current existing work in enabling fault tolerance in case of large scale distributed systems, presenting specific problem, existing solution, as well as several future trends. The characteristics of these systems pose problems to ensuring fault tolerance especially because of their complexity, involving many resources and users geographically distributed, because of the volatility of resources that are available only for limited amounts of time, and because of the constraints imposed by the applications and resource owners. A general fault tolerant architecture should, at a minimum, be comprised of at least a mechanism to detect failures and a component capable to recover and handle the detected failures, usually using some form of a replication mechanism. In this chapter we analyzed existing fault tolerance implementations, as well as solutions adopted in real world large scale distributed systems. We analyzed the fault tolerance architectures being proposed for particular distributed architectures, such as Grid or P2P systems.

Download Full-text

A Fault Tolerant Decentralized Scheduling in Large Scale Distributed Systems

Handbook of Research on P2P and Grid Systems for Service-Oriented Computing ◽

10.4018/978-1-61520-686-5.ch024 ◽

2010 ◽

pp. 566-588 ◽

Cited By ~ 2

Author(s):

Florin Pop

Keyword(s):

Distributed Systems ◽

High Performance ◽

Large Scale ◽

Fault Tolerant ◽

Optimal Algorithm ◽

Distributed Applications ◽

Distributed Scheduling ◽

Agent Based ◽

Decentralized Scheduling ◽

Optimization Schemes

This chapter presents a fault tolerant framework for the applications scheduling in large scale distributed systems (LSDS). Due to the specific characteristics and requirements of distributed systems, a good scheduling model should be dynamic. More specifically, it should adapt the scheduling decisions to resource state changes, which are commonly captured through monitoring. The scheduler and the monitor are two important middleware pieces that correlate their actions to ensure the high performance execution of distributed applications. The chapter presents and analyses agent based architecture for scheduling in large scale distributed systems. Then the user and resources management are presented. Optimization schemes for scheduling consider the near-optimal algorithm for distributed scheduling. The chapter presents the solution for scheduling optimization. The chapter covers and explains the fault tolerance cases for Grid environments and describes two possible scenarios for scheduling system.

Download Full-text

Using events to build large scale distributed applications

10.1145/504450.504453 ◽

1996 ◽

Cited By ~ 8

Author(s):

Richard Hayton ◽

Jean Bacon ◽

John Bates ◽

Ken Moody

Keyword(s):

Large Scale ◽

Distributed Applications

Download Full-text

RT-ZooKeeper: Taming the Recovery Latency of a Coordination Service

ACM Transactions on Embedded Computing Systems ◽

10.1145/3477034 ◽

2021 ◽

Vol 20 (5s) ◽

pp. 1-22

Author(s):

Haoran Li ◽

Chenyang Lu ◽

Christopher D. Gill

Keyword(s):

Fault Tolerant ◽

Empirical Evaluation ◽

Distributed Applications ◽

Edge Computing ◽

Challenges And Opportunities ◽

Cloud Environments ◽

Hand Coordination ◽

Recovery Protocols ◽

Recovery Latency

Fault-tolerant coordination services have been widely used in distributed applications in cloud environments. Recent years have witnessed the emergence of time-sensitive applications deployed in edge computing environments, which introduces both challenges and opportunities for coordination services. On one hand, coordination services must recover from failures in a timely manner. On the other hand, edge computing employs local networked platforms that can be exploited to achieve timely recovery. In this work, we first identify the limitations of the leader election and recovery protocols underlying Apache ZooKeeper, the prevailing open-source coordination service. To reduce recovery latency from leader failures, we then design RT-Zookeeper with a set of novel features including a fast-convergence election protocol, a quorum channel notification mechanism, and a distributed epoch persistence protocol. We have implemented RT-Zookeeper based on ZooKeeper version 3.5.8. Empirical evaluation shows that RT-ZooKeeper achieves 91% reduction in maximum recovery latency in comparison to ZooKeeper. Furthermore, a case study demonstrates that fast failure recovery in RT-ZooKeeper can benefit a common messaging service like Kafka in terms of message latency.

Download Full-text

Observer-Based Decentralized Adaptive NNs Fault-Tolerant Control of a Class of Large-Scale Uncertain Nonlinear Systems With Actuator Failures

IEEE Transactions on Systems Man and Cybernetics Systems ◽

10.1109/tsmc.2017.2744676 ◽

2019 ◽

Vol 49 (3) ◽

pp. 528-542 ◽

Cited By ~ 13

Author(s):

Yang Yang ◽

Dong Yue

Keyword(s):

Nonlinear Systems ◽

Large Scale ◽

Fault Tolerant ◽

Fault Tolerant Control ◽

Uncertain Nonlinear Systems ◽

Actuator Failures

Download Full-text

Single Chip Microcomputer Cluster Management

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.933.584 ◽

2014 ◽

Vol 933 ◽

pp. 584-589

Author(s):

Zhi Chun Zhang ◽

Song Wei Li ◽

Wei Ren Wang ◽

Wei Zhang ◽

Li Jun Qi

Keyword(s):

Real Time ◽

Management System ◽

Large Scale ◽

Fault Tolerant ◽

Single Chip ◽

Single Chip Microcomputer ◽

Cluster Management ◽

The Real ◽

Time Performance ◽

Operational Phase

This paper presents a system in which the cluster devices are controlled by single-chip microcomputers, with emphasis on the cluster management techniques of single-chip microcomputers. Each device in a cluster is controlled by a single-chip microcomputer collecting sample data sent to and driving the device by driving data received from the same cluster management computer through COMs. The cluster management system running on the cluster management computer carries out such control as initial SCM identification, run time slice management, communication resource utilization, fault tolerance and error corrections on single-chip microcomputers. Initial SCM identification is achieved by signal responses between the single-chip microcomputers and the cluster management computer. By using the port priority and the parallelization of serial communications, the systems real-time performance is maximized. The real-time performance can be adjusted and improved by increasing or decreasing COMs and the ports linked to each COM, and the real-time performance can also be raised by configuring more cluster management computers. Fault-tolerant control occurs in the initialization phase and the operational phase. In the initialization phase, the cluster management system incorporates unidentified single-chip microcomputers into the system based on the history information recorded on external storage media. In the operational phase, if an operation error of reading and writing on a single-chip microcomputer reaches a predetermined threshold, the single-chip microcomputer is regarded as serious fault or not existing. The cluster management system maintains accuracy maintenance database on external storage medium to solve nonlinear control of specific devices and accuracy maintenance due to wear. The cluster management system uses object-oriented method to design a unified driving framework in order to enable the implementation of the cluster management system simplified, standardized and easy to transplant. The system has been applied in a large-scale simulation system of 230 single-chip microcomputers, which proves that the system is reliable, real-time and easy to maintain.

Download Full-text

Lightweight Fault-tolerant Message Passing System for Parallel and Distributed Applications

International e-Conference of Computer Science 2006 ◽

10.1201/b12168-6 ◽

2007 ◽

pp. 30-33

Keyword(s):

Message Passing ◽

Fault Tolerant ◽

Distributed Applications

Download Full-text

Next Generation Middleware Technology for Mobile Computing

International Journal of Computer and Communication Technology ◽

10.47893/ijcct.2011.1074 ◽

2011 ◽

pp. 74-80

Author(s):

B. Darsana ◽

Karabi Konar

Keyword(s):

Mobile Computing ◽

Large Scale ◽

Network Connectivity ◽

Distributed Applications ◽

Mobile Systems ◽

Portable Devices ◽

Next Generation ◽

Mobile Middleware ◽

Mobile Computing Environment ◽

Middleware Technology

Current advances in portable devices, wireless technologies, and distributed systems have created a mobile computing environment that is characterized by a large scale of dynamism. Diversities in network connectivity, platform capability, and resource availability can significantly affect the application performance. Traditional middleware systems are not prepared to offer proper support for addressing the dynamic aspects of mobile systems. Modern distributed applications need a middleware that is capable of adapting to environment changes and that supports the required level of quality of service. This paper represents the experience of several research projects related to next generation middleware systems. We first indicate the major challenges in mobile computing systems and try to identify the main requirements for mobile middleware systems. The different categories of mobile middleware technologies are reviewed and their strength and weakness are analyzed.

Download Full-text

A Fault-Tolerant Strong Conjunctive Predicate Detection Algorithm for Large-Scale Networks

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum ◽

10.1109/ipdpsw.2013.156 ◽

2013 ◽

Cited By ~ 1

Author(s):

Min Shen ◽

Ajay D. Kshemkalyani

Keyword(s):

Large Scale ◽

Fault Tolerant ◽

Detection Algorithm ◽

Predicate Detection ◽

Large Scale Networks

Download Full-text