Fault Tolerance

The domains of usage of large scale distributed systems have been extending during the past years from scientific to commercial applications. Together with the extension of the application domains, new requirements have emerged for large scale distributed systems. Among these requirements, fault tolerance is needed by more and more modern distributed applications, not only by the critical ones. In this chapter we analyze current existing work in enabling fault tolerance in case of large scale distributed systems, presenting specific problem, existing solution, as well as several future trends. The characteristics of these systems pose problems to ensuring fault tolerance especially because of their complexity, involving many resources and users geographically distributed, because of the volatility of resources that are available only for limited amounts of time, and because of the constraints imposed by the applications and resource owners. A general fault tolerant architecture should, at a minimum, be comprised of at least a mechanism to detect failures and a component capable to recover and handle the detected failures, usually using some form of a replication mechanism. In this chapter we analyzed existing fault tolerance implementations, as well as solutions adopted in real world large scale distributed systems. We analyzed the fault tolerance architectures being proposed for particular distributed architectures, such as Grid or P2P systems.

Download Full-text

A Fault Tolerant Decentralized Scheduling in Large Scale Distributed Systems

Handbook of Research on P2P and Grid Systems for Service-Oriented Computing ◽

10.4018/978-1-61520-686-5.ch024 ◽

2010 ◽

pp. 566-588 ◽

Cited By ~ 2

Author(s):

Florin Pop

Keyword(s):

Distributed Systems ◽

High Performance ◽

Large Scale ◽

Fault Tolerant ◽

Optimal Algorithm ◽

Distributed Applications ◽

Distributed Scheduling ◽

Agent Based ◽

Decentralized Scheduling ◽

Optimization Schemes

This chapter presents a fault tolerant framework for the applications scheduling in large scale distributed systems (LSDS). Due to the specific characteristics and requirements of distributed systems, a good scheduling model should be dynamic. More specifically, it should adapt the scheduling decisions to resource state changes, which are commonly captured through monitoring. The scheduler and the monitor are two important middleware pieces that correlate their actions to ensure the high performance execution of distributed applications. The chapter presents and analyses agent based architecture for scheduling in large scale distributed systems. Then the user and resources management are presented. Optimization schemes for scheduling consider the near-optimal algorithm for distributed scheduling. The chapter presents the solution for scheduling optimization. The chapter covers and explains the fault tolerance cases for Grid environments and describes two possible scenarios for scheduling system.

Download Full-text

A Probabilistic Fault-Tolerant Recovery Mechanism for Task and Result Certification of Large-Scale Distributed Applications

Advances in Grid and Pervasive Computing - Lecture Notes in Computer Science ◽

10.1007/978-3-642-01671-4_42 ◽

2009 ◽

pp. 471-482

Author(s):

Rim Chayeh ◽

Christophe Cerin ◽

Mohamed Jemni

Keyword(s):

Large Scale ◽

Fault Tolerant ◽

Distributed Applications ◽

Recovery Mechanism

Download Full-text

Applications

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Large-Scale Distributed Computing and Applications ◽

10.4018/978-1-61520-703-9.ch011 ◽

2010 ◽

pp. 235-252

Author(s):

Valentin Cristea ◽

Ciprian Dobre ◽

Corina Stratan ◽

Florin Pop

Keyword(s):

Distributed Systems ◽

Large Scale ◽

P2p Systems ◽

Grid Applications ◽

The Subject ◽

Entire World

This chapter covers the subject of application in LSDS. The chapter is organized in two parts. The chapter parts present two aspect of application in LSDS: the overview of applications in entire world and the method of applications development. It is also presented a description of current projects and applications in large scale distributed systems, like applications from OSG projects in USA, EGEE and SEE-GRID applications in Europe and Asia, DEISA initiative (Distributed European Infrastructure for Supercomputing Applications). This part also presents the relevant applications in LSDS, like Grids, P2P systems.

Download Full-text

Security

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Large-Scale Distributed Computing and Applications ◽

10.4018/978-1-61520-703-9.ch009 ◽

2010 ◽

pp. 194-216

Author(s):

Valentin Cristea ◽

Ciprian Dobre ◽

Corina Stratan ◽

Florin Pop

Keyword(s):

Distributed Systems ◽

Access Control ◽

Key Management ◽

Secure Communication ◽

Large Scale ◽

Internet Banking ◽

Distributed Applications ◽

Sensitive Information ◽

Security Research ◽

Security Models

Security in distributed systems is a combination of confidentiality, integrity and availability of their components. It mainly targets the communication channels between users and/or processes located in different computers, the access control of users / processes to resources and services, and the management of keys, users and user groups. Distributed systems are more vulnerable to security threats due to several characteristics such as their large scale, the distributed nature of the control, and the remote nature of the access. In addition, an increasing number of distributed applications (such as Internet banking) manipulate sensitive information and have special security requirements. After discussing important security concepts in the Background section, this chapter addresses several important problems that are at the aim of current research in the security of large scale distributed systems: security models (which represent the theoretical foundation for solving security problems), access control (more specific the access control in distributed multi-organizational platforms), secure communication (with emphasis on the secure group communication, which is a hot topic in security research today), security management (especially key management for collaborative environments), secure distributed architectures (which are the blueprints for designing and building security systems), and security environments / frameworks.

Download Full-text

Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud

Wireless Personal Communications ◽

10.1007/s11277-020-07949-0 ◽

2020 ◽

Author(s):

Priti Kumari ◽

Parmeet Kaur

Keyword(s):

Large Scale ◽

Fault Tolerant ◽

Distributed Applications

Download Full-text

Fault-tolerant strategy and workspace of the subreflector parallel adjusting mechanism

Proceedings of the Institution of Mechanical Engineers Part C Journal of Mechanical Engineering Science ◽

10.1177/0954406219865289 ◽

2019 ◽

Vol 233 (18) ◽

pp. 6656-6667

Author(s):

Jiantao Yao ◽

Bo Han ◽

Yuchao Dou ◽

Yundou Xu ◽

Yongsheng Zhao

Keyword(s):

Fault Tolerance ◽

Large Scale ◽

Fault Tolerant ◽

Degree Of Freedom ◽

Proper Function ◽

Maintenance Cost ◽

Mechanical Equipment ◽

Identification Method ◽

Boundary Identification ◽

Workspace Boundary

Parallel mechanism has been widely used in large-scale and heavy-duty attitude adjustment equipment. In order to improve the work reliability of the subreflector parallel adjusting mechanism, a fault-tolerant strategy based on the redundant degree-of-freedom and a workspace boundary identification method are proposed in this paper, which can realize the proper function of the subreflector parallel adjusting mechanism when it has a driven fault. The configuration and parameters of the parallel adjusting mechanism are introduced firstly, then the degrees-of-freedom of the parallel adjusting mechanism is calculated when it has a driven fault, and the principle of the fault-tolerant strategy based on the redundant degree-of-freedom is deduced in detail. Next, the method to solve the workspace boundary identification problem for the parallel adjusting mechanism in fault tolerance conditions is proposed, the maximum and minimum workspaces of the parallel adjusting mechanism at fault tolerance conditions in different frequency bands are analyzed. The results showed that the workspace calculated by the fault-tolerant strategy in the fault condition can completely meet the needs of the subreflector, where this method can also be applied to other parallel mechanisms. Lastly, an experiment is conducted to verify the correctness and effectiveness of the fault-tolerant strategy, in which the results showed that the fault-tolerant strategy can effectively improve the work reliability of the parallel adjusting mechanism. The fault-tolerant strategy and workspace boundary identification method can make the subreflector parallel adjusting mechanism work normally when it has a driven fault, which can significantly improve the work reliability and work efficiency and the maintenance cost can also be reduced. The fault-tolerant strategy and workspace boundary identification method can also be well applied to the research and development for this kind of parallel mechanical equipment.

Download Full-text

Fault Tolerance Using a Front-End Service for Large Scale Distributed Systems

2009 11th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing ◽

10.1109/synasc.2009.13 ◽

2009 ◽

Cited By ~ 7

Author(s):

Marieta Nastase ◽

Ciprian Dobre ◽

Florin Pop ◽

Valentin Cristea

Keyword(s):

Distributed Systems ◽

Fault Tolerance ◽

Large Scale ◽

Front End

Download Full-text

Dynamic Fault Tolerant Topology Control for Wireless Sensor Network Based on Node Cascading Failure

International Journal of Online Engineering (iJOE) ◽

10.3991/ijoe.v14i05.8644 ◽

2018 ◽

Vol 14 (05) ◽

pp. 118 ◽

Cited By ~ 1

Author(s):

Yang Xiao

Keyword(s):

Fault Tolerance ◽

Degree Distribution ◽

Large Scale ◽

Fault Tolerant ◽

Wireless Sensor ◽

Cascading Failure ◽

Network Node ◽

Simulation Test ◽

Node Failure ◽

Scale Free

To address the node cascading failure (CF) of the wireless sensor networks (WSNs), considering such factors as node load and maximum capacity in scale-free topology, this paper establishes the WSN dynamic fault tolerant topology model based on node cascading failure, analyses the relationships between node load, topology and dynamic fault tolerance, and demonstrates the proposed model through simulation test. It studies the effects of topology parameter and load in case of random node failure in the network node cascading failure, and utilizes the theoretical derivation method to derive the structural feature of scale-free topology and the capacity limit for the WSNs large-scale cascading failure, effectively enhancing the cascading fault tolerance of traditional WSNs. The simulation test results show that, with the degree distribution parameter <em>C</em> increasing, the minimum network node degree will increase accordingly, and in highly intensive topology, the dynamic fault tolerance will be better; with the parameter<em> λ </em>increasing, the maximum degree of the network node will gradually decrease, and the degree distribution of topology structure tends to be uniform, so that the large-scale cascading failure caused by node failure will have less influence on the WSN, and further improve the dynamic fault tolerance performance of the system.

Download Full-text

A Replica Based Co-Scheduler (RBS) for Fault Tolerant Computational Grid

Cloud, Grid and High Performance Computing ◽

10.4018/978-1-60960-603-9.ch007 ◽

2011 ◽

pp. 101-116

Author(s):

Zahid Raza ◽

Deo P. Vidyarthi

Keyword(s):

Fault Tolerance ◽

Performance Optimization ◽

Large Scale ◽

Heterogeneous Computing ◽

Fault Tolerant ◽

Turnaround Time ◽

Computational Grid ◽

Successful Execution ◽

Computational Resources ◽

The Cost

Grid is a parallel and distributed computing network system comprising of heterogeneous computing resources spread over multiple administrative domains that offers high throughput computing. Since the Grid operates at a large scale, there is always a possibility of failure ranging from hardware to software. The penalty paid of these failures may be on a very large scale. System needs to be tolerant to various possible failures which, in spite of many precautions, are bound to happen. Replication is a strategy often used to introduce fault tolerance in the system to ensure successful execution of the job, even when some of the computational resources fail. Though replication incurs a heavy cost, a selective degree of replication can offer a good compromise between the performance and the cost. This chapter proposes a co-scheduler that can be integrated with main scheduler for the execution of the jobs submitted to computational Grid. The main scheduler may have any performance optimization criteria; the integration of co-scheduler will be an added advantage towards fault tolerance. The chapter evaluates the performance of the co-scheduler with the main scheduler designed to minimize the turnaround time of a modular job by introducing module replication to counter the effects of node failures in a Grid. Simulation study reveals that the model works well under various conditions resulting in a graceful degradation of the scheduler’s performance with improving the overall reliability offered to the job.

Download Full-text

Improved Checkpoint Using the Effective Management of I/O in a Cloud Environment

Advances in Computer and Electrical Engineering - Advanced Methodologies and Technologies in Network Architecture, Mobile Computing, and Data Analytics ◽

10.4018/978-1-5225-7598-6.ch016 ◽

2019 ◽

pp. 219-233

Author(s):

Bakhta Meroufel ◽

Ghalem Belalem

Keyword(s):

Fault Tolerance ◽

Execution Time ◽

Large Scale ◽

Fault Tolerant ◽

Experimental Results ◽

Cloud Environment ◽

Large Scale Systems ◽

Major Disadvantage ◽

Reliability And Robustness ◽

Effective Use

One of the most important points for more effective use in the environment of cloud is undoubtedly the study of reliability and robustness of services related to this environment. In this case, fault tolerance is necessary to ensure that reliability and reduce the SLA violation. Checkpointing is a popular fault tolerance technique in large-scale systems. However, its major disadvantage is the overhead caused by the storage time of checkpointing files, which increases the execution time and minimizes the possibility to meet the desired deadlines. In this chapter, the authors propose a checkpointing strategy with lightweight storage. The storage is provided by creating a virtual topology VRbIO and the use of an intelligent and fault tolerant I/O technique CSDS (collective and selective data sieving). The proposal is executed by active and reactive agents and it solves many problems of checkpointing with standard I/O. To evaluate the approach, the authors compare it with a checkpointing with ROMIO as I/O strategy. Experimental results show the effectiveness and reliability of the proposed approach.

Download Full-text