Fault Tolerance

Author(s):  
Valentin Cristea ◽  
Ciprian Dobre ◽  
Corina Stratan ◽  
Florin Pop

The domains of usage of large scale distributed systems have been extending during the past years from scientific to commercial applications. Together with the extension of the application domains, new requirements have emerged for large scale distributed systems. Among these requirements, fault tolerance is needed by more and more modern distributed applications, not only by the critical ones. In this chapter we analyze current existing work in enabling fault tolerance in case of large scale distributed systems, presenting specific problem, existing solution, as well as several future trends. The characteristics of these systems pose problems to ensuring fault tolerance especially because of their complexity, involving many resources and users geographically distributed, because of the volatility of resources that are available only for limited amounts of time, and because of the constraints imposed by the applications and resource owners. A general fault tolerant architecture should, at a minimum, be comprised of at least a mechanism to detect failures and a component capable to recover and handle the detected failures, usually using some form of a replication mechanism. In this chapter we analyzed existing fault tolerance implementations, as well as solutions adopted in real world large scale distributed systems. We analyzed the fault tolerance architectures being proposed for particular distributed architectures, such as Grid or P2P systems.

Author(s):  
Florin Pop

This chapter presents a fault tolerant framework for the applications scheduling in large scale distributed systems (LSDS). Due to the specific characteristics and requirements of distributed systems, a good scheduling model should be dynamic. More specifically, it should adapt the scheduling decisions to resource state changes, which are commonly captured through monitoring. The scheduler and the monitor are two important middleware pieces that correlate their actions to ensure the high performance execution of distributed applications. The chapter presents and analyses agent based architecture for scheduling in large scale distributed systems. Then the user and resources management are presented. Optimization schemes for scheduling consider the near-optimal algorithm for distributed scheduling. The chapter presents the solution for scheduling optimization. The chapter covers and explains the fault tolerance cases for Grid environments and describes two possible scenarios for scheduling system.


Author(s):  
Valentin Cristea ◽  
Ciprian Dobre ◽  
Corina Stratan ◽  
Florin Pop

This chapter covers the subject of application in LSDS. The chapter is organized in two parts. The chapter parts present two aspect of application in LSDS: the overview of applications in entire world and the method of applications development. It is also presented a description of current projects and applications in large scale distributed systems, like applications from OSG projects in USA, EGEE and SEE-GRID applications in Europe and Asia, DEISA initiative (Distributed European Infrastructure for Supercomputing Applications). This part also presents the relevant applications in LSDS, like Grids, P2P systems.


Author(s):  
Valentin Cristea ◽  
Ciprian Dobre ◽  
Corina Stratan ◽  
Florin Pop

Security in distributed systems is a combination of confidentiality, integrity and availability of their components. It mainly targets the communication channels between users and/or processes located in different computers, the access control of users / processes to resources and services, and the management of keys, users and user groups. Distributed systems are more vulnerable to security threats due to several characteristics such as their large scale, the distributed nature of the control, and the remote nature of the access. In addition, an increasing number of distributed applications (such as Internet banking) manipulate sensitive information and have special security requirements. After discussing important security concepts in the Background section, this chapter addresses several important problems that are at the aim of current research in the security of large scale distributed systems: security models (which represent the theoretical foundation for solving security problems), access control (more specific the access control in distributed multi-organizational platforms), secure communication (with emphasis on the secure group communication, which is a hot topic in security research today), security management (especially key management for collaborative environments), secure distributed architectures (which are the blueprints for designing and building security systems), and security environments / frameworks.


Author(s):  
Jiantao Yao ◽  
Bo Han ◽  
Yuchao Dou ◽  
Yundou Xu ◽  
Yongsheng Zhao

Parallel mechanism has been widely used in large-scale and heavy-duty attitude adjustment equipment. In order to improve the work reliability of the subreflector parallel adjusting mechanism, a fault-tolerant strategy based on the redundant degree-of-freedom and a workspace boundary identification method are proposed in this paper, which can realize the proper function of the subreflector parallel adjusting mechanism when it has a driven fault. The configuration and parameters of the parallel adjusting mechanism are introduced firstly, then the degrees-of-freedom of the parallel adjusting mechanism is calculated when it has a driven fault, and the principle of the fault-tolerant strategy based on the redundant degree-of-freedom is deduced in detail. Next, the method to solve the workspace boundary identification problem for the parallel adjusting mechanism in fault tolerance conditions is proposed, the maximum and minimum workspaces of the parallel adjusting mechanism at fault tolerance conditions in different frequency bands are analyzed. The results showed that the workspace calculated by the fault-tolerant strategy in the fault condition can completely meet the needs of the subreflector, where this method can also be applied to other parallel mechanisms. Lastly, an experiment is conducted to verify the correctness and effectiveness of the fault-tolerant strategy, in which the results showed that the fault-tolerant strategy can effectively improve the work reliability of the parallel adjusting mechanism. The fault-tolerant strategy and workspace boundary identification method can make the subreflector parallel adjusting mechanism work normally when it has a driven fault, which can significantly improve the work reliability and work efficiency and the maintenance cost can also be reduced. The fault-tolerant strategy and workspace boundary identification method can also be well applied to the research and development for this kind of parallel mechanical equipment.


2018 ◽  
Vol 14 (05) ◽  
pp. 118 ◽  
Author(s):  
Yang Xiao

To address the node cascading failure (CF) of the wireless sensor networks (WSNs), considering such factors as node load and maximum capacity in scale-free topology, this paper establishes the WSN dynamic fault tolerant topology model based on node cascading failure, analyses the relationships between node load, topology and dynamic fault tolerance, and demonstrates the proposed model through simulation test. It studies the effects of topology parameter and load in case of random node failure in the network node cascading failure, and utilizes the theoretical derivation method to derive the structural feature of scale-free topology and the capacity limit for the WSNs large-scale cascading failure, effectively enhancing the cascading fault tolerance of traditional WSNs. The simulation test results show that, with the degree distribution parameter <em>C</em> increasing, the minimum network node degree will increase accordingly, and in highly intensive topology, the dynamic fault tolerance will be better; with the parameter<em> λ </em>increasing, the maximum degree of the network node will gradually decrease, and the degree distribution of topology structure tends to be uniform, so that the large-scale cascading failure caused by node failure will have less influence on the WSN, and further improve the dynamic fault tolerance performance of the system.


Author(s):  
Zahid Raza ◽  
Deo P. Vidyarthi

Grid is a parallel and distributed computing network system comprising of heterogeneous computing resources spread over multiple administrative domains that offers high throughput computing. Since the Grid operates at a large scale, there is always a possibility of failure ranging from hardware to software. The penalty paid of these failures may be on a very large scale. System needs to be tolerant to various possible failures which, in spite of many precautions, are bound to happen. Replication is a strategy often used to introduce fault tolerance in the system to ensure successful execution of the job, even when some of the computational resources fail. Though replication incurs a heavy cost, a selective degree of replication can offer a good compromise between the performance and the cost. This chapter proposes a co-scheduler that can be integrated with main scheduler for the execution of the jobs submitted to computational Grid. The main scheduler may have any performance optimization criteria; the integration of co-scheduler will be an added advantage towards fault tolerance. The chapter evaluates the performance of the co-scheduler with the main scheduler designed to minimize the turnaround time of a modular job by introducing module replication to counter the effects of node failures in a Grid. Simulation study reveals that the model works well under various conditions resulting in a graceful degradation of the scheduler’s performance with improving the overall reliability offered to the job.


Author(s):  
Bakhta Meroufel ◽  
Ghalem Belalem

One of the most important points for more effective use in the environment of cloud is undoubtedly the study of reliability and robustness of services related to this environment. In this case, fault tolerance is necessary to ensure that reliability and reduce the SLA violation. Checkpointing is a popular fault tolerance technique in large-scale systems. However, its major disadvantage is the overhead caused by the storage time of checkpointing files, which increases the execution time and minimizes the possibility to meet the desired deadlines. In this chapter, the authors propose a checkpointing strategy with lightweight storage. The storage is provided by creating a virtual topology VRbIO and the use of an intelligent and fault tolerant I/O technique CSDS (collective and selective data sieving). The proposal is executed by active and reactive agents and it solves many problems of checkpointing with standard I/O. To evaluate the approach, the authors compare it with a checkpointing with ROMIO as I/O strategy. Experimental results show the effectiveness and reliability of the proposed approach.


Sign in / Sign up

Export Citation Format

Share Document