Preliminary Models of the Cost of Fault Tolerance

Author(s):  
Ronald J. Leach
Keyword(s):  
Author(s):  
Zahid Raza ◽  
Deo P. Vidyarthi

Grid is a parallel and distributed computing network system comprising of heterogeneous computing resources spread over multiple administrative domains that offers high throughput computing. Since the Grid operates at a large scale, there is always a possibility of failure ranging from hardware to software. The penalty paid of these failures may be on a very large scale. System needs to be tolerant to various possible failures which, in spite of many precautions, are bound to happen. Replication is a strategy often used to introduce fault tolerance in the system to ensure successful execution of the job, even when some of the computational resources fail. Though replication incurs a heavy cost, a selective degree of replication can offer a good compromise between the performance and the cost. This chapter proposes a co-scheduler that can be integrated with main scheduler for the execution of the jobs submitted to computational Grid. The main scheduler may have any performance optimization criteria; the integration of co-scheduler will be an added advantage towards fault tolerance. The chapter evaluates the performance of the co-scheduler with the main scheduler designed to minimize the turnaround time of a modular job by introducing module replication to counter the effects of node failures in a Grid. Simulation study reveals that the model works well under various conditions resulting in a graceful degradation of the scheduler’s performance with improving the overall reliability offered to the job.


Author(s):  
Ghalem Belalem ◽  
Said Limam

Cloud computing refers to both the applications delivered as services over the Internet and the hardware and systems software in the datacenters that provide those services. Failures of any type are common in current datacenters, partly due to the number of nodes. Fault tolerance has become a major task for computer engineers and software developers because the occurrence of faults increases the cost of using resources and to meet the user expectations, the most fundamental user expectation is, of course, that his or her application correctly finishes independent of faults in the node. This paper proposes a fault tolerant architecture to Cloud Computing that uses an adaptive Checkpoint mechanism to assure that a task running can correctly finish in spite of faults in the nodes in which it is running. The proposed fault tolerant architecture is simultaneously transparent and scalable.


2014 ◽  
Vol 61 (3) ◽  
pp. 1-64 ◽  
Author(s):  
Binbin Chen ◽  
Haifeng Yu ◽  
Yuda Zhao ◽  
Phillip B. Gibbons

CGS-accumulation (Consistent Global State Accumulation) is one of the commonly used method to provide fault tolerance in distributed systems so that the system can operate even if one or more components have failed. However, mobile computing systems are constrained by low bandwidth, mobility, lack of stable storage, frequent disconnections and limited battery life. Hence CGS- accumulation etiquettes which have lesser reinstatement- points are favored in mobile environment. In this paper, we propose a minimum-method coordinated CGS-accumulation etiquette for deterministic distributed applications on mobile computing systems. We eliminate useless reinstatement-points as well as blocking of methods during reinstatement-points at the cost of logging anti- messages of very few messages during CGS-accumulation. We also try to minimize the loss of CGS-accumulation effort when any method miscarries to capture its reinstatement-point in an instigation. In this way, we take care of excessive disappointments during CGS-accumulation. We make logging of anti-messages of very few messages only during CGS-accumulation. We also strive to minimize loss of CGS-accumulation effort.


Author(s):  
Said Limam ◽  
Ghalem Belalem

Cloud computing has become a significant technology and a great solution for providing a flexible, on-demand, and dynamically scalable computing infrastructure for many applications. Cloud computing also presents a significant technology trends. With the cloud computing technology, users use a variety of devices to access programs, storage, and application-development platforms over the Internet, via services offered by cloud computing providers. The probability of failure occur during the execution becomes stronger when the number of node increases; since it is impossible to fully prevent failures, one solution is to implement fault tolerance mechanisms. Fault tolerance has become a major task for computer engineers and software developers because the occurrence of faults increases the cost of using resources. In this paper, the authors have proposed an approach that is a combination of migration and checkpoint mechanism. The checkpoint mechanism minimizes the time lost and reduces the effect of failures on application execution while the migration mechanism guarantee the continuity of application execution and avoid any loss due to hardware failure in a way transparent and efficient. The results obtained by the simulation show the effectiveness of our approaches to fault tolerance in term of execution time and masking effects of failures.


VLSI Design ◽  
2007 ◽  
Vol 2007 ◽  
pp. 1-13 ◽  
Author(s):  
Teijo Lehtonen ◽  
Pasi Liljeberg ◽  
Juha Plosila

We propose link structures for NoC that have properties for tolerating efficiently transient, intermittent, and permanent errors. This is a necessary step to be taken in order to implement reliable systems in future nanoscale technologies. The protection against transient errors is realized using Hamming coding and interleaving for error detection and retransmission as the recovery method. We introduce two approaches for tackling the intermittent and permanent errors. In the first approach, spare wires are introduced together with reconfiguration circuitry. The other approach uses time redundancy, the transmission is split into two parts, where the data is doubled. In both structures the presence of permanent or intermittent errors is monitored by analyzing previous error syndromes. The links are based on self-timed signaling in which the handshake signals are protected using triple modular redundancy. We present the structures, operation, and designs for the different components of the links. The fault tolerance properties are analyzed using a fault model containing temporary, intermittent, and permanent faults that occur both as bursts and as single faults. The results show a considerable enhancement in the fault tolerance at the cost of performance and area, and with only a slight increase in power consumption.


Cloud Computing, being a delivery model is swiftly moving ahead by being adopted by small and large organization alike. This new model opens up many research challenges. As, cloud computing services are offered over the Internet on pay-per-use basis, it is very essential to provide fault tolerant services to the users. To ensure high availability, data centers are replicated. The process of replication is costly but in terms reliability it overtakes the cost factors. Vast amount of work has been undertaken in fault tolerance in other computing environments but they cannot be applied directly to the cloud. This gives an opportunity for new, effective solutions. In this paper, we propose policies for delivering fault tolerant services for private cloud computing environment related to virtual machine allocations. The experimental test results and policies derived are described with respect to virtual machine provisioning.


2011 ◽  
Vol 1 (4) ◽  
pp. 60-69 ◽  
Author(s):  
Ghalem Belalem ◽  
Said Limam

Cloud computing refers to both the applications delivered as services over the Internet and the hardware and systems software in the datacenters that provide those services. Failures of any type are common in current datacenters, partly due to the number of nodes. Fault tolerance has become a major task for computer engineers and software developers because the occurrence of faults increases the cost of using resources and to meet the user expectations, the most fundamental user expectation is, of course, that his or her application correctly finishes independent of faults in the node. This paper proposes a fault tolerant architecture to Cloud Computing that uses an adaptive Checkpoint mechanism to assure that a task running can correctly finish in spite of faults in the nodes in which it is running. The proposed fault tolerant architecture is simultaneously transparent and scalable.


Author(s):  
James F. Mancuso

IBM PC compatible computers are widely used in microscopy for applications ranging from control to image acquisition and analysis. The choice of IBM-PC based systems over competing computer platforms can be based on technical merit alone or on a number of factors relating to economics, availability of peripherals, management dictum, or simple personal preference.IBM-PC got a strong “head start” by first dominating clerical, document processing and financial applications. The use of these computers spilled into the laboratory where the DOS based IBM-PC replaced mini-computers. Compared to minicomputer, the PC provided a more for cost-effective platform for applications in numerical analysis, engineering and design, instrument control, image acquisition and image processing. In addition, the sitewide use of a common PC platform could reduce the cost of training and support services relative to cases where many different computer platforms were used. This could be especially true for the microscopists who must use computers in both the laboratory and the office.


Sign in / Sign up

Export Citation Format

Share Document