A Replica Based Co-Scheduler (RBS) for Fault Tolerant Computational Grid

Author(s):  
Zahid Raza ◽  
Deo P. Vidyarthi

Grid is a parallel and distributed computing network system comprising of heterogeneous computing resources spread over multiple administrative domains that offers high throughput computing. Since the Grid operates at a large scale, there is always a possibility of failure ranging from hardware to software. The penalty paid of these failures may be on a very large scale. System needs to be tolerant to various possible failures which, in spite of many precautions, are bound to happen. Replication is a strategy often used to introduce fault tolerance in the system to ensure successful execution of the job, even when some of the computational resources fail. Though replication incurs a heavy cost, a selective degree of replication can offer a good compromise between the performance and the cost. This chapter proposes a co-scheduler that can be integrated with main scheduler for the execution of the jobs submitted to computational Grid. The main scheduler may have any performance optimization criteria; the integration of co-scheduler will be an added advantage towards fault tolerance. The chapter evaluates the performance of the co-scheduler with the main scheduler designed to minimize the turnaround time of a modular job by introducing module replication to counter the effects of node failures in a Grid. Simulation study reveals that the model works well under various conditions resulting in a graceful degradation of the scheduler’s performance with improving the overall reliability offered to the job.

Author(s):  
Jiantao Yao ◽  
Bo Han ◽  
Yuchao Dou ◽  
Yundou Xu ◽  
Yongsheng Zhao

Parallel mechanism has been widely used in large-scale and heavy-duty attitude adjustment equipment. In order to improve the work reliability of the subreflector parallel adjusting mechanism, a fault-tolerant strategy based on the redundant degree-of-freedom and a workspace boundary identification method are proposed in this paper, which can realize the proper function of the subreflector parallel adjusting mechanism when it has a driven fault. The configuration and parameters of the parallel adjusting mechanism are introduced firstly, then the degrees-of-freedom of the parallel adjusting mechanism is calculated when it has a driven fault, and the principle of the fault-tolerant strategy based on the redundant degree-of-freedom is deduced in detail. Next, the method to solve the workspace boundary identification problem for the parallel adjusting mechanism in fault tolerance conditions is proposed, the maximum and minimum workspaces of the parallel adjusting mechanism at fault tolerance conditions in different frequency bands are analyzed. The results showed that the workspace calculated by the fault-tolerant strategy in the fault condition can completely meet the needs of the subreflector, where this method can also be applied to other parallel mechanisms. Lastly, an experiment is conducted to verify the correctness and effectiveness of the fault-tolerant strategy, in which the results showed that the fault-tolerant strategy can effectively improve the work reliability of the parallel adjusting mechanism. The fault-tolerant strategy and workspace boundary identification method can make the subreflector parallel adjusting mechanism work normally when it has a driven fault, which can significantly improve the work reliability and work efficiency and the maintenance cost can also be reduced. The fault-tolerant strategy and workspace boundary identification method can also be well applied to the research and development for this kind of parallel mechanical equipment.


Author(s):  
Harald Kruggel-Emden ◽  
Frantisek Stepanek ◽  
Ante Munjiza

The time- and event-driven discrete element methods are more and more applied to realistic industrial scale applications. However, they are still computational very demanding. Realistic modeling is often limited or even impeded by the cost of the computational resources required. In this paper the time-driven and event-driven discrete element methods are reviewed addressing especially the available algorithms. Their options for simultaneously modeling an interstitial fluid are discussed. A potential extension of the time-driven method currently under development functioning as a link between event- and time-driven methods is suggested and shortly addressed.


2018 ◽  
Vol 14 (05) ◽  
pp. 118 ◽  
Author(s):  
Yang Xiao

To address the node cascading failure (CF) of the wireless sensor networks (WSNs), considering such factors as node load and maximum capacity in scale-free topology, this paper establishes the WSN dynamic fault tolerant topology model based on node cascading failure, analyses the relationships between node load, topology and dynamic fault tolerance, and demonstrates the proposed model through simulation test. It studies the effects of topology parameter and load in case of random node failure in the network node cascading failure, and utilizes the theoretical derivation method to derive the structural feature of scale-free topology and the capacity limit for the WSNs large-scale cascading failure, effectively enhancing the cascading fault tolerance of traditional WSNs. The simulation test results show that, with the degree distribution parameter <em>C</em> increasing, the minimum network node degree will increase accordingly, and in highly intensive topology, the dynamic fault tolerance will be better; with the parameter<em> λ </em>increasing, the maximum degree of the network node will gradually decrease, and the degree distribution of topology structure tends to be uniform, so that the large-scale cascading failure caused by node failure will have less influence on the WSN, and further improve the dynamic fault tolerance performance of the system.


Author(s):  
Bakhta Meroufel ◽  
Ghalem Belalem

One of the most important points for more effective use in the environment of cloud is undoubtedly the study of reliability and robustness of services related to this environment. In this case, fault tolerance is necessary to ensure that reliability and reduce the SLA violation. Checkpointing is a popular fault tolerance technique in large-scale systems. However, its major disadvantage is the overhead caused by the storage time of checkpointing files, which increases the execution time and minimizes the possibility to meet the desired deadlines. In this chapter, the authors propose a checkpointing strategy with lightweight storage. The storage is provided by creating a virtual topology VRbIO and the use of an intelligent and fault tolerant I/O technique CSDS (collective and selective data sieving). The proposal is executed by active and reactive agents and it solves many problems of checkpointing with standard I/O. To evaluate the approach, the authors compare it with a checkpointing with ROMIO as I/O strategy. Experimental results show the effectiveness and reliability of the proposed approach.


Author(s):  
Ghalem Belalem ◽  
Said Limam

Cloud computing refers to both the applications delivered as services over the Internet and the hardware and systems software in the datacenters that provide those services. Failures of any type are common in current datacenters, partly due to the number of nodes. Fault tolerance has become a major task for computer engineers and software developers because the occurrence of faults increases the cost of using resources and to meet the user expectations, the most fundamental user expectation is, of course, that his or her application correctly finishes independent of faults in the node. This paper proposes a fault tolerant architecture to Cloud Computing that uses an adaptive Checkpoint mechanism to assure that a task running can correctly finish in spite of faults in the nodes in which it is running. The proposed fault tolerant architecture is simultaneously transparent and scalable.


Author(s):  
ROBERT STEWART ◽  
PATRICK MAIER ◽  
PHIL TRINDER

AbstractReliability is set to become a major concern on emergent large-scale architectures. While there are many parallel languages, and indeed many parallel functional languages, very few address reliability. The notable exception is the widely emulated Erlang distributed actor model that provides explicit supervision and recovery of actors with isolated state. We investigate scalable transparent fault tolerant functional computation with automatic supervision and recovery of tasks. We do so by developing HdpH-RS, a variant of the Haskell distributed parallel Haskell (HdpH) DSL with Reliable Scheduling. Extending the distributed work stealing protocol of HdpH for task supervision and recovery is challenging. To eliminate elusive concurrency bugs, we validate the HdpH-RS work stealing protocol using the SPIN model checker. HdpH-RS differs from the actor model in that its principal entities are tasks, i.e. independent stateless computations, rather than isolated stateful actors. Thanks to statelessness, fault recovery can be performed automatically and entirely hidden in the HdpH-RS runtime system. Statelessness is also key for proving a crucial property of the semantics of HdpH-RS: fault recovery does not change the result of the program, akin to deterministic parallelism. HdpH-RS provides a simple distributed fork/join-style programming model, with minimal exposure of fault tolerance at the language level, and a library of higher level abstractions such as algorithmic skeletons. In fact, the HdpH-RS DSL is exactly the same as the HdpH DSL, hence users can opt in or out of fault tolerant execution without any refactoring. Computations in HdpH-RS are always as reliable as the root node, no matter how many nodes and cores are actually used. We benchmark HdpH-RS on conventional clusters and an High Performance Computing platform: all benchmarks survive Chaos Monkey random fault injection; the system scales well e.g. up to 1,400 cores on the High Performance Computing; reliability and recovery overheads are consistently low even at scale.


2013 ◽  
Vol 3 (1) ◽  
Author(s):  
Mohammed Amoon

AbstractFault tolerance is an important property in computational grids since the resources are geographically distributed. Job checkpointing is one of the most common utilized techniques for providing fault tolerance in computational grids. The efficiency of checkpointing depends on the choice of the checkpoint interval. Inappropriate checkpointing interval can delay job execution. In this paper, a fault-tolerant scheduling system based on checkpointing technique is presented and evaluated. When scheduling a job, the system uses both average failure time and failure rate of grid resources combined with resources response time to generate scheduling decisions. The system uses the failure rate of the assigned resources to calculate the checkpoint interval for each job. Extensive simulation experiments are conducted to quantify the performance of the proposed system. Experiments have shown that the proposed system can considerably improve throughput, turnaround time, grid load and failure tendency of computational grids.


Author(s):  
MALARVIZHI NANDAGOPAL ◽  
S. GAJALAKSHMI ◽  
V. RHYMEND UTHARIARAJ

Computational grids have the potential for solving large-scale scientific applications using heterogeneous and geographically distributed resources. In addition to the challenges of managing and scheduling these applications, reliability challenges arise because of the unreliable nature of grid infrastructure. Two major problems that are critical to the effective utilization of computational resources are efficient scheduling of jobs and providing fault tolerance in a reliable manner. This paper addresses these problems by combining the checkpoint replication based fault tolerance mechanism with minimum total time to release (MTTR) job scheduling algorithm. TTR includes the service time of the job, waiting time in the queue, transfer of input and output data to and from the resource. The MTTR algorithm minimizes the response time by selecting a computational resource based on job requirements, job characteristics, and hardware features of the resources. The fault tolerance mechanism used here sets the job checkpoints based on the resource failure rate. If resource failure occurs, the job is restarted from its last successful state using a checkpoint file from another grid resource. Globus ToolKit is used as the grid middleware to set up a grid environment and evaluate the performance of the proposed approach. The monitoring tools Ganglia and Network Weather Service are used to gather hardware and network details, respectively. The experimental results demonstrate that, the proposed approach effectively schedule the grid jobs with fault-tolerant way thereby reduces TTR of the jobs submitted in the grid. Also, it increases the percentage of jobs completed within specified deadline and making the grid trustworthy.


2011 ◽  
Vol 58-60 ◽  
pp. 1442-1447
Author(s):  
Feng Ji Luo ◽  
Zhao Yang Dong ◽  
Can Wan ◽  
Ying Ying Chen ◽  
Ke Meng ◽  
...  

This paper proposes a computational grid platform for solving the large-scale power system applications. The platform is based on Globus Toolkit middleware and GridWay meta-scheduler. It can enable large-scale sharing of computational resources across institutional boundary. This paper first discusses the architecture and each component of the platform, and then the test bed is described. Finally, the test results of probabilistic load flow (PLF) by Monte-Carlo simulation are presented. The test results show that the computational Grid system can provide comparable performance.


Sensors ◽  
2021 ◽  
Vol 21 (17) ◽  
pp. 5906
Author(s):  
Roxana-Gabriela Stan ◽  
Lidia Băjenaru ◽  
Cătălin Negru ◽  
Florin Pop

This work establishes a set of methodologies to evaluate the performance of any task scheduling policy in heterogeneous computing contexts. We formally state a scheduling model for hybrid edge–cloud computing ecosystems and conduct simulation-based experiments on large workloads. In addition to the conventional cloud datacenters, we consider edge datacenters comprising smartphone and Raspberry Pi edge devices, which are battery powered. We define realistic capacities of the computational resources. Once a schedule is found, the various task demands can or cannot be fulfilled by the resource capacities. We build a scheduling and evaluation framework and measure typical scheduling metrics such as mean waiting time, mean turnaround time, makespan, throughput on the Round-Robin, Shortest Job First, Min-Min and Max-Min scheduling schemes. Our analysis and results show that the state-of-the-art independent task scheduling algorithms suffer from performance degradation in terms of significant task failures and nonoptimal resource utilization of datacenters in heterogeneous edge–cloud mediums in comparison to cloud-only mediums. In particular, for large sets of tasks, due to low battery or limited memory, more than 25% of tasks fail to execute for each scheduling scheme.


Sign in / Sign up

Export Citation Format

Share Document