Fault-tolerance and availability awareness in computational grids

Problem Statement. The advances in human civilization lead to more complications in problem solving. Grid computing serves as an efficient technology in solving those complicated problems. In computational grids, the grid scheduler schedules the task and finds the appropriate resource for each task. The scheduler must consider several factors such as user demand, communication time, failure handling mechanisms, and reduced makespan. Most of the existing algorithms do not consider user satisfaction. Thus a scheduling algorithm that handles failure of resources and achieves user satisfaction gains more importance.Approach. A new bicriteria scheduling algorithm (BSA) that considers user satisfaction along with fault tolerance has been introduced. The main contribution of this paper includes achieving user satisfaction along with fault tolerance and minimizing the makespan of jobs.Results. The performance of this proposed algorithm is evaluated using GridSim based on makespan and number of jobs completed successfully within user deadline.Conclusions/Recommendations. The proposed BSA algorithm achieves reduced makespan and better hit rate with higher user satisfaction and fault tolerance.

Download Full-text

Survay on Job Scheduling, Load Balancing and Fault Tolerance Techniques for Computational Grids

Global Journal of Technology and Optimization ◽

10.4172/2229-8711.1000169 ◽

2015 ◽

Vol 06 (01) ◽

Author(s):

Jasma Balasangameshwara

Keyword(s):

Fault Tolerance ◽

Load Balancing ◽

Job Scheduling ◽

Computational Grids

Download Full-text

Fault-tolerance and availability awareness in computational grids

Fundamentals of Grid Computing ◽

10.1201/9781439803684-13 ◽

2009 ◽

pp. 167-200

Keyword(s):

Fault Tolerance ◽

Computational Grids

Download Full-text

A job checkpointing system for computational grids

Open Computer Science ◽

10.2478/s13537-013-0103-3 ◽

2013 ◽

Vol 3 (1) ◽

Cited By ~ 4

Author(s):

Mohammed Amoon

Keyword(s):

Fault Tolerance ◽

Failure Rate ◽

Failure Time ◽

Fault Tolerant ◽

Turnaround Time ◽

Computational Grids ◽

Simulation Experiments ◽

Geographically Distributed ◽

Grid Load ◽

Grid Resources

AbstractFault tolerance is an important property in computational grids since the resources are geographically distributed. Job checkpointing is one of the most common utilized techniques for providing fault tolerance in computational grids. The efficiency of checkpointing depends on the choice of the checkpoint interval. Inappropriate checkpointing interval can delay job execution. In this paper, a fault-tolerant scheduling system based on checkpointing technique is presented and evaluated. When scheduling a job, the system uses both average failure time and failure rate of grid resources combined with resources response time to generate scheduling decisions. The system uses the failure rate of the assigned resources to calculate the checkpoint interval for each job. Extensive simulation experiments are conducted to quantify the performance of the proposed system. Experiments have shown that the proposed system can considerably improve throughput, turnaround time, grid load and failure tendency of computational grids.

Download Full-text

SCHEDULING WITH JOB CHECKPOINT IN COMPUTATIONAL GRID ENVIRONMENT

International Journal of Modeling Simulation and Scientific Computing ◽

10.1142/s1793962311000517 ◽

2011 ◽

Vol 02 (03) ◽

pp. 299-316

Author(s):

MALARVIZHI NANDAGOPAL ◽

S. GAJALAKSHMI ◽

V. RHYMEND UTHARIARAJ

Keyword(s):

Fault Tolerance ◽

Large Scale ◽

Job Scheduling ◽

Fault Tolerant ◽

Scheduling Algorithm ◽

Computational Grids ◽

Tolerance Mechanism ◽

Grid Resource ◽

Distributed Resources ◽

Grid Environment

Computational grids have the potential for solving large-scale scientific applications using heterogeneous and geographically distributed resources. In addition to the challenges of managing and scheduling these applications, reliability challenges arise because of the unreliable nature of grid infrastructure. Two major problems that are critical to the effective utilization of computational resources are efficient scheduling of jobs and providing fault tolerance in a reliable manner. This paper addresses these problems by combining the checkpoint replication based fault tolerance mechanism with minimum total time to release (MTTR) job scheduling algorithm. TTR includes the service time of the job, waiting time in the queue, transfer of input and output data to and from the resource. The MTTR algorithm minimizes the response time by selecting a computational resource based on job requirements, job characteristics, and hardware features of the resources. The fault tolerance mechanism used here sets the job checkpoints based on the resource failure rate. If resource failure occurs, the job is restarted from its last successful state using a checkpoint file from another grid resource. Globus ToolKit is used as the grid middleware to set up a grid environment and evaluate the performance of the proposed approach. The monitoring tools Ganglia and Network Weather Service are used to gather hardware and network details, respectively. The experimental results demonstrate that, the proposed approach effectively schedule the grid jobs with fault-tolerant way thereby reduces TTR of the jobs submitted in the grid. Also, it increases the percentage of jobs completed within specified deadline and making the grid trustworthy.

Download Full-text

Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid ◽

10.1109/ccgrid.2009.59 ◽

2009 ◽

Cited By ~ 24

Author(s):

Yang Zhang ◽

Anirban Mandal ◽

Charles Koelbel ◽

Keith Cooper

Keyword(s):

Fault Tolerance ◽

Computational Grids

Download Full-text

Fault tolerance in computational grids: perspectives, challenges, and issues

SpringerPlus ◽

10.1186/s40064-016-3669-0 ◽

2016 ◽

Vol 5 (1) ◽

Cited By ~ 3

Author(s):

Sajjad Haider ◽

Babar Nazir

Keyword(s):

Fault Tolerance ◽

Computational Grids

Download Full-text

High Performance Computational Grids Fault Tolerance at System Level

2008 First International Conference on Emerging Trends in Engineering and Technology ◽

10.1109/icetet.2008.21 ◽

2008 ◽

Cited By ~ 1

Author(s):

Manik Mujumdar ◽

Meenakshi Bheevgade ◽

Latesh Malik ◽

Rajendra Patrikar

Keyword(s):

Fault Tolerance ◽

High Performance ◽

System Level ◽

Computational Grids

Download Full-text

The Application of Fractal Transform and Entropy for Improving Fault Tolerance and Load Balancing in Grid Computing Environments

Entropy ◽

10.3390/e22121410 ◽

2020 ◽

Vol 22 (12) ◽

pp. 1410

Author(s):

Murad B. Khorsheed ◽

Qasim M. Zainel ◽

Oday A. Hassen ◽

Saad M. Darwish

Keyword(s):

Fault Tolerance ◽

Load Balancing ◽

Grid Computing ◽

Computing Time ◽

Index Structure ◽

Computational Grids ◽

Optimum Number ◽

Logical Network ◽

Proposed Model ◽

Cloud Of Points

This paper applies the entropy-based fractal indexing scheme that enables the grid environment for fast indexing and querying. It addresses the issue of fault tolerance and load balancing-based fractal management to make computational grids more effective and reliable. A fractal dimension of a cloud of points gives an estimate of the intrinsic dimensionality of the data in that space. The main drawback of this technique is the long computing time. The main contribution of the suggested work is to investigate the effect of fractal transform by adding R-tree index structure-based entropy to existing grid computing models to obtain a balanced infrastructure with minimal fault. In this regard, the presented work is going to extend the commonly scheduling algorithms that are built based on the physical grid structure to a reduced logical network. The objective of this logical network is to reduce the searching in the grid paths according to arrival time rate and path’s bandwidth with respect to load balance and fault tolerance, respectively. Furthermore, an optimization searching technique is utilized to enhance the grid performance by investigating the optimum number of nodes extracted from the logical grid. The experimental results indicated that the proposed model has better execution time, throughput, makespan, latency, load balancing, and success rate.

Download Full-text

On Fault Tolerance of Resources in Computational Grids

International Journal of Grid Computing & Applications ◽

10.5121/ijgca.2012.3301 ◽

2012 ◽

Vol 3 (3) ◽

pp. 1-10 ◽

Cited By ~ 4

Author(s):

Arindam Das

Keyword(s):

Fault Tolerance ◽

Computational Grids

Download Full-text