Supporting Fault-Tolerance for Time-Critical Events in Distributed Environments

In this paper, we consider the problem of supporting fault tolerance foradaptiveandtime-criticalapplications in heterogeneous and unreliable grid computing environments. Our goal for this class of applications is to optimize a user-specifiedbenefit functionwhile meeting the time deadline. Our first contribution in this paper is a multi-objective optimization algorithm for scheduling the application onto the most efficient and reliable resources. In this way, the processing can achieve the maximum benefit while also maximizing thesuccess-rate, which is the probability of finishing execution without failures. However, for the cases where failures do occur, we have developed ahybrid failure recoveryscheme to ensure that the application can complete within the pre-specified time interval. Our experimental results show that our scheduling algorithm can achieve better benefit when compared to several heuristics-based greedy scheduling algorithms, while still having a negligible overhead. Benefit is further improved when we apply the hybrid failure recovery scheme, and the success-rate becomes 100%.

Download Full-text

A Genetic Algorithm Based Check Pointing and Failure Recovery Scheme in Wireless Sensor Network

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i8.557562 ◽

2018 ◽

Vol 6 (8) ◽

pp. 557-562

Author(s):

Shilpa . ◽

Deepak Dhadwal

Keyword(s):

Genetic Algorithm ◽

Wireless Sensor Network ◽

Sensor Network ◽

Failure Recovery ◽

Wireless Sensor ◽

Recovery Scheme

Download Full-text

A hybrid meta-heuristic task scheduling algorithm based on genetic and thermodynamic simulated annealing algorithms in cloud computing environments

Neural Computing and Applications ◽

10.1007/s00521-021-06289-9 ◽

2021 ◽

Author(s):

Mozhdeh Tanha ◽

Mirsaeid Hosseini Shirvani ◽

Amir Masoud Rahmani

Keyword(s):

Cloud Computing ◽

Simulated Annealing ◽

Task Scheduling ◽

Scheduling Algorithm ◽

Task Scheduling Algorithm ◽

Computing Environments ◽

Annealing Algorithms

Download Full-text

Efficient Fault Tolerance on Cloud Environments

International Journal of Cloud Applications and Computing ◽

10.4018/ijcac.2018070102 ◽

2018 ◽

Vol 8 (3) ◽

pp. 20-31 ◽

Cited By ~ 3

Author(s):

Sam Goundar ◽

Akashdeep Bhardwaj

Keyword(s):

Fault Tolerance ◽

Web Applications ◽

Fault Tolerant ◽

Cloud Services ◽

Level Of Service ◽

Cloud Environments ◽

Computing Environments ◽

Service Assurance ◽

Mission Critical ◽

The Impact

With mission critical web applications and resources being hosted on cloud environments, and cloud services growing fast, the need for having greater level of service assurance regarding fault tolerance for availability and reliability has increased. The high priority now is ensuring a fault tolerant environment that can keep the systems up and running. To minimize the impact of downtime or accessibility failure due to systems, network devices or hardware, the expectations are that such failures need to be anticipated and handled proactively in fast, intelligent way. This article discusses the fault tolerance system for cloud computing environments, analyzes whether this is effective for Cloud environments.

Download Full-text

A multiple disk failure recovery scheme in RAID systems

Journal of Systems Architecture ◽

10.1016/j.sysarc.2003.06.004 ◽

2004 ◽

Vol 50 (4) ◽

pp. 169-175 ◽

Cited By ~ 1

Author(s):

Chong Won Park ◽

Jin-Won Park

Keyword(s):

Failure Recovery ◽

Recovery Scheme ◽

Disk Failure

Download Full-text

Fault Tolerance and Resilience in Cloud Computing Environments

Computer and Information Security Handbook ◽

10.1016/b978-0-12-803843-7.00009-0 ◽

2017 ◽

pp. 165-181 ◽

Cited By ~ 15

Author(s):

Ravi Jhawar ◽

Vincenzo Piuri

Keyword(s):

Cloud Computing ◽

Fault Tolerance ◽

Computing Environments

Download Full-text

Quasi Path Restoration: A post-failure recovery scheme over pre-allocated backup resource for elastic optical networks

Optical Fiber Technology ◽

10.1016/j.yofte.2018.01.011 ◽

2018 ◽

Vol 41 ◽

pp. 139-154 ◽

Cited By ~ 7

Author(s):

Dharmendra Singh Yadav ◽

Sarath Babu ◽

B.S. Manoj

Keyword(s):

Optical Networks ◽

Failure Recovery ◽

Recovery Scheme ◽

Path Restoration

Download Full-text

Smart Intra-query Fault Tolerance for Massive Parallel Processing Databases

Data Science and Engineering ◽

10.1007/s41019-019-00114-z ◽

2019 ◽

Vol 5 (1) ◽

pp. 65-79

Author(s):

Yunhong Ji ◽

Yunpeng Chai ◽

Xuan Zhou ◽

Lipeng Ren ◽

Yajie Qin

Keyword(s):

Cost Effectiveness ◽

Fault Tolerance ◽

Parallel Processing ◽

Query Processing ◽

Success Rate ◽

Database Systems ◽

Analytical Processing ◽

Commodity Clusters ◽

Massive Parallel Processing ◽

Query Latency

AbstractIntra-query fault tolerance has increasingly been a concern for online analytical processing, as more and more enterprises migrate data analytical systems from mainframes to commodity computers. Most massive parallel processing (MPP) databases do not support intra-query fault tolerance. They may suffer from prolonged query latency when running on unreliable commodity clusters. While SQL-on-Hadoop systems can utilize the fault tolerance support of low-level frameworks, such as MapReduce and Spark, their cost-effectiveness is not always acceptable. In this paper, we propose a smart intra-query fault tolerance (SIFT) mechanism for MPP databases. SIFT achieves fault tolerance by performing checkpointing, i.e., materializing intermediate results of selected operators. Different from existing approaches, SIFT aims at promoting query success rate within a given time. To achieve its goal, it needs to: (1) minimize query rerunning time after encountering failures and (2) introduce as less checkpointing overhead as possible. To evaluate SIFT in real-world MPP database systems, we implemented it in Greenplum. The experimental results indicate that it can improve success rate of query processing effectively, especially when working with unreliable hardware.

Download Full-text

An Energy-Efficient Routing Protocol for Reliable Data Transmission in Wireless Body Area Networks

Sensors ◽

10.3390/s19194238 ◽

2019 ◽

Vol 19 (19) ◽

pp. 4238

Author(s):

Yating Qu ◽

Guoqiang Zheng ◽

Honghai Wu ◽

Baofeng Ji ◽

Huahong Ma

Keyword(s):

Routing Protocol ◽

Data Transmission ◽

Energy Efficient ◽

Body Area Networks ◽

Wireless Body Area Networks ◽

Reliable Data ◽

Body Area ◽

Energy Efficient Routing ◽

Maximum Benefit ◽

Benefit Function

Wireless body area networks will inevitably bring tremendous convenience to human society in future development, and also enable people to benefit from ubiquitous technological services. However, one of the reasons hindering development is the limited energy of the network nodes. Therefore, the energy consumption in the selection of the next hop must be minimized in multi-hop routing. To solve this problem, this paper proposes an energy efficient routing protocol for reliable data transmission in a wireless body area network. The protocol takes multiple parameters of the network node into account, such as residual energy, transmission efficiency, available bandwidth, and the number of hops to the sink. We construct the maximum benefit function to select the next hop node by normalizing the node parameters, and dynamically select the node with the largest function value as the next hop node. Based on the above work, the proposed method can achieve efficient multi-hop routing transmission of data and improve the reliability of network data transmission. Compared with the priority-based energy-efficient routing algorithm (PERA) and modified new-attempt routing protocol (NEW-ATTEMPT), the simulation results show that the proposed routing protocol uses the maximum benefit function to select the next hop node dynamically, which not only improves the reliability of data transmission, but also significantly improves the energy utilization efficiency of the node and prolongs the network lifetime.

Download Full-text

A novel distributed scheduling algorithm for time-critical multi-agent systems

2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) ◽

10.1109/iros.2015.7354299 ◽

2015 ◽

Cited By ~ 5

Author(s):

Amanda Whitbrook ◽

Qinggang Meng ◽

Paul W. H. Chung

Keyword(s):

Scheduling Algorithm ◽

Distributed Scheduling ◽

Multi Agent Systems ◽

Agent Systems ◽

Multi Agent ◽

Time Critical

Download Full-text

An Efficient Grid Scheduling Algorithm with Fault Tolerance and User Satisfaction

Mathematical Problems in Engineering ◽

10.1155/2013/340294 ◽

2013 ◽

Vol 2013 ◽

pp. 1-9 ◽

Cited By ~ 4

Author(s):

P. Keerthika ◽

N. Kasthuri

Keyword(s):

Fault Tolerance ◽

User Satisfaction ◽

Scheduling Algorithm ◽

Computational Grids ◽

Efficient Technology ◽

Problem Statement ◽

Grid Scheduling ◽

Failure Handling ◽

Communication Time ◽

User Demand

Problem Statement. The advances in human civilization lead to more complications in problem solving. Grid computing serves as an efficient technology in solving those complicated problems. In computational grids, the grid scheduler schedules the task and finds the appropriate resource for each task. The scheduler must consider several factors such as user demand, communication time, failure handling mechanisms, and reduced makespan. Most of the existing algorithms do not consider user satisfaction. Thus a scheduling algorithm that handles failure of resources and achieves user satisfaction gains more importance.Approach. A new bicriteria scheduling algorithm (BSA) that considers user satisfaction along with fault tolerance has been introduced. The main contribution of this paper includes achieving user satisfaction along with fault tolerance and minimizing the makespan of jobs.Results. The performance of this proposed algorithm is evaluated using GridSim based on makespan and number of jobs completed successfully within user deadline.Conclusions/Recommendations. The proposed BSA algorithm achieves reduced makespan and better hit rate with higher user satisfaction and fault tolerance.

Download Full-text