Fault-Tolerant Scheduling of Fine-Grained Tasks in Grid Environments

Author(s):  
Gosia Wrzesińska ◽  
Rob V. van Nieuwpoort ◽  
Jason Maassen ◽  
Thilo Kielmann ◽  
Henri E. Bal
Author(s):  
Saranya R ◽  
Pradeep C ◽  
Neena Baby ◽  
Radhakrishnan R

Reconfigurable computing for DSP remains an active area to explore as the need for incorporation with more conventional DSP technologies turn out to be obvious. Conventionally, the majority of the work in the area of reconfigurable computing is aimed on fine grained FPGA devices. Over the years, the focus is shifted from bit level granularity to a coarse grained composition. FIR filter remains and persist to be an important building block in various DSP systems. It computes the output by multiplying input samples with a set of coefficients followed by addition. Here multipliers and adders are modeled using the concept of divide and conquer. For developing a reconfiguarble FIR filter, different tap filters are designed as separate reconfigurable modules. Furthermore, there is an additional concern for making the system fault tolerant. A fault detection mechanism is introduced to detect the faults based on the nature of operands. The reconfigurable modules are structurally modeled in Verilog HDL and simulated and synthesized using Xilinx ISE 14.2. A comparison of the device utilization of reconfigurable modules is also presented in this paper by implementing the design on various Virtex FPGA devices.


Author(s):  
Omer Subasi ◽  
Tatiana Martsinkevich ◽  
Ferad Zyulkyarov ◽  
Osman Unsal ◽  
Jesus Labarta ◽  
...  

We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme.


Author(s):  
Vincenzo De Florio

Failure detection is a fundamental building block to develop fault-tolerant distributed systems. Accurate failure detection in asynchronous systems (Chapter II) is notoriously difficult, as it is impossible to tell whether a process has actually failed or it is just slow. Because of this, several impossibility results have been derived—see for instance the well-known paper (Fischer, Lynch, & Paterson, 1985). As a consequence of these pessimistic results, many researchers have devoted their time and abilities to understanding how to reformulate the concept of system model in a fine-grained alternative way. Their goal was being able to tackle problems such as distributed consensus with the minimal requirements on the system environment. This brought to the theory of unreliable failure detectors for reliable systems, pioneered by the works of Chandra and Toueg (Chandra & Toueg, 1996). This chapter introduces these concepts and the formulation of failure detection protocols in the application layer. In particular a linguistic framework is proposed for the expression of those protocols. As a case study it is described the algorithm for failure detection used in the EFTOS DIR net and in the TIRAN Backbone—that is, the fault-tolerance managers introduced respectively in Chapter III and Chapter VI.


Author(s):  
Steffen Ortmann ◽  
Michael Maaser ◽  
Peter Langendoerfer

Wireless Sensor Networks are the key-enabler for low cost ubiquitous applications in the area of homeland security, health-care, and environmental monitoring. A necessary prerequisite is reliable and efficient event detection in spite of sudden failures and environmental changes. Due to the fact that the sensors need to be low cost, they have only scarce resources leading to a certain level of failures of sensor nodes or sensing devices attached to the nodes. Available fault tolerant solutions are mainly customized approaches that revealed several shortcomings, particularly in adaptability and energy efficiency. The authors present a complete event detection concept including all necessary steps from formal event definition to autonomous device configuration. It features an event definition language that allows defining complex events as well as enhance the reliability by tailor-made voting schemes and application constraints. Based on that, this paper introduces a novel approach for self-adapting on-node and in-network processing, called Event Decision Tree (EDT). EDT autonomously adapts to available resources and environmental conditions, even though it requires to (re-)organize collaboration between neighboring nodes for evaluation. The authors’ approach achieves fine-grained event-related fault tolerance with configurable adaptation rate while enhancing maintainability and energy efficiency.


Sign in / Sign up

Export Citation Format

Share Document