scholarly journals Online Reconfigurable Self-Timed Links for Fault Tolerant NoC

VLSI Design ◽  
2007 ◽  
Vol 2007 ◽  
pp. 1-13 ◽  
Author(s):  
Teijo Lehtonen ◽  
Pasi Liljeberg ◽  
Juha Plosila

We propose link structures for NoC that have properties for tolerating efficiently transient, intermittent, and permanent errors. This is a necessary step to be taken in order to implement reliable systems in future nanoscale technologies. The protection against transient errors is realized using Hamming coding and interleaving for error detection and retransmission as the recovery method. We introduce two approaches for tackling the intermittent and permanent errors. In the first approach, spare wires are introduced together with reconfiguration circuitry. The other approach uses time redundancy, the transmission is split into two parts, where the data is doubled. In both structures the presence of permanent or intermittent errors is monitored by analyzing previous error syndromes. The links are based on self-timed signaling in which the handshake signals are protected using triple modular redundancy. We present the structures, operation, and designs for the different components of the links. The fault tolerance properties are analyzed using a fault model containing temporary, intermittent, and permanent faults that occur both as bursts and as single faults. The results show a considerable enhancement in the fault tolerance at the cost of performance and area, and with only a slight increase in power consumption.

2007 ◽  
Vol 10 (2) ◽  
Author(s):  
Goutam Kumar Saha

The term “Self-healing” denotes the capability of a software system in dealing with bugs. Fault tolerance for dependable computing is to provide the specified service through rigorous design whereas self-healing is meant for run-time issues. The paper describes various issues on designing a self-healing software application system that relies on the on-the-fly error detection and repair of web application or service agent code and data. Self-Healing is a very new area of research that deals with fault tolerance for dynamic systems. Self-healing deals with imprecise specification, uncontrolled environment and reconfiguration of system according to its dynamics. Software, which is capable of detecting and reacting to its malfunctions, is called self-healing software. Such software system has the ability to examine its failures and to take appropriate corrections. Self-Healing system must have knowledge about its expected behavior in order to examine whether its actual behavior deviates from its expected behavior in relation of the environment. A fault-model of Self-Healing system is to state what faults or injuries to be self-healed including fault duration, fault source such as, operational errors, defective system requirements or implementation errors etc. Self-healing categories of aspects include fault-model or fault hypothesis, System-response, System-completeness and Design-context. Based on many important literatures, this paper aims also to illustrate critical points of the emergent research topic of Self – Healing Software System.


2021 ◽  
Author(s):  
Gopalakrishnan Sundararajan

This Chapter presents a solution for fault-tolerance in Multi-Valued Logic (MVL) circuits comprised of Carbon Nano-Tube Field Effect Transistors (CNTFET). This chapter reviews basic primitives of MVL and describes ternary implementations of CNTFET circuits. Finally, this chapter describes a method for error correction called Restorative Feedback (RFB). The RFB method is a variant of Triple-Modular Redundancy (TMR) that utilizes the fault masking capabilities of the Muller C element to provide added protection against noisy transient faults. Fault tolerant properties of Muller C element is discussed and error correction capability of RFB method is demonstrated in detail.


Author(s):  
Zahid Raza ◽  
Deo P. Vidyarthi

Grid is a parallel and distributed computing network system comprising of heterogeneous computing resources spread over multiple administrative domains that offers high throughput computing. Since the Grid operates at a large scale, there is always a possibility of failure ranging from hardware to software. The penalty paid of these failures may be on a very large scale. System needs to be tolerant to various possible failures which, in spite of many precautions, are bound to happen. Replication is a strategy often used to introduce fault tolerance in the system to ensure successful execution of the job, even when some of the computational resources fail. Though replication incurs a heavy cost, a selective degree of replication can offer a good compromise between the performance and the cost. This chapter proposes a co-scheduler that can be integrated with main scheduler for the execution of the jobs submitted to computational Grid. The main scheduler may have any performance optimization criteria; the integration of co-scheduler will be an added advantage towards fault tolerance. The chapter evaluates the performance of the co-scheduler with the main scheduler designed to minimize the turnaround time of a modular job by introducing module replication to counter the effects of node failures in a Grid. Simulation study reveals that the model works well under various conditions resulting in a graceful degradation of the scheduler’s performance with improving the overall reliability offered to the job.


Author(s):  
Ghalem Belalem ◽  
Said Limam

Cloud computing refers to both the applications delivered as services over the Internet and the hardware and systems software in the datacenters that provide those services. Failures of any type are common in current datacenters, partly due to the number of nodes. Fault tolerance has become a major task for computer engineers and software developers because the occurrence of faults increases the cost of using resources and to meet the user expectations, the most fundamental user expectation is, of course, that his or her application correctly finishes independent of faults in the node. This paper proposes a fault tolerant architecture to Cloud Computing that uses an adaptive Checkpoint mechanism to assure that a task running can correctly finish in spite of faults in the nodes in which it is running. The proposed fault tolerant architecture is simultaneously transparent and scalable.


1994 ◽  
Vol 6 (2) ◽  
pp. 150-154
Author(s):  
Shigeki Abe ◽  
◽  
Michitaka Kameyama ◽  
Tatsuo Higuchi ◽  
◽  
...  

To achieve the safety of an intelligent digital system for real-world applications, not only the hardware faults in the processors but also any other faults and errors related to the real world such as sensor faults, actuator faults and human errors must be removed. From this point of view, an intelligent fault-tolerant system for real-world applications is proposed based on triple-modular redundancy. The system consists of a master processor that performs the actual control operations and two redundant processors which simulate real-world process together with the control operations using knowledge-based inference strategy. To realize the independency between the triplicated modules, the simulation for error detection and recovery is performed without actual external sensor signals used in the master processor.


1999 ◽  
Vol 121 (3) ◽  
pp. 504-508 ◽  
Author(s):  
E. H. Maslen ◽  
C. K. Sortore ◽  
G. T. Gillies ◽  
R. D. Williams ◽  
S. J. Fedigan ◽  
...  

A fault tolerant magnetic bearing system was developed and demonstrated on a large flexible-rotor test rig. The bearing system comprises a high speed, fault tolerant digital controller, three high capacity radial magnetic bearings, one thrust bearing, conventional variable reluctance position sensors, and an array of commercial switching amplifiers. Controller fault tolerance is achieved through a very high speed voting mechanism which implements triple modular redundancy with a powered spare CPU, thereby permitting failure of up to three CPU modules without system failure. Amplifier/cabling/coil fault tolerance is achieved by using a separate power amplifier for each bearing coil and permitting amplifier reconfiguration by the controller upon detection of faults. This allows hot replacement of failed amplifiers without any system degradation and without providing any excess amplifier kVA capacity over the nominal system requirement. Implemented on a large (2440 mm in length) flexible rotor, the system shows excellent rejection of faults including the failure of three CPUs as well as failure of two adjacent amplifiers (or cabling) controlling an entire stator quadrant.


2016 ◽  
Vol 58 (6) ◽  
Author(s):  
Vahid Lari ◽  
Andreas Weichslgartner ◽  
Alexandru Tanase ◽  
Michael Witterauf ◽  
Faramarz Khosravi ◽  
...  

AbstractAs a consequence of technology scaling, today's complex multi-processor systems have become more and more susceptible to errors. In order to satisfy reliability requirements, such systems require methods to detect and tolerate errors. This entails two major challenges: (a) providing a comprehensive approach that ensures fault-tolerant execution of parallel applications across different types of resources, and (b) optimizing resource usage in the face of dynamic fault probabilities or with varying fault tolerance needs of different applications. In this paper, we present a holistic and adaptive approach to provide fault tolerance on Multi-Processor System-on-a-Chip (MPSoC) on demand of an application or environmental needs based on invasive computing. We show how invasive computing may provide adaptive fault tolerance on a heterogeneous MPSoC including hardware accelerators and communication infrastructure such as a Network-on-Chip (NoC). In addition, we present (a) compile-time transformations to automatically adopt well-known redundancy schemes such as Dual Modular Redundancy (DMR) and Triple Modular Redundancy (TMR) for fault-tolerant loop execution on a class of massively parallel arrays of processors called as Tightly Coupled Processor Arrays (). Based on timing characteristics derived from our compilation flow, we further develop (b) a reliability analysis guiding the selection of a suitable degree of fault tolerance. Finally, we present (c) a methodology to detect and adaptively mitigate faults in invasive NoCs.


Electronics ◽  
2020 ◽  
Vol 9 (11) ◽  
pp. 1783 ◽  
Author(s):  
Ayaz Hussain ◽  
Muhammad Irfan ◽  
Naveed Khan Baloch ◽  
Umar Draz ◽  
Tariq Ali ◽  
...  

The router plays an important role in communication among different processing cores in on-chip networks. Technology scaling on one hand has enabled the designers to integrate multiple processing components on a single chip; on the other hand, it becomes the reason for faults. A generic router consists of the buffers and pipeline stages. A single fault may result in an undesirable situation of degraded performance or a whole chip may stop working. Therefore, it is necessary to provide permanent fault tolerance to all the components of the router. In this paper, we propose a mechanism that can tolerate permanent faults that occur in the router. We exploit the fault-tolerant techniques of resource sharing and paring between components for the input port unit and routing computation (RC) unit, the resource borrowing for virtual channel allocator (VA) and multiple paths for switch allocator (SA) and crossbar (XB). The experimental results and analysis show that the proposed mechanism enhances the reliability of the router architecture towards permanent faults at the cost of 29% area overhead. The proposed router architecture achieves the highest Silicon Protection Factor (SPF) metric, which is 24.8 as compared to the state-of-the-art fault-tolerant architectures. It incurs an increase in latency for SPLASH2 and PARSEC benchmark traffics, which is minimal as compared to the baseline router.


2011 ◽  
Vol 2011 ◽  
pp. 1-15 ◽  
Author(s):  
Mohsin Amin ◽  
Abbas Ramazani ◽  
Fabrice Monteiro ◽  
Camille Diou ◽  
Abbas Dandache

We introduce a specialized self-checking hardware journal being used as a centerpiece in our design strategy to build a processor tolerant to transient faults. Fault tolerance here relies on the use of error detection techniques in the processor core together with journalization and rollback execution to recover from erroneous situations. Effective rollback recovery is possible thanks to using a hardware journal and chosing a stack computing architecture for the processor core instead of the usual RISC or CISC. The main objective of the journalization and the hardware self-checking journal is to prevent data not yet validated to be sent to the main memory, and allow to fast rollback execution on faulty situations. The main memory, supposed to be fault secure in our model, only contains valid (uncorrupted) data obtained from fault-free computations. Error control coding techniques are used both in the processor core to detect errors and in the HW journal to protect the temporarily stored data from possible changes induced by transient faults. Implementation results on an FPGA of the Altera Stratix-II family show clearly the relevance of the approach, both in terms of performance/area tradeoff and fault tolerance effectiveness, even for high error rates.


Author(s):  
Omer Subasi ◽  
Tatiana Martsinkevich ◽  
Ferad Zyulkyarov ◽  
Osman Unsal ◽  
Jesus Labarta ◽  
...  

We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme.


Sign in / Sign up

Export Citation Format

Share Document