A Self-Checking Hardware Journal for a Fault-Tolerant Processor Architecture

We introduce a specialized self-checking hardware journal being used as a centerpiece in our design strategy to build a processor tolerant to transient faults. Fault tolerance here relies on the use of error detection techniques in the processor core together with journalization and rollback execution to recover from erroneous situations. Effective rollback recovery is possible thanks to using a hardware journal and chosing a stack computing architecture for the processor core instead of the usual RISC or CISC. The main objective of the journalization and the hardware self-checking journal is to prevent data not yet validated to be sent to the main memory, and allow to fast rollback execution on faulty situations. The main memory, supposed to be fault secure in our model, only contains valid (uncorrupted) data obtained from fault-free computations. Error control coding techniques are used both in the processor core to detect errors and in the HW journal to protect the temporarily stored data from possible changes induced by transient faults. Implementation results on an FPGA of the Altera Stratix-II family show clearly the relevance of the approach, both in terms of performance/area tradeoff and fault tolerance effectiveness, even for high error rates.

Download Full-text

Exponential suppression of bit or phase errors with cyclic error correction

Nature ◽

10.1038/s41586-021-03588-y ◽

2021 ◽

Vol 595 (7867) ◽

pp. 383-387

Author(s):

◽

Zijun Chen ◽

Kevin J. Satzinger ◽

Juan Atalaya ◽

Alexander N. Korotkov ◽

...

Keyword(s):

Error Correction ◽

Error Detection ◽

Fault Tolerant ◽

Error Rates ◽

Quantum Error Correction ◽

Superconducting Qubits ◽

Two Dimensional ◽

Logical Error ◽

Quantum Error ◽

Logical Qubit

AbstractRealizing the potential of quantum computing requires sufficiently low logical error rates1. Many applications call for error rates as low as 10−15 (refs. 2–9), but state-of-the-art quantum platforms typically have physical error rates near 10−3 (refs. 10–14). Quantum error correction15–17 promises to bridge this divide by distributing quantum logical information across many physical qubits in such a way that errors can be detected and corrected. Errors on the encoded logical qubit state can be exponentially suppressed as the number of physical qubits grows, provided that the physical error rates are below a certain threshold and stable over the course of a computation. Here we implement one-dimensional repetition codes embedded in a two-dimensional grid of superconducting qubits that demonstrate exponential suppression of bit-flip or phase-flip errors, reducing logical error per round more than 100-fold when increasing the number of qubits from 5 to 21. Crucially, this error suppression is stable over 50 rounds of error correction. We also introduce a method for analysing error correlations with high precision, allowing us to characterize error locality while performing quantum error correction. Finally, we perform error detection with a small logical qubit using the 2D surface code on the same device18,19 and show that the results from both one- and two-dimensional codes agree with numerical simulations that use a simple depolarizing error model. These experimental demonstrations provide a foundation for building a scalable fault-tolerant quantum computer with superconducting qubits.

Download Full-text

Proposal of an Adaptive Fault Tolerance Mechanism to Tolerate Intermittent Faults in RAM

Electronics ◽

10.3390/electronics9122074 ◽

2020 ◽

Vol 9 (12) ◽

pp. 2074

Author(s):

J.-Carlos Baraza-Calvo ◽

Joaquín Gracia-Morán ◽

Luis-J. Saiz-Adalid ◽

Daniel Gil-Tomás ◽

Pedro-J. Gil-Vicente

Keyword(s):

Fault Tolerance ◽

Error Correction ◽

Error Detection ◽

Fault Injection ◽

Error Correction Codes ◽

Transient Faults ◽

Tolerance Mechanism ◽

Intermittent Faults ◽

Risc Processor ◽

Simulation Based

Due to transistor shrinking, intermittent faults are a major concern in current digital systems. This work presents an adaptive fault tolerance mechanism based on error correction codes (ECC), able to modify its behavior when the error conditions change without increasing the redundancy. As a case example, we have designed a mechanism that can detect intermittent faults and swap from an initial generic ECC to a specific ECC capable of tolerating one intermittent fault. We have inserted the mechanism in the memory system of a 32-bit RISC processor and validated it by using VHDL simulation-based fault injection. We have used two (39, 32) codes: a single error correction–double error detection (SEC–DED) and a code developed by our research group, called EPB3932, capable of correcting single errors and double and triple adjacent errors that include a bit previously tagged as error-prone. The results of injecting transient, intermittent, and combinations of intermittent and transient faults show that the proposed mechanism works properly. As an example, the percentage of failures and latent errors is 0% when injecting a triple adjacent fault after an intermittent stuck-at fault. We have synthesized the adaptive fault tolerance mechanism proposed in two types of FPGAs: non-reconfigurable and partially reconfigurable. In both cases, the overhead introduced is affordable in terms of hardware, time and power consumption.

Download Full-text

Survey on Fault Tolerance Startgies for Advance Microelectronics Chip

International Journal on Recent and Innovation Trends in Computing and Communication ◽

10.17762/ijritcc.v7i1.5217 ◽

2019 ◽

Vol 7 (1) ◽

pp. 01-04

Author(s):

Himanshu Shekhar, Prof. Deepa Gianchandani

Keyword(s):

Fault Tolerance ◽

Power Supply ◽

Fault Tolerant ◽

Full Adder ◽

Equipment Design ◽

Transient Faults ◽

Transient Fault

In the complex advance microelectronics based system, handling units are managing gadgets of littler size, which are delicate to the transient faults. A framework should be fabricated that will perceive the presence of faults and fuses strategies to will endure these faults without troublesome the typical activity A transient fault happens in a circuit caused by the electromagnetic commotions, astronomical beams, crosstalk and power supply clamor. It is extremely hard to recognize these faults amid disconnected testing. Subsequently a region effective fault tolerant full adder for testing and fixing of transient and changeless faults happened in single and multi-net is proposed. Furthermore, the proposed design can likewise identify and fix perpetual faults. This structure acquires much lower equipment overheads with respect to the conventional equipment design. In this paper, talk about various fault tolerant methodology for CMOS and ICs.

Download Full-text

Fault Tolerance in Carbon Nanotube Transistors Based Multi Valued Logic

10.5772/intechopen.95361 ◽

2021 ◽

Author(s):

Gopalakrishnan Sundararajan

Keyword(s):

Fault Tolerance ◽

Error Correction ◽

Field Effect ◽

Fault Tolerant ◽

Field Effect Transistors ◽

Transient Faults ◽

Carbon Nanotube Transistors ◽

Nanotube Transistors ◽

Modular Redundancy ◽

Carbon Nano Tube

This Chapter presents a solution for fault-tolerance in Multi-Valued Logic (MVL) circuits comprised of Carbon Nano-Tube Field Effect Transistors (CNTFET). This chapter reviews basic primitives of MVL and describes ternary implementations of CNTFET circuits. Finally, this chapter describes a method for error correction called Restorative Feedback (RFB). The RFB method is a variant of Triple-Modular Redundancy (TMR) that utilizes the fault masking capabilities of the Muller C element to provide added protection against noisy transient faults. Fault tolerant properties of Muller C element is discussed and error correction capability of RFB method is demonstrated in detail.

Download Full-text

DuckCore: A Fault-Tolerant Processor Core Architecture Based on the RISC-V ISA

Electronics ◽

10.3390/electronics11010122 ◽

2021 ◽

Vol 11 (1) ◽

pp. 122

Author(s):

Jiemin Li ◽

Shancong Zhang ◽

Chong Bao

Keyword(s):

Error Detection ◽

Integrated Circuit ◽

Large Scale ◽

Fault Tolerant ◽

Soft Errors ◽

Implementation Process ◽

Radiation Environment ◽

Cmos Integrated Circuit ◽

Processor Core ◽

The Impact

With the development of large-scale CMOS-integrated circuit manufacturing technology, microprocessor chips are more vulnerable to soft errors and radiation interference, resulting in reduced reliability. Core reliability is an important element of the microprocessor’s ability to resist soft errors. This paper proposes DuckCore, a fault-tolerant processor core architecture based on the free and open instruction set architecture (ISA) RISC-V. This architecture uses improved SECDED (single error correction, double error detection) code between pipelines, detects processor operating errors in real-time through the Supervision unit, and takes instruction rollbacks for different error types, which not only saves resources but also improves the reliability of the processor core. In the implementation process, all error injection tests are passed to verify the completeness of the function. In order to better verify the performance of the processor under different error intensity injections, the software is used to inject errors, the running program is run on the FPGA (Field Programmable Gate Array), and the impact of the actual radiation environment on the architecture is evaluated through the results. The architecture is applied to three–five-stage open-source processor cores and the results show that this method consumes fewer resources and its discrete design makes it more portable.

Download Full-text

Fault-Tolerant Protocols Using Compilers and Translators

Application-Layer Fault-Tolerance Protocols ◽

10.4018/978-1-60566-182-7.ch004 ◽

2009 ◽

pp. 133-160

Author(s):

Vincenzo De Florio

Keyword(s):

Fault Tolerance ◽

Programming Languages ◽

Data Structures ◽

Error Detection ◽

Fault Tolerant ◽

Error Recovery ◽

Application Layer ◽

Redundant Data ◽

New Research ◽

The University

In this chapter our survey of methods and structures for application-level fault-tolerance continues, getting closer to the programming language: Indeed, tools such as compilers and translators work at the level of the language—they parse, interpret, compile or transform our programs, so they are interesting candidates for managing dependability aspects in the application layer. An important property of this family of methods is the fact that fault-tolerance complexity is extracted from the program and turned into architectural complexity in the compiler or the translator. Apart from continuing with our survey, this chapter also aims at providing the reader with two practical examples: • Reflective and refractive variables, that is, a syntactical structure to express adaptive feedback loops in the application layer. This is useful to resilient computing because a feedback loop can attach error recovery strategies to error detection events. • Redundant variables, that is, a tool that allows designers to make use of adaptively redundant data structures with commodity programming languages such as C or Java. Designers using such tools can define redundant data structures in which the degree of redundancy is not fixed once and for all at design time, but rather it changes dynamically with respect to the disturbances experienced during the run time. Both tools are new research activities that are currently being carried out by the author of this book at the PATS research group of the University of Antwerp. It is shown how through a simple translation approach it is possible to provide sophisticated features such as adaptive fault-tolerance to programs written in any language, even plain old C.

Download Full-text

Intravenous Chemotherapy Compounding Errors in a Follow-Up Pan-Canadian Observational Study

Journal of Oncology Practice ◽

10.1200/jop.17.00007 ◽

2018 ◽

Vol 14 (5) ◽

pp. e295-e303 ◽

Cited By ~ 5

Author(s):

Rachel E. Gilbert ◽

Melissa C. Kozak ◽

Roxanne B. Dobish ◽

Venetia C. Bourrier ◽

Paul M. Koke ◽

...

Keyword(s):

Observational Study ◽

Error Detection ◽

Significant Degree ◽

Error Rates ◽

International Standards ◽

Cancer Center ◽

Loss Of Function ◽

Detection Techniques ◽

Wrong Drug ◽

Practice Standards

Purpose: Intravenous (IV) compounding safety has garnered recent attention as a result of high-profile incidents, awareness efforts from the safety community, and increasingly stringent practice standards. New research with more-sensitive error detection techniques continues to reinforce that error rates with manual IV compounding are unacceptably high. In 2014, our team published an observational study that described three types of previously unrecognized and potentially catastrophic latent chemotherapy preparation errors in Canadian oncology pharmacies that would otherwise be undetectable. We expand on this research and explore whether additional potential human failures are yet to be addressed by practice standards. Methods: Field observations were conducted in four cancer center pharmacies in four Canadian provinces from January 2013 to February 2015. Human factors specialists observed and interviewed pharmacy managers, oncology pharmacists, pharmacy technicians, and pharmacy assistants as they carried out their work. Emphasis was on latent errors (potential human failures) that could lead to outcomes such as wrong drug, dose, or diluent. Results: Given the relatively short observational period, no active failures or actual errors were observed. However, 11 latent errors in chemotherapy compounding were identified. In terms of severity, all 11 errors create the potential for a patient to receive the wrong drug or dose, which in the context of cancer care, could lead to death or permanent loss of function. Three of the 11 practices were observed in our previous study, but eight were new. Applicable Canadian and international standards and guidelines do not explicitly address many of the potentially error-prone practices observed. Conclusion: We observed a significant degree of risk for error in manual mixing practice. These latent errors may exist in other regions where manual compounding of IV chemotherapy takes place. Continued efforts to advance standards, guidelines, technological innovation, and chemical quality testing are needed.

Download Full-text

A new architecture for online error detection and isolation in network on chip

Journal of High Speed Networks ◽

10.3233/jhs-200646 ◽

2020 ◽

Vol 26 (4) ◽

pp. 307-323

Author(s):

Chakib Nehnouh

Keyword(s):

Error Detection ◽

Fault Tolerant ◽

High Reliability ◽

Low Cost ◽

Network On Chip ◽

Fault Detection And Isolation ◽

Main Concern ◽

Transient Faults ◽

Protection Factor ◽

On Chip

The Network-on-Chip (NoC) has become a promising communication infrastructure for Multiprocessors-System-on-Chip (MPSoC). Reliability is a main concern in NoC and performance is degraded when NoC is susceptible to faults. A fault can be determined as a cause of deviation from the desired operation of the system (error). To deal with these reliability challenges, this work propose OFDIM (Online Fault Detection and Isolation Mechanism),a novel combined methodology to tolerate multiple permanent and transient faults. The new router architecture uses two modules to assure highly reliable and low-cost fault-tolerant strategy. In contrast to existing works, our architecture presents less area, more fault tolerance, and high reliability. The reliability comparison using Silicon Protection Factor (SPF), shows 22-time improvement and that additional circuitry incurs an area overhead of 27%, which is better than state-of-the-art reliable router architectures. Also, the results show that the throughput decreases only by 5.19% and minor increase in average latency 2.40% while providing high reliability.

Download Full-text

Model Checking-based Software-FMEA: Assessment of Fault Tolerance and Error Detection Mechanisms

Periodica Polytechnica Electrical Engineering and Computer Science ◽

10.3311/ppee.9755 ◽

2017 ◽

Vol 61 (2) ◽

pp. 132 ◽

Cited By ~ 4

Author(s):

Vince Molnár ◽

István Majzik

Keyword(s):

Fault Tolerance ◽

Model Checking ◽

Error Detection ◽

Failure Modes ◽

System Level ◽

Complex Nature ◽

Software Faults ◽

Software Failures ◽

Detection Techniques ◽

Embedded Operating Systems

Failure Mode and Effects Analysis (FMEA) is a systematic technique to explore the possible failure modes of individual components or subsystems and determine their potential effects at the system level. Applications of FMEA are common in case of hardware and communication failures, but analyzing software failures (SW-FMEA) poses a number of challenges. Failures may originate in permanent software faults commonly called bugs, and their effects can be very subtle and hard to predict, due to the complex nature of programs. Therefore, a behavior-based automatic method to analyze the potential effects of different types of bugs is desirable. Such a method could be used to automatically build an FMEA report about the fault effects, or to evaluate different failure mitigation and detection techniques. This paper follows the latter direction, demonstrating the use of a model checking-based automated SW-FMEA approach to evaluate error detection and fault tolerance mechanisms, demonstrated on a case study inspired by safety-critical embedded operating systems.

Download Full-text

Effects of Physical Injection of Transient Faults on Control Flow and Evaluation of Some Software-Implemented Error Detection Techniques

Dependable Computing and Fault-Tolerant Systems - Dependable Computing for Critical Applications 4 ◽

10.1007/978-3-7091-9396-9_36 ◽

1995 ◽

pp. 435-457 ◽

Cited By ~ 1

Author(s):

Ghassem Miremadi ◽

Jan Torin

Keyword(s):

Error Detection ◽

Control Flow ◽

Transient Faults ◽

Detection Techniques

Download Full-text