Fault-Tolerant Protocols Using Compilers and Translators

Author(s):  
Vincenzo De Florio

In this chapter our survey of methods and structures for application-level fault-tolerance continues, getting closer to the programming language: Indeed, tools such as compilers and translators work at the level of the language—they parse, interpret, compile or transform our programs, so they are interesting candidates for managing dependability aspects in the application layer. An important property of this family of methods is the fact that fault-tolerance complexity is extracted from the program and turned into architectural complexity in the compiler or the translator. Apart from continuing with our survey, this chapter also aims at providing the reader with two practical examples: • Reflective and refractive variables, that is, a syntactical structure to express adaptive feedback loops in the application layer. This is useful to resilient computing because a feedback loop can attach error recovery strategies to error detection events. • Redundant variables, that is, a tool that allows designers to make use of adaptively redundant data structures with commodity programming languages such as C or Java. Designers using such tools can define redundant data structures in which the degree of redundancy is not fixed once and for all at design time, but rather it changes dynamically with respect to the disturbances experienced during the run time. Both tools are new research activities that are currently being carried out by the author of this book at the PATS research group of the University of Antwerp. It is shown how through a simple translation approach it is possible to provide sophisticated features such as adaptive fault-tolerance to programs written in any language, even plain old C.

Author(s):  
Vincenzo De Florio

This chapter resumes our survey of application-level fault-tolerance protocols considering approaches based on aspect-oriented programming. Aspect-compliant programming languages allow a source code to be regarded as a pliable web that the designer can weave so as to specialize or optimize towards a certain goal without having to recode it. This useful property keeps concerns separated, bounds complexity, and enhances maintainability. Aspect programs may be used for different objectives, including non-functional properties such as dependability. To date, it is not known whether aspect-orientation will actually provide satisfactory solutions for fault-tolerance in the application layer. Some researchers believe this is not the case (Kienzle & Guerraou, 2002)—at least for some fault-tolerance paradigm. Some preliminary studies have been carried out (for instance in (Lippert & Videira Lopes, 2000)), but no definitive word has been said on the matter. It is our belief that, at least for some paradigms, aspects may reveal themselves as invaluable tools to engineer the application-level of fault-tolerance services. For this reason their approach is described in this chapter.


Author(s):  
Vincenzo De Florio

The programming language itself is the focus of this chapter: Fault-tolerance is not embedded in the program (as it is the case e.g. for single-version fault-tolerance), nor around the language (through compilers or translators); on the contrary, faulttolerance is provided through the syntactical structures and the run-time executives of fault-tolerance programming languages. Also in this case a significant part of the complexity of dependability enforcement is moved from each single code to the architecture, in this case the programming language. Many cases exist of fault-tolerance programming languages; this chapter proposes a few of them, considering three cases: Object-oriented languages, functional languages, and hybrid languages. In particular it is discussed the case of Oz, a multi-paradigm programming language that achieves both transparent distribution and translucent failure handling.


Author(s):  
Vincenzo De Florio

After having described the main characteristics of dependability and fault-tolerance, it is analyzed here in more detail what it means that a program is fault-tolerant and what are the properties expected from a fault-tolerant program. The main objective of this chapter is introducing two sets of design assumptions that shape the way our fault-tolerant software is structured—the system and the fault models. Often misunderstood or underestimated, those models describe • what is expected from the execution environment in order to let our software system function correctly, and • what are the faults that our system is going to consider. Note that a fault-tolerant program shall (try to) tolerate only those faults stated in the fault model, and will be as defenseless against all other faults as any non fault-tolerant program. Together with the system specification, the fault and system models represent the foundation on top of which our computer services are built. It is not surprising that weak foundations often result in failing constructions. What is really surprising is that in so many cases, little or no attention had been given to those important factors in fault-tolerant software engineering. To give an idea of this, three wellknown accidents are described—the Ariane 5 flight 501, Mariner-1 disasters, and the Therac-25 accidents. In each case it is stressed what went wrong, what were the biggest mistakes, and how a careful understanding of fault models and system models would have helped highlighting the path to avoid catastrophic failures that cost considerable amounts of money and even the lives of innocent people. The other important objective of this chapter is introducing the core subject of this book: Software fault-tolerance situated at the level of the application layer. First of all, it is explained why targeting (also) the application layer is not an open option but a mandatory design choice for effective fault-tolerant software engineering. Secondly, given the peculiarities of the application layer, three properties to measure the quality of the methods to achieve fault-tolerant application software are introduced: 1. Separation of design concerns, that is, how good the method is in keeping the functional aspects and the fault-tolerance aspects separated from each other. 2. Syntactical adequacy, namely how versatile the employed method is in including the wider spectrum of fault-tolerance strategies. 3. Adaptability: How good the employed fault-tolerance method is in dealing with the inevitable changes characterizing the system and its run-time environment, including the dynamics of faults that manifest themselves at service time. Finally, this chapter also defines a few fundamental fault-tolerance services, namely watchdog timers, exception handling, transactions, and checkpointingand- rollback.


2021 ◽  
Vol 8 (4) ◽  
pp. 1-19
Author(s):  
Xuejiao Kang ◽  
David F. Gleich ◽  
Ahmed Sameh ◽  
Ananth Grama

As parallel and distributed systems scale, fault tolerance is an increasingly important problem—particularly on systems with limited I/O capacity and bandwidth. Erasure coded computations address this problem by augmenting a given problem instance with redundant data and then solving the augmented problem in a fault oblivious manner in a faulty parallel environment. In the event of faults, a computationally inexpensive procedure is used to compute the true solution from a potentially fault-prone solution. These techniques are significantly more efficient than conventional solutions to the fault tolerance problem. In this article, we show how we can minimize, to optimality, the overhead associated with our problem augmentation techniques for linear system solvers. Specifically, we present a technique that adaptively augments the problem only when faults are detected. At any point in execution, we only solve a system whose size is identical to the original input system. This has several advantages in terms of maintaining the size and conditioning of the system, as well as in only adding the minimal amount of computation needed to tolerate observed faults. We present, in detail, the augmentation process, the parallel formulation, and evaluation of performance of our technique. Specifically, we show that the proposed adaptive fault tolerance mechanism has minimal overhead in terms of FLOP counts with respect to the original solver executing in a non-faulty environment, has good convergence properties, and yields excellent parallel performance. We also demonstrate that our approach significantly outperforms an optimized application-level checkpointing scheme that only checkpoints needed data structures.


2011 ◽  
Vol 2011 ◽  
pp. 1-15 ◽  
Author(s):  
Mohsin Amin ◽  
Abbas Ramazani ◽  
Fabrice Monteiro ◽  
Camille Diou ◽  
Abbas Dandache

We introduce a specialized self-checking hardware journal being used as a centerpiece in our design strategy to build a processor tolerant to transient faults. Fault tolerance here relies on the use of error detection techniques in the processor core together with journalization and rollback execution to recover from erroneous situations. Effective rollback recovery is possible thanks to using a hardware journal and chosing a stack computing architecture for the processor core instead of the usual RISC or CISC. The main objective of the journalization and the hardware self-checking journal is to prevent data not yet validated to be sent to the main memory, and allow to fast rollback execution on faulty situations. The main memory, supposed to be fault secure in our model, only contains valid (uncorrupted) data obtained from fault-free computations. Error control coding techniques are used both in the processor core to detect errors and in the HW journal to protect the temporarily stored data from possible changes induced by transient faults. Implementation results on an FPGA of the Altera Stratix-II family show clearly the relevance of the approach, both in terms of performance/area tradeoff and fault tolerance effectiveness, even for high error rates.


Author(s):  
YUNG-YUAN CHEN

In recent years, very long instruction word (VLIW) processor has attracted much attention in that it offers a high instruction level parallelism and reduces the hardware design complexity. In this paper, we present two fault-tolerant schemes for VLIW processors. The first one is termed as test-instruction scheme which is based on the concept of instruction duplication to detect the errors. The process of test-instruction scheme consists of the error detection, error rollback recovery and reconfiguration. The second approach is called self-checking scheme which adopts the concept of self-checking logic to detect the errors. A real-time error recovery procedure is developed to conquer the errors. We implement the proposed designs of fault-tolerant VLIW processor in VHDL and employ the fault injection and fault simulation to validate our schemes. The main contribution of this research is to present the complete frameworks from error detection to error recovery for fault-tolerant design of VLIW processors. Experience learned from this investigation is that the issues of error detection and error recovery entail considering together. Without taking both issues into account simultaneously, the outcomes may lead to the improper conclusions.


10.29007/brkj ◽  
2019 ◽  
Author(s):  
Jia Xu

In a real-time embedded system which uses a primary and an alternate for each real-time task to achieve fault tolerance, there is a need to allow both primaries and alternates to have critical sections/segments in which shared data structures can be read and updated while guaranteeing that the execution of any part of one critical section will not be interleaved with or overlap with the execution of any part of a critical section belonging to some other primary or alternate which reads and writes on those shared data structures. In this paper a software architecture is presented which effectively handles critical section constraints where both primaries and alternates may have critical sections which can either overrun or underrun, while still guaranteeing that all primaries or alternates that do not overrun will always meet their deadlines while keeping the shared data in a consistent state on a multiprocessor in a fault tolerant real-time embedded system.


Micromachines ◽  
2019 ◽  
Vol 10 (5) ◽  
pp. 278 ◽  
Author(s):  
Binhan Du ◽  
Zhiyong Shi ◽  
Jinlong Song ◽  
Huaiguang Wang ◽  
Lanyi Han

The application of the Micro Electro-mechanical System (MEMS) inertial measurement unit has become a new research hotspot in the field of inertial navigation. In order to solve the problems of the poor accuracy and stability of MEMS sensors, the redundant design is an effective method under the restriction of current technology. The redundant data processing is the most important part in the MEMS redundant inertial navigation system, which includes the processing of abnormal data and the fusion estimation of redundant data. A developed quality index of the MEMS gyro measurement data is designed by the parity vector and the covariance matrix of the distributed Kalman filtering. The weight coefficients of gyros are calculated according to this index. The fault-tolerant fusion estimation of the redundant data is realized through the framework of the distributed Kalman filtering. Simulation experiments are conducted to test the performance of the new method with different types of anomalies.


VLSI Design ◽  
2007 ◽  
Vol 2007 ◽  
pp. 1-13 ◽  
Author(s):  
Teijo Lehtonen ◽  
Pasi Liljeberg ◽  
Juha Plosila

We propose link structures for NoC that have properties for tolerating efficiently transient, intermittent, and permanent errors. This is a necessary step to be taken in order to implement reliable systems in future nanoscale technologies. The protection against transient errors is realized using Hamming coding and interleaving for error detection and retransmission as the recovery method. We introduce two approaches for tackling the intermittent and permanent errors. In the first approach, spare wires are introduced together with reconfiguration circuitry. The other approach uses time redundancy, the transmission is split into two parts, where the data is doubled. In both structures the presence of permanent or intermittent errors is monitored by analyzing previous error syndromes. The links are based on self-timed signaling in which the handshake signals are protected using triple modular redundancy. We present the structures, operation, and designs for the different components of the links. The fault tolerance properties are analyzed using a fault model containing temporary, intermittent, and permanent faults that occur both as bursts and as single faults. The results show a considerable enhancement in the fault tolerance at the cost of performance and area, and with only a slight increase in power consumption.


2020 ◽  
Author(s):  
Ulrich Konrad

Currently, the qualitative spectrum of methods in the philological sciences is being substantially expanded, with far-reaching implications, through the integration of the empirical, quantitative, and evaluative possibilities of the Digital Humanities. The example of the planning and establishment of „Kallimachos,“ the Center for Philology and Digitality (ZPD) at the University of Würzburg, demonstrates how a research center in the field of interplay between the humanities and cultural studies, digital humanities, and computer science can bring about a surge of change by providing in-depth insights into each other‘s subjects and ways of thinking. It not only brings with it a new view of the epistemological interests of philology, its questions, its canon, and its key concepts, but also makes computer science aware of the ‚recalcitrance‘ of humanities subjects and thus confronts it with new tasks. The ZPD is the result of a systematic reflection on the digital transformation of philology, with its traditional focus on editing and analyzing, in order to advance this development both in terms of content and methodology. For example, the formation of linguistic conventions in speaking and writing about music in 19th-century composers‘ texts and in music journals would be an ideal subject for the application of digital methods of analysis and the development of new research questions based on them. Research networks that jointly develop and rethink methods on the level of data structures across disciplines are likely to be a proven means of preserving our own discipline in the future, even if this may occasionally be a relationship borne more by reason than by love.


Sign in / Sign up

Export Citation Format

Share Document