Application-Layer Fault-Tolerance Protocols
Latest Publications


TOTAL DOCUMENTS

11
(FIVE YEARS 0)

H-INDEX

0
(FIVE YEARS 0)

Published By IGI Global

9781605661827, 9781605661834

Author(s):  
Vincenzo De Florio

As mentioned in Chapter I, a service’s dependability must be justified in a quantitative way and proved through extensive on-field testing and fault injection, verification and validation techniques, simulation, source-code instrumentation, monitoring, and debugging. An exhaustive treatment of all these techniques falls outside the scope of this book, nevertheless the author feels important to include in this text an analysis of the effect on dependability of some of the methods that have been introduced in previous chapters.


Author(s):  
Vincenzo De Florio

This chapter describes some hybrid approaches for application-level software fault-tolerance. All the approaches reported in the rest of this chapter exploit the recovery language approach introduced in Chapter VI and couple it with other tools and paradigms described in other parts of this book. The objective of this chapter is to demonstrate how ReL can serve as a tool to further enhance some of the application- level fault-tolerance paradigms introduced in previous chapters. But why hybrid approaches in the first place? The main reason is that joining two or more concepts and their “system structures” (Randell, 1975), that is, the conceptual and syntactical axioms used in disparate application-level software fault-tolerance provisions, one comes up with a tool with better Syntactical Adequacy (the SA attribute introduced in Chapter II). As already mentioned, a wider syntactical structure can facilitate the expression of our codes, while on the contrary awkward structures often lead to clumsy, buggy applications. Hybrid approaches are often more versatile and can also inspire brand new designs. A drawback of hybrid approaches is that they are modifications of existing designs. The extra design complexity must be carefully added to prevent the introduction if design faults in the architecture.


Author(s):  
Vincenzo De Florio

The programming language itself is the focus of this chapter: Fault-tolerance is not embedded in the program (as it is the case e.g. for single-version fault-tolerance), nor around the language (through compilers or translators); on the contrary, faulttolerance is provided through the syntactical structures and the run-time executives of fault-tolerance programming languages. Also in this case a significant part of the complexity of dependability enforcement is moved from each single code to the architecture, in this case the programming language. Many cases exist of fault-tolerance programming languages; this chapter proposes a few of them, considering three cases: Object-oriented languages, functional languages, and hybrid languages. In particular it is discussed the case of Oz, a multi-paradigm programming language that achieves both transparent distribution and translucent failure handling.


Author(s):  
Vincenzo De Florio

After having described the main characteristics of dependability and fault-tolerance, it is analyzed here in more detail what it means that a program is fault-tolerant and what are the properties expected from a fault-tolerant program. The main objective of this chapter is introducing two sets of design assumptions that shape the way our fault-tolerant software is structured—the system and the fault models. Often misunderstood or underestimated, those models describe • what is expected from the execution environment in order to let our software system function correctly, and • what are the faults that our system is going to consider. Note that a fault-tolerant program shall (try to) tolerate only those faults stated in the fault model, and will be as defenseless against all other faults as any non fault-tolerant program. Together with the system specification, the fault and system models represent the foundation on top of which our computer services are built. It is not surprising that weak foundations often result in failing constructions. What is really surprising is that in so many cases, little or no attention had been given to those important factors in fault-tolerant software engineering. To give an idea of this, three wellknown accidents are described—the Ariane 5 flight 501, Mariner-1 disasters, and the Therac-25 accidents. In each case it is stressed what went wrong, what were the biggest mistakes, and how a careful understanding of fault models and system models would have helped highlighting the path to avoid catastrophic failures that cost considerable amounts of money and even the lives of innocent people. The other important objective of this chapter is introducing the core subject of this book: Software fault-tolerance situated at the level of the application layer. First of all, it is explained why targeting (also) the application layer is not an open option but a mandatory design choice for effective fault-tolerant software engineering. Secondly, given the peculiarities of the application layer, three properties to measure the quality of the methods to achieve fault-tolerant application software are introduced: 1. Separation of design concerns, that is, how good the method is in keeping the functional aspects and the fault-tolerance aspects separated from each other. 2. Syntactical adequacy, namely how versatile the employed method is in including the wider spectrum of fault-tolerance strategies. 3. Adaptability: How good the employed fault-tolerance method is in dealing with the inevitable changes characterizing the system and its run-time environment, including the dynamics of faults that manifest themselves at service time. Finally, this chapter also defines a few fundamental fault-tolerance services, namely watchdog timers, exception handling, transactions, and checkpointingand- rollback.


Author(s):  
Vincenzo De Florio

Failure detection is a fundamental building block to develop fault-tolerant distributed systems. Accurate failure detection in asynchronous systems (Chapter II) is notoriously difficult, as it is impossible to tell whether a process has actually failed or it is just slow. Because of this, several impossibility results have been derived—see for instance the well-known paper (Fischer, Lynch, & Paterson, 1985). As a consequence of these pessimistic results, many researchers have devoted their time and abilities to understanding how to reformulate the concept of system model in a fine-grained alternative way. Their goal was being able to tackle problems such as distributed consensus with the minimal requirements on the system environment. This brought to the theory of unreliable failure detectors for reliable systems, pioneered by the works of Chandra and Toueg (Chandra & Toueg, 1996). This chapter introduces these concepts and the formulation of failure detection protocols in the application layer. In particular a linguistic framework is proposed for the expression of those protocols. As a case study it is described the algorithm for failure detection used in the EFTOS DIR net and in the TIRAN Backbone—that is, the fault-tolerance managers introduced respectively in Chapter III and Chapter VI.


Author(s):  
Vincenzo De Florio

After having discussed the general approach of fault-tolerance languages and their main features, the focus is now set on one particular case: The ARIEL1 recovery language. It is also described as an approach towards resilient computing based on ARIEL and therefore dubbed the “recovery language approach” (ReL). In this chapter, first the main elements of ReL are introduced in general terms, coupling each concept to the technical foundations behind it. After this a quite extensive description of ARIEL and of a compliant architecture are provided. Target applications for such architecture are distributed codes, characterized by non-strict real-time requirements, written in a procedural language such as C, to be executed on distributed or parallel computers consisting of a predefined (fixed) set of processing nodes. The reason for giving special emphasis to ARIEL and its approach is not in their special qualities but more on the fact that, due to the first-hand experience of the author, who conceived, designed, and implemented ARIEL in the course of his studies, it was possible for him to provide the reader with what may be considered as a sort of practical exercise in system and fault modeling and in application-level fault-tolerance design, recalling and applying several of the concepts introduced.


Author(s):  
Vincenzo De Florio

In this chapter our survey of methods and structures for application-level fault-tolerance continues, getting closer to the programming language: Indeed, tools such as compilers and translators work at the level of the language—they parse, interpret, compile or transform our programs, so they are interesting candidates for managing dependability aspects in the application layer. An important property of this family of methods is the fact that fault-tolerance complexity is extracted from the program and turned into architectural complexity in the compiler or the translator. Apart from continuing with our survey, this chapter also aims at providing the reader with two practical examples: • Reflective and refractive variables, that is, a syntactical structure to express adaptive feedback loops in the application layer. This is useful to resilient computing because a feedback loop can attach error recovery strategies to error detection events. • Redundant variables, that is, a tool that allows designers to make use of adaptively redundant data structures with commodity programming languages such as C or Java. Designers using such tools can define redundant data structures in which the degree of redundancy is not fixed once and for all at design time, but rather it changes dynamically with respect to the disturbances experienced during the run time. Both tools are new research activities that are currently being carried out by the author of this book at the PATS research group of the University of Antwerp. It is shown how through a simple translation approach it is possible to provide sophisticated features such as adaptive fault-tolerance to programs written in any language, even plain old C.


Author(s):  
Vincenzo De Florio

This chapter resumes our survey of application-level fault-tolerance protocols considering approaches based on aspect-oriented programming. Aspect-compliant programming languages allow a source code to be regarded as a pliable web that the designer can weave so as to specialize or optimize towards a certain goal without having to recode it. This useful property keeps concerns separated, bounds complexity, and enhances maintainability. Aspect programs may be used for different objectives, including non-functional properties such as dependability. To date, it is not known whether aspect-orientation will actually provide satisfactory solutions for fault-tolerance in the application layer. Some researchers believe this is not the case (Kienzle & Guerraou, 2002)—at least for some fault-tolerance paradigm. Some preliminary studies have been carried out (for instance in (Lippert & Videira Lopes, 2000)), but no definitive word has been said on the matter. It is our belief that, at least for some paradigms, aspects may reveal themselves as invaluable tools to engineer the application-level of fault-tolerance services. For this reason their approach is described in this chapter.


Author(s):  
Vincenzo De Florio

This chapter discusses two large classes of fault-tolerance protocols: • Single-version protocols, that is, methods that use a non-distributed, single task provision, running side-by-side with the functional software, often available in the form of a library and a run-time executive. • Multiple-version protocols, which are methods that use actively a form of redundancy, as explained in what follows. In particular recovery blocks and N-version programming will be discussed. The two families have been grouped together in this chapter because of the several similarities they share.


Author(s):  
Vincenzo De Florio

The general objective of this chapter is to introduce the basic concepts and terminology of the domain of dependability. Concepts such as reliability, safety, or security, have been used inconsistently by different communities of researchers: The realtime system community, the secure computing community, and so forth, each had its own “lingo” and was referring to concepts such as faults, errors, and failures without the required formal foundation. This changed in the early 1990s, when Jean-Claude Laprie finally introduced a tentative model for dependable computing. To date, the Laprie model of dependability is the most widespread and accepted formal definition for the terms that play a key role in this book. As a consequence, the rest of this chapter introduces that model.


Sign in / Sign up

Export Citation Format

Share Document