Fault-Tolerant Software

Author(s):  
Vincenzo De Florio

After having described the main characteristics of dependability and fault-tolerance, it is analyzed here in more detail what it means that a program is fault-tolerant and what are the properties expected from a fault-tolerant program. The main objective of this chapter is introducing two sets of design assumptions that shape the way our fault-tolerant software is structured—the system and the fault models. Often misunderstood or underestimated, those models describe • what is expected from the execution environment in order to let our software system function correctly, and • what are the faults that our system is going to consider. Note that a fault-tolerant program shall (try to) tolerate only those faults stated in the fault model, and will be as defenseless against all other faults as any non fault-tolerant program. Together with the system specification, the fault and system models represent the foundation on top of which our computer services are built. It is not surprising that weak foundations often result in failing constructions. What is really surprising is that in so many cases, little or no attention had been given to those important factors in fault-tolerant software engineering. To give an idea of this, three wellknown accidents are described—the Ariane 5 flight 501, Mariner-1 disasters, and the Therac-25 accidents. In each case it is stressed what went wrong, what were the biggest mistakes, and how a careful understanding of fault models and system models would have helped highlighting the path to avoid catastrophic failures that cost considerable amounts of money and even the lives of innocent people. The other important objective of this chapter is introducing the core subject of this book: Software fault-tolerance situated at the level of the application layer. First of all, it is explained why targeting (also) the application layer is not an open option but a mandatory design choice for effective fault-tolerant software engineering. Secondly, given the peculiarities of the application layer, three properties to measure the quality of the methods to achieve fault-tolerant application software are introduced: 1. Separation of design concerns, that is, how good the method is in keeping the functional aspects and the fault-tolerance aspects separated from each other. 2. Syntactical adequacy, namely how versatile the employed method is in including the wider spectrum of fault-tolerance strategies. 3. Adaptability: How good the employed fault-tolerance method is in dealing with the inevitable changes characterizing the system and its run-time environment, including the dynamics of faults that manifest themselves at service time. Finally, this chapter also defines a few fundamental fault-tolerance services, namely watchdog timers, exception handling, transactions, and checkpointingand- rollback.

Author(s):  
Vincenzo De Florio

In this chapter our survey of methods and structures for application-level fault-tolerance continues, getting closer to the programming language: Indeed, tools such as compilers and translators work at the level of the language—they parse, interpret, compile or transform our programs, so they are interesting candidates for managing dependability aspects in the application layer. An important property of this family of methods is the fact that fault-tolerance complexity is extracted from the program and turned into architectural complexity in the compiler or the translator. Apart from continuing with our survey, this chapter also aims at providing the reader with two practical examples: • Reflective and refractive variables, that is, a syntactical structure to express adaptive feedback loops in the application layer. This is useful to resilient computing because a feedback loop can attach error recovery strategies to error detection events. • Redundant variables, that is, a tool that allows designers to make use of adaptively redundant data structures with commodity programming languages such as C or Java. Designers using such tools can define redundant data structures in which the degree of redundancy is not fixed once and for all at design time, but rather it changes dynamically with respect to the disturbances experienced during the run time. Both tools are new research activities that are currently being carried out by the author of this book at the PATS research group of the University of Antwerp. It is shown how through a simple translation approach it is possible to provide sophisticated features such as adaptive fault-tolerance to programs written in any language, even plain old C.


Author(s):  
Vincenzo De Florio

This chapter resumes our survey of application-level fault-tolerance protocols considering approaches based on aspect-oriented programming. Aspect-compliant programming languages allow a source code to be regarded as a pliable web that the designer can weave so as to specialize or optimize towards a certain goal without having to recode it. This useful property keeps concerns separated, bounds complexity, and enhances maintainability. Aspect programs may be used for different objectives, including non-functional properties such as dependability. To date, it is not known whether aspect-orientation will actually provide satisfactory solutions for fault-tolerance in the application layer. Some researchers believe this is not the case (Kienzle & Guerraou, 2002)—at least for some fault-tolerance paradigm. Some preliminary studies have been carried out (for instance in (Lippert & Videira Lopes, 2000)), but no definitive word has been said on the matter. It is our belief that, at least for some paradigms, aspects may reveal themselves as invaluable tools to engineer the application-level of fault-tolerance services. For this reason their approach is described in this chapter.


2014 ◽  
Vol 1030-1032 ◽  
pp. 1905-1908
Author(s):  
Yue Hua Ding ◽  
Ri Hua Xiang

Business developer needs to develop fault-tolerant process, but fault-tolerant process development based on exception handling mechanism provided by BPEL costs much time and is easy to make mistake. We analyze application exception throw chain theory of service-oriented system. Exception throwing of BPEL process is broadcasted among component layer, service layer and process layer. We also propose an application exception handling method for BPEL by using EHPDL-P. EHPDL-P can separate normal business logic and exception business logic, and thus it can enhance BPEL fault-tolerance ability.


2019 ◽  
Vol 2 (1) ◽  
pp. 43-52
Author(s):  
Alireza Alikhani ◽  
Safa Dehghan M ◽  
Iman Shafieenejad

In this study, satellite formation flying guidance in the presence of under actuation using inter-vehicle Coulomb force is investigated. The Coulomb forces are used to stabilize the formation flying mission. For this purpose, the charge of satellites is determined to create appropriate attraction and repulsion and also, to maintain the distance between satellites. Static Coulomb formation of satellites equations including three satellites in triangular form was developed. Furthermore, the charge value of the Coulomb propulsion system required for such formation was obtained. Considering Under actuation of one of the formation satellites, the fault-tolerance approach is proposed for achieving mission goals. Following this approach, in the first step fault-tolerant guidance law is designed. Accordingly, the obtained results show stationary formation. In the next step, tomaintain the formation shape and dimension, a fault-tolerant control law is designed.


Fault Tolerant Reliable Protocol (FTRP) is proposed as a novel routing protocol designed for Wireless Sensor Networks (WSNs). FTRP offers fault tolerance reliability for packet exchange and support for dynamic network changes. The key concept used is the use of node logical clustering. The protocol delegates the routing ownership to the cluster heads where fault tolerance functionality is implemented. FTRP utilizes cluster head nodes along with cluster head groups to store packets in transient. In addition, FTRP utilizes broadcast, which reduces the message overhead as compared to classical flooding mechanisms. FTRP manipulates Time to Live values for the various routing messages to control message broadcast. FTRP utilizes jitter in messages transmission to reduce the effect of synchronized node states, which in turn reduces collisions. FTRP performance has been extensively through simulations against Ad-hoc On-demand Distance Vector (AODV) and Optimized Link State (OLSR) routing protocols. Packet Delivery Ratio (PDR), Aggregate Throughput and End-to-End delay (E-2-E) had been used as performance metrics. In terms of PDR and aggregate throughput, it is found that FTRP is an excellent performer in all mobility scenarios whether the network is sparse or dense. In stationary scenarios, FTRP performed well in sparse network; however, in dense network FTRP’s performance had degraded yet in an acceptable range. This degradation is attributed to synchronized nodes states. Reliably delivering a message comes to a cost, as in terms of E-2-E. results show that FTRP is considered a good performer in all mobility scenarios where the network is sparse. In sparse stationary scenario, FTRP is considered good performer, however in dense stationary scenarios FTRP’s E-2-E is not acceptable. There are times when receiving a network message is more important than other costs such as energy or delay. That makes FTRP suitable for wide range of WSNs applications, such as military applications by monitoring soldiers’ biological data and supplies while in battlefield and battle damage assessment. FTRP can also be used in health applications in addition to wide range of geo-fencing, environmental monitoring, resource monitoring, production lines monitoring, agriculture and animals tracking. FTRP should be avoided in dense stationary deployments such as, but not limited to, scenarios where high application response is critical and life endangering such as biohazards detection or within intensive care units.


Energies ◽  
2021 ◽  
Vol 14 (8) ◽  
pp. 2210
Author(s):  
Luís Caseiro ◽  
André Mendes

Fault-tolerance is critical in power electronics, especially in Uninterruptible Power Supplies, given their role in protecting critical loads. Hence, it is crucial to develop fault-tolerant techniques to improve the resilience of these systems. This paper proposes a non-redundant fault-tolerant double conversion uninterruptible power supply based on 3-level converters. The proposed solution can correct open-circuit faults in all semiconductors (IGBTs and diodes) of all converters of the system (including the DC-DC converter), ensuring full-rated post-fault operation. This technique leverages the versatility of Finite-Control-Set Model Predictive Control to implement highly specific fault correction. This type of control enables a conditional exclusion of the switching states affected by each fault, allowing the converter to avoid these states when the fault compromises their output but still use them in all other conditions. Three main types of corrective actions are used: predictive controller adaptations, hardware reconfiguration, and DC bus voltage adjustment. However, highly differentiated corrective actions are taken depending on the fault type and location, maximizing post-fault performance in each case. Faults can be corrected simultaneously in all converters, as well as some combinations of multiple faults in the same converter. Experimental results are presented demonstrating the performance of the proposed solution.


2021 ◽  
Vol 9 (6) ◽  
pp. 574
Author(s):  
Zhuo Liu ◽  
Tianhao Tang ◽  
Azeddine Houari ◽  
Mohamed Machmoum ◽  
Mohamed Fouad Benkhoris

This paper firstly adopts a fault accommodation structure, a five-phase permanent magnet synchronous generator (PMSG) with trapezoidal back-electromagnetic forces, in order to enhance the fault tolerance of tidal current energy conversion systems. Meanwhile, a fault-tolerant control (FTC) method is proposed using multiple second-order generalized integrators (multiple SOGIs) to further improve the systematic fault tolerance. Then, additional harmonic disturbances from phase current or back-electromagnetic forces in original and Park’s frames are characterized under a single-phase open condition. Relying on a classical field-oriented vector control scheme, fault-tolerant composite controllers are then reconfigured using multiple SOGIs by compensating q-axis control commands. Finally, a real power-scale simulation setup with a gearless back-to-back tidal current energy conversion chain and a small power-scale laboratory prototype in machine side are established to comprehensively validate feasibility and fault tolerance of the proposed method. Simulation results show that the proposed method is able to suppress the main harmonic disturbances and maintain a satisfactory fault tolerance when third harmonic flux varies. Experimental results reveal that the proposed model-free fault-tolerant design is simple to implement, which contributes to better fault-tolerant behaviors, higher power quality and lower copper losses. The main advantage of the multiple SOGIs lies in convenient online implementation and efficient multi-harmonic extractions, without considering system’s model parameters. The proposed FTC design provides a model-free fault-tolerant solution to the energy harvested process of actual tidal current energy conversion systems under different working conditions.


2014 ◽  
Vol 548-549 ◽  
pp. 1326-1329
Author(s):  
Juan Jin ◽  
Qing Fan Gu

Against to the unsustainable problems of health diagnosis, fault location and fault tolerance mechanisms that existing in the current avionics applications, we proposed a fault-tolerant communication middleware which is based on time-triggered in this paper. This middleware is designed to provide a support platform for applications of the real-time based on communication middleware. From the communication middleware level and also combined with time-triggered mechanism and fault-tolerant strategy, it diagnoses the general faults first, and then routes them to the appropriate fault mechanism to process it. So the middleware completely separates fault-tolerant process from the application software functions.


2018 ◽  
Vol 8 (3) ◽  
pp. 20-31 ◽  
Author(s):  
Sam Goundar ◽  
Akashdeep Bhardwaj

With mission critical web applications and resources being hosted on cloud environments, and cloud services growing fast, the need for having greater level of service assurance regarding fault tolerance for availability and reliability has increased. The high priority now is ensuring a fault tolerant environment that can keep the systems up and running. To minimize the impact of downtime or accessibility failure due to systems, network devices or hardware, the expectations are that such failures need to be anticipated and handled proactively in fast, intelligent way. This article discusses the fault tolerance system for cloud computing environments, analyzes whether this is effective for Cloud environments.


Sign in / Sign up

Export Citation Format

Share Document