A Design for Fault-Tolerant Communication Middleware Based on Time-Triggered

2014 ◽  
Vol 548-549 ◽  
pp. 1326-1329
Author(s):  
Juan Jin ◽  
Qing Fan Gu

Against to the unsustainable problems of health diagnosis, fault location and fault tolerance mechanisms that existing in the current avionics applications, we proposed a fault-tolerant communication middleware which is based on time-triggered in this paper. This middleware is designed to provide a support platform for applications of the real-time based on communication middleware. From the communication middleware level and also combined with time-triggered mechanism and fault-tolerant strategy, it diagnoses the general faults first, and then routes them to the appropriate fault mechanism to process it. So the middleware completely separates fault-tolerant process from the application software functions.

2014 ◽  
Vol 933 ◽  
pp. 584-589
Author(s):  
Zhi Chun Zhang ◽  
Song Wei Li ◽  
Wei Ren Wang ◽  
Wei Zhang ◽  
Li Jun Qi

This paper presents a system in which the cluster devices are controlled by single-chip microcomputers, with emphasis on the cluster management techniques of single-chip microcomputers. Each device in a cluster is controlled by a single-chip microcomputer collecting sample data sent to and driving the device by driving data received from the same cluster management computer through COMs. The cluster management system running on the cluster management computer carries out such control as initial SCM identification, run time slice management, communication resource utilization, fault tolerance and error corrections on single-chip microcomputers. Initial SCM identification is achieved by signal responses between the single-chip microcomputers and the cluster management computer. By using the port priority and the parallelization of serial communications, the systems real-time performance is maximized. The real-time performance can be adjusted and improved by increasing or decreasing COMs and the ports linked to each COM, and the real-time performance can also be raised by configuring more cluster management computers. Fault-tolerant control occurs in the initialization phase and the operational phase. In the initialization phase, the cluster management system incorporates unidentified single-chip microcomputers into the system based on the history information recorded on external storage media. In the operational phase, if an operation error of reading and writing on a single-chip microcomputer reaches a predetermined threshold, the single-chip microcomputer is regarded as serious fault or not existing. The cluster management system maintains accuracy maintenance database on external storage medium to solve nonlinear control of specific devices and accuracy maintenance due to wear. The cluster management system uses object-oriented method to design a unified driving framework in order to enable the implementation of the cluster management system simplified, standardized and easy to transplant. The system has been applied in a large-scale simulation system of 230 single-chip microcomputers, which proves that the system is reliable, real-time and easy to maintain.


2012 ◽  
Vol 433-440 ◽  
pp. 4095-4100
Author(s):  
Chan Juan Li ◽  
Chuan De Zhang ◽  
Qing Guo Zhou

Nowadays there are a few works which are concerned with the virtualization technology and the fault-tolerance technology, because virtualization system can provide an environment allowing multiple operating systems running in concurrent way. In this paper, we based on a real-time hypervisor-XtratuM, propose the architecture of a fault-tolerant real-time control system (XFTRTS), which provide local backup execution and to support different level diversity including N-version programming on a single host. Furthermore, we implement a prototype of XFTRTS and test its important performance metric–latency, which is within two microseconds.


2010 ◽  
Vol 19 (05) ◽  
pp. 1041-1068 ◽  
Author(s):  
REFIK SAMET

This paper proposes a methodology for supporting the design of fault-tolerant computers for real-time applications. To this end, the paper first presents steps of fault tolerance and describes mechanisms that can be used to realize them. Then, the design options consisting of described mechanisms are proposed and a table summarizing them is designed. From that, the paper proposes a flowchart for choosing between the many various design options available for building a redundant computer system. Choosing an optimal design option is performed according to the number of redundant computers, the mode of operation of redundant computers, the computer failure mode and the severity of the real-time constraint. Finally, graphical models for sequencing the mechanisms of design options are proposed. The main merit of the proposed methodology includes a spectrum of design options of fault-tolerant mechanisms for real-time computers tolerating a single fault at a time and a guide for choosing between them.


2011 ◽  
Vol 383-390 ◽  
pp. 4377-4384
Author(s):  
Zhou Ma ◽  
Xiao Ning Li ◽  
Xiao Ming Zhang

A new practical fault location algorithm using two-terminal electrical quantities is presented in this article, which takes into account the distributed parameter line model. The analytical expression of algorithm derives from Three-Phase decoupling. First, an analytical synchronization of the unsynchronized measurements is performed with use of the determined synchronization operator and the non-synchronizing angle is calculated with the two-terminal pre-fault electrical quantities. Then, the real-time transmission line parameters are calculated using two-terminal non-synchronized electrical quantities and the non-synchronizing angle. The algorithm overcomes the drawbacks of the traditional fault location algorithms, which does not exist the pseudo-root problem. Besides, it has the advantages of simple, practical, litter computation, no need to search and iterative and robustness. The algorithm has not influenced by fault types, the transition resistance and other factors. At last the developed fault location algorithm is tested using signals of ATP-EMTP versatile simulations of faults on a transmission line.


Energy-aware real-time scheduling is gaining attention in recent years owing to environmental concerns and applications in numerous fields. System reliability also gets affected adversely with increasing energy dissipations posing serious challenges before the researchers. Keeping these in view, in recent times researchers have diverted to combining issues of fault-tolerance and energy efficiency. In literature, DVFS and DPM, most commonly used techniques for power management in task scheduling, are often combined with Primary/Backup technique to achieve fault tolerance against transient and permanent faults. Optimal algorithms, Earliest deadline first (EDF) and Rate-Monotonic (RM), meant for scheduling dynamic and fixed priority tasks respectively, have mainly been analyzed using a dual-processor approach for fault-tolerance and energy efficiency. In this paper, to handle higher workload of fixed-priority real-time tasks, energy-aware fault-tolerant scheduling algorithms are proposed for multiprocessor systems with balanced and unbalanced number of main and auxiliary processors. Simulations over extensive task-sets indicate that balanced approach is more energy-efficient than the unbalanced one.


Author(s):  
Camille Coti

This chapter gives an overview of techniques used to tolerate failures in high-performance distributed applications. We describe basic replication techniques, automatic rollback recovery and application-based fault tolerance. We present the challenges raised specifically by distributed, high performance computing and the performance overhead the fault tolerance mechanisms are likely to cost. Last, we give an example of a fault-tolerant algorithm that exploits specific properties of a recent algorithm.


2012 ◽  
Vol 21 (01) ◽  
pp. 1250004 ◽  
Author(s):  
LINJIE ZHU ◽  
TONGQUAN WEI ◽  
XIAODAO CHEN ◽  
YONGHE GUO ◽  
SHIYAN HU

Fault tolerance and energy have become important design issues in multiprocessor system-on-chips (SoCs) with the technology scaling and the proliferation of battery-powered multiprocessor SoCs. This paper proposed an energy-efficient fault tolerance task allocation scheme for multiprocessor SoCs in real-time energy harvesting systems. The proposed fault-tolerance scheme is based on the principle of the primiary/backup task scheduling, and can tolerate at most one single transient fault. Extensive simulated experiment shows that the proposed scheme can save up to 30% energy consumption and reduce the miss ratio to about 8% in the presence of faults.


10.29007/brkj ◽  
2019 ◽  
Author(s):  
Jia Xu

In a real-time embedded system which uses a primary and an alternate for each real-time task to achieve fault tolerance, there is a need to allow both primaries and alternates to have critical sections/segments in which shared data structures can be read and updated while guaranteeing that the execution of any part of one critical section will not be interleaved with or overlap with the execution of any part of a critical section belonging to some other primary or alternate which reads and writes on those shared data structures. In this paper a software architecture is presented which effectively handles critical section constraints where both primaries and alternates may have critical sections which can either overrun or underrun, while still guaranteeing that all primaries or alternates that do not overrun will always meet their deadlines while keeping the shared data in a consistent state on a multiprocessor in a fault tolerant real-time embedded system.


2021 ◽  
Author(s):  
Lukas Hübner ◽  
Alexey M. Kozlov ◽  
Demian Hespe ◽  
Peter Sanders ◽  
Alexandros Stamatakis

Phylogenetic trees are now routinely inferred on large scale HPC systems with thousands of cores as the parallel scalability of phylogenetic inference tools has improved over the past years to cope with the molecular data avalanche. Thus, the parallel fault tolerance of phylogenetic inference tools has become a relevant challenge. To this end, we explore parallel fault tolerance mechanisms and algorithms, the software modifications required, and the performance penalties induced via enabling parallel fault tolerance by example of RAxML-NG, the successor of the widely used RAxML tool for maximum likelihood based phylogenetic tree inference. We find that the slowdown induced by the necessary additional recovery mechanisms in RAxML-NG is on average 2%. The overall slowdown by using these recovery mechanisms in conjunction with a fault tolerant MPI implementation amounts to 8% on average for large empirical datasets. Via failure simulations, we show that RAxML-NG can successfully recover from multiple simultaneous failures, subsequent failures, failures during recovery, and failures during checkpointing. Recoveries are automatic and transparent to the user. The modified fault tolerant RAxML-NG code is available under GNU GPL at https://github.com/lukashuebner/ft-raxml-ng Contact: lukas.huebner@{kit.edu,h-its.org};, [email protected], [email protected], [email protected], [email protected] Supplementary information: Supplementary data are available at bioRχiv.


Sign in / Sign up

Export Citation Format

Share Document