On the impact of fault tolerance tactics on architecture patterns

With mission critical web applications and resources being hosted on cloud environments, and cloud services growing fast, the need for having greater level of service assurance regarding fault tolerance for availability and reliability has increased. The high priority now is ensuring a fault tolerant environment that can keep the systems up and running. To minimize the impact of downtime or accessibility failure due to systems, network devices or hardware, the expectations are that such failures need to be anticipated and handled proactively in fast, intelligent way. This article discusses the fault tolerance system for cloud computing environments, analyzes whether this is effective for Cloud environments.

Download Full-text

The Impact of Manufacturing Defects on the Fault Tolerance of TMR-Systems

2010 IEEE 25th International Symposium on Defect and Fault Tolerance in VLSI Systems ◽

10.1109/dft.2010.19 ◽

2010 ◽

Cited By ~ 9

Author(s):

Marc Hunger ◽

Sybille Hellebrand

Keyword(s):

Fault Tolerance ◽

Manufacturing Defects ◽

The Impact

Download Full-text

The Impact of Lossless Information on Algorithms

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.644-650.1714 ◽

2014 ◽

Vol 644-650 ◽

pp. 1714-1716

Author(s):

Xiao Peng Wang

Keyword(s):

Information Retrieval ◽

Fault Tolerance ◽

Byzantine Fault Tolerance ◽

Retrieval Systems ◽

Byzantine Fault ◽

Recent Advances ◽

Information Retrieval Systems ◽

Np Complete ◽

The Impact ◽

Pervasive Communication

Recent advances in semantic algorithms and pervasive communication are based entirely on the assumption that DHTs and hierarchical databases are not in conflict with write-back caches. Here, we argue the emulation of the partition table that would allow for further study into information retrieval systems. Our focus in this paper is not on whether the famous embedded algorithm for the study of e-business by Harris et al. is NP-complete, but rather on motivating an application for Byzantine fault tolerance (Mesophryon).

Download Full-text

Modular Avionics System Architecture (MASA)-the impact of fault tolerance

10.1109/dasc.1990.111306 ◽

2002 ◽

Author(s):

L.D. Brock ◽

A.L. Schor

Keyword(s):

Fault Tolerance ◽

System Architecture ◽

The Impact

Download Full-text

Flocking with Fault Tolerant Control and Obstacles Avoidance for Mobile Robots Based on Grid Maps

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.631-632.669 ◽

2014 ◽

Vol 631-632 ◽

pp. 669-675

Author(s):

Yong Xiong ◽

Ji Liang Lin

Keyword(s):

Fault Tolerance ◽

Fault Diagnosis ◽

Mobile Robots ◽

Control Strategy ◽

Control Algorithm ◽

Fault Tolerant ◽

Actuator Failure ◽

Diagnosis Method ◽

Tolerance Control ◽

The Impact

Taking α-lattice flocking as research object, the influence when faults occur in flock and its fault tolerance control algorithm is studied. The impact on flocking performance is analyzed by means of flocking property indexes when communication error, actuator failure or sensor malfunction occur. A flocking fault diagnosis method and fault tolerance control strategy based on communication and data association are introduced. Considering failure mobile robots as obstacles, a complex shaped obstacles avoidance algorithm is proposed. Simulation shows the effectiveness of the method.

Download Full-text

Analyzing the Impact of Fault Tolerance Methods in ARM Processors under Soft Errors running Linux and Parallelization APIs

IEEE Transactions on Nuclear Science ◽

10.1109/tns.2017.2706519 ◽

2017 ◽

pp. 1-1 ◽

Cited By ~ 8

Author(s):

Gennaro Rodrigues ◽

FELIPE ROSA ◽

Adria de Oliveira ◽

Fernanda Lima Kastensmidt ◽

Luciano Ost ◽

...

Keyword(s):

Fault Tolerance ◽

Soft Errors ◽

The Impact

Download Full-text

SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems

Journal of Computer Science and Technology ◽

10.24215/16666038.20.e14 ◽

2020 ◽

Vol 20 (2) ◽

pp. e14

Author(s):

Diego Montezanti

Keyword(s):

Fault Tolerance ◽

Error Detection ◽

Large Scale ◽

Soft Error ◽

Scientific Applications ◽

Case Scenario ◽

Worst Case ◽

Main Challenge ◽

Increased Risk ◽

The Impact

Reliability and fault tolerance have become aspects of growing relevance in the field of HPC, due to the increased probability that faults of different kinds will occur in these systems. This is fundamentally due to the increasing complexity of the processors, in the search to improve performance, which leads to a rise in the scale of integration and in the number of components that work near their technological limits, being increasingly prone to failures. Another factor that affects is the growth in the size of parallel systems to obtain greater computational power, in terms of number of cores and processing nodes. As applications demand longer uninterrupted computation times, the impact of faults grows, due to the cost of relaunching an execution that was aborted due to the occurrence of a fault or concluded with erroneous results. Consequently, it is necessary to run these applications on highly available and reliable systems, requiring strategies capable of providing detection, protection and recovery against faults. In the next years it is planned to reach Exa-scale, in which there will be supercomputers with millions of processing cores, capable of performing on the order of 1018 operations per second. This is a great window of opportunity for HPC applications, but it also increases the risk that they will not complete their executions. Recent studies show that, as systems continue to include more processors, the Mean Time Between Errors decreases, resulting in higher failure rates and increased risk of corrupted results; large parallel applications are expected to deal with errors that occur every few minutes, requiring external help to progress efficiently. Silent Data Corruptions are the most dangerous errors that can occur, since they can generate incorrect results in programs that appear to execute correctly. Scientific applications and large-scale simulations are the most affected, making silent error handling the main challenge towards resilience in HPC. In message passing applications, a silent error, affecting a single task, can produce a pattern of corruption that spreads to all communicating processes; in the worst case scenario, the erroneous final results cannot be detected at the end of the execution and will be taken as correct. Since scientific applications have execution times of the order of hours or even days, it is essential to find strategies that allow applications to reach correct solutions in a bounded time, despite the underlying failures. These strategies also prevent energy consumption from skyrocketing, since if they are not used, the executions should be launched again from the beginning. However, the most popular parallel programming models used in supercomputers lack support for fault tolerance.

Download Full-text

Efficient Fault Tolerance on Cloud Environments

Research Anthology on Architectures, Frameworks, and Integration Strategies for Distributed and Cloud Computing ◽

10.4018/978-1-7998-5339-8.ch059 ◽

2021 ◽

pp. 1231-1243

Author(s):

Sam Goundar ◽

Akashdeep Bhardwaj

Keyword(s):

Cloud Computing ◽

Fault Tolerance ◽

Web Applications ◽

Fault Tolerant ◽

Cloud Services ◽

Cloud Environments ◽

Computing Environments ◽

Service Assurance ◽

Mission Critical ◽

The Impact

With mission critical web applications and resources being hosted on cloud environments, and cloud services growing fast, the need for having greater level of service assurance regarding fault tolerance for availability and reliability has increased. The high priority now is ensuring a fault tolerant environment that can keep the systems up and running. To minimize the impact of downtime or accessibility failure due to systems, network devices or hardware, the expectations are that such failures need to be anticipated and handled proactively in fast, intelligent way. This article discusses the fault tolerance system for cloud computing environments, analyzes whether this is effective for Cloud environments.

Download Full-text

DESIGN FEATURES OF MULTI-BIT CMOS-CNI-ADC TO CREATE MULTI-CHANNEL HIGH-SPEED DSP SYSTEMS WITH INCREASED FAULT TOLERANCE

Modeling of systems and processes ◽

10.12737/2219-0767-2019-12-3-59-64 ◽

2019 ◽

Vol 12 (3) ◽

pp. 59-64 ◽

Cited By ~ 1

Author(s):

Владимир Кононов ◽

Vladimir Kononov

Keyword(s):

Fault Tolerance ◽

Conversion Rate ◽

High Speed ◽

Current Source ◽

Design Features ◽

Conductivity Type ◽

Fault Resistance ◽

Dsp Systems ◽

The Impact ◽

Additional Current

The features and possibilities of using foreign and domestic foundry technologies in the creation of CMOS-ADC for high-speed multichannel DSP systems with increased fault resistance to the effects of TKCH are considered. The architecture and technique of ADC balancing, which provide an increase in the conversion rate when several ADCS operate in the alternating mode, are presented. The technique of double reservation of sources of «weight» currents is considered. The necessity of using an additional current source and dual series-connected CMOS transistors instead of single transistors of the same conductivity type is substantiated. It is noted that the proposed solutions provide effective amortization of the impact of TKCH with the most "dangerous" energies of 20-100 MeV/nucleon

Download Full-text