A Formal Approach for Failure Detection in Large-Scale Distributed Systems Using Abstract State Machines

Failure detection is a fundamental building block for ensuring fault tolerance in large scale distributed systems. It is also a difficult problem. Resources under heavy loads can be mistaken as being failed. The failure of a network link can be detected by the lack of a response, but this also occurs when a computational resource fails. Although progress has been made, no existing approach provides a system that covers all essential aspects related to a distributed environment. This paper presents a failure detection system based on adaptive, decentralized failure detectors. The system is developed as an independent substrate, working asynchronously and independent of the application flow. It uses a hierarchical protocol, creating a clustering mechanism that ensures a dynamic configuration and traffic optimization. It also uses a gossip strategy for failure detection at local levels to minimize detection time and remove wrong suspicions. Results show that the system scales with the number of monitored resources, while still considering the QoS requirements of both applications and resources.

Download Full-text

A Failure Detection System for Large Scale Distributed Systems

Development of Distributed Systems from Design to Application and Maintenance ◽

10.4018/978-1-4666-2647-8.ch008 ◽

2012 ◽

pp. 127-151

Author(s):

Andrei Lavinia ◽

Ciprian Dobre ◽

Florin Pop ◽

Valentin Cristea

Keyword(s):

Distributed Systems ◽

Large Scale ◽

Detection System ◽

Failure Detection ◽

Difficult Problem ◽

Distributed Environment ◽

Dynamic Configuration ◽

Fundamental Building Block ◽

Heavy Loads ◽

Traffic Optimization

Failure detection is a fundamental building block for ensuring fault tolerance in large scale distributed systems. It is also a difficult problem. Resources under heavy loads can be mistaken as being failed. The failure of a network link can be detected by the lack of a response, but this also occurs when a computational resource fails. Although progress has been made, no existing approach provides a system that covers all essential aspects related to a distributed environment. This paper presents a failure detection system based on adaptive, decentralized failure detectors. The system is developed as an independent substrate, working asynchronously and independent of the application flow. It uses a hierarchical protocol, creating a clustering mechanism that ensures a dynamic configuration and traffic optimization. It also uses a gossip strategy for failure detection at local levels to minimize detection time and remove wrong suspicions. Results show that the system scales with the number of monitored resources, while still considering the QoS requirements of both applications and resources.

Download Full-text

Abstract state machines: Designing distributed systems with state machines and B

Lecture Notes in Computer Science - B’98: Recent Advances in the Development and Use of the B Method ◽

10.1007/bfb0053364 ◽

1998 ◽

pp. 226-242 ◽

Cited By ~ 4

Author(s):

Bill Stoddart ◽

Steve Dunne ◽

Andy Galloway ◽

Richard Shore

Keyword(s):

Distributed Systems ◽

State Machines ◽

Abstract State Machines

Download Full-text

An Integrated Specification and Verification Environment for Component-Based Architectures of Large-Scale Distributed Systems

10.21236/ada501823 ◽

2009 ◽

Cited By ~ 1

Author(s):

John Hatcliff ◽

Torben Amtoft ◽

Anindya Banerjee

Keyword(s):

Distributed Systems ◽

Large Scale ◽

Specification And Verification

Download Full-text

A Combined Approach for Model-Based PV Power Plant Failure Detection and Diagnostic

Energies ◽

10.3390/en14051261 ◽

2021 ◽

Vol 14 (5) ◽

pp. 1261

Author(s):

Christopher Gradwohl ◽

Vesna Dimitrievska ◽

Federico Pittino ◽

Wolfgang Muehleisen ◽

András Montvay ◽

...

Keyword(s):

Power Plants ◽

Large Scale ◽

Failure Detection ◽

Energy Yield ◽

Combined Approach ◽

Levelized Cost Of Electricity ◽

Term Operation ◽

Model Based

Photovoltaic (PV) technology allows large-scale investments in a renewable power-generating system at a competitive levelized cost of electricity (LCOE) and with a low environmental impact. Large-scale PV installations operate in a highly competitive market environment where even small performance losses have a high impact on profit margins. Therefore, operation at maximum performance is the key for long-term profitability. This can be achieved by advanced performance monitoring and instant or gradual failure detection methodologies. We present in this paper a combined approach on model-based fault detection by means of physical and statistical models and failure diagnosis based on physics of failure. Both approaches contribute to optimized PV plant operation and maintenance based on typically available supervisory control and data acquisition (SCADA) data. The failure detection and diagnosis capabilities were demonstrated in a case study based on six years of SCADA data from a PV plant in Slovenia. In this case study, underperforming values of the inverters of the PV plant were reliably detected and possible root causes were identified. Our work has led us to conclude that the combined approach can contribute to an efficient and long-term operation of photovoltaic power plants with a maximum energy yield and can be applied to the monitoring of photovoltaic plants.

Download Full-text

Workshop on large-scale distributed systems for information retrieval

ACM SIGIR Forum ◽

10.1145/1328964.1328979 ◽

2007 ◽

Vol 41 (2) ◽

pp. 83-88

Author(s):

Flavio P. Junqueira ◽

Vassilis Plachouras ◽

Fabrizio Silvestri ◽

Ivana Podnar

Keyword(s):

Information Retrieval ◽

Distributed Systems ◽

Large Scale

Download Full-text