A Survey of AIOps Methods for Failure Management

ACM Transactions on Intelligent Systems and Technology ◽

10.1145/3483424 ◽

2021 ◽

Vol 12 (6) ◽

pp. 1-45

Author(s):

Paolo Notaro ◽

Jorge Cardoso ◽

Michael Gerndt

Keyword(s):

Modern Society ◽

It Industry ◽

Distributed Computing Systems ◽

Computing Systems ◽

Intelligent Monitoring ◽

Daily Monitoring ◽

Quantitative Results ◽

Increasing Demand ◽

Data Requirements ◽

Modern society is increasingly moving toward complex and distributed computing systems. The increase in scale and complexity of these systems challenges O&M teams that perform daily monitoring and repair operations, in contrast with the increasing demand for reliability and scalability of modern applications. For this reason, the study of automated and intelligent monitoring systems has recently sparked much interest across applied IT industry and academia. Artificial Intelligence for IT Operations (AIOps) has been proposed to tackle modern IT administration challenges thanks to Machine Learning, AI, and Big Data. However, AIOps as a research topic is still largely unstructured and unexplored, due to missing conventions in categorizing contributions for their data requirements, target goals, and components. In this work, we focus on AIOps for Failure Management (FM), characterizing and describing 5 different categories and 14 subcategories of contributions, based on their time intervention window and the target problem being solved. We review 100 FM solutions, focusing on applicability requirements and the quantitative results achieved, to facilitate an effective application of AIOps solutions. Finally, we discuss current development problems in the areas covered by AIOps and delineate possible future trends for AI-based failure management.

Download Full-text

Data-Aware Distributed Batch Scheduling

Handbook of Research on Grid Technologies and Utility Computing ◽

10.4018/978-1-60566-184-1.ch005 ◽

2009 ◽

pp. 41-48

Author(s):

Tevfik Kosar

Keyword(s):

Data Placement ◽

Distributed Applications ◽

Complex Data ◽

Data Handling ◽

Future Trends ◽

Distributed Computing Systems ◽

Computing Systems ◽

Computation Data ◽

Data Requirements

As the data requirements of scientific distributed applications increase, the access to remote data becomes the main performance bottleneck for these applications. Traditional distributed computing systems closely couple data placement and computation, and consider data placement as a side effect of computation. Data placement is either embedded in the computation and causes the computation to delay, or performed as simple scripts which do not have the privileges of a job. The insufficiency of the traditional systems and existing CPU-oriented schedulers in dealing with the complex data handling problem has yielded a new emerging era: the data-aware schedulers. This chapter discusses the challenges in this area as well as future trends, with a focus on Stork case study.

Download Full-text

Efficient Resource Allocation Algorithm in Dependable Distributed Computing Systems Using A Colony Optimization

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i1.168171 ◽

2018 ◽

Vol 6 (1) ◽

pp. 168-171

Author(s):

Manas Kumar Yogi ◽

◽

G. Kumari ◽

L.Yamuna . ◽

◽

...

Keyword(s):

Resource Allocation ◽

Distributed Computing ◽

Distributed Computing Systems ◽

Computing Systems ◽

Resource Allocation Algorithm ◽

Allocation Algorithm ◽

Efficient Resource

Download Full-text

An Enhancement of Leveled DAG Prioritized Task Scheduling Algorithm in Distributed Computing Systems

Menoufia Journal of Electronic Engineering Research ◽

10.21608/mjeer.2017.63443 ◽

2017 ◽

Vol 26 (1) ◽

pp. 171-192

Author(s):

Amal EL-NATTAT ◽

Nirmeen A. El-Bahnasawy ◽

Ayman EL-SAYED

Keyword(s):

Distributed Computing ◽

Task Scheduling ◽

Scheduling Algorithm ◽

Distributed Computing Systems ◽

Computing Systems ◽

Task Scheduling Algorithm

Download Full-text

Optimization procedure for algorithms of task scheduling in high performance heterogeneous distributed computing systems

Egyptian Informatics Journal ◽

10.1016/j.eij.2011.10.001 ◽

2011 ◽

Vol 12 (3) ◽

pp. 219-229 ◽

Author(s):

Nirmeen A. Bahnasawy ◽

Fatma Omara ◽

Magdy A. Koutb ◽

Mervat Mosa

Keyword(s):

Distributed Computing ◽

Task Scheduling ◽

High Performance ◽

Optimization Procedure ◽

Distributed Computing Systems ◽

Computing Systems ◽

Heterogeneous Distributed Computing ◽

Heterogeneous Distributed Computing Systems

Download Full-text

Middleware of real-time object based fault tolerant distributed computing systems: issues and some approaches

Proceedings 2001 Pacific Rim International Symposium on Dependable Computing ◽

10.1109/prdc.2001.992672 ◽

2002 ◽

Author(s):

K.H. Kim

Keyword(s):

Distributed Computing ◽

Real Time ◽

Fault Tolerant ◽

Distributed Computing Systems ◽

Computing Systems ◽

Download Full-text

Artificial Intelligent Load Balance Agent on Network Traffic Across Multiple Heterogeneous Distributed Computing Systems

SSRN Electronic Journal ◽

10.2139/ssrn.3739322 ◽

2020 ◽

Author(s):

Anit Kumar ◽

Dhanpratap Singh

Keyword(s):

Distributed Computing ◽

Network Traffic ◽

Load Balance ◽

Distributed Computing Systems ◽

Computing Systems ◽

Artificial Intelligent ◽

Heterogeneous Distributed Computing ◽

Heterogeneous Distributed Computing Systems

Download Full-text

2021 IEEE 41st International Conference on Distributed Computing Systems Workshops (ICDCSW)

10.1109/icdcsw53096.2021 ◽

2021 ◽

Keyword(s):

Distributed Computing ◽

Distributed Computing Systems ◽

Computing Systems ◽

International Conference

Download Full-text

Optimization Issues in Distributed Computing Systems Design

Modeling, Simulation and Optimization of Complex Processes - HPSC 2012 ◽

10.1007/978-3-319-09063-4_21 ◽

2014 ◽

pp. 261-272

Author(s):

Krzysztof Walkowiak ◽

Jacek Rak

Keyword(s):

Distributed Computing ◽

Systems Design ◽

Distributed Computing Systems ◽

Computing Systems

Download Full-text

Decentralized Load Balancing Consensus Control in Distributed Computing Systems

2018 IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT) ◽

10.1109/stc-csit.2018.8526636 ◽

2018 ◽

Author(s):

Leonid Lyubchyk ◽

Yuri Dorofieiev

Keyword(s):

Distributed Computing ◽

Load Balancing ◽

Consensus Control ◽

Distributed Computing Systems ◽

Computing Systems ◽

Decentralized Load Balancing

Download Full-text

Estimation of Maximum Values of Communication Overhead and Delay for Distributed Computing Systems

IETE Journal of Research ◽

10.1080/03772063.2002.11416264 ◽

2002 ◽

Vol 48 (2) ◽

pp. 105-111

Author(s):

S Selvam ◽

Moinuddin ◽

Ibraheem ◽

M T Beg

Keyword(s):

Distributed Computing ◽

Communication Overhead ◽

Distributed Computing Systems ◽

Computing Systems

Download Full-text