Graph Based Root Cause Analysis in Cloud Data Center

Author(s):  
Divyaansh Dandona ◽  
Mevlut Demir ◽  
John J. Prevost
2020 ◽  
Vol 9 (1) ◽  
pp. 2146-2148

Data center is a complex amalgamation of servers where there are thousands of services, storage, networking, routers, switches and softwares providing services 24x7 to customers. Services provided can range from websites, storage, cloud platform, Email marketing etc. A team is established to detach the anomaly generated from the monitoring system. Anomaly or issues in servers cause high downtime of service. Detecting these anomalies with high accuracy and performing Root cause analysis has been a major issue. The team often remediates the symptom than the anomaly. With the use of Artificial Neural networks a trained model can provide solutions with high accuracy and scalablility which result in higher uptime and reduced MTTR for customers.


2021 ◽  
Vol 27 (11) ◽  
pp. 1152-1173
Author(s):  
Arnak Poghosyan ◽  
Ashot Harutyunyan ◽  
Naira Grigoryan ◽  
Nicholas Kushmerick

Effective root cause analysis (RCA) of performance issues in modern cloud environ- ments remains a hard problem. Traditional RCA tracks complex issues by their signatures known as problem incidents. Common approaches to incident discovery rely mainly on expertise of users who define environment-specific set of alerts and >target detection of problems through their occurrence in the monitoring system. Adequately modeling of all possible problem patterns for nowadays extremely sophisticated data center applications is a very complex task. It may result in alert/event storms including large numbers of non-indicative precautions. Thus, the crucial task for the incident-based RCA is reduction of redundant recommendations by prioritizing those events subject to importance/impact criteria or by deriving their meaningful groupings into separable situations. In this paper, we consider automation of incident discovery based on rule induction algorithms that retrieve conditions directly from monitoring datasets without consuming the sys- tem events. Rule-learning algorithms are very flexible and powerful for many regression and classification problems, with high-level explainability. Since annotated or labeled data sets are mostly unavailable in this area of technology, we discuss data self-labelling principles which allow transforming originally unsupervised learning tasks into classification problems with further application of rule induction methods to incident detection.


2011 ◽  
pp. 78-86
Author(s):  
R. Kilian ◽  
J. Beck ◽  
H. Lang ◽  
V. Schneider ◽  
T. Schönherr ◽  
...  

2012 ◽  
Vol 132 (10) ◽  
pp. 1689-1697
Author(s):  
Yutaka Kudo ◽  
Tomohiro Morimura ◽  
Kiminori Sugauchi ◽  
Tetsuya Masuishi ◽  
Norihisa Komoda

2018 ◽  
Vol 6 (2) ◽  
pp. 287-292
Author(s):  
M.R. Dave ◽  
◽  
H.B. Patel ◽  
B. Shrimali ◽  
◽  
...  

Author(s):  
Dan Bodoh ◽  
Kent Erington ◽  
Kris Dickson ◽  
George Lange ◽  
Carey Wu ◽  
...  

Abstract Laser-assisted device alteration (LADA) is an established technique used to identify critical speed paths in integrated circuits. LADA can reveal the physical location of a speed path, but not the timing of the speed path. This paper describes the root cause analysis benefits of 1064nm time resolved LADA (TR-LADA) with a picosecond laser. It shows several examples of how picosecond TR-LADA has complemented the existing fault isolation toolset and has allowed for quicker resolution of design and manufacturing issues. The paper explains how TR-LADA increases the LADA localization resolution by eliminating the well interaction, provides the timing of the event detected by LADA, indicates the propagation direction of the critical signals detected by LADA, allows the analyst to infer the logic values of the critical signals, and separates multiple interactions occurring at the same site for better understanding of the critical signals.


Sign in / Sign up

Export Citation Format

Share Document