Iterative Selection of Categorical Variables for Log Data Anomaly Detection

A Log-Based Anomaly Detection Method with Efficient Neighbor Searching and Automatic K Neighbor Selection

Scientific Programming ◽

10.1155/2020/4365356 ◽

2020 ◽

Vol 2020 ◽

pp. 1-17

Author(s):

Bingming Wang ◽

Shi Ying ◽

Zhe Yang

Keyword(s):

Anomaly Detection ◽

Large Scale ◽

Nearest Neighbor ◽

Detection Method ◽

Tree Model ◽

Log Data ◽

Neighbor Search ◽

Different Types ◽

Efficient Selection ◽

Selection Of

Using the k-nearest neighbor (kNN) algorithm in the supervised learning method to detect anomalies can get more accurate results. However, when using kNN algorithm to detect anomaly, it is inefficient at finding k neighbors from large-scale log data; at the same time, log data are imbalanced in quantity, so it is a challenge to select proper k neighbors for different data distributions. In this paper, we propose a log-based anomaly detection method with efficient selection of neighbors and automatic selection of k neighbors. First, we propose a neighbor search method based on minhash and MVP-tree. The minhash algorithm is used to group similar logs into the same bucket, and MVP-tree model is built for samples in each bucket. In this way, we can reduce the effort of distance calculation and the number of neighbor samples that need to be compared, so as to improve the efficiency of finding neighbors. In the process of selecting k neighbors, we propose an automatic method based on the Silhouette Coefficient, which can select proper k neighbors to improve the accuracy of anomaly detection. Our method is verified on six different types of log data to prove its universality and feasibility.

Bridging the Gap between Quantitative and Qualitative Historical Research: an application of multiple regression analysis and homogeneity analysis with alternating least squares

History and Computing ◽

10.3366/hac.1996.8.3.133 ◽

1996 ◽

Vol 8 (3) ◽

pp. 133-144 ◽

Cited By ~ 1

Author(s):

María del Mar del Pozo Andrés ◽

Jacques F A Braster

Keyword(s):

Least Squares ◽

Multiple Regression ◽

Historical Research ◽

Alternating Least Squares ◽

Categorical Variables ◽

Two Dimensional ◽

Homogeneity Analysis ◽

Regression Approach ◽

Dimensional Picture ◽

Selection Of

In this article we propose two research techniques that can bridge the gap between quantitative and qualitative historical research. These are: (1) a multiple regression approach that gives information about general patterns between numerical variables and the selection of outliers for qualitative analysis; (2) a homogeneity analysis with alternating least squares that results in a two-dimensional picture in which the relationships between categorical variables are graphically presented.

Anomaly Detection via Mining Numerical Workflow Relations from Logs

10.36227/techrxiv.12570926.v1 ◽

2020 ◽

Author(s):

Bo Zhang ◽

Hongyu Zhang ◽

Pablo Moscato

Keyword(s):

Distributed Systems ◽

Anomaly Detection ◽

Text Messages ◽

Distributed File System ◽

Log Data ◽

Sliding Windows ◽

Novel Approach ◽

Hadoop Distributed File System ◽

Blue Gene ◽

Online Anomaly Detection

<div>Complex software intensive systems, especially distributed systems, generate logs for troubleshooting. The logs are text messages recording system events, which can help engineers determine the system's runtime status. This paper proposes a novel approach named ADR (stands for Anomaly Detection by workflow Relations) that employs matrix nullspace to mine numerical relations from log data. The mined relations can be used for both offline and online anomaly detection and facilitate fault diagnosis. We have evaluated ADR on log data collected from two distributed systems, HDFS (Hadoop Distributed File System) and BGL (IBM Blue Gene/L supercomputers system). ADR successfully mined 87 and 669 numerical relations from the logs and used them to detect anomalies with high precision and recall. For online anomaly detection, ADR employs PSO (Particle Swarm Optimization) to find the optimal sliding windows' size and achieves fast anomaly detection.</div><div>The experimental results confirm that ADR is effective for both offline and online anomaly detection. </div>

Hybrid approach with Deep Auto-Encoder and optimized LSTM based Deep Learning approach to detect anomaly in cloud logs

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201707 ◽

2021 ◽

pp. 1-15

Author(s):

Savaridassan Pankajashan ◽

G. Maragatham ◽

T. Kirthiga Devi

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Anomaly Detection ◽

Performance Metrics ◽

Hybrid Approach ◽

Machine Learning Techniques ◽

Support Vector ◽

Paper Machine ◽

Log Data ◽

Isolation Forest

Anomaly-based detection is coupled with recognizing the uncommon, to catch the unusual activity, and to find the strange action behind that activity. Anomaly-based detection has a wide scope of critical applications, from bank application security to regular sciences to medical systems to marketing apps. Anomaly-based detection adopted by various Machine Learning techniques is really a type of system that consists of artificial intelligence. With the ever-expanding volume and new sorts of information, for example, sensor information from an incontestably enormous amount of IoT devices and from network flow data from cloud computing, it is implicitly understood without surprise that there is a developing enthusiasm for having the option to deal with more conclusions automatically by means of AI and ML applications. But with respect to anomaly detection, many applications of the scheme are simply the passion for detection. In this paper, Machine Learning (ML) techniques, namely the SVM, Isolation forest classifiers experimented and with reference to Deep Learning (DL) techniques, the proposed DA-LSTM (Deep Auto-Encoder LSTM) model are adopted for preprocessing of log data and anomaly-based detection to get better performance measures of detection. An enhanced LSTM (long-short-term memory) model, optimizing for the suitable parameter using a genetic algorithm (GA), is utilized to recognize better the anomaly from the log data that is filtered, adopting a Deep Auto-Encoder (DA). The Deep Neural network models are utilized to change over unstructured log information to training ready features, which are reasonable for log classification in detecting anomalies. These models are assessed, utilizing two benchmark datasets, the Openstack logs, and CIDDS-001 intrusion detection OpenStack server dataset. The outcomes acquired show that the DA-LSTM model performs better than other notable ML techniques. We further investigated the performance metrics of the ML and DL models through the well-known indicator measurements, specifically, the F-measure, Accuracy, Recall, and Precision. The exploratory conclusion shows that the Isolation Forest, and Support vector machine classifiers perform roughly 81%and 79%accuracy with respect to the performance metrics measurement on the CIDDS-001 OpenStack server dataset while the proposed DA-LSTM classifier performs around 99.1%of improved accuracy than the familiar ML algorithms. Further, the DA-LSTM outcomes on the OpenStack log data-sets show better anomaly detection compared with other notable machine learning models.

Variable Selection for Correlated High-Dimensional Data with Infrequent Categorical Variables: Based on Sparse Sample Regression and Anomaly Detection Technology

Intelligent Decision Technologies - Smart Innovation, Systems and Technologies ◽

10.1007/978-981-16-2765-1_9 ◽

2021 ◽

pp. 109-125

Author(s):

Yuhei Kotsuka ◽

Sumika Arima

Keyword(s):

Variable Selection ◽

Anomaly Detection ◽

High Dimensional Data ◽

Categorical Variables ◽

High Dimensional ◽

Detection Technology ◽

Selection For

Well Log Data Standardization, Imputation and Anomaly Detection Using Hidden Markov Models

10.3997/2214-4609.201902208 ◽

2019 ◽

Author(s):

K. Struminskiy ◽

A. Klenitskiy ◽

A. Reshytko ◽

D. Egorov ◽

A. Shchepetnov ◽

...

Keyword(s):

Anomaly Detection ◽

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

Well Log ◽

Log Data ◽

Data Standardization

Anomaly Detection of Hadoop Log Data Using Moving Average and 3-Sigma

KIPS Transactions on Software and Data Engineering ◽

10.3745/ktsde.2016.5.6.283 ◽

2016 ◽

Vol 5 (6) ◽

pp. 283-288

Author(s):

Siwoon Son ◽

Myeong-Seon Gil ◽

Yang-Sae Moon ◽

Hee-Sun Won

Keyword(s):

Anomaly Detection ◽

Moving Average ◽

Log Data

Habitat selection of Capercaillie (Tetrao urogallus) displaying males: Case from Rila Mountain, Bulgaria

ARPHA Conference Abstracts ◽

10.3897/aca.2.e46462 ◽

2019 ◽

Vol 2 ◽

Author(s):

Dimitar Plachiyski ◽

Georgi Popgeorgiev ◽

Stefan Avramov ◽

Yurii Kornilev

Keyword(s):

Habitat Selection ◽

Home Range ◽

Scots Pine ◽

Vegetation Cover ◽

Habitat Management ◽

Categorical Variables ◽

Tetrao Urogallus ◽

Adult Males ◽

Convex Polygons ◽

Selection Of

Current habitat management of the peripheral, regionally unique, and isolated Balkan capercaillie Tetrao urogallus rudolfi Dombrowski, 1912 meta-population in Bulgaria is based on obsolete knowledge of the spatial requirements of the species. Thus, we studied the habitat availability and the patterns of use by Capercaillie adult males, at the home range scale to inform and contribute to the conservation-oriented management of the threatened subspecies and its habitats. The field study was conducted during 2014–2015 in the northeastern part of Rila Mtn., Southwestern Bulgaria. Using GPS tags (“Bird 2A”, e-obs Digital Telemetry, Grünwald, Germany), a total of 38,640 GPS fixes from 3 displaying males, associated with one lek were gained. On this basis, we calculated annual and seasonal Minimum Convex Polygons (MCP), traditionally used as a measure of the maximum area of activity. Capercaillie habitat preference was computed using Manly’s habitat selection ratios (w), design III, combined with 90% Bonferroni simultaneous confidence intervals. To calculate habitat selection, we determined surface (Steepness and Exposure), forest stand succession and vegetation cover categorical variables. The habitat and surface layers was rasterized into 8 m square pixels. At the home range (MCP) scale, tagged roosters used vegetation cover non-randomly (annual: Khi2L=5738.89, df=14, p<0.001; winter: Khi2L=3773.28, df=13, p<0.001; summer: Khi2L=3646.32, df=14, p<0.001), and preferred forests dominated by Scots pine and Macedonian pine, such as the annual selection of Scots pine and summer selection of Macedonian pine are significantly different. In terms of forest stage succession, roosters used forest stages non-randomly (annual: Khi2L=3492.57, df=8, p<0.001; winter: Khi2L=2075.18, df=8, p<0.001; summer: Khi2L=1670.1, df=6, p<0.001), and demonstrated clear avoidance of forests stands in age classes: “0 to 40” and “41 to 80” years within the summer and annual ranges. The roosters demonstrated significant preference for southeastern exposure during the winter and annually, and significant overall avoidance of northern exposure, as well as avoidance of north-eastern aspect during the winter and south aspect during the summer (annual: Khi2L=4671.87, df=18, p<0.001; winter: Khi2L=3909.04, df=16, p<0.001; summer: hi2L=3095.84, df=18, p<0.001). The slope class “63.1 to 73o” was not used. In the summer, Capercaillie males significantly preferred slopes within the class “27.1 to 36o” and avoided the classes “0 to 9o”, “9.1 to 18o” and “54.1 to 63o”. The birds also demonstrated significant avoidance of flat terrains within the “0 to 9o” class annually (annual: Khi2L=608.24, df=17, p<0.001; winter: Khi2L=1148.37, df=16, p<0.001; summer: Khi2L=906.54, df=17, p<0.001).

Robust and Transferable Anomaly Detection in Log Data using Pre-Trained Language Models

10.1109/cloudintelligence52565.2021.00013 ◽

2021 ◽

Author(s):

Harold Ott ◽

Jasmin Bogatinovski ◽

Alexander Acker ◽

Sasho Nedelkoski ◽

Odej Kao

Keyword(s):

Anomaly Detection ◽

Language Models ◽

Log Data

Robust log-based anomaly detection on unstable log data

Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering - ESEC/FSE 2019 ◽

10.1145/3338906.3338931 ◽

2019 ◽

Cited By ~ 19

Author(s):

Xu Zhang ◽

Yong Xu ◽

Qingwei Lin ◽

Bo Qiao ◽

Hongyu Zhang ◽

...

Keyword(s):

Anomaly Detection ◽

Log Data