Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets

João Henriques; Filipe Caldeira; Tiago Cruz; Paulo Simões

doi:10.3390/electronics9071164

Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets

Electronics ◽

10.3390/electronics9071164 ◽

2020 ◽

Vol 9 (7) ◽

pp. 1164

Author(s):

João Henriques ◽

Filipe Caldeira ◽

Tiago Cruz ◽

Paulo Simões

Keyword(s):

Data Analysis ◽

Anomaly Detection ◽

Capacity Planning ◽

Domain Knowledge ◽

Large Data ◽

Large Datasets ◽

Complex Data ◽

Security Breaches ◽

Log Files ◽

Clustering And Classification

Computing and networking systems traditionally record their activity in log files, which have been used for multiple purposes, such as troubleshooting, accounting, post-incident analysis of security breaches, capacity planning and anomaly detection. In earlier systems those log files were processed manually by system administrators, or with the support of basic applications for filtering, compiling and pre-processing the logs for specific purposes. However, as the volume of these log files continues to grow (more logs per system, more systems per domain), it is becoming increasingly difficult to process those logs using traditional tools, especially for less straightforward purposes such as anomaly detection. On the other hand, as systems continue to become more complex, the potential of using large datasets built of logs from heterogeneous sources for detecting anomalies without prior domain knowledge becomes higher. Anomaly detection tools for such scenarios face two challenges. First, devising appropriate data analysis solutions for effectively detecting anomalies from large data sources, possibly without prior domain knowledge. Second, adopting data processing platforms able to cope with the large datasets and complex data analysis algorithms required for such purposes. In this paper we address those challenges by proposing an integrated scalable framework that aims at efficiently detecting anomalous events on large amounts of unlabeled data logs. Detection is supported by clustering and classification methods that take advantage of parallel computing environments. We validate our approach using the the well known NASA Hypertext Transfer Protocol (HTTP) logs datasets. Fourteen features were extracted in order to train a k-means model for separating anomalous and normal events in highly coherent clusters. A second model, making use of the XGBoost system implementing a gradient tree boosting algorithm, uses the previous binary clustered data for producing a set of simple interpretable rules. These rules represent the rationale for generalizing its application over a massive number of unseen events in a distributed computing environment. The classified anomaly events produced by our framework can be used, for instance, as candidates for further forensic and compliance auditing analysis in security management.

Download Full-text

Anomaly Detection for Inferring Social Structure

Social Computing ◽

10.4018/978-1-60566-984-7.ch118 ◽

2010 ◽

pp. 1797-1803

Author(s):

Lisa Friedland

Keyword(s):

Data Analysis ◽

Anomaly Detection ◽

Social Structure ◽

Small Groups ◽

Analysis Data ◽

Data Sets ◽

Complex Data ◽

Detection Approach ◽

Complex Data Sets ◽

Data Points

In traditional data analysis, data points lie in a Cartesian space, and an analyst asks certain questions: (1) What distribution can I fit to the data? (2) Which points are outliers? (3) Are there distinct clusters or substructure? Today, data mining treats richer and richer types of data. Social networks encode information about people and their communities; relational data sets incorporate multiple types of entities and links; and temporal information describes the dynamics of these systems. With such semantically complex data sets, a greater variety of patterns can be described and views constructed of the data. This article describes a specific social structure that may be present in such data sources and presents a framework for detecting it. The goal is to identify tribes, or small groups of individuals that intentionally coordinate their behavior—individuals with enough in common that they are unlikely to be acting independently. While this task can only be conceived of in a domain of interacting entities, the solution techniques return to the traditional data analysis questions. In order to find hidden structure (3), we use an anomaly detection approach: develop a model to describe the data (1), then identify outliers (2).

Download Full-text

Quantization based Sequence Generation and Subsequence Pruning for Data Mining Applications

Pattern Discovery Using Sequence Data Mining ◽

10.4018/978-1-61350-056-9.ch006 ◽

2012 ◽

pp. 94-110 ◽

Cited By ~ 1

Author(s):

T. Ravindra Babu ◽

M. Narasimha Murty ◽

S. V. Subrahmanya

Keyword(s):

Data Mining ◽

Large Data ◽

Large Datasets ◽

Superior Performance ◽

Data Sets ◽

Sequence Generation ◽

Data Compaction ◽

Clustering And Classification ◽

Minimal Data

Data Mining deals with efficient algorithms for dealing with large data. When such algorithms are combined with data compaction, they would lead to superior performance. Approaches to deal with large data include working with representatives of data instead of entire data. The representatives should preferably be generated with minimal data scans. In the current chapter we discuss working with methods of lossy and non-lossy data compression methods combined with clustering and classification of large datasets. We demonstrate the working of such schemes on two large data sets.

Download Full-text

Symbolic Data Clustering

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch204 ◽

2011 ◽

pp. 1087-1091 ◽

Cited By ~ 3

Author(s):

Edwin Diday ◽

M. Narasimha Murthy

Keyword(s):

Data Mining ◽

Decision Making ◽

Data Analysis ◽

Internal Structure ◽

Data Clustering ◽

Large Datasets ◽

Complex Data ◽

Symbolic Data Analysis ◽

New Techniques ◽

Symbolic Data

In data mining, we generate class/cluster models from large datasets. Symbolic Data Analysis (SDA) is a powerful tool that permits dealing with complex data (Diday, 1988) where a combination of variables and logical and hierarchical relationships among them are used. Such a view permits us to deal with data at a conceptual level, and as a consequence, SDA is ideally suited for data mining. Symbolic data have their own internal structure that necessitates the need for new techniques that generally differ from the ones used on conventional data (Billard & Diday, 2003). Clustering generates abstractions that can be used in a variety of decision-making applications (Jain, Murty, & Flynn, 1999). In this article, we deal with the application of clustering to SDA.

Download Full-text

Quantization based Sequence Generation and Subsequence Pruning for Data Mining Applications

Data Mining ◽

10.4018/978-1-4666-2455-9.ch038 ◽

2013 ◽

pp. 734-750

Author(s):

T. Ravindra Babu ◽

M. Narasimha Murty ◽

S. V. Subrahmanya

Keyword(s):

Data Mining ◽

Large Data ◽

Large Datasets ◽

Superior Performance ◽

Data Sets ◽

Sequence Generation ◽

Data Compaction ◽

Clustering And Classification ◽

Minimal Data

Download Full-text

Combination of wavelet transform and Autoencoder for complex data analysis and anomaly detection

10.1109/itnt52450.2021.9649436 ◽

2021 ◽

Author(s):

Vladimir Geppener ◽

Bogdana Mandrikova

Keyword(s):

Wavelet Transform ◽

Data Analysis ◽

Anomaly Detection ◽

Complex Data

Download Full-text

USAGE OF ADVANCED DATA ANALYSIS IN AUSTRIAN INDUSTRIAL COMPANIES

Proceedings of CBU in Economics and Business ◽

10.12955/peb.v2.265 ◽

2021 ◽

Vol 2 ◽

pp. 125-131

Author(s):

Martin Wallner ◽

Tomáš Peráček

Keyword(s):

Big Data ◽

Data Analysis ◽

Large Data ◽

Complex Data ◽

Small Companies ◽

High Turnover ◽

Industrial Companies ◽

Analysis Methods ◽

It Department ◽

Data Analysis Methods

Data has become one of the most valuable resources for companies. The large data volumes of Big Data projects allow institutions the application of various data analysis methods. Compared to older analysis methods, which mostly have an informative function, predictive and prescriptive analysis methods allow foresight and the prevention of future problems and errors. This paper evaluates the current state of advanced data analysis in Austrian industrial companies. Furthermore, it investigates if the advantages of complex data analyses can be monetarized and if cooperate figures such as the turnover or company size influence the answers of the survey. For that reason, a survey among industrial companies in Austria was performed to assess the usage of complex data analysis methods and Big Data. It is shown that small companies use descriptive and diagnostic analysis methods, while big companies use more advanced analytical methods. Companies with a high turnover are also more likely to perform Big Data projects. On an international comparison for most Austrian industrial companies, Big Data is not the main focus of their IT department. Also, modern data architectures are not as extensively implemented as in other countries of the DACH region. However, there is a clear perception by Austrian industrial companies that forward-looking data analysis methods will be predominant in five years.

Download Full-text

An Overview of Data Veracity Issues in Ship Performance and Navigation Monitoring

Volume 11B: Honoring Symposium for Professor Carlos Guedes Soares on Marine Technology and Ocean Engineering ◽

10.1115/omae2018-77669 ◽

2018 ◽

Cited By ~ 2

Author(s):

Lokukaluge P. Perera ◽

Brage Mo

Keyword(s):

Data Analysis ◽

Anomaly Detection ◽

Domain Knowledge ◽

Study Data ◽

Data Sets ◽

Data Handling ◽

Navigation Data ◽

Industrial Iot ◽

Ship Performance

An overview of data veracity issues in ship performance and navigation monitoring in relation to data sets collected from a selected vessel is presented in this study. Data veracity relates to the quality of ship performance and navigation parameters obtained by onboard IoT (internet of things). Industrial IoT can introduce various anomalies into measured ship performance and navigation parameters and that can degrade the outcome of the respective data analysis. Therefore, the identification and isolation process of such data anomalies can play an important role in the outcome of ship performance and navigation monitoring. In general, these data anomalies can be divided into sensor and data acquisition (DAQ) faults and system abnormal events. A considerable amount of domain knowledge is required to detect and classify such data anomalies, therefore data anomaly detection layers are proposed in this study for the same purpose. These data anomaly detection layers are divided into several levels: preliminary and advanced levels. The outcome of a preliminary anomaly detection layer with respect to ship performance and navigation data sets of a selected vessel is presented with the respective data handling challenges as the main contribution of this study.

Download Full-text

Topological data analysis : an overview

10.21079/11681/40943 ◽

2021 ◽

Author(s):

Amy Bednar

Keyword(s):

Data Analysis ◽

Large Data ◽

Large Data Sets ◽

Topological Data Analysis ◽

Background Information ◽

High Dimensional ◽

Data Sets ◽

Complex Data ◽

Topological Network ◽

Topological Data

A growing area of mathematics topological data analysis (TDA) uses fundamental concepts of topology to analyze complex, high-dimensional data. A topological network represents the data, and the TDA uses the network to analyze the shape of the data and identify features in the network that correspond to patterns in the data. These patterns extract knowledge from the data. TDA provides a framework to advance machine learning’s ability to understand and analyze large, complex data. This paper provides background information about TDA, TDA applications for large data sets, and details related to the investigation and implementation of existing tools and environments.

Download Full-text

Anomaly Detection for Inferring Social Structure

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch007 ◽

2010 ◽

pp. 39-44

Author(s):

Lisa Friedland

Keyword(s):

Data Analysis ◽

Anomaly Detection ◽

Social Structure ◽

Small Groups ◽

Analysis Data ◽

Data Sets ◽

Complex Data ◽

Detection Approach ◽

Complex Data Sets ◽

Data Points

In traditional data analysis, data points lie in a Cartesian space, and an analyst asks certain questions: (1) What distribution can I fit to the data? (2) Which points are outliers? (3) Are there distinct clusters or substructure? Today, data mining treats richer and richer types of data. Social networks encode information about people and their communities; relational data sets incorporate multiple types of entities and links; and temporal information describes the dynamics of these systems. With such semantically complex data sets, a greater variety of patterns can be described and views constructed of the data. This article describes a specific social structure that may be present in such data sources and presents a framework for detecting it. The goal is to identify tribes, or small groups of individuals that intentionally coordinate their behavior—individuals with enough in common that they are unlikely to be acting independently. While this task can only be conceived of in a domain of interacting entities, the solution techniques return to the traditional data analysis questions. In order to find hidden structure, we use an anomaly detection approach: develop a model to describe the data, then identify outliers.

Download Full-text

Model of artificial neural network for complex data analysis in slag glass-ceramic

Zeitschrift für Metallkunde ◽

10.3139/146.017915 ◽

2004 ◽

Vol 95 (2) ◽

pp. 97-101 ◽

Cited By ~ 1

Author(s):

Hongyuan Sun ◽

Qiye Wen ◽

Peixin Zhang ◽

Jianhong Liu ◽

Qianling Zhang ◽

...

Keyword(s):

Neural Network ◽

Artificial Neural Network ◽

Data Analysis ◽

Glass Ceramic ◽

Complex Data ◽

Slag Glass ◽

Artificial Neural

Download Full-text