scholarly journals Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets

Electronics ◽  
2020 ◽  
Vol 9 (7) ◽  
pp. 1164
Author(s):  
João Henriques ◽  
Filipe Caldeira ◽  
Tiago Cruz ◽  
Paulo Simões

Computing and networking systems traditionally record their activity in log files, which have been used for multiple purposes, such as troubleshooting, accounting, post-incident analysis of security breaches, capacity planning and anomaly detection. In earlier systems those log files were processed manually by system administrators, or with the support of basic applications for filtering, compiling and pre-processing the logs for specific purposes. However, as the volume of these log files continues to grow (more logs per system, more systems per domain), it is becoming increasingly difficult to process those logs using traditional tools, especially for less straightforward purposes such as anomaly detection. On the other hand, as systems continue to become more complex, the potential of using large datasets built of logs from heterogeneous sources for detecting anomalies without prior domain knowledge becomes higher. Anomaly detection tools for such scenarios face two challenges. First, devising appropriate data analysis solutions for effectively detecting anomalies from large data sources, possibly without prior domain knowledge. Second, adopting data processing platforms able to cope with the large datasets and complex data analysis algorithms required for such purposes. In this paper we address those challenges by proposing an integrated scalable framework that aims at efficiently detecting anomalous events on large amounts of unlabeled data logs. Detection is supported by clustering and classification methods that take advantage of parallel computing environments. We validate our approach using the the well known NASA Hypertext Transfer Protocol (HTTP) logs datasets. Fourteen features were extracted in order to train a k-means model for separating anomalous and normal events in highly coherent clusters. A second model, making use of the XGBoost system implementing a gradient tree boosting algorithm, uses the previous binary clustered data for producing a set of simple interpretable rules. These rules represent the rationale for generalizing its application over a massive number of unseen events in a distributed computing environment. The classified anomaly events produced by our framework can be used, for instance, as candidates for further forensic and compliance auditing analysis in security management.

2010 ◽  
pp. 1797-1803
Author(s):  
Lisa Friedland

In traditional data analysis, data points lie in a Cartesian space, and an analyst asks certain questions: (1) What distribution can I fit to the data? (2) Which points are outliers? (3) Are there distinct clusters or substructure? Today, data mining treats richer and richer types of data. Social networks encode information about people and their communities; relational data sets incorporate multiple types of entities and links; and temporal information describes the dynamics of these systems. With such semantically complex data sets, a greater variety of patterns can be described and views constructed of the data. This article describes a specific social structure that may be present in such data sources and presents a framework for detecting it. The goal is to identify tribes, or small groups of individuals that intentionally coordinate their behavior—individuals with enough in common that they are unlikely to be acting independently. While this task can only be conceived of in a domain of interacting entities, the solution techniques return to the traditional data analysis questions. In order to find hidden structure (3), we use an anomaly detection approach: develop a model to describe the data (1), then identify outliers (2).


Author(s):  
T. Ravindra Babu ◽  
M. Narasimha Murty ◽  
S. V. Subrahmanya

Data Mining deals with efficient algorithms for dealing with large data. When such algorithms are combined with data compaction, they would lead to superior performance. Approaches to deal with large data include working with representatives of data instead of entire data. The representatives should preferably be generated with minimal data scans. In the current chapter we discuss working with methods of lossy and non-lossy data compression methods combined with clustering and classification of large datasets. We demonstrate the working of such schemes on two large data sets.


Author(s):  
Edwin Diday ◽  
M. Narasimha Murthy

In data mining, we generate class/cluster models from large datasets. Symbolic Data Analysis (SDA) is a powerful tool that permits dealing with complex data (Diday, 1988) where a combination of variables and logical and hierarchical relationships among them are used. Such a view permits us to deal with data at a conceptual level, and as a consequence, SDA is ideally suited for data mining. Symbolic data have their own internal structure that necessitates the need for new techniques that generally differ from the ones used on conventional data (Billard & Diday, 2003). Clustering generates abstractions that can be used in a variety of decision-making applications (Jain, Murty, & Flynn, 1999). In this article, we deal with the application of clustering to SDA.


Data Mining ◽  
2013 ◽  
pp. 734-750
Author(s):  
T. Ravindra Babu ◽  
M. Narasimha Murty ◽  
S. V. Subrahmanya

Data Mining deals with efficient algorithms for dealing with large data. When such algorithms are combined with data compaction, they would lead to superior performance. Approaches to deal with large data include working with representatives of data instead of entire data. The representatives should preferably be generated with minimal data scans. In the current chapter we discuss working with methods of lossy and non-lossy data compression methods combined with clustering and classification of large datasets. We demonstrate the working of such schemes on two large data sets.


2021 ◽  
Vol 2 ◽  
pp. 125-131
Author(s):  
Martin Wallner ◽  
Tomáš Peráček

Data has become one of the most valuable resources for companies. The large data volumes of Big Data projects allow institutions the application of various data analysis methods. Compared to older analysis methods, which mostly have an informative function, predictive and prescriptive analysis methods allow foresight and the prevention of future problems and errors. This paper evaluates the current state of advanced data analysis in Austrian industrial companies. Furthermore, it investigates if the advantages of complex data analyses can be monetarized and if cooperate figures such as the turnover or company size influence the answers of the survey. For that reason, a survey among industrial companies in Austria was performed to assess the usage of complex data analysis methods and Big Data. It is shown that small companies use descriptive and diagnostic analysis methods, while big companies use more advanced analytical methods. Companies with a high turnover are also more likely to perform Big Data projects. On an international comparison for most Austrian industrial companies, Big Data is not the main focus of their IT department. Also, modern data architectures are not as extensively implemented as in other countries of the DACH region. However, there is a clear perception by Austrian industrial companies that forward-looking data analysis methods will be predominant in five years.


Author(s):  
Lokukaluge P. Perera ◽  
Brage Mo

An overview of data veracity issues in ship performance and navigation monitoring in relation to data sets collected from a selected vessel is presented in this study. Data veracity relates to the quality of ship performance and navigation parameters obtained by onboard IoT (internet of things). Industrial IoT can introduce various anomalies into measured ship performance and navigation parameters and that can degrade the outcome of the respective data analysis. Therefore, the identification and isolation process of such data anomalies can play an important role in the outcome of ship performance and navigation monitoring. In general, these data anomalies can be divided into sensor and data acquisition (DAQ) faults and system abnormal events. A considerable amount of domain knowledge is required to detect and classify such data anomalies, therefore data anomaly detection layers are proposed in this study for the same purpose. These data anomaly detection layers are divided into several levels: preliminary and advanced levels. The outcome of a preliminary anomaly detection layer with respect to ship performance and navigation data sets of a selected vessel is presented with the respective data handling challenges as the main contribution of this study.


2021 ◽  
Author(s):  
Amy Bednar

A growing area of mathematics topological data analysis (TDA) uses fundamental concepts of topology to analyze complex, high-dimensional data. A topological network represents the data, and the TDA uses the network to analyze the shape of the data and identify features in the network that correspond to patterns in the data. These patterns extract knowledge from the data. TDA provides a framework to advance machine learning’s ability to understand and analyze large, complex data. This paper provides background information about TDA, TDA applications for large data sets, and details related to the investigation and implementation of existing tools and environments.


Author(s):  
Lisa Friedland

In traditional data analysis, data points lie in a Cartesian space, and an analyst asks certain questions: (1) What distribution can I fit to the data? (2) Which points are outliers? (3) Are there distinct clusters or substructure? Today, data mining treats richer and richer types of data. Social networks encode information about people and their communities; relational data sets incorporate multiple types of entities and links; and temporal information describes the dynamics of these systems. With such semantically complex data sets, a greater variety of patterns can be described and views constructed of the data. This article describes a specific social structure that may be present in such data sources and presents a framework for detecting it. The goal is to identify tribes, or small groups of individuals that intentionally coordinate their behavior—individuals with enough in common that they are unlikely to be acting independently. While this task can only be conceived of in a domain of interacting entities, the solution techniques return to the traditional data analysis questions. In order to find hidden structure, we use an anomaly detection approach: develop a model to describe the data, then identify outliers.


2004 ◽  
Vol 95 (2) ◽  
pp. 97-101 ◽  
Author(s):  
Hongyuan Sun ◽  
Qiye Wen ◽  
Peixin Zhang ◽  
Jianhong Liu ◽  
Qianling Zhang ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document