scholarly journals An Anomaly Detection and Explainability Framework using Convolutional Autoencoders for Data Storage Systems

Author(s):  
Roy Assaf ◽  
Ioana Giurgiu ◽  
Jonas Pfefferle ◽  
Serge Monney ◽  
Haris Pozidis ◽  
...  

Anomaly detection in data storage systems is a challenging problem due to the high dimensional sequential data involved, and lack of labels. The state of the art for automating anomaly detection in these systems typically relies on hand crafted rules and thresholds which mainly allow to distinguish between normal and abnormal behavior of each indicator in isolation. In this work we present an end-to-end framework based on convolutional autoencoders which not only allows for anomaly detection on multivariate time series data, but also provides explainability. This is done by identifying similar historic anomalies and extracting the most influential indicators. These are then presented to relevant personnel such as system designers and architects, or to support engineers for further analysis. We demonstrate the application of this framework along with an intuitive interactive web interface which was developed for data storage system anomaly detection. We discuss how this framework along with its explainability aspects enables support engineers to effectively tackle abnormal behaviors, all while allowing for crucial feedback.

Water ◽  
2021 ◽  
Vol 13 (12) ◽  
pp. 1633
Author(s):  
Elena-Simona Apostol ◽  
Ciprian-Octavian Truică ◽  
Florin Pop ◽  
Christian Esposito

Due to the exponential growth of the Internet of Things networks and the massive amount of time series data collected from these networks, it is essential to apply efficient methods for Big Data analysis in order to extract meaningful information and statistics. Anomaly detection is an important part of time series analysis, improving the quality of further analysis, such as prediction and forecasting. Thus, detecting sudden change points with normal behavior and using them to discriminate between abnormal behavior, i.e., outliers, is a crucial step used to minimize the false positive rate and to build accurate machine learning models for prediction and forecasting. In this paper, we propose a rule-based decision system that enhances anomaly detection in multivariate time series using change point detection. Our architecture uses a pipeline that automatically manages to detect real anomalies and remove the false positives introduced by change points. We employ both traditional and deep learning unsupervised algorithms, in total, five anomaly detection and five change point detection algorithms. Additionally, we propose a new confidence metric based on the support for a time series point to be an anomaly and the support for the same point to be a change point. In our experiments, we use a large real-world dataset containing multivariate time series about water consumption collected from smart meters. As an evaluation metric, we use Mean Absolute Error (MAE). The low MAE values show that the algorithms accurately determine anomalies and change points. The experimental results strengthen our assumption that anomaly detection can be improved by determining and removing change points as well as validates the correctness of our proposed rules in real-world scenarios. Furthermore, the proposed rule-based decision support systems enable users to make informed decisions regarding the status of the water distribution network and perform effectively predictive and proactive maintenance.


2021 ◽  
Vol 100 ◽  
pp. 106919
Author(s):  
Jinbo Li ◽  
Hesam Izakian ◽  
Witold Pedrycz ◽  
Iqbal Jamal

Author(s):  
Igor Boyarshin ◽  
Anna Doroshenko ◽  
Pavlo Rehida

The article describes a new method of improving efficiency of the systems that deal with storage and providing access of shared data of many users by utilizing replication. Existing methods of load balancing in data storage systems are described, namely RR and WRR. A new method of request balancing among multiple data storage nodes is proposed, that is able to adjust to input request stream intensity in real time and utilize disk space efficiently while doing so.


Database ◽  
2019 ◽  
Vol 2019 ◽  
Author(s):  
Yaw Nti-Addae ◽  
Dave Matthews ◽  
Victor Jun Ulat ◽  
Raza Syed ◽  
Guilhem Sempéré ◽  
...  

Abstract Motivation With high-throughput genotyping systems now available, it has become feasible to fully integrate genotyping information into breeding programs. To make use of this information effectively requires DNA extraction facilities and marker production facilities that can efficiently deploy the desired set of markers across samples with a rapid turnaround time that allows for selection before crosses needed to be made. In reality, breeders often have a short window of time to make decisions by the time they are able to collect all their phenotyping data and receive corresponding genotyping data. This presents a challenge to organize information and utilize it in downstream analyses to support decisions made by breeders. In order to implement genomic selection routinely as part of breeding programs, one would need an efficient genotyping data storage system. We selected and benchmarked six popular open-source data storage systems, including relational database management and columnar storage systems. Results We found that data extract times are greatly influenced by the orientation in which genotype data is stored in a system. HDF5 consistently performed best, in part because it can more efficiently work with both orientations of the allele matrix. Availability http://gobiin1.bti.cornell.edu:6083/projects/GBM/repos/benchmarking/browse


2020 ◽  
pp. 082-093
Author(s):  
S.Yu. Punda ◽  
◽  

A review of modern data storage architectures was conducted, the advantages and disadvantages of each of them were given. The data storage systems of the IBM FlashSystem family were analyzed, as well as Spectrum Virtualize software, which is responsible for virtualization, compression, distribution and replication of data stored on the storage system. A mathematical model of the data storage system of IBM Storwize v5030E was developed. Well-known metrics are used to evaluate its performance when using spindle and solid-state drives. The effect of hardware and software data compression on system performance has been experimentally revealed. Recommendations are formulated by which it is possible to determine which media and which technology stack should be used by a business user to complete the tasks assigned to him.


2021 ◽  
Vol 2132 (1) ◽  
pp. 012047
Author(s):  
Yu Ye ◽  
Bailin Feng ◽  
Wujun Tao

Abstract One of the bottlenecks restricting the development of electric vehicle industry is the safety problem. Although numerous of anomaly detection algorithms for electric vehicles have been proposed, most of them may perform poorly due to the complexity and unpredictability of real scenes. We consider that there may be a certain degree of potential safety hazard in the battery system of electric vehicles before, during and after the process of faults in the real scenes, that is, label noise. In order to solve this problem, we propose a Multi-Instance Learning based Anomaly Detection (MILAD) framework, to perform anomaly detection for electric vehicles with label noise problem. Extensive cross validation experiments fully verify that the framework can effectively detect the existence of abnormal conditions in the presence of label noise in multivariate time series data.


Author(s):  
Richard S. Segall ◽  
Jeffrey S. Cook

This chapter deals with a detailed discussion on the storage systems for data-intensive computing using Big Data. The chapter begins with a brief introduction about data-intensive computing and types of parallel processing approaches. It also highlights the points that display how data-intensive computing systems differ from other forms of computing. A discussion on the importance of Big Data computing is put forth. The current and future challenges of storage in genomics are discussed in detail. Also, storage and data management strategies are given. The chapter's focus is then on the software challenges for storage. Storage use cases are provided like DataDirect Networks, SDSC, etc. The list of storage tools and their details are provided. A small section discusses the sensor data storage system. Then a table is provided that shows the top 10 cloud storage systems for data-intensive computing using Big Data in the world. Top 500 Big Data storage servers statistics are also displayed effectively by the images from Top500 website.


Author(s):  
Hongsheng Yan ◽  
Jianzhong Sun ◽  
Hongfu Zuo

It is almost impossible to detect the health status of the aircraft hydraulic system via a single variable, because of the complexity and the coupling relationship between components of the system. To serve the purpose, a novel anomaly detection method considering multivariate monitoring data is proposed in this article. The unsupervised auto-encoder model with the long short-term memory layers is used to reconstruct multivariate time series data, and a new comprehensive decision-making index based on two conventional ones is proposed to measure the difference between the observation and the reconstruction. Then, the health threshold of the decision-making index can be calculated by the kernel density estimation. The flight data are divided into several samples, and the anomaly detection of each sample is determined by the specific rule. The healthy status of each flight is determined by voting based on the detection results of all samples included in the flight. The performance of the proposed method is validated on the real continuous monitoring data, and the results confirm that the proposed model overcomes the problems of multistage and multivariate parameters in the anomaly detection of the aircraft system and improves the detection efficiency.


Sign in / Sign up

Export Citation Format

Share Document