Detecting Metachanges in Data Streams from the Viewpoint of the MDL Principle

This paper addresses the issue of how we can detect changes of changes, which we call metachanges, in data streams. A metachange refers to a change in patterns of when and how changes occur, referred to as “metachanges along time” and “metachanges along state”, respectively. Metachanges along time mean that the intervals between change points significantly vary, whereas metachanges along state mean that the magnitude of changes varies. It is practically important to detect metachanges because they may be early warning signals of important events. This paper introduces a novel notion of metachange statistics as a measure of the degree of a metachange. The key idea is to integrate metachanges along both time and state in terms of “code length” according to the minimum description length (MDL) principle. We develop an online metachange detection algorithm (MCD) based on the statistics to apply it to a data stream. With synthetic datasets, we demonstrated that MCD detects metachanges earlier and more accurately than existing methods. With real datasets, we demonstrated that MCD can lead to the discovery of important events that might be overlooked by conventional change detection methods.

Download Full-text

Change sign detection with differential MDL change statistics and its applications to COVID-19 pandemic analysis

Scientific Reports ◽

10.1038/s41598-021-98781-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Kenji Yamanishi ◽

Linchuan Xu ◽

Ryo Yuki ◽

Shintaro Fukushima ◽

Chuan-hao Lin

Keyword(s):

Time Series ◽

Early Warning ◽

Data Science ◽

Minimum Description Length ◽

Length Change ◽

Warning Signals ◽

Early Warning Signals ◽

New Information ◽

Long Time ◽

Synthetic Datasets

AbstractWe are concerned with the issue of detecting changes and their signs from a data stream. For example, when given time series of COVID-19 cases in a region, we may raise early warning signals of an epidemic by detecting signs of changes in the data. We propose a novel methodology to address this issue. The key idea is to employ a new information-theoretic notion, which we call the differential minimum description length change statistics (D-MDL), for measuring the scores of change sign. We first give a fundamental theory for D-MDL. We then demonstrate its effectiveness using synthetic datasets. We apply it to detecting early warning signals of the COVID-19 epidemic using time series of the cases for individual countries. We empirically demonstrate that D-MDL is able to raise early warning signals of events such as significant increase/decrease of cases. Remarkably, for about $$64\%$$ 64 % of the events of significant increase of cases in studied countries, our method can detect warning signals as early as nearly six days on average before the events, buying considerably long time for making responses. We further relate the warning signals to the dynamics of the basic reproduction number R0 and the timing of social distancing. The results show that our method is a promising approach to the epidemic analysis from a data science viewpoint.

Download Full-text

Microcluster-Based Incremental Ensemble Learning for Noisy, Nonstationary Data Streams

Complexity ◽

10.1155/2020/6147378 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Sanmin Liu ◽

Shan Xue ◽

Fanzhen Liu ◽

Jieren Cheng ◽

Xiulai Li ◽

...

Keyword(s):

Ensemble Learning ◽

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Majority Vote ◽

Stream Classification ◽

Model Stability ◽

Data Stream Classification ◽

Nonstationary Data ◽

Synthetic Datasets

Data stream classification becomes a promising prediction work with relevance to many practical environments. However, under the environment of concept drift and noise, the research of data stream classification faces lots of challenges. Hence, a new incremental ensemble model is presented for classifying nonstationary data streams with noise. Our approach integrates three strategies: incremental learning to monitor and adapt to concept drift; ensemble learning to improve model stability; and a microclustering procedure that distinguishes drift from noise and predicts the labels of incoming instances via majority vote. Experiments with two synthetic datasets designed to test for both gradual and abrupt drift show that our method provides more accurate classification in nonstationary data streams with noise than the two popular baselines.

Download Full-text

Targeted Adaptable Sample for Accurate and Efficient Quantile Estimation in Non-Stationary Data Streams

Machine Learning and Knowledge Extraction ◽

10.3390/make1030049 ◽

2019 ◽

Vol 1 (3) ◽

pp. 848-870

Author(s):

Ognjen Arandjelović

Keyword(s):

Data Streams ◽

Data Stream ◽

Buffer Capacity ◽

Comprehensive Evaluation ◽

Estimation Algorithm ◽

Quantile Estimation ◽

Motion Features ◽

Synthetic Datasets ◽

High Level ◽

Stochastic Properties

The need to detect outliers or otherwise unusual data, which can be formalized as the estimation a particular quantile of a distribution, is an important problem that frequently arises in a variety of applications of pattern recognition, computer vision and signal processing. For example, our work was most proximally motivated by the practical limitations and requirements of many semi-automatic surveillance analytics systems that detect abnormalities in closed-circuit television (CCTV) footage using statistical models of low-level motion features. In this paper, we specifically address the problem of estimating the running quantile of a data stream with non-stationary stochasticity when the absolute (rather than asymptotic) memory for storing observations is severely limited. We make several major contributions: (i) we derive an important theoretical result that shows that the change in the quantile of a stream is constrained regardless of the stochastic properties of data; (ii) we describe a set of high-level design goals for an effective estimation algorithm that emerge as a consequence of our theoretical findings; (iii) we introduce a novel algorithm that implements the aforementioned design goals by retaining a sample of data values in a manner adaptive to changes in the distribution of data and progressively narrowing down its focus in the periods of quasi-stationary stochasticity; and (iv) we present a comprehensive evaluation of the proposed algorithm and compare it with the existing methods in the literature on both synthetic datasets and three large “real-world” streams acquired in the course of operation of an existing commercial surveillance system. Our results and their detailed analysis convincingly and comprehensively demonstrate that the proposed method is highly successful and vastly outperforms the existing alternatives, especially when the target quantile is high-valued and the available buffer capacity severely limited.

Download Full-text

Soft Fault Detection Algorithms for Multi-Parallel Data Streams Under the Cloud Computing

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2018.p1114 ◽

2018 ◽

Vol 22 (7) ◽

pp. 1114-1119

Author(s):

Hongbing Meng ◽

Keyword(s):

Fault Detection ◽

Data Streams ◽

Data Stream ◽

Error Probability ◽

Detection Efficiency ◽

Detection Algorithm ◽

Traditional Methods ◽

Detection Algorithms ◽

Parallel Data ◽

Soft Fault

In the fault detection of multi-parallel data streams, the error probability of traditional methods is large, which cannot effectively meet the soft fault detection for multi-parallel data stream, causing the problem of low detection efficiency. A soft fault detection algorithm based on adaptive multi-parallel data stream is proposed. The soft fault feature in the data stream is extracted, and the adaptive soft fault detection algorithm is used to detect the fault of the multi-parallel data stream, which can overcome the disadvantages of traditional methods, effectively improve the efficiency, safety and the accuracy. Experimental results showed that the proposed method can effectively improve the efficiency of fault detection.

Download Full-text

Anomaly Pattern Detection in Streaming Data Based on the Transformation to Multiple Binary-Valued Data Streams

Journal of Artificial Intelligence and Soft Computing Research ◽

10.2478/jaiscr-2022-0002 ◽

2021 ◽

Vol 12 (1) ◽

pp. 19-27

Author(s):

Taegong Kim ◽

Cheong Hee Park

Keyword(s):

Outlier Detection ◽

Data Streams ◽

Data Stream ◽

Detection Method ◽

Binary Classification ◽

Streaming Data ◽

Pattern Detection ◽

Detection Methods ◽

Anomaly Pattern ◽

Isolation Forest

Abstract Anomaly pattern detection in a data stream aims to detect a time point where outliers begin to occur abnormally. Recently, a method for anomaly pattern detection has been proposed based on binary classification for outliers and statistical tests in the data stream of binary labels of normal or an outlier. It showed that an anomaly pattern can be detected accurately even when outlier detection performance is relatively low. However, since the anomaly pattern detection method is based on the binary classification for outliers, most well-known outlier detection methods, with the output of real-valued outlier scores, can not be used directly. In this paper, we propose an anomaly pattern detection method in a data stream using the transformation to multiple binary-valued data streams from real-valued outlier scores. By using three outlier detection methods, Isolation Forest(IF), Autoencoder-based outlier detection, and Local outlier factor(LOF), the proposed anomaly pattern detection method is tested using artificial and real data sets. The experimental results show that anomaly pattern detection using Isolation Forest gives the best performance.

Download Full-text

A Framework to Test Resistency of Detection Algorithms for Stepping-Stone Intrusion on Time-Jittering Manipulation

Wireless Communications and Mobile Computing ◽

10.1155/2021/1807509 ◽

2021 ◽

Vol 2021 ◽

pp. 1-8

Author(s):

Lixin Wang ◽

Jianhua Yang ◽

Michael Workman ◽

Peng-Jun Wan

Keyword(s):

Data Stream ◽

Detection Method ◽

Detection Algorithm ◽

Efficient Algorithms ◽

Detection Methods ◽

The Internet ◽

Stepping Stones ◽

Stepping Stone ◽

Detection Algorithms ◽

A Chain

Hackers on the Internet usually send attacking packets using compromised hosts, called stepping-stones, in order to avoid being detected and caught. With stepping-stone attacks, an intruder remotely logins these stepping-stones using programs like SSH or telnet, uses a chain of Internet hosts as relay machines, and then sends the attacking packets. A great number of detection approaches have been developed for stepping-stone intrusion (SSI) in the literature. Many of these existing detection methods worked effectively only when session manipulation by intruders is not present. When the session is manipulated by attackers, there are few known effective detection methods for SSI. It is important to know whether a detection algorithm for SSI is resistant on session manipulation by attackers. For session manipulation with chaff perturbation, software tools such as Scapy can be used to inject meaningless packets into a data stream. However, to the best of our knowledge, there are no existing effective tools or efficient algorithms to produce time-jittered network traffic that can be used to test whether an SSI detection method is resistant on intruders’ time-jittering manipulation. In this paper, we propose a framework to test resistency of detection algorithms for SSI on time-jittering manipulation. Our proposed framework can be used to test whether an existing or new SSI detection method is resistant on session manipulation by intruders with time-jittering.

Download Full-text

A Survey of Class Imbalance Problem on Evolving Data Stream

Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management ◽

10.4018/978-1-7998-7371-6.ch002 ◽

2021 ◽

pp. 23-41

Author(s):

D. Himaja ◽

T. Maruthi Padmaja ◽

P. Radha Krishna

Keyword(s):

Change Detection ◽

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Class Imbalance ◽

Detection Methods ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Learning From Data ◽

Main Emphasis

Learning from data streams with both online class imbalance and concept drift (OCI-CD) is receiving much attention in today's world. Due to this problem, the performance is affected for the current models that learn from both stationary as well as non-stationary environments. In the case of non-stationary environments, due to the imbalance, it is hard to spot the concept drift using conventional drift detection methods that aim at tracking the change detection based on the learner's performance. There is limited work on the combined problem from imbalanced evolving streams both from stationary and non-stationary environments. Here the data may be evolved with complete labels or with only limited labels. This chapter's main emphasis is to provide different methods for the purpose of resolving the issue of class imbalance in emerging streams, which involves changing and unchanging environments with supervised and availability of limited labels.

Download Full-text

Data Streams Oriented Outlier Detection Method: A Fast Minimal Infrequent Pattern Mining

The International Arab Journal of Information Technology ◽

10.34028/iajit/18/6/14 ◽

2021 ◽

Author(s):

ZhongYu Zhou ◽

DeChang Pi

Keyword(s):

Outlier Detection ◽

Data Streams ◽

Pattern Mining ◽

Detection Method ◽

Detection Algorithm ◽

Detection Methods ◽

Mining Method ◽

Telemetry Data ◽

Process Data ◽

Mining Data Streams

Outlier detection is a common method for analyzing data streams. In the existing outlier detection methods, most of methods compute distance of points to solve certain specific outlier detection problems. However, these methods are computationally expensive and cannot process data streams quickly. The outlier detection method based on pattern mining resolves the aforementioned issues, but the existing methods are inefficient and cannot meet requirements of quickly mining data streams. In order to improve the efficiency of the method, a new outlier detection method is proposed in this paper. First, a fast minimal infrequent pattern mining method is proposed to mine the minimal infrequent pattern from data streams. Second, an efficient outlier detection algorithm based on minimal infrequent pattern is proposed for detecting the outliers in the data streams by mining minimal infrequent pattern. The algorithm proposed in this paper is demonstrated by real telemetry data of a satellite in orbit. The experimental results show that the proposed method not only can be applied to satellite outlier detection, but also is superior to the existing methods.

Download Full-text

A Novel Algorithm for Detecting Pedestrians on Rainy Image

Sensors ◽

10.3390/s21010112 ◽

2020 ◽

Vol 21 (1) ◽

pp. 112

Author(s):

Yuhang Liu ◽

Jianxiao Ma ◽

Yuchen Wang ◽

Chenhong Zong

Keyword(s):

State Of The Art ◽

Pedestrian Detection ◽

Heavy Rain ◽

Detection Algorithm ◽

Detection Methods ◽

Detection Accuracy ◽

Traffic Data ◽

Average Precision ◽

Synthetic Datasets ◽

Light Medium

Pedestrian detection is widely used in cooperative vehicle infrastructure systems. Traditional pedestrian detection methods perform sufficiently well under sunny scenarios and obtain trustworthy traffic data. However, the detection drastically decreases under rainy scenarios. This study proposes a pedestrian detection algorithm with a de-raining module that improves detection accuracy under various rainy scenarios. Specifically, this algorithm determines the density information of rain and effectively removes rain streaks through the de-raining module. Then the algorithm detects pedestrians as a pair of keypoints through the pedestrian detection module to solve the problem of occlusion. Furthermore, a new pedestrian dataset containing rain density labels is established and used to train the algorithm. For the scenarios of light, medium, and heavy rain, extensive experiments on synthetic datasets demonstrate that the proposed algorithm increases AP (average precision) of pedestrian detection by 21.1%, 48.1%, and 60.9%. Moreover, the proposed algorithm performs well on real datasets and achieves improvements over the state-of-the-art methods, which reveals that the proposed algorithm can significantly improve the accuracy of pedestrian detection in rainy scenarios.

Download Full-text

A Novel Drift Detection Algorithm Based on Features’ Importance Analysis in a Data Streams Environment

Journal of Artificial Intelligence and Soft Computing Research ◽

10.2478/jaiscr-2020-0019 ◽

2020 ◽

Vol 10 (4) ◽

pp. 287-298

Author(s):

Piotr Duda ◽

Krzysztof Przybyszewski ◽

Lipo Wang

Keyword(s):

Random Forest ◽

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Ensemble Methods ◽

Real Data ◽

Relevant Information ◽

Detection Algorithm ◽

Important Indicator ◽

Features Importance

AbstractThe training set consists of many features that influence the classifier in different degrees. Choosing the most important features and rejecting those that do not carry relevant information is of great importance to the operating of the learned model. In the case of data streams, the importance of the features may additionally change over time. Such changes affect the performance of the classifier but can also be an important indicator of occurring concept-drift. In this work, we propose a new algorithm for data streams classification, called Random Forest with Features Importance (RFFI), which uses the measure of features importance as a drift detector. The RFFT algorithm implements solutions inspired by the Random Forest algorithm to the data stream scenarios. The proposed algorithm combines the ability of ensemble methods for handling slow changes in a data stream with a new method for detecting concept drift occurrence. The work contains an experimental analysis of the proposed algorithm, carried out on synthetic and real data.

Download Full-text