A Novel Drift Detection Algorithm Based on Features’ Importance Analysis in a Data Streams Environment

AbstractThe training set consists of many features that influence the classifier in different degrees. Choosing the most important features and rejecting those that do not carry relevant information is of great importance to the operating of the learned model. In the case of data streams, the importance of the features may additionally change over time. Such changes affect the performance of the classifier but can also be an important indicator of occurring concept-drift. In this work, we propose a new algorithm for data streams classification, called Random Forest with Features Importance (RFFI), which uses the measure of features importance as a drift detector. The RFFT algorithm implements solutions inspired by the Random Forest algorithm to the data stream scenarios. The proposed algorithm combines the ability of ensemble methods for handling slow changes in a data stream with a new method for detecting concept drift occurrence. The work contains an experimental analysis of the proposed algorithm, carried out on synthetic and real data.

Download Full-text

A Data Stream Outlier Detection Algorithm Based on Reverse K Nearest Neighbors

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.225-226.1032 ◽

2011 ◽

Vol 225-226 ◽

pp. 1032-1035 ◽

Cited By ~ 1

Author(s):

Zhong Ping Zhang ◽

Yong Xin Liang

Keyword(s):

Outlier Detection ◽

Data Stream ◽

Concept Drift ◽

Real Data ◽

Nearest Neighbors ◽

Detection Algorithm ◽

Data Sets ◽

K Nearest Neighbors ◽

Query Manager ◽

Current Window

This paper proposes a new data stream outlier detection algorithm SODRNN based on reverse nearest neighbors. We deal with the sliding window model, where outlier queries are performed in order to detect anomalies in the current window. The update of insertion or deletion only needs one scan of the current window, which improves efficiency. The capability of queries at arbitrary time on the whole current window is achieved by Query Manager Procedure, which can capture the phenomenon of concept drift of data stream in time. Results of experiments conducted on both synthetic and real data sets show that SODRNN algorithm is both effective and efficient.

Download Full-text

Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams

Entropy ◽

10.3390/e23070859 ◽

2021 ◽

Vol 23 (7) ◽

pp. 859

Author(s):

Abdulaziz O. AlQabbany ◽

Aqil M. Azmi

Keyword(s):

Big Data ◽

Random Forest ◽

Real Time ◽

Data Streams ◽

Learning Algorithm ◽

Concept Drift ◽

The United States ◽

Careful Consideration ◽

Data Sets ◽

Stream Data

We are living in the age of big data, a majority of which is stream data. The real-time processing of this data requires careful consideration from different perspectives. Concept drift is a change in the data’s underlying distribution, a significant issue, especially when learning from data streams. It requires learners to be adaptive to dynamic changes. Random forest is an ensemble approach that is widely used in classical non-streaming settings of machine learning applications. At the same time, the Adaptive Random Forest (ARF) is a stream learning algorithm that showed promising results in terms of its accuracy and ability to deal with various types of drift. The incoming instances’ continuity allows for their binomial distribution to be approximated to a Poisson(1) distribution. In this study, we propose a mechanism to increase such streaming algorithms’ efficiency by focusing on resampling. Our measure, resampling effectiveness (ρ), fuses the two most essential aspects in online learning; accuracy and execution time. We use six different synthetic data sets, each having a different type of drift, to empirically select the parameter λ of the Poisson distribution that yields the best value for ρ. By comparing the standard ARF with its tuned variations, we show that ARF performance can be enhanced by tackling this important aspect. Finally, we present three case studies from different contexts to test our proposed enhancement method and demonstrate its effectiveness in processing large data sets: (a) Amazon customer reviews (written in English), (b) hotel reviews (in Arabic), and (c) real-time aspect-based sentiment analysis of COVID-19-related tweets in the United States during April 2020. Results indicate that our proposed method of enhancement exhibited considerable improvement in most of the situations.

Download Full-text

Knowledge Discovery From Evolving Data Streams

Advances in Business Information Systems and Analytics - Machine Learning Techniques for Improved Business Analytics ◽

10.4018/978-1-5225-3534-8.ch002 ◽

2019 ◽

pp. 19-39

Author(s):

Prasanna Lakshmi Kompalli

Keyword(s):

Real Time ◽

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Data Stream Mining ◽

Time Data ◽

Stream Mining ◽

New Challenges ◽

Mining Data Streams ◽

Different Sources

Data coming from different sources is referred to as data streams. Data stream mining is an online learning technique where each data point must be processed as the data arrives and discarded as the processing is completed. Progress of technologies has resulted in the monitoring these data streams in real time. Data streams has created many new challenges to the researchers in real time. The main features of this type of data are they are fast flowing, large amounts of data which are continuous and growing in nature, and characteristics of data might change in course of time which is termed as concept drift. This chapter addresses the problems in mining data streams with concept drift. Due to which, isolating the correct literature would be a grueling task for researchers and practitioners. This chapter tries to provide a solution as it would be an amalgamation of all techniques used for data stream mining with concept drift.

Download Full-text

Cost-Sensitive Classification for Evolving Data Streams with Concept Drift and Class Imbalance

Computational Intelligence and Neuroscience ◽

10.1155/2021/8813806 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Yange Sun ◽

Meng Li ◽

Lei Li ◽

Han Shao ◽

Yi Sun

Keyword(s):

Data Streams ◽

Data Stream ◽

Learning Strategy ◽

Concept Drift ◽

Class Imbalance ◽

Data Preprocessing ◽

Cost Information ◽

Detection Mechanism ◽

Stream Classification ◽

Data Stream Classification

Class imbalance and concept drift are two primary principles that exist concurrently in data stream classification. Although the two issues have drawn enough attention separately, the joint treatment largely remains unexplored. Moreover, the class imbalance issue is further complicated if data streams with concept drift. A novel Cost-Sensitive based Data Stream (CSDS) classification is introduced to overcome the two issues simultaneously. The CSDS considers cost information during the procedures of data preprocessing and classification. During the data preprocessing, a cost-sensitive learning strategy is introduced into the ReliefF algorithm for alleviating the class imbalance at the data level. In the classification process, a cost-sensitive weighting schema is devised to enhance the overall performance of the ensemble. Besides, a change detection mechanism is embedded in our algorithm, which guarantees that an ensemble can capture and react to drift promptly. Experimental results validate that our method can obtain better classification results under different imbalanced concept drifting data stream scenarios.

Download Full-text

Research on the Fastest Detection Method for Weak Trends under Noise Interference

Entropy ◽

10.3390/e23081093 ◽

2021 ◽

Vol 23 (8) ◽

pp. 1093

Author(s):

Guang Li ◽

Jing Liang ◽

Caitong Yue

Keyword(s):

Anomaly Detection ◽

Data Streams ◽

Concept Drift ◽

Detection Algorithm ◽

Detection Accuracy ◽

Industrial Data ◽

Oil Drilling ◽

Noise Interference ◽

Detection Score ◽

Weak Trend

Trend anomaly detection is the practice of comparing and analyzing current and historical data trends to detect real-time abnormalities in online industrial data-streams. It has the advantages of tracking a concept drift automatically and predicting trend changes in the shortest time, making it important both for algorithmic research and industry. However, industrial data streams contain considerable noise that interferes with detecting weak anomalies. In this paper, the fastest detection algorithm “sliding nesting” is adopted. It is based on calculating the data weight in each window by applying variable weights, while maintaining the method of trend-effective integration accumulation. The new algorithm changes the traditional calculation method of the trend anomaly detection score, which calculates the score in a short window. This algorithm, SNWFD–DS, can detect weak trend abnormalities in the presence of noise interference. Compared with other methods, it has significant advantages. An on-site oil drilling data test shows that this method can significantly reduce delays compared with other methods and can improve the detection accuracy of weak trend anomalies under noise interference.

Download Full-text

Microcluster-Based Incremental Ensemble Learning for Noisy, Nonstationary Data Streams

Complexity ◽

10.1155/2020/6147378 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Sanmin Liu ◽

Shan Xue ◽

Fanzhen Liu ◽

Jieren Cheng ◽

Xiulai Li ◽

...

Keyword(s):

Ensemble Learning ◽

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Majority Vote ◽

Stream Classification ◽

Model Stability ◽

Data Stream Classification ◽

Nonstationary Data ◽

Synthetic Datasets

Data stream classification becomes a promising prediction work with relevance to many practical environments. However, under the environment of concept drift and noise, the research of data stream classification faces lots of challenges. Hence, a new incremental ensemble model is presented for classifying nonstationary data streams with noise. Our approach integrates three strategies: incremental learning to monitor and adapt to concept drift; ensemble learning to improve model stability; and a microclustering procedure that distinguishes drift from noise and predicts the labels of incoming instances via majority vote. Experiments with two synthetic datasets designed to test for both gradual and abrupt drift show that our method provides more accurate classification in nonstationary data streams with noise than the two popular baselines.

Download Full-text

Detecting Metachanges in Data Streams from the Viewpoint of the MDL Principle

Entropy ◽

10.3390/e21121134 ◽

2019 ◽

Vol 21 (12) ◽

pp. 1134 ◽

Cited By ~ 1

Author(s):

Shintaro Fukushima ◽

Kenji Yamanishi

Keyword(s):

Data Streams ◽

Data Stream ◽

Minimum Description Length ◽

Detection Algorithm ◽

Change Points ◽

Detection Methods ◽

Code Length ◽

Warning Signals ◽

Mdl Principle ◽

Synthetic Datasets

This paper addresses the issue of how we can detect changes of changes, which we call metachanges, in data streams. A metachange refers to a change in patterns of when and how changes occur, referred to as “metachanges along time” and “metachanges along state”, respectively. Metachanges along time mean that the intervals between change points significantly vary, whereas metachanges along state mean that the magnitude of changes varies. It is practically important to detect metachanges because they may be early warning signals of important events. This paper introduces a novel notion of metachange statistics as a measure of the degree of a metachange. The key idea is to integrate metachanges along both time and state in terms of “code length” according to the minimum description length (MDL) principle. We develop an online metachange detection algorithm (MCD) based on the statistics to apply it to a data stream. With synthetic datasets, we demonstrated that MCD detects metachanges earlier and more accurately than existing methods. With real datasets, we demonstrated that MCD can lead to the discovery of important events that might be overlooked by conventional change detection methods.

Download Full-text

Soft Fault Detection Algorithms for Multi-Parallel Data Streams Under the Cloud Computing

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2018.p1114 ◽

2018 ◽

Vol 22 (7) ◽

pp. 1114-1119

Author(s):

Hongbing Meng ◽

Keyword(s):

Fault Detection ◽

Data Streams ◽

Data Stream ◽

Error Probability ◽

Detection Efficiency ◽

Detection Algorithm ◽

Traditional Methods ◽

Detection Algorithms ◽

Parallel Data ◽

Soft Fault

In the fault detection of multi-parallel data streams, the error probability of traditional methods is large, which cannot effectively meet the soft fault detection for multi-parallel data stream, causing the problem of low detection efficiency. A soft fault detection algorithm based on adaptive multi-parallel data stream is proposed. The soft fault feature in the data stream is extracted, and the adaptive soft fault detection algorithm is used to detect the fault of the multi-parallel data stream, which can overcome the disadvantages of traditional methods, effectively improve the efficiency, safety and the accuracy. Experimental results showed that the proposed method can effectively improve the efficiency of fault detection.

Download Full-text

Calculating feature importance in data streams with concept drift using Online Random Forest

2014 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2014.7004352 ◽

2014 ◽

Cited By ~ 4

Author(s):

Andrew Phelps Cassidy ◽

Frank A. Deviney

Keyword(s):

Random Forest ◽

Data Streams ◽

Concept Drift ◽

Feature Importance

Download Full-text

Research on Outlier Detection Algorithm for Evaluation of Battery System Safety

Advances in Mechanical Engineering ◽

10.1155/2014/830402 ◽

2014 ◽

Vol 6 ◽

pp. 830402 ◽

Cited By ~ 3

Author(s):

Changhao Piao ◽

Zhi Huang ◽

Ling Su ◽

Sheng Lu

Keyword(s):

Outlier Detection ◽

Data Stream ◽

Concept Drift ◽

Detection Algorithm ◽

High Dimensional ◽

Small Scale ◽

Data Sets ◽

Angle Distribution ◽

System Safety ◽

Battery System

Battery system is the key part of the electric vehicle. To realize outlier detection in the running process of battery system effectively, a new high-dimensional data stream outlier detection algorithm (DSOD) based on angle distribution is proposed. First, in order to improve the algorithm stability in high-dimensional space, the method of angle distribution-based outlier detection algorithm is employed. Second, to reduce the computational complexity, a small-scale calculation set of data stream is established, which is composed of normal set and border set. For the purpose of solving the problem of concept drift, an update mechanism for the normal set and border set is developed in this paper. By this way, these hidden abnormal points will be rapidly detected. The experimental results on real data sets and battery system simulation data sets demonstrate that DSOD is more efficient than Simple variance of angles (Simple VOA) and angle-based outlier detection (ABOD) and is very suitable for the evaluation of battery system safety.

Download Full-text