FlexSketch: Estimation of Probability Density for Stationary and Non-Stationary Data Streams

Efficient and accurate estimation of the probability distribution of a data stream is an important problem in many sensor systems. It is especially challenging when the data stream is non-stationary, i.e., its probability distribution changes over time. Statistical models for non-stationary data streams demand agile adaptation for concept drift while tolerating temporal fluctuations. To this end, a statistical model needs to forget old data samples and to detect concept drift swiftly. In this paper, we propose FlexSketch, an online probability density estimation algorithm for data streams. Our algorithm uses an ensemble of histograms, each of which represents a different length of data history. FlexSketch updates each histogram for a new data sample and generates probability distribution by combining the ensemble of histograms while monitoring discrepancy between recent data and existing models periodically. When it detects concept drift, a new histogram is added to the ensemble and the oldest histogram is removed. This allows us to estimate the probability density function with high update speed and high accuracy using only limited memory. Experimental results demonstrate that our algorithm shows improved speed and accuracy compared to existing methods for both stationary and non-stationary data streams.

Download Full-text

Knowledge Discovery From Evolving Data Streams

Advances in Business Information Systems and Analytics - Machine Learning Techniques for Improved Business Analytics ◽

10.4018/978-1-5225-3534-8.ch002 ◽

2019 ◽

pp. 19-39

Author(s):

Prasanna Lakshmi Kompalli

Keyword(s):

Real Time ◽

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Data Stream Mining ◽

Time Data ◽

Stream Mining ◽

New Challenges ◽

Mining Data Streams ◽

Different Sources

Data coming from different sources is referred to as data streams. Data stream mining is an online learning technique where each data point must be processed as the data arrives and discarded as the processing is completed. Progress of technologies has resulted in the monitoring these data streams in real time. Data streams has created many new challenges to the researchers in real time. The main features of this type of data are they are fast flowing, large amounts of data which are continuous and growing in nature, and characteristics of data might change in course of time which is termed as concept drift. This chapter addresses the problems in mining data streams with concept drift. Due to which, isolating the correct literature would be a grueling task for researchers and practitioners. This chapter tries to provide a solution as it would be an amalgamation of all techniques used for data stream mining with concept drift.

Download Full-text

Cost-Sensitive Classification for Evolving Data Streams with Concept Drift and Class Imbalance

Computational Intelligence and Neuroscience ◽

10.1155/2021/8813806 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Yange Sun ◽

Meng Li ◽

Lei Li ◽

Han Shao ◽

Yi Sun

Keyword(s):

Data Streams ◽

Data Stream ◽

Learning Strategy ◽

Concept Drift ◽

Class Imbalance ◽

Data Preprocessing ◽

Cost Information ◽

Detection Mechanism ◽

Stream Classification ◽

Data Stream Classification

Class imbalance and concept drift are two primary principles that exist concurrently in data stream classification. Although the two issues have drawn enough attention separately, the joint treatment largely remains unexplored. Moreover, the class imbalance issue is further complicated if data streams with concept drift. A novel Cost-Sensitive based Data Stream (CSDS) classification is introduced to overcome the two issues simultaneously. The CSDS considers cost information during the procedures of data preprocessing and classification. During the data preprocessing, a cost-sensitive learning strategy is introduced into the ReliefF algorithm for alleviating the class imbalance at the data level. In the classification process, a cost-sensitive weighting schema is devised to enhance the overall performance of the ensemble. Besides, a change detection mechanism is embedded in our algorithm, which guarantees that an ensemble can capture and react to drift promptly. Experimental results validate that our method can obtain better classification results under different imbalanced concept drifting data stream scenarios.

Download Full-text

Microcluster-Based Incremental Ensemble Learning for Noisy, Nonstationary Data Streams

Complexity ◽

10.1155/2020/6147378 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Sanmin Liu ◽

Shan Xue ◽

Fanzhen Liu ◽

Jieren Cheng ◽

Xiulai Li ◽

...

Keyword(s):

Ensemble Learning ◽

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Majority Vote ◽

Stream Classification ◽

Model Stability ◽

Data Stream Classification ◽

Nonstationary Data ◽

Synthetic Datasets

Data stream classification becomes a promising prediction work with relevance to many practical environments. However, under the environment of concept drift and noise, the research of data stream classification faces lots of challenges. Hence, a new incremental ensemble model is presented for classifying nonstationary data streams with noise. Our approach integrates three strategies: incremental learning to monitor and adapt to concept drift; ensemble learning to improve model stability; and a microclustering procedure that distinguishes drift from noise and predicts the labels of incoming instances via majority vote. Experiments with two synthetic datasets designed to test for both gradual and abrupt drift show that our method provides more accurate classification in nonstationary data streams with noise than the two popular baselines.

Download Full-text

Targeted Adaptable Sample for Accurate and Efficient Quantile Estimation in Non-Stationary Data Streams

Machine Learning and Knowledge Extraction ◽

10.3390/make1030049 ◽

2019 ◽

Vol 1 (3) ◽

pp. 848-870

Author(s):

Ognjen Arandjelović

Keyword(s):

Data Streams ◽

Data Stream ◽

Buffer Capacity ◽

Comprehensive Evaluation ◽

Estimation Algorithm ◽

Quantile Estimation ◽

Motion Features ◽

Synthetic Datasets ◽

High Level ◽

Stochastic Properties

The need to detect outliers or otherwise unusual data, which can be formalized as the estimation a particular quantile of a distribution, is an important problem that frequently arises in a variety of applications of pattern recognition, computer vision and signal processing. For example, our work was most proximally motivated by the practical limitations and requirements of many semi-automatic surveillance analytics systems that detect abnormalities in closed-circuit television (CCTV) footage using statistical models of low-level motion features. In this paper, we specifically address the problem of estimating the running quantile of a data stream with non-stationary stochasticity when the absolute (rather than asymptotic) memory for storing observations is severely limited. We make several major contributions: (i) we derive an important theoretical result that shows that the change in the quantile of a stream is constrained regardless of the stochastic properties of data; (ii) we describe a set of high-level design goals for an effective estimation algorithm that emerge as a consequence of our theoretical findings; (iii) we introduce a novel algorithm that implements the aforementioned design goals by retaining a sample of data values in a manner adaptive to changes in the distribution of data and progressively narrowing down its focus in the periods of quasi-stationary stochasticity; and (iv) we present a comprehensive evaluation of the proposed algorithm and compare it with the existing methods in the literature on both synthetic datasets and three large “real-world” streams acquired in the course of operation of an existing commercial surveillance system. Our results and their detailed analysis convincingly and comprehensively demonstrate that the proposed method is highly successful and vastly outperforms the existing alternatives, especially when the target quantile is high-valued and the available buffer capacity severely limited.

Download Full-text

Incremental Learning on Non-stationary Data Stream using Ensemble Approach

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v6i4.10255 ◽

2016 ◽

Vol 6 (4) ◽

pp. 1811 ◽

Cited By ~ 1

Author(s):

Meenakshi Anurag Thalor ◽

Shrishailapa Patil

Keyword(s):

Machine Learning ◽

Incremental Learning ◽

Data Stream ◽

Recommendation System ◽

Concept Drift ◽

Joint Probability ◽

Machine Learning Algorithms ◽

Training Data ◽

Joint Probability Distribution ◽

Changes Over Time

<span lang="EN-US">Incremental Learning on non stationary distribution has been shown to be a very challenging problem in machine learning and data mining, because the joint probability distribution between the data and classes changes over time. Many real time problems suffer concept drift as they changes with time. For example, an advertisement recommendation system, in which customer’s behavior may change depending on the season of the year, on the inflation and on new products made available. An extra challenge arises when the classes to be learned are not represented equally in the training data i.e. classes are imbalanced, as most machine learning algorithms work well only when the training data is balanced. The objective of this paper is to develop an ensemble based classification algorithm for non-stationary data stream (ENSDS) with focus on two-class problems. In addition, we are presenting here an exhaustive comparison of purposed algorithms with state-of-the-art classification approaches using different evaluation measures like recall, f-measure and g-mean</span>

Download Full-text

Concept Drift Detection in Data Stream Clustering and its Application on Weather Data

International Journal of Agricultural and Environmental Information Systems ◽

10.4018/ijaeis.2020010104 ◽

2020 ◽

Vol 11 (1) ◽

pp. 67-85 ◽

Cited By ~ 1

Author(s):

Namitha K. ◽

Santhosh Kumar G.

Keyword(s):

Data Streams ◽

Data Stream ◽

Weather Forecasting ◽

Concept Drift ◽

Clustering Algorithms ◽

Weather Data ◽

Stream Clustering ◽

Cluster Evolution ◽

Data Stream Clustering ◽

Concept Drift Detection

This article presents a stream mining framework to cluster the data stream and monitor its evolution. Even though concept drift is expected to be present in data streams, explicit drift detection is rarely done in stream clustering algorithms. The proposed framework is capable of explicit concept drift detection and cluster evolution analysis. Concept drift is caused by the changes in data distribution over time. Relationship between concept drift and the occurrence of physical events has been studied by applying the framework on the weather data stream. Experiments led to the conclusion that the concept drift accompanied by a change in the number of clusters indicates a significant weather event. This kind of online monitoring and its results can be utilized in weather forecasting systems in various ways. Weather data streams produced by automatic weather stations (AWS) are used to conduct this study.

Download Full-text

Learning in the presence of concept recurrence in data stream clustering

Journal Of Big Data ◽

10.1186/s40537-020-00354-1 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

K. Namitha ◽

G. Santhosh Kumar

Keyword(s):

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Synthetic Data ◽

Real World Data ◽

Stream Classification ◽

Stream Clustering ◽

The Past ◽

Data Stream Clustering ◽

Learning Scenarios

Abstract In the case of real-world data streams, the underlying data distribution will not be static; it is subject to variation over time, which is known as the primary reason for concept drift. Concept drift poses severe problems to the accuracy of a model in online learning scenarios. The recurring concept is a particular case of concept drift where the concepts already seen in the past reappear as the stream evolves. This problem is not yet studied in the context of stream clustering. This paper proposes a novel algorithm for identifying the recurring concepts in data stream clustering. During concept recurrence, the most matching model is retrieved from the repository and reused. The algorithm has minimum memory requirements and works online with the stream. Some of the concepts and definitions, already familiar in concept recurrence studies of stream classification have been redefined for clustering. The experiments conducted on real and synthetic data streams reveal that the proposed algorithm has the potential to identify recurring concepts.

Download Full-text

Minority Resampling Based Ensemble Framework Using Enhanced Early Drift Detection Method For Imbalanced Data Streams

10.21203/rs.3.rs-141880/v1 ◽

2021 ◽

Author(s):

Priya S ◽

Annie Uthra

Keyword(s):

Data Streams ◽

Data Stream ◽

Detection Method ◽

Concept Drift ◽

Class Imbalance ◽

Current Data ◽

Classification Model ◽

Ensemble Classifiers ◽

K Nearest Neighbor ◽

Jaccard Similarity

Abstract As the data mining applications are increasing popularly, large volumes of data streams are generated over the period of time. The main problem in data streams is that it exhibits a high degree of class imbalance and distribution of data changes over time. In this paper, Timely Drift Detection and Minority Resampling Technique (TDDMRT) based on K-nearest neighbor and Jaccard similarity is proposed to handle the class imbalance by finding the current ratio of class labels. The Enhanced Early Drift Detection Method (EEDDM) is proposed for detecting the concept drift and the Minority Resampling Method (KNN-JS) determines whether the current data stream should be regarded as imbalance and it resamples the minority instances in the drifting data stream. The K-Nearest Neighbors technique is used to resample the minority classes and the Jaccard similarity measure is established over the resampled data to generate the synthetic data similar to the original data and it is handled by ensemble classifiers. The proposed ensemble based classification model outperforms the existing over sampling and under sampling techniques with accuracy of 98.52%.

Download Full-text

An accurate estimation algorithm for big data streams

Distributed and Parallel Databases ◽

10.1007/s10619-018-7225-5 ◽

2018 ◽

Vol 36 (3) ◽

pp. 461-483

Author(s):

Qin Xin ◽

Jianping Wu

Keyword(s):

Big Data ◽

Data Streams ◽

Estimation Algorithm ◽

Accurate Estimation ◽

Big Data Streams

Download Full-text

A Survey of Class Imbalance Problem on Evolving Data Stream

Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management ◽

10.4018/978-1-7998-7371-6.ch002 ◽

2021 ◽

pp. 23-41

Author(s):

D. Himaja ◽

T. Maruthi Padmaja ◽

P. Radha Krishna

Keyword(s):

Change Detection ◽

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Class Imbalance ◽

Detection Methods ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Learning From Data ◽

Main Emphasis

Learning from data streams with both online class imbalance and concept drift (OCI-CD) is receiving much attention in today's world. Due to this problem, the performance is affected for the current models that learn from both stationary as well as non-stationary environments. In the case of non-stationary environments, due to the imbalance, it is hard to spot the concept drift using conventional drift detection methods that aim at tracking the change detection based on the learner's performance. There is limited work on the combined problem from imbalanced evolving streams both from stationary and non-stationary environments. Here the data may be evolved with complete labels or with only limited labels. This chapter's main emphasis is to provide different methods for the purpose of resolving the issue of class imbalance in emerging streams, which involves changing and unchanging environments with supervised and availability of limited labels.

Download Full-text