Dynamically Adjusting Diversity in Ensembles for the Classification of Data Streams with Concept Drift

2021 ◽  
Vol 16 (2) ◽  
pp. 1-30
Author(s):  
Juan I. G. Hidalgo ◽  
Silas G. T. C. Santos ◽  
Roberto S. M. Barros

A data stream can be defined as a system that continually generates a lot of data over time. Today, processing data streams requires new demands and challenging tasks in the data mining and machine learning areas. Concept Drift is a problem commonly characterized as changes in the distribution of the data within a data stream. The implementation of new methods for dealing with data streams where concept drifts occur requires algorithms that can adapt to several scenarios to improve its performance in the different experimental situations where they are tested. This research proposes a strategy for dynamic parameter adjustment in the presence of concept drifts. Parameter Estimation Procedure (PEP) is a general method proposed for dynamically adjusting parameters which is applied to the diversity parameter (λ) of several classification ensembles commonly used in the area. To this end, the proposed estimation method (PEP) was used to create Boosting-like Online Learning Ensemble with Parameter Estimation (BOLE-PE), Online AdaBoost-based M1 with Parameter Estimation (OABM1-PE), and Oza and Russell’s Online Bagging with Parameter Estimation (OzaBag-PE), based on the existing ensembles BOLE, OABM1, and OzaBag, respectively. To validate them, experiments were performed with artificial and real-world datasets using Hoeffding Tree (HT) as base classifier. The accuracy results were statistically evaluated using a variation of the Friedman test and the Nemenyi post-hoc test. The experimental results showed that the application of the dynamic estimation in the diversity parameter (λ) produced good results in most scenarios, i.e., the modified methods have improved accuracy in the experiments with both artificial and real-world datasets.

2021 ◽  
Vol 2021 ◽  
pp. 1-17
Author(s):  
Tinofirei Museba ◽  
Fulufhelo Nelwamondo ◽  
Khmaies Ouahada

Beyond applying machine learning predictive models to static tasks, a significant corpus of research exists that applies machine learning predictive models to streaming environments that incur concept drift. With the prevalence of streaming real-world applications that are associated with changes in the underlying data distribution, the need for applications that are capable of adapting to evolving and time-varying dynamic environments can be hardly overstated. Dynamic environments are nonstationary and change with time and the target variables to be predicted by the learning algorithm and often evolve with time, a phenomenon known as concept drift. Most work in handling concept drift focuses on updating the prediction model so that it can recover from concept drift while little effort has been dedicated to the formulation of a learning system that is capable of learning different types of drifting concepts at any time with minimum overheads. This work proposes a novel and evolving data stream classifier called Adaptive Diversified Ensemble Selection Classifier (ADES) that significantly optimizes adaptation to different types of concept drifts at any time and improves convergence to new concepts by exploiting different amounts of ensemble diversity. The ADES algorithm generates diverse base classifiers, thereby optimizing the margin distribution to exploit ensemble diversity to formulate an ensemble classifier that generalizes well to unseen instances and provides fast recovery from different types of concept drift. Empirical experiments conducted on both artificial and real-world data streams demonstrate that ADES can adapt to different types of drifts at any given time. The prediction performance of ADES is compared to three other ensemble classifiers designed to handle concept drift using both artificial and real-world data streams. The comparative evaluation performed demonstrated the ability of ADES to handle different types of concept drifts. The experimental results, including statistical test results, indicate comparable performances with other algorithms designed to handle concept drift and prove their significance and effectiveness.


Information ◽  
2019 ◽  
Vol 10 (5) ◽  
pp. 158 ◽  
Author(s):  
Yange Sun ◽  
Han Shao ◽  
Shasha Wang

Most existing multi-label data streams classification methods focus on extending single-label streams classification approaches to multi-label cases, without considering the special characteristics of multi-label stream data, such as label dependency, concept drift, and recurrent concepts. Motivated by these challenges, we devise an efficient ensemble paradigm for multi-label data streams classification. The algorithm deploys a novel change detection based on Jensen–Shannon divergence to identify different kinds of concept drift in data streams. Moreover, our method tries to consider label dependency by pruning away infrequent label combinations to enhance classification performance. Empirical results on both synthetic and real-world datasets have demonstrated its effectiveness.


2021 ◽  
Vol 10 (6) ◽  
pp. 3361-3368
Author(s):  
Ibnu Daqiqil Id ◽  
Pardomuan Robinson Sihombing ◽  
Supratman Zakir

When predicting data streams, changes in data distribution may decrease model accuracy over time, thereby making the model obsolete. This phenomenon is known as concept drift. Detecting concept drifts and then adapting to them are critical operations to maintain model performance. However, model adaptation can only be made if labeled data is available. Labeling data is both costly and time-consuming because it has to be done by humans. Only part of the data can be labeled in the data stream because the data size is massive and appears at high speed. To solve these problems simultaneously, we apply a technique to update the model by employing both labeled and unlabeled instances to do so. The experiment results show that our proposed method can adapt to the concept drift with pseudo-labels and maintain its accuracy even though label availability is drastically reduced from 95% to 5%. The proposed method also has the highest overall accuracy and outperforms other methods in 5 of 10 datasets.


Smart Cities ◽  
2021 ◽  
Vol 4 (1) ◽  
pp. 349-371
Author(s):  
Hassan Mehmood ◽  
Panos Kostakos ◽  
Marta Cortes ◽  
Theodoros Anagnostopoulos ◽  
Susanna Pirttikangas ◽  
...  

Real-world data streams pose a unique challenge to the implementation of machine learning (ML) models and data analysis. A notable problem that has been introduced by the growth of Internet of Things (IoT) deployments across the smart city ecosystem is that the statistical properties of data streams can change over time, resulting in poor prediction performance and ineffective decisions. While concept drift detection methods aim to patch this problem, emerging communication and sensing technologies are generating a massive amount of data, requiring distributed environments to perform computation tasks across smart city administrative domains. In this article, we implement and test a number of state-of-the-art active concept drift detection algorithms for time series analysis within a distributed environment. We use real-world data streams and provide critical analysis of results retrieved. The challenges of implementing concept drift adaptation algorithms, along with their applications in smart cities, are also discussed.


Author(s):  
Prasanna Lakshmi Kompalli

Data coming from different sources is referred to as data streams. Data stream mining is an online learning technique where each data point must be processed as the data arrives and discarded as the processing is completed. Progress of technologies has resulted in the monitoring these data streams in real time. Data streams has created many new challenges to the researchers in real time. The main features of this type of data are they are fast flowing, large amounts of data which are continuous and growing in nature, and characteristics of data might change in course of time which is termed as concept drift. This chapter addresses the problems in mining data streams with concept drift. Due to which, isolating the correct literature would be a grueling task for researchers and practitioners. This chapter tries to provide a solution as it would be an amalgamation of all techniques used for data stream mining with concept drift.


2020 ◽  
Vol 8 (4) ◽  
pp. 63-73
Author(s):  
Sikha Bagui ◽  
Katie Jin

This survey performs a thorough enumeration and analysis of existing methods for data stream processing. It is a survey of the challenges facing streaming data. The challenges addressed are preprocessing of streaming data, detection and dealing with concept drifts in streaming data, data reduction in the face of data streams, approximate queries and blocking operations in streaming data.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Yange Sun ◽  
Meng Li ◽  
Lei Li ◽  
Han Shao ◽  
Yi Sun

Class imbalance and concept drift are two primary principles that exist concurrently in data stream classification. Although the two issues have drawn enough attention separately, the joint treatment largely remains unexplored. Moreover, the class imbalance issue is further complicated if data streams with concept drift. A novel Cost-Sensitive based Data Stream (CSDS) classification is introduced to overcome the two issues simultaneously. The CSDS considers cost information during the procedures of data preprocessing and classification. During the data preprocessing, a cost-sensitive learning strategy is introduced into the ReliefF algorithm for alleviating the class imbalance at the data level. In the classification process, a cost-sensitive weighting schema is devised to enhance the overall performance of the ensemble. Besides, a change detection mechanism is embedded in our algorithm, which guarantees that an ensemble can capture and react to drift promptly. Experimental results validate that our method can obtain better classification results under different imbalanced concept drifting data stream scenarios.


PLoS ONE ◽  
2021 ◽  
Vol 16 (10) ◽  
pp. e0258442
Author(s):  
Sean C. Epstein ◽  
Timothy J. P. Bray ◽  
Margaret A. Hall-Craggs ◽  
Hui Zhang

This paper proposes a task-driven computational framework for assessing diffusion MRI experimental designs which, rather than relying on parameter-estimation metrics, directly measures quantitative task performance. Traditional computational experimental design (CED) methods may be ill-suited to experimental tasks, such as clinical classification, where outcome does not depend on parameter-estimation accuracy or precision alone. Current assessment metrics evaluate experiments’ ability to faithfully recover microstructural parameters rather than their task performance. The method we propose addresses this shortcoming. For a given MRI experimental design (protocol, parameter-estimation method, model, etc.), experiments are simulated start-to-finish and task performance is computed from receiver operating characteristic (ROC) curves and associated summary metrics (e.g. area under the curve (AUC)). Two experiments were performed: first, a validation of the pipeline’s task performance predictions against clinical results, comparing in-silico predictions to real-world ROC/AUC; and second, a demonstration of the pipeline’s advantages over traditional CED approaches, using two simulated clinical classification tasks. Comparison with clinical datasets validates our method’s predictions of (a) the qualitative form of ROC curves, (b) the relative task performance of different experimental designs, and (c) the absolute performance (AUC) of each experimental design. Furthermore, we show that our method outperforms traditional task-agnostic assessment methods, enabling improved, more useful experimental design. Our pipeline produces accurate, quantitative predictions of real-world task performance. Compared to current approaches, such task-driven assessment is more likely to identify experimental designs that perform well in practice. Our method is not limited to diffusion MRI; the pipeline generalises to any task-based quantitative MRI application, and provides the foundation for developing future task-driven end-to end CED frameworks.


Complexity ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-12
Author(s):  
Sanmin Liu ◽  
Shan Xue ◽  
Fanzhen Liu ◽  
Jieren Cheng ◽  
Xiulai Li ◽  
...  

Data stream classification becomes a promising prediction work with relevance to many practical environments. However, under the environment of concept drift and noise, the research of data stream classification faces lots of challenges. Hence, a new incremental ensemble model is presented for classifying nonstationary data streams with noise. Our approach integrates three strategies: incremental learning to monitor and adapt to concept drift; ensemble learning to improve model stability; and a microclustering procedure that distinguishes drift from noise and predicts the labels of incoming instances via majority vote. Experiments with two synthetic datasets designed to test for both gradual and abrupt drift show that our method provides more accurate classification in nonstationary data streams with noise than the two popular baselines.


1994 ◽  
Vol 116 (1) ◽  
pp. 19-29 ◽  
Author(s):  
J. P. Laible ◽  
D. Pflaster ◽  
B. R. Simon ◽  
M. H. Krag ◽  
M. Pope ◽  
...  

A three-dimensional finite element model for a poroelastic medium has been coupled with a least squares parameter estimation method for the purpose of assessing material properties based on intradiscal displacement and reactive forces. Parameter optimization may be based on either load or displacement control experiments. In this paper we present the basis of the finite element model and the parameter estimation process. The method is then applied to a test problem and the computational behavior is discussed. Sequential optimization on different parameter groups was found to have superior convergence properties. Some guidelines for choosing the starting parameter values for optimization were deduced by considering the form of the objective function. For load control experiments, in which displacement data is used for the optimization, the starting values for the elastic modulus should be lower in magnitude than an “anticipated” modulus. The permeability starting values should be higher than an anticipated permeability. For displacement control experiments, the reverse is true. The optimization scheme was also tested on data with random variations.


Sign in / Sign up

Export Citation Format

Share Document