Concept Drift Identification using Classifier Ensemble Approach

Leena Deshpande; M. Narsing Rao

doi:10.11591/ijece.v8i1.pp19-25

Concept Drift Identification using Classifier Ensemble Approach

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i1.pp19-25 ◽

2018 ◽

Vol 8 (1) ◽

pp. 19

Author(s):

Leena Deshpande ◽

M. Narsing Rao

Keyword(s):

Concept Drift ◽

Data Distribution ◽

Ensemble Classifier ◽

Classification Model ◽

Data Sets ◽

Ensemble Approach ◽

Telecommunication Systems ◽

Traditional Classification ◽

Financial Domain ◽

New Feature

<p>Abstract:-In Internetworking system, the huge amount of data is scattered, generated and processed over the network. The data mining techniques are used to discover the unknown pattern from the underlying data. A traditional classification model is used to classify the data based on past labelled data. However in many current applications, data is increasing in size with fluctuating patterns. Due to this new feature may arrive in the data. It is present in many applications like sensornetwork, banking and telecommunication systems, financial domain, Electricity usage and prices based on its demand and supplyetc .Thus change in data distribution reduces the accuracy of classifying the data. It may discover some patterns as frequent while other patterns tend to disappear and wrongly classify. To mine such data distribution, traditionalclassification techniques may not be suitable as the distribution generating the items can change over time so data from the past may become irrelevant or even false for the current prediction. For handlingsuch varying pattern of data, concept drift mining approach is used to improve the accuracy of classification techniques. In this paper we have proposed ensemble approach for improving the accuracy of classifier. The ensemble classifier is applied on 3 different data sets. We investigated different features for the different chunk of data which is further given to ensemble classifier. We observed the proposed approach improves the accuracy of classifier for different chunks of data.</p>

Download Full-text

Concepts Seeds Gathering and Dataset Updating Algorithm for Handling Concept Drift

International Journal of Decision Support System Technology ◽

10.4018/ijdsst.2015040103 ◽

2015 ◽

Vol 7 (2) ◽

pp. 29-57 ◽

Cited By ~ 1

Author(s):

Nabil M. Hewahi ◽

Ibrahim M. Elbouhissi

Keyword(s):

Classification Accuracy ◽

Data Stream ◽

Concept Drift ◽

Data Distribution ◽

Classification Model ◽

Classification Models ◽

Comparison Study ◽

New Approach ◽

New Concepts ◽

Traditional Classification

In data mining, the phenomenon of change in data distribution over time is known as concept drift. In this research, the authors introduce a new approach called Concepts Seeds Gathering and Dataset Updating algorithm (CSG-DU) that gives the traditional classification models the ability to adapt and cope with concept drift as time passes. CSG-DU is concerned with discovering new concepts in data stream and aims to increase the classification accuracy using any classification model when changes occur in the underlying concepts. The proposed approach has been tested using synthetic and real datasets. The experiments conducted show that after applying the authors' approach, the classification accuracy increased from low values to high and acceptable ones. Finally, a comparison study between CSG-DU and Set Formation for Delayed Labeling algorithm (SFDL) has been conducted; SFDL is an approach that handles sudden and gradual concept drift. CSG-DU results outperforms SFDL in terms of classification accuracy.

Download Full-text

An http data-federation ecosystem with caching functionality using DPM and Dynafed

EPJ Web of Conferences ◽

10.1051/epjconf/201921404041 ◽

2019 ◽

Vol 214 ◽

pp. 04041 ◽

Cited By ~ 1

Author(s):

Davide Michelino ◽

Silvio Pardi ◽

Guido Russo ◽

Bernardino Spisso

Keyword(s):

Data Distribution ◽

Controlled Environment ◽

Data Sets ◽

Preliminary Results ◽

Computing Model ◽

Data Federation ◽

Belle Ii ◽

New Feature ◽

Grid Storage ◽

Computing Infrastructure

The implementation of cache systems in the computing model of HEP experiments enables to accelerate access to hot data sets by scientists, opening new scenarios of data distribution and enable to exploit the paradigm of storage-less sites. In this work, we present a study for the creation of an http data-federation ecosystem with caching functionality. We created plug-in integrated in the logic of a DPM Storage, able to reproduce a cache behaviour, taking advantage from the new feature introduced in the last version of Disk Pool Manager, called volatile-pool. Then we used Dynafed as lightweight federation system to aggregate a set of standard Grid Storage together with the caching system. With the designed setup, clients asking for a file present on the Data-Grid are automatically redirected to the cache, if the cache is the closest storage, thanks to the action of the geo-plugin run by Dynafed. As proof of the concept, we tested the whole system in a controlled environment within the Belle II computing infrastructure using a set of files located in production Storage Elements. Preliminary results demonstrate the proper functionality of the logic and encourage continuing the work.

Download Full-text

Design of adaptive ensemble classifier for online sentiment analysis and opinion mining

PeerJ Computer Science ◽

10.7717/peerj-cs.660 ◽

2021 ◽

Vol 7 ◽

pp. e660

Author(s):

Sanjeev Kumar ◽

Ravendra Singh ◽

Mohammad Zubair Khan ◽

Abdulfattah Noorwali

Keyword(s):

Sentiment Analysis ◽

False Positive ◽

Opinion Mining ◽

Negative Impact ◽

Concept Drift ◽

Data Distribution ◽

Detection Algorithm ◽

Ensemble Classifier ◽

Detection Algorithms ◽

Highly Sensitive

DataStream mining is a challenging task for researchers because of the change in data distribution during classification, known as concept drift. Drift detection algorithms emphasize detecting the drift. The drift detection algorithm needs to be very sensitive to change in data distribution for detecting the maximum number of drifts in the data stream. But highly sensitive drift detectors lead to higher false-positive drift detections. This paper proposed a Drift Detection-based Adaptive Ensemble classifier for sentiment analysis and opinion mining, which uses these false-positive drift detections to benefit and minimize the negative impact of false-positive drift detection signals. The proposed method creates and adds a new classifier to the ensemble whenever a drift happens. A weighting mechanism is implemented, which provides weights to each classifier in the ensemble. The weight of the classifier decides the contribution of each classifier in the final classification results. The experiments are performed using different classification algorithms, and results are evaluated on the accuracy, precision, recall, and F1-measures. The proposed method is also compared with these state-of-the-art methods, OzaBaggingADWINClassifier, Accuracy Weighted Ensemble, Additive Expert Ensemble, Streaming Random Patches, and Adaptive Random Forest Classifier. The results show that the proposed method handles both true positive and false positive drifts efficiently.

Download Full-text

Novel Class Detection with Concept Drift in Data Stream - AhtNODE

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2020010102 ◽

2020 ◽

Vol 11 (1) ◽

pp. 15-26

Author(s):

Jay Gandhi ◽

Vaibhav Gandhi

Keyword(s):

Data Stream ◽

Concept Drift ◽

Ensemble Classifier ◽

Streaming Data ◽

Classification Model ◽

Infinite Length ◽

The Novel ◽

Stream Data ◽

Hoeffding Tree ◽

Discovery Method

Data stream mining has become an interesting analysis topic and it is a growing interest in data discovery method. There are several applications supporting stream data processing like device network, electronic network, etc. Our approach AhtNODE (Adaptive Hoeffding Tree based NOvel class DEtection) detects novel class in the presence of concept drift in streaming data. It addresses there are three challenges of streaming data: infinite length, concept drift, and concept evolution. This approach automatically detects the novel class whenever it arrives in the data stream. It is a multi-class approach that distinguishes novel class from existing classes. The authors tend to apply the Adaptive Hoeffding Tree as a classification model that is also used to handle the concept drift situation. Previous approaches used the ensemble model to handle concept drift. In AHT, classification is done in the single pass. The experiment result proves the effectiveness of AhtNODE compared to existing ensemble classifier in terms of classification accuracy, speed and use of memory.

Download Full-text

Number of Instances for Reliable Feature Ranking in a Given Problem

Business Systems Research Journal ◽

10.2478/bsrj-2018-0017 ◽

2018 ◽

Vol 9 (2) ◽

pp. 35-44

Author(s):

Marko Bohanec ◽

Mirjana Kljajić Borštnar ◽

Marko Robnik-Šikonja

Keyword(s):

Confidence Intervals ◽

Missing Values ◽

Classification Model ◽

Low Rank ◽

Data Sets ◽

Feature Evaluation ◽

Evaluation Measure ◽

New Feature ◽

The Impact

Abstract Background: In practical use of machine learning models, users may add new features to an existing classification model, reflecting their (changed) empirical understanding of a field. New features potentially increase classification accuracy of the model or improve its interpretability. Objectives: We have introduced a guideline for determination of the sample size needed to reliably estimate the impact of a new feature. Methods/Approach: Our approach is based on the feature evaluation measure ReliefF and the bootstrap-based estimation of confidence intervals for feature ranks. Results: We test our approach using real world qualitative business-tobusiness sales forecasting data and two UCI data sets, one with missing values. The results show that new features with a high or a low rank can be detected using a relatively small number of instances, but features ranked near the border of useful features need larger samples to determine their impact. Conclusions: A combination of the feature evaluation measure ReliefF and the bootstrap-based estimation of confidence intervals can be used to reliably estimate the impact of a new feature in a given problem

Download Full-text

Concept Drift and Evolution Detection in Fusion Diagnosis With Evolving Data Streams

Volume 2A: 43rd Design Automation Conference ◽

10.1115/detc2017-68373 ◽

2017 ◽

Author(s):

Amirmahyar Abdolsamadi ◽

Pingfeng Wang

Keyword(s):

Data Streams ◽

Concept Drift ◽

Data Distribution ◽

Streaming Data ◽

Majority Voting ◽

Classification Model ◽

Engineering System ◽

Concept Evolution ◽

Adaptive Fusion

Health diagnosis interprets data streams acquired by smart sensors and makes inferences about health conditions of an engineering system thereby making critical operational decisions. A data stream is a flow of continuous data that face some challenges in data mining. This paper addresses concept drift and concept evolution as two major challenges in the classification of streaming data. Concept drift occurs as a result of data distribution changes. Concept evolution happens when new classes appear in the stream. These changes may cause the degradation of classification results over time. This paper presents an adaptive fusion learning approach to build a robust classification model. The proposed approach consists of three steps: (i) proposed fusion formulation using weighted majority voting (ii) active learning to labels selectively instead of querying for all true labels (iii) distance-based approach to monitoring the movement of data distribution. A diagnosis case study has been used to demonstrate the developed fusion diagnosis methodology.

Download Full-text

Learning emotional word embeddings for sentiment analysis

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201993 ◽

2021 ◽

pp. 1-13

Author(s):

Qingtian Zeng ◽

Xishi Zhao ◽

Xiaohui Hu ◽

Hua Duan ◽

Zhongying Zhao ◽

...

Keyword(s):

Sentiment Analysis ◽

Language Processing ◽

State Of The Art ◽

Research Problem ◽

Emotional Word ◽

Classification Model ◽

Data Sets ◽

Word Embeddings ◽

Real World Data ◽

Text Documents

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.

Download Full-text

Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams

Entropy ◽

10.3390/e23070859 ◽

2021 ◽

Vol 23 (7) ◽

pp. 859

Author(s):

Abdulaziz O. AlQabbany ◽

Aqil M. Azmi

Keyword(s):

Big Data ◽

Random Forest ◽

Real Time ◽

Data Streams ◽

Learning Algorithm ◽

Concept Drift ◽

The United States ◽

Careful Consideration ◽

Data Sets ◽

Stream Data

We are living in the age of big data, a majority of which is stream data. The real-time processing of this data requires careful consideration from different perspectives. Concept drift is a change in the data’s underlying distribution, a significant issue, especially when learning from data streams. It requires learners to be adaptive to dynamic changes. Random forest is an ensemble approach that is widely used in classical non-streaming settings of machine learning applications. At the same time, the Adaptive Random Forest (ARF) is a stream learning algorithm that showed promising results in terms of its accuracy and ability to deal with various types of drift. The incoming instances’ continuity allows for their binomial distribution to be approximated to a Poisson(1) distribution. In this study, we propose a mechanism to increase such streaming algorithms’ efficiency by focusing on resampling. Our measure, resampling effectiveness (ρ), fuses the two most essential aspects in online learning; accuracy and execution time. We use six different synthetic data sets, each having a different type of drift, to empirically select the parameter λ of the Poisson distribution that yields the best value for ρ. By comparing the standard ARF with its tuned variations, we show that ARF performance can be enhanced by tackling this important aspect. Finally, we present three case studies from different contexts to test our proposed enhancement method and demonstrate its effectiveness in processing large data sets: (a) Amazon customer reviews (written in English), (b) hotel reviews (in Arabic), and (c) real-time aspect-based sentiment analysis of COVID-19-related tweets in the United States during April 2020. Results indicate that our proposed method of enhancement exhibited considerable improvement in most of the situations.

Download Full-text

Bayesian Classifier for Sparsity-Promoting Feature Selection

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001415500226 ◽

2015 ◽

Vol 29 (06) ◽

pp. 1550022 ◽

Cited By ~ 1

Author(s):

Danlei Xu ◽

Lan Du ◽

Hongwei Liu ◽

Penghui Wang

Keyword(s):

Feature Selection ◽

Synthetic Data ◽

Original Data ◽

Radar Data ◽

Bayesian Classifier ◽

Classification Model ◽

Data Sets ◽

Data Set ◽

Classification Boundary ◽

Nonlinear Mappings

A Bayesian classifier for sparsity-promoting feature selection is developed in this paper, where a set of nonlinear mappings for the original data is performed as a pre-processing step. The linear classification model with such mappings from the original input space to a nonlinear transformation space can not only construct the nonlinear classification boundary, but also realize the feature selection for the original data. A zero-mean Gaussian prior with Gamma precision and a finite approximation of Beta process prior are used to promote sparsity in the utilization of features and nonlinear mappings in our model, respectively. We derive the Variational Bayesian (VB) inference algorithm for the proposed linear classifier. Experimental results based on the synthetic data set, measured radar data set, high-dimensional gene expression data set, and several benchmark data sets demonstrate the aggressive and robust feature selection capability and comparable classification accuracy of our method comparing with some other existing classifiers.

Download Full-text

Learning from Ontology Streams with Semantic Concept Drift

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/133 ◽

2017 ◽

Cited By ~ 7

Author(s):

Jiaoyan Chen ◽

Freddy Lecue ◽

Jeff Z. Pan ◽

Huajun Chen

Keyword(s):

Semantic Web ◽

Data Stream ◽

Concept Drift ◽

Data Distribution ◽

Accurate Prediction ◽

Knowledge Structures ◽

Semantic Concept ◽

Web Data ◽

Semantic Inference

Data stream learning has been largely studied for extracting knowledge structures from continuous and rapid data records. In the semantic Web, data is interpreted in ontologies and its ordered sequence is represented as an ontology stream. Our work exploits the semantics of such streams to tackle the problem of concept drift i.e., unexpected changes in data distribution, causing most of models to be less accurate as time passes. To this end we revisited (i) semantic inference in the context of supervised stream learning, and (ii) models with semantic embeddings. The experiments show accurate prediction with data from Dublin and Beijing.

Download Full-text