Stream Classification Algorithm Based on Decision Tree

Due to the rise of many fields such as e-commerce platforms, a large number of stream data has emerged. The incomplete labeling problem and concept drift problem of these data pose a huge challenge to the existing stream data classification methods. In this respect, a dynamic stream data classification algorithm is proposed for the stream data. For the incomplete labeling problem, this method introduces randomization and iterative strategy based on the very fast decision tree VFDT algorithm to design an iterative integration algorithm, and the algorithm uses the previous model classification result as the next model input and implements the voting mechanism for new data classification. At the same time, the window mechanism is used to store data and calculate the data distribution characteristics in the window, then, combined with the calculated result and the predicted amount of data to adjust the size of the sliding window. Experiments show the superiority of the algorithm in classification accuracy. The aim of the study is to compare different algorithms to evaluate whether classification model adapts to the current data environment.

Download Full-text

A Review of Classification and Novel Class Detection Technique of Data Streams

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v3i2c.2891 ◽

2012 ◽

Vol 3 (2) ◽

pp. 314-316

Author(s):

Manish Rai ◽

Rekha Pandit

Keyword(s):

Machine Learning ◽

Data Streams ◽

Concept Drift ◽

Data Classification ◽

Classification Model ◽

Infinite Length ◽

Stream Data ◽

Machine Learning Technique ◽

Feature Evaluation ◽

Learning Technique

Stream data classification suffered from a problem of infinite length, concept evaluation, feature evaluation and data drift. Data stream labeling is more challenging than label static data because of several unique properties of data streams. Data streams are suppose to have infinite length, which makes it difficult to store and use all the historical data for training. Earlier multi-pass machine learning technique is not directly applied to data streams. Data streams discover concept-drift, which occurs when the discontinue concept of the data changes over time. In order to address concept drift, a classification model must endlessly adapt itself to the most recent concept. Various authors reduce these problem using machine learning approach and feature optimization technique. In this paper we present various method for reducing such problem occurred in stream data classification. Here we also discuss a machine learning technique for feature evaluation process for generation of novel class.

Download Full-text

Novel Class Detection with Concept Drift in Data Stream - AhtNODE

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2020010102 ◽

2020 ◽

Vol 11 (1) ◽

pp. 15-26

Author(s):

Jay Gandhi ◽

Vaibhav Gandhi

Keyword(s):

Data Stream ◽

Concept Drift ◽

Ensemble Classifier ◽

Streaming Data ◽

Classification Model ◽

Infinite Length ◽

The Novel ◽

Stream Data ◽

Hoeffding Tree ◽

Discovery Method

Data stream mining has become an interesting analysis topic and it is a growing interest in data discovery method. There are several applications supporting stream data processing like device network, electronic network, etc. Our approach AhtNODE (Adaptive Hoeffding Tree based NOvel class DEtection) detects novel class in the presence of concept drift in streaming data. It addresses there are three challenges of streaming data: infinite length, concept drift, and concept evolution. This approach automatically detects the novel class whenever it arrives in the data stream. It is a multi-class approach that distinguishes novel class from existing classes. The authors tend to apply the Adaptive Hoeffding Tree as a classification model that is also used to handle the concept drift situation. Previous approaches used the ensemble model to handle concept drift. In AHT, classification is done in the single pass. The experiment result proves the effectiveness of AhtNODE compared to existing ensemble classifier in terms of classification accuracy, speed and use of memory.

Download Full-text

A Data Classification Model: For Effective Classification of Intrusion in an Intrusion Detection System Based on Decision Tree Learning Algorithm

Information and Communication Technology for Sustainable Development - Lecture Notes in Networks and Systems ◽

10.1007/978-981-10-3932-4_7 ◽

2017 ◽

pp. 61-66

Author(s):

Latika Mehrotra ◽

Prashant Sahai Saxena ◽

Nitika Vats Doohan

Keyword(s):

Intrusion Detection ◽

Decision Tree ◽

Intrusion Detection System ◽

Learning Algorithm ◽

Detection System ◽

Data Classification ◽

Classification Model ◽

Decision Tree Learning

Download Full-text

Minority Resampling Based Ensemble Framework Using Enhanced Early Drift Detection Method For Imbalanced Data Streams

10.21203/rs.3.rs-141880/v1 ◽

2021 ◽

Author(s):

Priya S ◽

Annie Uthra

Keyword(s):

Data Streams ◽

Data Stream ◽

Detection Method ◽

Concept Drift ◽

Class Imbalance ◽

Current Data ◽

Classification Model ◽

Ensemble Classifiers ◽

K Nearest Neighbor ◽

Jaccard Similarity

Abstract As the data mining applications are increasing popularly, large volumes of data streams are generated over the period of time. The main problem in data streams is that it exhibits a high degree of class imbalance and distribution of data changes over time. In this paper, Timely Drift Detection and Minority Resampling Technique (TDDMRT) based on K-nearest neighbor and Jaccard similarity is proposed to handle the class imbalance by finding the current ratio of class labels. The Enhanced Early Drift Detection Method (EEDDM) is proposed for detecting the concept drift and the Minority Resampling Method (KNN-JS) determines whether the current data stream should be regarded as imbalance and it resamples the minority instances in the drifting data stream. The K-Nearest Neighbors technique is used to resample the minority classes and the Jaccard similarity measure is established over the resampled data to generate the synthetic data similar to the original data and it is handled by ensemble classifiers. The proposed ensemble based classification model outperforms the existing over sampling and under sampling techniques with accuracy of 98.52%.

Download Full-text

Imbalanced Data Classification Algorithm Based on Clustering and SVM

Journal of Circuits System and Computers ◽

10.1142/s0218126621500365 ◽

2020 ◽

pp. 2150036

Author(s):

Bo Huang ◽

Yimin Zhu ◽

Zhongzhen Wang ◽

Zhijun Fang

Keyword(s):

Class Imbalance ◽

Imbalanced Data ◽

Data Classification ◽

Classification Algorithm ◽

Classification Model ◽

Imbalance Data ◽

Imbalance Problem ◽

Imbalanced Data Classification ◽

Under Sampling ◽

Feature Dimension

The class-imbalance learning is one of the most significant research topics in the data mining and machine learning. Imbalance problem means that one of the classes has much more samples than that of other classes. To deal with the issues of low classification accuracy and high time complexity, this paper proposes an novel imbalance data classification algorithm based on clustering and SVM. The algorithm suggests under-sampling in majority samples based on the distribution characteristics of minority samples. First, specific clusters are detected by cluster analysis on the minority. Second, a cluster boundary strategy is proposed to eliminate the bad influence of noise samples. To structure a balanced dataset for imbalance data, this paper proposes three principles of under-sampling on majority samples according to the characteristic of samples in the cluster. Finally, the optimal classification model from the linear combination of hybrid-kernel SVM is obtained. The experiments based on datasets in UCI and KEEL database show that our algorithm effectively decreases the interference of noise samples. Compared with the SMOTE and Fast-CBUS, the proposed algorithm not only reduces the feature dimension, but also improves the precision of the minor classes under the different labeled sample rates generally.

Download Full-text