Random Forest Based Approach for Concept Drift Handling

We are living in the age of big data, a majority of which is stream data. The real-time processing of this data requires careful consideration from different perspectives. Concept drift is a change in the data’s underlying distribution, a significant issue, especially when learning from data streams. It requires learners to be adaptive to dynamic changes. Random forest is an ensemble approach that is widely used in classical non-streaming settings of machine learning applications. At the same time, the Adaptive Random Forest (ARF) is a stream learning algorithm that showed promising results in terms of its accuracy and ability to deal with various types of drift. The incoming instances’ continuity allows for their binomial distribution to be approximated to a Poisson(1) distribution. In this study, we propose a mechanism to increase such streaming algorithms’ efficiency by focusing on resampling. Our measure, resampling effectiveness (ρ), fuses the two most essential aspects in online learning; accuracy and execution time. We use six different synthetic data sets, each having a different type of drift, to empirically select the parameter λ of the Poisson distribution that yields the best value for ρ. By comparing the standard ARF with its tuned variations, we show that ARF performance can be enhanced by tackling this important aspect. Finally, we present three case studies from different contexts to test our proposed enhancement method and demonstrate its effectiveness in processing large data sets: (a) Amazon customer reviews (written in English), (b) hotel reviews (in Arabic), and (c) real-time aspect-based sentiment analysis of COVID-19-related tweets in the United States during April 2020. Results indicate that our proposed method of enhancement exhibited considerable improvement in most of the situations.

Download Full-text

Calculating feature importance in data streams with concept drift using Online Random Forest

2014 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2014.7004352 ◽

2014 ◽

Cited By ~ 4

Author(s):

Andrew Phelps Cassidy ◽

Frank A. Deviney

Keyword(s):

Random Forest ◽

Data Streams ◽

Concept Drift ◽

Feature Importance

Download Full-text

Modification of Random Forest Based Approach for Streaming Data with Concept Drift

Bulletin of the South Ural State University Series Mathematical Modelling Programming and Computer Software ◽

10.14529/mmp160408 ◽

2016 ◽

Vol 9 (4) ◽

pp. 86-95

Author(s):

A.V. Zhukov ◽

◽

D.N. Sidorov ◽

Keyword(s):

Random Forest ◽

Concept Drift ◽

Streaming Data

Download Full-text

A Novel Drift Detection Algorithm Based on Features’ Importance Analysis in a Data Streams Environment

Journal of Artificial Intelligence and Soft Computing Research ◽

10.2478/jaiscr-2020-0019 ◽

2020 ◽

Vol 10 (4) ◽

pp. 287-298

Author(s):

Piotr Duda ◽

Krzysztof Przybyszewski ◽

Lipo Wang

Keyword(s):

Random Forest ◽

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Ensemble Methods ◽

Real Data ◽

Relevant Information ◽

Detection Algorithm ◽

Important Indicator ◽

Features Importance

AbstractThe training set consists of many features that influence the classifier in different degrees. Choosing the most important features and rejecting those that do not carry relevant information is of great importance to the operating of the learned model. In the case of data streams, the importance of the features may additionally change over time. Such changes affect the performance of the classifier but can also be an important indicator of occurring concept-drift. In this work, we propose a new algorithm for data streams classification, called Random Forest with Features Importance (RFFI), which uses the measure of features importance as a drift detector. The RFFT algorithm implements solutions inspired by the Random Forest algorithm to the data stream scenarios. The proposed algorithm combines the ability of ensemble methods for handling slow changes in a data stream with a new method for detecting concept drift occurrence. The work contains an experimental analysis of the proposed algorithm, carried out on synthetic and real data.

Download Full-text

Implementation of data mining as a support of business application strategy

Journal of Applied Information, Communication and Technology ◽

10.33555/ejaict.v5i1.49 ◽

2018 ◽

Vol 5 (1) ◽

pp. 47-55

Author(s):

Florensia Unggul Damayanti

Keyword(s):

Data Mining ◽

Random Forest ◽

Business Strategy ◽

Input Parameter ◽

Data Mining Algorithm ◽

Complex Data ◽

Business Decision ◽

Marketing Department ◽

Business Application ◽

Complex Data Sets

Data mining help industries create intelligent decision on complex problems. Data mining algorithm can be applied to the data in order to forecasting, identity pattern, make rules and recommendations, analyze the sequence in complex data sets and retrieve fresh insights. Yet, increasing of technology and various techniques among data mining availability data give opportunity to industries to explore and gain valuable information from their data and use the information to support business decision making. This paper implement classification data mining in order to retrieve knowledge in customer databases to support marketing department while planning strategy for predict plan premium. The dataset decompose into conceptual analytic to identify characteristic data that can be used as input parameter of data mining model. Business decision and application is characterized by processing step, processing characteristic and processing outcome (Seng, J.L., Chen T.C. 2010). This paper set up experimental of data mining based on J48 and Random Forest classifiers and put a light on performance evaluation between J48 and random forest in the context of dataset in insurance industries. The experiment result are about classification accuracy and efficiency of J48 and Random Forest , also find out the most attribute that can be used to predict plan premium in context of strategic planning to support business strategy.

Download Full-text

Database-Driven Modeling based on Variable Selection using Random Forest and Its Application for Linear Air Fuel Ratio Sensor Output Prediction

IEEJ Transactions on Electronics Information and Systems ◽

10.1541/ieejeiss.139.850 ◽

2019 ◽

Vol 139 (8) ◽

pp. 850-857

Author(s):

Hiromu Imaji ◽

Takuya Kinoshita ◽

Toru Yamamoto ◽

Keisuke Ito ◽

Masahiro Yoshida ◽

...

Keyword(s):

Random Forest ◽

Variable Selection ◽

Sensor Output ◽

Fuel Ratio

Download Full-text

Multiple fault diagnosis for hydraulic systems using Nearest-centroid-with-DBA and Random-Forest-based-time-series-classification

2020 39th Chinese Control Conference (CCC) ◽

10.23919/ccc50068.2020.9189401 ◽

2020 ◽

Author(s):

Zhijie Peng ◽

Ke Zhang ◽

Yi Chai

Keyword(s):

Time Series ◽

Fault Diagnosis ◽

Random Forest ◽

Time Series Classification ◽

Hydraulic Systems ◽

Multiple Fault ◽

Multiple Fault Diagnosis

Download Full-text

Research on Prediction Method of Finish Rolling Power Consumption of Multi-Specific Strip Steel Based on Random Forest Optimization Model

2020 39th Chinese Control Conference (CCC) ◽

10.23919/ccc50068.2020.9188937 ◽

2020 ◽

Author(s):

XIAO Xiong ◽

DENG Daoming ◽

XIAO Yuxiong ◽

GUO Qiang ◽

ZHANG Yongjun

Keyword(s):

Random Forest ◽

Power Consumption ◽

Optimization Model ◽

Prediction Method ◽

Strip Steel ◽

Finish Rolling

Download Full-text

Random Forest: A Review

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse/v7i1/01113 ◽

2017 ◽

Vol 7 (1) ◽

pp. 251-257 ◽

Cited By ~ 28

Author(s):

Eesha Goel ◽

◽

Er. Abhilasha ◽

Keyword(s):

Random Forest

Download Full-text

Random Forest Refinement of Pairwise Potentials for Protein-ligand Decoy Detection

10.26434/chemrxiv.8047820.v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jun Pei ◽

Zheng Zheng ◽

Hyunji Kim ◽

Lin Song ◽

Sarah Walworth ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Probability Function ◽

Pair Potential ◽

Scoring Function ◽

Stable Structure ◽

Scoring Functions ◽

Atom Pair ◽

Data Set ◽

Atom Pairs

An accurate scoring function is expected to correctly select the most stable structure from a set of pose candidates. One can hypothesize that a scoring function’s ability to identify the most stable structure might be improved by emphasizing the most relevant atom pairwise interactions. However, it is hard to evaluate the relevant importance for each atom pair using traditional means. With the introduction of machine learning methods, it has become possible to determine the relative importance for each atom pair present in a scoring function. In this work, we use the Random Forest (RF) method to refine a pair potential developed by our laboratory (GARF6) by identifying relevant atom pairs that optimize the performance of the potential on our given task. Our goal is to construct a machine learning (ML) model that can accurately differentiate the native ligand binding pose from candidate poses using a potential refined by RF optimization. We successfully constructed RF models on an unbalanced data set with the ‘comparison’ concept and, the resultant RF models were tested on CASF-2013.5 In a comparison of the performance of our RF models against 29 scoring functions, we found our models outperformed the other scoring functions in predicting the native pose. In addition, we used two artificial designed potential models to address the importance of the GARF potential in the RF models: (1) a scrambled probability function set, which was obtained by mixing up atom pairs and probability functions in GARF, and (2) a uniform probability function set, which share the same peak positions with GARF but have fixed peak heights. The results of accuracy comparison from RF models based on the scrambled, uniform, and original GARF potential clearly showed that the peak positions in the GARF potential are important while the well depths are not. <br>

Download Full-text