Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams

We are living in the age of big data, a majority of which is stream data. The real-time processing of this data requires careful consideration from different perspectives. Concept drift is a change in the data’s underlying distribution, a significant issue, especially when learning from data streams. It requires learners to be adaptive to dynamic changes. Random forest is an ensemble approach that is widely used in classical non-streaming settings of machine learning applications. At the same time, the Adaptive Random Forest (ARF) is a stream learning algorithm that showed promising results in terms of its accuracy and ability to deal with various types of drift. The incoming instances’ continuity allows for their binomial distribution to be approximated to a Poisson(1) distribution. In this study, we propose a mechanism to increase such streaming algorithms’ efficiency by focusing on resampling. Our measure, resampling effectiveness (ρ), fuses the two most essential aspects in online learning; accuracy and execution time. We use six different synthetic data sets, each having a different type of drift, to empirically select the parameter λ of the Poisson distribution that yields the best value for ρ. By comparing the standard ARF with its tuned variations, we show that ARF performance can be enhanced by tackling this important aspect. Finally, we present three case studies from different contexts to test our proposed enhancement method and demonstrate its effectiveness in processing large data sets: (a) Amazon customer reviews (written in English), (b) hotel reviews (in Arabic), and (c) real-time aspect-based sentiment analysis of COVID-19-related tweets in the United States during April 2020. Results indicate that our proposed method of enhancement exhibited considerable improvement in most of the situations.

Download Full-text

Anomalies Detection Using Isolation in Concept-Drifting Data Streams

Computers ◽

10.3390/computers10010013 ◽

2021 ◽

Vol 10 (1) ◽

pp. 13

Author(s):

Maurras Ulbricht Togbe ◽

Yousra Chabchoub ◽

Aliou Boly ◽

Mariam Barry ◽

Raja Chiky ◽

...

Keyword(s):

Anomaly Detection ◽

Half Space ◽

Data Streams ◽

Detection Efficiency ◽

Concept Drift ◽

Streaming Data ◽

Detection Methods ◽

Data Sets ◽

Stream Data ◽

Isolation Forest

Detecting anomalies in streaming data is an important issue for many application domains, such as cybersecurity, natural disasters, or bank frauds. Different approaches have been designed in order to detect anomalies: statistics-based, isolation-based, clustering-based, etc. In this paper, we present a structured survey of the existing anomaly detection methods for data streams with a deep view on Isolation Forest (iForest). We first provide an implementation of Isolation Forest Anomalies detection in Stream Data (IForestASD), a variant of iForest for data streams. This implementation is built on top of scikit-multiflow (River), which is an open source machine learning framework for data streams containing a single anomaly detection algorithm in data streams, called Streaming half-space trees. We performed experiments on different real and well known data sets in order to compare the performance of our implementation of IForestASD and half-space trees. Moreover, we extended the IForestASD algorithm to handle drifting data by proposing three algorithms that involve two main well known drift detection methods: ADWIN and KSWIN. ADWIN is an adaptive sliding window algorithm for detecting change in a data stream. KSWIN is a more recent method and it refers to the Kolmogorov–Smirnov Windowing method for concept drift detection. More precisely, we extended KSWIN to be able to deal with n-dimensional data streams. We validated and compared all of the proposed methods on both real and synthetic data sets. In particular, we evaluated the F1-score, the execution time, and the memory consumption. The experiments show that our extensions have lower resource consumption than the original version of IForestASD with a similar or better detection efficiency.

Download Full-text

Big Data Management in the Context of Real-Time Data Warehousing

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-4699-5.ch007 ◽

2013 ◽

pp. 150-176

Author(s):

M. Asif Naeem ◽

Gillian Dobbie ◽

Gerald Weber

Keyword(s):

Big Data ◽

Data Integration ◽

Real Time ◽

Real Life ◽

Skewed Distribution ◽

Stream Data ◽

Time Data ◽

Master Data ◽

Real Time Data ◽

Resource Aware

In order to make timely and effective decisions, businesses need the latest information from big data warehouse repositories. To keep these repositories up to date, real-time data integration is required. An important phase in real-time data integration is data transformation where a stream of updates, which is huge in volume and infinite, is joined with large disk-based master data. Stream processing is an important concept in Big Data, since large volumes of data are often best processed immediately. A well-known algorithm called Mesh Join (MESHJOIN) was proposed to process stream data with disk-based master data, which uses limited memory. MESHJOIN is a candidate for a resource-aware system setup. The problem that the authors consider in this chapter is that MESHJOIN is not very selective. In particular, the performance of the algorithm is always inversely proportional to the size of the master data table. As a consequence, the resource consumption is in some scenarios suboptimal. They present an algorithm called Cache Join (CACHEJOIN), which performs asymptotically at least as well as MESHJOIN but performs better in realistic scenarios, particularly if parts of the master data are used with different frequencies. In order to quantify the performance differences, the authors compare both algorithms with a synthetic dataset of a known skewed distribution as well as TPC-H and real-life datasets.

Download Full-text

A Detailed Study on Classification Algorithms in Big Data

Big Data Analytics for Sustainable Computing - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-9750-6.ch002 ◽

2020 ◽

pp. 30-46

Author(s):

Saranya N. ◽

Saravana Selvam

Keyword(s):

Big Data ◽

Random Forest ◽

Linear Regression ◽

Comprehensive Evaluation ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Classification Methods ◽

Computing Science ◽

Data Collections

After an era of managing data collection difficulties, these days the issue has turned into the problem of how to process these vast amounts of information. Scientists, as well as researchers, think that today, probably the most essential topic in computing science is Big Data. Big Data is used to clarify the huge volume of data that could exist in any structure. This makes it difficult for standard controlling approaches for mining the best possible data through such large data sets. Classification in Big Data is a procedure of summing up data sets dependent on various examples. There are distinctive classification frameworks which help us to classify data collections. A few methods that discussed in the chapter are Multi-Layer Perception Linear Regression, C4.5, CART, J48, SVM, ID3, Random Forest, and KNN. The target of this chapter is to provide a comprehensive evaluation of classification methods that are in effect commonly utilized.

Download Full-text

Knowledge Discovery From Evolving Data Streams

Advances in Business Information Systems and Analytics - Machine Learning Techniques for Improved Business Analytics ◽

10.4018/978-1-5225-3534-8.ch002 ◽

2019 ◽

pp. 19-39

Author(s):

Prasanna Lakshmi Kompalli

Keyword(s):

Real Time ◽

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Data Stream Mining ◽

Time Data ◽

Stream Mining ◽

New Challenges ◽

Mining Data Streams ◽

Different Sources

Data coming from different sources is referred to as data streams. Data stream mining is an online learning technique where each data point must be processed as the data arrives and discarded as the processing is completed. Progress of technologies has resulted in the monitoring these data streams in real time. Data streams has created many new challenges to the researchers in real time. The main features of this type of data are they are fast flowing, large amounts of data which are continuous and growing in nature, and characteristics of data might change in course of time which is termed as concept drift. This chapter addresses the problems in mining data streams with concept drift. Due to which, isolating the correct literature would be a grueling task for researchers and practitioners. This chapter tries to provide a solution as it would be an amalgamation of all techniques used for data stream mining with concept drift.

Download Full-text

Discrete Event Simulation and Real Time Locating Systems

International Journal of E-Adoption ◽

10.4018/jea.2012100102 ◽

2012 ◽

Vol 4 (4) ◽

pp. 16-28

Author(s):

T. Eugene Day ◽

Ajit N. Babu ◽

Steven M. Kymes ◽

Nathan Ravi

Keyword(s):

Health Care ◽

Real Time ◽

Discrete Event Simulation ◽

Medical Center ◽

Discrete Event ◽

Healthcare Delivery ◽

The United States ◽

Careful Consideration ◽

Health Administration ◽

Event Simulation

The Veteran’s Health Administration (VHA) is the largest integrated health care system in the United States, forming the arm of the Department of Veterans Affairs (VA) that delivers medical services. From a troubled past, the VHA today is regarded as a model for healthcare transformation. The VA has evaluated and adopted a variety of cutting-edge approaches to foster greater efficiency and effectiveness in healthcare delivery as part of their systems redesign initiative. This paper discusses the integration of two health care analysis platforms: Discrete Event Simulation (DES), and Real Time Locating systems (RTLS) presenting examples of work done at the St. Louis VA Medical Center. Use of RTLS data for generation and validation of DES models is detailed, with prescriptive discussion of methodologies. The authors recommend the careful consideration of these relatively new approaches which show promise in assisting systems redesign initiatives across the health care spectrum.

Download Full-text

Data-driven decision support under concept drift in streamed big data

Complex & Intelligent Systems ◽

10.1007/s40747-019-00124-4 ◽

2019 ◽

Vol 6 (1) ◽

pp. 157-163 ◽

Cited By ~ 2

Author(s):

Jie Lu ◽

Anjin Liu ◽

Yiliao Song ◽

Guangquan Zhang

Keyword(s):

Decision Making ◽

Big Data ◽

Real Time ◽

Concept Drift ◽

High Volume ◽

Streaming Data ◽

Data Driven ◽

Research Directions ◽

Decision Outcomes ◽

Past Data

Abstract Data-driven decision-making ($$\mathrm {D^3}$$D3M) is often confronted by the problem of uncertainty or unknown dynamics in streaming data. To provide real-time accurate decision solutions, the systems have to promptly address changes in data distribution in streaming data—a phenomenon known as concept drift. Past data patterns may not be relevant to new data when a data stream experiences significant drift, thus to continue using models based on past data will lead to poor prediction and poor decision outcomes. This position paper discusses the basic framework and prevailing techniques in streaming type big data and concept drift for $$\mathrm {D^3}$$D3M. The study first establishes a technical framework for real-time $$\mathrm {D^3}$$D3M under concept drift and details the characteristics of high-volume streaming data. The main methodologies and approaches for detecting concept drift and supporting $$\mathrm {D^3}$$D3M are highlighted and presented. Lastly, further research directions, related methods and procedures for using streaming data to support decision-making in concept drift environments are identified. We hope the observations in this paper could support researchers and professionals to better understand the fundamentals and research directions of $$\mathrm {D^3}$$D3M in streamed big data environments.

Download Full-text

Research on real-time outlier detection over big data streams

International Journal of Computers and Applications ◽

10.1080/1206212x.2017.1397388 ◽

2017 ◽

Vol 42 (1) ◽

pp. 93-101

Author(s):

Liangchen Chen ◽

Shu Gao ◽

Xiufeng Cao

Keyword(s):

Big Data ◽

Real Time ◽

Outlier Detection ◽

Data Streams ◽

Big Data Streams

Download Full-text

Scalable real-time classification of data streams with concept drift

Future Generation Computer Systems ◽

10.1016/j.future.2017.03.026 ◽

2017 ◽

Vol 75 ◽

pp. 187-199 ◽

Cited By ~ 35

Author(s):

Mark Tennant ◽

Frederic Stahl ◽

Omer Rana ◽

João Bártolo Gomes

Keyword(s):

Real Time ◽

Data Streams ◽

Concept Drift ◽

Real Time Classification

Download Full-text

Spatiotemporal Traffic Analysis using Big Data

International Journal of Advanced Information and Communication Technology ◽

10.46532/ijaict-2020010 ◽

2020 ◽

pp. 37-40

Author(s):

Anandakumar H ◽

Abishek Sailesh ◽

Muthumeenal C ◽

Visalakshi S ◽

Muthumani K

Keyword(s):

Random Forest ◽

Real Time ◽

Historical Data ◽

Learning Algorithm ◽

Ensemble Classifier ◽

Convergence Time ◽

Traffic Patterns ◽

Context Model ◽

Real Time Traffic ◽

Context Data

In collaborated online technique traffic prediction methods is proposed with distributed context aware random forest learning algorithm .The random forest is ensemble classifier which learns different traffic and context model form distributed traffic patterns. One major challenge in predicting traffic is how much to rely on the prediction model constructed using historical data in the real-time traffic situation, which may differ from that of the historical data due to the fact that traffic situations are numerous and changing over time. The proposed algorithm is online predictor of real-time traffic, the global prediction is achieved with less convergence time .The distributed scenarios (traffic data and context data) are collected together to improve the learning accuracy of classifier. The conducted experimental results on prediction of traffic dataset prove that the proposed algorithm significantly outperforms the existing algorithm.

Download Full-text

An Analytical Model for Prediction of Heart Disease using Machine Learning Classifiers

10.36227/techrxiv.14867175 ◽

2021 ◽

Author(s):

Diti Roy ◽

Md. Ashiq Mahmood ◽

Tamal Joyti Roy

Keyword(s):

Machine Learning ◽

Heart Disease ◽

Random Forest ◽

Learning Algorithm ◽

Modern Technology ◽

Learning Approach ◽

Data Sets ◽

Machine Learning Classifiers ◽

Machine Learning Approach ◽

Day By Day

Heart Disease is the most dominating disease which is taking a large number of deaths every year. A report from WHO in 2016 portrayed that every year at least 17 million people die of heart disease. This number is gradually increasing day by day and WHO estimated that this death toll will reach the summit of 75 million by 2030. Despite having modern technology and health care system predicting heart disease is still beyond limitations. As the Machine Learning algorithm is a vital source predicting data from available data sets we have used a machine learning approach to predict heart disease. We have collected data from the UCI repository. In our study, we have used Random Forest, Zero R, Voted Perceptron, K star classifier. We have got the best result through the Random Forest classifier with an accuracy of 97.69.

Download Full-text