Concept Drift for Big Data

Author(s):  
Raihan Seraj ◽  
Mohiuddin Ahmed
Keyword(s):  
Big Data ◽  
Entropy ◽  
2021 ◽  
Vol 23 (7) ◽  
pp. 859
Author(s):  
Abdulaziz O. AlQabbany ◽  
Aqil M. Azmi

We are living in the age of big data, a majority of which is stream data. The real-time processing of this data requires careful consideration from different perspectives. Concept drift is a change in the data’s underlying distribution, a significant issue, especially when learning from data streams. It requires learners to be adaptive to dynamic changes. Random forest is an ensemble approach that is widely used in classical non-streaming settings of machine learning applications. At the same time, the Adaptive Random Forest (ARF) is a stream learning algorithm that showed promising results in terms of its accuracy and ability to deal with various types of drift. The incoming instances’ continuity allows for their binomial distribution to be approximated to a Poisson(1) distribution. In this study, we propose a mechanism to increase such streaming algorithms’ efficiency by focusing on resampling. Our measure, resampling effectiveness (ρ), fuses the two most essential aspects in online learning; accuracy and execution time. We use six different synthetic data sets, each having a different type of drift, to empirically select the parameter λ of the Poisson distribution that yields the best value for ρ. By comparing the standard ARF with its tuned variations, we show that ARF performance can be enhanced by tackling this important aspect. Finally, we present three case studies from different contexts to test our proposed enhancement method and demonstrate its effectiveness in processing large data sets: (a) Amazon customer reviews (written in English), (b) hotel reviews (in Arabic), and (c) real-time aspect-based sentiment analysis of COVID-19-related tweets in the United States during April 2020. Results indicate that our proposed method of enhancement exhibited considerable improvement in most of the situations.


2019 ◽  
Vol 6 (1) ◽  
pp. 157-163 ◽  
Author(s):  
Jie Lu ◽  
Anjin Liu ◽  
Yiliao Song ◽  
Guangquan Zhang

Abstract Data-driven decision-making ($$\mathrm {D^3}$$D3M) is often confronted by the problem of uncertainty or unknown dynamics in streaming data. To provide real-time accurate decision solutions, the systems have to promptly address changes in data distribution in streaming data—a phenomenon known as concept drift. Past data patterns may not be relevant to new data when a data stream experiences significant drift, thus to continue using models based on past data will lead to poor prediction and poor decision outcomes. This position paper discusses the basic framework and prevailing techniques in streaming type big data and concept drift for $$\mathrm {D^3}$$D3M. The study first establishes a technical framework for real-time $$\mathrm {D^3}$$D3M under concept drift and details the characteristics of high-volume streaming data. The main methodologies and approaches for detecting concept drift and supporting $$\mathrm {D^3}$$D3M are highlighted and presented. Lastly, further research directions, related methods and procedures for using streaming data to support decision-making in concept drift environments are identified. We hope the observations in this paper could support researchers and professionals to better understand the fundamentals and research directions of $$\mathrm {D^3}$$D3M in streamed big data environments.


2018 ◽  
Vol 9 (2) ◽  
pp. 69-79 ◽  
Author(s):  
Klemen Kenda ◽  
Dunja Mladenić

Abstract Background: Internet of Things (IoT), earth observation and big scientific experiments are sources of extensive amounts of sensor big data today. We are faced with large amounts of data with low measurement costs. A standard approach in such cases is a stream mining approach, implying that we look at a particular measurement only once during the real-time processing. This requires the methods to be completely autonomous. In the past, very little attention was given to the most time-consuming part of the data mining process, i.e. data pre-processing. Objectives: In this paper we propose an algorithm for data cleaning, which can be applied to real-world streaming big data. Methods/Approach: We use the short-term prediction method based on the Kalman filter to detect admissible intervals for future measurements. The model can be adapted to the concept drift and is useful for detecting random additive outliers in a sensor data stream. Results: For datasets with low noise, our method has proven to perform better than the method currently commonly used in batch processing scenarios. Our results on higher noise datasets are comparable. Conclusions: We have demonstrated a successful application of the proposed method in real-world scenarios including the groundwater level, server load and smart-grid data


Author(s):  
Qi Wang ◽  
Xia Zhao ◽  
Jincai Huang ◽  
Yanghe Feng ◽  
Zhong Liu ◽  
...  

The concept of ‘big data’ has been widely discussed, and its value has been illuminated throughout a variety of domains. To quickly mine potential values and alleviate the ever-increasing volume of information, machine learning is playing an increasingly important role and faces more challenges than ever. Because few studies exist regarding how to modify machine learning techniques to accommodate big data environments, we provide a comprehensive overview of the history of the evolution of big data, the foundations of machine learning, and the bottlenecks and trends of machine learning in the big data era. More specifically, based on learning principals, we discuss regularization to enhance generalization. The challenges of quality in big data are reduced to the curse of dimensionality, class imbalances, concept drift and label noise, and the underlying reasons and mainstream methodologies to address these challenges are introduced. Learning model development has been driven by domain specifics, dataset complexities, and the presence or absence of human involvement. In this paper, we propose a robust learning paradigm by aggregating the aforementioned factors. Over the next few decades, we believe that these perspectives will lead to novel ideas and encourage more studies aimed at incorporating knowledge and establishing data-driven learning systems that involve both data quality considerations and human interactions.


IEEE Access ◽  
2021 ◽  
pp. 1-1
Author(s):  
Ahmed Abbasi ◽  
Abdul Rehman Javed ◽  
Chinmay Chakraborty ◽  
Jamel Nebhen ◽  
Wisha Zehra ◽  
...  

Author(s):  
Qi Wang ◽  
Xia Zhao ◽  
Jincai Huang ◽  
Yanghe Feng ◽  
Jiahao Su ◽  
...  

The concept of ‘big data’ has been widely discussed, and its value has been illuminated throughout a variety of domains. To quickly mine potential values and alleviate the ever-increasing volume of information, machine learning is playing an increasingly important role and faces more challenges than ever. Because few studies exist regarding how to modify machine learning techniques to accommodate big data environments, we provide a comprehensive overview of the history of the evolution of big data, the foundations of machine learning, and the bottlenecks and trends of machine learning in the big data era. More specifically, based on learning principals, we discuss regularization to enhance generalization. The challenges of quality in big data are reduced to the curse of dimensionality, class imbalances, concept drift and label noise, and the underlying reasons and mainstream methodologies to address these challenges are introduced. Learning model development has been driven by domain specifics, dataset complexities, and the presence or absence of human involvement. In this paper, we propose a robust learning paradigm by aggregating the aforementioned factors. Over the next few decades, we believe that these perspectives will lead to novel ideas and encourage more studies aimed at incorporating knowledge and establishing data-driven learning systems that involve both data quality considerations and human interactions.


2014 ◽  
Vol 23 (01) ◽  
pp. 08-13 ◽  
Author(s):  
Riccardo Bellazzi

SummaryBig data are receiving an increasing attention in biomedicine and healthcare. It is therefore important to understand the reason why big data are assuming a crucial role for the biomedical informatics community. The capability of handling big data is becoming an enabler to carry out unprecedented research studies and to implement new models of healthcare delivery. Therefore, it is first necessary to deeply understand the four elements that constitute big data, namely Volume, Variety, Velocity, and Veracity, and their meaning in practice. Then, it is mandatory to understand where big data are present, and where they can be beneficially collected. There are research fields, such as translational bioinformatics, which need to rely on big data technologies to withstand the shock wave of data that is generated every day. Other areas, ranging from epidemiology to clinical care, can benefit from the exploitation of the large amounts of data that are nowadays available, from personal monitoring to primary care. However, building big data-enabled systems carries on relevant implications in terms of reproducibility of research studies and management of privacy and data access; proper actions should be taken to deal with these issues. An interesting consequence of the big data scenario is the availability of new software, methods, and tools, such as map-reduce, cloud computing, and concept drift machine learning algorithms, which will not only contribute to big data research, but may be beneficial in many biomedical informatics applications. The way forward with the big data opportunity will require properly applied engineering principles to design studies and applications, to avoid preconceptions or over-enthusiasms, to fully exploit the available technologies, and to improve data processing and data management regulations.


Sign in / Sign up

Export Citation Format

Share Document