Diversity in Ensemble Model for Classification of Data Streams with Concept Drift

Data streams can be defined as the continuous stream of data coming from different sources and in different forms. Streams are often very dynamic, and its underlying structure usually changes over time, which may result to a phenomenon called concept drift. When solving predictive problems using the streaming data, traditional machine learning models trained on historical data may become invalid when such changes occur. Adaptive models equipped with mechanisms to reflect the changes in the data proved to be suitable to handle drifting streams. Adaptive ensemble models represent a popular group of these methods used in classification of drifting data streams. In this paper, we present the heterogeneous adaptive ensemble model for the data streams classification, which utilizes the dynamic class weighting scheme and a mechanism to maintain the diversity of the ensemble members. Our main objective was to design a model consisting of a heterogeneous group of base learners (Naive Bayes, k-NN, Decision trees), with adaptive mechanism which besides the performance of the members also takes into an account the diversity of the ensemble. The model was experimentally evaluated on both real-world and synthetic datasets. We compared the presented model with other existing adaptive ensemble methods, both from the perspective of predictive performance and computational resource requirements.

Download Full-text

Study on the classification of data streams with concept drift

2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) ◽

10.1109/fskd.2011.6019889 ◽

2011 ◽

Cited By ~ 7

Author(s):

Zhenzheng Ouyang ◽

Yuhai Gao ◽

Zipeng Zhao ◽

Tao Wang

Keyword(s):

Data Streams ◽

Concept Drift

Download Full-text

Decision Tree Classification Algorithm within Concept Similarity

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.235.9 ◽

2012 ◽

Vol 235 ◽

pp. 9-14

Author(s):

Chun Hua Ju ◽

Li Li Mao

Keyword(s):

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Classification Algorithm ◽

Streaming Data ◽

Decision Tree Classification ◽

The Cost ◽

Prediction Efficiency ◽

Concept Similarity

Data stream mining has been applied in many domains, but the concept drifts of data streams bring great obstacles to data mining. Current researches about classification algorithm for streaming data with concept drift have achieved many successes, while they pay little attention to the iterancy of data streams, namely, the situation of the historical concept reappears. For this characteristic, this paper puts forward that it utilizes the classifier model of the historical concepts or high similarity concepts through calculating the concept similarity to classify and predict. In this way, we don’t need training any more. Meanwhile, it reduces the cost of update model, speeds up the classification of the rate and improves the prediction efficiency.

Download Full-text

Automated classification of clinical trial eligibility criteria text based on ensemble learning and metric learning

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01492-z ◽

2021 ◽

Vol 21 (S2) ◽

Author(s):

Kun Zeng ◽

Yibin Xu ◽

Ge Lin ◽

Likeng Liang ◽

Tianyong Hao

Keyword(s):

Clinical Trial ◽

Ensemble Learning ◽

Metric Learning ◽

Classification Performance ◽

Ensemble Model ◽

Automated Classification ◽

Eligibility Criteria ◽

Data Imbalance ◽

The Impact

Abstract Background Eligibility criteria are the primary strategy for screening the target participants of a clinical trial. Automated classification of clinical trial eligibility criteria text by using machine learning methods improves recruitment efficiency to reduce the cost of clinical research. However, existing methods suffer from poor classification performance due to the complexity and imbalance of eligibility criteria text data. Methods An ensemble learning-based model with metric learning is proposed for eligibility criteria classification. The model integrates a set of pre-trained models including Bidirectional Encoder Representations from Transformers (BERT), A Robustly Optimized BERT Pretraining Approach (RoBERTa), XLNet, Pre-training Text Encoders as Discriminators Rather Than Generators (ELECTRA), and Enhanced Representation through Knowledge Integration (ERNIE). Focal Loss is used as a loss function to address the data imbalance problem. Metric learning is employed to train the embedding of each base model for feature distinguish. Soft Voting is applied to achieve final classification of the ensemble model. The dataset is from the standard evaluation task 3 of 5th China Health Information Processing Conference containing 38,341 eligibility criteria text in 44 categories. Results Our ensemble method had an accuracy of 0.8497, a precision of 0.8229, and a recall of 0.8216 on the dataset. The macro F1-score was 0.8169, outperforming state-of-the-art baseline methods by 0.84% improvement on average. In addition, the performance improvement had a p-value of 2.152e-07 with a standard t-test, indicating that our model achieved a significant improvement. Conclusions A model for classifying eligibility criteria text of clinical trials based on multi-model ensemble learning and metric learning was proposed. The experiments demonstrated that the classification performance was improved by our ensemble model significantly. In addition, metric learning was able to improve word embedding representation and the focal loss reduced the impact of data imbalance to model performance.

Download Full-text

Concept Drift Adaptation Techniques in Distributed Environment for Real-World Data Streams

Smart Cities ◽

10.3390/smartcities4010021 ◽

2021 ◽

Vol 4 (1) ◽

pp. 349-371

Author(s):

Hassan Mehmood ◽

Panos Kostakos ◽

Marta Cortes ◽

Theodoros Anagnostopoulos ◽

Susanna Pirttikangas ◽

...

Keyword(s):

Real World ◽

Data Streams ◽

Smart City ◽

Smart Cities ◽

Concept Drift ◽

Distributed Environment ◽

Real World Data ◽

Unique Challenge ◽

World Data ◽

Concept Drift Detection

Real-world data streams pose a unique challenge to the implementation of machine learning (ML) models and data analysis. A notable problem that has been introduced by the growth of Internet of Things (IoT) deployments across the smart city ecosystem is that the statistical properties of data streams can change over time, resulting in poor prediction performance and ineffective decisions. While concept drift detection methods aim to patch this problem, emerging communication and sensing technologies are generating a massive amount of data, requiring distributed environments to perform computation tasks across smart city administrative domains. In this article, we implement and test a number of state-of-the-art active concept drift detection algorithms for time series analysis within a distributed environment. We use real-world data streams and provide critical analysis of results retrieved. The challenges of implementing concept drift adaptation algorithms, along with their applications in smart cities, are also discussed.

Download Full-text

Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams

Entropy ◽

10.3390/e23070859 ◽

2021 ◽

Vol 23 (7) ◽

pp. 859

Author(s):

Abdulaziz O. AlQabbany ◽

Aqil M. Azmi

Keyword(s):

Big Data ◽

Random Forest ◽

Real Time ◽

Data Streams ◽

Learning Algorithm ◽

Concept Drift ◽

The United States ◽

Careful Consideration ◽

Data Sets ◽

Stream Data

We are living in the age of big data, a majority of which is stream data. The real-time processing of this data requires careful consideration from different perspectives. Concept drift is a change in the data’s underlying distribution, a significant issue, especially when learning from data streams. It requires learners to be adaptive to dynamic changes. Random forest is an ensemble approach that is widely used in classical non-streaming settings of machine learning applications. At the same time, the Adaptive Random Forest (ARF) is a stream learning algorithm that showed promising results in terms of its accuracy and ability to deal with various types of drift. The incoming instances’ continuity allows for their binomial distribution to be approximated to a Poisson(1) distribution. In this study, we propose a mechanism to increase such streaming algorithms’ efficiency by focusing on resampling. Our measure, resampling effectiveness (ρ), fuses the two most essential aspects in online learning; accuracy and execution time. We use six different synthetic data sets, each having a different type of drift, to empirically select the parameter λ of the Poisson distribution that yields the best value for ρ. By comparing the standard ARF with its tuned variations, we show that ARF performance can be enhanced by tackling this important aspect. Finally, we present three case studies from different contexts to test our proposed enhancement method and demonstrate its effectiveness in processing large data sets: (a) Amazon customer reviews (written in English), (b) hotel reviews (in Arabic), and (c) real-time aspect-based sentiment analysis of COVID-19-related tweets in the United States during April 2020. Results indicate that our proposed method of enhancement exhibited considerable improvement in most of the situations.

Download Full-text