Streaming Data Mining with Massive Online Analytics (MOA)

Abstract Finding frequent itemsets in a continuous streaming data is an important data mining task which is widely used in network monitoring, Internet of Things data analysis and so on. In the era of big data, it is necessary to develop a distributed frequent itemset mining algorithm to meet the needs of massive streaming data processing. Apache Spark is a unified analytic engine for massive data processing which has been successfully used in many data mining fields. In this paper, we propose a distributed algorithm for mining frequent itemsets over massive streaming data named SWEclat. The algorithm uses sliding window to process streaming data and uses vertical data structure to store the dataset in the sliding window. This algorithm is implemented by Apache Spark and uses Spark RDD to store streaming data and dataset in vertical data format, so as to divide these RDDs into partitions for distributed processing. Experimental results show that SWEclat algorithm has good acceleration, parallel scalability and load balancing.

Download Full-text

Reducing labeling complexity in streaming data mining

10.31274/etd-180810-6013 ◽

2018 ◽

Author(s):

Yesdaulet Izenov

Keyword(s):

Data Mining ◽

Streaming Data

Download Full-text

Classification of Imbalanced Data Stream: Techniques and Challenges

Transactions on Machine Learning and Artificial Intelligence ◽

10.14738/tmlai.92.9964 ◽

2021 ◽

Vol 9 (2) ◽

pp. 36-52

Author(s):

Mashaal A. Alfhaid ◽

Manal Abdullah

Keyword(s):

Data Mining ◽

Data Stream ◽

Concept Drift ◽

Class Imbalance ◽

Imbalanced Data ◽

Predictive Performance ◽

Knowledge Extraction ◽

Streaming Data ◽

Stream Data ◽

Stream Data Mining

As the number of generated data increases every day, this has brought the importance of data mining and knowledge extraction. In traditional data mining, offline status can be used for knowledge extraction. Nevertheless, dealing with stream data mining is different due to continuously arriving data that can be processed at a single scan besides the appearance of concept drift. As the pre-processing stage is critical in knowledge extraction, imbalanced stream data gain significant popularity in the last few years among researchers. Many real-world applications suffer from class imbalance including medical, business, fraud detection and etc. Learning from the supervised model includes classes whether it is binary- or multi-classes. These classes are often imbalance where it is divided into the majority (negative) class and minority (positive) class, which can cause a bias toward the majority class that leads to skew in predictive performance models. Handles imbalance streaming data is mandatory for more accurate and reliable learning models. In this paper, we will present an overview of data stream mining and its tools. Besides, summarize the problem of class imbalance and its different approaches. In addition, researchers will present the popular evaluation metrics and challenges prone from imbalanced streaming data.

Download Full-text

The Research and Application on Streaming Data of GIS Data Mining

2009 First International Workshop on Database Technology and Applications ◽

10.1109/dbta.2009.140 ◽

2009 ◽

Author(s):

Jia Liu ◽

Lin Liu ◽

Wei Chen

Keyword(s):

Data Mining ◽

Streaming Data ◽

Gis Data

Download Full-text

Graphing Model of Prediction Data for Occupational Incidents in Chemical and Gas Industries

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.d2208.029420 ◽

2020 ◽

Vol 9 (4) ◽

pp. 3112-3116

Keyword(s):

Data Mining ◽

Data Visualization ◽

Large Data ◽

Data Representation ◽

Daily Basis ◽

Streaming Data ◽

Visual Data ◽

Visual Data Mining ◽

Real Time Analysis ◽

Inspection Data

Constant streaming of data for any instances at such high volumes provides insight in various organizations. Analyzing and identifying the pattern from the huge volumes of data has become difficult with its raw form of data. Visualization of information and visual data mining helps to deal with the flood of information. Constant streaming of data for any instances at such high volumes provides insight in various organizations. Analyzing and identifying the pattern from the huge volumes of data has become difficult with its raw form of data. Visualization of information and visual data mining helps to deal with the flood of information. Visual data representation takes the data and its results to all the stakeholders in a meaningful manner which comes out of the data mining process. Recent developments have brought a large number of information visualization techniques to explore the large data sets which can be converted into useful information and knowledge. Observations and inspection data gathered from chemical and gas industries are being piled up on a daily basis as raw data. Continuous analysis is a new term evolving in the industry which continuously performs on the streaming data to have real-time analysis and prediction on-live. In this paper, usage of the various graphing model as per the respective information obtained from the organization have been discussed and justified. It also describes the value addition in making the decisions by representations through graphs and charts for better understanding. Heatmap, Scattergram and customized Radar plots the analyzed data as in the required format to visualize the prediction done for the occupational incidents in chemical and gas industries. As a result of the graphing model, representation provides a higher level of confidence in the findings of the analysis. This fact takes a better visual representation technique and transforms them to provide better results with faster processing and understanding.

Download Full-text

A Frequent Pattern Conjunction Heuristic for Rule Generation in Data Streams

Information ◽

10.3390/info12010024 ◽

2021 ◽

Vol 12 (1) ◽

pp. 24

Author(s):

Frederic Stahl ◽

Thien Le ◽

Atta Badii ◽

Mohamed Medhat Gaber

Keyword(s):

Data Mining ◽

Real Time ◽

Data Streams ◽

Data Stream ◽

Practical Importance ◽

Rule Induction ◽

Streaming Data ◽

Frequent Pattern ◽

Unseen Data ◽

Rule Sets

This paper introduces a new and expressive algorithm for inducing descriptive rule-sets from streaming data in real-time in order to describe frequent patterns explicitly encoded in the stream. Data Stream Mining (DSM) is concerned with the automatic analysis of data streams in real-time. Rapid flows of data challenge the state-of-the art processing and communication infrastructure, hence the motivation for research and innovation into real-time algorithms that analyse data streams on-the-fly and can automatically adapt to concept drifts. To date, DSM techniques have largely focused on predictive data mining applications that aim to forecast the value of a particular target feature of unseen data instances, answering questions such as whether a credit card transaction is fraudulent or not. A real-time, expressive and descriptive Data Mining technique for streaming data has not been previously established as part of the DSM toolkit. This has motivated the work reported in this paper, which has resulted in developing and validating a Generalised Rule Induction (GRI) tool, thus producing expressive rules as explanations that can be easily understood by human analysts. The expressiveness of decision models in data streams serves the objectives of transparency, underpinning the vision of `explainable AI’ and yet is an area of research that has attracted less attention despite being of high practical importance. The algorithm introduced and described in this paper is termed Fast Generalised Rule Induction (FGRI). FGRI is able to induce descriptive rules incrementally for raw data from both categorical and numerical features. FGRI is able to adapt rule-sets to changes of the pattern encoded in the data stream (concept drift) on the fly as new data arrives and can thus be applied continuously in real-time. The paper also provides a theoretical, qualitative and empirical evaluation of FGRI.

Download Full-text