PARASOL: a hybrid approximation approach for scalable frequent itemset mining in streaming data

AbstractHere, we present a novel algorithm for frequent itemset mining in streaming data (FIM-SD). For the past decade, various FIM-SD methods in one-pass approximation settings that allow to approximate the support of each itemset have been proposed. They can be categorized into two approximation types: parameter-constrained (PC) mining and resource-constrained (RC) mining. PC methods control the maximum error that can be included in the approximate support based on a pre-defined parameter. In contrast, RC methods limit the maximum memory consumption based on resource constraints. However, the existing PC methods can exponentially increase the memory consumption, while the existing RC methods can rapidly increase the maximum error. In this study, we address this problem by introducing a hybrid approach of PC-RC approximations, called PARASOL. For any streaming data, PARASOL ensures to provide a condensed representation, called a Δ-covered set, which is regarded as an extension of the closedness compression; when Δ = 0, the solution corresponds to the ordinary closed itemsets. PARASOL searches for such approximate closed itemsets that can restore the frequent itemsets and their supports while the maximum error is bounded by an integer, Δ. Then, we empirically demonstrate that the proposed algorithm significantly outperforms the state-of-the-art PC and RC methods for FIM-SD.

Download Full-text

Data Mining Itemset of Big Data Using Pre-Processing Based on Mapreduce FrameWork with ETL Tools

APTIKOM Journal on Computer Science and Information Technologies ◽

10.11591/aptikom.j.csit.103 ◽

2017 ◽

Vol 2 (2) ◽

pp. 57-62

Author(s):

Padmanathan Anantharaman ◽

H.V. Ramakrishan

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

Programming Model ◽

Hybrid Approach ◽

Processing Technique ◽

Frequent Itemsets ◽

Frequent Itemset ◽

Frequent Itemset Mining ◽

Itemset Mining ◽

Dataset Size

As data volumes continue to grow, they quickly consume the capacity of data warehouses and application databases. Is your IT organization forced into costly upgrades to expensive databases and data warehouse hardware appliances and enormous amount of data is getting explored through Internet of Things (IoT) as technologies are advancing and people uses these technologies in day to day activities, this data is termed as Big Data having its characteristics and challenges. Frequent Itemset Mining algorithms are aimed to disclose frequent itemsets from transactional database but as the dataset size increases, it cannot be handled by traditional frequent itemset mining. MapReduce programming model solves the problem of large datasets but it has large communication cost which reduces execution efficiency. This proposed new pre-processed k-means technique applied on BigFIM algorithm. ClustBigFIM uses hybrid approach, clustering using k-means algorithm to generate Clusters from huge datasets and Apriori and Eclat to mine frequent itemsets from generated clusters using MapReduce programming model. Results shown that execution efficiency of ClustBigFIM algorithm is increased by applying k-means clustering algorithm before BigFIM algorithm as one of the pre-processing technique.

Download Full-text

Finding tendencies in streaming data using Big Data frequent itemset mining

Knowledge-Based Systems ◽

10.1016/j.knosys.2018.09.026 ◽

2019 ◽

Vol 163 ◽

pp. 666-674 ◽

Cited By ~ 12

Author(s):

Carlos Fernandez-Basso ◽

Abel J. Francisco-Agra ◽

Maria J. Martin-Bautista ◽

M. Dolores Ruiz

Keyword(s):

Big Data ◽

Streaming Data ◽

Frequent Itemset ◽

Frequent Itemset Mining ◽

Itemset Mining

Download Full-text

An Algorithm for In-Core Frequent Itemset Mining on Streaming Data

Fifth IEEE International Conference on Data Mining (ICDM'05) ◽

10.1109/icdm.2005.21 ◽

2006 ◽

Cited By ~ 2

Author(s):

Ruoming Jin ◽

G. Agrawal

Keyword(s):

Streaming Data ◽

Frequent Itemset ◽

Frequent Itemset Mining ◽

Itemset Mining

Download Full-text

TIFIM: Tree based Incremental Frequent Itemset Mining over Streaming Data

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v10i5.4149 ◽

2013 ◽

Vol 10 (5) ◽

pp. 1580-1586

Author(s):

V.sidda Reddy ◽

Dr T.V. Rao ◽

Dr A. Govardhan

Keyword(s):

Data Streams ◽

Data Stream ◽

Streaming Data ◽

Frequent Itemset ◽

Frequent Itemset Mining ◽

Itemset Mining ◽

Proposed Model ◽

Mining Model ◽

Mining Algorithms ◽

Memory Efficient

Data Stream Mining algorithms performs under constraints called space used and time taken, which is due to the streaming property. The relaxation in these constraints is inversely proportional to the streaming speed of the data. Since the caching and mining the streaming-data is sensitive, here in this paper a scalable, memory efficient caching and frequent itemset mining model is devised. The proposed model is an incremental approach that builds single level multi node trees called bushes from each window of the streaming data; henceforth we refer this proposed algorithm as a Tree (bush) based Incremental Frequent Itemset Mining (TIFIM) over data streams.

Download Full-text

Approximate Frequent Itemset Mining for streaming data on FPGA

2016 26th International Conference on Field Programmable Logic and Applications (FPL) ◽

10.1109/fpl.2016.7577331 ◽

2016 ◽

Author(s):

Yubin Li ◽

Yuliang Sun ◽

Guohao Dai ◽

Qiang Xu ◽

Yu Wang ◽

...

Keyword(s):

Streaming Data ◽

Frequent Itemset ◽

Frequent Itemset Mining ◽

Itemset Mining

Download Full-text

Parallel frequent itemset mining on streaming data

2014 10th International Conference on Natural Computation (ICNC) ◽

10.1109/icnc.2014.6975926 ◽

2014 ◽

Cited By ~ 1

Author(s):

Yanshan He ◽

Min Yue

Keyword(s):

Streaming Data ◽

Frequent Itemset ◽

Frequent Itemset Mining ◽

Itemset Mining

Download Full-text

Frequent Itemset Mining A Metadata Based Approach for Knowledge Discovery

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i3.316320 ◽

2018 ◽

Vol 6 (3) ◽

pp. 316-320

Author(s):

Basavaraj A. Goudannavar ◽

◽

Prashant Bhat ◽

Keyword(s):

Knowledge Discovery ◽

Frequent Itemset ◽

Frequent Itemset Mining ◽

Itemset Mining

Download Full-text

Inverse Frequent Itemset Mining Based on FP-Tree

Journal of Software ◽

10.3724/sp.j.1001.2008.00338 ◽

2008 ◽

Vol 19 (2) ◽

pp. 338-350 ◽

Cited By ~ 2

Author(s):

Yu-Hong GUO

Keyword(s):

Frequent Itemset ◽

Frequent Itemset Mining ◽

Itemset Mining

Download Full-text

A Synopsis Based Approach for Itemset Frequency Estimation over Massive Multi-Transaction Stream

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3465238 ◽

2021 ◽

Vol 16 (2) ◽

pp. 1-30

Author(s):

Guangtao Wang ◽

Gao Cong ◽

Ying Zhang ◽

Zhen Hai ◽

Jieping Ye

Keyword(s):

Frequency Estimation ◽

Frequent Itemsets ◽

Frequent Itemset ◽

Experimental Results ◽

Closure Property ◽

Frequent Itemset Mining ◽

Itemset Mining ◽

Minimum Value ◽

Downward Closure ◽

Bounded Size

The streams where multiple transactions are associated with the same key are prevalent in practice, e.g., a customer has multiple shopping records arriving at different time. Itemset frequency estimation on such streams is very challenging since sampling based methods, such as the popularly used reservoir sampling, cannot be used. In this article, we propose a novel k -Minimum Value (KMV) synopsis based method to estimate the frequency of itemsets over multi-transaction streams. First, we extract the KMV synopses for each item from the stream. Then, we propose a novel estimator to estimate the frequency of an itemset over the KMV synopses. Comparing to the existing estimator, our method is not only more accurate and efficient to calculate but also follows the downward-closure property. These properties enable the incorporation of our new estimator with existing frequent itemset mining (FIM) algorithm (e.g., FP-Growth) to mine frequent itemsets over multi-transaction streams. To demonstrate this, we implement a KMV synopsis based FIM algorithm by integrating our estimator into existing FIM algorithms, and we prove it is capable of guaranteeing the accuracy of FIM with a bounded size of KMV synopsis. Experimental results on massive streams show our estimator can significantly improve on the accuracy for both estimating itemset frequency and FIM compared to the existing estimators.

Download Full-text

Frequent Itemset Mining and Multi-Layer Network-Based Analysis of RDF Databases

Mathematics ◽

10.3390/math9040450 ◽

2021 ◽

Vol 9 (4) ◽

pp. 450

Author(s):

Gergely Honti ◽

János Abonyi

Keyword(s):

Climate Change ◽

Extraction Process ◽

Knowledge Extraction ◽

Frequent Itemset ◽

Frequent Itemset Mining ◽

Underlying Structure ◽

Multilayer Network ◽

Interdisciplinary Science ◽

Academic Knowledge ◽

Itemset Mining

Triplestores or resource description framework (RDF) stores are purpose-built databases used to organise, store and share data with context. Knowledge extraction from a large amount of interconnected data requires effective tools and methods to address the complexity and the underlying structure of semantic information. We propose a method that generates an interpretable multilayered network from an RDF database. The method utilises frequent itemset mining (FIM) of the subjects, predicates and the objects of the RDF data, and automatically extracts informative subsets of the database for the analysis. The results are used to form layers in an analysable multidimensional network. The methodology enables a consistent, transparent, multi-aspect-oriented knowledge extraction from the linked dataset. To demonstrate the usability and effectiveness of the methodology, we analyse how the science of sustainability and climate change are structured using the Microsoft Academic Knowledge Graph. In the case study, the FIM forms networks of disciplines to reveal the significant interdisciplinary science communities in sustainability and climate change. The constructed multilayer network then enables an analysis of the significant disciplines and interdisciplinary scientific areas. To demonstrate the proposed knowledge extraction process, we search for interdisciplinary science communities and then measure and rank their multidisciplinary effects. The analysis identifies discipline similarities, pinpointing the similarity between atmospheric science and meteorology as well as between geomorphology and oceanography. The results confirm that frequent itemset mining provides an informative sampled subsets of RDF databases which can be simultaneously analysed as layers of a multilayer network.

Download Full-text