Frequent Itemset Mining in Large Datasets a Survey

In today’s world, voluminous data are available which are generated from various sources in various forms. Mining or analyzing this large scale data in an efficient way so as to make them useful for the mankind is difficult with the existing approaches. Frequent itemset mining is one such technique used for analyzing in many fields like finance, health care system where the main focus is gathering frequent patterns and grouping them to be meaningful inorder to gather useful insights from the data. Some major applications include customer segmentation in marketing, shopping cart analyses, management relationship, web usage mining, player tracking and so on. Many parallel algorithms, like Dist-Eclat Algorithm, Big FIM algorithm are available to perform large scale Frequent itemset mining. In Dist-Eclat algorithm, datasets are partitioned using Round Robin technique which uses a hybrid partitioning approach, which can improve the overall efficiency of the system. The system works as follows: Initially the data collected are distributed by mapreduce. Then the local frequent k-itmesets are computed using FP-Tree and sent to the map phase. Later the mining results are combined to the center node. Finally, global frequent itemsets are gathered by mapreduce. The proposed system is expected to improve in efficiency by using hybrid partitioning approach in the datasets based on the identification of frequent items.

Download Full-text

The MapReduce Model on Cascading Platform for Frequent Itemset Mining

IJCCS (Indonesian Journal of Computing and Cybernetics Systems) ◽

10.22146/ijccs.34102 ◽

2018 ◽

Vol 12 (2) ◽

pp. 149

Author(s):

Nur Rokhman ◽

Amelia Nursanti

Keyword(s):

Large Scale ◽

Frequent Itemset ◽

Frequent Itemset Mining ◽

Programming Models ◽

Distributed Programming ◽

Itemset Mining ◽

Large Scale Data ◽

Mapreduce Model ◽

Large Scale Data Processing ◽

Scale Data

The implementation of parallel algorithms is very interesting research recently. Parallelism is very suitable to handle large-scale data processing. MapReduce is one of the parallel and distributed programming models. The implementation of parallel programming faces many difficulties. The Cascading gives easy scheme of Hadoop system which implements MapReduce model.Frequent itemsets are most often appear objects in a dataset. The Frequent Itemset Mining (FIM) requires complex computation. FIM is a complicated problem when implemented on large-scale data. This paper discusses the implementation of MapReduce model on Cascading for FIM. The experiment uses the Amazon dataset product co-purchasing network metadata.The experiment shows the fact that the simple mechanism of Cascading can be used to solve FIM problem. It gives time complexity O(n), more efficient than the nonparallel which has complexity O(n2/m).

Download Full-text

GMiner: A fast GPU-based frequent itemset mining method for large-scale data

Information Sciences ◽

10.1016/j.ins.2018.01.046 ◽

2018 ◽

Vol 439-440 ◽

pp. 19-38 ◽

Cited By ~ 11

Author(s):

Kang-Wook Chon ◽

Sang-Hyun Hwang ◽

Min-Soo Kim

Keyword(s):

Large Scale ◽

Frequent Itemset ◽

Frequent Itemset Mining ◽

Mining Method ◽

Itemset Mining ◽

Large Scale Data ◽

Scale Data

Download Full-text

Multi-GPU approach to global induction of classification trees for large-scale data mining

Applied Intelligence ◽

10.1007/s10489-020-01952-5 ◽

2021 ◽

Author(s):

Krzysztof Jurczuk ◽

Marcin Czajkowski ◽

Marek Kretowski

Keyword(s):

Data Mining ◽

Large Scale ◽

Real Life ◽

Population Based ◽

Tree Structure ◽

Global Approach ◽

Data Parallel ◽

Large Scale Data ◽

The Impact ◽

Scale Data

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.

Download Full-text

Towards a Verified Parallel Implementation of Frequent Itemset Mining

2017 International Conference on High Performance Computing & Simulation (HPCS) ◽

10.1109/hpcs.2017.138 ◽

2017 ◽

Author(s):

Christopher D. Whitney ◽

Frederic Loulergue

Keyword(s):

Parallel Implementation ◽

Frequent Itemset ◽

Frequent Itemset Mining ◽

Itemset Mining

Download Full-text

Optimal hot spot allocation on meshes for large-scale data-parallel algorithms

IEEE Transactions on Parallel and Distributed Systems ◽

10.1109/71.406956 ◽

1995 ◽

Vol 6 (8) ◽

pp. 788-802 ◽

Cited By ~ 1

Author(s):

Soo-Young Lee ◽

Chung-Ming Chen

Keyword(s):

Parallel Algorithms ◽

Large Scale ◽

Hot Spot ◽

Data Parallel ◽

Large Scale Data ◽

Scale Data

Download Full-text

Performance Models of Data Parallel DAG Workflows for Large Scale Data Analytics

2021 IEEE 37th International Conference on Data Engineering Workshops (ICDEW) ◽

10.1109/icdew53142.2021.00026 ◽

2021 ◽

Author(s):

Juwei Shi ◽

Jiaheng Lu

Keyword(s):

Data Analytics ◽

Large Scale ◽

Performance Models ◽

Data Parallel ◽

Large Scale Data ◽

Scale Data

Download Full-text

Grafting for combinatorial binary model using frequent itemset mining

Data Mining and Knowledge Discovery ◽

10.1007/s10618-019-00657-9 ◽

2019 ◽

Vol 34 (1) ◽

pp. 101-123 ◽

Cited By ~ 1

Author(s):

Taito Lee ◽

Shin Matsushima ◽

Kenji Yamanishi

Keyword(s):

Large Scale ◽

Computational Cost ◽

Frequent Itemset ◽

Frequent Itemset Mining ◽

Itemset Mining ◽

Binary Model ◽

High Knowledge ◽

Linear Predictors ◽

Computational Difficulty ◽

High Computational Cost

Abstract We consider the class of linear predictors over all logical conjunctions of binary attributes, which we refer to as the class of combinatorial binary models (CBMs) in this paper. CBMs are of high knowledge interpretability but naïve learning of them from labeled data requires exponentially high computational cost with respect to the length of the conjunctions. On the other hand, in the case of large-scale datasets, long conjunctions are effective for learning predictors. To overcome this computational difficulty, we propose an algorithm, GRAfting for Binary datasets (GRAB), which efficiently learns CBMs within the $$L_1$$L1-regularized loss minimization framework. The key idea of GRAB is to adopt weighted frequent itemset mining for the most time-consuming step in the grafting algorithm, which is designed to solve large-scale $$L_1$$L1-RERM problems by an iterative approach. Furthermore, we experimentally showed that linear predictors of CBMs are effective in terms of prediction accuracy and knowledge discovery.

Download Full-text

MapReduce-Based Crow Search-Adopted Partitional Clustering Algorithms for Handling Large-Scale Data

International Journal of Cognitive Informatics and Natural Intelligence ◽

10.4018/ijcini.20211001.oa32 ◽

2021 ◽

Vol 15 (4) ◽

pp. 1-23

Author(s):

Karthikeyani Visalakshi N. ◽

Shanthi S. ◽

Lakshmi K.

Keyword(s):

Large Scale ◽

Parallel Implementation ◽

Search Algorithm ◽

Clustering Algorithms ◽

Computation Time ◽

Global Optimum ◽

Data Mining Technique ◽

Local Optima ◽

Large Scale Data ◽

Scale Data

Cluster analysis is the prominent data mining technique in knowledge discovery and it discovers the hidden patterns from the data. The K-Means, K-Modes and K-Prototypes are partition based clustering algorithms and these algorithms select the initial centroids randomly. Because of its random selection of initial centroids, these algorithms provide the local optima in solutions. To solve these issues, the strategy of Crow Search algorithm is employed with these algorithms to obtain the global optimum solution. With the advances in information technology, the size of data increased in a drastic manner from terabytes to petabytes. To make proposed algorithms suitable to handle these voluminous data, the phenomena of parallel implementation of these clustering algorithms with Hadoop Mapreduce framework. The proposed algorithms are experimented with large scale data and the results are compared in terms of cluster evaluation measures and computation time with the number of nodes.

Download Full-text

MapReduce Based Crow Search Adopted Partitional Clustering Algorithms For Handling Large Scale Data

International Journal of Cognitive Informatics and Natural Intelligence ◽

10.4018/ijcini.20211001oa19 ◽

2021 ◽

Vol 15 (4) ◽

pp. 0-0

Keyword(s):

Large Scale ◽

Parallel Implementation ◽

Search Algorithm ◽

Clustering Algorithms ◽

Computation Time ◽

Global Optimum ◽

Data Mining Technique ◽

Local Optima ◽

Large Scale Data ◽

Scale Data

Cluster analysis is the prominent data mining technique in knowledge discovery and it discovers the hidden patterns from the data. The K-Means, K-Modes and K-Prototypes are partition based clustering algorithms and these algorithms select the initial centroids randomly. Because of its random selection of initial centroids, these algorithms provide the local optima in solutions. To solve these issues, the strategy of Crow Search algorithm is employed with these algorithms to obtain the global optimum solution. With the advances in information technology, the size of data increased in a drastic manner from terabytes to petabytes. To make proposed algorithms suitable to handle these voluminous data, the phenomena of parallel implementation of these clustering algorithms with Hadoop Mapreduce framework. The proposed algorithms are experimented with large scale data and the results are compared in terms of cluster evaluation measures and computation time with the number of nodes.

Download Full-text