Towards Scalable Algorithm for Closed Itemset Mining in High-Dimensional Data

Author(s):  
Fatimah Audah Md. Zaki ◽  
Nurul Fariza Zulkurnain

<p>Mining frequent itemsets from large dataset has a major drawback in which the explosive number of itemsets requires additional mining process which might filter the interesting ones. Therefore, as the solution, the concept of closed frequent itemset was introduced that is lossless and condensed representation of all the frequent itemsets and their corresponding supports.  Unfortunately, many algorithms are not memory-efficient since it requires the storage of closed itemsets in main memory for duplication checks. This paper presents BFF, a scalable algorithm for discovering closed frequent itemsets from high-dimensional data. Unlike many well-known algorithms, BFF traverses the search tree in breadth-first manner resulted to a minimum use of memory and less running time. The tests conducted on a number of microarray datasets show that the performance of this algorithm improved significantly as the support threshold decreases which is crucial in generating more interesting rules.</p>

Author(s):  
Eka Karyawati ◽  
Edi Winarko

Frequent patterns (itemsets) discovery is an important problem in associative classification rule mining.  Differents approaches have been proposed such as the Apriori-like, Frequent Pattern (FP)-growth, and Transaction Data Location (Tid)-list Intersection algorithm. This paper focuses on surveying and comparing the state of the art associative classification techniques with regards to the rule generation phase of associative classification algorithms.  This phase includes frequent itemsets discovery and rules mining/extracting methods to generate the set of class association rules (CARs).  There are some techniques proposed to improve the rule generation method.  A technique by utilizing the concepts of discriminative power of itemsets can reduce the size of frequent itemset.  It can prune the useless frequent itemsets. The closed frequent itemset concept can be utilized to compress the rules to be compact rules.  This technique may reduce the size of generated rules.  Other technique is in determining the support threshold value of the itemset. Specifying not single but multiple support threshold values with regard to the class label frequencies can give more appropriate support threshold value.  This technique may generate more accurate rules. Alternative technique to generate rule is utilizing the vertical layout to represent dataset.  This method is very effective because it only needs one scan over dataset, compare with other techniques that need multiple scan over dataset.   However, one problem with these approaches is that the initial set of tid-lists may be too large to fit into main memory. It requires more sophisticated techniques to compress the tid-lists.


Author(s):  
Luminita Dumitriu

Association rules, introduced by Agrawal, Imielinski and Swami (1993), provide useful means to discover associations in data. The problem of mining association rules in a database is defined as finding all the association rules that hold with more than a user-given minimum support threshold and a user-given minimum confidence threshold. According to Agrawal, Imielinski and Swami, this problem is solved in two steps: 1. Find all frequent itemsets in the database. 2. For each frequent itemset I, generate all the association rules I’ÞI\I’, where I’ÌI.


Frequent Itemset mining (FIM) concept and limitations are explored in this paper, for the purpose of extracting unknown hidden patterns as itemsets from the transactional database. Since candidate generation and support calculations are the major tasks in FIM, the major limitations of FIM are tackled, (i) huge possible frequent itemsets are generated as candidates at each pass (ii) Data base scan at each pass to calculate the support of the generated itemsets (iii) generated itemsets are highly sensitive to the minimum support threshold. SS-FIM a single scan algorithm is to deal with the above limitations. However, several unnecessary itemsets are being hashed in the buckets. To overcome the limitations, a partition based approach is proposed in this paper. The proposed approach, PSSFIM, takes single scan of the database to identify frequent itemsets. The unique feature of PSSFIM allow to generate size of candidate itemsets independent on the minimum support. It allows the candidates in hash that are possible for frequent, which intuitively reduces the cost in terms of verifying the support of generated candidates. It is compared with SS-FIM and Apriori with the standard datasets. The results show that the PSSFIM is good at the comparison of SS-FIM and Apriori.


2021 ◽  
Vol 7 ◽  
pp. e385
Author(s):  
Saood Iqbal ◽  
Abdul Shahid ◽  
Muhammad Roman ◽  
Zahid Khan ◽  
Shaha Al-Otaibi ◽  
...  

Frequently used items mining is a significant subject of data mining studies. In the last ten years, due to innovative development, the quantity of data has grown exponentially. For frequent Itemset (FIs) mining applications, it imposes new challenges. Misconceived information may be found in recent algorithms, including both threshold and size based algorithms. Threshold value plays a central role in generating frequent itemsets from the given dataset. Selecting a support threshold value is very complicated for those unaware of the dataset’s characteristics. The performance of algorithms for finding FIs without the support threshold is, however, deficient due to heavy computation. Therefore, we have proposed a method to discover FIs without the support threshold, called Top-k frequent itemsets mining (TKFIM). It uses class equivalence and set-theory concepts for mining FIs. The proposed procedure does not miss any FIs; thus, accurate frequent patterns are mined. Furthermore, the results are compared with state-of-the-art techniques such as Top-k miner and Build Once and Mine Once (BOMO). It is found that the proposed TKFIM has outperformed the results of these approaches in terms of execution and performance, achieving 92.70, 35.87, 28.53, and 81.27 percent gain on Top-k miner using Chess, Mushroom, and Connect and T1014D100K datasets, respectively. Similarly, it has achieved a performance gain of 97.14, 100, 78.10, 99.70 percent on BOMO using Chess, Mushroom, Connect, and T1014D100K datasets, respectively. Therefore, it is argued that the proposed procedure may be adopted on a large dataset for better performance.


2021 ◽  
Vol 11 (19) ◽  
pp. 8971
Author(s):  
Yalong Zhang ◽  
Wei Yu ◽  
Xuan Ma ◽  
Hisakazu Ogura ◽  
Dongfen Ye

The solution space of a frequent itemset generally presents exponential explosive growth because of the high-dimensional attributes of big data. However, the premise of the big data association rule analysis is to mine the frequent itemset in high-dimensional transaction sets. Traditional and classical algorithms such as the Apriori and FP-Growth algorithms, as well as their derivative algorithms, are unacceptable in practical big data analysis in an explosive solution space because of their huge consumption of storage space and running time. A multi-objective optimization algorithm was proposed to mine the frequent itemset of high-dimensional data. First, all frequent 2-itemsets were generated by scanning transaction sets based on which new items were added in as the objects of population evolution. Algorithms aim to search for the maximal frequent itemset to gather more non-void subsets because non-void subsets of frequent itemsets are all properties of frequent itemsets. During the operation of algorithms, lethal gene fragments in individuals were recorded and eliminated so that individuals may resurge. Finally, the set of the Pareto optimal solution of the frequent itemset was gained. All non-void subsets of these solutions were frequent itemsets, and all supersets are non-frequent itemsets. Finally, the practicability and validity of the proposed algorithm in big data were proven by experiments.


2010 ◽  
Vol 35 (7) ◽  
pp. 825-843 ◽  
Author(s):  
Xiaohui Yu ◽  
Junfeng Dong

2015 ◽  
Vol 710 ◽  
pp. 127-131
Author(s):  
Qing Chao Jiang

In the mining of association rules, the generation of frequent itemsets is a key factor that influence the efficiency and performance of the algorithm. With the increase of data dimension, it is obvious that the traditional association rules mining algorithm can’t meet the demand of high dimensional data mining. On the basis of Apriori algorithm, we put forward Split Mtrix _Apriori algorithm in this paper. By generating the Boolean matrix of the database, Split Mtrix _Apriori algorithm decreased the times of scanning database when generating the frequent itemsets. With adopting grouping processing strategy in the Boolean matrix, the algorithm can still keep high efficiency in dealing with high-dimensional data.So Split Mtrix _Apriori improved the efficiency of association rule mining significantly.


Sign in / Sign up

Export Citation Format

Share Document