DISSparse: Efficient Mining of Discriminative Itemsets

Author(s):  
Majid Seyfi ◽  
Richi Nayak ◽  
Yue Xu ◽  
Shlomo Geva

We tackle the problem of discriminative itemset mining. Given a set of datasets, we want to find the itemsets that are frequent in the target dataset and have much higher frequencies compared with the same itemsets in other datasets. Such itemsets are very useful for dataset discrimination. We demonstrate that this problem has important applications and, at a same time, is very challenging. We present the DISSparse algorithm, a mining method that uses two determinative heuristics based on the sparsity characteristics of the discriminative itemsets as a small subset of the frequent itemsets. We prove that the DISSparse algorithm is sound and complete. We experimentally investigate the performance of the proposed DISSparse on a range of datasets, evaluating its efficiency and stability and demonstrating it is substantially faster than the baseline method.

2021 ◽  
Vol 16 (2) ◽  
pp. 1-30
Author(s):  
Guangtao Wang ◽  
Gao Cong ◽  
Ying Zhang ◽  
Zhen Hai ◽  
Jieping Ye

The streams where multiple transactions are associated with the same key are prevalent in practice, e.g., a customer has multiple shopping records arriving at different time. Itemset frequency estimation on such streams is very challenging since sampling based methods, such as the popularly used reservoir sampling, cannot be used. In this article, we propose a novel k -Minimum Value (KMV) synopsis based method to estimate the frequency of itemsets over multi-transaction streams. First, we extract the KMV synopses for each item from the stream. Then, we propose a novel estimator to estimate the frequency of an itemset over the KMV synopses. Comparing to the existing estimator, our method is not only more accurate and efficient to calculate but also follows the downward-closure property. These properties enable the incorporation of our new estimator with existing frequent itemset mining (FIM) algorithm (e.g., FP-Growth) to mine frequent itemsets over multi-transaction streams. To demonstrate this, we implement a KMV synopsis based FIM algorithm by integrating our estimator into existing FIM algorithms, and we prove it is capable of guaranteeing the accuracy of FIM with a bounded size of KMV synopsis. Experimental results on massive streams show our estimator can significantly improve on the accuracy for both estimating itemset frequency and FIM compared to the existing estimators.


2009 ◽  
Vol 12 (11) ◽  
pp. 49-56
Author(s):  
Bac Hoai Le ◽  
Bay Dinh Vo

In traditional mining of association rules, finding all association rules from databases that satisfy minSup and minConf faces with some problems in case of the number of frequent itemsets is large. Thus, it is necessary to have a suitable method for mining fewer rules but they still embrace all rules of traditional mining method. One of the approaches that is the mining method of essential rules: it only keeps the rule that its left hand side is minimal and its right side is maximal (follow in parent-child relationship). In this paper, we propose a new algorithm for mining the essential rules from the frequent closed itemsets lattice to reduce the time of mining rules. We use the parent-child relationship in lattice to reduce the cost of considering parent-child relationship and lead to reduce the time of mining rules.


2013 ◽  
Vol 9 (4) ◽  
pp. 62-75 ◽  
Author(s):  
Damla Oguz ◽  
Baris Yildiz ◽  
Belgin Ergenc

Updates on an operational database bring forth the challenge of keeping the frequent itemsets up-to-date without re-running the itemset mining algorithms. Studies on dynamic itemset mining, which is the solution to such an update problem, have to address some challenges as handling i) updates without re-running the base algorithm, ii) changes in the support threshold, iii) new items and iv) additions/deletions in updates. The study in this paper is the extension of the Incremental Matrix Apriori Algorithm which proposes solutions to the first three challenges besides inheriting the advantages of the base algorithm which works without candidate generation. In the authors' current work, the authors have improved a former algorithm as to handle updates that are composed of additions and deletions. The authors have also carried out a detailed performance evaluation study on a real and two benchmark datasets.


2017 ◽  
Vol 8 (1) ◽  
pp. 31-43
Author(s):  
Zuber Shaikh ◽  
Antara Mohadikar ◽  
Rachana Nayak ◽  
Rohith Padamadan

Frequent itemsets refer to a set of data values (e.g., product items) whose number of co-occurrences exceeds a given threshold. The challenge is that the design of proofs and verification objects has to be customized for different data mining algorithms. Intended method will implement a basic idea of completeness verification and authentication approach in which the client will uses a set of frequent item sets as the evidence, and checks whether the server has missed any frequent item set as evidence in its returned result. It will help client detect untrusted server and system will become much more efficiency by reducing time. In authentication process CaRP is both a captcha and a graphical password scheme. CaRP addresses a number of security problems altogether, such as online guessing attacks, relay attacks, and, if combined with dual-view technologies, shoulder-surfing attacks.


Author(s):  
Padmanathan Anantharaman ◽  
H.V. Ramakrishan

As data volumes continue to grow, they quickly consume the capacity of data warehouses and application databases. Is your IT organization forced into costly upgrades to expensive databases and data warehouse hardware appliances and enormous amount of data is getting explored through Internet of Things (IoT) as technologies are advancing and people uses these technologies in day to day activities, this data is termed as Big Data having its characteristics and challenges. Frequent Itemset Mining algorithms are aimed to disclose frequent itemsets from transactional database but as the dataset size increases, it cannot be handled by traditional frequent itemset mining. MapReduce programming model solves the problem of large datasets but it has large communication cost which reduces execution efficiency. This proposed new pre-processed k-means technique applied on BigFIM algorithm. ClustBigFIM uses hybrid approach, clustering using k-means algorithm to generate Clusters from huge datasets and Apriori and Eclat to mine frequent itemsets from generated clusters using MapReduce programming model. Results shown that execution efficiency of ClustBigFIM algorithm is increased by applying k-means clustering algorithm before BigFIM algorithm as one of the pre-processing technique.


Frequent Itemset mining (FIM) concept and limitations are explored in this paper, for the purpose of extracting unknown hidden patterns as itemsets from the transactional database. Since candidate generation and support calculations are the major tasks in FIM, the major limitations of FIM are tackled, (i) huge possible frequent itemsets are generated as candidates at each pass (ii) Data base scan at each pass to calculate the support of the generated itemsets (iii) generated itemsets are highly sensitive to the minimum support threshold. SS-FIM a single scan algorithm is to deal with the above limitations. However, several unnecessary itemsets are being hashed in the buckets. To overcome the limitations, a partition based approach is proposed in this paper. The proposed approach, PSSFIM, takes single scan of the database to identify frequent itemsets. The unique feature of PSSFIM allow to generate size of candidate itemsets independent on the minimum support. It allows the candidates in hash that are possible for frequent, which intuitively reduces the cost in terms of verifying the support of generated candidates. It is compared with SS-FIM and Apriori with the standard datasets. The results show that the PSSFIM is good at the comparison of SS-FIM and Apriori.


Author(s):  
Hanaa Ibrahim Abu Zahra ◽  
Shaker El-Sappagh ◽  
Tarek Ahmef El Shishtawy

Most frequent itemset mining algorithms (FIMA) discover hidden relationships from unrelated items. They find the most frequent itemsets depending only on the frequency of the item's existence in the dataset. These algorithms give all items the same importance, and neglect the differences in importance of the items. They assume the full certainty of data, but in most cases, real word data may be uncertain. As a result, the data could be incomplete and/or imprecise. These two problems are the most common challenges that face FIMA algorithms. Some new algorithms proposed some solutions to face these two issues separately. In other words, some algorithms handle item importance only, and others handle uncertainty only. Few algorithms dealt with the two issues together. In this article, the single scan for weighted itemsets over the uncertain database (SSU-Wfim) is proposed. It depends on the single scan frequent itemsets algorithm (SS_FIM), and enhances it to deal with weighted items in an uncertain database. SSU_WFIM deals with the uncertainty of data by giving each item in a transaction an additional value to indicate occurrence likelihood. It gives the items different values to define the weight of them. It uses a table called Ptable to save the items and their probability values. This table is used to generate all possible candidates itemsets. The results indicate the high performance in aspects of runtime, memory consumption and scalability of SSU-Wfim comparing with the UApriori algorithm. The proposed algorithm saves time and memory with a percentage exceeds 70% for all tested datasets.


2020 ◽  
Vol 76 (10) ◽  
pp. 7619-7634 ◽  
Author(s):  
Wen Xiao ◽  
Juan Hu

Abstract Finding frequent itemsets in a continuous streaming data is an important data mining task which is widely used in network monitoring, Internet of Things data analysis and so on. In the era of big data, it is necessary to develop a distributed frequent itemset mining algorithm to meet the needs of massive streaming data processing. Apache Spark is a unified analytic engine for massive data processing which has been successfully used in many data mining fields. In this paper, we propose a distributed algorithm for mining frequent itemsets over massive streaming data named SWEclat. The algorithm uses sliding window to process streaming data and uses vertical data structure to store the dataset in the sliding window. This algorithm is implemented by Apache Spark and uses Spark RDD to store streaming data and dataset in vertical data format, so as to divide these RDDs into partitions for distributed processing. Experimental results show that SWEclat algorithm has good acceleration, parallel scalability and load balancing.


Author(s):  
Fathima Sherin T K ◽  
Anish Kumar B.

Frequent itemset mining (FIM) is a data mining idea with extracting frequent itemset from a database. Finding frequent itemsets in existing methods accept that datasets are static or steady and enlisted guidelines are pertinent all through the total dataset. In any case, this isn't the situation when information is temporal which contains time-related data that changes data mining results. Patterns may occur during all or at specific interims, to limit time interims, frequent itemset mining with time cube is proposed to manage time arranges in the mining technique. This is how patterns are perceived that happen occasionally, in a period interim, or both. Thus, this paper mostly centres around developing up a productive calculation to mine frequent itemsets and their related time interval from a value-based database by expanding from the earlier calculation dependent on support and density as another edge. Density is proposed to deal with the overestimated timespan issue and to ensure the authenticity of the patterns found. As an extension from the current framework, here the density rate and minimum threshold is dynamically generated which is user determined parameter previously. Likewise, an analysis concerning time is made between dataset with partitioning and without apportioning the dataset, which shows computation time is less on account of partitioning technique.


Sign in / Sign up

Export Citation Format

Share Document