Association Rules in Very Large Databases

Mining of association rules is one of the most adopted techniques for data mining in the most widespread application domains. A great deal of work has been carried out in the last years on the development of efficient algorithms for association rules extraction. Indeed, this problem is a computationally difficult task, known as NP-hard (Calders, 2004), which has been augmented by the fact that normally association rules are being extracted from very large databases. Moreover, in order to increase the relevance and interestingness of obtained results and to reduce the volume of the overall result, constraints on association rules are introduced and must be evaluated (Ng et al.,1998; Srikant et al., 1997). However, in this contribution, we do not focus on the problem of developing efficient algorithms but on the semantic problem behind the extraction of association rules (see Tsur et al. [1998] for an interesting generalization of this problem).

Download Full-text

DISCOVERY OF CAUSALITY POSSIBILITIES

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001404003058 ◽

2004 ◽

Vol 18 (01) ◽

pp. 63-73 ◽

Cited By ~ 1

Author(s):

LAWRENCE MAZLACK

Keyword(s):

Data Mining ◽

Association Rules ◽

Joint Probability ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Large Databases ◽

Very Large Databases ◽

Predictive Relationships ◽

Strength Of Association

Determining causality has been a tantalizing goal throughout human history. Proper sacrifices to the gods were thought to bring rewards; failure to make suitable observations were thought to lead to disaster. Today, data mining holds the promise of extracting unsuspected information from very large databases. Methods have been developed to build association rules from large data sets. Association rules indicate the strength of association of two or more data attributes. In many ways, the interest in association rules is that they offer the promise (or illusion) of causal, or at least, predictive relationships. However, association rules only calculate a joint probability; they do not express a causal relationship. If causal relationships could be discovered, it would be very useful. Our goal is to explore causality in the data mining context.

Download Full-text

Analysis of the progressive sampling-based approach using real life datasets

Open Computer Science ◽

10.2478/s13537-011-0016-y ◽

2011 ◽

Vol 1 (2) ◽

Cited By ~ 1

Author(s):

Venkatapathy Umarani ◽

Muthusamy Punithavalli

Keyword(s):

Association Rules ◽

Association Rule ◽

Association Rule Mining ◽

Real Life ◽

Computation Time ◽

Frequent Itemsets ◽

Rule Mining ◽

Large Databases ◽

Very Large Databases ◽

Progressive Sampling

AbstractThe discovery of association rules is an important and challenging data mining task. Most of the existing algorithms for finding association rules require multiple passes over the entire database, and I/O overhead incurred is extremely high for very large databases. An obvious approach to reduce the complexity of association rule mining is sampling. In recent times, several sampling-based approaches have been developed for speeding up the process of association rule mining. A proficient progressive sampling-based approach is presented for mining association rules from large databases. At first, frequent itemsets are mined from an initial sample and subsequently, the negative border is computed from the mined frequent itemsets. Based on the support computed for the midpoint itemset in the sorted negative border, the sample size is either increased or association rules are mined from it. In this paper, we have presented an extensive analysis of the progressive sampling-based approach with different real life datasets and, in addition, the performance of the approach is evaluated with the well-known association rule mining algorithm, Apriori. The experimental results show that accuracy and computation time of the progressive sampling-based approach is effectively improved in mining of association rules from the real life datasets.

Download Full-text

IMIDB: An Algorithm for Indexed Mining of Incremental Databases

Journal of Intelligent Systems ◽

10.1515/jisys-2015-0107 ◽

2017 ◽

Vol 26 (1) ◽

pp. 69-85

Author(s):

Mohammed M. Fouad ◽

Mostafa G.M. Mostafa ◽

Abdulfattah S. Mashat ◽

Tarek F. Gharib

Keyword(s):

Data Structure ◽

Association Rules ◽

Efficient Algorithm ◽

Performance Comparison ◽

Incremental Mining ◽

Itemset Mining ◽

Large Databases ◽

Transactional Databases ◽

Dynamic Databases ◽

Database Size

AbstractAssociation rules provide important knowledge that can be extracted from transactional databases. Owing to the massive exchange of information nowadays, databases become dynamic and change rapidly and periodically: new transactions are added to the database and/or old transactions are updated or removed from the database. Incremental mining was introduced to overcome the problem of maintaining previously generated association rules in dynamic databases. In this paper, we propose an efficient algorithm (IMIDB) for incremental itemset mining in large databases. The algorithm utilizes the trie data structure for indexing dynamic database transactions. Performance comparison of the proposed algorithm to recently cited algorithms shows that a significant improvement of about two orders of magnitude is achieved by our algorithm. Also, the proposed algorithm exhibits linear scalability with respect to database size.

Download Full-text

Very Large Databases

Wiley Encyclopedia of Electrical and Electronics Engineering ◽

10.1002/047134608x.w4308 ◽

1999 ◽

Author(s):

Minos N. Garofalakis ◽

Ren��e J. Miller

Keyword(s):

Large Databases ◽

Very Large Databases

Download Full-text

A Hybrid Algorithm of Mining Closed Itemsets for Large Databases

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.145.292 ◽

2011 ◽

Vol 145 ◽

pp. 292-296

Author(s):

Lee Wen Huang

Keyword(s):

Data Mining ◽

Association Rules ◽

Execution Time ◽

Hybrid Algorithm ◽

Hybrid Approach ◽

Market Basket Analysis ◽

Market Basket ◽

Large Databases ◽

Closed Itemsets ◽

Simulation Results

Data Mining means a process of nontrivial extraction of implicit, previously and potentially useful information from data in databases. Mining closed large itemsets is a further work of mining association rules, which aims to find the set of necessary subsets of large itemsets that could be representative of all large itemsets. In this paper, we design a hybrid approach, considering the character of data, to mine the closed large itemsets efficiently. Two features of market basket analysis are considered – the number of items is large; the number of associated items for each item is small. Combining the cut-point method and the hash concept, the new algorithm can find the closed large itemsets efficiently. The simulation results show that the new algorithm outperforms the FP-CLOSE algorithm in the execution time and the space of storage.

Download Full-text

Sampling Methods in Approximate Query Answering Systems

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch186 ◽

2011 ◽

pp. 990-994 ◽

Cited By ~ 2

Author(s):

Gautam Das

Keyword(s):

Data Analysis ◽

Large Data ◽

Massive Datasets ◽

Data Repositories ◽

Large Databases ◽

Approximate Query Answering ◽

Very Large Databases ◽

Approximate Query ◽

And Storage ◽

Collection And Management

In recent years, advances in data collection and management technologies have led to a proliferation of very large databases. These large data repositories typically are created in the hope that, through analysis such as data mining and decision support, they will yield new insights into the data and the real-world processes that created them. In practice, however, while the collection and storage of massive datasets has become relatively straightforward, effective data analysis has proven more difficult to achieve. One reason that data analysis successes have proven elusive is that most analysis queries, by their nature, require aggregation or summarization of large portions of the data being analyzed. For multi-gigabyte data repositories, this means that processing even a single analysis query involves accessing enormous amounts of data, leading to prohibitively expensive running times. This severely limits the feasibility of many types of analysis applications, especially those that depend on timeliness or interactivity.

Download Full-text