Finding Similar Documents Using Frequent Pattern Mining Methods

Various problems are just rising with regard to mining in massive datasets, among which finding similar documents can be pinpointed. The Shingling method converts this problem to a set-based problem. Some of existing methods have used min-hashing to compress the results already driven from the shingling method and then have exploited LSH method to find candidate pairs for similarity search from all pairs of documents. In this paper, an apriori-based method is proposed for finding similar documents based on frequent itemset mining approach. To this end, the apriori algorithm is modified and is customized for similarity search problem. Modeling the similarity search problem as a frequent pattern mining problem, using a modified version of apriori, and dynamic selection the minimum support threshold are the most important advantages of the proposed method, which lead to its appropriate execution time and high quality results. The proposed method finds similar documents in less time than the combined method and MCVM method because it generates fewer candidate pairs for finding similar documents. Furthermore, experimental results show the high quality of the answers of the proposed methods.

Download Full-text

Association Rules Optimization Algorithm Based on Fuzzy Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.602-605.3536 ◽

2014 ◽

Vol 602-605 ◽

pp. 3536-3539

Author(s):

Yu Fu ◽

Jun Rui Yang

Keyword(s):

Association Rules ◽

Fuzzy Clustering ◽

Pattern Mining ◽

Computing Time ◽

Frequent Pattern Mining ◽

Research Direction ◽

Frequent Itemset ◽

Frequent Pattern ◽

Good Prospect ◽

Original Dataset

Frequent pattern mining has been an important research direction in association rules. This paper use a methodology by preprocessing the original dataset using fuzzy clustering which can mapped quantitative datasets into linguistic datasets. Then we propose a algorithm based on fuzzy frequent pattern tree for extracting fuzzy frequent itemset from mapped linguistic datasets. Experimental results show that our algorithm is shorter than the F-Apriori on computing time to huge database. For large database, the algorithm presented in this paper is proved to have a good prospect.

Download Full-text

Bi-Directional Constraint Pushing in Frequent Pattern Mining

Data Mining Patterns ◽

10.4018/978-1-59904-162-9.ch002 ◽

2011 ◽

pp. 32-56

Author(s):

Osmar R. Zaïane ◽

Mohammed El-Hajj

Keyword(s):

Pattern Mining ◽

Frequent Pattern Mining ◽

Large Data ◽

Frequent Itemset ◽

Frequent Pattern ◽

Frequent Itemset Mining ◽

Data Sets ◽

Itemset Mining ◽

Transactional Databases ◽

The Cost

Frequent Itemset Mining (FIM) is a key component of many algorithms that extract patterns from transactional databases. For example, FIM can be leveraged to produce association rules, clusters, classifiers or contrast sets. This capability provides a strategic resource for decision support, and is most commonly used for market basket analysis. One challenge for frequent itemset mining is the potentially huge number of extracted patterns, which can eclipse the original database in size. In addition to increasing the cost of mining, this makes it more difficult for users to find the valuable patterns. Introducing constraints to the mining process helps mitigate both issues. Decision makers can restrict discovered patterns according to specified rules. By applying these restrictions as early as possible, the cost of mining can be constrained. For example, users may be interested in purchases whose total price exceeds $100, or whose items cost between $50 and $100. In cases of extremely large data sets, pushing constraints sequentially is not enough and parallelization becomes a must. However, specific design is needed to achieve sizes never reported before in the literature.

Download Full-text

Incorporating occupancy into frequent pattern mining for high quality pattern recommendation

Proceedings of the 21st ACM international conference on Information and knowledge management - CIKM '12 ◽

10.1145/2396761.2396775 ◽

2012 ◽

Cited By ~ 14

Author(s):

Linpeng Tang ◽

Lei Zhang ◽

Ping Luo ◽

Min Wang

Keyword(s):

Pattern Mining ◽

Frequent Pattern Mining ◽

Frequent Pattern ◽

High Quality

Download Full-text

Mining Closed Weighed Frequent Patterns from a Sliding Window over Data Stream

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.756-759.2606 ◽

2013 ◽

Vol 756-759 ◽

pp. 2606-2609

Author(s):

Cui Cui Ge ◽

Xiu Fen Fu

Keyword(s):

Data Stream ◽

Pattern Mining ◽

Frequent Pattern Mining ◽

Sliding Window ◽

Frequent Itemset ◽

Experimental Results ◽

Frequent Pattern ◽

Frequent Patterns ◽

Itemset Mining ◽

Basic Window

Weighted frequent pattern mining address to discover more important frequent pattern by considering different weights of every item, closed frequent pattern mining can significantly reduce the number of frequent itemset mining and keep sufficient result information. In this paper,we proposed an algorithm DS_CRWF to mine closed weighted frequent pattern over data stream,which is based on sliding window and take basic window as unit of updating,all the closed weighted frequent patterns can be mined through once scan.The experimental results show the feasibility of the algorithm.

Download Full-text

Efficient Algorithms for Mining Frequent Patterns from Sparse and Dense Databases

Journal of Intelligent Systems ◽

10.1515/jisys-2014-0040 ◽

2015 ◽

Vol 24 (2) ◽

pp. 181-197

Author(s):

Lan Vu ◽

Gita Alaghband

Keyword(s):

Execution Time ◽

Pattern Mining ◽

Frequent Pattern Mining ◽

Optimization Techniques ◽

Frequent Pattern ◽

Frequent Patterns ◽

Memory Usage ◽

New Approach ◽

Support Threshold ◽

Best Fit

AbstractIn this article, we present a new approach for frequent pattern mining (FPM) that runs fast for both sparse and dense databases. Two algorithms, FEM and DFEM, based on our approach are also introduced. FEM applies a fixed threshold as the condition for switching between the two mining strategies; meanwhile, DFEM adopts this threshold dynamically at runtime to best fit the characteristics of the database during the mining process, especially when minimum support threshold is low. Additionally, we present optimization techniques for the proposed algorithms to speed the mining process, reduce the memory usage, and optimize the I/O cost. We also analyze in depth the performance of FEM and DFEM and compare them with several existing algorithms. The experimental results show that FEM and DFEM achieve a significant improvement in execution time and consume less memory than many popular FPM algorithms including the well-known Apriori, FP-growth, and Eclat.

Download Full-text

PENERAPAN DATA MINING MENGGUNAKAN ALGORITMA APRIORI UNTUK MENENTUKAN POLA GOLONGAN PENYANDANG MASALAH KESEJAHTERAAN SOSIAL

Sebatik ◽

10.46984/sebatik.v26i1.1622 ◽

2022 ◽

Vol 26 (1) ◽

Author(s):

Irwan Adji Darmawan ◽

Muhammad Fakhri Randy ◽

Imam Yunianto ◽

Muhamad Malik Mutoffar ◽

M Tio Putra Salis

Keyword(s):

Data Mining ◽

Association Rule ◽

Pattern Mining ◽

Frequent Pattern Mining ◽

Frequent Itemset ◽

Frequent Pattern ◽

Data Set ◽

Minimum Support

Penyandang Masalah Kesejahteraan Sosial (PMKS) menjadi satu dari sekian masalah yang terdapat di daerah perkotaan, sebab dapat mengganggu pembangunan kota, ketertiban umum, keamanan dan stabilitas. Sejauh ini langkah yang dilakukan sementara masih terfokus dengan cara penanganan PMKS, masih belum mengarah untuk mencegah. Menentukan pola golongan PMKS merupakan salah satu cara yang dapat dilakukan. Algoritma Apriori memiliki fungsi untuk membantu menemukan pola yang terdapat pada data (frequent pattern mining) untuk menentukan frequent itemset yang menggunakan metode Association Rule dalam data mining. Dalam penghitungan secara manual yang dilakukan maka didapat pola kombinasi antara lain 3 rules yang memiliki nilai minimum support 15% dengan confidence tertinggi 100% menggunakan Algoritma Apriori. Dalam menguji Algoritma Apriori digunakan aplikasi RapidMiner. RapidMiner merupakan satu dari beberapa software pengolah data mining, misalnya menganalisis teks, mengekstrak pola data set kemudian dikombinasikan menggunakan metode statistik, database, dan kecerdasan buatan agar didapat informasi yang tinggi berasal dari olahan data. Hasil yang didapat dari pengujian perbandingan pola antar golongan PMKS. Dari pengujian menggunakan aplikasi RapidMiner dan penghitungan secara manual Algoritma Apriori, maka disimpulkan dengan kriteria pengujian, bahwa pola (rules) golongan dengan nilai confidence (c) penghitungan manual Algoritma Apriori dapat dibilang tidak mendekati hasil pengujian aplikasi RapidMiner, maka dapat dikatakan tingkat keakuratan pengujian rencah, hanya 37,5%.

Download Full-text

Penerapan Data Mining Menggunakan Algoritma Apriori untuk Menentukan Pola Penyebab Gelandangan dan Pengemis

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.2020721376 ◽

2020 ◽

Vol 7 (2) ◽

pp. 229

Author(s):

Wirta Agustin ◽

Yulya Muharmi

Keyword(s):

Data Mining ◽

Association Rule ◽

Urban Areas ◽

Pattern Mining ◽

Frequent Pattern Mining ◽

Frequent Itemset ◽

Frequent Pattern ◽

Data Sets ◽

Apriori Algorithm ◽

Data Set

Gelandangan dan pengemis salah satu masalah yang ada di daerah perkotaan, karena dapat mengganggu ketertiban umum, keamanan, stabilitas dan pembangunan kota. Upaya yang dilakukan saat ini masih fokus pada cara penanganan gelandangan dan pengemis, belum untuk pencegahan. Salah satu cara yang bisa dilakukan adalah dengan menentukan pola usia gelandangan dan pengemis. Algoritma Apriori sebuah metode Association Rule dalam data mining untuk menentukan frequent itemset yang berfungsi membantu menemukan pola dalam sebuah data (frequent pattern mining). Perhitungan manual menggunakan algoritma apriori, menghasilkan pola kombinasi sebanyak 3 rules dengan nilai minimum support sebesar 30% dan nilai confidence tertinggi sebesar 100%. Pengujian penerapan Algoritma Apriori menggunakan aplikasi RapidMiner. RapidMiner salah satu software pengolahan data mining, diantaranya analisis teks, mengekstrak pola-pola dari data set dan mengkombinasikannya dengan metode statistika, kecerdasan buatan, dan database untuk mendapatkan informasi bermutu tinggi dari data yang diolah. Hasil pengujian menunjukkan perbandingan pola usia gelandangan dan pengemis yang berpotensi menjadi gelandangan dan pengemis. Berdasarkan hasil pengujian aplikasi RapidMiner dan hasil perhitungan manual Algoritma Apriori, dapat disimpulkan sesuai kriteria pengujian, bahiwa pola (rules) usia dan nilai confidence (c) hasil perhitungan manual Algoritma Apriori tidak mendekati nilai hasil pengujian menggunakan aplikasi RapidMiner, maka tingkat keakuratan pengujian rendah, yaitu 37.5 %. Abstract Homeless and beggars are one of the problems in urban areas as they possibly disrupt public order, security, stability and urban development. The efforts conducted are still focusing on managing the existing homeless and beggars instead of preventing the potential ones. One of the methods used for solving this problem is Algoritma Apriori which determines the age pattern of homeless and beggars. Apriori Algorithm is an Association Rule method in data mining to determine frequent item set that serves to help in finding patterns in a data (frequent pattern mining). The manual calculation through Apriori Algorithm obtains combination pattern of 3 rules with a minimum support value of 30% and the highest confidence value of 100%. These patterns were refences for the incharged department in precaution action of homeless and beggars arising numbers. Apriori Algorithm testing uses the RapidMiner application which is one of data mining processing software, including text analysis, extracting patterns from data sets and combining them with statistical methods, artificial intelligence, and databases to obtain high quality information from processed data. Based on the results of the said testing, it can be concluded that the level of accuracy test is low, i.e. 37.5%.

Download Full-text

Evaluation of Frequent Itemset Mining Algorithms-Apriori and FP Growth

International Journal of Engineering Technology and Management Sciences ◽

10.46647/ijetms.2020.v04i06.001 ◽

2020 ◽

Vol 4 (6) ◽

pp. 1-4

Author(s):

Jismy Joseph ◽

Kesavaraj G

Keyword(s):

Pattern Mining ◽

Frequent Pattern Mining ◽

Frequent Itemset ◽

Frequent Pattern ◽

Frequent Itemset Mining ◽

Frequent Patterns ◽

Apriori Algorithm ◽

Itemset Mining ◽

Mining Algorithms ◽

Time And Space Complexity

Nowadays the Frequentitemset mining (FIM) is an essential task for retrieving frequently occurring patterns, correlation, events or association in a transactional database. Understanding of such frequent patterns helps to take substantial decisions in decisive situations. Multiple algorithms are proposed for finding such patterns, however the time and space complexity of these algorithms rapidly increases with number of items in a dataset. So it is necessary to analyze the efficiency of these algorithms by using different datasets. The aim of this paper is to evaluate theperformance of frequent itemset mining algorithms, Apriori and Frequent Pattern (FP) growth by comparing their features. This study shows that the FP-growth algorithm is more efficient than the Apriori algorithm for generating rules and frequent pattern mining.

Download Full-text

Performance of IF-Postdiffset and R-Eclat Variants in Large Dataset

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i4.1.28241 ◽

2018 ◽

Vol 7 (4.1) ◽

pp. 134

Author(s):

Julaily Aida Jusoh ◽

Mustafa Man ◽

Wan Aezwani Wan Abu Bakar

Keyword(s):

Pattern Mining ◽

Frequent Pattern Mining ◽

Frequent Itemset ◽

Frequent Pattern ◽

Data Discovery ◽

Large Dataset ◽

New Variant ◽

Transaction Database ◽

Pattern Occurrences ◽

Multiple Variants

Pattern mining refers to a subfield of data mining that uncovers interesting, unexpected, and useful patterns from transaction databases. Such patterns reflect frequent and infrequent patterns. An abundant literature has dedicated in frequent pattern mining and tremendous efficient algorithms for frequent itemset mining in the transaction database. Nonetheless, the infrequent pattern mining has emerged to be an interesting issue in discovering patterns that rarely occur in the transaction database. More researchers reckon that rare pattern occurrences may offer valuable information in knowledge data discovery process. The R-Eclat is a novel algorithm that determines infrequent patterns in the transaction database. The multiple variants in the R-Eclat algorithm generate varied performances in infrequent mining patterns. This paper proposes IF-Postdiffset as a new variant in R-Eclat algorithm. This paper also highlights the performance of infrequent mining pattern from the transaction database among different variants of the R-Eclat algorithm regarding its execution time.

Download Full-text

Various Research Opportunities in High Utility Itemset Mining

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d7213.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 2455-2461

Keyword(s):

Pattern Mining ◽

Research Work ◽

Frequent Pattern Mining ◽

Frequent Itemset ◽

Frequent Pattern ◽

Frequent Itemset Mining ◽

Research Opportunities ◽

Itemset Mining ◽

Utility Factor ◽

High Utility

Pattern mining is a technique, which discovers interesting, hidden, unpredicted and useful patterns of data from the database. Most of the research work in pattern mining has been focused on the traditional way of Frequent Itemset Mining (FIM) and Association Rule Mining (ARM) for patterndiscovery. Patterns in frequent itemset mining are based on the occurrence frequency of items. Although frequent pattern mining is useful, the assumption that ‘frequent patterns are interesting,’ doesn’t hold for numerous applications. High Utility Itemset Mining (UIM) overcomes this limitation of frequent itemset mining. The aim of HUIM is to find the patterns based on a utility function where the utility can be measured in terms of revenue, profit, weight, frequency, interestingness or time spent on some webpage, etc. Mining patterns with high utility can be seen as a generalization of FIM where the transaction database is the input and every item is having a utility factor representing its importance and might have non-binary quantities in the transactions. This paper surveys various recent advances and research opportunities in the field of high utility itemset mining

Download Full-text