RHUPS

Databases that deal with the real world have various characteristics. New data is continuously inserted over time without limiting the length of the database, and a variety of information about the items constituting the database is contained. Recently generated data has a greater influence than the previously generated data. These are called the time-sensitive non-binary stream databases, and they include databases such as web-server click data, market sales data, data from sensor networks, and network traffic measurement. Many high utility pattern mining and stream pattern mining methods have been proposed so far. However, they have a limitation that they are not suitable to analyze these databases, because they find valid patterns by analyzing a database with only some of the features described above. Therefore, knowledge-based software about how to find meaningful information efficiently by analyzing databases with these characteristics is required. In this article, we propose an intelligent information system that calculates the influence of the insertion time of each batch in a large-scale stream database by applying the sliding window model and mines recent high utility patterns without generating candidate patterns. In addition, a novel list-based data structure is suggested for a fast and efficient management of the time-sensitive stream databases. Moreover, our technique is compared with state-of-the-art algorithms through various experiments using real datasets and synthetic datasets. The experimental results show that our approach outperforms the previously proposed methods in terms of runtime, memory usage, and scalability.

Download Full-text

Efficient Utility Tree-Based Algorithm to Mine High Utility Patterns Having Strong Correlation

Complexity ◽

10.1155/2021/7310137 ◽

2021 ◽

Vol 2021 ◽

pp. 1-18

Author(s):

Rashad Saeed ◽

Azhar Rauf ◽

Fahmi H. Quradaa ◽

Syed Muhammad Asim

Keyword(s):

Strong Correlation ◽

Pattern Mining ◽

Computational Cost ◽

Search Space ◽

Sufficient Information ◽

Memory Consumption ◽

Cost Cutting ◽

E Learning ◽

High Utility ◽

High Utility Patterns

High Utility Itemset Mining (HUIM) is one of the most investigated tasks of data mining. It has broad applications in domains such as product recommendation, market basket analysis, e-learning, text mining, bioinformatics, and web click stream analysis. Insights from such pattern analysis provide numerous benefits, including cost cutting, improved competitive advantage, and increased revenue. However, HUIM methods may discover misleading patterns as they do not evaluate the correlation of extracted patterns. As a consequence, a number of algorithms have been proposed to mine correlated HUIs. These algorithms still suffer from the issue of the computational cost in terms of both time and memory consumption. This paper presents an algorithm, named Efficient Correlated High Utility Pattern Mining (ECoHUPM), to efficiently mine the high utility patterns having strong correlation items. A new data structure based on utility tree (UTtree) named CoUTlist is proposed to store sufficient information for mining the desired patterns. Three pruning properties are introduced to reduce the search space and improve the mining performance. Experiments on sparse, very sparse, dense, and very dense datasets indicate that the proposed ECoHUPM algorithm is efficient as compared to the state-of-the-art CoHUIM and CoHUI-Miner algorithms in terms of both time and memory consumption.

Download Full-text

Scalable Mining of High-Utility Sequential Patterns With Three-Tier MapReduce Model

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3487046 ◽

2022 ◽

Vol 16 (3) ◽

pp. 1-26

Author(s):

Jerry Chun-Wei Lin ◽

Youcef Djenouri ◽

Gautam Srivastava ◽

Yuanfa Li ◽

Philip S. Yu

Keyword(s):

Large Scale ◽

Pattern Mining ◽

Sequential Pattern Mining ◽

Main Memory ◽

Frequent Itemset ◽

Sequential Pattern ◽

Sequential Patterns ◽

Speed Up ◽

Mapreduce Model ◽

High Utility

High-utility sequential pattern mining (HUSPM) is a hot research topic in recent decades since it combines both sequential and utility properties to reveal more information and knowledge rather than the traditional frequent itemset mining or sequential pattern mining. Several works of HUSPM have been presented but most of them are based on main memory to speed up mining performance. However, this assumption is not realistic and not suitable in large-scale environments since in real industry, the size of the collected data is very huge and it is impossible to fit the data into the main memory of a single machine. In this article, we first develop a parallel and distributed three-stage MapReduce model for mining high-utility sequential patterns based on large-scale databases. Two properties are then developed to hold the correctness and completeness of the discovered patterns in the developed framework. In addition, two data structures called sidset and utility-linked list are utilized in the developed framework to accelerate the computation for mining the required patterns. From the results, we can observe that the designed model has good performance in large-scale datasets in terms of runtime, memory, efficiency of the number of distributed nodes, and scalability compared to the serial HUSP-Span approach.

Download Full-text

Mining large‐scale high utility patterns in vehicular ad hoc network environments

Transactions on Emerging Telecommunications Technologies ◽

10.1002/ett.4168 ◽

2020 ◽

Author(s):

Jimmy Ming‐Tai Wu ◽

Min Wei ◽

Gautam Srivastava ◽

Chien‐Ming Chen ◽

Jerry Chun‐Wei Lin

Keyword(s):

Ad Hoc Network ◽

Large Scale ◽

Ad Hoc ◽

Vehicular Ad Hoc Network ◽

High Utility ◽

High Utility Patterns ◽

Network Environments

Download Full-text

A Survey of incremental high-utility pattern mining based on storage structure

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-202745 ◽

2021 ◽

pp. 1-26

Author(s):

Haodong Cheng ◽

Meng Han ◽

Ni Zhang ◽

Xiaojuan Li ◽

Le Wang

Keyword(s):

Pattern Mining ◽

Business Decisions ◽

Practical Applications ◽

Itemset Mining ◽

High Utility ◽

High Utility Itemsets ◽

High Utility Patterns ◽

Mining Algorithms ◽

Purchase Quantity ◽

Storage Structures

Traditional association rule mining has been widely studied, but this is not applicable to practical applications that must consider factors such as the unit profit of the item and the purchase quantity. High-utility itemset mining (HUIM) aims to find high-utility patterns by considering the number of items purchased and the unit profit. However, most high-utility itemset mining algorithms are designed for static databases. In real-world applications (such as market analysis and business decisions), databases are usually updated by inserting new data dynamically. Some researchers have proposed algorithms for finding high-utility itemsets in dynamically updated databases. Different from the batch processing algorithms that always process the databases from scratch, the incremental HUIM algorithms update and output high-utility itemsets in an incremental manner, thereby reducing the cost of finding high-utility itemsets. This paper provides the latest research on incremental high-utility itemset mining algorithms, including methods of storing itemsets and utilities based on tree, list, array and hash set storage structures. It also points out several important derivative algorithms and research challenges for incremental high-utility itemset mining.

Download Full-text

An Efficient Algorithm for Extracting High-Utility Hierarchical Sequential Patterns

Wireless Communications and Mobile Computing ◽

10.1155/2020/8816228 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Chunkai Zhang ◽

Zilin Du ◽

Yiwen Zu

Keyword(s):

Pattern Mining ◽

Search Space ◽

Sequential Pattern Mining ◽

Sequential Pattern ◽

Sequential Patterns ◽

Second Phase ◽

Two Phase ◽

High Utility ◽

Synthetic Datasets ◽

Hierarchical Relation

High-utility sequential pattern mining (HUSPM) is an emerging topic in data mining, where utility is used to measure the importance or weight of a sequence. However, the underlying informative knowledge of hierarchical relation between different items is ignored in HUSPM, which makes HUSPM unable to extract more interesting patterns. In this paper, we incorporate the hierarchical relation of items into HUSPM and propose a two-phase algorithm MHUH, the first algorithm for high-utility hierarchical sequential pattern mining (HUHSPM). In the first phase named Extension, we use the existing algorithm FHUSpan which we proposed earlier to efficiently mine the general high-utility sequences (g-sequences); in the second phase named Replacement, we mine the special high-utility sequences with the hierarchical relation (s-sequences) as high-utility hierarchical sequential patterns from g-sequences. For further improvements of efficiency, MHUH takes several strategies such as Reduction, FGS, and PBS and a novel upper bounder TSWU, which will be able to greatly reduce the search space. Substantial experiments were conducted on both real and synthetic datasets to assess the performance of the two-phase algorithm MHUH in terms of runtime, number of patterns, and scalability. Conclusion can be drawn from the experiment that MHUH extracts more interesting patterns with underlying informative knowledge efficiently in HUHSPM.

Download Full-text

Improved Strategy for High-Utility Pattern Mining Algorithm

Mathematical Problems in Engineering ◽

10.1155/2020/1971805 ◽

2020 ◽

Vol 2020 ◽

pp. 1-11

Author(s):

Le Wang ◽

Shui Wang ◽

Haiyan Li ◽

Chunliang Zhou

Keyword(s):

Pattern Mining ◽

State Of The Art ◽

Search Space ◽

Research Topics ◽

Main Research ◽

Mining Algorithm ◽

Temporal Efficiency ◽

High Utility ◽

High Utility Patterns ◽

Mining Algorithms

High-utility pattern mining is a research hotspot in the field of pattern mining, and one of its main research topics is how to improve the efficiency of the mining algorithm. Based on the study on the state-of-the-art high-utility pattern mining algorithms, this paper proposes an improved strategy that removes noncandidate items from the global header table and local header table as early as possible, thus reducing search space and improving efficiency of the algorithm. The proposed strategy is applied to the algorithm EFIM (EFficient high-utility Itemset Mining). Experimental verification was carried out on nine typical datasets (including two large datasets); results show that our strategy can effectively improve temporal efficiency for mining high-utility patterns.

Download Full-text

Efficiently Enumerating Substrings with Statistically Significant Frequencies of Locally Optimal Occurrences in Gigantic String

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5969 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5240-5247

Author(s):

Atsuyoshi Nakamura ◽

Ichigaku Takigawa ◽

Hiroshi Mamitsuka

Keyword(s):

Large Scale ◽

Tandem Repeats ◽

Pattern Mining ◽

Information Source ◽

Statistical Significance ◽

Generation Model ◽

Genome Sequences ◽

Scale Experiment ◽

Synthetic Datasets ◽

Single Sequence

We propose new frequent substring pattern mining which can enumerate all substrings with statistically significant frequencies of their locally optimal occurrences from a given single sequence. Our target application is genome sequences, around a half being said to be covered by interspersed and consecutive (tandem) repeats, and detecting these repeats is an important task in molecular life sciences. We evaluate the statistical significance of frequent substrings by using a string generation model with a memoryless stationary information source. We combine this idea with an existing algorithm, ESFLOO-0G.C (Nakamura et al. 2016), to enumerate all statistically significant substrings with locally optimal occurrences. We further develop a parallelized version of our algorithm. Experimental results using synthetic datasets showed the proposed algorithm achieved far higher F-measure in extracting substrings (with various lengths and frequencies) embedded in a randomly generated string with noise, than conventional algorithms. The large-scale experiment using the whole human genome sequence with 3,095,677,412 bases (letters) showed that our parallel algorithm covers 75% of the whole positions analyzed, around 4% and 24% higher than the recent report and the current cutting-edge knowledge, implying a biologically unique finding.

Download Full-text

On-Shelf Utility Mining of Sequence Data

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3457570 ◽

2021 ◽

Vol 16 (2) ◽

pp. 1-31

Author(s):

Chunkai Zhang ◽

Zilin Du ◽

Yuting Yang ◽

Wensheng Gan ◽

Philip S. Yu

Keyword(s):

High Efficiency ◽

Sequence Data ◽

Real Life ◽

Search Space ◽

Upper Bounds ◽

Utility Mining ◽

Limited Memory ◽

Time Periods ◽

High Utility ◽

Synthetic Datasets

Utility mining has emerged as an important and interesting topic owing to its wide application and considerable popularity. However, conventional utility mining methods have a bias toward items that have longer on-shelf time as they have a greater chance to generate a high utility. To eliminate the bias, the problem of on-shelf utility mining (OSUM) is introduced. In this article, we focus on the task of OSUM of sequence data, where the sequential database is divided into several partitions according to time periods and items are associated with utilities and several on-shelf time periods. To address the problem, we propose two methods, OSUM of sequence data (OSUMS) and OSUMS + , to extract on-shelf high-utility sequential patterns. For further efficiency, we also design several strategies to reduce the search space and avoid redundant calculation with two upper bounds time prefix extension utility ( TPEU ) and time reduced sequence utility ( TRSU ). In addition, two novel data structures are developed for facilitating the calculation of upper bounds and utilities. Substantial experimental results on certain real and synthetic datasets show that the two methods outperform the state-of-the-art algorithm. In conclusion, OSUMS may consume a large amount of memory and is unsuitable for cases with limited memory, while OSUMS + has wider real-life applications owing to its high efficiency.

Download Full-text