Search Space Partition-Based Sequential Pattern Mining

Author(s):  
Jean-Marc Adamo
Author(s):  
UNIL YUN ◽  
KEUN HO RYU

Sequential pattern mining with constraints has been developed to improve the efficiency and effectiveness in mining process. Specifically, there are two interesting constraints for sequential pattern mining. First, some sequences are more important and others are less important. Weight constraints consider the importance of sequences and items within sequences. Second, patterns including only a few items are interesting if they have high support. Meanwhile, long patterns can be interesting although their supports are relatively small. Weight constraints and length-decreasing support constraints are two paradigms aimed at finding important sequential patterns and reducing uninteresting patterns. Although weight and length-decreasing support constraints are vital elements, it is hard to consider both constraints by using previous approaches. In this paper, we integrate weight and length-decreasing support constraints by pushing two constraints into the prefix projection growth method. For pruning techniques, we define the Weighted Smallest Valid Extension property and apply the property to our pruning methods for reducing search space. In performance test, we show that our algorithm mines important sequential patterns with length-decreasing support constraints.


Information ◽  
2020 ◽  
Vol 11 (1) ◽  
pp. 44
Author(s):  
Scott Buffett

A ubiquitous challenge throughout all areas of data mining, particularly in the mining of frequent patterns in large databases, is centered on the necessity to reduce the time and space required to perform the search. The extent of this reduction proportionally facilitates the ability to identify patterns of interest. High utility sequential pattern mining (HUSPM) seeks to identify frequent patterns that are (1) sequential in nature and (2) hold a significant magnitude of utility in a sequence database, by considering the aspect of item value or importance. While traditional sequential pattern mining relies on the downward closure property to significantly reduce the required search space, with HUSPM, this property does not hold. To address this drawback, an approach is proposed that establishes a tight upper bound on the utility of future candidate sequential patterns by maintaining a list of items that are deemed potential candidates for concatenation. Such candidates are provably the only items that are ever needed for any extension of a given sequential pattern or its descendants in the search tree. This list is then exploited to significantly further tighten the upper bound on the utilities of descendent patterns. An extension of this work is then proposed that significantly reduces the computational cost of updating database utilities each time a candidate item is removed from the list, resulting in a massive reduction in the number of candidate sequential patterns that need to be generated in the search. Sequential pattern mining methods implementing these new techniques for bound reduction and further candidate list reduction are demonstrated via the introduction of the CRUSP and CRUSPPivot algorithms, respectively. Validation of the techniques was conducted on six public datasets. Tests show that use of the CRUSP algorithm results in a significant reduction in the overall number of candidate sequential patterns that need to be considered, and subsequently a significant reduction in run time, when compared to the current state of the art in bounding techniques. When employing the CRUSPPivot algorithm, the further reduction in the size of the search space was found to be dramatic, with the reduction in run time found to be dramatic to moderate, depending on the dataset. Demonstrating the practical significance of the work, experiments showed that time required for one particularly complex dataset was reduced from many hours to less than one minute.


2016 ◽  
Vol 10 (1) ◽  
pp. 23
Author(s):  
Edith Belise Kenmogne

Sequential Pattern Mining is an efficient technique for discovering recurring structures or patterns from very large datasetwidely addressed by the data mining community, with a very large field of applications, such as cross-marketing, DNA analysis, web log analysis,user behavior, sensor data, etc. The sequence pattern mining aims at extractinga set of attributes, shared across time among a large number of objects in a given database. Previous studies have developed two major classes of sequential pattern mining methods, namely, the candidate generation-and-test approach based on either vertical or horizontal data formats represented respectively by GSP and SPADE, and the pattern-growth approach represented by FreeSpan and PrefixSpan.In this paper, we are interested in the study of the impact of the pattern-growthordering on the performances of pattern growth-based sequential pattern mining algorithms.To this end, we introduce a class of pattern-growth orderings, called linear orderings, for which patterns are grown by making grow either the currentpattern prefix or the current pattern suffix from the same position at eachgrowth-step.We study the problem of pruning and partitioning the search space followinglinear orderings. Experimentations show that the order in which patternsgrow has a significant influence on the performances. 


2020 ◽  
Vol 2020 ◽  
pp. 1-12
Author(s):  
Chunkai Zhang ◽  
Zilin Du ◽  
Yiwen Zu

High-utility sequential pattern mining (HUSPM) is an emerging topic in data mining, where utility is used to measure the importance or weight of a sequence. However, the underlying informative knowledge of hierarchical relation between different items is ignored in HUSPM, which makes HUSPM unable to extract more interesting patterns. In this paper, we incorporate the hierarchical relation of items into HUSPM and propose a two-phase algorithm MHUH, the first algorithm for high-utility hierarchical sequential pattern mining (HUHSPM). In the first phase named Extension, we use the existing algorithm FHUSpan which we proposed earlier to efficiently mine the general high-utility sequences (g-sequences); in the second phase named Replacement, we mine the special high-utility sequences with the hierarchical relation (s-sequences) as high-utility hierarchical sequential patterns from g-sequences. For further improvements of efficiency, MHUH takes several strategies such as Reduction, FGS, and PBS and a novel upper bounder TSWU, which will be able to greatly reduce the search space. Substantial experiments were conducted on both real and synthetic datasets to assess the performance of the two-phase algorithm MHUH in terms of runtime, number of patterns, and scalability. Conclusion can be drawn from the experiment that MHUH extracts more interesting patterns with underlying informative knowledge efficiently in HUHSPM.


Sequential pattern mining is a data mining approach; aims to discover common interesting patterns in sequence datasets, which attracted a significant research interest due to its real world applications in various fields such as web click stream mining, retail business, stock market and bio-informatics. Each sequence in sequence dataset is composed of time ordered events and each event is an item set. It discovers all frequent subsequences having frequency greater than the given minimum support threshold. Discovering sequential patterns is expensive with respect to mining time as well as the amount of memory used, because of aggressive search space growth due to generation of explosive number of frequent subsequences with the sequence length as well as count of distinct items and large volume of sequence dataset. So, research in this domain aims at developing effective data structures which address frequency counting and large search space as well as scalable algorithms to reduce the execution time and the amount of memory utilized. We propose two efficient data structures called Pre-order Post-order Coded Aggregate Tree (PPCA-Tree) for compact representation of the sequence dataset and Root-node List of First-Occurrence Sub Trees Map (RLFOST-Map) for efficient representation of projected databases. We also developed an efficient Partially ordered Sequential PAttern Mining algorithm called PSPAM and Parallel implementation of Partially ordered Sequential PAttern Mining algorithm called PAPSPAM based on PPCA-Tree using RLFOST-Map which eliminates reconstruction of the projected databases. Experimental analysis done on various synthetic datasets proves that our algorithms PSPAM and PAPSPAM outperform prefixspan and other conventional & state-of-the-art algorithms over dense datasets with better scalability.


Sign in / Sign up

Export Citation Format

Share Document