Mining of Sequential Patterns using Directed Graphs

Sequential pattern mining is one of the important functionalities of data mining. It is used for analyzing sequential database and discovers sequential patterns. It is focused for extracting interesting subsequences from a set of sequences. Various factors such as rate of occurrence, length, and profit are used to define the interestingness of subsequence derived from the sequence database. Sequential pattern mining has abundant real-life applications since sequential data is logically programmed as sequences of cipher in many fields such as bioinformatics, e-learning, market basket analysis, texts, and webpage click-stream analysis. A large diversity of competent algorithms such as Prefixspan, GSP and Freespan have been proposed during the past few years. In this paper we propose a data model for organizing the sequential database, which consists of a directed graph DGS (cycles and several edges are allowed) and an organization of directed paths in DGS to represent a sequential data for discovering sequential pattern3 from a sequence database. Competent algorithms for constructing the digraph model (DGS) for extracting all sequential patterns and mining association rules are proposed. A number of theoretical parameters of digraph model are also introduced, which lead to more understanding of the problem.

Download Full-text

MINING TOP-K FREQUENT SEQUENTIAL PATTERN IN ITEM INTERVAL EXTENDED SEQUENCE DATABASE

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/34/3/13053 ◽

2018 ◽

Vol 34 (3) ◽

pp. 249-263

Author(s):

Duong Huy Tran ◽

Thang Truong Nguyen ◽

Thi Duc Vu ◽

Anh The Tran

Keyword(s):

Pattern Mining ◽

Real Life ◽

Sequential Pattern Mining ◽

Sequential Pattern ◽

Sequential Patterns ◽

Sequence Database ◽

Extended Sequence ◽

Support Threshold ◽

Interesting Task ◽

Frequent Sequential Pattern

Abstract. Frequent sequential pattern mining in item interval extended sequence database (iSDB) has been one of interesting task in recent years. Unlike classic frequent sequential pattern mining, the pattern mining in iSDB also consider the item interval between successive items; thus, it may extract more meaningful sequential patterns in real life. Most previous frequent sequential pattern mining in iSDB algorithms needs a minimum support threshold (minsup) to perform the mining. However, it’s not easy for users to provide an appropriate threshold in practice. The too high minsup value will lead to missing valuable patterns, while the too low minsup value may generate too many useless patterns. To address this problem, we propose an algorithm: TopKWFP – Top-k weighted frequent sequential pattern mining in item interval extended sequence database. Our algorithm doesn’t need to provide a fixed minsup value, this minsup value will dynamically raise during the mining process

Download Full-text

HIGH UTILITY ITEM INTERVAL SEQUENTIAL PATTERN MINING ALGORITHM

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/1/1/14398 ◽

2020 ◽

Vol 36 (1) ◽

pp. 1-15

Author(s):

Tran Huy Duong ◽

Nguyen Truong Thang ◽

Vu Duc Thi ◽

Tran The Anh

Keyword(s):

Data Mining ◽

Pattern Mining ◽

Sequential Pattern Mining ◽

Sequential Pattern ◽

Sequential Patterns ◽

Sequence Database ◽

Mining Algorithm ◽

Pattern Growth ◽

High Utility ◽

Growth Approach

High utility sequential pattern mining is a popular topic in data mining with the main purpose is to extract sequential patterns with high utility in the sequence database. Many recent works have proposed methods to solve this problem. However, most of them does not consider item intervals of sequential patterns which can lead to the extraction of sequential patterns with too long item interval, thus making little sense. In this paper, we propose a High Utility Item Interval Sequential Pattern (HUISP) algorithm to solve this problem. Our algorithm uses pattern growth approach and some techniques to increase algorithm's performance.

Download Full-text

High-throughput phenotyping with temporal sequences

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocaa288 ◽

2020 ◽

Author(s):

Hossein Estiri ◽

Zachary H Strasser ◽

Shawn N Murphy

Keyword(s):

High Throughput ◽

Pattern Mining ◽

Sequential Pattern Mining ◽

Classification Performance ◽

Sequential Pattern ◽

Sequential Patterns ◽

Sequential Data ◽

High Throughput Phenotyping ◽

Using Data ◽

Temporal Sequences

Abstract Objective High-throughput electronic phenotyping algorithms can accelerate translational research using data from electronic health record (EHR) systems. The temporal information buried in EHRs is often underutilized in developing computational phenotypic definitions. This study aims to develop a high-throughput phenotyping method, leveraging temporal sequential patterns from EHRs. Materials and Methods We develop a representation mining algorithm to extract 5 classes of representations from EHR diagnosis and medication records: the aggregated vector of the records (aggregated vector representation), the standard sequential patterns (sequential pattern mining), the transitive sequential patterns (transitive sequential pattern mining), and 2 hybrid classes. Using EHR data on 10 phenotypes from the Mass General Brigham Biobank, we train and validate phenotyping algorithms. Results Phenotyping with temporal sequences resulted in a superior classification performance across all 10 phenotypes compared with the standard representations in electronic phenotyping. The high-throughput algorithm’s classification performance was superior or similar to the performance of previously published electronic phenotyping algorithms. We characterize and evaluate the top transitive sequences of diagnosis records paired with the records of risk factors, symptoms, complications, medications, or vaccinations. Discussion The proposed high-throughput phenotyping approach enables seamless discovery of sequential record combinations that may be difficult to assume from raw EHR data. Transitive sequences offer more accurate characterization of the phenotype, compared with its individual components, and reflect the actual lived experiences of the patients with that particular disease. Conclusion Sequential data representations provide a precise mechanism for incorporating raw EHR records into downstream machine learning. Our approach starts with user interpretability and works backward to the technology.

Download Full-text

HIGH UTILITY ITEM INTERVAL SEQUENTIAL PATTERN MINING ALGORITHM

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/36/1/14398 ◽

2020 ◽

Vol 36 (1) ◽

pp. 1-15

Author(s):

Tran Huy Duong ◽

Nguyen Truong Thang ◽

Vu Duc Thi ◽

Tran The Anh

Keyword(s):

Data Mining ◽

Pattern Mining ◽

Sequential Pattern Mining ◽

Sequential Pattern ◽

Sequential Patterns ◽

Sequence Database ◽

Mining Algorithm ◽

Pattern Growth ◽

High Utility ◽

Growth Approach

Download Full-text

Applications of Pattern Discovery Using Sequential Data Mining

Pattern Discovery Using Sequence Data Mining ◽

10.4018/978-1-61350-056-9.ch001 ◽

2012 ◽

pp. 1-23 ◽

Cited By ~ 8

Author(s):

Manish Gupta ◽

Jiawei Han

Keyword(s):

Data Mining ◽

Text Mining ◽

Intrusion Detection ◽

Pattern Mining ◽

Pattern Discovery ◽

Sequential Pattern Mining ◽

Web Usage Mining ◽

Sequential Pattern ◽

Sequential Data ◽

Mining Methods

Sequential pattern mining methods have been found to be applicable in a large number of domains. Sequential data is omnipresent. Sequential pattern mining methods have been used to analyze this data and identify patterns. Such patterns have been used to implement efficient systems that can recommend based on previously observed patterns, help in making predictions, improve usability of systems, detect events, and in general help in making strategic product decisions. In this chapter, we discuss the applications of sequential data mining in a variety of domains like healthcare, education, Web usage mining, text mining, bioinformatics, telecommunications, intrusion detection, et cetera. We conclude with a summary of the work.

Download Full-text

Sequential Pattern Mining Algorithm Based on Text Data: Taking the Fault Text Records as an Example

Sustainability ◽

10.3390/su10114330 ◽

2018 ◽

Vol 10 (11) ◽

pp. 4330 ◽

Cited By ~ 2

Author(s):

Xinglong Yuan ◽

Wenbing Chang ◽

Shenghan Zhou ◽

Yang Cheng

Keyword(s):

Time Series ◽

Pattern Mining ◽

Sequential Pattern Mining ◽

Sequential Pattern ◽

Fault Classification ◽

Sequential Patterns ◽

Series Data ◽

Similarity Measurement ◽

Text Similarity ◽

Text Data

Sequential pattern mining (SPM) is an effective and important method for analyzing time series. This paper proposed a SPM algorithm to mine fault sequential patterns in text data. Because the structure of text data is poor and there are many different forms of text expression for the same concept, the traditional SPM algorithm cannot be directly applied to text data. The proposed algorithm is designed to solve this problem. First, this study measured the similarity of fault text data and classified similar faults into one class. Next, this paper proposed a new text similarity measurement model based on the word embedding distance. Compared with the classic text similarity measurement method, this model can achieve good results in short text classification. Then, on the basis of fault classification, this paper proposed the SPM algorithm with an event window, which is a time soft constraint for obtaining a certain number of sequential patterns according to needs. Finally, this study used the fault text records of a certain aircraft as experimental data for mining fault sequential patterns. Experiment showed that this algorithm can effectively mine sequential patterns in text data. The proposed algorithm can be widely applied to text time series data in many fields such as industry, business, finance and so on.

Download Full-text

A heuristic to predict the optimal pattern-growth direction for the pattern growth-based sequential pattern mining approach

Journal of Advanced Computer Science & Technology ◽

10.14419/jacst.v6i2.7011 ◽

2017 ◽

Vol 6 (2) ◽

pp. 20

Author(s):

Kenmogne Edith Belise ◽

Nkambou Roger ◽

Tadmon Calvin ◽

Engelbert Mephu Nguifo

Keyword(s):

Pattern Mining ◽

Real Life ◽

Growth Direction ◽

Sequential Pattern Mining ◽

Sequential Pattern ◽

Large Field ◽

Data Formats ◽

Pattern Growth ◽

Very Large Datasets ◽

Synthetic Datasets

Sequential pattern mining is an efficient technique for discovering recurring structures or patterns from very large datasets, with a very large field of applications. It aims at extracting a set of attributes, shared across time among a large number of objects in a given database. Previous studies have developed two major classes of sequential pattern mining methods, namely, the candidate generation-and-test approach based on either vertical or horizontal data formats represented respectively by GSP and SPADE, and the pattern-growth approach represented by FreeSpan, PrefixSpan and their further extensions. The performances of these algorithms depend on how patterns grow. Because of this, we introduce a heuristic to predict the optimal pattern-growth direction, i.e. the pattern-growth direction leading to the best performance in terms of runtime and memory usage. Then, we perform a number of experimentations on both real-life and synthetic datasets to test the heuristic. The performance analysis of these experimentations show that the heuristic prediction is reliable in general.

Download Full-text

Detecting Implicit Security Exceptions Using an Improved Variable-Length Sequential Pattern Mining Method

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194017500462 ◽

2017 ◽

Vol 27 (08) ◽

pp. 1235-1268

Author(s):

Jinfu Chen ◽

Saihua Cai ◽

Dave Towey ◽

Lili Zhu ◽

Rubing Huang ◽

...

Keyword(s):

Visual Inspection ◽

Pattern Mining ◽

Sequential Pattern Mining ◽

Variable Length ◽

Sequential Pattern ◽

Sequential Patterns ◽

Mining Method ◽

Security Testing ◽

String Searching ◽

Correct Execution

The process of component security testing can produce massive amounts of monitor logs. Current approaches to detect implicit security exceptions (those which cannot be identified by visual inspection alone) compare correct execution sequences with fixed patterns mined from the execution of sequential patterns in the monitor logs. However, this is not efficient and is not suitable for mining large monitor logs. To enable effective mining of implicit security exceptions from large monitor logs, this paper proposes a method based on improved variable-length sequential pattern mining. The proposed method first mines the variable-length sequential patterns from correct execution sequences and from actual execution sequences, thus reducing the number of patterns. The sequential patterns are then detected using the Sunday string-searching algorithm. We conducted an experimental study based on this method, the results of which show that the proposed method can efficiently detect the implicit security exceptions of components.

Download Full-text

Sequential Pattern Mining from Sequential Data

Handbook of Research on Innovations in Database Technologies and Applications ◽

10.4018/978-1-60566-242-8.ch067 ◽

2009 ◽

pp. 622-631

Author(s):

Shigeaki Sakurai

Keyword(s):

Pattern Mining ◽

Pattern Discovery ◽

Sequential Pattern ◽

The Other ◽

Sequential Patterns ◽

Sequential Data ◽

Frequent Patterns ◽

New Knowledge ◽

Discovery Method ◽

Time Information

Owing to the progress of computer and network environments, it is easy to collect data with time information such as daily business reports, weblog data, and physiological information. This is the context in which methods of analyzing data with time information have been studied. This chapter focuses on a sequential pattern discovery method from discrete sequential data. The methods proposed by Pei et al. (2001), Srikant & Agrawal (1996), and Zaki (2001) efficiently discover the frequent patterns as characteristic patterns. However, the discovered patterns do not always correspond to the interests of analysts, because the patterns are common and are not a source of new knowledge for the analysts. The problem has been pointed out in connection with the discovery of associative rules. Blanchard et al. (2005), Brin et al. (1997), Silberschatz et al. (1996), and Suzuki et al. (2005) propose other criteria in order to discover other kinds of characteristic patterns. The patterns discovered by the criteria are not always frequent but are characteristic of viewpoints. The criteria may be applicable to discovery methods of sequential patterns. However, these criteria do not satisfy the Apriori property. It is difficult for the methods based on the criteria to efficiently discover the patterns. On the other hand, methods that use the background knowledge of analysts have been proposed in order to discover sequential patterns corresponding to the interests of analysts (Garofalakis et al., 1999; Pei et al., 2002; Sakurai et al., 2008b; Yen, 2005).

Download Full-text

Scalable Mining of High-Utility Sequential Patterns With Three-Tier MapReduce Model

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3487046 ◽

2022 ◽

Vol 16 (3) ◽

pp. 1-26

Author(s):

Jerry Chun-Wei Lin ◽

Youcef Djenouri ◽

Gautam Srivastava ◽

Yuanfa Li ◽

Philip S. Yu

Keyword(s):

Large Scale ◽

Pattern Mining ◽

Sequential Pattern Mining ◽

Main Memory ◽

Frequent Itemset ◽

Sequential Pattern ◽

Sequential Patterns ◽

Speed Up ◽

Mapreduce Model ◽

High Utility

High-utility sequential pattern mining (HUSPM) is a hot research topic in recent decades since it combines both sequential and utility properties to reveal more information and knowledge rather than the traditional frequent itemset mining or sequential pattern mining. Several works of HUSPM have been presented but most of them are based on main memory to speed up mining performance. However, this assumption is not realistic and not suitable in large-scale environments since in real industry, the size of the collected data is very huge and it is impossible to fit the data into the main memory of a single machine. In this article, we first develop a parallel and distributed three-stage MapReduce model for mining high-utility sequential patterns based on large-scale databases. Two properties are then developed to hold the correctness and completeness of the discovered patterns in the developed framework. In addition, two data structures called sidset and utility-linked list are utilized in the developed framework to accelerate the computation for mining the required patterns. From the results, we can observe that the designed model has good performance in large-scale datasets in terms of runtime, memory, efficiency of the number of distributed nodes, and scalability compared to the serial HUSP-Span approach.

Download Full-text