Advances in Classification of Sequence Data

Author(s):  
Pradeep Kumar ◽  
P. Radha Krishna ◽  
Raju S. Bapi ◽  
T. M. Padmaja

In recent years, advanced information systems have enabled collection of increasingly large amounts of data that are sequential in nature. To analyze huge amounts of sequential data, the interdisciplinary field of Knowledge Discovery in Databases (KDD) is very useful. The most important step within the process of KDD is data mining, which is concerned with the extraction of the valid patterns. Recent research focus in data mining includes stream data mining, sequence data mining, web mining, text mining, visual mining, multimedia mining and multi-relational data mining. Sequence data may be discrete or continuous in nature. Most of the research on discrete sequence data concentrated on the discovery of frequently occurring patterns. However, comparatively less amount of work has been carried out in the area of discrete sequence data classification. In this chapter, data taxonomy is introduced with a review of the state of art for sequence data classification. The usefulness of embedding partial subsequence information extracted using sliding window technique into traditional classifier like kNN has been demonstrated. kNN has been tested with various vector based distance/similarity metrics. Further, with the use of S3M similarity metric, the full subsequence information embedded in the data sequences is extracted. The experimental data taken is DARPA’98 IDS benchmark dataset collected from UCIML dataset repository. The chapter closes by pointing out various application areas of sequence data and also the open issues in sequence data classification problem.

Data Mining ◽  
2013 ◽  
pp. 1835-1851
Author(s):  
Manish Gupta ◽  
Jiawei Han

In this chapter we first introduce sequence data. We then discuss different approaches for mining of patterns from sequence data, studied in literature. Apriori based methods and the pattern growth methods are the earliest and the most influential methods for sequential pattern mining. There is also a vertical format based method which works on a dual representation of the sequence database. Work has also been done for mining patterns with constraints, mining closed patterns, mining patterns from multi-dimensional databases, mining closed repetitive gapped subsequences, and other forms of sequential pattern mining. Some works also focus on mining incremental patterns and mining from stream data. We present at least one method of each of these types and discuss their advantages and disadvantages. We conclude with a summary of the work.


Author(s):  
Manish Gupta ◽  
Jiawei Han

In this chapter we first introduce sequence data. We then discuss different approaches for mining of patterns from sequence data, studied in literature. Apriori based methods and the pattern growth methods are the earliest and the most influential methods for sequential pattern mining. There is also a vertical format based method which works on a dual representation of the sequence database. Work has also been done for mining patterns with constraints, mining closed patterns, mining patterns from multi-dimensional databases, mining closed repetitive gapped subsequences, and other forms of sequential pattern mining. Some works also focus on mining incremental patterns and mining from stream data. We present at least one method of each of these types and discuss their advantages and disadvantages. We conclude with a summary of the work.


Author(s):  
Manmohan Singh ◽  
Rajendra Pamula ◽  
Alok Kumar

There are various applications of clustering in the fields of machine learning, data mining, data compression along with pattern recognition. The existent techniques like the Llyods algorithm (sometimes called k-means) were affected by the issue of the algorithm which converges to a local optimum along with no approximation guarantee. For overcoming these shortcomings, an efficient k-means clustering approach is offered by this paper for stream data mining. Coreset is a popular and fundamental concept for k-means clustering in stream data. In each step, reduction determines a coreset of inputs, and represents the error, where P represents number of input points according to nested property of coreset. Hence, a bit reduction in error of final coreset gets n times more accurate. Therefore, this motivated the author to propose a new coreset-reduction algorithm. The proposed algorithm executed on the Covertype dataset, Spambase dataset, Census 1990 dataset, Bigcross dataset, and Tower dataset. Our algorithm outperforms with competitive algorithms like Streamkm[Formula: see text], BICO (BIRCH meets Coresets for k-means clustering), and BIRCH (Balance Iterative Reducing and Clustering using Hierarchies.


2020 ◽  
Vol 106 ◽  
pp. 672-684 ◽  
Author(s):  
José Maia ◽  
Carlos Alberto Severiano ◽  
Frederico Gadelha Guimarães ◽  
Cristiano Leite de Castro ◽  
André Paim Lemos ◽  
...  

2010 ◽  
Vol 6 (4) ◽  
pp. 16-32 ◽  
Author(s):  
Pradeep Kumar ◽  
Bapi S. Raju ◽  
P. Radha Krishna

In many data mining applications, both classification and clustering algorithms require a distance/similarity measure. The central problem in similarity based clustering/classification comprising sequential data is deciding an appropriate similarity metric. The existing metrics like Euclidean, Jaccard, Cosine, and so forth do not exploit the sequential nature of data explicitly. In this paper, the authors propose a similarity preserving function called Sequence and Set Similarity Measure (S3M) that captures both the order of occurrence of items in sequences and the constituent items of sequences. The authors demonstrate the usefulness of the proposed measure for classification and clustering tasks. Experiments were conducted on benchmark datasets, that is, DARPA’98 and msnbc, for classification task in intrusion detection and clustering task in web mining domains. Results show the usefulness of the proposed measure.


Author(s):  
Richard Weber

Since the First KDD Workshop back in 1989 when “Knowledge Mining” was recognized as one of the top 5 topics in future database research (Piatetsky-Shapiro 1991), many scientists as well as users in industry and public organizations have considered data mining as highly relevant for their respective professional activities. We have witnessed the development of advanced data mining techniques as well as the successful implementation of knowledge discovery systems in many companies and organizations worldwide. Most of these implementations are static in the sense that they do not contemplate explicitly a changing environment. However, since most analyzed phenomena change over time, the respective systems should be adapted to the new environment in order to provide useful and reliable analyses. If we consider for example a system for credit card fraud detection, we may want to segment our customers, process stream data generated by their transactions, and finally classify them according to their fraud probability where fraud pattern change over time. If our segmentation should group together homogeneous customers using not only their current feature values but also their trajectories, things get even more difficult since we have to cluster vectors of functions instead of vectors of real values. An example for such a trajectory could be the development of our customers’ number of transactions over the past six months or so if such a development tells us more about their behavior than just a single value; e.g., the most recent number of transactions. It is in this kind of applications is where dynamic data mining comes into play! Since data mining is just one step of the iterative KDD (Knowledge Discovery in Databases) process (Han & Kamber, 2001), dynamic elements should be considered also during the other steps. The entire process consists basically of activities that are performed before doing data mining (such as: selection, pre-processing, transformation of data (Famili et al., 1997)), the actual data mining part, and subsequent steps (such as: interpretation, evaluation of results). In subsequent sections we will present the background regarding dynamic data mining by studying existing methodological approaches as well as already performed applications and even patents and tools. Then we will provide the main focus of this chapter by presenting dynamic approaches for each step of the KDD process. Some methodological aspects regarding dynamic data mining will be presented in more detail. After envisioning future trends regarding dynamic data mining we will conclude this chapter.


Sign in / Sign up

Export Citation Format

Share Document