Pattern Discovery Using Sequence Data Mining

Sequence Pattern Mining for Web Logs

Pattern Discovery Using Sequence Data Mining ◽

10.4018/978-1-61350-056-9.ch014 ◽

2012 ◽

pp. 237-243

Author(s):

Pradeep Kumar ◽

Raju S. Bapi ◽

P. Radha Krishna

Keyword(s):

Pattern Mining ◽

Time And Space ◽

Sequence Pattern ◽

Interestingness Measures ◽

Web Logs ◽

Space Cost ◽

Reduced Time

Interestingness measures play an important role in finding frequently occurring patterns, regardless of the kind of patterns being mined. In this work, we propose variation to the AprioriALL Algorithm, which is commonly used for the sequence pattern mining. The proposed variation adds up the measure interest during every step of candidate generation to reduce the number of candidates thus resulting in reduced time and space cost. The proposed algorithm derives the patterns which are qualified and more of interest to the user. The algorithm, by using the interest, measure limits the size the candidates set whenever it is produced by giving the user more importance to get the desired patterns.

Download Full-text

Analysis of Kinase Inhibitors and Druggability of Kinase-Targets Using Machine Learning Techniques

Pattern Discovery Using Sequence Data Mining ◽

10.4018/978-1-61350-056-9.ch009 ◽

2012 ◽

pp. 155-165

Author(s):

S. Prasanthi ◽

S.Durga Bhavani ◽

T. Sobha Rani ◽

Raju S. Bapi

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Kinase Inhibitors ◽

Kinase Inhibitor ◽

Classification Problem ◽

Machine Learning Techniques ◽

Learning Approaches ◽

Decision Tree Classifier ◽

Data Set ◽

Learning Techniques

Vast majority of successful drugs or inhibitors achieve their activity by binding to, and modifying the activity of a protein leading to the concept of druggability. A target protein is druggable if it has the potential to bind the drug-like molecules. Hence kinase inhibitors need to be studied to understand the specificity of a kinase inhibitor in choosing a particular kinase target. In this paper we focus on human kinase drug target sequences since kinases are known to be potential drug targets. Also we do a preliminary analysis of kinase inhibitors in order to study the problem in the protein-ligand space in future. The identification of druggable kinases is treated as a classification problem in which druggable kinases are taken as positive data set and non-druggable kinases are chosen as negative data set. The classification problem is addressed using machine learning techniques like support vector machine (SVM) and decision tree (DT) and using sequence-specific features. One of the challenges of this classification problem is due to the unbalanced data with only 48 druggable kinases available against 509 non-drugggable kinases present at Uniprot. The accuracy of the decision tree classifier obtained is 57.65 which is not satisfactory. A two-tier architecture of decision trees is carefully designed such that recognition on the non-druggable dataset also gets improved. Thus the overall model is shown to achieve a final performance accuracy of 88.37. To the best of our knowledge, kinase druggability prediction using machine learning approaches has not been reported in literature.

Download Full-text

A Review of Kernel Methods Based Approaches to Classification and Clustering of Sequential Patterns, Part II

Pattern Discovery Using Sequence Data Mining ◽

10.4018/978-1-61350-056-9.ch003 ◽

2012 ◽

pp. 51-71

Author(s):

Veena T. ◽

Dileep A. D. ◽

C. Chandra Sekhar

Keyword(s):

Speech Processing ◽

Kernel Methods ◽

Text Analysis ◽

Character Recognition ◽

Pattern Analysis ◽

Biological Sequence ◽

Biological Sequence Analysis ◽

Handwritten Character ◽

Varying Length ◽

Classification And Clustering

Pattern analysis tasks on sequences of discrete symbols are important for pattern discovery in bioinformatics, text analysis, speech processing, and handwritten character recognition. Discrete symbols may correspond to amino acids or nucleotides in biological sequence analysis, characters in text analysis, and codebook indices in processing of speech and handwritten character data. The main issues in kernel methods based approaches to pattern analysis tasks on discrete symbol sequences are related to defining a measure of similarity between sequences of discrete symbols, and handling the varying length nature of sequences. We present a review of methods to design dynamic kernels for sequences of discrete symbols. We then present a review of approaches to classification and clustering of sequences of discrete symbols using the dynamic kernel based methods.

Download Full-text

Applications of Pattern Discovery Using Sequential Data Mining

Pattern Discovery Using Sequence Data Mining ◽

10.4018/978-1-61350-056-9.ch001 ◽

2012 ◽

pp. 1-23 ◽

Cited By ~ 8

Author(s):

Manish Gupta ◽

Jiawei Han

Keyword(s):

Data Mining ◽

Text Mining ◽

Intrusion Detection ◽

Pattern Mining ◽

Pattern Discovery ◽

Sequential Pattern Mining ◽

Web Usage Mining ◽

Sequential Pattern ◽

Sequential Data ◽

Mining Methods

Sequential pattern mining methods have been found to be applicable in a large number of domains. Sequential data is omnipresent. Sequential pattern mining methods have been used to analyze this data and identify patterns. Such patterns have been used to implement efficient systems that can recommend based on previously observed patterns, help in making predictions, improve usability of systems, detect events, and in general help in making strategic product decisions. In this chapter, we discuss the applications of sequential data mining in a variety of domains like healthcare, education, Web usage mining, text mining, bioinformatics, telecommunications, intrusion detection, et cetera. We conclude with a summary of the work.

Download Full-text

Mining Statistically Significant Substrings Based on the Chi-Square Measure

Pattern Discovery Using Sequence Data Mining ◽

10.4018/978-1-61350-056-9.ch004 ◽

2012 ◽

pp. 73-82 ◽

Cited By ~ 1

Author(s):

Sourav Dutta ◽

Arnab Bhattacharya

Keyword(s):

Intrusion Detection ◽

Sequence Data ◽

Long Strings ◽

Intrusion Detection Systems ◽

Chi Square ◽

Detection Systems ◽

Statistical Measures ◽

Normal Behavior ◽

Practical Challenge ◽

String Databases

With the tremendous expansion of reservoirs of sequence data stored worldwide, efficient mining of large string databases in various domains including intrusion detection systems, player statistics, texts, and proteins, has emerged as a practical challenge. Searching for an unusual pattern within long strings of data is one of the foremost requirements for many diverse applications. Given a string, the problem is to identify the substrings that differ the most from the expected or normal behavior, i.e., the substrings that are statistically significant (or, in other words, less likely to occur due to chance alone). We first survey and analyze the different statistical measures available to meet this end. Next, we argue that the most appropriate metric is the chi-square measure. Finally, we discuss different approaches and algorithms proposed for retrieving the top-k substrings with the largest chi-square measure.

Download Full-text

Discovering Patterns in Order to Detect Weak Signals and Define New Strategies

Pattern Discovery Using Sequence Data Mining ◽

10.4018/978-1-61350-056-9.ch012 ◽

2012 ◽

pp. 195-211 ◽

Cited By ~ 1

Author(s):

Anass El Haddadi ◽

Bernard Dousset ◽

Ilham Berrada

Keyword(s):

Data Mining ◽

Correspondence Analysis ◽

Sequence Data ◽

Multiple Correspondence Analysis ◽

Decision Makers ◽

Competitive Intelligence ◽

Weak Signals ◽

New Strategies

Competitive intelligence activities rely on collecting and analyzing data in order to discover patterns from data using sequence data mining. The discovered patterns are used to help decision-makers considering innovation and defining the strategy for their business. In this chapter we present four methods for discovering patterns in the competitive intelligence process: “correspondence analysis,” “multiple correspondence analysis,” “evolutionary graph,” and “multi-term method.”

Download Full-text

Identification of Genomic Islands by Pattern Discovery

Pattern Discovery Using Sequence Data Mining ◽

10.4018/978-1-61350-056-9.ch010 ◽

2012 ◽

pp. 166-181

Author(s):

Nita Parekh

Keyword(s):

Pattern Recognition ◽

Pattern Discovery ◽

Genetic Material ◽

Genomic Islands ◽

Evolutionary Relationships ◽

Bacterial Genomes ◽

Pattern Recognition Problem ◽

Functional Behavior ◽

Virulent Strains ◽

Biological Patterns

Pattern discovery is at the heart of bioinformatics, and algorithms from computer science have been widely used for identifying biological patterns. The assumption behind pattern discovery approaches is that a pattern that occurs often enough in biological sequences/structures or is conserved across organisms is expected to play a role in defining the respective sequence’s or structure’s functional behavior and/or evolutionary relationships. The pattern recognition problem addressed here is at the genomic level and involves identifying horizontally transferred regions, called genomic islands. A horizontally transferred event is defined as the movement of genetic material between phylogenetically unrelated organisms by mechanisms other than parent to progeny inheritance. Increasing evidence suggests the importance of horizontal transfer events in the evolution of bacteria, influencing traits such as antibiotic resistance, symbiosis and fitness, virulence, and adaptation in general. In the genomic era, with the availability of large number of bacterial genomes, the identification of genomic islands also form the first step in the annotation of the newly sequenced genomes and in identifying the differences between virulent and non-virulent strains of a species. Considerable effort is being made in their identification and analysis and in this chapter a brief summary of various approaches used in the identification and validation of horizontally acquired regions is discussed.

Download Full-text

Classification of Biological Sequences

Pattern Discovery Using Sequence Data Mining ◽

10.4018/978-1-61350-056-9.ch007 ◽

2012 ◽

pp. 111-135

Author(s):

Pratibha Rani ◽

Vikram Pudi

Keyword(s):

Naive Bayes ◽

Analysis Data ◽

Naïve Bayes ◽

Biological Data ◽

Rapid Progress ◽

Biological Data Analysis ◽

Depth Analysis ◽

Iterative Scaling ◽

Mining Methods

The rapid progress of computational biology, biotechnology, and bioinformatics in the last two decades has led to the accumulation of tremendous amounts of biological data that demands in-depth analysis. Data mining methods have been applied successfully for analyzing this data. An important problem in biological data analysis is to classify a newly discovered sequence like a protein or DNA sequence based on their important features and functions, using the collection of available sequences. In this chapter, we study this problem and present two Bayesian classifiers RBNBC (Rani & Pudi, 2008a) and REBMEC (Rani & Pudi, 2008c). The algorithms used in these classifiers incorporate repeated occurrences of subsequences within each sequence (Rani, 2008). Specifically, Repeat Based Naive Bayes Classifier (RBNBC) uses a novel formulation of Naive Bayes, and the second classifier, Repeat Based Maximum Entropy Classifier (REBMEC) uses a novel framework based on the classical Generalized Iterative Scaling (GIS) algorithm.

Download Full-text

Quantization based Sequence Generation and Subsequence Pruning for Data Mining Applications

Pattern Discovery Using Sequence Data Mining ◽

10.4018/978-1-61350-056-9.ch006 ◽

2012 ◽

pp. 94-110 ◽

Cited By ~ 1

Author(s):

T. Ravindra Babu ◽

M. Narasimha Murty ◽

S. V. Subrahmanya

Keyword(s):

Data Mining ◽

Large Data ◽

Large Datasets ◽

Superior Performance ◽

Data Sets ◽

Sequence Generation ◽

Data Compaction ◽

Clustering And Classification ◽

Minimal Data

Data Mining deals with efficient algorithms for dealing with large data. When such algorithms are combined with data compaction, they would lead to superior performance. Approaches to deal with large data include working with representatives of data instead of entire data. The representatives should preferably be generated with minimal data scans. In the current chapter we discuss working with methods of lossy and non-lossy data compression methods combined with clustering and classification of large datasets. We demonstrate the working of such schemes on two large data sets.

Download Full-text

A Review of Kernel Methods Based Approaches to Classification and Clustering of Sequential Patterns, Part I

Pattern Discovery Using Sequence Data Mining ◽

10.4018/978-1-61350-056-9.ch002 ◽

2012 ◽

pp. 24-50

Author(s):

Dileep A. D. ◽

Veena T. ◽

C. Chandra Sekhar

Keyword(s):

Pattern Classification ◽

Kernel Methods ◽

Pattern Analysis ◽

Kernel Functions ◽

Sequential Pattern ◽

Sequential Patterns ◽

Sequential Data ◽

Feature Vectors ◽

Varying Length ◽

Classification And Clustering

Sequential data mining involves analysis of sequential patterns of varying length. Sequential pattern analysis is important for pattern discovery from sequences of discrete symbols as in bioinformatics and text analysis, and from sequences or sets of continuous valued feature vectors as in processing of audio, speech, music, image, and video data. Pattern analysis techniques using kernel methods have been explored for static patterns as well as sequential patterns. The main issue in sequential pattern analysis using kernel methods is the design of a suitable kernel for sequential patterns of varying length. Kernel functions designed for sequential patterns are known as dynamic kernels. In this chapter, we present a brief description of kernel methods for pattern classification and clustering. Then we describe dynamic kernels for sequences of continuous feature vectors. We then present a review of approaches to sequential pattern classification and clustering using dynamic kernels.

Download Full-text

Pattern Discovery Using Sequence Data Mining
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Sequence Pattern Mining for Web Logs

Analysis of Kinase Inhibitors and Druggability of Kinase-Targets Using Machine Learning Techniques

A Review of Kernel Methods Based Approaches to Classification and Clustering of Sequential Patterns, Part II

Applications of Pattern Discovery Using Sequential Data Mining

Mining Statistically Significant Substrings Based on the Chi-Square Measure

Discovering Patterns in Order to Detect Weak Signals and Define New Strategies

Identification of Genomic Islands by Pattern Discovery

Classification of Biological Sequences

Quantization based Sequence Generation and Subsequence Pruning for Data Mining Applications

A Review of Kernel Methods Based Approaches to Classification and Clustering of Sequential Patterns, Part I

Export Citation Format

Pattern Discovery Using Sequence Data MiningLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Sequence Pattern Mining for Web Logs

Analysis of Kinase Inhibitors and Druggability of Kinase-Targets Using Machine Learning Techniques

A Review of Kernel Methods Based Approaches to Classification and Clustering of Sequential Patterns, Part II

Applications of Pattern Discovery Using Sequential Data Mining

Mining Statistically Significant Substrings Based on the Chi-Square Measure

Discovering Patterns in Order to Detect Weak Signals and Define New Strategies

Identification of Genomic Islands by Pattern Discovery

Classification of Biological Sequences

Quantization based Sequence Generation and Subsequence Pruning for Data Mining Applications

A Review of Kernel Methods Based Approaches to Classification and Clustering of Sequential Patterns, Part I

Pattern Discovery Using Sequence Data Mining
Latest Publications