Data Mining Patterns | ScienceGate

Topic and Cluster Evolution Over Noisy Document Streams

Data Mining Patterns ◽

10.4018/978-1-59904-162-9.ch010 ◽

2011 ◽

pp. 220-239 ◽

Cited By ~ 1

Author(s):

Sascha Schulz ◽

Myra Spiliopoulou ◽

Rene Schult

Keyword(s):

Strategic Management ◽

Hierarchical Clustering ◽

Relevant Literature ◽

Information Filtering ◽

Feature Space ◽

Classification Schemes ◽

Topic Evolution ◽

Cluster Evolution ◽

Time Periods ◽

Emerging Topics

We study the issue of discovering and tracing thematic topics in a stream of documents. This issue, often studied under the label “topic evolution” is of interest in many applications where thematic trends should be identified and monitored, including environmental modelling for marketing and strategic management applications, information filtering over streams of news and enrichment of classification schemes with emerging new classes. We concentrate on the latter area and depict an example application from the automotive industry – the discovery of emerging topics in repair & maintenance reports. We first discuss relevant literature on (a) the discovery and monitoring of topics over document streams and (b) the monitoring of evolving clusters over arbitrary data streams. Then, we propose our own method for topic evolution over a stream of small noisy documents: We combine hierarchical clustering, performed at different time periods, with cluster comparison over adjacent time periods, taking into account that the feature space itself may change from one period to the next. We elaborate on the behaviour of this method and show how human experts can be assisted in identifying class candidates among the topics thus identified.

Social Network Mining from the Web

Data Mining Patterns ◽

10.4018/978-1-59904-162-9.ch007 ◽

2011 ◽

pp. 149-175 ◽

Cited By ~ 2

Author(s):

Yutaka Matsuo ◽

Junichiro Mori ◽

Mitsuru Ishizuka

Keyword(s):

Social Network ◽

Search Engine ◽

Relevant Information ◽

Web Content ◽

Message Boards ◽

Network Information ◽

Network Mining ◽

Social Network Mining ◽

The Social ◽

The Web

This chapter describes social network mining from the Web. Since the end of the 1990s, several attempts have been made to mine social network information from e-mail messages, message boards, Web linkage structure, and Web content. In this chapter, we specifically examine the social network extraction from the Web using a search engine. The Web is a huge source of information about relations among persons. Therefore, we can build a social network by merging the information distributed on the Web. The growth of information on the Web, in addition to the development of a search engine, opens new possibilities to process the vast amounts of relevant information and mine important structures and knowledge.

Pattern Discovery in Biosequences

Data Mining Patterns ◽

10.4018/978-1-59904-162-9.ch004 ◽

2011 ◽

pp. 85-105 ◽

Cited By ~ 1

Author(s):

Simona Este Rombo ◽

Luigi Palopoli

Keyword(s):

Pattern Discovery ◽

Protein Sequences ◽

Biological Data ◽

Data Sets ◽

Biological Sequences ◽

Biological Functions ◽

Pattern Extraction ◽

New Methods

In the last years, the information stored in biological data-sets grew up exponentially, and new methods and tools have been proposed to interpret and retrieve useful information from such data. Most biological data-sets contain biological sequences (e.g., DNA and protein sequences). Thus, it is much significant to have techniques available capable of mining patterns from such sequences to discover interesting information from them. For instance, singling out for common or similar sub-sequences in sets of bio-sequences is sensible as these are usually associated to similar biological functions expressed by the corresponding macromolecules. The aim of this chapter is to explain how pattern discovery can be applied to deal with such important biological problems, describing also a number of relevant techniques proposed in the literature. A simple formalization of the problem is given and specialized for each of the presented approaches. Such a formalization should ease reading and understanding the illustrated material by providing a simple-to-follow roadmap scheme through the diverse methods for pattern extraction we are going to illustrate.

Bi-Directional Constraint Pushing in Frequent Pattern Mining

Data Mining Patterns ◽

10.4018/978-1-59904-162-9.ch002 ◽

2011 ◽

pp. 32-56

Author(s):

Osmar R. Zaïane ◽

Mohammed El-Hajj

Keyword(s):

Pattern Mining ◽

Frequent Pattern Mining ◽

Large Data ◽

Frequent Itemset ◽

Frequent Pattern ◽

Frequent Itemset Mining ◽

Data Sets ◽

Itemset Mining ◽

Transactional Databases ◽

The Cost

Frequent Itemset Mining (FIM) is a key component of many algorithms that extract patterns from transactional databases. For example, FIM can be leveraged to produce association rules, clusters, classifiers or contrast sets. This capability provides a strategic resource for decision support, and is most commonly used for market basket analysis. One challenge for frequent itemset mining is the potentially huge number of extracted patterns, which can eclipse the original database in size. In addition to increasing the cost of mining, this makes it more difficult for users to find the valuable patterns. Introducing constraints to the mining process helps mitigate both issues. Decision makers can restrict discovered patterns according to specified rules. By applying these restrictions as early as possible, the cost of mining can be constrained. For example, users may be interested in purchases whose total price exceeds $100, or whose items cost between $50 and $100. In cases of extremely large data sets, pushing constraints sequentially is not enough and parallelization becomes a must. However, specific design is needed to achieve sizes never reported before in the literature.

Discovering Spatio-Textual Association Rules in Document Images

Data Mining Patterns ◽

10.4018/978-1-59904-162-9.ch008 ◽

2011 ◽

pp. 176-197

Author(s):

Donato Malerba ◽

Margherita Berardi ◽

Michelangelo Ceci

Keyword(s):

Association Rules ◽

Spatial Information ◽

Inductive Logic ◽

Logical Structure ◽

Document Image ◽

Programming Approach ◽

Analysis Process ◽

Preprocessing Technique ◽

Double Entry ◽

Textual Content

This chapter introduces a data mining method for the discovery of association rules from images of scanned paper documents. It argues that a document image is a multi-modal unit of analysis whose semantics is deduced from a combination of both the textual content and the layout structure and the logical structure. Therefore, it proposes a method where both the spatial information derived from a complex document image analysis process (layout analysis), and the information extracted from the logical structure of the document (document image classification and understanding) and the textual information extracted by means of an OCR, are simultaneously considered to generate interesting patterns. The proposed method is based on an inductive logic programming approach, which is argued to be the most appropriate to analyze data available in more than one modality. It contributes to show a possible evolution of the unimodal knowledge discovery scheme, according to which different types of data describing the units of analysis are dealt with through the application of some preprocessing technique that transform them into a single double entry tabular data.

Metric Methods in Data Mining

Data Mining Patterns ◽

10.4018/978-1-59904-162-9.ch001 ◽

2011 ◽

pp. 1-31 ◽

Cited By ~ 1

Author(s):

Dan A. Simovici

Keyword(s):

Data Mining ◽

Metric Space ◽

Training Data ◽

Future Research ◽

Open Problems ◽

Research Directions ◽

Data Mining Techniques ◽

Future Research Directions ◽

Major Data ◽

Geometric Study

This chapter presents data mining techniques that make use of metrics defined on the set of partitions of finite sets. Partitions are naturally associated with object attributes and major data mining problem such as classification, clustering, and data preparation benefit from an algebraic and geometric study of the metric space of partitions. The metrics we find most useful are derived from a generalization of the entropic metric. We discuss techniques that produce smaller classifiers, allow incremental clustering of categorical data and help user to better prepare training data for constructing classifiers. Finally, we discuss open problems and future research directions.

Finding Patterns in Class-Labeled Data Using Data Visualization

Data Mining Patterns ◽

10.4018/978-1-59904-162-9.ch005 ◽

2011 ◽

pp. 106-123

Author(s):

Gregor Leban ◽

Minca Mramor ◽

Blaž Zupan ◽

Janez Demšar ◽

Ivan Bratko

Keyword(s):

Gene Expression ◽

Data Mining ◽

Data Analysis ◽

Data Visualization ◽

Gene Expression Data ◽

Data Sets ◽

Cancer Gene ◽

Manual Search ◽

Degree Of Separation ◽

Using Data

Data visualization plays a crucial role in data mining and knowledge discovery. Its use is, however, often difficult due to the large number of possible data projections. Manual search through such sets of projections can be prohibitively timely or even impossible, especially in the data analysis problems that comprise many data features. The chapter describes a method called VizRank, which can be used to automatically identify interesting data projections for multivariate visualizations of class-labeled data. VizRank assigns a score of interestingness to each considered projection based on the degree of separation of data instances with different class label. We demonstrate the usefulness of this approach on six cancer gene expression data sets, showing that the method can reveal interesting data patterns and can further be used for data classification and outlier detection.

Mining XML Documents

Data Mining Patterns ◽

10.4018/978-1-59904-162-9.ch009 ◽

2011 ◽

pp. 198-219 ◽

Cited By ~ 1

Author(s):

Laurent Candillier ◽

Ludovic Denoyer ◽

Patrick Gallinari ◽

Marie Christine Rousset ◽

Alexandre Termier ◽

...

Keyword(s):

Experimental Evaluation ◽

Information Sources ◽

Tree Structure ◽

Complex Structures ◽

Xml Documents ◽

New Methods ◽

Text Collections ◽

Mining Algorithms ◽

Classification And Clustering

XML documents are becoming ubiquitous because of their rich and flexible format that can be used for a variety of applications. Giving the increasing size of XML collections as information sources, mining techniques that traditionally exist for text collections or databases need to be adapted and new methods to be invented to exploit the particular structure of XML documents. Basically XML documents can be seen as trees, which are well known to be complex structures. This chapter describes various ways of using and simplifying this tree structure to model documents and support efficient mining algorithms. We focus on three mining tasks: classification and clustering which are standard for text collections; discovering of frequent tree structure which is especially important for heterogeneous collection. This chapter presents some recent approaches and algorithms to support these tasks together with experimental evaluation on a variety of large XML collections.

Summarizing Data Cubes Using Blocks

Data Mining Patterns ◽

10.4018/978-1-59904-162-9.ch006 ◽

2011 ◽

pp. 124-148

Author(s):

Yeow Wei Choong ◽

Anne Laurent ◽

Dominique Laurent

Keyword(s):

Data Cube ◽

Multidimensional Data ◽

Data Set ◽

Data Cubes ◽

Abstract Knowledge ◽

Novel Approach ◽

Fuzzy Partitions

In the context of multidimensional data, OLAP tools are appropriate for the navigation in the data, aiming at discovering pertinent and abstract knowledge. However, due to the size of the data set, a systematic and exhaustive exploration is not feasible. Therefore, the problem is to design automatic tools to ease the navigation in the data and their visualization. In this chapter, we present a novel approach allowing to build automatically blocks of similar values in a given data cube that are meant to summarize the content of the cube. Our method is based on a levelwise algorithm (a la Apriori) whose complexity is shown to be polynomial in the number of scans of the data cube. The experiments reported in the chapter show that our approach is scalable, in particular in the case where the measure values present in the data cube are discretized using crisp or fuzzy partitions.

Discovery of Latent Patterns with Hierarchical Bayesian Mixed-Membership Models and the Issue of Model Choice

Data Mining Patterns ◽

10.4018/978-1-59904-162-9.ch011 ◽

2011 ◽

pp. 240-275 ◽

Cited By ~ 1

Author(s):

Cyrille J. Joutard ◽

Edoardo M. Airoldi ◽

Stephen E. Edoardo M. ◽

Tanzy M. Love

Keyword(s):

Biological Sciences ◽

Functional Disability ◽

Data Sets ◽

Hierarchical Bayesian ◽

Model Choice ◽

National Academy Of Sciences ◽

Scientific Publications ◽

Care Survey ◽

Mixed Membership Models ◽

Academy Of Sciences

Statistical models involving a latent structure often support clustering, classification, and other data-mining tasks. Parameterizations, specifications, and constraints of alternative models can be very different, however, and may lead to contrasting conclusions. Thus model choice becomes a fundamental issue in applications, both methodological and substantive. Here, we work from a general formulation of hierarchical Bayesian models of mixed-membership that subsumes many popular models successfully applied to problems in the computing, social and biological sciences. We present both parametric and nonparametric specifications for discovering latent patterns. Context for the discussion is provided by novel analyses of the following two data sets: (1) 5 years of scientific publications from the Proceedings of the National Academy of Sciences; (2) an extract on the functional disability of Americans age 65+ from the National Long Term Care Survey. For both, we elucidate strategies for model choice and our analyses bring new insights compared with earlier published analyses.

Data Mining Patterns
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Topic and Cluster Evolution Over Noisy Document Streams

Social Network Mining from the Web

Pattern Discovery in Biosequences

Bi-Directional Constraint Pushing in Frequent Pattern Mining

Discovering Spatio-Textual Association Rules in Document Images

Metric Methods in Data Mining

Finding Patterns in Class-Labeled Data Using Data Visualization

Mining XML Documents

Summarizing Data Cubes Using Blocks

Discovery of Latent Patterns with Hierarchical Bayesian Mixed-Membership Models and the Issue of Model Choice

Export Citation Format

Data Mining PatternsLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Topic and Cluster Evolution Over Noisy Document Streams

Social Network Mining from the Web

Pattern Discovery in Biosequences

Bi-Directional Constraint Pushing in Frequent Pattern Mining

Discovering Spatio-Textual Association Rules in Document Images

Metric Methods in Data Mining

Finding Patterns in Class-Labeled Data Using Data Visualization

Mining XML Documents

Summarizing Data Cubes Using Blocks

Discovery of Latent Patterns with Hierarchical Bayesian Mixed-Membership Models and the Issue of Model Choice

Data Mining Patterns
Latest Publications