Data Mining Patterns
Latest Publications


TOTAL DOCUMENTS

11
(FIVE YEARS 0)

H-INDEX

1
(FIVE YEARS 0)

Published By IGI Global

9781599041629, 9781599041643

2011 ◽  
pp. 220-239 ◽  
Author(s):  
Sascha Schulz ◽  
Myra Spiliopoulou ◽  
Rene Schult

We study the issue of discovering and tracing thematic topics in a stream of documents. This issue, often studied under the label “topic evolution” is of interest in many applications where thematic trends should be identified and monitored, including environmental modelling for marketing and strategic management applications, information filtering over streams of news and enrichment of classification schemes with emerging new classes. We concentrate on the latter area and depict an example application from the automotive industry – the discovery of emerging topics in repair & maintenance reports. We first discuss relevant literature on (a) the discovery and monitoring of topics over document streams and (b) the monitoring of evolving clusters over arbitrary data streams. Then, we propose our own method for topic evolution over a stream of small noisy documents: We combine hierarchical clustering, performed at different time periods, with cluster comparison over adjacent time periods, taking into account that the feature space itself may change from one period to the next. We elaborate on the behaviour of this method and show how human experts can be assisted in identifying class candidates among the topics thus identified.


2011 ◽  
pp. 149-175 ◽  
Author(s):  
Yutaka Matsuo ◽  
Junichiro Mori ◽  
Mitsuru Ishizuka

This chapter describes social network mining from the Web. Since the end of the 1990s, several attempts have been made to mine social network information from e-mail messages, message boards, Web linkage structure, and Web content. In this chapter, we specifically examine the social network extraction from the Web using a search engine. The Web is a huge source of information about relations among persons. Therefore, we can build a social network by merging the information distributed on the Web. The growth of information on the Web, in addition to the development of a search engine, opens new possibilities to process the vast amounts of relevant information and mine important structures and knowledge.


2011 ◽  
pp. 85-105 ◽  
Author(s):  
Simona Este Rombo ◽  
Luigi Palopoli

In the last years, the information stored in biological data-sets grew up exponentially, and new methods and tools have been proposed to interpret and retrieve useful information from such data. Most biological data-sets contain biological sequences (e.g., DNA and protein sequences). Thus, it is much significant to have techniques available capable of mining patterns from such sequences to discover interesting information from them. For instance, singling out for common or similar sub-sequences in sets of bio-sequences is sensible as these are usually associated to similar biological functions expressed by the corresponding macromolecules. The aim of this chapter is to explain how pattern discovery can be applied to deal with such important biological problems, describing also a number of relevant techniques proposed in the literature. A simple formalization of the problem is given and specialized for each of the presented approaches. Such a formalization should ease reading and understanding the illustrated material by providing a simple-to-follow roadmap scheme through the diverse methods for pattern extraction we are going to illustrate.


2011 ◽  
pp. 32-56
Author(s):  
Osmar R. Zaïane ◽  
Mohammed El-Hajj

Frequent Itemset Mining (FIM) is a key component of many algorithms that extract patterns from transactional databases. For example, FIM can be leveraged to produce association rules, clusters, classifiers or contrast sets. This capability provides a strategic resource for decision support, and is most commonly used for market basket analysis. One challenge for frequent itemset mining is the potentially huge number of extracted patterns, which can eclipse the original database in size. In addition to increasing the cost of mining, this makes it more difficult for users to find the valuable patterns. Introducing constraints to the mining process helps mitigate both issues. Decision makers can restrict discovered patterns according to specified rules. By applying these restrictions as early as possible, the cost of mining can be constrained. For example, users may be interested in purchases whose total price exceeds $100, or whose items cost between $50 and $100. In cases of extremely large data sets, pushing constraints sequentially is not enough and parallelization becomes a must. However, specific design is needed to achieve sizes never reported before in the literature.


2011 ◽  
pp. 176-197
Author(s):  
Donato Malerba ◽  
Margherita Berardi ◽  
Michelangelo Ceci

This chapter introduces a data mining method for the discovery of association rules from images of scanned paper documents. It argues that a document image is a multi-modal unit of analysis whose semantics is deduced from a combination of both the textual content and the layout structure and the logical structure. Therefore, it proposes a method where both the spatial information derived from a complex document image analysis process (layout analysis), and the information extracted from the logical structure of the document (document image classification and understanding) and the textual information extracted by means of an OCR, are simultaneously considered to generate interesting patterns. The proposed method is based on an inductive logic programming approach, which is argued to be the most appropriate to analyze data available in more than one modality. It contributes to show a possible evolution of the unimodal knowledge discovery scheme, according to which different types of data describing the units of analysis are dealt with through the application of some preprocessing technique that transform them into a single double entry tabular data.


2011 ◽  
pp. 1-31 ◽  
Author(s):  
Dan A. Simovici

This chapter presents data mining techniques that make use of metrics defined on the set of partitions of finite sets. Partitions are naturally associated with object attributes and major data mining problem such as classification, clustering, and data preparation benefit from an algebraic and geometric study of the metric space of partitions. The metrics we find most useful are derived from a generalization of the entropic metric. We discuss techniques that produce smaller classifiers, allow incremental clustering of categorical data and help user to better prepare training data for constructing classifiers. Finally, we discuss open problems and future research directions.


2011 ◽  
pp. 106-123
Author(s):  
Gregor Leban ◽  
Minca Mramor ◽  
Blaž Zupan ◽  
Janez Demšar ◽  
Ivan Bratko

Data visualization plays a crucial role in data mining and knowledge discovery. Its use is, however, often difficult due to the large number of possible data projections. Manual search through such sets of projections can be prohibitively timely or even impossible, especially in the data analysis problems that comprise many data features. The chapter describes a method called VizRank, which can be used to automatically identify interesting data projections for multivariate visualizations of class-labeled data. VizRank assigns a score of interestingness to each considered projection based on the degree of separation of data instances with different class label. We demonstrate the usefulness of this approach on six cancer gene expression data sets, showing that the method can reveal interesting data patterns and can further be used for data classification and outlier detection.


2011 ◽  
pp. 198-219 ◽  
Author(s):  
Laurent Candillier ◽  
Ludovic Denoyer ◽  
Patrick Gallinari ◽  
Marie Christine Rousset ◽  
Alexandre Termier ◽  
...  

XML documents are becoming ubiquitous because of their rich and flexible format that can be used for a variety of applications. Giving the increasing size of XML collections as information sources, mining techniques that traditionally exist for text collections or databases need to be adapted and new methods to be invented to exploit the particular structure of XML documents. Basically XML documents can be seen as trees, which are well known to be complex structures. This chapter describes various ways of using and simplifying this tree structure to model documents and support efficient mining algorithms. We focus on three mining tasks: classification and clustering which are standard for text collections; discovering of frequent tree structure which is especially important for heterogeneous collection. This chapter presents some recent approaches and algorithms to support these tasks together with experimental evaluation on a variety of large XML collections.


2011 ◽  
pp. 124-148
Author(s):  
Yeow Wei Choong ◽  
Anne Laurent ◽  
Dominique Laurent

In the context of multidimensional data, OLAP tools are appropriate for the navigation in the data, aiming at discovering pertinent and abstract knowledge. However, due to the size of the data set, a systematic and exhaustive exploration is not feasible. Therefore, the problem is to design automatic tools to ease the navigation in the data and their visualization. In this chapter, we present a novel approach allowing to build automatically blocks of similar values in a given data cube that are meant to summarize the content of the cube. Our method is based on a levelwise algorithm (a la Apriori) whose complexity is shown to be polynomial in the number of scans of the data cube. The experiments reported in the chapter show that our approach is scalable, in particular in the case where the measure values present in the data cube are discretized using crisp or fuzzy partitions.


2011 ◽  
pp. 240-275 ◽  
Author(s):  
Cyrille J. Joutard ◽  
Edoardo M. Airoldi ◽  
Stephen E. Edoardo M. ◽  
Tanzy M. Love

Statistical models involving a latent structure often support clustering, classification, and other data-mining tasks. Parameterizations, specifications, and constraints of alternative models can be very different, however, and may lead to contrasting conclusions. Thus model choice becomes a fundamental issue in applications, both methodological and substantive. Here, we work from a general formulation of hierarchical Bayesian models of mixed-membership that subsumes many popular models successfully applied to problems in the computing, social and biological sciences. We present both parametric and nonparametric specifications for discovering latent patterns. Context for the discussion is provided by novel analyses of the following two data sets: (1) 5 years of scientific publications from the Proceedings of the National Academy of Sciences; (2) an extract on the functional disability of Americans age 65+ from the National Long Term Care Survey. For both, we elucidate strategies for model choice and our analyses bring new insights compared with earlier published analyses.


Sign in / Sign up

Export Citation Format

Share Document