Data Warehousing and Mining
Latest Publications


TOTAL DOCUMENTS

232
(FIVE YEARS 0)

H-INDEX

4
(FIVE YEARS 0)

Published By IGI Global

9781599049519, 9781599049526

2008 ◽  
pp. 3235-3251
Author(s):  
Yongqiao Xiao ◽  
Jenq-Foung Yao ◽  
Guizhen Yang

Recent years have witnessed a surge of research interest in knowledge discovery from data domains with complex structures, such as trees and graphs. In this paper, we address the problem of mining maximal frequent embedded subtrees which is motivated by such important applications as mining “hot” spots of Web sites from Web usage logs and discovering significant “deep” structures from tree-like bioinformatic data. One major challenge arises due to the fact that embedded subtrees are no longer ordinary subtrees, but preserve only part of the ancestor-descendant relationships in the original trees. To solve the embedded subtree mining problem, in this article we propose a novel algorithm, called TreeGrow, which is optimized in two important respects. First, it obtains frequency counts of root-to-leaf paths through efficient compression of trees, thereby being able to quickly grow an embedded subtree pattern path by path instead of node by node. Second, candidate subtree generation is highly localized so as to avoid unnecessary computational overhead. Experimental results on benchmark synthetic data sets have shown that our algorithm can outperform unoptimized methods by up to 20 times.


2008 ◽  
pp. 3194-3211
Author(s):  
Simon K. Milton ◽  
Ed Kazmierczak

Data modelling languages are used in today’s information systems engineering environments. Many have a degree of hype surrounding their quality and applicability with narrow and specific justification often given in support of one over another. We want to more deeply understand the fundamental nature of data modelling languages. We thus propose a theory, based on ontology, that should allow us to understand, compare, evaluate, and strengthen data modelling languages. In this paper we present a method (conceptual evaluation) and its extension (conceptual comparison), as part of our theory. Our methods are largely independent of a specific ontology. We introduce Chisholm’s ontology and apply our methods to analyse some data modelling languages using it. We find a good degree of overlap between all of the data modelling languages analysed and the core concepts of Chisholm’s ontology, and conclude that the data modelling languages investigated reflect an ontology of commonsense-realism.


2008 ◽  
pp. 3176-3193
Author(s):  
Ying Chen ◽  
Frank Dehne ◽  
Todd Eavis ◽  
A. Rau-Chaplin

This paper presents an improved parallel method for generating ROLAP data cubes on a shared-nothing multiprocessor based on a novel optimized data partitioning technique. Since no shared disk is required, our method can be used for highly scalable processor clusters consisting of standard PCs with local disks only, connected via a data switch. Experiments show that our improved parallel method provides optimal, linear, speedup for at least 32 processors. The approach taken, which uses a ROLAP representation of the data cube, is well suited for large data warehouses and high dimensional data, and supports the generation of both fully materialized and partially materialized data cubes.


2008 ◽  
pp. 3142-3163
Author(s):  
Rodrigo Salvador Monteiro ◽  
Geraldo Zimbrao ◽  
Holger Schwarz ◽  
Bernhard Mitschang ◽  
Jano Moreira de Souza

This chapter presents the core of the DWFIST approach, which is concerned with supporting the analysis and exploration of frequent itemsets and derived patterns, e.g., association rules in transactional datasets. The goal of this new approach is to provide: (1) flexible pattern-retrieval capabilities without requiring the original data during the analysis phase; and (2) a standard modeling for data warehouses of frequent itemsets, allowing an easier development and reuse of tools for analysis and exploration of itemset-based patterns. Instead of storing the original datasets, our approach organizes frequent itemsets holding on different partitions of the original transactions in a data warehouse that retains sufficient information for future analysis. A running example for mining calendar-based patterns on data streams is presented. Staging area tasks are discussed and standard conceptual and logical schemas are presented. Properties of this standard modeling allow retrieval of frequent itemsets holding on any set of partitions, along with upper and lower bounds on their frequency counts. Furthermore, precision guarantees for some interestingness measures of association rules are provided as well.


2008 ◽  
pp. 3116-3141
Author(s):  
Shi-Ming Huang ◽  
David C. Yen ◽  
Hsiang-Yuan Hsueh

The materialized view approach is widely adopted in implementations of data warehouse systems in or-der for efficiency purposes. In terms of the construction of a materialized data warehouse system, some managerial problems still exist to most developers and users in the view resource maintenance area in particular. Resource redundancy and data inconsistency among materialized views in a data warehouse system is a problem that many developers and users struggle with. In this article, a space-efficient protocol for materialized view maintenance with a global data view on data warehouses with embedded proxies is proposed. In the protocol set, multilevel proxy-based protocols with a data compensating mechanism are provided to certify the consistency and uniqueness of materialized data among data resources and materialized views. The authors also provide a set of evaluation experiences and derivations to verify the feasibility of proposed protocols and mechanisms. With such protocols as proxy services, the performance and space utilization of the materialized view approach will be improved. Furthermore, the consistency issue among materialized data warehouses and heterogeneous data sources can be properly accomplished by applying a dynamic compensating and synchronization mechanism. The trade-off between efficiency, storage consumption, and data validity for view maintenance tasks can be properly balanced.


2008 ◽  
pp. 3085-3115
Author(s):  
Biren Shah ◽  
Karthik Ramachandran ◽  
Vijay Raghavan

Materialized view selection is one of the crucial decisions in designing a data warehouse for optimal efficiency. Static selection of views may materialize certain views that are not beneficial as the data and usage trends change over time. On the contrary, dynamic selection of views works better only for queries demanding a high degree of aggregation. These facts point to the need for a technique that combines the improved response time of the static approach and the automated tuning capability of the dynamic approach. In this article, we propose a hybrid approach for the selection of materialized views. The idea is to partition the collection of all views into a static and dynamic set such that views selected for materialization from the static set are persistent over multiple query (and maintenance) windows, whereas views selected from the dynamic set can be queried and/or replaced on the fly. Highly aggregated views are selected on the fly based on the query access patterns of users, whereas the more detailed static set of views plays a significant role in the efficient maintenance of the dynamic set of views and in answering certain detailed view queries. We prove that our proposed strategy satisfies the monotonicity requirements, which is essential in order for the greedy heuristic to deliver competitive solutions. Experimental results show that our approach outperforms Dynamat, a well-known dynamic view management system that is known to outperform optimal static view selection.


2008 ◽  
pp. 3067-3084
Author(s):  
John Talburt ◽  
Richard Wang ◽  
Kimberly Hess ◽  
Emily Kuo

This chapter introduces abstract algebra as a means of understanding and creating data quality metrics for entity resolution, the process in which records determined to represent the same real-world entity are successively located and merged. Entity resolution is a particular form of data mining that is foundational to a number of applications in both industry and government. Examples include commercial customer recognition systems and information sharing on “persons of interest” across federal intelligence agencies. Despite the importance of these applications, most of the data quality literature focuses on measuring the intrinsic quality of individual records than the quality of record grouping or integration. In this chapter, the authors describe current research into the creation and validation of quality metrics for entity resolution, primarily in the context of customer recognition systems. The approach is based on an algebraic view of the system as creating a partition of a set of entity records based on the indicative information for the entities in question. In this view, the relative quality of entity identification between two systems can be measured in terms of the similarity between the partitions they produce. The authors discuss the difficulty of applying statistical cluster analysis to this problem when the datasets are large and propose an alternative index suitable for these situations. They also report some preliminary experimental results, and outlines areas and approaches to further research in this area.


2008 ◽  
pp. 2993-3004
Author(s):  
George Tzanis ◽  
Christos Berberidis

Association rule mining is a popular task that involves the discovery of co-occurences of items in transaction databases. Several extensions of the traditional association rule mining model have been proposed so far; however, the problem of mining for mutually exclusive items has not been directly tackled yet. Such information could be useful in various cases (e.g., when the expression of a gene excludes the expression of another), or it can be used as a serious hint in order to reveal inherent taxonomical information. In this article, we address the problem of mining pairs of items, such that the presence of one excludes the other. First, we provide a concise review of the literature, then we define this problem, we propose a probability-based evaluation metric, and finally a mining algorithm that we test on transaction data.


2008 ◽  
pp. 2978-2992
Author(s):  
Jianting Zhang ◽  
Wieguo Liu ◽  
Le Gruenwald

Decision trees (DT) has been widely used for training and classification of remotely sensed image data due to its capability to generate human interpretable decision rules and its relatively fast speed in training and classification. This chapter proposes a successive decision tree (SDT) approach where the samples in the ill-classified branches of a previous resulting decision tree are used to construct a successive decision tree. The decision trees are chained together through pointers and used for classification. SDT aims at constructing more interpretable decision trees while attempting to improve classification accuracies. The proposed approach is applied to two real remotely sensed image datasets for evaluations in terms of classification accuracy and interpretability of the resulting decision rules.


2008 ◽  
pp. 2943-2963
Author(s):  
Malcolm J. Beynon

The efficacy of data mining lies in its ability to identify relationships amongst data. This chapter investigates that constraining this efficacy is the quality of the data analysed, including whether the data is imprecise or in the worst case incomplete. Through the description of Dempster-Shafer theory (DST), a general methodology based on uncertain reasoning, it argues that traditional data mining techniques are not structured to handle such imperfect data, instead requiring the external management of missing values, and so forth. One DST based technique is classification and ranking belief simplex (CaRBS), which allows intelligent data mining through the acceptance of missing values in the data analysed, considering them a factor of ignorance, and not requiring their external management. Results presented here, using CaRBS and a number of simplex plots, show the effect of managing and not managing of imperfect data.


Sign in / Sign up

Export Citation Format

Share Document