Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development
Latest Publications

Calendar-based pattern mining aims at identifying patterns on specific calendar partitions. Potential calendar partitions are for example: every Monday, every first working day of each month, every holiday. Providing flexible mining capabilities for calendar-based partitions is especially challenging in a data stream scenario. The calendar partitions of interest are not known a priori and at each point in time only a subset of the detailed data is available. The authors show how a data warehouse approach can be applied to this problem. The data warehouse that keeps track of frequent itemsets holding on different partitions of the original stream has low storage requirements. Nevertheless, it allows to derive sets of patterns that are complete and precise. Furthermore, the authors demonstrate the effectiveness of their approach by a series of experiments.

Download Full-text

The LBF R-Tree

Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development ◽

10.4018/978-1-60566-748-5.ch001 ◽

2010 ◽

pp. 1-27

Author(s):

Todd Eavis

Keyword(s):

Hilbert Space ◽

Data Warehousing ◽

Experimental Results ◽

Data Sets ◽

Space Filling ◽

Fully Integrated ◽

Space Filling Curve ◽

Filling Curve ◽

Common User ◽

Tree Performance

In multi-dimensional database environments, such as those typically associated with contemporary data warehousing, we generally require effective indexing mechanisms for all but the smallest data sets. While numerous such methods have been proposed, the R-tree has emerged as one of the most common and reliable indexing models. Nevertheless, as user queries grow in terms of both size and dimensionality, R-tree performance can deteriorate significantly. Moreover, in the multi-terabyte spaces of today’s enterprise warehouses, the combination of data and indexes ? R-tree or otherwise ? can produce unacceptably large storage requirements. In this chapter, the authors present a framework that addresses both of these concerns. First, they propose a variation of the classic R-tree that specifically targets data warehousing architectures. Their new LBF R-tree not only improves performance on common user-defined range queries, but gracefully degrades to a linear scan of the data on pathologically large queries. Experimental results demonstrate a reduction in disk seeks of more than 50% relative to more conventional R-tree designs. Second, the authors present a fully integrated, block-oriented compression model that reduces the storage footprint of both data and indexes. It does so by exploiting the same Hilbert space filling curve that is used to construct the LBF R-tree itself. Extensive testing demonstrates compression rates of more than 90% for multi-dimensional data, and up to 98% for the associated indexes.

Download Full-text

Conceptual Data Warehouse Design Methodology for Business Process Intelligence

Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development ◽

10.4018/978-1-60566-748-5.ch007 ◽

2010 ◽

pp. 129-173

Author(s):

Svetlana Mansmann ◽

Thomas Neumuth ◽

Oliver Burgert ◽

Matthias Röger

Keyword(s):

Business Process ◽

Process Model ◽

Process Management ◽

Business Processes ◽

Data Warehousing ◽

Medical Engineering ◽

Multidimensional Analysis ◽

Qualitative Assessment ◽

Business Process Intelligence ◽

Multiple Challenges

The emerging area of business process intelligence aims at enhancing the analysis power of business process management systems by employing performance-oriented technologies of data warehousing and mining. However, the differences in the assumptions and objectives of the underlying models, namely the business process model and the multidimensional data model, aggravate straightforward and meaningful convergence of the two concepts. The authors present an approach to designing a data warehousing for enabling the multidimensional analysis of business processes and their execution. The aims of such analysis are manifold, from quantitative and qualitative assessment to process discovery, pattern recognition and mining. The authors demonstrate that business processes and workflows represent a non-conventional application scenario for the data warehousing approach and that multiple challenges arise at various design stages. They describe deficiencies of the conventional OLAP technology with respect to business process modeling and formulate the requirements for an adequate multidimensional presentation of process descriptions. Modeling extensions proposed at the conceptual level are verified by implementing them in a relational OLAP system, accessible via state-of-the-art visual frontend tools. The authors demonstrate the benefits of the proposed modeling framework by presenting relevant analysis tasks from the domain of medical engineering and showing the type of the decision support provided by our solution.

Download Full-text

Protocol Identification of Encrypted Network Streams

Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development ◽

10.4018/978-1-60566-748-5.ch015 ◽

2010 ◽

pp. 328-341

Author(s):

Matthew Gebski ◽

Alex Penev ◽

Raymond K. Wong

Keyword(s):

Real World ◽

Network Traffic ◽

Experimental Evaluation ◽

Network Monitoring ◽

Traffic Analysis ◽

Network Data ◽

Stream Characteristics ◽

Protocol Identification

Traffic analysis is an important issue for network monitoring and security. The authors focus on identifying protocols for network traffic by analysing the size, timing and direction of network packets. By using these network stream characteristics, they propose a technique for modelling the behaviour of various tcp protocols. This model can be used for recognising protocols even when running under encrypted tunnels. This is complemented with experimental evaluation on real world network data.

Download Full-text

An Approximate Approach for Maintaining Recent Occurrences of Itemsets in a Sliding Window over Data Streams

Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development ◽

10.4018/978-1-60566-748-5.ch014 ◽

2010 ◽

pp. 308-327

Author(s):

Jia-Ling Koh ◽

Shu-Ning Shin ◽

Yuan-Bin Don

Keyword(s):

Data Streams ◽

Data Stream ◽

Traditional Approach ◽

Experimental Studies ◽

Dynamic Environment ◽

Sliding Window ◽

Fixed Time ◽

Frequent Itemsets ◽

Embedded Knowledge ◽

Data Elements

Recently, the data stream, which is an unbounded sequence of data elements generated at a rapid rate, provides a dynamic environment for collecting data sources. It is likely that the embedded knowledge in a data stream will change quickly as time goes by. Therefore, catching the recent trend of data is an important issue when mining frequent itemsets over data streams. Although the sliding window model proposed a good solution for this problem, the appearing information of patterns within a sliding window has to be maintained completely in the traditional approach. For estimating the approximate supports of patterns within a sliding window, the frequency changing point (FCP) method is proposed for monitoring the recent occurrences of itemsets over a data stream. In addition to a basic design proposed under the assumption that exact one transaction arrives at each time point, the FCP method is extended for maintaining recent patterns over a data stream where a block of various numbers of transactions (including zero or more transactions) is inputted within a fixed time unit. Accordingly, the recently frequent itemsets or representative patterns are discovered from the maintained structure approximately. Experimental studies demonstrate that the proposed algorithms achieve high true positive rates and guarantees no false dismissal to the results yielded. A theoretic analysis is provided for the guarantee. In addition, the authors’ approach outperforms the previously proposed method in terms of reducing the run-time memory usage significantly.

Download Full-text

Simultaneous Feature Selection and Tuple Selection for Efficient Classification

Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development ◽

10.4018/978-1-60566-748-5.ch012 ◽

2010 ◽

pp. 270-285

Author(s):

Manoranjan Dash ◽

Vivekanand Gopalkrishnan

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Distance Measure ◽

Microarray Gene Expression ◽

Research Areas ◽

Microarray Gene ◽

Selection For ◽

Learning Data

Feature selection and tuple selection help the classifier to focus to achieve similar (or even better) accuracy as compared to the classification without feature selection and tuple selection. Although feature selection and tuple selection have been studied earlier in various research areas such as machine learning, data mining, and so on, they have rarely been studied together. The contribution of this chapter is that the authors propose a novel distance measure to select the most representative features and tuples. Their experiments are conducted over some microarray gene expression datasets, UCI machine learning and KDD datasets. Results show that the proposed method outperforms the existing methods quite significantly.

Download Full-text

Built-In Indicators to Support Business Intelligence in OLAP Databases

Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development ◽

10.4018/978-1-60566-748-5.ch006 ◽

2010 ◽

pp. 108-127

Author(s):

Jérôme Cubillé ◽

Christian Derquenne ◽

Sabine Goutier ◽

Françoise Guisnel ◽

Henri Klajnmic ◽

...

Keyword(s):

Quantitative Measure ◽

Data Cube ◽

Data Exploration ◽

Proof Of Concept ◽

Coding System ◽

Color Coding ◽

Chi Square ◽

Whole Process ◽

System A ◽

Statistical Criteria

This chapter is in the scope of static and dynamic discovery-driven explorations of a data cube. It presents different methods to facilitate the whole process of data exploration. Each kind of analysis (static or dynamic) is developed for either a count measure or a quantitative measure. Both are based on the calculation, on the fly, of specific statistical built-in indicators. Firstly, a global methodology is proposed to help a dynamic discovery-driven exploration. It aims at identifying the most relevant dimensions to expand. A built-in rank on dimensions is restituted interactively, at each step of the process. Secondly, to help a static discovery-driven exploration, generalized statistical criteria are detailed to detect and highlight interesting cells within a cube slice. The cell’s degree of interest is determined by the calculation of either test-value or Chi-Square contribution. Their display is done by a color-coding system. A proof of concept implementation on the ORACLE 10g system is described at the end of the chapter.

Download Full-text

Ranking Gradients in Multi-Dimensional Spaces

Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development ◽

10.4018/978-1-60566-748-5.ch011 ◽

2010 ◽

pp. 251-269

Author(s):

Ronnie Alves ◽

Joel Ribeiro ◽

Orlando Belo ◽

Jiawei Han

Keyword(s):

Data Analysis ◽

Dimensional Space ◽

Evaluation Study ◽

Business Strategies ◽

Future Research ◽

Business Organizations ◽

Combining Data ◽

Gradient Based ◽

Preference Query ◽

Query Selection

Business organizations must pay attention to interesting changes in customer behavior in order to anticipate their needs and act accordingly with appropriated business actions. Tracking customer’s commercial paths through the products they are interested in is an essential technique to improve business and increase customer satisfaction. Data warehousing (DW) allows us to do so, giving the basic means to record every customer transaction based on the different business strategies established. Although managing such huge amounts of records may imply business advantage, its exploration, especially in a multi-dimensional space (MDS), is a nontrivial task. The more dimensions we want to explore, the more are the computational costs involved in multi-dimensional data analysis (MDA). To make MDA practical in real world business problems, DW researchers have been working on combining data cubing and mining techniques to detect interesting changes in MDS. Such changes can also be detected through gradient queries. While those studies have provided the basis for future research in MDA, just few of them points to preference query selection in MDS. Thus, not only the exploration of changes in MDS is an essential task, but also even more important is ranking most interesting gradients. In this chapter, the authors investigate how to mine and rank the most interesting changes in a MDS applying a TOP-K gradient strategy. Additionally, the authors also propose a gradient-based cubing method to evaluate interesting gradient regions in MDS. So, the challenge is to find maximum gradient regions (MGRs) that maximize the task of raking gradients in a MDS. The authors’ evaluation study demonstrates that the proposed method presents a promising strategy for ranking gradients in MDS.

Download Full-text

Decisional Annotations

Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development ◽

10.4018/978-1-60566-748-5.ch004 ◽

2010 ◽

pp. 65-81

Author(s):

Guillaume Cabanac ◽

Max Chevalier ◽

Franck Ravat ◽

Olivier Teste

Keyword(s):

Decision Makers ◽

Multidimensional Databases

This chapter deals with an annotation-based decisional system. The decisional system the authors present is based on multidimensional databases, which are composed of facts and dimensions. The expertise of decision-makers is modeled, shared and stored through annotations. These annotations allow decision-makers to carry on active analysis and to collaborate with other decision-makers on a common analysis.

Download Full-text

Dynamic Workload for Schema Evolution in Data Warehouses

Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development ◽

10.4018/978-1-60566-748-5.ch002 ◽

2010 ◽

pp. 28-46

Author(s):

Fadila Bentayeb ◽

Cécile Favre ◽

Omar Boussaid

Keyword(s):

Data Warehouse ◽

Management System ◽

Heterogeneous Data ◽

Database Management System ◽

Main Issue ◽

Data Sources ◽

Schema Evolution ◽

Data Warehouses ◽

Workload Management ◽

On Line

A data warehouse allows the integration of heterogeneous data sources for identified analysis purposes. The data warehouse schema is designed according to the available data sources and the users’ analysis requirements. In order to provide an answer to new individual analysis needs, the authors previously proposed, in recent work, a solution for on-line analysis personalization. They based their solution on a user-driven approach for data warehouse schema evolution which consists in creating new hierarchy levels in OLAP (on-line analytical processing) dimensions. One of the main objectives of OLAP, as the meaning of the acronym refers, is the performance during the analysis process. Since data warehouses contain a large volume of data, answering decision queries efficiently requires particular access methods. The main issue is to use redundant optimization structures such as views and indices. This implies to select an appropriate set of materialized views and indices, which minimizes total query response time, given a limited storage space. A judicious choice in this selection must be cost-driven and based on a workload which represents a set of users’ queries on the data warehouse. In this chapter, the authors address the issues related to the workload’s evolution and maintenance in data warehouse systems in response to new requirements modeling resulting from users’ personalized analysis needs. The main issue is to avoid the workload generation from scratch. Hence, they propose a workload management system which helps the administrator to maintain and adapt dynamically the workload according to changes arising on the data warehouse schema. To achieve this maintenance, the authors propose two types of workload updates: (1) maintaining existing queries consistent with respect to the new data warehouse schema and (2) creating new queries based on the new dimension hierarchy levels. Their system helps the administrator in adopting a pro-active behaviour in the management of the data warehouse performance. In order to validate their workload management system, the authors address the implementation issues of their proposed prototype. This latter has been developed within client/server architecture with a Web client interfaced with the Oracle 10g DataBase Management System.

Download Full-text

Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval DevelopmentLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Exploring Calendar-Based Pattern Mining in Data Streams

The LBF R-Tree

Conceptual Data Warehouse Design Methodology for Business Process Intelligence

Protocol Identification of Encrypted Network Streams

An Approximate Approach for Maintaining Recent Occurrences of Itemsets in a Sliding Window over Data Streams

Simultaneous Feature Selection and Tuple Selection for Efficient Classification

Built-In Indicators to Support Business Intelligence in OLAP Databases

Ranking Gradients in Multi-Dimensional Spaces

Decisional Annotations

Dynamic Workload for Schema Evolution in Data Warehouses

Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development
Latest Publications