Integrations of Data Warehousing, Data Mining and Database Technologies

In the last years, data warehousing systems have gained relevance to support decision making within organizations. The core component of these systems is the data warehouse and nowadays it is widely assumed that the data warehouse design must follow the multidimensional paradigm. Thus, many methods have been presented to support the multidimensional design of the data warehouse.The first methods introduced were requirement-driven but the semantics of the data warehouse (since the data warehouse is the result of homogenizing and integrating relevant data of the organization in a single, detailed view of the organization business) require to also consider the data sources during the design process. Considering the data sources gave rise to several data-driven methods that automate the data warehouse design process, mainly, from relational data sources. Currently, research on multidimensional modeling is still a hot topic and we have two main research lines. On the one hand, new hybrid automatic methods have been introduced proposing to combine data-driven and requirement-driven approaches. These methods focus on automating the whole process and improving the feedback retrieved by each approach to produce better results. On the other hand, some new approaches focus on considering alternative scenarios than relational sources. These methods also consider (semi)-structured data sources, such as ontologies or XML, that have gained relevance in the last years. Thus, they introduce innovative solutions for overcoming the heterogeneity of the data sources. All in all, we discuss the current scenario of multidimensional modeling by carrying out a survey of multidimensional design methods. We present the most relevant methods introduced in the literature and a detailed comparison showing the main features of each approach.

Download Full-text

Summarizing Datacubes

Integrations of Data Warehousing, Data Mining and Database Technologies ◽

10.4018/978-1-60960-537-7.ch002 ◽

2011 ◽

pp. 19-39 ◽

Cited By ~ 1

Author(s):

Rosine Cicchetti ◽

Lotfi Lakhal ◽

Sébastien Nedjar ◽

Noël Novelli ◽

Alain Casali

Keyword(s):

Initial Data ◽

Research Work ◽

Data Cube ◽

Storage Space ◽

Analytical Evaluation ◽

Aggregated Data ◽

Analytical Comparison ◽

Data Cubes ◽

Reduction Methods ◽

Quotient Cube

Datacubes are especially useful for answering efficiently queries on data warehouses. Nevertheless the amount of generated aggregated data is huge with respect to the initial data which is itself very large. Recent research work has addressed the issue of summarizing Datacubes in order to reduce their size. In this chapter, we present three different approaches. They propose structures which make it possible to reduce the size of the data cube representation. The two former, the closed cube and the quotient cube, are said semantic and discard the redundancies captured within data cubes. The size of the underlying representations is especially reduced but the counterpart is an additional response time when answering the OLAP queries. The latter approach is rather syntactic since it enforces an optimization at the logical level. It is called Partition Cube and based on the concept of partition. We also give an algorithm to compute it. We propose a Relational Partition Cube, a novel R-Olap cubing solution for managing Partition Cubes using the relational technology. An analytical evaluation shows that the storage space of Partition Cubes is smaller than Datacubes. In order to confirm analytical comparison, experiments are performed in order to compare our approach with Datacubes and with two of the best reduction methods, the Quotient Cube and the Closed Cube.

Download Full-text

Primary and Referential Horizontal Partitioning Selection Problems

Integrations of Data Warehousing, Data Mining and Database Technologies ◽

10.4018/978-1-60960-537-7.ch011 ◽

2011 ◽

pp. 258-286

Author(s):

Ladjel Bellatreche ◽

Kamel Boukhalfa ◽

Pascal Richard

Keyword(s):

Data Warehouse ◽

Cost Model ◽

Experimental Studies ◽

Optimal Solution ◽

Hill Climbing ◽

Data Warehouses ◽

Data Set ◽

Primary Fragmentation ◽

Partition Dimension ◽

Horizontal Partitioning

Horizontal partitioning has evolved significantly in recent years and widely advocated by the academic and industrial communities. Horizontal Partitioning affects positively query performance, database manageability and availability. Two types of horizontal partitioning are supported: primary and referential. Horizontal fragmentation in the context of relational data warehouses is to partition dimension tables by primary fragmentation then fragmenting the fact table by referential fragmentation. This fragmentation can generate a very large number of fragments which may make the maintenance task very complicated. In this paper, we first focus on the evolution of horizontal partitioning in commercial DBMS motivated by decision support applications. Secondly, we give a formalization of the referential fragmentation schema selection problem in the data warehouse and we study its hardness to select an optimal solution. Due to its high complexity, we develop two algorithms: hill climbing and simulated annealing with several variants to select a near optimal partitioning schema. We present ParAdmin, an advisor tool assisting administrators to use primary and referential partitioning during the physical design of their data warehouses. Finally, extensive experimental studies are conducted using the data set of APB1 benchmark to compare the quality the proposed algorithms using a mathematical cost model. Based on these experiments, some recommendations are given to ensure the well use of horizontal partitioning.

Download Full-text

On Handling the Evolution of External Data Sources in a Data Warehouse Architecture

Integrations of Data Warehousing, Data Mining and Database Technologies ◽

10.4018/978-1-60960-537-7.ch006 ◽

2011 ◽

pp. 106-147 ◽

Cited By ~ 4

Author(s):

Robert Wrembel

Keyword(s):

Data Warehouse ◽

Real World ◽

Practical Importance ◽

Data Sources ◽

External Data ◽

External Data Source ◽

On Line ◽

Advanced Analysis ◽

Data Source ◽

Analytical Processing

A data warehouse architecture (DWA) has been developed for the purpose of integrating data from multiple heterogeneous, distributed, and autonomous external data sources (EDSs) as well as for providing means for advanced analysis of integrated data. The major components of this architecture include: an external data source (EDS) layer, and extraction-transformation-loading (ETL) layer, a data warehouse (DW) layer, and an on-line analytical processing (OLAP) layer. Methods of designing a DWA, research developments, and most of the commercially available DW technologies tacitly assumed that a DWA is static. In practice, however, a DWA requires changes among others as the result of the evolution of EDSs, changes of the real world represented in a DW, and new user requirements. Changes in the structures of EDSs impact the ETL, DW, and OLAP layers. Since such changes are frequent, developing a technology for handling them automatically or semi-automatically in a DWA is of high practical importance. This chapter discusses challenges in designing, building, and managing a DWA that supports the evolution of structures of EDSs, evolution of an ETL layer, and evolution of a DW. The challenges and their solutions presented here are based on an experience of building a prototype Evolving-ETL and a prototype Multiversion Data Warehouse (MVDW). In details, this chapter presents the following issues: the concept of the MVDW, an approach to querying the MVDW, an approach to handling the evolution of an ETL layer, a technique for sharing data between multiple DW versions, and two index structures for the MVDW.

Download Full-text

A Parameterized Framework for Clustering Streams

Integrations of Data Warehousing, Data Mining and Database Technologies ◽

10.4018/978-1-60960-537-7.ch003 ◽

2011 ◽

pp. 40-59

Author(s):

Vasudha Bhatnagar ◽

Sharanjit Kaur ◽

Laurent Mignet

Keyword(s):

Comparative Study ◽

Data Streams ◽

Clustering Algorithms ◽

End Users ◽

End User ◽

Stream Clustering ◽

Clustering Model ◽

Architectural Framework ◽

Application Requirements

Clustering of data streams finds important applications in tracking evolution of various phenomena in medical, meteorological, astrophysical, seismic studies. Algorithms designed for this purpose are capable of adapting the discovered clustering model to the changes in data characteristics but are not capable of adapting to the user’s requirements themselves. Based on the previous observation, we perform a comparative study of different approaches for existing stream clustering algorithms and present a parameterized architectural framework that exploits nuances of the algorithms. This framework permits the end user to tailor a method to suit his specific application needs. We give a parameterized framework that empowers the end-users of KDD technology to build a clustering model. The framework delivers results as per the user’s application requirements. We also present two assembled algorithms G-kMeans and G-dbscan to instantiate the proposed framework and compare the performance with the existing stream clustering algorithms.

Download Full-text

A Survey of Open Source Tools for Business Intelligence

Integrations of Data Warehousing, Data Mining and Database Technologies ◽

10.4018/978-1-60960-537-7.ch010 ◽

2011 ◽

pp. 237-257

Keyword(s):

Open Source ◽

Business Intelligence ◽

Database Management ◽

Database Management Systems ◽

Management Systems ◽

Survey Article ◽

On Line ◽

Analytical Processing ◽

Industrial Use ◽

On Line Analytical Processing

The industrial use of open source Business Intelligence (BI) tools is becoming more common, but is still not as widespread as for other types of software. It is therefore of interest to explore which possibilities are available for open source BI and compare the tools. In this survey article, we consider the capabilities of a number of open source tools for BI. In the article, we consider a number of Extract- Transform-Load (ETL) tools, database management systems (DBMSs), On-Line Analytical Processing (OLAP) servers, and OLAP clients. We find that, unlike the situation a few years ago, there now exist mature and powerful tools in all these categories. However, the functionality still falls somewhat short of that found in commercial tools.

Download Full-text

Reliability Estimates for Regression Predictions

Integrations of Data Warehousing, Data Mining and Database Technologies ◽

10.4018/978-1-60960-537-7.ch014 ◽

2011 ◽

pp. 320-338 ◽

Cited By ~ 1

Author(s):

Zoran Bosnic ◽

Igor Kononenko

Keyword(s):

Machine Learning ◽

Prediction Error ◽

Mean Squared Error ◽

Noisy Data ◽

Data Set ◽

Squared Error ◽

Reliability Estimates ◽

Risk Sensitive ◽

Average Accuracy ◽

Applications Of Machine Learning

In machine learning, the reliability estimates for individual predictions provide more information about individual prediction error than the average accuracy of predictive model (e.g. relative mean squared error). Such reliability estimates may represent decisive information in the risk-sensitive applications of machine learning (e.g. medicine, engineering, and business), where they enable the users to distinguish between more and less reliable predictions. In the atuhors’ previous work they proposed eight reliability estimates for individual examples in regression and evaluated their performance. The results showed that the performance of each estimate strongly varies depending on the domain and regression model properties. In this paper they empirically analyze the dependence of reliability estimates’ performance on the data set and model properties. They present the results which show that the reliability estimates perform better when used with more accurate regression models, in domains with greater number of examples and in domains with less noisy data.

Download Full-text

What-If Application Design Using UML

Integrations of Data Warehousing, Data Mining and Database Technologies ◽

10.4018/978-1-60960-537-7.ch012 ◽

2011 ◽

pp. 287-306

Author(s):

Matteo Golfarelli ◽

Stefano Rizzi

Keyword(s):

Complex System ◽

Simulation Model ◽

Future Trends ◽

Crucial Issue ◽

Activity Diagrams ◽

Commercial Area ◽

Past Data ◽

The Impact ◽

Vital Factor

Optimizing decisions has become a vital factor for companies. In order to be able to evaluate beforehand the impact of a decision, managers need reliable previsional systems. Though data warehouses enable analysis of past data, they are not capable of giving anticipations of future trends. What-if analysis fills this gap by enabling users to simulate and inspect the behavior of a complex system under some given hypotheses. A crucial issue in the design of what-if applications is to find an adequate formalism to conceptually express the underlying simulation model. In this paper we report on how, within the framework of a comprehensive design methodology, this can be accomplished by extending UML 2 with a set of stereotypes. Our proposal is centered on the use of activity diagrams enriched with object flows, aimed at expressing functional, dynamic, and static aspects in an integrated fashion. The paper is completed by examples taken from a real case study in the commercial area.

Download Full-text

A State-of-the-Art in Spatio-Temporal Data Warehousing, OLAP and Mining

Integrations of Data Warehousing, Data Mining and Database Technologies ◽

10.4018/978-1-60960-537-7.ch009 ◽

2011 ◽

pp. 200-236 ◽

Cited By ~ 1

Author(s):

Leticia Gómez ◽

Bart Kuijpers ◽

Bart Moelans ◽

Alejandro Vaisman

Keyword(s):

Moving Objects ◽

Data Warehousing ◽

Route Planning ◽

Geographic Information ◽

Moving Object ◽

Trajectory Data ◽

Attribute Data ◽

Object Databases ◽

Spatio Temporal ◽

The University

Geographic Information Systems (GIS) have been extensively used in various application domains, ranging from economical, ecological and demographic analysis, to city and route planning. Nowadays, organizations need sophisticated GIS-based Decision Support System (DSS) to analyze their data with respect to geographic information, represented not only as attribute data, but also in maps. Thus, vendors are increasingly integrating their products, leading to the concept of SOLAP (Spatial OLAP). Also, in the last years, and motivated by the explosive growth in the use of PDA devices, the field of moving object data has been receiving attention from the GIS community, although not much work has been done to provide moving object databases with OLAP capabilities. In the first part of this paper we survey the SOLAP literature. We then address the problem of trajectory analysis, and review recent efforts regarding trajectory data warehousing and mining. We also provide an in-depth comparative study between two proposals: the GeoPKDD project (that makes use of the Hermes system), and Piet, a proposal for SOLAP and moving objects, developed at the University of Buenos Aires, Argentina. Finally, we discuss future directions in the field, including SOLAP analysis over raster data.

Download Full-text

A Survey of Parallel and Distributed Data Warehouses

Integrations of Data Warehousing, Data Mining and Database Technologies ◽

10.4018/978-1-60960-537-7.ch007 ◽

2011 ◽

pp. 148-169

Author(s):

Pedro Furtado

Keyword(s):

Data Warehouse ◽

The Other ◽

Distributed Data ◽

Data Warehouses ◽

Research Results ◽

Other Hand ◽

Scientific Organizations ◽

Globalized World ◽

Distributed Data Warehouse

Data Warehouses are a crucial technology for current competitive organizations in the globalized world. Size, speed and distributed operation are major challenges concerning those systems. Many data warehouses have huge sizes and the requirement that queries be processed quickly and efficiently, so parallel solutions are deployed to render the necessary efficiency. Distributed operation, on the other hand, concerns global commercial and scientific organizations that need to share their data in a coherent distributed data warehouse. In this article we review the major concepts, systems and research results behind parallel and distributed data warehouses.

Download Full-text

Integrations of Data Warehousing, Data Mining and Database Technologies
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Multidimensional Design Methods for Data Warehousing

Summarizing Datacubes

Primary and Referential Horizontal Partitioning Selection Problems

On Handling the Evolution of External Data Sources in a Data Warehouse Architecture

A Parameterized Framework for Clustering Streams

A Survey of Open Source Tools for Business Intelligence

Reliability Estimates for Regression Predictions

What-If Application Design Using UML

A State-of-the-Art in Spatio-Temporal Data Warehousing, OLAP and Mining

A Survey of Parallel and Distributed Data Warehouses

Export Citation Format

Integrations of Data Warehousing, Data Mining and Database TechnologiesLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Multidimensional Design Methods for Data Warehousing

Summarizing Datacubes

Primary and Referential Horizontal Partitioning Selection Problems

On Handling the Evolution of External Data Sources in a Data Warehouse Architecture

A Parameterized Framework for Clustering Streams

A Survey of Open Source Tools for Business Intelligence

Reliability Estimates for Regression Predictions

What-If Application Design Using UML

A State-of-the-Art in Spatio-Temporal Data Warehousing, OLAP and Mining

A Survey of Parallel and Distributed Data Warehouses

Integrations of Data Warehousing, Data Mining and Database Technologies
Latest Publications