A Methodology for Building XML Data Warehouses

2008 ◽  
pp. 530-555
Author(s):  
Laura Irina Rusu ◽  
J. Wenny Rahayu ◽  
David Taniar

Developing a data warehouse for XML documents involves two major processes: one of creating it, by processing XML raw documents into a specified data warehouse repository; and the other of querying it, by applying techniques to better answer users’ queries. This paper focuses on the first part; that is identifying a systematic approach for building a data warehouse of XML documents, specifically for transferring data from an underlying XML database into a defined XML data warehouse. The proposed methodology on building XML data warehouses covers processes including data cleaning and integration, summarization, intermediate XML documents, and updating/linking existing documents and creating fact tables. In this paper, we also present a case study on how to put this methodology into practice. We utilise the XQuery technology in all of the above processes.

Author(s):  
Hadj Mahboubi ◽  
Jérôme Darmont

XML data warehouses form an interesting basis for decision-support applications that exploit complex data. However, native-XML database management systems (DBMSs) currently bear limited performances and it is necessary to research for ways to optimize them. In this chapter, the authors present two such techniques. First, they propose an XML join index that is specifically adapted to the multidimensional architecture of XML warehouses. It eliminates join operations while preserving the information contained in the original warehouse. Second, the authors present a strategy for selecting XML materialized views by clustering the query workload. To validate these proposals, the authors measure the response time of a set of decision-support XQueries over an XML data warehouse, with and without using their optimization techniques. The authors’ experimental results demonstrate their efficiency, even when queries are complex and data are voluminous.


Author(s):  
Lars Frank ◽  
Christian Frank

A Star Schema Data Warehouse looks like a star with a central, so-called fact table, in the middle, surrounded by so-called dimension tables with one-to-many relationships to the central fact table. Dimensions are defined as dynamic or slowly changing if the attributes or relationships of a dimension can be updated. Aggregations of fact data to the level of the related dynamic dimensions might be misleading if the fact data are aggregated without considering the changes of the dimensions. In this chapter, we will first prove that the problems of SCD (Slowly Changing Dimensions) in a datawarehouse may be viewed as a special case of the read skew anomaly that may occur when different transactions access and update records without concurrency control. That is, we prove that aggregating fact data to the levels of a dynamic dimension should not make sense. On the other hand, we will also illustrate, by examples, that in some situations it does make sense that fact data is aggregated to the levels of a dynamic dimension. That is, it is the semantics of the data that determine whether historical dimension data should be preserved or destroyed. Even worse, we also illustrate that for some applications, we need a history preserving response, while for other applications at the same time need a history destroying response. Kimball et al., (2002), have described three classic solutions/responses to handling the aggregation problems caused by slowly changing dimensions. In this chapter, we will describe and evaluate four more responses of which one are new. This is important because all the responses have very different properties, and it is not possible to select a best solution without knowing the semantics of the data.


2017 ◽  
Vol 2 (1) ◽  
pp. 15
Author(s):  
Becky Yoose

The rise of evidence-based practices and assessment in libraries in recent years, combined with tying outcomes to future funding and resource allotments, has made libraries more reliant on patron data to determine how to allocate limited resources and funding. Libraries who want to use data for research and analysis but also wanting to protect patron privacy find themselves wondering how to balance these two priorities. This article explores The Seattle Public Library’s attempt to strike the balance between patron privacy and data analysis with the use of a data warehouse with de-identified patron data, as well as implications of data warehouses and de-identification as an option for other libraries.


Author(s):  
Michel Schneider

Basically, the schema of a data warehouse lies on two kinds of elements: facts and dimensions. Facts are used to memorize measures about situations or events. Dimensions are used to analyse these measures, particularly through aggregation operations (counting, summation, average, etc.). To fix the ideas let us consider the analysis of the sales in a shop according to the product type and to the month in the year. Each sale of a product is a fact. One can characterize it by a quantity. One can calculate an aggregation function on the quantities of several facts. For example, one can make the sum of quantities sold for the product type “mineral water” during January in 2001, 2002 and 2003. Product type is a criterion of the dimension Product. Month and Year are criteria of the dimension Time. A quantity is so connected both with a type of product and with a month of one year. This type of connection concerns the organization of facts with regard to dimensions. On the other hand a month is connected to one year. This type of connection concerns the organization of criteria within a dimension. The possibilities of fact analysis depend on these two forms of connection and on the schema of the warehouse. This schema is chosen by the designer in accordance with the users needs.


2010 ◽  
pp. 865-886
Author(s):  
Pedro Furtado

Data Warehouses are a crucial technology for current competitive organizations in the globalized world. Size, speed and distributed operation are major challenges concerning those systems. Many data warehouses have huge sizes and the requirement that queries be processed quickly and efficiently, so parallel solutions are deployed to render the necessary efficiency. Distributed operation, on the other hand, concerns global commercial and scientific organizations that need to share their data in a coherent distributed data warehouse. In this article we review the major concepts, systems and research results behind parallel and distributed data warehouses.


Author(s):  
Sandro Bimonte ◽  
Omar Boussaid ◽  
Michel Schneider ◽  
Fabien Ruelle

In the era of Big Data, more and more stream data is available. In the same way, Decision Support Systems (DSS) tools, such as data warehouses and alert systems, become more and more sophisticated, and conceptual modeling tools are consequently mandatory for successfully DSS projects. Formalisms such as UML and ER have been widely used in the context of classical information and data warehouse systems, but they have not been investigated yet for stream data warehouses to deal with alert systems. Therefore, in this article, the authors introduce the notion of Active Stream Data Warehouse (ASDW) and this article proposes a UML profile for designing Active Stream Data Warehouses. Indeed, this article extends the ICSOLAP profile to take into account continuous and window OLAP queries. Moreover, this article studies the duality of the stream and OLAP decision-making process and the authors propose a set of ECA rules to automatically trigger OLAP operators. The UML profile is implemented in a new OLAP architecture, and it is validated using an environmental case study concerning the wind monitoring.


Author(s):  
Michael Aram ◽  
Felix Mödritscher ◽  
Gustaf Neumann ◽  
Monika Andergassen

E-assessment comprises a variety of activities in and beyond the classroom. However, traditional e-learning platforms support only a part of assessment (e.g., individual and group assignments, the grading of such activities, and student record management). Typically, such platforms lack competency orientation, or face performance issues due to increasing application complexity and usage intensity. To overcome technical limitations and provide a basis for competency-based assessment, the authors present an analytics component that is inspired by data warehouses. The potential of this artifact is elaborated, and the improvements are evaluated through a case study about Learn@WU, the LMS of WU Vienna. Although the focus was competency-based aggregation of learning results, early experiences show performance increases for retrieving simple grades of 45% to 98%. Sample scenarios demonstrate how to define and calculate indicators along activity hierarchies and competency graphs to enable the measurement of learning performance along both generic indicators and competency-oriented assessment.


2012 ◽  
Vol 2 (1) ◽  
pp. 21-64
Author(s):  
Zurinahni Zainol ◽  
Bing Wang

Designing “good” XML documents is a very difficult task for a database designer. Although many theories for XML database design have proposed, none of commercial design tool for XML document design has been developed to assist the XML document designer. In this paper, the authors present a formal framework of XML document design by incorporating a conceptual model of XML schema called Graph-Document Type Definition (G-DTD) with a theory of database normalization. This framework is designed as a blueprint to help the XML database designers to perform the XML document schema design quickly and accurately. The G-DTD is used to describe the structure of XML documents at the schema level. A set of normal forms for G-DTD on the basis of rules proposed by Arenas and Libkin and Lv. et al is used to provide a guideline to a well-designed schema for XML documents. They develop a prototype of XML document schema design using a Z formal specification language. Finally, using a case study, this formal specification is validated to check for correctness and consistency of the specification. Thus, this gives a confidence that the authors’ prototype can be implemented successfully to generate an automatic XML document design.


Author(s):  
Michel Schneider

Basically, the schema of a data warehouse lies on two kinds of elements: facts and dimensions. Facts are used to memorize measures about situations or events. Dimensions are used to analyse these measures, particularly through aggregation operations (counting, summation, average, etc.). To fix the ideas let us consider the analysis of the sales in a shop according to the product type and to the month in the year. Each sale of a product is a fact. One can characterize it by a quantity. One can calculate an aggregation function on the quantities of several facts. For example, one can make the sum of quantities sold for the product type “mineral water” during January in 2001, 2002 and 2003. Product type is a criterion of the dimension Product. Month and Year are criteria of the dimension Time. A quantity is so connected both with a type of product and with a month of one year. This type of connection concerns the organization of facts with regard to dimensions. On the other hand a month is connected to one year. This type of connection concerns the organization of criteria within a dimension. The possibilities of fact analysis depend on these two forms of connection and on the schema of the warehouse. This schema is chosen by the designer in accordance with the users needs. Determining the schema of a data warehouse cannot be achieved without adequate modelling of dimensions and facts. In this article we present a general model for dimensions and facts and their relationships. This model will facilitate greatly the choice of the schema and its manipulation by the users.


2019 ◽  
Vol 15 (2) ◽  
pp. 1-21 ◽  
Author(s):  
Sandro Bimonte ◽  
Omar Boussaid ◽  
Michel Schneider ◽  
Fabien Ruelle

In the era of Big Data, more and more stream data is available. In the same way, Decision Support Systems (DSS) tools, such as data warehouses and alert systems, become more and more sophisticated, and conceptual modeling tools are consequently mandatory for successfully DSS projects. Formalisms such as UML and ER have been widely used in the context of classical information and data warehouse systems, but they have not been investigated yet for stream data warehouses to deal with alert systems. Therefore, in this article, the authors introduce the notion of Active Stream Data Warehouse (ASDW) and this article proposes a UML profile for designing Active Stream Data Warehouses. Indeed, this article extends the ICSOLAP profile to take into account continuous and window OLAP queries. Moreover, this article studies the duality of the stream and OLAP decision-making process and the authors propose a set of ECA rules to automatically trigger OLAP operators. The UML profile is implemented in a new OLAP architecture, and it is validated using an environmental case study concerning the wind monitoring.


Sign in / Sign up

Export Citation Format

Share Document