Reconciling requirement-driven data warehouses with data sources via multidimensional normal forms

AbstractData warehouses integrate external data sources (EDSs), which very often change their data structures (schemas). In many cases, such changes cause an erroneous execution of an already deployed ETL workow. Structural changes of EDSs are frequent, therefore an automatic reparation of an ETL workow, after such changes, is of a high importance. This paper presents a framework, called E-ETL, for handling the evolution of an ETL layer. Detection of changes in EDSs causes a repa- ration of the fragment of ETL workow which interacts with the changed EDSs. The proposed framework was developed as a module external to a standard commercial or open-source ETL engine, accessing the engine by means of API. The innovation of this framework consists in: (1) the algorithms for semi-automatic reparation of an ETL workow and (2) its ability to interact with various ETL engines that provide API.

Download Full-text

A Proposal of Methodology for Designing Big Data Warehouses

10.20944/preprints201806.0219.v1 ◽

2018 ◽

Cited By ~ 1

Author(s):

Francesco Di Tria ◽

Ezio Lefons ◽

Filippo Tangorra

Keyword(s):

Big Data ◽

Data Warehouse ◽

Design Methodology ◽

Data Sources ◽

Massive Data ◽

Data Warehouses ◽

Short Intervals ◽

New Class ◽

New Business ◽

Business Requirements

Big Data warehouses are a new class of databases that largely use unstructured and volatile data for analytical purpose. Examples of this kind of data sources are those coming from the Web, such as social networks and blogs, or from sensor networks, where huge amounts of data may be available only for short intervals of time. In order to manage massive data sources, a strategy must be adopted to define multidimensional schemas in presence of fast-changing situations or even undefined business requirements. In the paper, we propose a design methodology that adopts agile and automatic approaches, in order to reduce the time necessary to integrate new data sources and to include new business requirements on the fly. The data are immediately available for analyses, since the underlying architecture is based on a virtual data warehouse that does not require the importing phase. Examples of application of the methodology are presented along the paper in order to show the validity of this approach compared to a traditional one.

Download Full-text

Temporal Semistructured Data Models and Data Warehouses

Data Warehouses and OLAP ◽

10.4018/987-1-59904-364-7.ch012 ◽

2011 ◽

pp. 277-297 ◽

Cited By ~ 2

Author(s):

Carlo Combi ◽

Barbara Oliboni

Keyword(s):

Data Warehouse ◽

Data Model ◽

Heterogeneous Data ◽

Data Models ◽

Semistructured Data ◽

Data Sources ◽

Time Varying ◽

Data Warehouses ◽

Time Dimension ◽

Heterogeneous Data Sources

This chapter describes a graph-based approach to represent information stored in a data warehouse, by means of a temporal semistructured data model. We consider issues related to the representation of semistructured data warehouses, and discuss the set of constraints needed to manage in a correct way the warehouse time, i.e. the time dimension considered storing data in the data warehouse itself. We use a temporal semistructured data model because a data warehouse can contain data coming from different and heterogeneous data sources. This means that data stored in a data warehouse are semistructured in nature, i.e. in different documents the same information can be represented in different ways, and moreover, the document schemata can be available or not. Moreover, information stored into a data warehouse is often time varying, thus as for semistructured data, also in the data warehouse context, it could be useful to consider time.

Download Full-text

An Eligibility Criteria Query Language for Heterogeneous Data Warehouses

Methods of Information in Medicine ◽

10.3414/me13-02-0027 ◽

2015 ◽

Vol 54 (01) ◽

pp. 41-44 ◽

Cited By ~ 11

Author(s):

A. Taweel ◽

S. Miles ◽

B. C. Delaney ◽

R. Bache

Keyword(s):

Clinical Data ◽

Query Language ◽

Data Representation ◽

Query Languages ◽

Heterogeneous Data ◽

Data Sources ◽

Data Warehouses ◽

Eligibility Criteria ◽

Strong Basis ◽

Temporal Semantics

SummaryIntroduction: This article is part of the Focus Theme of Methods of Information in Medicine on “Managing Interoperability and Complexity in Health Systems”.Objectives: The increasing availability of electronic clinical data provides great potential for finding eligible patients for clinical research. However, data heterogeneity makes it difficult for clinical researchers to interrogate sources consistently. Existing standard query languages are often not sufficient to query across diverse representations. Thus, a higher- level domain language is needed so that queries become data-representation agnostic. To this end, we define a clinician-readable computational language for querying whether patients meet eligibility criteria (ECs) from clinical trials. This language is capable of implementing the temporal semantics required by many ECs, and can be automatically evaluated on heterogeneous data sources.Methods: By reference to standards and examples of existing ECs, a clinician-readable query language was developed. Using a model-based approach, it was implemented to transform captured ECs into queries that interrogate heterogeneous data warehouses. The query language was evaluated on two types of data sources, each different in structure and content.Results: The query language abstracts the level of expressivity so that researchers construct their ECs with no prior knowledge of the data sources. It was evaluated on two types of semantically and structurally diverse data warehouses. This query language is now used to express ECs in the EHR4CR project. A survey shows that it was perceived by the majority of users to be useful, easy to understand and unambiguous.Discussion: An EC-specific language enables clinical researchers to express their ECs as a query such that the user is isolated from complexities of different heterogeneous clinical data sets. More generally, the approach demonstrates that a domain query language has potential for overcoming the problems of semantic interoperability and is applicable where the nature of the queries is well understood and the data is conceptually similar but in different representations.Conclusions: Our language provides a strong basis for use across different clinical domains for expressing ECs by overcoming the heterogeneous nature of electronic clinical data whilst maintaining semantic consistency. It is readily comprehensible by target users. This demonstrates that a domain query language can be both usable and interoperable.

Download Full-text

A Model Driven Process for Spatial Data Sources and Spatial Data Warehouses Reconcilation

Computational Science and Its Applications – ICCSA 2010 - Lecture Notes in Computer Science ◽

10.1007/978-3-642-12156-2_35 ◽

2010 ◽

pp. 461-475 ◽

Cited By ~ 1

Author(s):

Octavio Glorio ◽

Jose-Norberto Mazón ◽

Juan Trujillo

Keyword(s):

Spatial Data ◽

Data Sources ◽

Data Warehouses ◽

Model Driven ◽

Spatial Data Warehouses

Download Full-text

A Survey of Managing the Evolution of Data Warehouses

Business Information Systems ◽

10.4018/978-1-61520-969-9.ch055 ◽

2010 ◽

pp. 894-928 ◽

Cited By ~ 1

Author(s):

Robert Wrembel

Keyword(s):

Data Warehouse ◽

Real World ◽

Data Sources ◽

Index Structures ◽

Data Warehouses ◽

The Real ◽

External Data ◽

Structure Changes

Methods of designing a data warehouse (DW) usually assume that its structure is static. In practice, however, a DW structure changes among others as the result of the evolution of external data sources and changes of the real world represented in a DW. The most advanced research approaches to this problem are based on temporal extensions and versioning techniques. This article surveys challenges in designing, building, and managing data warehouses whose structure and content evolve in time. The survey is based on the so-called Multiversion Data Warehouse (MVDW). In details, this article presents the following issues: the concept of the MVDW, a language for querying the MVDW, a framework for detecting changes in data sources, a structure for sharing data in the MVDW, index structures for indexing data in the MVDW.

Download Full-text

Dynamic Workload for Schema Evolution in Data Warehouses

Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development ◽

10.4018/978-1-60566-748-5.ch002 ◽

2010 ◽

pp. 28-46

Author(s):

Fadila Bentayeb ◽

Cécile Favre ◽

Omar Boussaid

Keyword(s):

Data Warehouse ◽

Management System ◽

Heterogeneous Data ◽

Database Management System ◽

Main Issue ◽

Data Sources ◽

Schema Evolution ◽

Data Warehouses ◽

Workload Management ◽

On Line

A data warehouse allows the integration of heterogeneous data sources for identified analysis purposes. The data warehouse schema is designed according to the available data sources and the users’ analysis requirements. In order to provide an answer to new individual analysis needs, the authors previously proposed, in recent work, a solution for on-line analysis personalization. They based their solution on a user-driven approach for data warehouse schema evolution which consists in creating new hierarchy levels in OLAP (on-line analytical processing) dimensions. One of the main objectives of OLAP, as the meaning of the acronym refers, is the performance during the analysis process. Since data warehouses contain a large volume of data, answering decision queries efficiently requires particular access methods. The main issue is to use redundant optimization structures such as views and indices. This implies to select an appropriate set of materialized views and indices, which minimizes total query response time, given a limited storage space. A judicious choice in this selection must be cost-driven and based on a workload which represents a set of users’ queries on the data warehouse. In this chapter, the authors address the issues related to the workload’s evolution and maintenance in data warehouse systems in response to new requirements modeling resulting from users’ personalized analysis needs. The main issue is to avoid the workload generation from scratch. Hence, they propose a workload management system which helps the administrator to maintain and adapt dynamically the workload according to changes arising on the data warehouse schema. To achieve this maintenance, the authors propose two types of workload updates: (1) maintaining existing queries consistent with respect to the new data warehouse schema and (2) creating new queries based on the new dimension hierarchy levels. Their system helps the administrator in adopting a pro-active behaviour in the management of the data warehouse performance. In order to validate their workload management system, the authors address the implementation issues of their proposed prototype. This latter has been developed within client/server architecture with a Web client interfaced with the Oracle 10g DataBase Management System.

Download Full-text