scholarly journals Extract Transform Load (ETL) Process in Distributed Database Academic Data Warehouse

Author(s):  
Ardhian Agung Yulianto

While data warehouse is designed to support the decision-making function, the most time-consuming part is Extract Transform Load (ETL) process. Case in Academic Data Warehouse, when data source came from faculty’s distributed database, although having a typical database but become not easier to integrate. This paper presents the ETL detail process following Data Flow Thread in data staging area for identifying, profiling, the content analyzing including all tables in data sources, and then cleaning, confirming dimension and data delivery to the data warehouse. Those processes are running gradually from each distributed database data sources until it merged. Dimension table and fact table are generated in a multidimensional model. ETL tool is Pentaho Data Integration 6.1. ETL testing is done by comparing data source and data target and DW testing conducted by comparing the data analysis between SQL query and Saiku Analytics plugin in Pentaho Business Analytic Server.

Author(s):  
Ardhian Agung Yulianto

While a data warehouse is designed to support the decision-making function, the most time-consuming partis the Extract Transform Load (ETL) process. Case in Academic Data Warehouse, when data source came from thefaculty’s distributed database, although having a typical database but become not easier to integrate. This paperpresents how to an ETL process in distributed database academic data warehouse. Following Data Flow Threadprocess in the data staging area, a deep analysis performed for identifying all tables in each data sources, includingcontent profiling. Then the cleaning, confirming, and data delivery steps pour the different data source into the datawarehouse (DW). Since DW development using bottom-up Kimball’s multidimensional approach, we found the threetypes of extraction activities from data source table: merge, merge-union, and union. Result for cleaning andconforming step set by creating conform dimension on data source analysis, refinement, and hierarchy structure. Thefinal of the ETL step is loading it into integrating dimension and fact tables by a generation of a surrogate key. Thoseprocesses are running gradually from each distributed database data sources until it incorporated. This technicalactivity in distributed database ETL process generally can be adopted widely in other industries which designer musthave advance knowledge to structure and content of data source.


2014 ◽  
Vol 668-669 ◽  
pp. 1374-1377 ◽  
Author(s):  
Wei Jun Wen

ETL refers to the process of data extracting, transformation and loading and is deemed as a critical step in ensuring the quality, data specification and standardization of marine environmental data. Marine data, due to their complication, field diversity and huge volume, still remain decentralized, polyphyletic and isomerous with different semantics and hence far from being able to provide effective data sources for decision making. ETL enables the construction of marine environmental data warehouse in the form of cleaning, transformation, integration, loading and periodic updating of basic marine data warehouse. The paper presents a research on rules for cleaning, transformation and integration of marine data, based on which original ETL system of marine environmental data warehouse is so designed and developed. The system further guarantees data quality and correctness in analysis and decision-making based on marine environmental data in the future.


2016 ◽  
Vol 12 (3) ◽  
pp. 32-50
Author(s):  
Xiufeng Liu ◽  
Nadeem Iftikhar ◽  
Huan Huo ◽  
Per Sieverts Nielsen

In data warehousing, the data from source systems are populated into a central data warehouse (DW) through extraction, transformation and loading (ETL). The standard ETL approach usually uses sequential jobs to process the data with dependencies, such as dimension and fact data. It is a non-trivial task to process the so-called early-/late-arriving data, which arrive out of order. This paper proposes a two-level data staging area method to optimize ETL. The proposed method is an all-in-one solution that supports processing different types of data from operational systems, including early-/late-arriving data, and fast-/slowly-changing data. The introduced additional staging area decouples loading process from data extraction and transformation, which improves ETL flexibility and minimizes intervention to the data warehouse. This paper evaluates the proposed method empirically, which shows that it is more efficient and less intrusive than the standard ETL method.


Author(s):  
Nouha Arfaoui ◽  
Jalel Akaichi

The healthcare industry generates huge amount of data underused for decision making needs because of the absence of specific design mastered by healthcare actors and the lack of collaboration and information exchange between the institutions. In this work, a new approach is proposed to design the schema of a Hospital Data Warehouse (HDW). It starts by generating the schemas of the Hospital Data Mart (HDM) one for each department taking into consideration the requirements of the healthcare staffs and the existing data sources. Then, it merges them to build the schema of HDW. The bottom-up approach is suitable because the healthcare departments are separately. To merge the schemas, a new schema integration methodology is used. It starts by extracting the similar elements of the schemas and the conflicts and presents them as mapping rules. Then, it transforms the rules into queries and applies them to merge the schemas.


Author(s):  
Komang Budiarta ◽  
Putu Agung Ananta Wijaya ◽  
Cokorde Gede Indra Partha

College accreditation by BAN-PT is one of the parameters in determining the quality of universities in Indonesia. As consideration to achieve the standard from BAN-PT, so they have an evaluation process itself in study program or college to be meet the standard universities when set by the BAN-PT. In carrying out the process of self evaluation, required data source that is used as the basis in assessing on a criteria. In most of the study program, all data spread on the system information and physical document that different, that is require more time and effort to integrate up to interpret. Data warehouse fight important in collecting data that spread and become an information. The process data warehouse with ETL used to integrate, extract, clean, transforming and reload into the data warehouse. With the existence of the data warehouse on Academic STIMIK STIKOM Bali can make it easier for executives to get the information to support the standard accreditation standart three and can be used as a reference in decision making.


Author(s):  
Ivan Bojicic ◽  
Zoran Marjanovic ◽  
Nina Turajlic ◽  
Marko Petrovic ◽  
Milica Vuckovic ◽  
...  

In order for a data warehouse to be able to adequately fulfill its integrative and historical purpose, its data model must enable the appropriate and consistent representation of the different states of a system. In effect, a DW data model, representing the physical structure of the DW, must be general enough, to be able to consume data from heterogeneous data sources and reconcile the semantic differences of the data source models, and, at the same time, be resilient to the constant changes in the structure of the data sources. One of the main problems related to DW development is the absence of a standardized DW data model. In this paper a comparative analysis of the four most prominent DW data models (namely the relational/normalized model, data vault model, anchor model and dimensional model) will be given. On the basis of the results of [1]a, the new DW data model (the Domain/Mapping model- DMM) which would more adequately fulfill the posed requirements is presented.


Author(s):  
Robert Wrembel

A data warehouse architecture (DWA) has been developed for the purpose of integrating data from multiple heterogeneous, distributed, and autonomous external data sources (EDSs) as well as for providing means for advanced analysis of integrated data. The major components of this architecture include: an external data source (EDS) layer, and extraction-transformation-loading (ETL) layer, a data warehouse (DW) layer, and an on-line analytical processing (OLAP) layer. Methods of designing a DWA, research developments, and most of the commercially available DW technologies tacitly assumed that a DWA is static. In practice, however, a DWA requires changes among others as the result of the evolution of EDSs, changes of the real world represented in a DW, and new user requirements. Changes in the structures of EDSs impact the ETL, DW, and OLAP layers. Since such changes are frequent, developing a technology for handling them automatically or semi-automatically in a DWA is of high practical importance. This chapter discusses challenges in designing, building, and managing a DWA that supports the evolution of structures of EDSs, evolution of an ETL layer, and evolution of a DW. The challenges and their solutions presented here are based on an experience of building a prototype Evolving-ETL and a prototype Multiversion Data Warehouse (MVDW). In details, this chapter presents the following issues: the concept of the MVDW, an approach to querying the MVDW, an approach to handling the evolution of an ETL layer, a technique for sharing data between multiple DW versions, and two index structures for the MVDW.


Author(s):  
Ping Yi ◽  
Songling Zhang

This paper introduces applications of the Dempster–Shafer (D-S) data fusion technique in transportation system decision making. D-S inference is a statistics-based data classification technique, and it can be used when data sources contribute discontinuous and incomplete information and no single data source can produce an overwhelmingly high probability of certainty for identifying the most probable event. The technique captures and combines the information contributed by the data sources by using Dempster’s rule to find the conjunction of the events and to determine the highest associated probability. The D-S theory is explained and its implementation described through numerical examples of a ride-hauling service and of crowd management at a subway station. Results from the applications have shown that the technique is very effective in dealing with incomplete information and multiple data sources in the era of big data.


Author(s):  
Nouha Arfaoui ◽  
Jalel Akaichi

The healthcare industry generates huge amount of data underused for decision making needs because of the absence of specific design mastered by healthcare actors and the lack of collaboration and information exchange between the institutions. In this work, a new approach is proposed to design the schema of a Hospital Data Warehouse (HDW). It starts by generating the schemas of the Hospital Data Mart (HDM) one for each department taking into consideration the requirements of the healthcare staffs and the existing data sources. Then, it merges them to build the schema of HDW. The bottom-up approach is suitable because the healthcare departments are separately. To merge the schemas, a new schema integration methodology is used. It starts by extracting the similar elements of the schemas and the conflicts and presents them as mapping rules. Then, it transforms the rules into queries and applies them to merge the schemas.


2011 ◽  
Vol 1 (1) ◽  
Author(s):  
Payal Pahwa ◽  
Shweta Taneja ◽  
Shalini Jain

A data warehouse is a single repository of data which includes data generated from various operational systems. Conceptual modeling is an important concept in the successful design of a data warehouse. The Unified Modeling Language (UML) has become a standard for object modeling during analysis and design steps of software system development. The paper proposes an object oriented approach to model the process of data warehouse design. The hierarchies of each data element can be explicitly defined, thus highlighting the data granularity. We propose a UML multidimensional model using various data sources based on UML schemas. We present a conceptual-level integration framework on diverse UML data sources on which OLAP operations can be performed. Our integration framework takes into account the benefits of UML (its concepts, relationships and extended features) which is more close to the real world and can model even the complex problems easily and accurately. Two steps are involved in our integration framework. The first one is to convert UML schemas into UML class diagrams. The second is to build a multidimensional model from the UML class diagrams. The white-paper focuses on the transformations used in the second step. We describe how to represent a multidimensional model using a UML star or snowflake diagram with the help of a case study. To the best of our knowledge, we are the first people to represent a UML snowflake diagram that integrates heterogeneous UML data sources.


Sign in / Sign up

Export Citation Format

Share Document