A Proposal of Methodology for Designing Big Data Warehouses

Big Data warehouses are a new class of databases that largely use unstructured and volatile data for analytical purpose. Examples of this kind of data sources are those coming from the Web, such as social networks and blogs, or from sensor networks, where huge amounts of data may be available only for short intervals of time. In order to manage massive data sources, a strategy must be adopted to define multidimensional schemas in presence of fast-changing situations or even undefined business requirements. In the paper, we propose a design methodology that adopts agile and automatic approaches, in order to reduce the time necessary to integrate new data sources and to include new business requirements on the fly. The data are immediately available for analyses, since the underlying architecture is based on a virtual data warehouse that does not require the importing phase. Examples of application of the methodology are presented along the paper in order to show the validity of this approach compared to a traditional one.

Download Full-text

Big Data Warehouse

International Journal of Decision Support System Technology ◽

10.4018/ijdsst.2020010101 ◽

2020 ◽

Vol 12 (1) ◽

pp. 1-24

Author(s):

Khaled Dehdouh ◽

Omar Boussaid ◽

Fadila Bentayeb

Keyword(s):

Big Data ◽

Data Warehouse ◽

Database System ◽

Massive Data ◽

Data Warehouses ◽

Online Analysis ◽

Storage Model ◽

Nosql Database ◽

Big Data Warehouse ◽

Oriented Approach

In the Big Data warehouse context, a column-oriented NoSQL database system is considered as the storage model which is highly adapted to data warehouses and online analysis. Indeed, the use of NoSQL models allows data scalability easily and the columnar store is suitable for storing and managing massive data, especially for decisional queries. However, the column-oriented NoSQL DBMS do not offer online analysis operators (OLAP). To build OLAP cubes corresponding to the analysis contexts, the most common way is to integrate other software such as HIVE or Kylin which has a CUBE operator to build data cubes. By using that, the cube is built according to the row-oriented approach and does not allow to fully obtain the benefits of a column-oriented approach. In this article, the focus is to define a cube operator called MC-CUBE (MapReduce Columnar CUBE), which allows building columnar NoSQL cubes according to the columnar approach by taking into account the non-relational and distributed aspects when data warehouses are stored.

Download Full-text

Building OLAP Cubes From Columnar NoSQL Data Warehouses

Emerging Perspectives in Big Data Warehousing - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-5516-2.ch006 ◽

2019 ◽

pp. 129-157

Author(s):

Khaled Dehdouh

Keyword(s):

Big Data ◽

Database System ◽

Massive Data ◽

Data Warehouses ◽

Online Analysis ◽

Storage Model ◽

Data Cubes ◽

Nosql Database ◽

Oriented Approach

In the big data warehouses context, a column-oriented NoSQL database system is considered as the storage model which is highly adapted to data warehouses and online analysis. Indeed, the use of NoSQL models allows data scalability easily and the columnar store is suitable for storing and managing massive data, especially for decisional queries. However, the column-oriented NoSQL DBMS do not offer online analysis operators (OLAP). To build OLAP cubes corresponding to the analysis contexts, the most common way is to integrate other software such as HIVE or Kylin which has a CUBE operator to build data cubes. By using that, the cube is built according to the row-oriented approach and does not allow to fully obtain the benefits of a column-oriented approach. In this chapter, the main contribution is to define a cube operator called MC-CUBE (MapReduce Columnar CUBE), which allows building columnar NoSQL cubes according to the columnar approach by taking into account the non-relational and distributed aspects when data warehouses are stored.

Download Full-text

A Semi-Automatic Design Methodology for (Big) Data Warehouse Transforming Facts into Dimensions

IEEE Transactions on Knowledge and Data Engineering ◽

10.1109/tkde.2019.2925621 ◽

2021 ◽

Vol 33 (1) ◽

pp. 28-42

Author(s):

Lucile Sautot ◽

Sandro Bimonte ◽

Ludovic Journaux

Keyword(s):

Big Data ◽

Data Warehouse ◽

Design Methodology ◽

Automatic Design ◽

Big Data Warehouse

Download Full-text

Optimization of ETL Process in Data Warehouse Through a Combination of Parallelization and Shared Cache Memory

Engineering, Technology & Applied Science Research ◽

10.48084/etasr.849 ◽

2016 ◽

Vol 6 (6) ◽

pp. 1241-1244 ◽

Cited By ~ 2

Author(s):

M. Faridi Masouleh ◽

M. A. Afshar Kazemi ◽

M. Alborzi ◽

A. Toloie Eshlaghy

Keyword(s):

Big Data ◽

Data Warehouse ◽

Cache Memory ◽

Data Warehouses ◽

Data Bases ◽

Shared Cache ◽

Optimization Management ◽

Implementation Time ◽

The Creation ◽

Management Improvement

Extraction, Transformation and Loading (ETL) is introduced as one of the notable subjects in optimization, management, improvement and acceleration of processes and operations in data bases and data warehouses. The creation of ETL processes is potentially one of the greatest tasks of data warehouses and so its production is a time-consuming and complicated procedure. Without optimization of these processes, the implementation of projects in data warehouses area is costly, complicated and time-consuming. The present paper used the combination of parallelization methods and shared cache memory in systems distributed on the basis of data warehouse. According to the conducted assessment, the proposed method exhibited 7.1% speed improvement to kattle optimization instrument and 7.9% to talend instrument in terms of implementation time of the ETL process. Therefore, parallelization could notably improve the ETL process. It eventually caused the management and integration processes of big data to be implemented in a simple way and with acceptable speed.

Download Full-text

A Scalable Business Intelligence Decision-Making System in the Era of Big Data

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l3251.1081219 ◽

2019 ◽

Vol 8 (12) ◽

pp. 2034-2042

Keyword(s):

Big Data ◽

Data Warehouse ◽

Data Storage ◽

Business Intelligence ◽

Unstructured Data ◽

Data Sources ◽

Second Step ◽

Decision Making System ◽

Set Up

Transformation presents the second step in the ETL process that is responsible for extracting, transforming and loading data into a data warehouse. The role of transformation is to set up several operations to clean, to format and to unify types and data coming from multiple and different data sources. The goal is to get data to conform to the schema of the data warehouse to avoid any ambiguity problems during the data storage and analytical operations. Transforming data coming from structured, semi-structured and unstructured data sources need two levels of treatments: the first one is transformation schema to schema to get a unified schema for all selected data sources and the second treatment is transformation data to data to unify all types and data gathered. To ensure the setting up of these steps we propose in this paper a process switch from one database schema to another as a part of transformation schema to schema, and a meta-model based on MDA approach to describe the main operations of transformation data to data. The results of our transformations propose a data loading in one of the four schemas of NoSQL to best meet the constraints and requirements of Big Data.

Download Full-text

EXTRACTION TRANSFORMATION LOAD (ETL) SOLUTION FOR DATA INTEGRATION: A CASE STUDY OF RUBBER IMPORT AND EXPORT INFORMATION

Jurnal Teknologi ◽

10.11113/jt.v78.4061 ◽

2015 ◽

Vol 78 (1) ◽

Author(s):

Mimi Safinaz Jamaluddin ◽

Nurulhuda Firdaus Mohd Azmi

Keyword(s):

Data Integration ◽

Information Extraction ◽

Data Warehouse ◽

Data Sources ◽

Sql Server ◽

Unified View ◽

Import And Export ◽

Business Requirements

Data integration is important in consolidating all the data in the organization or outside the organization to provide a unified view of the organization's information. Extraction Transformation Load (ETL) solution is the back-end process of data integration which involves collecting data from various data sources, preparing and transforming the data according to business requirements and loading them into a Data Warehouse (DW). This paper explains the integration of the rubber import and export data between Malaysian Rubber Board (MRB) and Royal Malaysian Customs Department (Customs) using the ETL solution. Microsoft SQL Server Integration Services (SSIS) and Microsoft SQL Server Agent Jobs have been used as the ETL tool and ETL scheduling.

Download Full-text

Big data implementation for price statistics in Indonesia: Past, current, and future developments

Statistical Journal of the IAOS ◽

10.3233/sji-200740 ◽

2021 ◽

pp. 1-13

Author(s):

Setia Pramana ◽

Siti Mariyah ◽

Takdir

Keyword(s):

Big Data ◽

Online Shopping ◽

Rapid Development ◽

Data Sources ◽

Massive Data ◽

It Infrastructure ◽

Online Systems ◽

Future Developments ◽

Near Future ◽

Future Plans

The rapid development of Big Data as result of increasing interactivity with online systems between humans (e.g., online shopping, marketplace) and machine (internet of things, mobile phone, etc.) has led to a measurement revolution. This massive data if being mined and analyzed correctly can provide valuable alternative data sources for official statistics, especially price statistics. Several studies for using diverse Big Data as new sources of price statistics in Indonesia have been initiated. This article would provide a comprehensive review of experiences in exploiting various Big Data sources for price statistics, followed by the current development and the near future plans. The development of system and IT infrastructure is also discussed. Based on this experience, limitations, challenges, and advances for each approach would be presented.

Download Full-text

Temporal Semistructured Data Models and Data Warehouses

Data Warehouses and OLAP ◽

10.4018/987-1-59904-364-7.ch012 ◽

2011 ◽

pp. 277-297 ◽

Cited By ~ 2

Author(s):

Carlo Combi ◽

Barbara Oliboni

Keyword(s):

Data Warehouse ◽

Data Model ◽

Heterogeneous Data ◽

Data Models ◽

Semistructured Data ◽

Data Sources ◽

Time Varying ◽

Data Warehouses ◽

Time Dimension ◽

Heterogeneous Data Sources

This chapter describes a graph-based approach to represent information stored in a data warehouse, by means of a temporal semistructured data model. We consider issues related to the representation of semistructured data warehouses, and discuss the set of constraints needed to manage in a correct way the warehouse time, i.e. the time dimension considered storing data in the data warehouse itself. We use a temporal semistructured data model because a data warehouse can contain data coming from different and heterogeneous data sources. This means that data stored in a data warehouse are semistructured in nature, i.e. in different documents the same information can be represented in different ways, and moreover, the document schemata can be available or not. Moreover, information stored into a data warehouse is often time varying, thus as for semistructured data, also in the data warehouse context, it could be useful to consider time.

Download Full-text

Big Data Warehouse Automatic Design Methodology

Big Data ◽

10.4018/978-1-4666-9840-6.ch023 ◽

2016 ◽

pp. 454-492

Author(s):

Francesco Di Tria ◽

Ezio Lefons ◽

Filippo Tangorra

Keyword(s):

Big Data ◽

Data Warehouse ◽

Design Process ◽

Hybrid Approach ◽

End Users ◽

Structured Data ◽

Data Sources ◽

The One ◽

Definition Of ◽

Big Data Warehouse

Traditional data warehouse design methodologies are based on two opposite approaches. The one is data oriented and aims to realize the data warehouse mainly through a reengineering process of the well-structured data sources solely, while minimizing the involvement of end users. The other is requirement oriented and aims to realize the data warehouse only on the basis of business goals expressed by end users, with no regard to the information obtainable from data sources. Since these approaches are not able to address the problems that arise when dealing with big data, the necessity to adopt hybrid methodologies, which allow the definition of multidimensional schemas by considering user requirements and reconciling them against non-structured data sources, has emerged. As a counterpart, hybrid methodologies may require a more complex design process. For this reason, the current research is devoted to introducing automatisms in order to reduce the design efforts and to support the designer in the big data warehouse creation. In this chapter, the authors present a methodology based on a hybrid approach that adopts a graph-based multidimensional model. In order to automate the whole design process, the methodology has been implemented using logical programming.

Download Full-text

Test-Driven Development of Data Warehouses

Principles and Applications of Business Intelligence Research ◽

10.4018/978-1-4666-2650-8.ch014 ◽

2012 ◽

pp. 200-209

Author(s):

Sam Schutte ◽

Thilini Ariyachandra ◽

Mark Frolick

Keyword(s):

Software Development ◽

Data Warehouse ◽

Data Warehousing ◽

Test Cases ◽

Data Warehouses ◽

Team Members ◽

Test Driven Development ◽

Development Methodology ◽

Software Development Methodology ◽

Business Requirements

Test-driven development is a software development methodology that has recently gained a great deal of traction in the software development community. It focuses on creating software-based test cases that define the business requirements of an application before beginning the coding of the application itself. This paper proposes that test-driven development could be a useful methodology for data warehouse projects, in that it could help team members avoid some of the major pitfalls of data warehousing, and result in a higher-quality end product.

Download Full-text