scholarly journals A Proposal of Methodology for Designing Big Data Warehouses

Author(s):  
Francesco Di Tria ◽  
Ezio Lefons ◽  
Filippo Tangorra

Big Data warehouses are a new class of databases that largely use unstructured and volatile data for analytical purpose. Examples of this kind of data sources are those coming from the Web, such as social networks and blogs, or from sensor networks, where huge amounts of data may be available only for short intervals of time. In order to manage massive data sources, a strategy must be adopted to define multidimensional schemas in presence of fast-changing situations or even undefined business requirements. In the paper, we propose a design methodology that adopts agile and automatic approaches, in order to reduce the time necessary to integrate new data sources and to include new business requirements on the fly. The data are immediately available for analyses, since the underlying architecture is based on a virtual data warehouse that does not require the importing phase. Examples of application of the methodology are presented along the paper in order to show the validity of this approach compared to a traditional one.

2020 ◽  
Vol 12 (1) ◽  
pp. 1-24
Author(s):  
Khaled Dehdouh ◽  
Omar Boussaid ◽  
Fadila Bentayeb

In the Big Data warehouse context, a column-oriented NoSQL database system is considered as the storage model which is highly adapted to data warehouses and online analysis. Indeed, the use of NoSQL models allows data scalability easily and the columnar store is suitable for storing and managing massive data, especially for decisional queries. However, the column-oriented NoSQL DBMS do not offer online analysis operators (OLAP). To build OLAP cubes corresponding to the analysis contexts, the most common way is to integrate other software such as HIVE or Kylin which has a CUBE operator to build data cubes. By using that, the cube is built according to the row-oriented approach and does not allow to fully obtain the benefits of a column-oriented approach. In this article, the focus is to define a cube operator called MC-CUBE (MapReduce Columnar CUBE), which allows building columnar NoSQL cubes according to the columnar approach by taking into account the non-relational and distributed aspects when data warehouses are stored.


Author(s):  
Khaled Dehdouh

In the big data warehouses context, a column-oriented NoSQL database system is considered as the storage model which is highly adapted to data warehouses and online analysis. Indeed, the use of NoSQL models allows data scalability easily and the columnar store is suitable for storing and managing massive data, especially for decisional queries. However, the column-oriented NoSQL DBMS do not offer online analysis operators (OLAP). To build OLAP cubes corresponding to the analysis contexts, the most common way is to integrate other software such as HIVE or Kylin which has a CUBE operator to build data cubes. By using that, the cube is built according to the row-oriented approach and does not allow to fully obtain the benefits of a column-oriented approach. In this chapter, the main contribution is to define a cube operator called MC-CUBE (MapReduce Columnar CUBE), which allows building columnar NoSQL cubes according to the columnar approach by taking into account the non-relational and distributed aspects when data warehouses are stored.


2016 ◽  
Vol 6 (6) ◽  
pp. 1241-1244 ◽  
Author(s):  
M. Faridi Masouleh ◽  
M. A. Afshar Kazemi ◽  
M. Alborzi ◽  
A. Toloie Eshlaghy

Extraction, Transformation and Loading (ETL) is introduced as one of the notable subjects in optimization, management, improvement and acceleration of processes and operations in data bases and data warehouses. The creation of ETL processes is potentially one of the greatest tasks of data warehouses and so its production is a time-consuming and complicated procedure. Without optimization of these processes, the implementation of projects in data warehouses area is costly, complicated and time-consuming. The present paper used the combination of parallelization methods and shared cache memory in systems distributed on the basis of data warehouse. According to the conducted assessment, the proposed method exhibited 7.1% speed improvement to kattle optimization instrument and 7.9% to talend instrument in terms of implementation time of the ETL process. Therefore, parallelization could notably improve the ETL process. It eventually caused the management and integration processes of big data to be implemented in a simple way and with acceptable speed.


Transformation presents the second step in the ETL process that is responsible for extracting, transforming and loading data into a data warehouse. The role of transformation is to set up several operations to clean, to format and to unify types and data coming from multiple and different data sources. The goal is to get data to conform to the schema of the data warehouse to avoid any ambiguity problems during the data storage and analytical operations. Transforming data coming from structured, semi-structured and unstructured data sources need two levels of treatments: the first one is transformation schema to schema to get a unified schema for all selected data sources and the second treatment is transformation data to data to unify all types and data gathered. To ensure the setting up of these steps we propose in this paper a process switch from one database schema to another as a part of transformation schema to schema, and a meta-model based on MDA approach to describe the main operations of transformation data to data. The results of our transformations propose a data loading in one of the four schemas of NoSQL to best meet the constraints and requirements of Big Data.


2015 ◽  
Vol 78 (1) ◽  
Author(s):  
Mimi Safinaz Jamaluddin ◽  
Nurulhuda Firdaus Mohd Azmi

Data integration is important in consolidating all the data in the organization or outside the organization to provide a unified view of the organization's information. Extraction Transformation Load (ETL) solution is the back-end process of data integration which involves collecting data from various data sources, preparing and transforming the data according to business requirements and loading them into a Data Warehouse (DW). This paper explains the integration of the rubber import and export data between Malaysian Rubber Board (MRB) and Royal Malaysian Customs Department (Customs) using the ETL solution. Microsoft SQL Server Integration Services (SSIS) and Microsoft SQL Server Agent Jobs have been used as the ETL tool and ETL scheduling. 


2021 ◽  
pp. 1-13
Author(s):  
Setia Pramana ◽  
Siti Mariyah ◽  
Takdir

The rapid development of Big Data as result of increasing interactivity with online systems between humans (e.g., online shopping, marketplace) and machine (internet of things, mobile phone, etc.) has led to a measurement revolution. This massive data if being mined and analyzed correctly can provide valuable alternative data sources for official statistics, especially price statistics. Several studies for using diverse Big Data as new sources of price statistics in Indonesia have been initiated. This article would provide a comprehensive review of experiences in exploiting various Big Data sources for price statistics, followed by the current development and the near future plans. The development of system and IT infrastructure is also discussed. Based on this experience, limitations, challenges, and advances for each approach would be presented.


2011 ◽  
pp. 277-297 ◽  
Author(s):  
Carlo Combi ◽  
Barbara Oliboni

This chapter describes a graph-based approach to represent information stored in a data warehouse, by means of a temporal semistructured data model. We consider issues related to the representation of semistructured data warehouses, and discuss the set of constraints needed to manage in a correct way the warehouse time, i.e. the time dimension considered storing data in the data warehouse itself. We use a temporal semistructured data model because a data warehouse can contain data coming from different and heterogeneous data sources. This means that data stored in a data warehouse are semistructured in nature, i.e. in different documents the same information can be represented in different ways, and moreover, the document schemata can be available or not. Moreover, information stored into a data warehouse is often time varying, thus as for semistructured data, also in the data warehouse context, it could be useful to consider time.


Big Data ◽  
2016 ◽  
pp. 454-492
Author(s):  
Francesco Di Tria ◽  
Ezio Lefons ◽  
Filippo Tangorra

Traditional data warehouse design methodologies are based on two opposite approaches. The one is data oriented and aims to realize the data warehouse mainly through a reengineering process of the well-structured data sources solely, while minimizing the involvement of end users. The other is requirement oriented and aims to realize the data warehouse only on the basis of business goals expressed by end users, with no regard to the information obtainable from data sources. Since these approaches are not able to address the problems that arise when dealing with big data, the necessity to adopt hybrid methodologies, which allow the definition of multidimensional schemas by considering user requirements and reconciling them against non-structured data sources, has emerged. As a counterpart, hybrid methodologies may require a more complex design process. For this reason, the current research is devoted to introducing automatisms in order to reduce the design efforts and to support the designer in the big data warehouse creation. In this chapter, the authors present a methodology based on a hybrid approach that adopts a graph-based multidimensional model. In order to automate the whole design process, the methodology has been implemented using logical programming.


Author(s):  
Sam Schutte ◽  
Thilini Ariyachandra ◽  
Mark Frolick

Test-driven development is a software development methodology that has recently gained a great deal of traction in the software development community. It focuses on creating software-based test cases that define the business requirements of an application before beginning the coding of the application itself. This paper proposes that test-driven development could be a useful methodology for data warehouse projects, in that it could help team members avoid some of the major pitfalls of data warehousing, and result in a higher-quality end product.


Sign in / Sign up

Export Citation Format

Share Document