SDWP: A New Data Placement Strategy for Distributed Big Data Warehouses in Hadoop

A Data Modelling Method for Big Data Warehouses

Information Systems - Lecture Notes in Business Information Processing ◽

10.1007/978-3-030-44322-1_7 ◽

2020 ◽

pp. 85-98

Author(s):

Marta Nogueira ◽

João Galvão ◽

Maribel Y. Santos

Keyword(s):

Big Data ◽

Data Modelling ◽

Data Warehouses ◽

Modelling Method

Download Full-text

SkipSJoin: A New Physical Design for Distributed Big Data Warehouses in Hadoop

Conceptual Modeling - Lecture Notes in Computer Science ◽

10.1007/978-3-030-33223-5_21 ◽

2019 ◽

pp. 255-263

Author(s):

Yassine Ramdane ◽

Nadia Kabachi ◽

Omar Boussaid ◽

Fadila Bentayeb

Keyword(s):

Big Data ◽

Physical Design ◽

Data Warehouses

Download Full-text

Data Mining and Cloud Computing

JITA - Journal of Information Technology and Applications (Banja Luka) - APEIRON ◽

10.7251/jit1202075v ◽

2012 ◽

Vol 4 (2) ◽

Cited By ~ 2

Author(s):

Robert Vrbić

Keyword(s):

Data Mining ◽

Cloud Computing ◽

Big Data ◽

Data Warehouses ◽

Cloud Data ◽

Big Data Mining ◽

New Knowledge ◽

Flexible Infrastructure

Cloud computing provides a powerful, scalable and flexible infrastructure into which one can integrate, previously known, techniques and methods of Data Mining. The result of such integration should be strong and capacitive platform that will be able to deal with the increasing production of data, or that will create the conditions for the efficient mining of massive amounts of data from various data warehouses with the aim of creating (useful) information or the production of new knowledge. This paper discusses such technology - the technology of big data mining, known as Cloud Data Mining (CDM).

Download Full-text

Building OLAP Cubes From Columnar NoSQL Data Warehouses

Emerging Perspectives in Big Data Warehousing - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-5516-2.ch006 ◽

2019 ◽

pp. 129-157

Author(s):

Khaled Dehdouh

Keyword(s):

Big Data ◽

Database System ◽

Massive Data ◽

Data Warehouses ◽

Online Analysis ◽

Storage Model ◽

Data Cubes ◽

Nosql Database ◽

Oriented Approach

In the big data warehouses context, a column-oriented NoSQL database system is considered as the storage model which is highly adapted to data warehouses and online analysis. Indeed, the use of NoSQL models allows data scalability easily and the columnar store is suitable for storing and managing massive data, especially for decisional queries. However, the column-oriented NoSQL DBMS do not offer online analysis operators (OLAP). To build OLAP cubes corresponding to the analysis contexts, the most common way is to integrate other software such as HIVE or Kylin which has a CUBE operator to build data cubes. By using that, the cube is built according to the row-oriented approach and does not allow to fully obtain the benefits of a column-oriented approach. In this chapter, the main contribution is to define a cube operator called MC-CUBE (MapReduce Columnar CUBE), which allows building columnar NoSQL cubes according to the columnar approach by taking into account the non-relational and distributed aspects when data warehouses are stored.

Download Full-text

Efficient Data Placement and Replication for QoS-Aware Approximate Query Evaluation of Big Data Analytics

IEEE Transactions on Parallel and Distributed Systems ◽

10.1109/tpds.2019.2921337 ◽

2019 ◽

Vol 30 (12) ◽

pp. 2677-2691 ◽

Cited By ~ 4

Author(s):

Qiufen Xia ◽

Zichuan Xu ◽

Weifa Liang ◽

Shui Yu ◽

Song Guo ◽

...

Keyword(s):

Big Data ◽

Data Analytics ◽

Big Data Analytics ◽

Data Placement ◽

Query Evaluation ◽

Efficient Data ◽

Approximate Query

Download Full-text

Big Data Warehouses for Smart Industries

Encyclopedia of Big Data Technologies ◽

10.1007/978-3-319-77525-8_204 ◽

2019 ◽

pp. 341-351

Author(s):

Carlos Costa ◽

Carina Andrade ◽

Maribel Yasmina Santos

Keyword(s):

Big Data ◽

Data Warehouses

Download Full-text

Concept of operations for knowledge discovery from Big Data across enterprise data warehouses

10.1117/12.2016321 ◽

2013 ◽

Cited By ~ 3

Author(s):

Sreenivas R. Sukumar ◽

Mohammed M. Olama ◽

Allen W. McNair ◽

James J. Nutaro

Keyword(s):

Big Data ◽

Knowledge Discovery ◽

Data Warehouses ◽

Concept Of Operations

Download Full-text

Optimization of ETL Process in Data Warehouse Through a Combination of Parallelization and Shared Cache Memory

Engineering, Technology & Applied Science Research ◽

10.48084/etasr.849 ◽

2016 ◽

Vol 6 (6) ◽

pp. 1241-1244 ◽

Cited By ~ 2

Author(s):

M. Faridi Masouleh ◽

M. A. Afshar Kazemi ◽

M. Alborzi ◽

A. Toloie Eshlaghy

Keyword(s):

Big Data ◽

Data Warehouse ◽

Cache Memory ◽

Data Warehouses ◽

Data Bases ◽

Shared Cache ◽

Optimization Management ◽

Implementation Time ◽

The Creation ◽

Management Improvement

Extraction, Transformation and Loading (ETL) is introduced as one of the notable subjects in optimization, management, improvement and acceleration of processes and operations in data bases and data warehouses. The creation of ETL processes is potentially one of the greatest tasks of data warehouses and so its production is a time-consuming and complicated procedure. Without optimization of these processes, the implementation of projects in data warehouses area is costly, complicated and time-consuming. The present paper used the combination of parallelization methods and shared cache memory in systems distributed on the basis of data warehouse. According to the conducted assessment, the proposed method exhibited 7.1% speed improvement to kattle optimization instrument and 7.9% to talend instrument in terms of implementation time of the ETL process. Therefore, parallelization could notably improve the ETL process. It eventually caused the management and integration processes of big data to be implemented in a simple way and with acceptable speed.

Download Full-text

Significance of Hierarchical and Markov Clustering in Grouping Aware Data Placement for Data Intensive Applications Having Interest Locality

Scalable Computing Practice and Experience ◽

10.12694/scpe.v19i3.1375 ◽

2018 ◽

Vol 19 (3) ◽

pp. 245-258

Author(s):

Vengadeswaran Shanmugasundaram ◽

Balasundaram Sadhu Ramakrishnan

Keyword(s):

Big Data ◽

Data Placement ◽

Query Execution ◽

Access Pattern ◽

Clustering Techniques ◽

Data Intensive ◽

Markov Clustering ◽

Default Data ◽

Data Intensive Applications ◽

Grouping Behavior

In this data era, massive volumes of data are being generated every second in variety of domains such as Geoscience, Social Web, Finance, e-Commerce, Health Care, Climate modelling, Physics, Astronomy, Government sectors etc. Hadoop has been well-recognized as de factobig data processing platform that have been extensively adopted, and is currently widely used, in many application domains processing Big Data. Even though it is considered as an efficient solution for such complex query processing, it has its own limitation when the data to be processed exhibit interest locality. The data required for any query execution follows grouping behavior wherein only a part of the Big-Data is accessed frequently. During such scenarion, the time taken to execute a queryand return results, increases exponentially as the amount of data increases leading to much waiting time for the user. Since Hadoop default data placement strategy (HDDPS) does not consider such grouping behavior, it does not perform efficiently resulting in lacunas such as decreased local map task execution, increased query execution time etc. Hence proposed an Optimal Data Placement Strategy (ODPS) based on grouping semantics. In this paper we experiment the significance oftwo most promising clustering techniques viz. Hierarchical Agglomerative Clustering (HAC) and Markov Clustering (MCL) in grouping aware data placement for data intensive applications having interest locality. Initially user access pattern is identified by dynamically analyzing history log.Then both clustering techniques (HAC & MCL) are separately applied over the access pattern to obtain independent clusters. These clusters are interpreted and validated to extract the Optimal Data Groupings (ODG). Finally proposed strategy reorganizes the default data layouts in HDFSbased on ODG to achieve maximum parallel execution per group subjective to Load Balancer and Rack Awareness. Our proposed strategy is tested in 10 node cluster placed in a multi rack with Hadoop installed in every node deployed in cloud platform. Proposed strategy reduces the query execution time, significantly improves the data locality and has proved to be more efficient for massive datasets processing in heterogeneous distributed environment. Also MCL shows a marginal improved performance over HAC for queries exhibiting interest localities.

Download Full-text

A Survey on Data Placement Strategy in Big Data Heterogeneous Environments

2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI) ◽

10.1109/icoei.2019.8862676 ◽

2019 ◽

Author(s):

Anilkumar Ambore ◽

Udaya Rani V.

Keyword(s):

Big Data ◽

Data Placement ◽

Heterogeneous Environments

Download Full-text