A Context-Based Performance Enhancement Algorithm for Columnar Storage in MapReduce with Hive

To achieve high reliability and scalability, most large-scale data warehouse systems have adopted the cluster-based architecture. In this context, MapReduce has emerged as a promising architecture for large scale data warehousing and data analytics on commodity clusters. The MapReduce framework offers several lucrative features such as high fault-tolerance, scalability and use of a variety of hardware from low to high range. But these benefits have resulted in substantial performance compromise. In this paper, we propose the design of a novel cluster-based data warehouse system, Daenyrys for data processing on Hadoop – an open source implementation of the MapReduce framework under the umbrella of Apache. Daenyrys is a data management system which has the capability to take decision about the optimum partitioning scheme for the Hadoop's distributed file system (DFS). The optimum partitioning scheme improves the performance of the complete framework. The choice of the optimum partitioning is query-context dependent. In Daenyrys, the columns are formed into optimized groups to provide the basis for the partitioning of tables vertically. Daenyrys has an algorithm that monitors the context of current queries and based on the observations, it re-partitions the DFS for better performance and resource utilization. In the proposed system, Hive, a MapReduce-based SQL-like query engine is supported above the DFS.

Download Full-text

A distributed data management system to support large-scale data analysis

Journal of Systems and Software ◽

10.1016/j.jss.2018.11.007 ◽

2019 ◽

Vol 148 ◽

pp. 105-115 ◽

Cited By ~ 6

Author(s):

Tamer Z. Emara ◽

Joshua Zhexue Huang

Keyword(s):

Data Analysis ◽

Data Management ◽

Management System ◽

Large Scale ◽

Data Management System ◽

Distributed Data ◽

Distributed Data Management ◽

Large Scale Data ◽

Scale Data

Download Full-text

A Layered Architecture Approach for Large-Scale Data Warehouse Systems

Lecture Notes in Business Information Processing - Information Systems: Methods, Models, and Applications ◽

10.1007/978-3-642-38370-0_20 ◽

2013 ◽

pp. 199-201

Author(s):

Thorsten Winsemann ◽

Veit Köppen ◽

Andreas Lübcke ◽

Gunter Saake

Keyword(s):

Data Warehouse ◽

Large Scale ◽

Layered Architecture ◽

Large Scale Data ◽

Scale Data

Download Full-text

Data Warehousing Requirements Collection and Definition

Organizational Applications of Business Intelligence Management ◽

10.4018/978-1-4666-0279-3.ch018 ◽

2012 ◽

pp. 261-271

Author(s):

Nenad Jukic ◽

Miguel Velasco

Keyword(s):

Data Warehouse ◽

Large Scale ◽

Data Warehousing ◽

System Development ◽

Large Scale Data ◽

Typical Data ◽

Potential Risks ◽

Real Scenario ◽

The Impact ◽

Scale Data

Defining data warehouse requirements is widely recognized as one of the most important steps in the larger data warehouse system development process. This paper examines the potential risks and pitfalls within the data warehouse requirement collection and definition process. A real scenario of a large-scale data warehouse implementation is given, and details of this project, which ultimately failed due to inadequate requirement collection and definition process, are described. The presented case underscores and illustrates the impact of the requirement collection and definition process on the data warehouse implementation, while the case is analyzed within the context of the existing approaches, methodologies, and best practices for prevention and avoidance of typical data warehouse requirement errors and oversights.

Download Full-text

Data Warehousing Requirements Collection and Definition

International Journal of Business Intelligence Research ◽

10.4018/jbir.2010070105 ◽

2010 ◽

Vol 1 (3) ◽

pp. 66-76

Author(s):

Nenad Jukic ◽

Miguel Velasco

Keyword(s):

Data Warehouse ◽

Large Scale ◽

Development Process ◽

System Development ◽

Large Scale Data ◽

Typical Data ◽

Potential Risks ◽

Real Scenario ◽

The Impact ◽

Scale Data

Download Full-text

Genome Sequence Analysis in Distributed Computing Using Spark

Biotechnology ◽

10.4018/978-1-5225-8903-7.ch047 ◽

2019 ◽

pp. 1177-1189

Author(s):

Sagar Ap. ◽

Pooja Mehta ◽

Anuradha J. ◽

B.K. Tripathy

Keyword(s):

Sequence Analysis ◽

Computational Biology ◽

Dna Sequence ◽

Large Scale ◽

Base Pairs ◽

Large Scale Data ◽

Query Engine ◽

Large Scale Data Processing ◽

Computational Resources ◽

Scale Data

Integration of Computer Science with Bio Science has led to new field Computational Biology which created an opportunity in speeding up the process of analyzing the Bio-data. DNA sequence analysis especially finding the base pairs that helps in identifying the order of nucleotides present in all living beings, it also helps in forensics for DNA profiling and parenting testing. This sequence analysis has been a challenging task in Computational Biology due to large volumes of data and need of more computational resources. Using a distributed file system with distributed computation of tasks can be one of the solutions to above problem. In this paper, the authors use Spark a query engine for large-scale data processing in analyzing the DNA sequence and extracting the base pairs and also they try to improve base pair extraction with improvised algorithms.

Download Full-text

BASIC: An alternative to BASE for large-scale data management system

2014 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2014.7004206 ◽

2014 ◽

Cited By ~ 2

Author(s):

Lengdong Wu ◽

Li-Yan Yuan ◽

Jia-Huai You

Keyword(s):

Data Management ◽

Management System ◽

Large Scale ◽

Data Management System ◽

Large Scale Data ◽

Scale Data

Download Full-text

Large-Scale Data Management System Using Data De-duplication System

Advances in Intelligent Systems and Computing - Proceedings of the Second International Conference on Computer and Communication Technologies ◽

10.1007/978-81-322-2517-1_23 ◽

2015 ◽

pp. 225-234

Author(s):

S. Abirami ◽

Rashmi Vikraman ◽

S. Murugappan

Keyword(s):

Data Management ◽

Management System ◽

Large Scale ◽

Data Management System ◽

Large Scale Data ◽

Using Data ◽

Scale Data

Download Full-text

Towards the Development of Large-Scale Data Warehouse Application Frameworks

Re-conceptualizing Enterprise Information Systems - Lecture Notes in Business Information Processing ◽

10.1007/978-3-642-28827-2_7 ◽

2012 ◽

pp. 92-104 ◽

Cited By ~ 1

Author(s):

Duong Thi Anh Hoang ◽

Hieu Tran ◽

Binh Thanh Nguyen ◽

A Min Tjoa

Keyword(s):

Data Warehouse ◽

Large Scale ◽

Large Scale Data ◽

Scale Data

Download Full-text

Requirements and Data Content Evaluation of Industry In-House Data Management System

Volume 3: 30th Computers and Information in Engineering Conference, Parts A and B ◽

10.1115/detc2010-28548 ◽

2010 ◽

Cited By ~ 4

Author(s):

Beshoy Morkos ◽

Shraddha Joshi ◽

Joshua D. Summers ◽

Gregory G. Mocko

Keyword(s):

Data Management ◽

Management System ◽

Large Scale ◽

Data Management System ◽

Industrial Case Study ◽

Large Scale Data ◽

Data Content ◽

Scale Data ◽

Future Growth

This paper presents an industrial case study performed on an in-house developed data management system for an automation firm. This data management system has been in use and evolving over a span of fifteen years. To ensure the system is robust to withstand the future growth of the corporation, a study is done to identify deficiencies that may prohibit efficient large scale data management. Specifically, this case study focused on the means in which project requirements are managed and explored the issues of perceived utility in the system. Two major findings are presented: completion metrics are not consistent or expressive of the actual needs and there is no linking between the activities and the original client requirements. Thus, the results of the study were used to depict the potential vulnerability of such deficiencies.

Download Full-text

A Survey on MapReduce Implementations

International Journal of Cloud Applications and Computing ◽

10.4018/ijcac.2016010104 ◽

2016 ◽

Vol 6 (1) ◽

pp. 59-87 ◽

Cited By ~ 2

Author(s):

Amer Al-Badarneh ◽

Amr Mohammad ◽

Salah Harb

Keyword(s):

Large Scale ◽

Programming Model ◽

Data Sets ◽

Mapreduce Framework ◽

Large Scale Data ◽

Parallel Data ◽

Efficiency Performance ◽

Scale Data ◽

Large Scale Data Sets ◽

Open Issues

A distinguished successful platform for parallel data processing MapReduce is attracting a significant momentum from both academia and industry as the volume of data to capture, transform, and analyse grows rapidly. Although MapReduce is used in many applications to analyse large scale data sets, there is still a lot of debate among scientists and researchers on its efficiency, performance, and usability to support more classes of applications. This survey presents a comprehensive review of various implementations of MapReduce framework. Initially the authors give an overview of MapReduce programming model. They then present a broad description of various technical aspects of the most successful implementations of MapReduce framework reported in the literature and discuss their main strengths and weaknesses. Finally, the authors conclude by introducing a comparison between MapReduce implementations and discuss open issues and challenges on enhancing MapReduce.

Download Full-text