A Context-Based Performance Enhancement Algorithm for Columnar Storage in MapReduce with Hive

2013 ◽  
Vol 3 (4) ◽  
pp. 38-50 ◽  
Author(s):  
Yashvardhan Sharma ◽  
Saurabh Verma ◽  
Sumit Kumar ◽  
Shivam U.

To achieve high reliability and scalability, most large-scale data warehouse systems have adopted the cluster-based architecture. In this context, MapReduce has emerged as a promising architecture for large scale data warehousing and data analytics on commodity clusters. The MapReduce framework offers several lucrative features such as high fault-tolerance, scalability and use of a variety of hardware from low to high range. But these benefits have resulted in substantial performance compromise. In this paper, we propose the design of a novel cluster-based data warehouse system, Daenyrys for data processing on Hadoop – an open source implementation of the MapReduce framework under the umbrella of Apache. Daenyrys is a data management system which has the capability to take decision about the optimum partitioning scheme for the Hadoop's distributed file system (DFS). The optimum partitioning scheme improves the performance of the complete framework. The choice of the optimum partitioning is query-context dependent. In Daenyrys, the columns are formed into optimized groups to provide the basis for the partitioning of tables vertically. Daenyrys has an algorithm that monitors the context of current queries and based on the observations, it re-partitions the DFS for better performance and resource utilization. In the proposed system, Hive, a MapReduce-based SQL-like query engine is supported above the DFS.

Author(s):  
Nenad Jukic ◽  
Miguel Velasco

Defining data warehouse requirements is widely recognized as one of the most important steps in the larger data warehouse system development process. This paper examines the potential risks and pitfalls within the data warehouse requirement collection and definition process. A real scenario of a large-scale data warehouse implementation is given, and details of this project, which ultimately failed due to inadequate requirement collection and definition process, are described. The presented case underscores and illustrates the impact of the requirement collection and definition process on the data warehouse implementation, while the case is analyzed within the context of the existing approaches, methodologies, and best practices for prevention and avoidance of typical data warehouse requirement errors and oversights.


2010 ◽  
Vol 1 (3) ◽  
pp. 66-76
Author(s):  
Nenad Jukic ◽  
Miguel Velasco

Defining data warehouse requirements is widely recognized as one of the most important steps in the larger data warehouse system development process. This paper examines the potential risks and pitfalls within the data warehouse requirement collection and definition process. A real scenario of a large-scale data warehouse implementation is given, and details of this project, which ultimately failed due to inadequate requirement collection and definition process, are described. The presented case underscores and illustrates the impact of the requirement collection and definition process on the data warehouse implementation, while the case is analyzed within the context of the existing approaches, methodologies, and best practices for prevention and avoidance of typical data warehouse requirement errors and oversights.


Biotechnology ◽  
2019 ◽  
pp. 1177-1189
Author(s):  
Sagar Ap. ◽  
Pooja Mehta ◽  
Anuradha J. ◽  
B.K. Tripathy

Integration of Computer Science with Bio Science has led to new field Computational Biology which created an opportunity in speeding up the process of analyzing the Bio-data. DNA sequence analysis especially finding the base pairs that helps in identifying the order of nucleotides present in all living beings, it also helps in forensics for DNA profiling and parenting testing. This sequence analysis has been a challenging task in Computational Biology due to large volumes of data and need of more computational resources. Using a distributed file system with distributed computation of tasks can be one of the solutions to above problem. In this paper, the authors use Spark a query engine for large-scale data processing in analyzing the DNA sequence and extracting the base pairs and also they try to improve base pair extraction with improvised algorithms.


Author(s):  
Beshoy Morkos ◽  
Shraddha Joshi ◽  
Joshua D. Summers ◽  
Gregory G. Mocko

This paper presents an industrial case study performed on an in-house developed data management system for an automation firm. This data management system has been in use and evolving over a span of fifteen years. To ensure the system is robust to withstand the future growth of the corporation, a study is done to identify deficiencies that may prohibit efficient large scale data management. Specifically, this case study focused on the means in which project requirements are managed and explored the issues of perceived utility in the system. Two major findings are presented: completion metrics are not consistent or expressive of the actual needs and there is no linking between the activities and the original client requirements. Thus, the results of the study were used to depict the potential vulnerability of such deficiencies.


2016 ◽  
Vol 6 (1) ◽  
pp. 59-87 ◽  
Author(s):  
Amer Al-Badarneh ◽  
Amr Mohammad ◽  
Salah Harb

A distinguished successful platform for parallel data processing MapReduce is attracting a significant momentum from both academia and industry as the volume of data to capture, transform, and analyse grows rapidly. Although MapReduce is used in many applications to analyse large scale data sets, there is still a lot of debate among scientists and researchers on its efficiency, performance, and usability to support more classes of applications. This survey presents a comprehensive review of various implementations of MapReduce framework. Initially the authors give an overview of MapReduce programming model. They then present a broad description of various technical aspects of the most successful implementations of MapReduce framework reported in the literature and discuss their main strengths and weaknesses. Finally, the authors conclude by introducing a comparison between MapReduce implementations and discuss open issues and challenges on enhancing MapReduce.


Sign in / Sign up

Export Citation Format

Share Document