Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark

Day by day advanced web technologies have led to tremendous growth amount of daily data generated volumes. This mountain of huge and spread data sets leads to phenomenon that called big data which is a collection of massive, heterogeneous, unstructured, enormous and complex data sets. Big Data life cycle could be represented as, Collecting (capture), storing, distribute, manipulating, interpreting, analyzing, investigate and visualizing big data. Traditional techniques as Relational Database Management System (RDBMS) couldn’t handle big data because it has its own limitations, so Advancement in computing architecture is required to handle both the data storage requisites and the weighty processing needed to analyze huge volumes and variety of data economically. There are many technologies manipulating a big data, one of them is hadoop. Hadoop could be understand as an open source spread data processing that is one of the prominent and well known solutions to overcome handling big data problem. Apache Hadoop was based on Google File System and Map Reduce programming paradigm. Through this paper we dived to search for all big data characteristics starting from first three V's that have been extended during time through researches to be more than fifty six V's and making comparisons between researchers to reach to best representation and the precise clarification of all big data V’s characteristics. We highlight the challenges that face big data processing and how to overcome these challenges using Hadoop and its use in processing big data sets as a solution for resolving various problems in a distributed cloud based environment. This paper mainly focuses on different components of hadoop like Hive, Pig, and Hbase, etc. Also we institutes absolute description of Hadoop Pros and cons and improvements to face hadoop problems by choosing proposed Cost-efficient Scheduler Algorithm for heterogeneous Hadoop system.

A Survey on Accelerated Mapreduce for Hadoop

Oriental journal of computer science and technology ◽

10.13005/ojcst/10.03.07 ◽

2017 ◽

Vol 10 (3) ◽

pp. 597-602

Author(s):

Jyotindra Tiwari ◽

Dr. Mahesh Pawar ◽

Dr. Anjajana Pandey

Keyword(s):

Big Data ◽

Data Storage ◽

Energy Efficient ◽

High Performance ◽

Map Reduce ◽

Efficient Computation ◽

Apache Hadoop ◽

Huge Data ◽

Performance Techniques ◽

Big Data Storage

Big Data is defined by 3Vs which stands for variety, volume and velocity. The volume of data is very huge, data exists in variety of file types and data grows very rapidly. Big data storage and processing has always been a big issue. Big data has become even more challenging to handle these days. To handle big data high performance techniques have been introduced. Several frameworks like Apache Hadoop has been introduced to process big data. Apache Hadoop provides map/reduce to process big data. But this map/reduce can be further accelerated. In this paper a survey has been performed for map/reduce acceleration and energy efficient computation in quick time.

“Saksham Model” Performance Improvisation Using Node Capability Evaluation in Apache Hadoop

Big Data Analytics for Sustainable Computing - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-9750-6.ch012 ◽

2020 ◽

pp. 206-230

Author(s):

Ankit Shah ◽

Mamta C. Padole

Keyword(s):

Big Data ◽

Distributed Computing ◽

Data Processing ◽

Data Storage ◽

Model Performance ◽

Big Data Processing ◽

Apache Hadoop ◽

Processing Capability ◽

Proposed Model ◽

Capability Evaluation

Big Data processing and analysis requires tremendous processing capability. Distributed computing brings many commodity systems under the common platform to answer the need for Big Data processing and analysis. Apache Hadoop is the most suitable set of tools for Big Data storage, processing, and analysis. But Hadoop found to be inefficient when it comes to heterogeneous set computers which have different processing capabilities. In this research, we propose the Saksham model which optimizes the processing time by efficient use of node processing capability and file management. The proposed model shows the performance improvement for Big Data processing. To achieve better performance, Saksham model uses two vital aspects of heterogeneous distributed computing: Effective block rearrangement policy and use of node processing capability. The results demonstrate that the proposed model successfully achieves better job execution time and improves data locality.

Experimental Characteristics Study of Data Storage Formats for Data Marts Development within Data Lakes

Applied Sciences ◽

10.3390/app11188651 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8651

Author(s):

Vladimir Belov ◽

Alexander N. Kosenkov ◽

Evgeny Nikulchev

Keyword(s):

Big Data ◽

Data Storage ◽

Storage System ◽

Apache Hadoop ◽

Aggregated Data ◽

Data Marts ◽

Hadoop Platform ◽

Analytical Platforms ◽

Big Data Storage

One of the most popular methods for building analytical platforms involves the use of the concept of data lakes. A data lake is a storage system in which the data are presented in their original format, making it difficult to conduct analytics or present aggregated data. To solve this issue, data marts are used, representing environments of stored data of highly specialized information, focused on the requests of employees of a certain department, the vector of an organization’s work. This article presents a study of big data storage formats in the Apache Hadoop platform when used to build data marts.

The Deep Learning and Apache Spark Enabled Architecture for Improving the Performance of Big Data Classification

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.k2445.0981119 ◽

2019 ◽

Vol 8 (11) ◽

pp. 2908-2914

Keyword(s):

Big Data ◽

Deep Learning ◽

Review Paper ◽

Data Classification ◽

Apache Spark ◽

The Novel ◽

Second Stage ◽

Big Data Applications ◽

Big Data Classification ◽

Two Stages

At present the Big Data applications, for example, informal communication, therapeutic human services, horticulture, banking, financial exchange, instruction, Facebook and so forth are producing the information with extremely rapid. Volume and Velocity of the Big information assumes a significant job in the presentation of Big information applications. Execution of the Big information application can be influenced by different parameters. Expediently search, proficiency and precision are the a portion of the overwhelming parameters which influence the general execution of any Big information applications. Due the immediate and aberrant inclusion of the qualities of 7Vs of Big information, each Big Data administrations anticipate the elite. Elite is the greatest test in the present evolving situation. In this paper we propose the Big Data characterization way to deal with speedup the Big Data applications. This paper is the review paper, we allude different Big information advancements and the related work in the field of Big Data Classification. In the wake of learning and understanding the writing we discover the holes in existing work and techniques. Finally we propose the novel methodology of Big Data characterization. Our methodology relies on the Deep Learning and Apache Spark engineering. In the proposed work two stages are appeared; first stage is include choice and second stage is Big Data Classification. Apache Spark is the most reasonable and predominant innovation to execute this proposed work. Apache Spark is having two hubs; introductory hubs and last hubs. The element choice will be occur in introductory hubs and Big Data Classification will happen in definite hubs of Apache Spark

Performance evaluation of Map-reduce jar pig hive and spark with machine learning using big data

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i4.pp3811-3818 ◽

2020 ◽

Vol 10 (4) ◽

pp. 3811

Author(s):

Santosh Jankatti ◽

Raghavendra B. K. ◽

Raghavendra S. ◽

Meenakshi Meenakshi

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Storage ◽

Processing Speed ◽

Learning Technology ◽

Apache Hadoop ◽

Hadoop Mapreduce ◽

Processing Power ◽

Big Data Storage ◽

Better Than

Big data is the biggest challenges as we need huge processing power system and good algorithms to make an decision. We need Hadoop environment with pig hive, machine learning and hadoopecosystem components. The data comes from industries. Many devices around us and sensor, and from social media sites. According to McKinsey There will be a shortage of 15000000 big data professionals by the end of 2020. There are lots of technologies to solve the problem of big data Storage and processing. Such technologies are Apache Hadoop, Apache Spark, Apache Kafka, and many more. Here we analyse the processing speed for the 4GB data on cloudx lab with Hadoop mapreduce with varing mappers and reducers and with pig script and Hive querries and spark environment along with machine learning technology and from the results we can say that machine learning with Hadoop will enhance the processing performance along with with spark, and also we can say that spark is better than Hadoop mapreduce pig and hive, spark with hive and machine learning will be the best performance enhanced compared with pig and hive, Hadoop mapreduce jar.

A Survey of Machine Learning Techniques for Self-tuning Hadoop Performance

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i3.pp1854-1862 ◽

2018 ◽

Vol 8 (3) ◽

pp. 1854 ◽

Cited By ~ 1

Author(s):

Md. Armanur Rahman ◽

J. Hossen ◽

Venkataseshaiah C ◽

CK Ho ◽

Tan Kim Geok ◽

...

Keyword(s):

Machine Learning ◽

Big Data ◽

Learning Algorithm ◽

Machine Learning Techniques ◽

Apache Hadoop ◽

Deep Learning Algorithm ◽

Learning Techniques ◽

Critical Issues ◽

Self Tuning ◽

Hadoop System

The Apache Hadoop framework is an open source implementation of MapReduce for processing and storing big data. However, to get the best performance from this is a big challenge because of its large number configuration parameters. In this paper, the concept of critical issues of Hadoop system, big data and machine learning have been highlighted and an analysis of some machine learning techniques applied so far, for improving the Hadoop performance is presented. Then, a promising machine learning technique using deep learning algorithm is proposed for Hadoop system performance improvement.

BigData Analysis in Healthcare: Apache Hadoop , Apache spark and Apache Flink

Frontiers in Health Informatics ◽

10.30699/fhi.v8i1.180 ◽

2019 ◽

Vol 8 (1) ◽

pp. 14 ◽

Cited By ~ 1

Author(s):

Elham Nazari ◽

Mohammad Hasan Shahriari ◽

Hamed Tabesh

Keyword(s):

Big Data ◽

Error Detection ◽

Memory Management ◽

High Speed ◽

Scientific Information ◽

High Volume ◽

Apache Spark ◽

Data Set ◽

Apache Hadoop ◽

The Subject

Introduction: Health care data is increasing. The correct analysis of such data will improve the quality of care and reduce costs. This kind of data has certain features such as high volume, variety, high-speed production, etc. It makes it impossible to analyze with ordinary hardware and software platforms. Choosing the right platform for managing this kind of data is very important. The purpose of this study is to introduce and compare the most popular and most widely used platform for processing big data, Apache Hadoop MapReduce, and the two Apache Spark and Apache Flink platforms, which have recently been featured with great prominence.Material and Methods: This study is a survey whose content is based on the subject matter search of the Proquest, PubMed, Google Scholar, Science Direct, Scopus, IranMedex, Irandoc, Magiran, ParsMedline and Scientific Information Database (SID) databases, as well as Web reviews, specialized books with related keywords and standard. Finally, 80 articles related to the subject of the study were reviewed.Results: The findings showed that each of the studied platforms has features, such as data processing, support for different languages, processing speed, computational model, memory management, optimization, delay, error tolerance, scalability, performance, compatibility, Security and so on. Overall, the findings showed that the Apache Hadoop environment has simplicity, error detection, and scalability management based on clusters, but because its processing is based on batch processing, it works for slow complex analyzes and does not support flow processing, Apache Spark is also distributed as a computational platform that can process a big data set in memory with a very fast response time, the Apache Flink allows users to store data in memory and load them multiple times and provide a complex Fault Tolerance mechanism Continuously retrieves data flow status.Conclusion: The application of big data analysis and processing platforms varies according to the needs. In other words, it can be said that each technology is complementary, each of which is applicable in a particular field and cannot be separated from one another and depending on the purpose and the expected expectation, and the platform must be selected for analysis or whether custom tools are designed on these platforms.

Application of Apache Big Data technologies for the problems of climate monitoring

Вычислительные технологии ◽

10.25743/ict.2021.26.2.008 ◽

2021 ◽

pp. 98-108

Author(s):

Сергей Юрьевич Золотов ◽

Игорь Юрьевич Турчановский

Keyword(s):

Big Data ◽

Test Problem ◽

Random Access ◽

Main Idea ◽

Reanalysis Data ◽

Apache Spark ◽

Apache Hadoop ◽

Big Data Technologies ◽

Original File ◽

Disk Subsystems

Описан эксперимент по использованию технологий Apache Big Data в исследованиях климатических систем. В ходе эксперимента реализовано четыре варианта решения тестовой задачи. Ускорение расчетов с помощью технологий Apache Big Data вполне достижимо, и наиболее эффективный способ для этого найден в четвертом варианте решения тестовой задачи. Суть найденного решения сводится к преобразованию исходных наборов данных к формату, подходящему для хранения в распределенной файловой системе и применения технологи Spark SQL из стека Apache Big Data для параллельной обработки данных на вычислительных кластерах. The core of the Apache Big Data stack consists of two technologies: Apache Hadoop for organizing distributed file storages of unlimited capacity and Apache Spark for organizing parallel computing on computing clusters. The combination of Apache Spark and Apache Hadoop is fully applicable for creating big data processing systems. The main idea implemented by Spark is dividing data into separate parts (partitions) and processing these parts in memory of many computers connected within a network. Data is sent only when needed, and Spark automatically detects when the exchange will take place. For testing, we chose the problem of calculating the monthly, annual, and seasonal trends in the temperature of the atmosphere of our planet for the period from 1960 to 2010 according to the NCEP/NCAR and JRA-55 reanalysis data. During the experiment, four variants of solving the test problem were implemented. The first variant represents the simplest implementation without parallelism. The second implementation variant assumes parallel reading of data from the local file system, aggregation, and calculation of trends. The third variant was the calculation of a test problem on a two-node cluster. NCEP and JRA-55 reanalysis files were placed in their original format in the Hadoop storage (HDFS), which combines the disk subsystems of two computers. The disadvantage of this variant is loading all reanalysis files completely into the random access memory of the workflow. The solution proposed in the fourth variant is to pre-convert the original file format to a form when reading from HDFS is selective, based on the specified parameters.

“Saksham Model” Performance Improvisation Using Node Capability Evaluation in Apache Hadoop

Research Anthology on Architectures, Frameworks, and Integration Strategies for Distributed and Cloud Computing ◽

10.4018/978-1-7998-5339-8.ch062 ◽

2021 ◽

pp. 1282-1302

Author(s):

Ankit Shah ◽

Mamta C. Padole

Keyword(s):

Big Data ◽

Distributed Computing ◽

Data Processing ◽

Data Storage ◽

Model Performance ◽

Big Data Processing ◽

Apache Hadoop ◽

Processing Capability ◽

Proposed Model ◽

Capability Evaluation

Big Data processing and analysis requires tremendous processing capability. Distributed computing brings many commodity systems under the common platform to answer the need for Big Data processing and analysis. Apache Hadoop is the most suitable set of tools for Big Data storage, processing, and analysis. But Hadoop found to be inefficient when it comes to heterogeneous set computers which have different processing capabilities. In this research, we propose the Saksham model which optimizes the processing time by efficient use of node processing capability and file management. The proposed model shows the performance improvement for Big Data processing. To achieve better performance, Saksham model uses two vital aspects of heterogeneous distributed computing: Effective block rearrangement policy and use of node processing capability. The results demonstrate that the proposed model successfully achieves better job execution time and improves data locality.