scholarly journals Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark

Symmetry ◽  
2021 ◽  
Vol 13 (2) ◽  
pp. 195 ◽  
Author(s):  
Vladimir Belov ◽  
Andrey Tatarintsev ◽  
Evgeny Nikulchev

One of the most important tasks of any platform for big data processing is storing the data received. Different systems have different requirements for the storage formats of big data, which raises the problem of choosing the optimal data storage format to solve the current problem. This paper describes the five most popular formats for storing big data, presents an experimental evaluation of these formats and a methodology for choosing the format. The following data storage formats will be considered: avro, CSV, JSON, ORC, parquet. At the first stage, a comparative analysis of the main characteristics of the studied formats was carried out; at the second stage, an experimental evaluation of these formats was prepared and carried out. For the experiment, an experimental stand was deployed with tools for processing big data installed on it. The aim of the experiment was to find out characteristics of data storage formats, such as the volume and processing speed for different operations using the Apache Spark framework. In addition, within the study, an algorithm for choosing the optimal format from the presented alternatives was developed using tropical optimization methods. The result of the study is presented in the form of a technique for obtaining a vector of ratings of data storage formats for the Apache Hadoop system, based on an experimental assessment using Apache Spark.


Author(s):  
Abou_el_ela Abdou Hussein

Day by day advanced web technologies have led to tremendous growth amount of daily data generated volumes. This mountain of huge and spread data sets leads to phenomenon that called big data which is a collection of massive, heterogeneous, unstructured, enormous and complex data sets. Big Data life cycle could be represented as, Collecting (capture), storing, distribute, manipulating, interpreting, analyzing, investigate and visualizing big data. Traditional techniques as Relational Database Management System (RDBMS) couldn’t handle big data because it has its own limitations, so Advancement in computing architecture is required to handle both the data storage requisites and the weighty processing needed to analyze huge volumes and variety of data economically. There are many technologies manipulating a big data, one of them is hadoop. Hadoop could be understand as an open source spread data processing that is one of the prominent and well known solutions to overcome handling big data problem. Apache Hadoop was based on Google File System and Map Reduce programming paradigm. Through this paper we dived to search for all big data characteristics starting from first three V's that have been extended during time through researches to be more than fifty six V's and making comparisons between researchers to reach to best representation and the precise clarification of all big data V’s characteristics. We highlight the challenges that face big data processing and how to overcome these challenges using Hadoop and its use in processing big data sets as a solution for resolving various problems in a distributed cloud based environment. This paper mainly focuses on different components of hadoop like Hive, Pig, and Hbase, etc. Also we institutes absolute description of Hadoop Pros and cons and improvements to face hadoop problems by choosing proposed Cost-efficient Scheduler Algorithm for heterogeneous Hadoop system.



2017 ◽  
Vol 10 (3) ◽  
pp. 597-602
Author(s):  
Jyotindra Tiwari ◽  
Dr. Mahesh Pawar ◽  
Dr. Anjajana Pandey

Big Data is defined by 3Vs which stands for variety, volume and velocity. The volume of data is very huge, data exists in variety of file types and data grows very rapidly. Big data storage and processing has always been a big issue. Big data has become even more challenging to handle these days. To handle big data high performance techniques have been introduced. Several frameworks like Apache Hadoop has been introduced to process big data. Apache Hadoop provides map/reduce to process big data. But this map/reduce can be further accelerated. In this paper a survey has been performed for map/reduce acceleration and energy efficient computation in quick time.



Author(s):  
Ankit Shah ◽  
Mamta C. Padole

Big Data processing and analysis requires tremendous processing capability. Distributed computing brings many commodity systems under the common platform to answer the need for Big Data processing and analysis. Apache Hadoop is the most suitable set of tools for Big Data storage, processing, and analysis. But Hadoop found to be inefficient when it comes to heterogeneous set computers which have different processing capabilities. In this research, we propose the Saksham model which optimizes the processing time by efficient use of node processing capability and file management. The proposed model shows the performance improvement for Big Data processing. To achieve better performance, Saksham model uses two vital aspects of heterogeneous distributed computing: Effective block rearrangement policy and use of node processing capability. The results demonstrate that the proposed model successfully achieves better job execution time and improves data locality.



2021 ◽  
Vol 11 (18) ◽  
pp. 8651
Author(s):  
Vladimir Belov ◽  
Alexander N. Kosenkov ◽  
Evgeny Nikulchev

One of the most popular methods for building analytical platforms involves the use of the concept of data lakes. A data lake is a storage system in which the data are presented in their original format, making it difficult to conduct analytics or present aggregated data. To solve this issue, data marts are used, representing environments of stored data of highly specialized information, focused on the requests of employees of a certain department, the vector of an organization’s work. This article presents a study of big data storage formats in the Apache Hadoop platform when used to build data marts.



At present the Big Data applications, for example, informal communication, therapeutic human services, horticulture, banking, financial exchange, instruction, Facebook and so forth are producing the information with extremely rapid. Volume and Velocity of the Big information assumes a significant job in the presentation of Big information applications. Execution of the Big information application can be influenced by different parameters. Expediently search, proficiency and precision are the a portion of the overwhelming parameters which influence the general execution of any Big information applications. Due the immediate and aberrant inclusion of the qualities of 7Vs of Big information, each Big Data administrations anticipate the elite. Elite is the greatest test in the present evolving situation. In this paper we propose the Big Data characterization way to deal with speedup the Big Data applications. This paper is the review paper, we allude different Big information advancements and the related work in the field of Big Data Classification. In the wake of learning and understanding the writing we discover the holes in existing work and techniques. Finally we propose the novel methodology of Big Data characterization. Our methodology relies on the Deep Learning and Apache Spark engineering. In the proposed work two stages are appeared; first stage is include choice and second stage is Big Data Classification. Apache Spark is the most reasonable and predominant innovation to execute this proposed work. Apache Spark is having two hubs; introductory hubs and last hubs. The element choice will be occur in introductory hubs and Big Data Classification will happen in definite hubs of Apache Spark



Author(s):  
Santosh Jankatti ◽  
Raghavendra B. K. ◽  
Raghavendra S. ◽  
Meenakshi Meenakshi

Big data is the biggest challenges as we need huge processing power system and good algorithms to make an decision. We need Hadoop environment with pig hive, machine learning and hadoopecosystem components. The data comes from industries. Many devices around us and sensor, and from social media sites. According to McKinsey There will be a shortage of 15000000 big data professionals by the end of 2020. There are lots of technologies to solve the problem of big data Storage and processing. Such technologies are Apache Hadoop, Apache Spark, Apache Kafka, and many more. Here we analyse the processing speed for the 4GB data on cloudx lab with Hadoop mapreduce with varing mappers and reducers and with pig script and Hive querries and spark environment along with machine learning technology and from the results we can say that machine learning with Hadoop will enhance the processing performance along with with spark, and also we can say that spark is better than Hadoop mapreduce pig and hive, spark with hive and machine learning will be the best performance enhanced compared with pig and hive, Hadoop mapreduce jar.



Author(s):  
Md. Armanur Rahman ◽  
J. Hossen ◽  
Venkataseshaiah C ◽  
CK Ho ◽  
Tan Kim Geok ◽  
...  

The Apache Hadoop framework is an open source implementation of MapReduce for processing and storing big data. However, to get the best performance from this is a big challenge because of its large number configuration parameters. In this paper, the concept of critical issues of Hadoop system, big data and machine learning have been highlighted and an analysis of some machine learning techniques applied so far, for improving the Hadoop performance is presented. Then, a promising machine learning technique using deep learning algorithm is proposed for Hadoop system performance improvement.



2019 ◽  
Vol 8 (1) ◽  
pp. 14 ◽  
Author(s):  
Elham Nazari ◽  
Mohammad Hasan Shahriari ◽  
Hamed Tabesh

Introduction: Health care data is increasing. The correct analysis of such data will improve the quality of care and reduce costs. This kind of data has certain features such as high volume, variety, high-speed production, etc. It makes it impossible to analyze with ordinary hardware and software platforms. Choosing the right platform for managing this kind of data is very important. The purpose of this study is to introduce and compare the most popular and most widely used platform for processing big data, Apache Hadoop MapReduce, and the two Apache Spark and Apache Flink platforms, which have recently been featured with great prominence.Material and Methods: This study is a survey whose content is based on the subject matter search of the Proquest, PubMed, Google Scholar, Science Direct, Scopus, IranMedex, Irandoc, Magiran, ParsMedline and Scientific Information Database (SID) databases, as well as Web reviews, specialized books with related keywords and standard. Finally, 80 articles related to the subject of the study were reviewed.Results: The findings showed that each of the studied platforms has features, such as data processing, support for different languages, processing speed, computational model, memory management, optimization, delay, error tolerance, scalability, performance, compatibility, Security and so on. Overall, the findings showed that the Apache Hadoop environment has simplicity, error detection, and scalability management based on clusters, but because its processing is based on batch processing, it works for slow complex analyzes and does not support flow processing, Apache Spark is also distributed as a computational platform that can process a big data set in memory with a very fast response time, the Apache Flink allows users to store data in memory and load them multiple times and provide a complex Fault Tolerance mechanism Continuously retrieves data flow status.Conclusion: The application of big data analysis and processing platforms varies according to the needs. In other words, it can be said that each technology is complementary, each of which is applicable in a particular field and cannot be separated from one another and depending on the purpose and the expected expectation, and the platform must be selected for analysis or whether custom tools are designed on these platforms.



Author(s):  
Сергей Юрьевич Золотов ◽  
Игорь Юрьевич Турчановский

Описан эксперимент по использованию технологий Apache Big Data в исследованиях климатических систем. В ходе эксперимента реализовано четыре варианта решения тестовой задачи. Ускорение расчетов с помощью технологий Apache Big Data вполне достижимо, и наиболее эффективный способ для этого найден в четвертом варианте решения тестовой задачи. Суть найденного решения сводится к преобразованию исходных наборов данных к формату, подходящему для хранения в распределенной файловой системе и применения технологи Spark SQL из стека Apache Big Data для параллельной обработки данных на вычислительных кластерах. The core of the Apache Big Data stack consists of two technologies: Apache Hadoop for organizing distributed file storages of unlimited capacity and Apache Spark for organizing parallel computing on computing clusters. The combination of Apache Spark and Apache Hadoop is fully applicable for creating big data processing systems. The main idea implemented by Spark is dividing data into separate parts (partitions) and processing these parts in memory of many computers connected within a network. Data is sent only when needed, and Spark automatically detects when the exchange will take place. For testing, we chose the problem of calculating the monthly, annual, and seasonal trends in the temperature of the atmosphere of our planet for the period from 1960 to 2010 according to the NCEP/NCAR and JRA-55 reanalysis data. During the experiment, four variants of solving the test problem were implemented. The first variant represents the simplest implementation without parallelism. The second implementation variant assumes parallel reading of data from the local file system, aggregation, and calculation of trends. The third variant was the calculation of a test problem on a two-node cluster. NCEP and JRA-55 reanalysis files were placed in their original format in the Hadoop storage (HDFS), which combines the disk subsystems of two computers. The disadvantage of this variant is loading all reanalysis files completely into the random access memory of the workflow. The solution proposed in the fourth variant is to pre-convert the original file format to a form when reading from HDFS is selective, based on the specified parameters.



Author(s):  
Ankit Shah ◽  
Mamta C. Padole

Big Data processing and analysis requires tremendous processing capability. Distributed computing brings many commodity systems under the common platform to answer the need for Big Data processing and analysis. Apache Hadoop is the most suitable set of tools for Big Data storage, processing, and analysis. But Hadoop found to be inefficient when it comes to heterogeneous set computers which have different processing capabilities. In this research, we propose the Saksham model which optimizes the processing time by efficient use of node processing capability and file management. The proposed model shows the performance improvement for Big Data processing. To achieve better performance, Saksham model uses two vital aspects of heterogeneous distributed computing: Effective block rearrangement policy and use of node processing capability. The results demonstrate that the proposed model successfully achieves better job execution time and improves data locality.



Sign in / Sign up

Export Citation Format

Share Document