hadoop system
Recently Published Documents


TOTAL DOCUMENTS

36
(FIVE YEARS 11)

H-INDEX

4
(FIVE YEARS 1)

Author(s):  
Badr-Eddine Boudriki Semlali ◽  
Chaker El Amrani

Currently, remote sensing is widely used in environmental monitoring applications, mostly air quality mapping and climate change supervision. However, satellite sensors occur massive volumes of data in near-real-time, stored in multiple formats and are provided with high velocity and variety. Besides, the processing of satellite big data is challenging. Thus, this study aims to approve that satellite data are big data and proposes a new big data architecture for satellite data processing. The developed software is enabling an efficient remote sensing big data ingestion and preprocessing. As a result, the experiment results show that 86 percent of the unnecessary daily files are discarded with a data cleansing of 20 percent of the erroneous and inaccurate plots. The final output is integrated into the Hadoop system, especially the HDFS, HBase, and Hive, for extra calculation and processing.


Symmetry ◽  
2021 ◽  
Vol 13 (2) ◽  
pp. 195 ◽  
Author(s):  
Vladimir Belov ◽  
Andrey Tatarintsev ◽  
Evgeny Nikulchev

One of the most important tasks of any platform for big data processing is storing the data received. Different systems have different requirements for the storage formats of big data, which raises the problem of choosing the optimal data storage format to solve the current problem. This paper describes the five most popular formats for storing big data, presents an experimental evaluation of these formats and a methodology for choosing the format. The following data storage formats will be considered: avro, CSV, JSON, ORC, parquet. At the first stage, a comparative analysis of the main characteristics of the studied formats was carried out; at the second stage, an experimental evaluation of these formats was prepared and carried out. For the experiment, an experimental stand was deployed with tools for processing big data installed on it. The aim of the experiment was to find out characteristics of data storage formats, such as the volume and processing speed for different operations using the Apache Spark framework. In addition, within the study, an algorithm for choosing the optimal format from the presented alternatives was developed using tropical optimization methods. The result of the study is presented in the form of a technique for obtaining a vector of ratings of data storage formats for the Apache Hadoop system, based on an experimental assessment using Apache Spark.


Author(s):  
Abou_el_ela Abdou Hussein

Day by day advanced web technologies have led to tremendous growth amount of daily data generated volumes. This mountain of huge and spread data sets leads to phenomenon that called big data which is a collection of massive, heterogeneous, unstructured, enormous and complex data sets. Big Data life cycle could be represented as, Collecting (capture), storing, distribute, manipulating, interpreting, analyzing, investigate and visualizing big data. Traditional techniques as Relational Database Management System (RDBMS) couldn’t handle big data because it has its own limitations, so Advancement in computing architecture is required to handle both the data storage requisites and the weighty processing needed to analyze huge volumes and variety of data economically. There are many technologies manipulating a big data, one of them is hadoop. Hadoop could be understand as an open source spread data processing that is one of the prominent and well known solutions to overcome handling big data problem. Apache Hadoop was based on Google File System and Map Reduce programming paradigm. Through this paper we dived to search for all big data characteristics starting from first three V's that have been extended during time through researches to be more than fifty six V's and making comparisons between researchers to reach to best representation and the precise clarification of all big data V’s characteristics. We highlight the challenges that face big data processing and how to overcome these challenges using Hadoop and its use in processing big data sets as a solution for resolving various problems in a distributed cloud based environment. This paper mainly focuses on different components of hadoop like Hive, Pig, and Hbase, etc. Also we institutes absolute description of Hadoop Pros and cons and improvements to face hadoop problems by choosing proposed Cost-efficient Scheduler Algorithm for heterogeneous Hadoop system.


Author(s):  
Manisha K. Gupta ◽  
Md. Nadeem Akhtar Hasid ◽  
Sourav Dhar ◽  
H. S. Mruthyunjaya
Keyword(s):  
Big Data ◽  

Information ◽  
2019 ◽  
Vol 10 (7) ◽  
pp. 222 ◽  
Author(s):  
Sungchul Lee ◽  
Ju-Yeon Jo ◽  
Yoohwan Kim

Background: Hadoop has become the base framework on the big data system via the simple concept that moving computation is cheaper than moving data. Hadoop increases a data locality in the Hadoop Distributed File System (HDFS) to improve the performance of the system. The network traffic among nodes in the big data system is reduced by increasing a data-local on the machine. Traditional research increased the data-local on one of the MapReduce stages to increase the Hadoop performance. However, there is currently no mathematical performance model for the data locality on the Hadoop. Methods: This study made the Hadoop performance analysis model with data locality for analyzing the entire process of MapReduce. In this paper, the data locality concept on the map stage and shuffle stage was explained. Also, this research showed how to apply the Hadoop performance analysis model to increase the performance of the Hadoop system by making the deep data locality. Results: This research proved the deep data locality for increasing performance of Hadoop via three tests, such as, a simulation base test, a cloud test and a physical test. According to the test, the authors improved the Hadoop system by over 34% by using the deep data locality. Conclusions: The deep data locality improved the Hadoop performance by reducing the data movement in HDFS.


Sign in / Sign up

Export Citation Format

Share Document