apache hadoop Latest Research Papers

The term big data analytics refers to mining and analyzing of the voluminous amount of data in big data by using various tools and platforms. Some of the popular tools are Apache Hadoop, Apache Spark, HBase, Storm, Grid Gain, HPCC, Casandra, Pig, Hive, and No SQL, etc. These tools are used depending on the parameter taken for big data analysis. So, we need a comparative analysis of such analytical tools to choose best and simpler way of analysis to gain more optimal throughput and efficient mining. This chapter contributes to a comparative study of big data analytics tools based on different aspects such as their functionality, pros, and cons based on characteristics that can be used to determine the best and most efficient among them. Through the comparative study, people are capable of using such tools in a more efficient way.

Download Full-text

Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10110763 ◽

2021 ◽

Vol 10 (11) ◽

pp. 763

Author(s):

Panagiotis Moutafis ◽

George Mavrommatis ◽

Michael Vassilakopoulos ◽

Antonio Corral

Keyword(s):

Query Processing ◽

Nearest Neighbor ◽

Apache Spark ◽

Spatial Query ◽

K Nearest Neighbor ◽

Distributed Computing Systems ◽

Spatial Query Processing ◽

Apache Hadoop ◽

The One ◽

Query Algorithm

Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users to work on distributed in-memory data, without worrying about the data distribution mechanism and fault-tolerance. Given two datasets of points (called Query and Training), the group K nearest-neighbor (GKNN) query retrieves (K) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been actively studied in centralized environments and several performance improving techniques and pruning heuristics have been also proposed, while, a distributed algorithm in Apache Hadoop was recently proposed by our team. Since, in general, Apache Hadoop exhibits lower performance than Spark, in this paper, we present the first distributed GKNN query algorithm in Apache Spark and compare it against the one in Apache Hadoop. This algorithm incorporates programming features and facilities that are specific to Apache Spark. Moreover, techniques that improve performance and are applicable in Apache Spark are also incorporated. The results of an extensive set of experiments with real-world spatial datasets are presented, demonstrating that our Apache Spark GKNN solution, with its improvements, is efficient and a clear winner in comparison to processing this query in Apache Hadoop.

Download Full-text

A Survey on Security of the Hadoop Framework in the Environment of Bigdata

Journal of Physics Conference Series ◽

10.1088/1742-6596/2089/1/012031 ◽

2021 ◽

Vol 2089 (1) ◽

pp. 012031

Author(s):

Saritha Gattoju ◽

NagaLakshmi Vadlamani

Keyword(s):

Big Data ◽

Cost Effective ◽

Corporate Management ◽

Sensitive Data ◽

Apache Hadoop ◽

Huge Data ◽

Data Platform ◽

Healthcare Finance ◽

Day To Day Operations ◽

The Moment

Abstract The world is becoming increasingly digital at the moment. Every day, a significant amount of data is generated by everyone who uses the internet nowadays. The data are critical for carrying out day-to-day operations, as well as assisting corporate management in achieving their objectives and making the best judgments possible based on the information gathered. BigData is the process of merging many hardware and software solutions to deal with extremely huge amounts of data that surpass storage capability. It’s possible that large amounts of data will be generated. Hadoop systems are used in a variety of areas, including healthcare, finance, and government. insurance, and social media, in order to provide a quick and cost-effective big data solution. The Apache Hadoop is a framework for storing and processing data, managing, and distributing large amounts of information over a large number of server nodes. Here are some solutions that work on top of the Apache Hadoop stack to guarantee data security. To get a complete picture of the problem, we decided to conduct an investigation into existing security solutions for Apache Hadoop security in sensitive data which is stored on a huge data platform employing distributed computing on a cluster of commodity devices. The goal of this paper is to provide knowledge of security and Big Data issues.

Download Full-text

Efficient Transfer of data from RDBMS to HDFS and conversion to JSON format

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.38710 ◽

2021 ◽

Vol 9 (10) ◽

pp. 1869-1871

Author(s):

Dr. C. K. Gomathy

Keyword(s):

Data Warehouse ◽

Relational Databases ◽

Command Line ◽

Command Line Interface ◽

Apache Hadoop ◽

Enterprise Data Warehouse ◽

Efficient Transfer ◽

Efficient Execution

Abstract: Apache Sqoop is mainly used to efficiently transfer large volumes of data between Apache Hadoop and relational databases. It helps to certain tasks, such as ETL (Extract transform load) processing, from an enterprise data warehouse to Hadoop, for efficient execution at a much less cost. Here first we import the table which presents in MYSQL Database with the help of command-line interface application called Sqoop and there is a chance of addition of new rows and updating new rows then we have to execute the query again. So, with the help of our project there is no need of executing queries again for that we are using Sqoop job, which consists of total commands for import and next after import we retrieve the data from hive using Java JDBC and we convert the data to JSON Format, which consists of data in an organized way and easy to access manner by using GSON Library. Keywords: Sqoop, Json, Gson, Maven and JDBC

Download Full-text

A Critical Analysis of Apache Hadoop and Spark for Big Data Processing

10.1109/ispcc53510.2021.9609518 ◽

2021 ◽

Author(s):

Piyush Sewal ◽

Hari Singh

Keyword(s):

Big Data ◽

Data Processing ◽

Critical Analysis ◽

Big Data Processing ◽

Apache Hadoop

Download Full-text

Diseño de una arquitectura para el procesamiento distribuido de grandes volúmenes de datos

Ñawparisun - Revista de Investigación Científica - ÑAWPARISUN - Revista de Investigación Científica ◽

10.47190/nric.v3i3.9 ◽

2021 ◽

Vol 3 (Vol. 3, Num. 3) ◽

pp. 73-77

Keyword(s):

Big Data ◽

Apache Hadoop

Actualmente, Big Data se ha convertido en un concepto que está presente en muchas actividades, y su importancia es debido a que es utilizado en diversos aspectos que conduzcan a mejorar decisiones en el campo empresarial y gubernamental. Es posible analizar los grandes volúmenes de datos, tanto estructurados como no estructurados, que a cada día aumentan en los diferentes negocios y campos del conocimiento. Para obtener resultados satisfactorios es importante diseñar una arquitectura físicamente en base a Hardware Commodity (homogénea, heterogénea), escalable horizontalmente y con tolerancia a fallas. De esta manera, actualmente, con la evolución de las herramientas, es conveniente utilizar un híbrido donde la parte lógica trabaja con el Framework Apache Hadoop 2.0, que realiza el procesamiento de datos en paralelo (utilizando YARN), con almacenamiento HDFS (Sistema de Archivos Distribuidos sobre Hadoop) y agregando Spark para el tratamiento en memoria con respuestas en tiempo real y la utilización de recursos gráficos mediante Apache Ambari.

Download Full-text

A Hybrid Machine Learning Approach for Performance Modeling of Cloud-Based Big Data Applications

The Computer Journal ◽

10.1093/comjnl/bxab131 ◽

2021 ◽

Author(s):

Ehsan Ataie ◽

Athanasia Evangelinou ◽

Eugenio Gianniti ◽

Danilo Ardagna

Keyword(s):

Machine Learning ◽

Big Data ◽

Hybrid Approach ◽

Service Level ◽

Analytical Models ◽

Prediction Errors ◽

Apache Hadoop ◽

Big Data Applications ◽

Machine Learning Approach ◽

Distributed Solutions

Abstract Nowadays, Apache Hadoop and Apache Spark are two of the most prominent distributed solutions for processing big data applications on the market. Since in many cases these frameworks are adopted to support business critical activities, it is often important to predict with fair confidence the execution time of submitted applications, for instance when service-level agreements are established with end-users. In this work, we propose and validate a hybrid approach for the performance prediction of big data applications running on clouds, which exploits both analytical modeling and machine learning (ML) techniques and it is able to achieve a good accuracy without too many time consuming and costly experiments on a real setup. The experimental results show how the proposed approach attains improvement in accuracy, number of experiments to be run on the operational system and cost over applying ML techniques without any support from analytical models. Moreover, we compare our approach with Ernest, an ML-based technique proposed in the literature by the Spark inventors. Experiments show that Ernest can accurately estimate the performance in interpolating scenarios while it fails to predict the performance when configurations with increasing number of cores are considered. Finally, a comparison with a similar hybrid approach proposed in the literature demonstrates how our approach significantly reduce prediction errors especially when few experiments on the real system are performed.

Download Full-text

Experimental Characteristics Study of Data Storage Formats for Data Marts Development within Data Lakes

Applied Sciences ◽

10.3390/app11188651 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8651

Author(s):

Vladimir Belov ◽

Alexander N. Kosenkov ◽

Evgeny Nikulchev

Keyword(s):

Big Data ◽

Data Storage ◽

Storage System ◽

Apache Hadoop ◽

Aggregated Data ◽

Data Marts ◽

Hadoop Platform ◽

Analytical Platforms ◽

Big Data Storage

One of the most popular methods for building analytical platforms involves the use of the concept of data lakes. A data lake is a storage system in which the data are presented in their original format, making it difficult to conduct analytics or present aggregated data. To solve this issue, data marts are used, representing environments of stored data of highly specialized information, focused on the requests of employees of a certain department, the vector of an organization’s work. This article presents a study of big data storage formats in the Apache Hadoop platform when used to build data marts.

Download Full-text

Hadoop solution for large data protection

Physico-mathematical modelling and informational technologies ◽

10.15407/fmmit2021.33.023 ◽

2021 ◽

pp. 23-27

Author(s):

Nataliya Maslova ◽

Olha Polovynka

Keyword(s):

Data Analysis ◽

Data Protection ◽

Large Data ◽

Unstructured Data ◽

Management Systems ◽

Intelligent Data Analysis ◽

Apache Hadoop ◽

Parallel Version ◽

Cryptographic Algorithms ◽

Using Data

Investigated one of large data problems of - providing protection in the process of accumulation and processing. The case of application of Hadoop technology and its latest modification Apache Hadoop 3.3.0 is considered. A solution is proposed with strengthening the protection of processed data, connecting the Apache Knox Gateway, Apache Ranger and Apache Atlas tools. The possibil-ity of using data obtained as a result of the work of local databases, electronic archives, database management systems and individual users is provided. The solution also features the use of a pri-vate cloud and cryptographic algorithms. An example of the implementation of a secure solution to the problem of Intelligent Data Analysis is given on the example of a parallel version of the problem of finding association rules when working with unstructured data of large volumes.

Download Full-text

Performance evaluation of GPU- and cluster-computing for parallelization of compute-intensive tasks

International Journal of Web Information Systems ◽

10.1108/ijwis-03-2021-0032 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Alexander Döschl ◽

Max-Emanuel Keller ◽

Peter Mandl

Keyword(s):

Distributed Computing ◽

Cluster Computing ◽

Graphics Processing Unit ◽

Apache Spark ◽

Processing Unit ◽

Data Set ◽

Apache Hadoop ◽

Content Type ◽

All Solutions ◽

Computationally Intensive

Purpose This paper aims to evaluate different approaches for the parallelization of compute-intensive tasks. The study compares a Java multi-threaded algorithm, distributed computing solutions with MapReduce (Apache Hadoop) and resilient distributed data set (RDD) (Apache Spark) paradigms and a graphics processing unit (GPU) approach with Numba for compute unified device architecture (CUDA). Design/methodology/approach The paper uses a simple but computationally intensive puzzle as a case study for experiments. To find all solutions using brute force search, 15! permutations had to be computed and tested against the solution rules. The experimental application comprises a Java multi-threaded algorithm, distributed computing solutions with MapReduce (Apache Hadoop) and RDD (Apache Spark) paradigms and a GPU approach with Numba for CUDA. The implementations were benchmarked on Amazon-EC2 instances for performance and scalability measurements. Findings The comparison of the solutions with Apache Hadoop and Apache Spark under Amazon EMR showed that the processing time measured in CPU minutes with Spark was up to 30% lower, while the performance of Spark especially benefits from an increasing number of tasks. With the CUDA implementation, more than 16 times faster execution is achievable for the same price compared to the Spark solution. Apart from the multi-threaded implementation, the processing times of all solutions scale approximately linearly. Finally, several application suggestions for the different parallelization approaches are derived from the insights of this study. Originality/value There are numerous studies that have examined the performance of parallelization approaches. Most of these studies deal with processing large amounts of data or mathematical problems. This work, in contrast, compares these technologies on their ability to implement computationally intensive distributed algorithms.

Download Full-text

apache hadoop
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Big Data Analytics Tools and Platform in Big Data Landscape

Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark

A Survey on Security of the Hadoop Framework in the Environment of Bigdata

Efficient Transfer of data from RDBMS to HDFS and conversion to JSON format

A Critical Analysis of Apache Hadoop and Spark for Big Data Processing

Diseño de una arquitectura para el procesamiento distribuido de grandes volúmenes de datos

A Hybrid Machine Learning Approach for Performance Modeling of Cloud-Based Big Data Applications

Experimental Characteristics Study of Data Storage Formats for Data Marts Development within Data Lakes

Hadoop solution for large data protection

Performance evaluation of GPU- and cluster-computing for parallelization of compute-intensive tasks

Export Citation Format

apache hadoopRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Big Data Analytics Tools and Platform in Big Data Landscape

Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark

A Survey on Security of the Hadoop Framework in the Environment of Bigdata

Efficient Transfer of data from RDBMS to HDFS and conversion to JSON format

A Critical Analysis of Apache Hadoop and Spark for Big Data Processing

Diseño de una arquitectura para el procesamiento distribuido de grandes volúmenes de datos

A Hybrid Machine Learning Approach for Performance Modeling of Cloud-Based Big Data Applications

Experimental Characteristics Study of Data Storage Formats for Data Marts Development within Data Lakes

Hadoop solution for large data protection

Performance evaluation of GPU- and cluster-computing for parallelization of compute-intensive tasks

apache hadoop
Recently Published Documents