scholarly journals Performance Evaluation of a Big Data Application on Apache Spark

Author(s):  
Jeanne Alcantara

Apache Spark enables a big data application—one that takes massive data as input and may produce massive data along its execution—to run in parallel on multiple nodes. Hence, for a big data application, performance is a vital issue. This project analyzes a WordCount application using Apache Spark, where the impact on the execution time and average utilization is assessed. To facilitate this assessment, the number of executor cores and the size of executor memory are varied across different sizes of data that the application has to process, and the different number of nodes in the cluster that the application runs on. It is concluded that different pairs (data size, number of nodes in the cluster) require different number of executor cores and different size of executor memory to obtain optimum results for execution time and average node utilization.

2021 ◽  
Author(s):  
Jeanne Alcantara

Apache Spark enables a big data application—one that takes massive data as input and may produce massive data along its execution—to run in parallel on multiple nodes. Hence, for a big data application, performance is a vital issue. This project analyzes a WordCount application using Apache Spark, where the impact on the execution time and average utilization is assessed. To facilitate this assessment, the number of executor cores and the size of executor memory are varied across different sizes of data that the application has to process, and the different number of nodes in the cluster that the application runs on. It is concluded that different pairs (data size, number of nodes in the cluster) require different number of executor cores and different size of executor memory to obtain optimum results for execution time and average node utilization.


2020 ◽  
Vol 10 (23) ◽  
pp. 8524
Author(s):  
Cornelia A. Győrödi ◽  
Diana V. Dumşe-Burescu ◽  
Doina R. Zmaranda ◽  
Robert Ş. Győrödi ◽  
Gianina A. Gabor ◽  
...  

In the current context of emerging several types of database systems (relational and non-relational), choosing the type and database system for storing large amounts of data in today’s big data applications has become an important challenge. In this paper, we aimed to provide a comparative evaluation of two popular open-source database management systems (DBMSs): MySQL as a relational DBMS and, more recently, as a non-relational DBMS, and CouchDB as a non-relational DBMS. This comparison was based on performance evaluation of CRUD (CREATE, READ, UPDATE, DELETE) operations for different amounts of data to show how these two databases could be modeled and used in an application and highlight the differences in the response time and complexity. The main objective of the paper was to make a comparative analysis of the impact that each specific DBMS has on application performance when carrying out CRUD requests. To perform the analysis and to ensure the consistency of tests, two similar applications were developed in Java, one using MySQL and the other one using CouchDB database; these applications were further used to evaluate the time responses for each database technology on the same CRUD operations on the database. Finally, a comprehensive discussion based on the results of the analysis was performed that centered on the results obtained and several conclusions were revealed. Advantages and drawbacks for each DBMS are outlined to support a decision for choosing a specific type of DBMS that could be used in a big data application.


Right by and by the Colossal Information applications, for case, social orchestrating, helpful human administrations, agribusiness, keeping cash, stock show, direction, Facebook and so forward are making the data with especially tall speed. Volume and Speed of the Immense data plays a fundamental bit interior the execution of Colossal data applications. Execution of the Colossal data application can be affected by distinctive parameters. Quickly watch, capacity and precision are the a significant parcel of the triumphant parameters which impact the by and gigantic execution of any Huge data applications. Due the energize and underhanded affiliation of the qualities of 7Vs of Colossal data, each Colossal Information affiliations expect the tall execution.Tall execution is the foremost obvious test within the display advancing condition. In this paper we propose the parallel course of action way to bargain with speedup the explore for closest neighbor center. k-NN classifier is the preeminent basic and comprehensively utilized method for gathering. In this paper we apply a parallelism thought to k-NN for looking the another closest neighbor. This neighbor center will be utilized for putting lost and execution of the remarkable data streams. This classifier unequivocally overhaul and coordinate of the out of date data streams. We are utilizing the Apache Begin and scattered estimation space affiliation for snappier evaluation.


Sign in / Sign up

Export Citation Format

Share Document