An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster

Big data frameworks play a vital role in storing, processing, and analysing large datasets. Apache Spark has been established as one of the most popular big data engines for its efficiency and reliability. However, one of the significant problems of the Spark system is performance prediction. Spark has more than 150 configurable parameters, and configuration of so many parameters is challenging task when determining the suitable parameters for the system. In this paper, we proposed two distinct parallelisation models for performance prediction. Our insight is that each node in a Hadoop cluster can communicate with identical nodes, and a certain function of the non-parallelisable runtime can be estimated accordingly. Both models use simple equations that allows us to predict the runtime when the size of the job and the number of executables are known. The proposed models were evaluated based on five HiBench workloads, Kmeans, PageRank, Graph (NWeight), SVM, and WordCount. The workload’s empirical data were fitted with one of the two models meeting the accuracy requirements. Finally, the experimental findings show that the model can be a handy and helpful tool for scheduling and planning system deployment.

Download Full-text

Implementation of Clustering Algorithms for Real Time Large Datasets

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.c2570.0981119 ◽

2019 ◽

Vol 8 (11) ◽

pp. 2303-2304

Keyword(s):

Big Data ◽

Clustering Algorithms ◽

Vital Role ◽

Large Datasets ◽

Similar Data ◽

Data Set ◽

Survey Paper ◽

Density Based Clustering ◽

Geographical Maps ◽

Data Objects

Now a day’s clustering plays vital role in big data. It is very difficult to analyze and cluster large volume of data. Clustering is a procedure for grouping similar data objects of a data set. We make sure that inside the cluster high intra cluster similarity and outside the cluster high inter similarity. Clustering used in statistical analysis, geographical maps, biology cell analysis and in google maps. The various approaches for clustering grid clustering, density based clustering, hierarchical methods, partitioning approaches. In this survey paper we focused on all these algorithms for large datasets like big data and make a report on comparison among them. The main metric is time complexity to differentiate all algorithms.

Download Full-text

Big Data Security Challenges and Solution of Distributed Computing in Hadoop Environment: A Security Framework

Recent Advances in Computer Science and Communications ◽

10.2174/2213275912666190822095422 ◽

2020 ◽

Vol 13 (4) ◽

pp. 790-797

Author(s):

Gurjit Singh Bhathal ◽

Amardeep Singh Dhiman

Keyword(s):

Big Data ◽

Data Security ◽

Data Sets ◽

Security Framework ◽

Hadoop Distributed File System ◽

Current Scenario ◽

Hadoop Cluster ◽

Ciphertext Policy ◽

In Transit ◽

Hadoop Framework

Background: In current scenario of internet, large amounts of data are generated and processed. Hadoop framework is widely used to store and process big data in a highly distributed manner. It is argued that Hadoop Framework is not mature enough to deal with the current cyberattacks on the data. Objective: The main objective of the proposed work is to provide a complete security approach comprising of authorisation and authentication for the user and the Hadoop cluster nodes and to secure the data at rest as well as in transit. Methods: The proposed algorithm uses Kerberos network authentication protocol for authorisation and authentication and to validate the users and the cluster nodes. The Ciphertext-Policy Attribute- Based Encryption (CP-ABE) is used for data at rest and data in transit. User encrypts the file with their own set of attributes and stores on Hadoop Distributed File System. Only intended users can decrypt that file with matching parameters. Results: The proposed algorithm was implemented with data sets of different sizes. The data was processed with and without encryption. The results show little difference in processing time. The performance was affected in range of 0.8% to 3.1%, which includes impact of other factors also, like system configuration, the number of parallel jobs running and virtual environment. Conclusion: The solutions available for handling the big data security problems faced in Hadoop framework are inefficient or incomplete. A complete security framework is proposed for Hadoop Environment. The solution is experimentally proven to have little effect on the performance of the system for datasets of different sizes.

Download Full-text

Soil Moisture based Automatic Irrigation System to Improve Water Productivity

Journal of AgriSearch ◽

10.21921/jas.v7i04.19393 ◽

2020 ◽

Vol 7 (04) ◽

Author(s):

PRADEEP H K ◽

JASMA BALASANGAMESHWARA ◽

K RAJAN ◽

PRABHUDEV JAGADEESH

Keyword(s):

Soil Moisture ◽

Soil Moisture Content ◽

Irrigation System ◽

Water Productivity ◽

Vital Role ◽

Agricultural Water Management ◽

Water Management System ◽

Proposed Model ◽

Crop Growth Stage ◽

Experimental Findings

Irrigation automation plays a vital role in agricultural water management system. An efficient automatic irrigation system is crucial to improve crop water productivity. Soil moisture based irrigation is an economical and efficient approach for automation of irrigation system. An experiment was conducted for irrigation automation based on the soil moisture content and crop growth stage. The experimental findings exhibited that, automatic irrigation system based on the proposed model triggers the water supply accurately based on the real-time soil moisture values.

Download Full-text

Exploring Apache Spark Data APIs for Water Big Data Management

Advances in Intelligent Systems and Computing - Advanced Intelligent Systems for Sustainable Development (AI2SD’2018) ◽

10.1007/978-3-030-11881-5_10 ◽

2019 ◽

pp. 105-117

Author(s):

Nassif El Hassane ◽

Hicham Hajji

Keyword(s):

Big Data ◽

Data Management ◽

Apache Spark

Download Full-text

Big data Predictive Analytics for Apache Spark using Machine Learning

2020 Global Conference on Wireless and Optical Technologies (GCWOT) ◽

10.1109/gcwot49901.2020.9391620 ◽

2020 ◽

Author(s):

Muhammad Junaid ◽

Shiraz Ali Wagan ◽

Nawab Muhammad Faseeh Qureshi ◽

Choon Sung Nam ◽

Dong Ryeol Shin

Keyword(s):

Machine Learning ◽

Big Data ◽

Predictive Analytics ◽

Apache Spark

Download Full-text

Performance Prediction for Data-driven Workflows on Apache Spark

2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) ◽

10.1109/mascots50786.2020.9285944 ◽

2020 ◽

Author(s):

Andrea Gulino ◽

Arif Canakoglu ◽

Stefano Ceri ◽

Danilo Ardagna

Keyword(s):

Performance Prediction ◽

Apache Spark ◽

Data Driven

Download Full-text

Efficient Performance Prediction for Apache Spark

Journal of Parallel and Distributed Computing ◽

10.1016/j.jpdc.2020.10.010 ◽

2021 ◽

Vol 149 ◽

pp. 40-51

Author(s):

Guoli Cheng ◽

Shi Ying ◽

Bingming Wang ◽

Yuhang Li

Keyword(s):

Performance Prediction ◽

Apache Spark ◽

Efficient Performance

Download Full-text

Approx-SMOTE: Fast SMOTE for Big Data on Apache Spark

Neurocomputing ◽

10.1016/j.neucom.2021.08.086 ◽

2021 ◽

Vol 464 ◽

pp. 432-437

Author(s):

Mario Juez-Gil ◽

Álvar Arnaiz-González ◽

Juan J. Rodríguez ◽

Carlos López-Nozal ◽

César García-Osorio

Keyword(s):

Big Data ◽

Apache Spark

Download Full-text

AusGeochem and Big Data Analytics in Low-Temperature Thermochronology

10.5194/egusphere-egu21-16550 ◽

2021 ◽

Author(s):

Samuel Boone ◽

Fabian Kohlmann ◽

Moritz Theile ◽

Wayne Noble ◽

Barry Kohn ◽

...

Keyword(s):

Big Data ◽

Low Temperature ◽

Data Analytics ◽

Big Data Analytics ◽

Application Programming Interface ◽

Large Datasets ◽

Geochemical Data ◽

Advisory Group ◽

New Era ◽

Analytical Instruments

The AuScope Geochemistry Network (AGN) and partners Lithodat Pty Ltd are developing AusGeochem, a novel cloud-based platform for Australian-produced geochemistry data from around the globe. The open platform will allow laboratories to upload, archive, disseminate and publish their datasets, as well as perform statistical analyses and data synthesis within the context of large volumes of publicly funded geochemical data. As part of this endeavour, representatives from four Australian low-temperature thermochronology laboratories (University of Melbourne, University of Adelaide, Curtin University and University of Queensland) are advising the AGN and Lithodat on the development of low-temperature thermochronology (LTT)-specific data models for the relational AusGeochem database and its international counterpart, LithoSurfer. These schemas will facilitate the structured archiving of a wide variety of thermochronology data, enabling geoscientists to readily perform LTT Big Data analytics and gain new insights into the thermo-tectonic evolution of Earth&#8217;s crust.Adopting established international data reporting best practices, the LTT expert advisory group has designed database schemas for the fission track and (U-Th-Sm)/He methods, as well as for thermal history modelling results and metadata. In addition to recording the parameters required for LTT analyses, the schemas include fields for reference material results and error reporting, allowing AusGeochem users to independently perform QA/QC on data archived in the database. Development of scripts for the automated upload of data directly from analytical instruments into AusGeochem using its open-source Application Programming Interface are currently under way.The advent of a LTT relational database heralds the beginning of a new era of Big Data analytics in the field of low-temperature thermochronology. By methodically archiving detailed LTT (meta-)data in structured schemas, intractably large datasets comprising 1000s of analyses produced by numerous laboratories can be readily interrogated in new and powerful ways. These include rapid derivation of inter-data relationships, facilitating on-the-fly age computation, statistical analysis and data visualisation. With the detailed LTT data stored in relational schemas, measurements can then be re-calculated and re-modelled using user-defined constants and kinetic algorithms. This enables analyses determined using different parameters to be equated and compared across regional- to global scales.The development of this novel tool heralds the beginning of a new era of structured Big Data in the field of low-temperature thermochronology, improving laboratories&#8217; ability to manage and share their data in alignment with FAIR data principles while enabling analysts to readily interrogate intractably large datasets in new and powerful ways.

Download Full-text

Flight Test Intelligent Mission Planning System Based on Big Data Platform

10.1109/icetci53161.2021.9563537 ◽

2021 ◽

Author(s):

He Zeyuan ◽

Wang Huahui ◽

Li Nanbo

Keyword(s):

Big Data ◽

Flight Test ◽

Mission Planning ◽

Planning System ◽

Data Platform

Download Full-text