scholarly journals An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster

2021 ◽  
Vol 5 (4) ◽  
pp. 65
Author(s):  
Nasim Ahmed ◽  
Andre L. C. Barczak ◽  
Mohammad A. Rashid ◽  
Teo Susnjak

Big data frameworks play a vital role in storing, processing, and analysing large datasets. Apache Spark has been established as one of the most popular big data engines for its efficiency and reliability. However, one of the significant problems of the Spark system is performance prediction. Spark has more than 150 configurable parameters, and configuration of so many parameters is challenging task when determining the suitable parameters for the system. In this paper, we proposed two distinct parallelisation models for performance prediction. Our insight is that each node in a Hadoop cluster can communicate with identical nodes, and a certain function of the non-parallelisable runtime can be estimated accordingly. Both models use simple equations that allows us to predict the runtime when the size of the job and the number of executables are known. The proposed models were evaluated based on five HiBench workloads, Kmeans, PageRank, Graph (NWeight), SVM, and WordCount. The workload’s empirical data were fitted with one of the two models meeting the accuracy requirements. Finally, the experimental findings show that the model can be a handy and helpful tool for scheduling and planning system deployment.

Now a day’s clustering plays vital role in big data. It is very difficult to analyze and cluster large volume of data. Clustering is a procedure for grouping similar data objects of a data set. We make sure that inside the cluster high intra cluster similarity and outside the cluster high inter similarity. Clustering used in statistical analysis, geographical maps, biology cell analysis and in google maps. The various approaches for clustering grid clustering, density based clustering, hierarchical methods, partitioning approaches. In this survey paper we focused on all these algorithms for large datasets like big data and make a report on comparison among them. The main metric is time complexity to differentiate all algorithms.


2020 ◽  
Vol 13 (4) ◽  
pp. 790-797
Author(s):  
Gurjit Singh Bhathal ◽  
Amardeep Singh Dhiman

Background: In current scenario of internet, large amounts of data are generated and processed. Hadoop framework is widely used to store and process big data in a highly distributed manner. It is argued that Hadoop Framework is not mature enough to deal with the current cyberattacks on the data. Objective: The main objective of the proposed work is to provide a complete security approach comprising of authorisation and authentication for the user and the Hadoop cluster nodes and to secure the data at rest as well as in transit. Methods: The proposed algorithm uses Kerberos network authentication protocol for authorisation and authentication and to validate the users and the cluster nodes. The Ciphertext-Policy Attribute- Based Encryption (CP-ABE) is used for data at rest and data in transit. User encrypts the file with their own set of attributes and stores on Hadoop Distributed File System. Only intended users can decrypt that file with matching parameters. Results: The proposed algorithm was implemented with data sets of different sizes. The data was processed with and without encryption. The results show little difference in processing time. The performance was affected in range of 0.8% to 3.1%, which includes impact of other factors also, like system configuration, the number of parallel jobs running and virtual environment. Conclusion: The solutions available for handling the big data security problems faced in Hadoop framework are inefficient or incomplete. A complete security framework is proposed for Hadoop Environment. The solution is experimentally proven to have little effect on the performance of the system for datasets of different sizes.


2020 ◽  
Vol 7 (04) ◽  
Author(s):  
PRADEEP H K ◽  
JASMA BALASANGAMESHWARA ◽  
K RAJAN ◽  
PRABHUDEV JAGADEESH

Irrigation automation plays a vital role in agricultural water management system. An efficient automatic irrigation system is crucial to improve crop water productivity. Soil moisture based irrigation is an economical and efficient approach for automation of irrigation system. An experiment was conducted for irrigation automation based on the soil moisture content and crop growth stage. The experimental findings exhibited that, automatic irrigation system based on the proposed model triggers the water supply accurately based on the real-time soil moisture values.


Author(s):  
Muhammad Junaid ◽  
Shiraz Ali Wagan ◽  
Nawab Muhammad Faseeh Qureshi ◽  
Choon Sung Nam ◽  
Dong Ryeol Shin

2021 ◽  
Vol 149 ◽  
pp. 40-51
Author(s):  
Guoli Cheng ◽  
Shi Ying ◽  
Bingming Wang ◽  
Yuhang Li

2021 ◽  
Vol 464 ◽  
pp. 432-437
Author(s):  
Mario Juez-Gil ◽  
Álvar Arnaiz-González ◽  
Juan J. Rodríguez ◽  
Carlos López-Nozal ◽  
César García-Osorio
Keyword(s):  
Big Data ◽  

2021 ◽  
Author(s):  
Samuel Boone ◽  
Fabian Kohlmann ◽  
Moritz Theile ◽  
Wayne Noble ◽  
Barry Kohn ◽  
...  

<p>The AuScope Geochemistry Network (AGN) and partners Lithodat Pty Ltd are developing AusGeochem, a novel cloud-based platform for Australian-produced geochemistry data from around the globe. The open platform will allow laboratories to upload, archive, disseminate and publish their datasets, as well as perform statistical analyses and data synthesis within the context of large volumes of publicly funded geochemical data. As part of this endeavour, representatives from four Australian low-temperature thermochronology laboratories (University of Melbourne, University of Adelaide, Curtin University and University of Queensland) are advising the AGN and Lithodat on the development of low-temperature thermochronology (LTT)-specific data models for the relational AusGeochem database and its international counterpart, LithoSurfer. These schemas will facilitate the structured archiving of a wide variety of thermochronology data, enabling geoscientists to readily perform LTT Big Data analytics and gain new insights into the thermo-tectonic evolution of Earth’s crust.</p><p>Adopting established international data reporting best practices, the LTT expert advisory group has designed database schemas for the fission track and (U-Th-Sm)/He methods, as well as for thermal history modelling results and metadata. In addition to recording the parameters required for LTT analyses, the schemas include fields for reference material results and error reporting, allowing AusGeochem users to independently perform QA/QC on data archived in the database. Development of scripts for the automated upload of data directly from analytical instruments into AusGeochem using its open-source Application Programming Interface are currently under way.</p><p>The advent of a LTT relational database heralds the beginning of a new era of Big Data analytics in the field of low-temperature thermochronology. By methodically archiving detailed LTT (meta-)data in structured schemas, intractably large datasets comprising 1000s of analyses produced by numerous laboratories can be readily interrogated in new and powerful ways. These include rapid derivation of inter-data relationships, facilitating on-the-fly age computation, statistical analysis and data visualisation. With the detailed LTT data stored in relational schemas, measurements can then be re-calculated and re-modelled using user-defined constants and kinetic algorithms. This enables analyses determined using different parameters to be equated and compared across regional- to global scales.</p><p>The development of this novel tool heralds the beginning of a new era of structured Big Data in the field of low-temperature thermochronology, improving laboratories’ ability to manage and share their data in alignment with FAIR data principles while enabling analysts to readily interrogate intractably large datasets in new and powerful ways.</p>


Sign in / Sign up

Export Citation Format

Share Document