scholarly journals Resource and Performance Modelling of Hadoop Clusters Using Machine Learning

2021 ◽  
Author(s):  
◽  
Hassan Tariq

<p>There is a huge and rapidly increasing amount of data being generated by social media, mobile applications and sensing devices. Big data is the term usually used to describe such data and is described in terms of the 3Vs - volume, variety and velocity. In order to process and mine such a massive amount of data, several approaches and platforms have been developed such as Hadoop. Hadoop is a popular open source distributed and parallel computing framework. It has a large number of configurable parameters which can be set before the execution of jobs to optimize the resource utilization and execution time of the clusters. These parameters have a significant impact on system resources and execution time. Optimizing the performance of a Hadoop cluster by tuning such a large number of parameters is a tedious task. Most current big data modeling approaches do not include the complex interaction between configuration parameters and the cluster environment changes such as use of different datasets or types of query. This makes it difficult to predict for example the execution time of a job or resource utilization of a cluster. Other attributes include configuration parameters, the structure of query, the dataset, number of nodes and the infrastructure used.  Our first main objective was to design reliable experiments to understand the relationship between attributes. Before designing and implementing the actual experiment we applied Hazard and Operability (HAZOP) analysis to identify operational hazards. These hazards can affect normal working of cluster and execution of Hadoop jobs. This brainstorming activity improved the design and implementation of our experiments by improving the internal validity of the experiments. It also helped us to identify the considerations that must be taken into account for reliable results. After implementing our design, we characterized the relationship between different Hadoop configuration parameters, network and system performance measures.   Our second main objective was to investigate the use of machine learning to model and predict the resource utilization and execution time of Hadoop jobs. Resource utilization and execution time of Hadoop jobs are affected by different attributes such as configuration parameters and structure of query. In order to estimate or predict either qualitatively or quantitatively the level of resource utilization and execution time, it is important to understand the impact of different combinations of these Hadoop job attributes. You could conduct experiments with many different combinations of parameters to uncover this but it is very difficult to run such a large number of jobs with different combinations of Hadoop job attributes and then interpret the data manually. It is very difficult to extract patterns from the data and give a model that can generalize for an unseen scenario. In order to automate the process of data extraction and modeling the complex behavior of different attributes of Hadoop job machine learning was used. Our decision tree based approach enabled us to systematically discover significant patterns in data. Our results showed that the decision tree models constructed for different resources and execution time were informative and robust. They were able to generalize over a wide range of minor and major environmental changes such as change in dataset, cluster size and infrastructure such as Amazon EC2. Moreover, the use of different correlation and regression techniques, such as M5P, Pearson's correlation and k-means clustering, confirmed our findings and provided further insight into the relationship of different attributes and with each other. M5P is a classification and regression technique that predicted the functional relationships among different job attributes. The use of k-means clustering allowed us to see the experimental runs that shows similar resource utilization and execution time. Statistical significance tests, were used to validate the significance of changes in results of different experimental runs, also showed the effectiveness of our resource and performance modelling and prediction method.</p>

2021 ◽  
Author(s):  
◽  
Hassan Tariq

<p>There is a huge and rapidly increasing amount of data being generated by social media, mobile applications and sensing devices. Big data is the term usually used to describe such data and is described in terms of the 3Vs - volume, variety and velocity. In order to process and mine such a massive amount of data, several approaches and platforms have been developed such as Hadoop. Hadoop is a popular open source distributed and parallel computing framework. It has a large number of configurable parameters which can be set before the execution of jobs to optimize the resource utilization and execution time of the clusters. These parameters have a significant impact on system resources and execution time. Optimizing the performance of a Hadoop cluster by tuning such a large number of parameters is a tedious task. Most current big data modeling approaches do not include the complex interaction between configuration parameters and the cluster environment changes such as use of different datasets or types of query. This makes it difficult to predict for example the execution time of a job or resource utilization of a cluster. Other attributes include configuration parameters, the structure of query, the dataset, number of nodes and the infrastructure used.  Our first main objective was to design reliable experiments to understand the relationship between attributes. Before designing and implementing the actual experiment we applied Hazard and Operability (HAZOP) analysis to identify operational hazards. These hazards can affect normal working of cluster and execution of Hadoop jobs. This brainstorming activity improved the design and implementation of our experiments by improving the internal validity of the experiments. It also helped us to identify the considerations that must be taken into account for reliable results. After implementing our design, we characterized the relationship between different Hadoop configuration parameters, network and system performance measures.   Our second main objective was to investigate the use of machine learning to model and predict the resource utilization and execution time of Hadoop jobs. Resource utilization and execution time of Hadoop jobs are affected by different attributes such as configuration parameters and structure of query. In order to estimate or predict either qualitatively or quantitatively the level of resource utilization and execution time, it is important to understand the impact of different combinations of these Hadoop job attributes. You could conduct experiments with many different combinations of parameters to uncover this but it is very difficult to run such a large number of jobs with different combinations of Hadoop job attributes and then interpret the data manually. It is very difficult to extract patterns from the data and give a model that can generalize for an unseen scenario. In order to automate the process of data extraction and modeling the complex behavior of different attributes of Hadoop job machine learning was used. Our decision tree based approach enabled us to systematically discover significant patterns in data. Our results showed that the decision tree models constructed for different resources and execution time were informative and robust. They were able to generalize over a wide range of minor and major environmental changes such as change in dataset, cluster size and infrastructure such as Amazon EC2. Moreover, the use of different correlation and regression techniques, such as M5P, Pearson's correlation and k-means clustering, confirmed our findings and provided further insight into the relationship of different attributes and with each other. M5P is a classification and regression technique that predicted the functional relationships among different job attributes. The use of k-means clustering allowed us to see the experimental runs that shows similar resource utilization and execution time. Statistical significance tests, were used to validate the significance of changes in results of different experimental runs, also showed the effectiveness of our resource and performance modelling and prediction method.</p>


2019 ◽  
Vol 8 (11) ◽  
pp. e298111473
Author(s):  
Hugo Kenji Rodrigues Okada ◽  
Andre Ricardo Nascimento das Neves ◽  
Ricardo Shitsuka

Decision trees are data structures or computational methods that enable nonparametric supervised machine learning and are used in classification and regression tasks. The aim of this paper is to present a comparison between the decision tree induction algorithms C4.5 and CART. A quantitative study is performed in which the two methods are compared by analyzing the following aspects: operation and complexity. The experiments presented practically equal hit percentages in the execution time for tree induction, however, the CART algorithm was approximately 46.24% slower than C4.5 and was considered to be more effective.


2016 ◽  
pp. 180-196
Author(s):  
Tu-Bao Ho ◽  
Siriwon Taewijit ◽  
Quang-Bach Ho ◽  
Hieu-Chi Dam

Big data is about handling huge and/or complex datasets that conventional technologies cannot handle or handle well. Big data is currently receiving tremendous attention from both industry and academia as there is much more data around us than ever before. This chapter addresses the relationship between big data and service science, especially how big data can contribute to the process of co-creation of service value. In particular, the value co-creation in terms of customer relationship management is mentioned. The chapter starts with brief descriptions of big data, machine learning and data mining methods, service science and its model of value co-creation, and then addresses the key idea of how big data can contribute to co-create service value.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Ginevra Gravili ◽  
Francesco Manta ◽  
Concetta Lucia Cristofaro ◽  
Rocco Reina ◽  
Pierluigi Toma

PurposeThe aim of this paper is to analyze and measure the effects of intellectual capital (IC), i.e. human capital (HC), relational capital (RC) and structural capital (SC), on healthcare industry organizational performance and understanding the role of data analytics and big data (BD) in healthcare value creation (Wang et al., 2018). Through the assessment of determined variables specific for each component of IC, the paper identifies the guidelines and suggests propositions for a more efficient response in terms of services provided to citizens and, specifically, patients, as well as predicting effective strategies to improve the care management efficiency in terms of cost reduction.Design/methodology/approachThe study has a twofold approach: in the first part, the authors operated a systematic review of the academic literature aiming to enquire the relationship between IC, big data analytics (BDA) and healthcare system, which were also the descriptors employed. In the second part, the authors built an econometric model analyzed through panel data analysis, studying the relationship between IC, namely human, relational and structural capital indicators, and the performance of healthcare system in terms of performance. The study has been conducted on a sample of 28 European countries, notwithstanding the belonging to specific international or supranational bodies, between 2011 and 2016.FindingsThe paper proposes a data-driven model that presents new approach to IC assessment, extendable to other economic sectors beyond healthcare. It shows the existence of a positive impact (turning into a mathematical inverse relationship) of the human, relational and structural capital on the performance indicator, while the physical assets (i.e. the available beds in hospitals on total population) positively mediates the relationship, turning into a negative impact of non-IC related inputs on healthcare performance. The result is relevant in terms of managerial implications, enhancing the opportunity to highlight the crucial role of IC in the healthcare sector.Research limitations/implicationsThe relationship between IC indicators and performance could be employed in other sectors, disseminating new approaches in academic research. Through the establishment of a relationship between IC factors and performance, the authors implemented an approach in which healthcare organizations are active participants in their economic and social value creation. This challenges the views of knowledge sharing deeply held inside organizations by creating “new value” developed through a more collaborative and permeated approach in terms of knowledge spillovers. A limitation is given by a fragmented policymaking process which carries out different results in each country.Practical implicationsThe analysis provides interesting implications on multiple perspectives. The novelty of the study provides interesting implications for managers, practitioners and governmental bodies. A more efficient healthcare system could provide better results in terms of cost minimization and reduction of hospitalization period. Moreover, dissemination of new scientific knowledge and drivers of specialization enhances best practices sharing in the healthcare sector. On the other hand, an improvement in preventive medicine practices could help in reducing the overload of demand for curative treatments, on the perspective of sharply decreasing the avoidable deaths rate and improving societal standards.Originality/valueThe authors provide a new holistic framework on the relationship between IC, BDA and organizational performance in healthcare organizations through a systematic review approach and an empirical panel analysis at a multinational level, which is quite a novelty regarding the healthcare. There is little research focussed on healthcare industries' organizational performance, and, specifically, most of the research on IC in healthcare delivered results in terms of theoretical contribution and qualitative analyzes. The authors even contributed to analyze the healthcare industry in the light of the possible existence of synergies and networks among countries.


2020 ◽  
Vol 2020 ◽  
pp. 1-12
Author(s):  
Peter Appiahene ◽  
Yaw Marfo Missah ◽  
Ussiph Najim

The financial crisis that hit Ghana from 2015 to 2018 has raised various issues with respect to the efficiency of banks and the safety of depositors’ in the banking industry. As part of measures to improve the banking sector and also restore customers’ confidence, efficiency and performance analysis in the banking industry has become a hot issue. This is because stakeholders have to detect the underlying causes of inefficiencies within the banking industry. Nonparametric methods such as Data Envelopment Analysis (DEA) have been suggested in the literature as a good measure of banks’ efficiency and performance. Machine learning algorithms have also been viewed as a good tool to estimate various nonparametric and nonlinear problems. This paper presents a combined DEA with three machine learning approaches in evaluating bank efficiency and performance using 444 Ghanaian bank branches, Decision Making Units (DMUs). The results were compared with the corresponding efficiency ratings obtained from the DEA. Finally, the prediction accuracies of the three machine learning algorithm models were compared. The results suggested that the decision tree (DT) and its C5.0 algorithm provided the best predictive model. It had 100% accuracy in predicting the 134 holdout sample dataset (30% banks) and a P value of 0.00. The DT was followed closely by random forest algorithm with a predictive accuracy of 98.5% and a P value of 0.00 and finally the neural network (86.6% accuracy) with a P value 0.66. The study concluded that banks in Ghana can use the result of this study to predict their respective efficiencies. All experiments were performed within a simulation environment and conducted in R studio using R codes.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Yejin Lee ◽  
Dae-Young Kim

Purpose Using the decision tree model, this study aims to understand the online travelers booking behaviors on Expedia.com, by examining influential determinants of online hotel booking, especially for longer-stay travelers. The geographical distance is also considered in understanding the booking behaviors trisecting travel destinations (i.e. Americas, Europe and Asia). Design/methodology/approach The data were obtained from American Statistical Association DataFest and Expedia.com. Based on the US travelers who made hotel reservation on the website, the study used a machine learning algorithm, decision tree, to analyze the influential determinants on hotel booking considering the geographical distance between origin and destination. Findings The results of the findings demonstrate that the choice of package product is the prioritized determinant for longer-stay hotel guests. Several similarities and differences were found from the significant determinants of the decision tree, in accordance with the geographic distance among the Americas, Europe and Asia. Research limitations/implications This paper presents the extension to an existing machine learning environment, and especially to the decision tree model. The findings are anticipated to expand the understanding of online hotel booking and apprehend the influential determinants toward consumers’ decision-making process regarding the relationship between geographical distance and traveler’s hotel staying duration. Originality/value This research brings a meaningful understanding of the hospitality and tourism industry, especially to the realm of machine learning adapted to an online booking website. It provides a unique approach to comprehend and forecast consumer behavior with data mining.


Author(s):  
Tu-Bao Ho ◽  
Siriwon Taewijit ◽  
Quang-Bach Ho ◽  
Hieu-Chi Dam

Big data is about handling huge and/or complex datasets that conventional technologies cannot handle or handle well. Big data is currently receiving tremendous attention from both industry and academia as there is much more data around us than ever before. This chapter addresses the relationship between big data and service science, especially how big data can contribute to the process of co-creation of service value. In particular, the value co-creation in terms of customer relationship management is mentioned. The chapter starts with brief descriptions of big data, machine learning and data mining methods, service science and its model of value co-creation, and then addresses the key idea of how big data can contribute to co-create service value.


2018 ◽  
Vol 7 (3.34) ◽  
pp. 291
Author(s):  
M Malleswari ◽  
R.J Manira ◽  
Praveen Kumar ◽  
Murugan .

 Big data analytics has been the focus for large scale data processing. Machine learning and Big data has great future in prediction. Churn prediction is one of the sub domain of big data. Preventing customer attrition especially in telecom is the advantage of churn prediction.  Churn prediction is a day-to-day affair involving millions. So a solution to prevent customer attrition can save a lot. This paper propose to do comparison of three machine learning techniques Decision tree algorithm, Random Forest algorithm and Gradient Boosted tree algorithm using Apache Spark. Apache Spark is a data processing engine used in big data which provides in-memory processing so that the processing speed is higher. The analysis is made by extracting the features of the data set and training the model. Scala is a programming language that combines both object oriented and functional programming and so a powerful programming language. The analysis is implemented using Apache Spark and modelling is done using scala ML. The accuracy of Decision tree model came out as 86%, Random Forest model is 87% and Gradient Boosted tree is 85%. 


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Thérence Nibareke ◽  
Jalal Laassiri

Abstract Introduction Nowadays large data volumes are daily generated at a high rate. Data from health system, social network, financial, government, marketing, bank transactions as well as the censors and smart devices are increasing. The tools and models have to be optimized. In this paper we applied and compared Machine Learning algorithms (Linear Regression, Naïve bayes, Decision Tree) to predict diabetes. Further more, we performed analytics on flight delays. The main contribution of this paper is to give an overview of Big Data tools and machine learning models. We highlight some metrics that allow us to choose a more accurate model. We predict diabetes disease using three machine learning models and then compared their performance. Further more we analyzed flight delay and produced a dashboard which can help managers of flight companies to have a 360° view of their flights and take strategic decisions. Case description We applied three Machine Learning algorithms for predicting diabetes and we compared the performance to see what model give the best results. We performed analytics on flights datasets to help decision making and predict flight delays. Discussion and evaluation The experiment shows that the Linear Regression, Naive Bayesian and Decision Tree give the same accuracy (0.766) but Decision Tree outperforms the two other models with the greatest score (1) and the smallest error (0). For the flight delays analytics, the model could show for example the airport that recorded the most flight delays. Conclusions Several tools and machine learning models to deal with big data analytics have been discussed in this paper. We concluded that for the same datasets, we have to carefully choose the model to use in prediction. In our future works, we will test different models in other fields (climate, banking, insurance.).


Sign in / Sign up

Export Citation Format

Share Document