scholarly journals Comprehend the Performance of MapReduce Programming model for K-Means algorithm on Hadoop Cluster

MapReduce is a programming model used for processing Big Data. There are had been considerable research in improvement of performance of MapReduce model. This paper examines performance of MapReduce model based on K Means algorithm inside the Hadoop cluster. Different input size had been taken on various configurations to discover the impact of CPU cores and primary memory size. Results of this evaluation had been shown that the number of cores had maximum impact of the performance of MapReduce model.

Author(s):  
Uttama Garg

The amount of data in today’s world is increasing exponentially. Effectively analyzing Big Data is a very complex task. The MapReduce programming model created by Google in 2004 revolutionized the big-data comput-ing market. Nowadays the model is being used by many for scientific and research analysis as well as for commercial purposes. The MapReduce model however is quite a low-level progamming model and has many limitations. Active research is being undertaken to make models that overcome/remove these limitations. In this paper we have studied some popular data analytic models that redress some of the limitations of MapReduce; namely ASTERIX and Pregel (Giraph) We discuss these models briefly and through the discussion highlight how these models are able to overcome MapReduce’s limitations.


Complexity ◽  
2019 ◽  
Vol 2019 ◽  
pp. 1-10 ◽  
Author(s):  
Jian Yang ◽  
Chongchong Zhao ◽  
Chunxiao Xing

In recent years, data has become a special kind of information commodity and promoted the development of information commodity economy through distribution. With the development of big data, the data market emerged and provided convenience for data transactions. However, the issues of optimal pricing and data quality allocation in the big data market have not been fully studied yet. In this paper, we proposed a big data market pricing model based on data quality. We first analyzed the dimensional indicators that affect data quality, and a linear evaluation model was established. Then, from the perspective of data science, we analyzed the impact of quality level on big data analysis (i.e., machine learning algorithms) and defined the utility function of data quality. The experimental results in real data sets have shown the applicability of the proposed quality utility function. In addition, we formulated the profit maximization problem and gave theoretical analysis. Finally, the data market can maximize profits through the proposed model illustrated with numerical examples.


2022 ◽  
Vol 2022 ◽  
pp. 1-11
Author(s):  
Min Zhu

This article first established a university network education system model based on physical failure repair behavior at the big data infrastructure layer and then examined in depth the complex common causes of multiple data failures in the big data environment caused by a single physical machine failure, all based on the principle of mobile edge computing. At the application service layer, a performance model based on queuing theory is first established, with the amount of available resources as a conditional parameter. The model examines important events in mobile edge computing, such as queue overflow and timeout failure. The impact of failure repair behavior on the random change of system dynamic energy consumption is thoroughly investigated, and a system energy consumption model is developed as a result. The network education system in colleges and universities includes a user login module, teaching resource management module, student and teacher management module, online teaching management module, student achievement management module, student homework management module, system data management module, and other business functions. Later, the theory of mobile edge computing proposed a set of comprehensive evaluation indicators that characterize the relevance, such as expected performance and expected energy consumption. Based on these evaluation indicators, a new indicator was proposed to quantify the complex constraint relationship. Finally, a functional use case test was conducted, focusing on testing the query function of online education information; a performance test was conducted in the software operating environment, following the development of the test scenario, and the server’s CPU utilization rate was tested while the software was running. The results show that the designed network education platform is relatively stable and can withstand user access pressure. The performance ratio indicator can effectively assist the cloud computing system in selecting a more appropriate option for the migrated traditional service system.


Author(s):  
Zhehuang Huang ◽  
Jianxin Huang

The rapid updates of the resources and media in the big data age provide new opportunities for oversea Chinese education. It is an urgent task to effectively use the big data to boost the development of oversea Chinese education. However, very few studies are conducted in this area. Map-Reduce is a programming model of cloud computing used for the parallel computing of the large-scale data sets and this model enables programmers to run their own programs in the distributed system. In this paper we proposed a personalized overseas Chinese education model based on Map-Reduce mechanism , which can analyze the behavioral habits and personal preferences of users from a large pool of Chinese educational resources. In this way, the customer needs can be accurately grasped and their favorite resources are recommended from huge amounts of resources. The proposed model has a good application prospect for overseas Chinese education .


2020 ◽  
Vol 2020 ◽  
pp. 1-11
Author(s):  
Xu Sun ◽  
Zixiu Bai ◽  
Kun Lin ◽  
Pengpeng Jiao ◽  
HuaPu Lu

In order to improve the accuracy, reliability, and economy of urban traffic information collection, an optimization model of traffic sensor layout is proposed in this paper. Considering the impact of traffic big data, a set of impact factors for traffic sensor layout is established, including system cost, multisource data sharing, data demand, sensor failures, road infrastructure, and sensor type. The impacts of these influential factors are taken into account in the traffic sensor layout optimization problem, which is formulated in the form of multiobjective programming model that includes minimum system cost, maximum truncation flow, minimum path coverage, and an origin-destination (OD) coverage constraint. The model is solved by the tolerant lexicographic method based on a genetic algorithm. A case study shows that the model reflects the influence of multisource data sharing and fault conditions and satisfies the origin-destination coverage constraint to achieve the multiobjective optimization of traffic sensor layout.


Electronics ◽  
2021 ◽  
Vol 10 (14) ◽  
pp. 1728
Author(s):  
Carmen Lacave ◽  
Ana Isabel Molina

Collaborative learning activities have become a common practice in current university studies due to the implantation of the EHEA. However, the COVID-19 pandemic has led to a radical and abrupt change in the teaching–learning model used in most universities, and in the way students’ group work is carried out. Given this new situation, our interest is focused on discovering how computer science students have approached group programming tasks. For this purpose, we have designed a cross-sectional pilot study to explore, from both social and technological points of view, how students carried out their group programming activities during the shutdown of universities, how they are doing them now, when social distance must be maintained, and what they have missed in both situations. The results of the study indicate that during the imposed confinement, the students adopted a programming model based on work division or distributed peer programming, and very few made use of synchronous distributed collaboration tools. After the lockdown, the students mostly opted for a model based on collaborative programming and there was an increased use of synchronous distributed collaboration tools. The specific communication, synchronization, and coordination functionalities they considered most useful or necessary were also analyzed. Among the desirable features included in a software for synchronous distributed programming, the students considered that having an audio-channel can be very useful and, possibly, the most agile method to communicate. The video signal is not considered as very necessary, being in many cases rather a source of distraction, while textual communication through a chat, to which they are very accustomed, is also well valued. In addition, version control and the possibility of recovering previous states of the practical projects were highly appreciated by the students, and they considered it necessary to record the individual contributions of each member of the team to the result.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Youwen Ma ◽  
Yi Wan

Based on cloud computing and statistics theory, this paper proposes a reasonable analysis method for big data of film and television. The method selects Hadoop open source cloud platform as the basis, combines the MapReduce distributed programming model and HDFS distributed file storage system and other key cloud computing technologies. In order to cope with different data processing needs of film and television industry, association analysis, cluster analysis, factor analysis, and K-mean + association analysis algorithm training model were applied to model, process, and analyze the full data of film and TV series. According to the film type, producer, production region, investment, box office, audience rating, network score, audience group, and other factors, the film and television data in recent years are analyzed and studied. Based on the study of the impact of each attribute of film and television drama on film box office and TV audience rating, it is committed to the prediction of film and television industry and constantly verifies and improves the algorithm model.


2017 ◽  
Vol 2017 ◽  
pp. 1-12 ◽  
Author(s):  
Yufei Gao ◽  
Yanjie Zhou ◽  
Bing Zhou ◽  
Lei Shi ◽  
Jiacai Zhang

The healthcare industry has generated large amounts of data, and analyzing these has emerged as an important problem in recent years. The MapReduce programming model has been successfully used for big data analytics. However, data skew invariably occurs in big data analytics and seriously affects efficiency. To overcome the data skew problem in MapReduce, we have in the past proposed a data processing algorithm called Partition Tuning-based Skew Handling (PTSH). In comparison with the one-stage partitioning strategy used in the traditional MapReduce model, PTSH uses a two-stage strategy and the partition tuning method to disperse key-value pairs in virtual partitions and recombines each partition in case of data skew. The robustness and efficiency of the proposed algorithm were tested on a wide variety of simulated datasets and real healthcare datasets. The results showed that PTSH algorithm can handle data skew in MapReduce efficiently and improve the performance of MapReduce jobs in comparison with the native Hadoop, Closer, and locality-aware and fairness-aware key partitioning (LEEN). We also found that the time needed for rule extraction can be reduced significantly by adopting the PTSH algorithm, since it is more suitable for association rule mining (ARM) on healthcare data.


Sign in / Sign up

Export Citation Format

Share Document