scholarly journals The Analysis and Implementation of the K - Means Algorithm Based on Hadoop Platform

2018 ◽  
Vol 11 (1) ◽  
pp. 98
Author(s):  
Liu Xiang Wei

In today's society has entered the era of big data, data of the diversity and the amount of data increases to the data storage and processing brought great challenges, Hadoop HDFS and MapReduce better solves the these two problems. Classical K-means algorithm is the most widely used one based on the partition of the clustering algorithm. At the completion of the cluster configuration based on, the k-means algorithm in cluster mode of operation principle and in the cluster mode realized kmeans algorithm, and the experimental results are research and analysis, summarized the k-means algorithm is run on the Hadoop platform's strengths and limitations.

Author(s):  
Joaquín Pérez Ortega ◽  
Nelva Nely Almanza Ortega ◽  
Andrea Vega Villalobos ◽  
Marco A. Aguirre L. ◽  
Crispín Zavala Díaz ◽  
...  

In recent years, the amount of texts in natural language, in digital format, has had an impressive increase. To obtain useful information from a large volume of data, new specialized techniques and efficient algorithms are required. Text mining consists of extracting meaningful patterns from texts; one of the basic approaches is clustering. The most used clustering algorithm is k-means. This chapter proposes an improvement of the k-means algorithm in the convergence step; the process stops whenever the number of objects that change their assigned cluster in the current iteration is bigger than the ones that changed in the previous iteration. Experimental results showed a reduction in execution time up to 93%. It is remarkable that, in general, better results are obtained when the volume of the text increase, particularly in those texts within big data environments.


Sensors ◽  
2019 ◽  
Vol 19 (15) ◽  
pp. 3438 ◽  
Author(s):  
Xia ◽  
Huang ◽  
Li ◽  
Zhou ◽  
Zhang

Remote sensing big data (RSBD) is generally characterized by huge volumes, diversity, and high dimensionality. Mining hidden information from RSBD for different applications imposes significant computational challenges. Clustering is an important data mining technique widely used in processing and analyzing remote sensing imagery. However, conventional clustering algorithms are designed for relatively small datasets. When applied to problems with RSBD, they are, in general, too slow or inefficient for practical use. In this paper, we proposed a parallel subsampling-based clustering (PARSUC) method for improving the performance of RSBD clustering in terms of both efficiency and accuracy. PARSUC leverages a novel subsampling-based data partitioning (SubDP) method to realize three-step parallel clustering, effectively solving the notable performance bottleneck of the existing parallel clustering algorithms; that is, they must cope with numerous repeated calculations to get a reasonable result. Furthermore, we propose a centroid filtering algorithm (CFA) to eliminate subsampling errors and to guarantee the accuracy of the clustering results. PARSUC was implemented on a Hadoop platform by using the MapReduce parallel model. Experiments conducted on massive remote sensing imageries with different sizes showed that PARSUC (1) provided much better accuracy than conventional remote sensing clustering algorithms in handling larger image data; (2) achieved notable scalability with increased computing nodes added; and (3) spent much less time than the existing parallel clustering algorithm in handling RSBD.


2021 ◽  
Vol 11 (18) ◽  
pp. 8651
Author(s):  
Vladimir Belov ◽  
Alexander N. Kosenkov ◽  
Evgeny Nikulchev

One of the most popular methods for building analytical platforms involves the use of the concept of data lakes. A data lake is a storage system in which the data are presented in their original format, making it difficult to conduct analytics or present aggregated data. To solve this issue, data marts are used, representing environments of stored data of highly specialized information, focused on the requests of employees of a certain department, the vector of an organization’s work. This article presents a study of big data storage formats in the Apache Hadoop platform when used to build data marts.


2017 ◽  
Vol 4 (3) ◽  
pp. 108-117
Author(s):  
Shilpa G. Kolte ◽  
Jagdish W. Bakal

This paper proposes a big data (i.e., documents, texts) summarization method using proposed clustering and semantic features. This paper proposes a novel clustering algorithm which is used for big data summarization. The proposed system works in four phases and provides a modular implementation of multiple documents summarization. The experimental results using Iris dataset show that the proposed clustering algorithm performs better than K-means and K-medodis algorithm. The performance of big data (i.e., documents, texts) summarization is evaluated using Australian legal cases from the Federal Court of Australia (FCA) database. The experimental results demonstrate that the proposed method can summarize big data document superior as compared with existing systems.


2021 ◽  
Vol 2143 (1) ◽  
pp. 012039
Author(s):  
Yu Zhou ◽  
Xuesong Shao ◽  
Zhuowen Mu ◽  
Qixin Cai ◽  
Yue Li ◽  
...  

Abstract Aiming at the problem of unstable operation of smart meters, this paper proposes a smart meter evaluation system based on big data. First introduce big data related technologies, such as data storage, analysis, mining, etc.. Secondly, design the big data smart meter evaluation system model. Finally, use big data technology and clustering algorithm to realize the design of the big data smart meter operation evaluation system. The system can convert massive data from multiple systems into operational evaluation reports, which helps to reduce the waste of human and material resources.


2021 ◽  
Vol 1948 (1) ◽  
pp. 012016
Author(s):  
Taizhi Lv ◽  
Chenyong He ◽  
Juan Zhang ◽  
Zhiyang Song
Keyword(s):  

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Hossein Ahmadvand ◽  
Fouzhan Foroutan ◽  
Mahmood Fathy

AbstractData variety is one of the most important features of Big Data. Data variety is the result of aggregating data from multiple sources and uneven distribution of data. This feature of Big Data causes high variation in the consumption of processing resources such as CPU consumption. This issue has been overlooked in previous works. To overcome the mentioned problem, in the present work, we used Dynamic Voltage and Frequency Scaling (DVFS) to reduce the energy consumption of computation. To this goal, we consider two types of deadlines as our constraint. Before applying the DVFS technique to computer nodes, we estimate the processing time and the frequency needed to meet the deadline. In the evaluation phase, we have used a set of data sets and applications. The experimental results show that our proposed approach surpasses the other scenarios in processing real datasets. Based on the experimental results in this paper, DV-DVFS can achieve up to 15% improvement in energy consumption.


2021 ◽  
pp. 1-10
Author(s):  
Meng Huang ◽  
Shuai Liu ◽  
Yahao Zhang ◽  
Kewei Cui ◽  
Yana Wen

The integration of Artificial Intelligence technology and school education had become a future trend, and became an important driving force for the development of education. With the advent of the era of big data, although the relationship between students’ learning status data was closer to nonlinear relationship, combined with the application analysis of artificial intelligence technology, it could be found that students’ living habits were closely related to their academic performance. In this paper, through the investigation and analysis of the living habits and learning conditions of more than 2000 students in the past 10 grades in Information College of Institute of Disaster Prevention, we used the hierarchical clustering algorithm to classify the nearly 180000 records collected, and used the big data visualization technology of Echarts + iView + GIS and the JavaScript development method to dynamically display the students’ life track and learning information based on the map, then apply Three Dimensional ArcGIS for JS API technology showed the network infrastructure of the campus. Finally, a training model was established based on the historical learning achievements, life trajectory, graduates’ salary, school infrastructure and other information combined with the artificial intelligence Back Propagation neural network algorithm. Through the analysis of the training resulted, it was found that the students’ academic performance was related to the reasonable laboratory study time, dormitory stay time, physical exercise time and social entertainment time. Finally, the system could intelligently predict students’ academic performance and give reasonable suggestions according to the established prediction model. The realization of this project could provide technical support for university educators.


Sign in / Sign up

Export Citation Format

Share Document