The Analysis and Implementation of the K - Means Algorithm Based on Hadoop Platform

In today's society has entered the era of big data, data of the diversity and the amount of data increases to the data storage and processing brought great challenges, Hadoop HDFS and MapReduce better solves the these two problems. Classical K-means algorithm is the most widely used one based on the partition of the clustering algorithm. At the completion of the cluster configuration based on, the k-means algorithm in cluster mode of operation principle and in the cluster mode realized kmeans algorithm, and the experimental results are research and analysis, summarized the k-means algorithm is run on the Hadoop platform's strengths and limitations.

Download Full-text

Improving the K-Means Clustering Algorithm Oriented to Big Data Environments

Handbook of Research on Natural Language Processing and Smart Service Systems - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-7998-4730-4.ch013 ◽

2021 ◽

pp. 289-308

Author(s):

Joaquín Pérez Ortega ◽

Nelva Nely Almanza Ortega ◽

Andrea Vega Villalobos ◽

Marco A. Aguirre L. ◽

Crispín Zavala Díaz ◽

...

Keyword(s):

Big Data ◽

Text Mining ◽

Large Volume ◽

Execution Time ◽

Clustering Algorithm ◽

Efficient Algorithms ◽

Experimental Results ◽

Digital Format ◽

Basic Approaches ◽

Previous Iteration

In recent years, the amount of texts in natural language, in digital format, has had an impressive increase. To obtain useful information from a large volume of data, new specialized techniques and efficient algorithms are required. Text mining consists of extracting meaningful patterns from texts; one of the basic approaches is clustering. The most used clustering algorithm is k-means. This chapter proposes an improvement of the k-means algorithm in the convergence step; the process stops whenever the number of objects that change their assigned cluster in the current iteration is bigger than the ones that changed in the previous iteration. Experimental results showed a reduction in execution time up to 93%. It is remarkable that, in general, better results are obtained when the volume of the text increase, particularly in those texts within big data environments.

Download Full-text

PARSUC: A Parallel Subsampling-Based Method for Clustering Remote Sensing Big Data

Sensors ◽

10.3390/s19153438 ◽

2019 ◽

Vol 19 (15) ◽

pp. 3438 ◽

Cited By ~ 3

Author(s):

Xia ◽

Huang ◽

Li ◽

Zhou ◽

Zhang

Keyword(s):

Remote Sensing ◽

Big Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Image Data ◽

Data Partitioning ◽

Data Mining Technique ◽

Mining Technique ◽

Hadoop Platform ◽

Parallel Clustering

Remote sensing big data (RSBD) is generally characterized by huge volumes, diversity, and high dimensionality. Mining hidden information from RSBD for different applications imposes significant computational challenges. Clustering is an important data mining technique widely used in processing and analyzing remote sensing imagery. However, conventional clustering algorithms are designed for relatively small datasets. When applied to problems with RSBD, they are, in general, too slow or inefficient for practical use. In this paper, we proposed a parallel subsampling-based clustering (PARSUC) method for improving the performance of RSBD clustering in terms of both efficiency and accuracy. PARSUC leverages a novel subsampling-based data partitioning (SubDP) method to realize three-step parallel clustering, effectively solving the notable performance bottleneck of the existing parallel clustering algorithms; that is, they must cope with numerous repeated calculations to get a reasonable result. Furthermore, we propose a centroid filtering algorithm (CFA) to eliminate subsampling errors and to guarantee the accuracy of the clustering results. PARSUC was implemented on a Hadoop platform by using the MapReduce parallel model. Experiments conducted on massive remote sensing imageries with different sizes showed that PARSUC (1) provided much better accuracy than conventional remote sensing clustering algorithms in handling larger image data; (2) achieved notable scalability with increased computing nodes added; and (3) spent much less time than the existing parallel clustering algorithm in handling RSBD.

Download Full-text

Experimental Characteristics Study of Data Storage Formats for Data Marts Development within Data Lakes

Applied Sciences ◽

10.3390/app11188651 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8651

Author(s):

Vladimir Belov ◽

Alexander N. Kosenkov ◽

Evgeny Nikulchev

Keyword(s):

Big Data ◽

Data Storage ◽

Storage System ◽

Apache Hadoop ◽

Aggregated Data ◽

Data Marts ◽

Hadoop Platform ◽

Analytical Platforms ◽

Big Data Storage

One of the most popular methods for building analytical platforms involves the use of the concept of data lakes. A data lake is a storage system in which the data are presented in their original format, making it difficult to conduct analytics or present aggregated data. To solve this issue, data marts are used, representing environments of stored data of highly specialized information, focused on the requests of employees of a certain department, the vector of an organization’s work. This article presents a study of big data storage formats in the Apache Hadoop platform when used to build data marts.

Download Full-text

Big Data Summarization Using Novel Clustering Algorithm and Semantic Feature Approach

International Journal of Rough Sets and Data Analysis ◽

10.4018/ijrsda.2017070108 ◽

2017 ◽

Vol 4 (3) ◽

pp. 108-117

Author(s):

Shilpa G. Kolte ◽

Jagdish W. Bakal

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

Federal Court ◽

Semantic Feature ◽

Experimental Results ◽

Semantic Features ◽

Legal Cases ◽

Data Summarization ◽

Summarization Method ◽

Better Than

This paper proposes a big data (i.e., documents, texts) summarization method using proposed clustering and semantic features. This paper proposes a novel clustering algorithm which is used for big data summarization. The proposed system works in four phases and provides a modular implementation of multiple documents summarization. The experimental results using Iris dataset show that the proposed clustering algorithm performs better than K-means and K-medodis algorithm. The performance of big data (i.e., documents, texts) summarization is evaluated using Australian legal cases from the Federal Court of Australia (FCA) database. The experimental results demonstrate that the proposed method can summarize big data document superior as compared with existing systems.

Download Full-text

Smart Meter Evaluation System Based on Big Data

Journal of Physics Conference Series ◽

10.1088/1742-6596/2143/1/012039 ◽

2021 ◽

Vol 2143 (1) ◽

pp. 012039

Author(s):

Yu Zhou ◽

Xuesong Shao ◽

Zhuowen Mu ◽

Qixin Cai ◽

Yue Li ◽

...

Keyword(s):

Big Data ◽

Data Storage ◽

Evaluation System ◽

Clustering Algorithm ◽

Massive Data ◽

Smart Meter ◽

Smart Meters ◽

Multiple Systems ◽

Material Resources ◽

Big Data Technology

Abstract Aiming at the problem of unstable operation of smart meters, this paper proposes a smart meter evaluation system based on big data. First introduce big data related technologies, such as data storage, analysis, mining, etc.. Secondly, design the big data smart meter evaluation system model. Finally, use big data technology and clustering algorithm to realize the design of the big data smart meter operation evaluation system. The system can convert massive data from multiple systems into operational evaluation reports, which helps to reduce the waste of human and material resources.

Download Full-text

Analysis of Big Data Storage Tools for Data Lakes based on Apache Hadoop Platform

International Journal of Advanced Computer Science and Applications ◽

10.14569/ijacsa.2021.0120864 ◽

2021 ◽

Vol 12 (8) ◽

Author(s):

Vladimir Belov ◽

Evgeny Nikulchev

Keyword(s):

Big Data ◽

Data Storage ◽

Apache Hadoop ◽

Hadoop Platform ◽

Big Data Storage

Download Full-text

Massive AIS data storage and query based on Hadoop platform

Journal of Physics Conference Series ◽

10.1088/1742-6596/1948/1/012016 ◽

2021 ◽

Vol 1948 (1) ◽

pp. 012016

Author(s):

Taizhi Lv ◽

Chenyong He ◽

Juan Zhang ◽

Zhiyang Song

Keyword(s):

Data Storage ◽

Hadoop Platform

Download Full-text

Big Data Storage Concepts

Big Data ◽

10.1002/9781119701859.ch2 ◽

2021 ◽

pp. 31-52

Keyword(s):

Big Data ◽

Data Storage ◽

Big Data Storage

Download Full-text

DV-DVFS: merging data variety and DVFS technique to manage the energy consumption of big data processing

Journal Of Big Data ◽

10.1186/s40537-021-00437-7 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Hossein Ahmadvand ◽

Fouzhan Foroutan ◽

Mahmood Fathy

Keyword(s):

Big Data ◽

Energy Consumption ◽

Processing Time ◽

Experimental Results ◽

The Other ◽

Data Sets ◽

Multiple Sources ◽

Evaluation Phase ◽

Dynamic Voltage ◽

Processing Resources

AbstractData variety is one of the most important features of Big Data. Data variety is the result of aggregating data from multiple sources and uneven distribution of data. This feature of Big Data causes high variation in the consumption of processing resources such as CPU consumption. This issue has been overlooked in previous works. To overcome the mentioned problem, in the present work, we used Dynamic Voltage and Frequency Scaling (DVFS) to reduce the energy consumption of computation. To this goal, we consider two types of deadlines as our constraint. Before applying the DVFS technique to computer nodes, we estimate the processing time and the frequency needed to meet the deadline. In the evaluation phase, we have used a set of data sets and applications. The experimental results show that our proposed approach surpasses the other scenarios in processing real datasets. Based on the experimental results in this paper, DV-DVFS can achieve up to 15% improvement in energy consumption.

Download Full-text

Research on the university intelligent learning analysis system based on AI

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189820 ◽

2021 ◽

pp. 1-10

Author(s):

Meng Huang ◽

Shuai Liu ◽

Yahao Zhang ◽

Kewei Cui ◽

Yana Wen

Keyword(s):

Artificial Intelligence ◽

Big Data ◽

Academic Performance ◽

Clustering Algorithm ◽

Back Propagation ◽

Three Dimensional ◽

Training Model ◽

Future Trend ◽

Artificial Intelligence Technology ◽

Visualization Technology

The integration of Artificial Intelligence technology and school education had become a future trend, and became an important driving force for the development of education. With the advent of the era of big data, although the relationship between students’ learning status data was closer to nonlinear relationship, combined with the application analysis of artificial intelligence technology, it could be found that students’ living habits were closely related to their academic performance. In this paper, through the investigation and analysis of the living habits and learning conditions of more than 2000 students in the past 10 grades in Information College of Institute of Disaster Prevention, we used the hierarchical clustering algorithm to classify the nearly 180000 records collected, and used the big data visualization technology of Echarts + iView + GIS and the JavaScript development method to dynamically display the students’ life track and learning information based on the map, then apply Three Dimensional ArcGIS for JS API technology showed the network infrastructure of the campus. Finally, a training model was established based on the historical learning achievements, life trajectory, graduates’ salary, school infrastructure and other information combined with the artificial intelligence Back Propagation neural network algorithm. Through the analysis of the training resulted, it was found that the students’ academic performance was related to the reasonable laboratory study time, dormitory stay time, physical exercise time and social entertainment time. Finally, the system could intelligently predict students’ academic performance and give reasonable suggestions according to the established prediction model. The realization of this project could provide technical support for university educators.

Download Full-text