Improving the K-Means Clustering Algorithm Oriented to Big Data Environments

Author(s):  
Joaquín Pérez Ortega ◽  
Nelva Nely Almanza Ortega ◽  
Andrea Vega Villalobos ◽  
Marco A. Aguirre L. ◽  
Crispín Zavala Díaz ◽  
...  

In recent years, the amount of texts in natural language, in digital format, has had an impressive increase. To obtain useful information from a large volume of data, new specialized techniques and efficient algorithms are required. Text mining consists of extracting meaningful patterns from texts; one of the basic approaches is clustering. The most used clustering algorithm is k-means. This chapter proposes an improvement of the k-means algorithm in the convergence step; the process stops whenever the number of objects that change their assigned cluster in the current iteration is bigger than the ones that changed in the previous iteration. Experimental results showed a reduction in execution time up to 93%. It is remarkable that, in general, better results are obtained when the volume of the text increase, particularly in those texts within big data environments.

2018 ◽  
Vol 9 (3) ◽  
pp. 15-30 ◽  
Author(s):  
S. Vengadeswaran ◽  
S. R. Balasundaram

This article describes how the time taken to execute a query and return the results, increase exponentially as the data size increases, leading to more waiting times of the user. Hadoop with its distributed processing capability is considered as an efficient solution for processing such large data. Hadoop's Default Data Placement Strategy (HDDPS) allocates the data blocks randomly across the cluster of nodes without considering any of the execution parameters. This result in non-availability of the blocks required for execution in local machine so that the data has to be transferred across the network for execution, leading to data locality issue. Also, it is commonly observed that most of the data intensive applications show grouping semantics. Hence during query execution, only a part of the Big-Data set is utilized. Since such execution parameters and grouping behavior are not considered, the default placement does not perform well resulting in several lacunas such as decreased local map task execution, increased query execution time, query latency, etc. In order to overcome such issues, an Optimal Data Placement Strategy (ODPS) based on grouping semantics is proposed. Initially, user history log is dynamically analyzed for identifying access pattern which is depicted as a graph. Markov clustering, a Graph clustering algorithm is applied to identify groupings among the dataset. Then, an Optimal Data Placement Algorithm (ODPA) is proposed based on the statistical measures estimated from the clustered graph. This in turn re-organizes the default data layouts in HDFS to achieve improved performance for Big-Data sets in heterogeneous distributed environment. Our proposed strategy is tested in a 15 node cluster placed in a single rack topology. The result has proved to be more efficient for massive datasets, reducing query execution time by 26% and significantly improves the data locality by 38% compared to HDDPS.


2018 ◽  
Vol 11 (1) ◽  
pp. 98
Author(s):  
Liu Xiang Wei

In today's society has entered the era of big data, data of the diversity and the amount of data increases to the data storage and processing brought great challenges, Hadoop HDFS and MapReduce better solves the these two problems. Classical K-means algorithm is the most widely used one based on the partition of the clustering algorithm. At the completion of the cluster configuration based on, the k-means algorithm in cluster mode of operation principle and in the cluster mode realized kmeans algorithm, and the experimental results are research and analysis, summarized the k-means algorithm is run on the Hadoop platform's strengths and limitations.


2017 ◽  
Vol 4 (3) ◽  
pp. 108-117
Author(s):  
Shilpa G. Kolte ◽  
Jagdish W. Bakal

This paper proposes a big data (i.e., documents, texts) summarization method using proposed clustering and semantic features. This paper proposes a novel clustering algorithm which is used for big data summarization. The proposed system works in four phases and provides a modular implementation of multiple documents summarization. The experimental results using Iris dataset show that the proposed clustering algorithm performs better than K-means and K-medodis algorithm. The performance of big data (i.e., documents, texts) summarization is evaluated using Australian legal cases from the Federal Court of Australia (FCA) database. The experimental results demonstrate that the proposed method can summarize big data document superior as compared with existing systems.


2021 ◽  
Vol 11 (15) ◽  
pp. 7169
Author(s):  
Mohamed Allouche ◽  
Tarek Frikha ◽  
Mihai Mitrea ◽  
Gérard Memmi ◽  
Faten Chaabane

To bridge the current gap between the Blockchain expectancies and their intensive computation constraints, the present paper advances a lightweight processing solution, based on a load-balancing architecture, compatible with the lightweight/embedding processing paradigms. In this way, the execution of complex operations is securely delegated to an off-chain general-purpose computing machine while the intimate Blockchain operations are kept on-chain. The illustrations correspond to an on-chain Tezos configuration and to a multiprocessor ARM embedded platform (integrated into a Raspberry Pi). The performances are assessed in terms of security, execution time, and CPU consumption when achieving a visual document fingerprint task. It is thus demonstrated that the advanced solution makes it possible for a computing intensive application to be deployed under severely constrained computation and memory resources, as set by a Raspberry Pi 3. The experimental results show that up to nine Tezos nodes can be deployed on a single Raspberry Pi 3 and that the limitation is not derived from the memory but from the computation resources. The execution time with a limited number of fingerprints is 40% higher than using a classical PC solution (value computed with 95% relative error lower than 5%).


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Hossein Ahmadvand ◽  
Fouzhan Foroutan ◽  
Mahmood Fathy

AbstractData variety is one of the most important features of Big Data. Data variety is the result of aggregating data from multiple sources and uneven distribution of data. This feature of Big Data causes high variation in the consumption of processing resources such as CPU consumption. This issue has been overlooked in previous works. To overcome the mentioned problem, in the present work, we used Dynamic Voltage and Frequency Scaling (DVFS) to reduce the energy consumption of computation. To this goal, we consider two types of deadlines as our constraint. Before applying the DVFS technique to computer nodes, we estimate the processing time and the frequency needed to meet the deadline. In the evaluation phase, we have used a set of data sets and applications. The experimental results show that our proposed approach surpasses the other scenarios in processing real datasets. Based on the experimental results in this paper, DV-DVFS can achieve up to 15% improvement in energy consumption.


2021 ◽  
pp. 1-10
Author(s):  
Meng Huang ◽  
Shuai Liu ◽  
Yahao Zhang ◽  
Kewei Cui ◽  
Yana Wen

The integration of Artificial Intelligence technology and school education had become a future trend, and became an important driving force for the development of education. With the advent of the era of big data, although the relationship between students’ learning status data was closer to nonlinear relationship, combined with the application analysis of artificial intelligence technology, it could be found that students’ living habits were closely related to their academic performance. In this paper, through the investigation and analysis of the living habits and learning conditions of more than 2000 students in the past 10 grades in Information College of Institute of Disaster Prevention, we used the hierarchical clustering algorithm to classify the nearly 180000 records collected, and used the big data visualization technology of Echarts + iView + GIS and the JavaScript development method to dynamically display the students’ life track and learning information based on the map, then apply Three Dimensional ArcGIS for JS API technology showed the network infrastructure of the campus. Finally, a training model was established based on the historical learning achievements, life trajectory, graduates’ salary, school infrastructure and other information combined with the artificial intelligence Back Propagation neural network algorithm. Through the analysis of the training resulted, it was found that the students’ academic performance was related to the reasonable laboratory study time, dormitory stay time, physical exercise time and social entertainment time. Finally, the system could intelligently predict students’ academic performance and give reasonable suggestions according to the established prediction model. The realization of this project could provide technical support for university educators.


Sign in / Sign up

Export Citation Format

Share Document