Improving the K-Means Clustering Algorithm Oriented to Big Data Environments

In recent years, the amount of texts in natural language, in digital format, has had an impressive increase. To obtain useful information from a large volume of data, new specialized techniques and efficient algorithms are required. Text mining consists of extracting meaningful patterns from texts; one of the basic approaches is clustering. The most used clustering algorithm is k-means. This chapter proposes an improvement of the k-means algorithm in the convergence step; the process stops whenever the number of objects that change their assigned cluster in the current iteration is bigger than the ones that changed in the previous iteration. Experimental results showed a reduction in execution time up to 93%. It is remarkable that, in general, better results are obtained when the volume of the text increase, particularly in those texts within big data environments.

Download Full-text

An Optimal Data Placement Strategy for Improving System Performance of Massive Data Applications Using Graph Clustering

International Journal of Ambient Computing and Intelligence ◽

10.4018/ijaci.2018070102 ◽

2018 ◽

Vol 9 (3) ◽

pp. 15-30 ◽

Cited By ~ 4

Author(s):

S. Vengadeswaran ◽

S. R. Balasundaram

Keyword(s):

Big Data ◽

Execution Time ◽

Clustering Algorithm ◽

Graph Clustering ◽

Data Placement ◽

Data Locality ◽

Query Execution ◽

Data Set ◽

Statistical Measures ◽

Default Data

This article describes how the time taken to execute a query and return the results, increase exponentially as the data size increases, leading to more waiting times of the user. Hadoop with its distributed processing capability is considered as an efficient solution for processing such large data. Hadoop's Default Data Placement Strategy (HDDPS) allocates the data blocks randomly across the cluster of nodes without considering any of the execution parameters. This result in non-availability of the blocks required for execution in local machine so that the data has to be transferred across the network for execution, leading to data locality issue. Also, it is commonly observed that most of the data intensive applications show grouping semantics. Hence during query execution, only a part of the Big-Data set is utilized. Since such execution parameters and grouping behavior are not considered, the default placement does not perform well resulting in several lacunas such as decreased local map task execution, increased query execution time, query latency, etc. In order to overcome such issues, an Optimal Data Placement Strategy (ODPS) based on grouping semantics is proposed. Initially, user history log is dynamically analyzed for identifying access pattern which is depicted as a graph. Markov clustering, a Graph clustering algorithm is applied to identify groupings among the dataset. Then, an Optimal Data Placement Algorithm (ODPA) is proposed based on the statistical measures estimated from the clustered graph. This in turn re-organizes the default data layouts in HDFS to achieve improved performance for Big-Data sets in heterogeneous distributed environment. Our proposed strategy is tested in a 15 node cluster placed in a single rack topology. The result has proved to be more efficient for massive datasets, reducing query execution time by 26% and significantly improves the data locality by 38% compared to HDDPS.

Download Full-text

The Analysis and Implementation of the K - Means Algorithm Based on Hadoop Platform

Computer and Information Science ◽

10.5539/cis.v11n1p98 ◽

2018 ◽

Vol 11 (1) ◽

pp. 98

Author(s):

Liu Xiang Wei

Keyword(s):

Big Data ◽

Data Storage ◽

Clustering Algorithm ◽

Experimental Results ◽

Mode Of Operation ◽

Cluster Configuration ◽

Hadoop Platform ◽

Kmeans Algorithm

In today's society has entered the era of big data, data of the diversity and the amount of data increases to the data storage and processing brought great challenges, Hadoop HDFS and MapReduce better solves the these two problems. Classical K-means algorithm is the most widely used one based on the partition of the clustering algorithm. At the completion of the cluster configuration based on, the k-means algorithm in cluster mode of operation principle and in the cluster mode realized kmeans algorithm, and the experimental results are research and analysis, summarized the k-means algorithm is run on the Hadoop platform's strengths and limitations.

Download Full-text

Big Data Summarization Using Novel Clustering Algorithm and Semantic Feature Approach

International Journal of Rough Sets and Data Analysis ◽

10.4018/ijrsda.2017070108 ◽

2017 ◽

Vol 4 (3) ◽

pp. 108-117

Author(s):

Shilpa G. Kolte ◽

Jagdish W. Bakal

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

Federal Court ◽

Semantic Feature ◽

Experimental Results ◽

Semantic Features ◽

Legal Cases ◽

Data Summarization ◽

Summarization Method ◽

Better Than

This paper proposes a big data (i.e., documents, texts) summarization method using proposed clustering and semantic features. This paper proposes a novel clustering algorithm which is used for big data summarization. The proposed system works in four phases and provides a modular implementation of multiple documents summarization. The experimental results using Iris dataset show that the proposed clustering algorithm performs better than K-means and K-medodis algorithm. The performance of big data (i.e., documents, texts) summarization is evaluated using Australian legal cases from the Federal Court of Australia (FCA) database. The experimental results demonstrate that the proposed method can summarize big data document superior as compared with existing systems.

Download Full-text

An Analysis on the Trend of Dissertation Results Related to Chinese Education in Korea Using Big Data Text Mining : A Study of 20 Years from 2000 to 2019

The Journal of Chinese Language and Literature ◽

10.25021/jcll.2019.12.119.253 ◽

2019 ◽

Vol 119 ◽

pp. 253-283

Author(s):

Eun-jae Choi

Keyword(s):

Big Data ◽

Text Mining ◽

Chinese Education

Download Full-text

"Big Data Text Mining Analysis of Chinese Electric Vehicle Design -Focusing on the Chinese Market-"

The Treatise on The Plastic Media ◽

10.35280/kotpm.2020.23.4.8 ◽

2020 ◽

Vol 23 (4) ◽

pp. 69-77

Author(s):

Lei Sun ◽

Sang Young Lee

Keyword(s):

Big Data ◽

Text Mining ◽

Electric Vehicle ◽

Vehicle Design ◽

Chinese Market

Download Full-text

A Comparison of Perception on Creativity between Academic Research and Social Big Data Using Text-Mining Techniques

Korean Society for Creativity Education ◽

10.36358/jce.2020.20.3.47 ◽

2020 ◽

Vol 20 (3) ◽

pp. 47-67

Author(s):

Eunbyul Cho ◽

Jiyeon Min ◽

Soowon Park

Keyword(s):

Big Data ◽

Text Mining ◽

Academic Research ◽

Social Big Data

Download Full-text

Lightweight Blockchain Processing. Case Study: Scanned Document Tracking on Tezos Blockchain

Applied Sciences ◽

10.3390/app11157169 ◽

2021 ◽

Vol 11 (15) ◽

pp. 7169

Author(s):

Mohamed Allouche ◽

Tarek Frikha ◽

Mihai Mitrea ◽

Gérard Memmi ◽

Faten Chaabane

Keyword(s):

Load Balancing ◽

Relative Error ◽

Execution Time ◽

General Purpose ◽

Experimental Results ◽

Raspberry Pi ◽

Embedded Platform ◽

Memory Resources ◽

Processing Solution

To bridge the current gap between the Blockchain expectancies and their intensive computation constraints, the present paper advances a lightweight processing solution, based on a load-balancing architecture, compatible with the lightweight/embedding processing paradigms. In this way, the execution of complex operations is securely delegated to an off-chain general-purpose computing machine while the intimate Blockchain operations are kept on-chain. The illustrations correspond to an on-chain Tezos configuration and to a multiprocessor ARM embedded platform (integrated into a Raspberry Pi). The performances are assessed in terms of security, execution time, and CPU consumption when achieving a visual document fingerprint task. It is thus demonstrated that the advanced solution makes it possible for a computing intensive application to be deployed under severely constrained computation and memory resources, as set by a Raspberry Pi 3. The experimental results show that up to nine Tezos nodes can be deployed on a single Raspberry Pi 3 and that the limitation is not derived from the memory but from the computation resources. The execution time with a limited number of fingerprints is 40% higher than using a classical PC solution (value computed with 95% relative error lower than 5%).

Download Full-text

DV-DVFS: merging data variety and DVFS technique to manage the energy consumption of big data processing

Journal Of Big Data ◽

10.1186/s40537-021-00437-7 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Hossein Ahmadvand ◽

Fouzhan Foroutan ◽

Mahmood Fathy

Keyword(s):

Big Data ◽

Energy Consumption ◽

Processing Time ◽

Experimental Results ◽

The Other ◽

Data Sets ◽

Multiple Sources ◽

Evaluation Phase ◽

Dynamic Voltage ◽

Processing Resources

AbstractData variety is one of the most important features of Big Data. Data variety is the result of aggregating data from multiple sources and uneven distribution of data. This feature of Big Data causes high variation in the consumption of processing resources such as CPU consumption. This issue has been overlooked in previous works. To overcome the mentioned problem, in the present work, we used Dynamic Voltage and Frequency Scaling (DVFS) to reduce the energy consumption of computation. To this goal, we consider two types of deadlines as our constraint. Before applying the DVFS technique to computer nodes, we estimate the processing time and the frequency needed to meet the deadline. In the evaluation phase, we have used a set of data sets and applications. The experimental results show that our proposed approach surpasses the other scenarios in processing real datasets. Based on the experimental results in this paper, DV-DVFS can achieve up to 15% improvement in energy consumption.

Download Full-text

Research on Judicial Big Data Text Mining and Sentencing Prediction Model

Journal of Physics Conference Series ◽

10.1088/1742-6596/1883/1/012158 ◽

2021 ◽

Vol 1883 (1) ◽

pp. 012158

Author(s):

Juan Xu

Keyword(s):

Big Data ◽

Text Mining ◽

Prediction Model

Download Full-text

Research on the university intelligent learning analysis system based on AI

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189820 ◽

2021 ◽

pp. 1-10

Author(s):

Meng Huang ◽

Shuai Liu ◽

Yahao Zhang ◽

Kewei Cui ◽

Yana Wen

Keyword(s):

Artificial Intelligence ◽

Big Data ◽

Academic Performance ◽

Clustering Algorithm ◽

Back Propagation ◽

Three Dimensional ◽

Training Model ◽

Future Trend ◽

Artificial Intelligence Technology ◽

Visualization Technology

The integration of Artificial Intelligence technology and school education had become a future trend, and became an important driving force for the development of education. With the advent of the era of big data, although the relationship between students’ learning status data was closer to nonlinear relationship, combined with the application analysis of artificial intelligence technology, it could be found that students’ living habits were closely related to their academic performance. In this paper, through the investigation and analysis of the living habits and learning conditions of more than 2000 students in the past 10 grades in Information College of Institute of Disaster Prevention, we used the hierarchical clustering algorithm to classify the nearly 180000 records collected, and used the big data visualization technology of Echarts + iView + GIS and the JavaScript development method to dynamically display the students’ life track and learning information based on the map, then apply Three Dimensional ArcGIS for JS API technology showed the network infrastructure of the campus. Finally, a training model was established based on the historical learning achievements, life trajectory, graduates’ salary, school infrastructure and other information combined with the artificial intelligence Back Propagation neural network algorithm. Through the analysis of the training resulted, it was found that the students’ academic performance was related to the reasonable laboratory study time, dormitory stay time, physical exercise time and social entertainment time. Finally, the system could intelligently predict students’ academic performance and give reasonable suggestions according to the established prediction model. The realization of this project could provide technical support for university educators.

Download Full-text