Parallel Implementation of Ant-Based Clustering Algorithm Based on Hadoop

Author(s):  
Yan Yang ◽  
Xianhua Ni ◽  
Hongjun Wang ◽  
Yiteng Zhao
2016 ◽  
Vol 48 ◽  
pp. 35-41 ◽  
Author(s):  
Isabel Timón ◽  
Jesús Soto ◽  
Horacio Pérez-Sánchez ◽  
José M. Cecilia

2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Biqiu Li ◽  
Jiabin Wang ◽  
Xueli Liu

Data is an important source of knowledge discovery, but the existence of similar duplicate data not only increases the redundancy of the database but also affects the subsequent data mining work. Cleaning similar duplicate data is helpful to improve work efficiency. Based on the complexity of the Chinese language and the bottleneck of the single machine system to large-scale data computing performance, this paper proposes a Chinese data cleaning method that combines the BERT model and a k-means clustering algorithm and gives a parallel implementation scheme of the algorithm. In the process of text to vector, the position vector is introduced to obtain the context features of words, and the vector is dynamically adjusted according to the semantics so that the polysemous words can obtain different vector representations in different contexts. At the same time, the parallel implementation of the process is designed based on Hadoop. After that, k-means clustering algorithm is used to cluster similar duplicate data to achieve the purpose of cleaning. Experimental results on a variety of data sets show that the parallel cleaning algorithm proposed in this paper not only has good speedup and scalability but also improves the precision and recall of similar duplicate data cleaning, which will be of great significance for subsequent data mining.


Author(s):  
Trevor Cickovski ◽  
Tiffany Flor ◽  
Galen Irving-Sachs ◽  
Philip Novikov ◽  
James Parda ◽  
...  

2019 ◽  
Vol 48 (4) ◽  
pp. 673-681
Author(s):  
Shufen Zhang ◽  
Zhiyu Liu ◽  
Xuebin Chen ◽  
Changyin Luo

In order to solve the problem of traditional K-Means clustering algorithm in dealing with large-scale data set, a Hadoop K-Means (referred to HKM) clustering algorithm is proposed. Firstly, according to the sample density, the algorithm eliminates the effects of noise points in the data set. Secondly, it optimizes the selection of the initial center point using the thought of the max-min distance. Finally, it uses a MapReduce programming model to realize the parallelization. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but can also solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.


2017 ◽  
Vol 107 ◽  
pp. 442-447 ◽  
Author(s):  
Rui Liu ◽  
Xiaoge Li ◽  
Liping Du ◽  
Shuting Zhi ◽  
Mian Wei

Sign in / Sign up

Export Citation Format

Share Document