Parallel Implementation of Ant-Based Clustering Algorithm Based on Hadoop

Data is an important source of knowledge discovery, but the existence of similar duplicate data not only increases the redundancy of the database but also affects the subsequent data mining work. Cleaning similar duplicate data is helpful to improve work efficiency. Based on the complexity of the Chinese language and the bottleneck of the single machine system to large-scale data computing performance, this paper proposes a Chinese data cleaning method that combines the BERT model and a k-means clustering algorithm and gives a parallel implementation scheme of the algorithm. In the process of text to vector, the position vector is introduced to obtain the context features of words, and the vector is dynamically adjusted according to the semantics so that the polysemous words can obtain different vector representations in different contexts. At the same time, the parallel implementation of the process is designed based on Hadoop. After that, k-means clustering algorithm is used to cluster similar duplicate data to achieve the purpose of cleaning. Experimental results on a variety of data sets show that the parallel cleaning algorithm proposed in this paper not only has good speedup and scalability but also improves the precision and recall of similar duplicate data cleaning, which will be of great significance for subsequent data mining.

Download Full-text

GPUDePiCt: A Parallel Implementation of a Clustering Algorithm for Computing Degenerate Primers on Graphics Processing Units

IEEE/ACM Transactions on Computational Biology and Bioinformatics ◽

10.1109/tcbb.2014.2355231 ◽

2015 ◽

Vol 12 (2) ◽

pp. 445-454 ◽

Cited By ~ 1

Author(s):

Trevor Cickovski ◽

Tiffany Flor ◽

Galen Irving-Sachs ◽

Philip Novikov ◽

James Parda ◽

...

Keyword(s):

Graphics Processing Units ◽

Clustering Algorithm ◽

Parallel Implementation ◽

Degenerate Primers ◽

Graphics Processing

Download Full-text

Efficient parallel implementation of a density peaks clustering algorithm on graphics processing unit

Frontiers of Information Technology & Electronic Engineering ◽

10.1631/fitee.1601786 ◽

2017 ◽

Vol 18 (7) ◽

pp. 915-927 ◽

Cited By ~ 3

Author(s):

Ke-shi Ge ◽

Hua-you Su ◽

Dong-sheng Li ◽

Xi-cheng Lu

Keyword(s):

Clustering Algorithm ◽

Graphics Processing Unit ◽

Parallel Implementation ◽

Processing Unit ◽

Density Peaks ◽

Density Peaks Clustering ◽

Graphics Processing

Download Full-text

Parallel Implementation of Improved K-Means Based on a Cloud Platform

Information Technology And Control ◽

10.5755/j01.itc.48.4.23881 ◽

2019 ◽

Vol 48 (4) ◽

pp. 673-681

Author(s):

Shufen Zhang ◽

Zhiyu Liu ◽

Xuebin Chen ◽

Changyin Luo

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Programming Model ◽

Parallel Implementation ◽

Clustering Algorithms ◽

Data Set ◽

Large Scale Data ◽

Sample Density ◽

Scale Data ◽

Selection Of

In order to solve the problem of traditional K-Means clustering algorithm in dealing with large-scale data set, a Hadoop K-Means (referred to HKM) clustering algorithm is proposed. Firstly, according to the sample density, the algorithm eliminates the effects of noise points in the data set. Secondly, it optimizes the selection of the initial center point using the thought of the max-min distance. Finally, it uses a MapReduce programming model to realize the parallelization. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but can also solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.

Download Full-text

Parallel Implementation of a Density-Based Stream Clustering Algorithm Over a GPU Scheduling System

Lecture Notes in Computer Science - Trends and Applications in Knowledge Discovery and Data Mining ◽

10.1007/978-3-319-13186-3_40 ◽

2014 ◽

pp. 441-453 ◽

Cited By ~ 2

Author(s):

Marwan Hassani ◽

Ayman Tarakji ◽

Lyubomir Georgiev ◽

Thomas Seidl

Keyword(s):

Clustering Algorithm ◽

Parallel Implementation ◽

Scheduling System ◽

Stream Clustering ◽

Gpu Scheduling

Download Full-text

Parallel Implementation of Fuzzy Joint Points Clustering Algorithm on GPU

2019 4th International Conference on Computer Science and Engineering (UBMK) ◽

10.1109/ubmk.2019.8907166 ◽

2019 ◽

Author(s):

Can Atilgan ◽

Baris Tekin Tezel ◽

Efendi Nasiboglu

Keyword(s):

Clustering Algorithm ◽

Parallel Implementation

Download Full-text

Parallel Implementation of Density Peaks Clustering Algorithm Based on Spark

Procedia Computer Science ◽

10.1016/j.procs.2017.03.138 ◽

2017 ◽

Vol 107 ◽

pp. 442-447 ◽

Cited By ~ 5

Author(s):

Rui Liu ◽

Xiaoge Li ◽

Liping Du ◽

Shuting Zhi ◽

Mian Wei

Keyword(s):

Clustering Algorithm ◽

Parallel Implementation ◽

Density Peaks ◽

Density Peaks Clustering

Download Full-text