Parallel clustering method for non-disjoint partitioning of large-scale data based on spark framework

Author(s):  
Abir Zayani ◽  
Chiheb-Eddine Ben N'Cir ◽  
Nadia Essoussi
Author(s):  
Xu Yin ◽  
Hong Xingyong ◽  
Zhou Wenjiang ◽  
Wang Lunwen ◽  
Zhang Ling ◽  
...  

2017 ◽  
Vol 52 (3) ◽  
pp. 619-636 ◽  
Author(s):  
Mohamed Aymen Ben HajKacem ◽  
Chiheb-Eddine Ben N’cir ◽  
Nadia Essoussi

Information ◽  
2020 ◽  
Vol 11 (3) ◽  
pp. 148
Author(s):  
Anbang Yang ◽  
Jiangbo Qian ◽  
Huahui Chen ◽  
Yihong Dong

With the rapid development of modern society, generated data has increased exponentially. Finding required data from this huge data pool is an urgent problem that needs to be solved. Hashing technology is widely used in similarity searches of large-scale data. Among them, the ranking-based hashing algorithm has been widely studied due to its accuracy and speed regarding the search results. At present, most ranking-based hashing algorithms construct loss functions by comparing the rank consistency of data in Euclidean and Hamming spaces. However, most of them have high time complexity and long training times, meaning they cannot meet requirements. In order to solve these problems, this paper introduces a distributed Spark framework and implements the ranking-based hashing algorithm in a parallel environment on multiple machines. The experimental results show that the Spark-RLSH (Ranking Listwise Supervision Hashing) can greatly reduce the training time and improve the training efficiency compared with other ranking-based hashing algorithms.


Author(s):  
Ahmed M. Serdah ◽  
Wesam M. Ashour

Abstract Traditional clustering algorithms are no longer suitable for use in data mining applications that make use of large-scale data. There have been many large-scale data clustering algorithms proposed in recent years, but most of them do not achieve clustering with high quality. Despite that Affinity Propagation (AP) is effective and accurate in normal data clustering, but it is not effective for large-scale data. This paper proposes two methods for large-scale data clustering that depend on a modified version of AP algorithm. The proposed methods are set to ensure both low time complexity and good accuracy of the clustering method. Firstly, a data set is divided into several subsets using one of two methods random fragmentation or K-means. Secondly, subsets are clustered into K clusters using K-Affinity Propagation (KAP) algorithm to select local cluster exemplars in each subset. Thirdly, the inverse weighted clustering algorithm is performed on all local cluster exemplars to select well-suited global exemplars of the whole data set. Finally, all the data points are clustered by the similarity between all global exemplars and each data point. Results show that the proposed clustering method can significantly reduce the clustering time and produce better clustering result in a way that is more effective and accurate than AP, KAP, and HAP algorithms.


Author(s):  
Kaiyang Liao ◽  
Fan Zhao ◽  
Yuanlin Zheng ◽  
Congjun Cao ◽  
Mingzhu Zhang

Using clustering method to detect useful patterns in large datasets has attracted considerable interest recently. The HKM clustering algorithm (Hierarchical K-means) is very efficient in large-scale data analysis. It has been widely used to build visual vocabulary for large scale video/image retrieval system. However, the speed and even the accuracy of hierarchical K-means clustering algorithm still have room to be improved. In this paper, we propose a Parallel N-path quantification hierarchical K-means clustering algorithm which improves on the hierarchical K-means clustering algorithm in the following ways. Firstly, we replace the Euclidean kernel with the Hellinger kernel to improve the accuracy. Secondly, the Greedy N-best Paths Labeling method is adopted to improve the clustering accuracy. Thirdly, the multi-core processors-based parallel clustering algorithm is proposed. Our results confirm that the proposed clustering algorithm is much faster and more effective.


2009 ◽  
Vol 28 (11) ◽  
pp. 2737-2740
Author(s):  
Xiao ZHANG ◽  
Shan WANG ◽  
Na LIAN

Sign in / Sign up

Export Citation Format

Share Document