Parallel clustering method for non-disjoint partitioning of large-scale data based on spark framework

A new clustering method suitable for large scale data

2008 7th World Congress on Intelligent Control and Automation ◽

10.1109/wcica.2008.4593875 ◽

2008 ◽

Author(s):

Xu Yin ◽

Hong Xingyong ◽

Zhou Wenjiang ◽

Wang Lunwen ◽

Zhang Ling ◽

...

Keyword(s):

Large Scale ◽

Clustering Method ◽

Large Scale Data ◽

Scale Data

Download Full-text

One-pass MapReduce-based clustering method for mixed large scale data

Journal of Intelligent Information Systems ◽

10.1007/s10844-017-0472-5 ◽

2017 ◽

Vol 52 (3) ◽

pp. 619-636 ◽

Cited By ~ 5

Author(s):

Mohamed Aymen Ben HajKacem ◽

Chiheb-Eddine Ben N’cir ◽

Nadia Essoussi

Keyword(s):

Large Scale ◽

Clustering Method ◽

Large Scale Data ◽

Scale Data

Download Full-text

A Ranking-Based Hashing Algorithm Based on the Distributed Spark Platform

Information ◽

10.3390/info11030148 ◽

2020 ◽

Vol 11 (3) ◽

pp. 148

Author(s):

Anbang Yang ◽

Jiangbo Qian ◽

Huahui Chen ◽

Yihong Dong

Keyword(s):

Large Scale ◽

Rapid Development ◽

Modern Society ◽

Training Time ◽

Large Scale Data ◽

Huge Data ◽

Hashing Algorithm ◽

Similarity Searches ◽

Spark Framework ◽

Scale Data

With the rapid development of modern society, generated data has increased exponentially. Finding required data from this huge data pool is an urgent problem that needs to be solved. Hashing technology is widely used in similarity searches of large-scale data. Among them, the ranking-based hashing algorithm has been widely studied due to its accuracy and speed regarding the search results. At present, most ranking-based hashing algorithms construct loss functions by comparing the rank consistency of data in Euclidean and Hamming spaces. However, most of them have high time complexity and long training times, meaning they cannot meet requirements. In order to solve these problems, this paper introduces a distributed Spark framework and implements the ranking-based hashing algorithm in a parallel environment on multiple machines. The experimental results show that the Spark-RLSH (Ranking Listwise Supervision Hashing) can greatly reduce the training time and improve the training efficiency compared with other ranking-based hashing algorithms.

Download Full-text

Schema Matching for Large-Scale Data Based on Ontology Clustering Method

International Journal on Advanced Science Engineering and Information Technology ◽

10.18517/ijaseit.7.5.2133 ◽

2017 ◽

Vol 7 (5) ◽

pp. 1790

Author(s):

Harith Oraibi Alani ◽

Saidah Saad

Keyword(s):

Large Scale ◽

Schema Matching ◽

Clustering Method ◽

Large Scale Data ◽

Scale Data

Download Full-text

Clustering Large-Scale Data Based On Modified Affinity Propagation Algorithm

Journal of Artificial Intelligence and Soft Computing Research ◽

10.1515/jaiscr-2016-0003 ◽

2016 ◽

Vol 6 (1) ◽

pp. 23-33 ◽

Cited By ~ 23

Author(s):

Ahmed M. Serdah ◽

Wesam M. Ashour

Keyword(s):

Data Clustering ◽

Large Scale ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Affinity Propagation ◽

Clustering Method ◽

Data Set ◽

Local Cluster ◽

Large Scale Data ◽

Scale Data

Abstract Traditional clustering algorithms are no longer suitable for use in data mining applications that make use of large-scale data. There have been many large-scale data clustering algorithms proposed in recent years, but most of them do not achieve clustering with high quality. Despite that Affinity Propagation (AP) is effective and accurate in normal data clustering, but it is not effective for large-scale data. This paper proposes two methods for large-scale data clustering that depend on a modified version of AP algorithm. The proposed methods are set to ensure both low time complexity and good accuracy of the clustering method. Firstly, a data set is divided into several subsets using one of two methods random fragmentation or K-means. Secondly, subsets are clustered into K clusters using K-Affinity Propagation (KAP) algorithm to select local cluster exemplars in each subset. Thirdly, the inverse weighted clustering algorithm is performed on all local cluster exemplars to select well-suited global exemplars of the whole data set. Finally, all the data points are clustered by the similarity between all global exemplars and each data point. Results show that the proposed clustering method can significantly reduce the clustering time and produce better clustering result in a way that is more effective and accurate than AP, KAP, and HAP algorithms.

Download Full-text

Parallel N-Path Quantification Hierarchical K-Means Clustering Algorithm for Video Retrieval

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s021800141750029x ◽

2017 ◽

Vol 31 (09) ◽

pp. 1750029 ◽

Cited By ~ 1

Author(s):

Kaiyang Liao ◽

Fan Zhao ◽

Yuanlin Zheng ◽

Congjun Cao ◽

Mingzhu Zhang

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Retrieval System ◽

Video Retrieval ◽

Large Datasets ◽

Large Scale Data ◽

Image Retrieval System ◽

Labeling Method ◽

Parallel Clustering ◽

Scale Data

Using clustering method to detect useful patterns in large datasets has attracted considerable interest recently. The HKM clustering algorithm (Hierarchical K-means) is very efficient in large-scale data analysis. It has been widely used to build visual vocabulary for large scale video/image retrieval system. However, the speed and even the accuracy of hierarchical K-means clustering algorithm still have room to be improved. In this paper, we propose a Parallel N-path quantification hierarchical K-means clustering algorithm which improves on the hierarchical K-means clustering algorithm in the following ways. Firstly, we replace the Euclidean kernel with the Hellinger kernel to improve the accuracy. Secondly, the Greedy N-best Paths Labeling method is adopted to improve the clustering accuracy. Thirdly, the multi-core processors-based parallel clustering algorithm is proposed. Our results confirm that the proposed clustering algorithm is much faster and more effective.

Download Full-text

1808 Examination on clustering method using multiobjective GA : Initialization algorithm for large-scale data

The Proceedings of The Computational Mechanics Conference ◽

10.1299/jsmecmd.2007.20.317 ◽

2007 ◽

Vol 2007.20 (0) ◽

pp. 317-318

Author(s):

Tomoharu SENDA ◽

Tomoyuki HIROYASU ◽

Mitsunori MIKI

Keyword(s):

Large Scale ◽

Clustering Method ◽

Large Scale Data ◽

Initialization Algorithm ◽

Scale Data

Download Full-text

Large-Scale Data Learning Method for Anomaly Detection using Machine Learning for Monitoring Vibration in Vehicle Equipment

IEEJ Transactions on Industry Applications ◽

10.1541/ieejias.140.480 ◽

2020 ◽

Vol 140 (6) ◽

pp. 480-487

Author(s):

Minoru Kondo

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

Large Scale ◽

Learning Method ◽

Large Scale Data ◽

Scale Data

Download Full-text

Faculty Opinions recommendation of Comparative assessment of large-scale data sets of protein-protein interactions.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1006598.82257 ◽

2002 ◽

Author(s):

Rob Russell

Keyword(s):

Protein Interactions ◽

Large Scale ◽

Comparative Assessment ◽

Data Sets ◽

Protein Protein Interactions ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

ProGen:Provenance database generator for large-scale data set

Journal of Computer Applications ◽

10.3724/sp.j.1087.2008.02737 ◽

2009 ◽

Vol 28 (11) ◽

pp. 2737-2740

Author(s):

Xiao ZHANG ◽

Shan WANG ◽

Na LIAN

Keyword(s):

Large Scale ◽

Data Set ◽

Large Scale Data ◽

Scale Data

Download Full-text