An Adaptive Sweep-circle Spatial Clustering Algorithm Based on Gestalt

An adaptive spatial clustering (ASC) algorithm is proposed that employs sweep-circle techniques and a dynamic threshold setting based on Gestalt theory to detect spatial clusters. The proposed algorithm can automatically discover clusters in one pass, rather than through the modification of the initial model (for example, a minimal spanning tree, Delaunay triangulation, or Voronoi diagram). It can quickly identify arbitrarily shaped clusters while adapting efficiently to non-homogeneous density characteristics of spatial data, without the need of priori knowledge or parameters. The proposed algorithm is also ideal for use in data streaming technology with dynamic characteristics flowing in the form of spatial clustering large data sets.

Download Full-text

An Adaptive Sweep-Circle Spatial Clustering Algorithm Based on Gestalt

10.20944/preprints201708.0040.v2 ◽

2017 ◽

Author(s):

Qingming Zhan ◽

Shuguang Deng ◽

Zhihua Zheng

Keyword(s):

Spatial Data ◽

Clustering Algorithm ◽

Spatial Clustering ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Data Streaming ◽

Spatial Clusters ◽

Streaming Technology ◽

Threshold Setting

An adaptive spatial clustering (ASC) algorithm is proposed in this present study, which employs sweep-circle techniques and a dynamic threshold setting based on the Gestalt theory to detect spatial clusters. The proposed algorithm can automatically discover clusters in one pass, rather than through the modification of the initial model (for example, a minimal spanning tree, Delaunay triangulation or Voronoi diagram). It can quickly identify arbitrarily-shaped clusters while adapting efficiently to non-homogeneous density characteristics of spatial data, without the need of prior knowledge or parameters. The proposed algorithm is also ideal for use in data streaming technology with dynamic characteristics flowing in the form of spatial clustering in large data sets.

Download Full-text

Summary of Affinity Propagation

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.268-270.811 ◽

2011 ◽

Vol 268-270 ◽

pp. 811-816

Author(s):

Yong Zhou ◽

Yan Xing

Keyword(s):

Clustering Algorithm ◽

Large Data ◽

Large Data Sets ◽

Affinity Propagation ◽

Damping Factor ◽

Data Sets ◽

Similarity Matrix ◽

Data Points

Affinity Propagation(AP)is a new clustering algorithm, which is based on the similarity matrix between pairs of data points and messages are exchanged between data points until clustering result emerges. It is efficient and fast , and it can solve the clustering on large data sets. But the traditional Affinity Propagation has many limitations, this paper introduces the Affinity Propagation, and analyzes in depth the advantages and limitations of it, focuses on the improvements of the algorithm — improve the similarity matrix, adjust the preference and the damping-factor, combine with other algorithms. Finally, discusses the development of Affinity Propagation.

Download Full-text

Applying the K-Means Algorithm in Big Raw Data Sets with Hadoop and MapReduce

Business Intelligence ◽

10.4018/978-1-4666-9562-7.ch062 ◽

2016 ◽

pp. 1220-1243

Author(s):

Ilias K. Savvas ◽

Georgia N. Sofianidou ◽

M-Tahar Kechadi

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

File System ◽

Large Data ◽

Large Data Sets ◽

Distributed File System ◽

Data Sets ◽

Raw Data ◽

Hadoop Distributed File System ◽

Access To Data

Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.

Download Full-text

Radar Emission Sources Identification Based on Hierarchical Agglomerative Clustering for Large Data Sets

Journal of Sensors ◽

10.1155/2016/1879327 ◽

2016 ◽

Vol 2016 ◽

pp. 1-9 ◽

Cited By ~ 21

Author(s):

Janusz Dudczyk

Keyword(s):

Clustering Algorithm ◽

Large Data ◽

Large Data Sets ◽

Emission Sources ◽

Data Sets ◽

Agglomerative Clustering ◽

Distinctive Features ◽

Identification Process ◽

Hierarchical Agglomerative Clustering ◽

Repetition Interval

More advanced recognition methods, which may recognize particular copies of radars of the same type, are called identification. The identification process of radar devices is a more specialized task which requires methods based on the analysis of distinctive features. These features are distinguished from the signals coming from the identified devices. Such a process is called Specific Emitter Identification (SEI). The identification of radar emission sources with the use of classic techniques based on the statistical analysis of basic measurable parameters of a signal such as Radio Frequency, Amplitude, Pulse Width, or Pulse Repetition Interval is not sufficient for SEI problems. This paper presents the method of hierarchical data clustering which is used in the process of radar identification. The Hierarchical Agglomerative Clustering Algorithm (HACA) based on Generalized Agglomerative Scheme (GAS) implemented and used in the research method is parameterized; therefore, it is possible to compare the results. The results of clustering are presented in dendrograms in this paper. The received results of grouping and identification based on HACA are compared with other SEI methods in order to assess the degree of their usefulness and effectiveness for systems of ESM/ELINT class.

Download Full-text

A GA-based clustering algorithm for large data sets with mixed numeric and categorical values

10.1117/12.538864 ◽

2003 ◽

Cited By ~ 6

Author(s):

Jie Li ◽

Xinbo Gao ◽

Licheng Jiao

Keyword(s):

Clustering Algorithm ◽

Large Data ◽

Large Data Sets ◽

Data Sets

Download Full-text

DETERMINISTIC INITIALIZATION OF THE K-MEANS ALGORITHM USING HIERARCHICAL CLUSTERING

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001412500188 ◽

2012 ◽

Vol 26 (07) ◽

pp. 1250018 ◽

Cited By ~ 29

Author(s):

M. EMRE CELEBI ◽

HASSAN A. KINGRAVI

Keyword(s):

Clustering Algorithm ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Partitional Clustering ◽

Highly Sensitive ◽

Data Points ◽

Initial Placement ◽

Random Initialization ◽

Common Deficiency

K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, due to its gradient descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Numerous initialization methods have been proposed to address this problem. Many of these methods, however, have superlinear complexity in the number of data points, making them impractical for large data sets. On the other hand, linear methods are often random and/or order-sensitive, which renders their results unrepeatable. Recently, Su and Dy proposed two highly successful hierarchical initialization methods named Var-Part and PCA-Part that are not only linear, but also deterministic (nonrandom) and order-invariant. In this paper, we propose a discriminant analysis based approach that addresses a common deficiency of these two methods. Experiments on a large and diverse collection of data sets from the UCI machine learning repository demonstrate that Var-Part and PCA-Part are highly competitive with one of the best random initialization methods to date, i.e. k-means++, and that the proposed approach significantly improves the performance of both hierarchical methods.

Download Full-text

High performance spatial data mining for very large data-sets (citation_only)

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '03 ◽

10.1145/781498.781509 ◽

2003 ◽

Author(s):

Baris Kazar

Keyword(s):

Data Mining ◽

Spatial Data ◽

High Performance ◽

Large Data ◽

Spatial Data Mining ◽

Large Data Sets ◽

Data Sets

Download Full-text

A GA-based clustering algorithm for large data sets with mixed and categorical values

Proceedings Fifth International Conference on Computational Intelligence and Multimedia Applications. ICCIMA 2003 ◽

10.1109/iccima.2003.1238108 ◽

2004 ◽

Cited By ~ 1

Author(s):

Li Jie ◽

Gao Xinbo ◽

Jiao Li-cheng

Keyword(s):

Clustering Algorithm ◽

Large Data ◽

Large Data Sets ◽

Data Sets

Download Full-text

DBSCANI: Noise-Resistant Method for Missing Value Imputation

Journal of Intelligent Systems ◽

10.1515/jisys-2014-0172 ◽

2016 ◽

Vol 25 (3) ◽

pp. 431-440 ◽

Cited By ~ 1

Author(s):

Archana Purwar ◽

Sandeep Kumar Singh

Keyword(s):

Spatial Data ◽

Missing Values ◽

Clustering Algorithm ◽

Spatial Clustering ◽

Data Sets ◽

Quality Of Data ◽

Data Set ◽

Dbscan Clustering ◽

Density Based Clustering

AbstractThe quality of data is an important task in the data mining. The validity of mining algorithms is reduced if data is not of good quality. The quality of data can be assessed in terms of missing values (MV) as well as noise present in the data set. Various imputation techniques have been studied in MV study, but little attention has been given on noise in earlier work. Moreover, to the best of knowledge, no one has used density-based spatial clustering of applications with noise (DBSCAN) clustering for MV imputation. This paper proposes a novel technique density-based imputation (DBSCANI) built on density-based clustering to deal with incomplete values in the presence of noise. Density-based clustering algorithm proposed by Kriegal groups the objects according to their density in spatial data bases. The high-density regions are known as clusters, and the low-density regions refer to the noise objects in the data set. A lot of experiments have been performed on the Iris data set from life science domain and Jain’s (2D) data set from shape data sets. The performance of the proposed method is evaluated using root mean square error (RMSE) as well as it is compared with existing K-means imputation (KMI). Results show that our method is more noise resistant than KMI on data sets used under study.

Download Full-text