DCAD: a dual clustering algorithm for distributed spatial databases

Geographic outliers at GBIF (Global Biodiversity Information Facility) are a known problem. Outliers can be errors, coordinates with high uncertainty, or simply occurrences from an undersampled region. Often in data cleaning pipelines, outliers are removed (even if they are legitimate points) because the researcher does not have time to verify each record one-by-one. Outlier points are usually occurrences that need attention. Currently, there is no outlier detection implemented at GBIF and it is up to the user to flag outliers themselves. DBSCAN (a density-based algorithm for discovering clusters in large spatial databases with noise) is a simple and popular clustering algorithm. It uses two parameters, (1) distance and (2) a minimum number of points per cluster, to decide if something is an outlier. Since occurrence data can be very patchy, non-clustering distance-based methods will fail often Fig. 1. DBSCAN does not need to know the expected number of clusters in advance. DBSCAN does well using only distance and does not require some additional environmental variables like Bioclim. Advanatages of DBSCAN : Simple Easy to understand Only two parameters to set Scales well No additional data sources needed Users would understand how their data was changed Simple Easy to understand Only two parameters to set Scales well No additional data sources needed Users would understand how their data was changed Drawbacks : Only uses distance Must choose parameter settings Sensitive to sparse global sampling Does not include any other relevant environmental information Can only flag outliers outside of a point blob Only uses distance Must choose parameter settings Sensitive to sparse global sampling Does not include any other relevant environmental information Can only flag outliers outside of a point blob Outlier detection and error detection are different. If your goal is to produce a system with no false positives, it will fail. While more complex environmentally-informed outlier detection methods (like reverse jackknifing (Chapman 2005)) might perform better for certain examples or even in genreal, DBSCAN performs adequately on almost everything despite being very simple. Currently I am using DBSCAN to find errors and assess dataset quality. It is a Spark job written in Scala (github). It does not run on species with lots of (>30K) unique latitude-longitude points, since the current implementation relies on an in-memory distance matrix. However, around 99% of species (plants, animals, fungi) on GBIF have fewer than >30K unique lat-long points (2,283 species keys / 222,993 species keys). There are other implementations ( example) that might scale to many more points. There are no immediate plans to include DBSCAN outliers as a data quality flag on GBIF, but it could be done somewhat easily, since this type of method does not rely on any external environmental data sources and already runs on the GBIF cluster.

Download Full-text

Hierarchical Clustering Algorithm Based on Neighborhood-Linked in Large Spatial Databases

Lecture Notes in Computer Science - Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing ◽

10.1007/3-540-39205-x_102 ◽

2007 ◽

pp. 619-622

Author(s):

Yi-hong Dong

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithm ◽

Spatial Databases ◽

Hierarchical Clustering Algorithm

Download Full-text

A distribution-based clustering algorithm for mining in large spatial databases

Proceedings 14th International Conference on Data Engineering ◽

10.1109/icde.1998.655795 ◽

2002 ◽

Cited By ~ 29

Author(s):

Xiaowei Xu ◽

M. Ester ◽

H.-P. Kriegel ◽

J. Sander

Keyword(s):

Clustering Algorithm ◽

Spatial Databases

Download Full-text

A Scalable Grid-Based Clustering Algorithm for Very Large Spatial Databases

2006 International Conference on Computational Intelligence and Security ◽

10.1109/iccias.2006.294238 ◽

2006 ◽

Author(s):

Yufen Sun ◽

Yansheng Lu

Keyword(s):

Clustering Algorithm ◽

Spatial Databases ◽

Grid Based

Download Full-text

PROFILE BASED PROTECTION SCHEME AGAINST DDOS ATTACK IN WSN WITH DUAL CLUSTERING ALGORITHM

International Journal of Research in Engineering and Technology ◽

10.15623/ijret.2015.0414008 ◽

2015 ◽

Vol 04 (14) ◽

pp. 29-32

Author(s):

Subhashini .

Keyword(s):

Clustering Algorithm ◽

Protection Scheme ◽

Ddos Attack ◽

Dual Clustering

Download Full-text

Towards Real-Time Geodemographics: Clustering Algorithm Performance for Large Multidimensional Spatial Databases

Transactions in GIS ◽

10.1111/j.1467-9671.2010.01197.x ◽

2010 ◽

Vol 14 (3) ◽

pp. 283-297 ◽

Cited By ~ 17

Author(s):

Muhammad Adnan ◽

Paul A Longley ◽

Alex D Singleton ◽

Chris Brunsdon

Keyword(s):

Real Time ◽

Clustering Algorithm ◽

Spatial Databases ◽

Algorithm Performance

Download Full-text

A Fast Parallel Clustering Algorithm for Large Spatial Databases

High Performance Data Mining ◽

10.1007/0-306-47011-x_3 ◽

2006 ◽

pp. 263-290 ◽

Cited By ~ 23

Author(s):

Xiaowei Xu ◽

Jochen Jäger ◽

Hans-Peter Kriegel

Keyword(s):

Clustering Algorithm ◽

Spatial Databases ◽

Parallel Clustering

Download Full-text

An adaptive dual clustering algorithm based on hierarchical structure: A case study of settlement zoning

Transactions in GIS ◽

10.1111/tgis.12246 ◽

2016 ◽

Vol 21 (5) ◽

pp. 916-933 ◽

Cited By ~ 4

Author(s):

Yaolin Liu ◽

Xiaomi Wang ◽

Dianfeng Liu ◽

Leilei Liu

Keyword(s):

Hierarchical Structure ◽

Clustering Algorithm ◽

Dual Clustering

Download Full-text

An Optimized K-means with Density and Distance-Based Clustering Algorithm for Multidimensional Spatial Databases

International Journal of Computer Network and Information Security ◽

10.5815/ijcnis.2021.06.06 ◽

2021 ◽

Vol 13 (6) ◽

pp. 70-82

Author(s):

K Laskhmaiah ◽

◽

S Murali Krishna ◽

B Eswara Reddy

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Spatial Databases ◽

Spatial Database ◽

Spatial Data Mining ◽

Rand Index ◽

Spatial Distance ◽

Experimental Result ◽

Adjusted Rand Index ◽

Second Phase

From massive and complex spatial database, the useful information and knowledge are extracted using spatial data mining. To analyze the complexity, efficient clustering algorithm for spatial database has been used in this area of research. The geographic areas containing spatial points are discovered using clustering methods in many applications. With spatial attributes, the spatial clustering problem have been designed using many approaches, but nonoverlapping constraints are not considered. Most existing data mining algorithms suffer in high dimensions. With nonoverlapping named as Non Overlapping Constraint based Optimized K-Means with Density and Distance-based Clustering (NOC-OKMDDC),a multidimensional optimization clustering is designed to solve this problem by the proposed system and the clusters with diverse shapes and densities in spatial databases are fast found. Proposed method consists of three main phases. Using weighted convolutional Neural Networks(Weighted CNN), attributes are reduced from the multidimensional dataset in this first phase. A partition-based algorithm (K-means) used by Optimized KMeans with Density and Distance-based Clustering (OKMDD) and several relatively small spherical or ball-shaped sub clusters are made by Clustering the dataset in this second phase. The optimal sub cluster count is performed with the help of Adaptive Adjustment Factor based Glowworm Swarm Optimization algorithm (AAFGSO). Then the proposed system designed an Enhanced Penalized Spatial Distance (EPSD) Measure to satisfy the non-overlapping condition. According to the spatial attribute values, the spatial distance between two points are well adjusted to achieving the EPSD. In third phase, to merge sub clusters the proposed system utilizes the Density based clustering with relative distance scheme. In terms of adjusted rand index, rand index, mirkins index and huberts index, better performance is achieved by proposed system when compared to the existing system which is shown by experimental result.

Download Full-text