Significant DBSCAN+: Statistically Robust Density-based Clustering

2021 ◽  
Vol 12 (5) ◽  
pp. 1-26
Author(s):  
Yiqun Xie ◽  
Xiaowei Jia ◽  
Shashi Shekhar ◽  
Han Bao ◽  
Xun Zhou

Cluster detection is important and widely used in a variety of applications, including public health, public safety, transportation, and so on. Given a collection of data points, we aim to detect density-connected spatial clusters with varying geometric shapes and densities, under the constraint that the clusters are statistically significant. The problem is challenging, because many societal applications and domain science studies have low tolerance for spurious results, and clusters may have arbitrary shapes and varying densities. As a classical topic in data mining and learning, a myriad of techniques have been developed to detect clusters with both varying shapes and densities (e.g., density-based, hierarchical, spectral, or deep clustering methods). However, the vast majority of these techniques do not consider statistical rigor and are susceptible to detecting spurious clusters formed as a result of natural randomness. On the other hand, scan statistic approaches explicitly control the rate of spurious results, but they typically assume a single “hotspot” of over-density and many rely on further assumptions such as a tessellated input space. To unite the strengths of both lines of work, we propose a statistically robust formulation of a multi-scale DBSCAN, namely Significant DBSCAN+, to identify significant clusters that are density connected. As we will show, incorporation of statistical rigor is a powerful mechanism that allows the new Significant DBSCAN+ to outperform state-of-the-art clustering techniques in various scenarios. We also propose computational enhancements to speed-up the proposed approach. Experiment results show that Significant DBSCAN+ can simultaneously improve the success rate of true cluster detection (e.g., 10–20% increases in absolute F1 scores) and substantially reduce the rate of spurious results (e.g., from thousands/hundreds of spurious detections to none or just a few across 100 datasets), and the acceleration methods can improve the efficiency for both clustered and non-clustered data.

2016 ◽  
Vol 13 (10) ◽  
pp. 6935-6943 ◽  
Author(s):  
Jia-Lin Hua ◽  
Jian Yu ◽  
Miin-Shen Yang

Mountains, which heap up by densities of a data set, intuitively reflect the structure of data points. These mountain clustering methods are useful for grouping data points. However, the previous mountain-based clustering suffers from the choice of parameters which are used to compute the density. In this paper, we adopt correlation analysis to determine the density, and propose a new clustering algorithm, called Correlative Density-based Clustering (CDC). The new algorithm computes the density with a modified way and determines the parameters based on the inherent structure of data points. Experiments on artificial datasets and real datasets demonstrate the simplicity and effectiveness of the proposed approach.


2021 ◽  
Vol 40 (6) ◽  
pp. 10781-10796
Author(s):  
Xin Yu ◽  
Feng Zeng ◽  
Deborah Simon Mwakapesa ◽  
Y.A. Nanehkaran ◽  
Yi-Min Mao ◽  
...  

The main target of this paper is to design a density-based clustering algorithm using the weighted grid and information entropy based on MapReduce, noted as DBWGIE-MR, to deal with the problems of unreasonable division of data gridding, low accuracy of clustering results and low efficiency of parallelization in big data clustering algorithm based on density. This algorithm is implemented in three stages: data partitioning, local clustering, and global clustering. For each stage, we propose several strategies to improve the algorithm. In the first stage, based on the spatial distribution of data points, we propose an adaptive division strategy (ADG) to divide the grid adaptively. In the second stage, we design a weighted grid construction strategy (NE) which can strengthen the relevance between grids to improve the accuracy of clustering. Meanwhile, based on the weighted grid and information entropy, we design a density calculation strategy (WGIE) to calculate the density of the grid. And last, to improve the parallel efficiency, core clusters computing algorithm based on MapReduce (COMCORE-MR) are proposed to parallel compute the core clusters of the clustering algorithm. In the third stage, based on disjoint-set, we propose a core cluster merging algorithm (MECORE) to speed-up ratio the convergence of merged local clusters. Furthermore, based on MapReduce, a core clusters parallel merging algorithm (MECORE-MR) is proposed to get the clustering algorithm results faster, which improves the core clusters merging efficiency of the density-based clustering algorithm. We conduct the experiments on four synthetic clusters. Compared with H-DBSCAN, DBSCAN-MR and MR-VDBSCAN, the experimental results show that the DBWGIE-MR algorithm has higher stability and accuracy, and it takes less time in parallel clustering.


2013 ◽  
Vol 2013 ◽  
pp. 1-8 ◽  
Author(s):  
Amit Banerjee

Density-based clustering methods are known to be robust against outliers in data; however, they are sensitive to user-specified parameters, the selection of which is not trivial. Moreover, relational data clustering is an area that has received considerably less attention than object data clustering. In this paper, two approaches to robust density-based clustering for relational data using evolutionary computation are investigated.


2019 ◽  
Vol 11 (03n04) ◽  
pp. 1950006
Author(s):  
Hedi Xia ◽  
Hector D. Ceniceros

A new method for hierarchical clustering of data points is presented. It combines treelets, a particular multiresolution decomposition of data, with a mapping on a reproducing kernel Hilbert space. The proposed approach, called kernel treelets (KT), uses this mapping to go from a hierarchical clustering over attributes (the natural output of treelets) to a hierarchical clustering over data. KT effectively substitutes the correlation coefficient matrix used in treelets with a symmetric and positive semi-definite matrix efficiently constructed from a symmetric and positive semi-definite kernel function. Unlike most clustering methods, which require data sets to be numeric, KT can be applied to more general data and yields a multiresolution sequence of orthonormal bases on the data directly in feature space. The effectiveness and potential of KT in clustering analysis are illustrated with some examples.


Author(s):  
Shapol M. Mohammed ◽  
Karwan Jacksi ◽  
Subhi R. M. Zeebaree

<p><span>Semantic similarity is the process of identifying relevant data semantically. The traditional way of identifying document similarity is by using synonymous keywords and syntactician. In comparison, semantic similarity is to find similar data using meaning of words and semantics. Clustering is a concept of grouping objects that have the same features and properties as a cluster and separate from those objects that have different features and properties. In semantic document clustering, documents are clustered using semantic similarity techniques with similarity measurements. One of the common techniques to cluster documents is the density-based clustering algorithms using the density of data points as a main strategic to measure the similarity between them. In this paper, a state-of-the-art survey is presented to analyze the density-based algorithms for clustering documents. Furthermore, the similarity and evaluation measures are investigated with the selected algorithms to grasp the common ones. The delivered review revealed that the most used density-based algorithms in document clustering are DBSCAN and DPC. The most effective similarity measurement has been used with density-based algorithms, specifically DBSCAN and DPC, is Cosine similarity with F-measure for performance and accuracy evaluation.</span></p>


Author(s):  
Mouhcine El Hassani ◽  
Noureddine Falih ◽  
Belaid Bouikhalene

<p><span>Classification of information is a vague and difficult to explore area of research, hence the emergence of grouping techniques, often referred to Clustering. It is necessary to differentiate between an unsupervised and a supervised classification. Clustering methods are numerous. Data partitioning and hierarchization push to use them in parametric form or not. Also, their use is influenced by algorithms of a probabilistic nature during the partitioning of data. The choice of a method depends on the result of the Clustering that we want to have. This work focuses on classification using the density-based spatial clustering of applications with noise (DBSCAN) and DENsity-based CLUstEring (DENCLUE) algorithm through an application made in csharp. Through the use of three databases which are the IRIS database, breast cancer wisconsin (diagnostic) data set and bank marketing data set, we show experimentally that the choice of the initial data parameters is important to accelerate the processing and can minimize the number of iterations to reduce the execution time of the application.</span></p>


TEM Journal ◽  
2020 ◽  
pp. 929-936
Author(s):  
Mochammad Haldi Widianto ◽  
Ivan Diryana Sudirman ◽  
Muhammad Hanif Awaluddin

Online life is used as a method of finding information, one of which is Twitter as the medium. The occurrence of natural disasters is very detrimental. Therefore, the application is needed to see natural disasters through social media Twitter. A small number of studies using clustering methods based on Twitter user data density are the beginning of this research. With the availability of data in certain areas makes it easy to group. After that, the data is grouped based on a high degree of similarity. One result of applying this method is the location of the disaster. NER-based rules are used to discover out the area of the disaster. Data accuracy testing is performed using the Silhouette coefficient.


2013 ◽  
Vol 6 (3) ◽  
pp. 441-448 ◽  
Author(s):  
Sajid Nagi ◽  
Dhruba Kumar Bhattacharyya ◽  
Jugal K. Kalita

When clustering high dimensional data, traditional clustering methods are found to be lacking since they consider all of the dimensions of the dataset in discovering clusters whereas only some of the dimensions are relevant. This may give rise to subspaces within the dataset where clusters may be found. Using feature selection, we can remove irrelevant and redundant dimensions by analyzing the entire dataset. The problem of automatically identifying clusters that exist in multiple and maybe overlapping subspaces of high dimensional data, allowing better clustering of the data points, is known as Subspace Clustering. There are two major approaches to subspace clustering based on search strategy. Top-down algorithms find an initial clustering in the full set of dimensions and evaluate the subspaces of each cluster, iteratively improving the results. Bottom-up approaches start from finding low dimensional dense regions, and then use them to form clusters. Based on a survey on subspace clustering, we identify the challenges and issues involved with clustering gene expression data.


Sign in / Sign up

Export Citation Format

Share Document