DBWGIE-MR: A density-based clustering algorithm by using the weighted grid and information entropy based on MapReduce

The main target of this paper is to design a density-based clustering algorithm using the weighted grid and information entropy based on MapReduce, noted as DBWGIE-MR, to deal with the problems of unreasonable division of data gridding, low accuracy of clustering results and low efficiency of parallelization in big data clustering algorithm based on density. This algorithm is implemented in three stages: data partitioning, local clustering, and global clustering. For each stage, we propose several strategies to improve the algorithm. In the first stage, based on the spatial distribution of data points, we propose an adaptive division strategy (ADG) to divide the grid adaptively. In the second stage, we design a weighted grid construction strategy (NE) which can strengthen the relevance between grids to improve the accuracy of clustering. Meanwhile, based on the weighted grid and information entropy, we design a density calculation strategy (WGIE) to calculate the density of the grid. And last, to improve the parallel efficiency, core clusters computing algorithm based on MapReduce (COMCORE-MR) are proposed to parallel compute the core clusters of the clustering algorithm. In the third stage, based on disjoint-set, we propose a core cluster merging algorithm (MECORE) to speed-up ratio the convergence of merged local clusters. Furthermore, based on MapReduce, a core clusters parallel merging algorithm (MECORE-MR) is proposed to get the clustering algorithm results faster, which improves the core clusters merging efficiency of the density-based clustering algorithm. We conduct the experiments on four synthetic clusters. Compared with H-DBSCAN, DBSCAN-MR and MR-VDBSCAN, the experimental results show that the DBWGIE-MR algorithm has higher stability and accuracy, and it takes less time in parallel clustering.

Download Full-text

A Hybrid Query Scheme to Speed Up Queries in Unstructured Peer-to-Peer Networks

Advances in Multimedia ◽

10.1155/2007/64938 ◽

2007 ◽

Vol 2007 ◽

pp. 1-10

Author(s):

Zhan Zhang ◽

Yong Tang ◽

Shigang Chen ◽

Ying Jian

Keyword(s):

Response Time ◽

System Performance ◽

Clustering Algorithm ◽

Peer To Peer ◽

Peer Networks ◽

Communication Overhead ◽

The Core ◽

Peer To Peer Networks ◽

Labeling Algorithm ◽

Speed Up

Unstructured peer-to-peer networks have gained a lot of popularity due to their resilience to network dynamics. The core operation in such networks is to efficiently locate resources. However, existing query schemes, for example, flooding, random walks, and interest-based shortcut suffer various problems in reducing communication overhead and in shortening response time. In this paper, we study the possible problems in the existing approaches and propose a new hybrid query scheme, which mixes inter-cluster queries and intracluster queries. Specifically, the proposed scheme works by efficiently locating the clusters, sharing similar interests with intercluster queries, and then exhaustively searching the nodes in the found clusters with intracluster queries. To facilitate the scheme, we propose a clustering algorithm to cluster nodes that share similar interests, and a labeling algorithm to explicitly capture the clusters in the underlying overlays. As demonstrated by extensive simulations, our new query scheme can improve the system performance significantly by achieving a better tradeoff among communication overhead, response time, and ability to locate more resources.

Download Full-text

Recognition and labeling of faults in wind turbines with a density-based clustering algorithm

Data Technologies and Applications ◽

10.1108/dta-09-2020-0223 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Shuai Luo ◽

Hongwei Liu ◽

Ershi Qi

Keyword(s):

Wind Turbines ◽

Clustering Algorithm ◽

Support Vector ◽

Scanning Strategy ◽

Data Set ◽

Content Type ◽

Vibration Data ◽

Density Based Clustering ◽

Extreme Gradient Boosting ◽

Data Points

PurposeThe purpose of this paper is to recognize and label the faults in wind turbines with a new density-based clustering algorithm, named contour density scanning clustering (CDSC) algorithm.Design/methodology/approachThe algorithm includes four components: (1) computation of neighborhood density, (2) selection of core and noise data, (3) scanning core data and (4) updating clusters. The proposed algorithm considers the relationship between neighborhood data points according to a contour density scanning strategy.FindingsThe first experiment is conducted with artificial data to validate that the proposed CDSC algorithm is suitable for handling data points with arbitrary shapes. The second experiment with industrial gearbox vibration data is carried out to demonstrate that the time complexity and accuracy of the proposed CDSC algorithm in comparison with other conventional clustering algorithms, including k-means, density-based spatial clustering of applications with noise, density peaking clustering, neighborhood grid clustering, support vector clustering, random forest, core fusion-based density peak clustering, AdaBoost and extreme gradient boosting. The third experiment is conducted with an industrial bearing vibration data set to highlight that the CDSC algorithm can automatically track the emerging fault patterns of bearing in wind turbines over time.Originality/valueData points with different densities are clustered using three strategies: direct density reachability, density reachability and density connectivity. A contours density scanning strategy is proposed to determine whether the data points with the same density belong to one cluster. The proposed CDSC algorithm achieves automatically clustering, which means that the trends of the fault pattern could be tracked.

Download Full-text

Correlative Density-Based Clustering

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2016.5650 ◽

2016 ◽

Vol 13 (10) ◽

pp. 6935-6943 ◽

Cited By ~ 1

Author(s):

Jia-Lin Hua ◽

Jian Yu ◽

Miin-Shen Yang

Keyword(s):

Correlation Analysis ◽

Clustering Algorithm ◽

Clustering Methods ◽

Data Set ◽

Density Based Clustering ◽

Inherent Structure ◽

Data Points ◽

Artificial Datasets

Mountains, which heap up by densities of a data set, intuitively reflect the structure of data points. These mountain clustering methods are useful for grouping data points. However, the previous mountain-based clustering suffers from the choice of parameters which are used to compute the density. In this paper, we adopt correlation analysis to determine the density, and propose a new clustering algorithm, called Correlative Density-based Clustering (CDC). The new algorithm computes the density with a modified way and determines the parameters based on the inherent structure of data points. Experiments on artificial datasets and real datasets demonstrate the simplicity and effectiveness of the proposed approach.

Download Full-text

Heterogeneous Distributed Big Data Clustering on Sparse Grids

10.20944/preprints201902.0019.v1 ◽

2019 ◽

Author(s):

David Pfander ◽

Gregor Daiß ◽

Dirk Pflüger

Keyword(s):

Big Data ◽

Density Estimation ◽

High Performance ◽

Clustering Algorithm ◽

Sparse Grid ◽

Peak Performance ◽

Density Based Clustering ◽

Wide Range ◽

Data Points ◽

And Performance

Clustering is an important task in data mining that has become more challenging due to the ever-increasing size of available datasets. To cope with these big data scenarios, a high-performance clustering approach is required. Sparse grid clustering is a density-based clustering method that uses a sparse grid density estimation as its central building block. The underlying density estimation approach enables the detection of clusters with non-convex shapes and without a predetermined number of clusters. In this work, we introduce a new distributed and performance-portable variant of the sparse grid clustering algorithm that is suited for big data settings. Our compute kernels were implemented in OpenCL to enable portability across a wide range of architectures. For distributed environments, we added a manager-worker scheme that was implemented using MPI. In experiments on two supercomputers, Piz Daint and Hazel Hen, with up to 100 million data points in a 10-dimensional dataset, we show the performance and scalability of our approach. The dataset with 100 million data points was clustered in 1198s using 128 nodes of Piz Daint. This translates to an overall performance of 352TFLOPS. On the node-level, we provide results for two GPUs, Nvidia's Tesla P100 and the AMD FirePro W8100, and one processor-based platform that uses Intel Xeon E5-2680v3 processors. In these experiments, we achieved between 43% and 66% of the peak performance across all compute kernels and devices, demonstrating the performance portability of our approach.

Download Full-text

Massively scalable density based clustering (DBSCAN) on the HPCC systems big data platform

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v10.i1.pp207-214 ◽

2021 ◽

Vol 10 (1) ◽

pp. 207

Author(s):

Yatish H. R. ◽

Shubham Milind Phal ◽

Tanmay Sanjay Hukkeri ◽

Lili Xu ◽

Shobha G ◽

...

Keyword(s):

Clustering Algorithm ◽

Spatial Clustering ◽

Computation Time ◽

Large Data ◽

Single Node ◽

Data Set ◽

Traffic Pattern ◽

Density Based Clustering ◽

Data Points ◽

Hpcc Systems

<span id="docs-internal-guid-919b015d-7fff-56da-f81d-8f032097bce2"><span>Dealing with large samples of unlabeled data is a key challenge in today’s world, especially in applications such as traffic pattern analysis and disaster management. DBSCAN, or density based spatial clustering of applications with noise, is a well-known density-based clustering algorithm. Its key strengths lie in its capability to detect outliers and handle arbitrarily shaped clusters. However, the algorithm, being fundamentally sequential in nature, proves expensive and time consuming when operated on extensively large data chunks. This paper thus presents a novel implementation of a parallel and distributed DBSCAN algorithm on the HPCC Systems platform. The algorithm seeks to fully parallelize the algorithm implementation by making use of HPCC Systems optimal distributed architecture and performing a tree-based union to merge local clusters. The proposed approach* was tested both on synthetic as well as standard datasets (MFCCs Data Set) and found to be completely accurate. Additionally, when compared against a single node setup, a significant decrease in computation time was observed with no impact to accuracy. The parallelized algorithm performed eight times better for higher number of data points and takes exponentially lesser time as the number of data points increases.</span></span>

Download Full-text

Accelerating Density Peak Clustering Algorithm

Symmetry ◽

10.3390/sym11070859 ◽

2019 ◽

Vol 11 (7) ◽

pp. 859 ◽

Cited By ~ 1

Author(s):

Lin

Keyword(s):

Clustering Algorithm ◽

Local Density ◽

Early Stage ◽

Separation Distance ◽

Density Peak ◽

Density Peaks ◽

Density Based Clustering ◽

Data Points ◽

Data Point ◽

Density Peak Clustering

The Density Peak Clustering (DPC) algorithm is a new density-based clustering method. It spends most of its execution time on calculating the local density and the separation distance for each data point in a dataset. The purpose of this study is to accelerate its computation. On average, the DPC algorithm scans half of the dataset to calculate the separation distance of each data point. We propose an approach to calculate the separation distance of a data point by scanning only the neighbors of the data point. Additionally, the purpose of the separation distance is to assist in choosing the density peaks, which are the data points with both high local density and high separation distance. We propose an approach to identify non-peak data points at an early stage to avoid calculating their separation distances. Our experimental results show that most of the data points in a dataset can benefit from the proposed approaches to accelerate the DPC algorithm.

Download Full-text

Heterogeneous Distributed Big Data Clustering on Sparse Grids

Algorithms ◽

10.3390/a12030060 ◽

2019 ◽

Vol 12 (3) ◽

pp. 60 ◽

Cited By ~ 1

Author(s):

David Pfander ◽

Gregor Daiß ◽

Dirk Pflüger

Keyword(s):

Big Data ◽

Density Estimation ◽

High Performance ◽

Clustering Algorithm ◽

Sparse Grid ◽

Peak Performance ◽

Density Based Clustering ◽

Wide Range ◽

Data Points ◽

And Performance

Clustering is an important task in data mining that has become more challenging due to the ever-increasing size of available datasets. To cope with these big data scenarios, a high-performance clustering approach is required. Sparse grid clustering is a density-based clustering method that uses a sparse grid density estimation as its central building block. The underlying density estimation approach enables the detection of clusters with non-convex shapes and without a predetermined number of clusters. In this work, we introduce a new distributed and performance-portable variant of the sparse grid clustering algorithm that is suited for big data settings. Our computed kernels were implemented in OpenCL to enable portability across a wide range of architectures. For distributed environments, we added a manager–worker scheme that was implemented using MPI. In experiments on two supercomputers, Piz Daint and Hazel Hen, with up to 100 million data points in a ten-dimensional dataset, we show the performance and scalability of our approach. The dataset with 100 million data points was clustered in 1198 s using 128 nodes of Piz Daint. This translates to an overall performance of 352 TFLOPS . On the node-level, we provide results for two GPUs, Nvidia’s Tesla P100 and the AMD FirePro W8100, and one processor-based platform that uses Intel Xeon E5-2680v3 processors. In these experiments, we achieved between 43% and 66% of the peak performance across all computed kernels and devices, demonstrating the performance portability of our approach.

Download Full-text

A study and analysis of a discrete quantum walk-based hybrid clustering approach using d-regular bipartite graph and 1D lattice

International Journal of Quantum Information ◽

10.1142/s0219749919500163 ◽

2019 ◽

Vol 17 (02) ◽

pp. 1950016 ◽

Cited By ~ 1

Author(s):

Sanjay Chakraborty ◽

Soharab Hossain Shaikh ◽

Sudhindu Bikash Mandal ◽

Ranjan Ghosh ◽

Amlan Chakrabarti

Keyword(s):

Machine Learning ◽

Bipartite Graph ◽

Data Clustering ◽

Clustering Algorithm ◽

Quantum Walk ◽

Hybrid Clustering ◽

Data Points ◽

Speed Up ◽

Quantum Oracle ◽

Regular Bipartite Graph

Traditional machine learning shares several benefits with quantum information processing field. The study of machine learning with quantum mechanics is called quantum machine learning. Data clustering is an important tool for machine learning where quantum computing plays a vital role in its inherent speed up capability. In this paper, a hybrid quantum algorithm for data clustering (quantum walk-based hybrid clustering (QWBHC)) is introduced where one-dimensional discrete time quantum walks (DTQW) play the central role to update the positions of data points according to their probability distributions. A quantum oracle is also designed and it is mainly implemented on a finite [Formula: see text]-regular bipartite graph where data points are initially distributed as a predefined set of clusters. An overview of a quantum walk (QW) based clustering algorithm on 1D lattice structure is also introduced and described in this paper. In order to search the nearest neighbors, a unitary and reversible DTQW gives a quadratic speed up over the traditional classical random walk. This paper also demonstrates the comparisons of our proposed hybrid quantum clustering algorithm with some state-of-the-art clustering algorithms in terms of clustering accuracy and time complexity analysis. The proposed quantum oracle needs [Formula: see text] queries to mark the nearest data points among clusters and modify the existing clusters. Finally, the proposed QWBHC algorithm achieves [Formula: see text] performance.

Download Full-text

Mining Taxi Pick-Up Hotspots Based on Grid Information Entropy Clustering Algorithm

Journal of Advanced Transportation ◽

10.1155/2021/5814879 ◽

2021 ◽

Vol 2021 ◽

pp. 1-25

Author(s):

Shuoben Bi ◽

Ruizhuang Xu ◽

Aili Liu ◽

Luye Wang ◽

Lei Wan

Keyword(s):

Information Entropy ◽

Input Data ◽

Clustering Algorithm ◽

Scientific Basis ◽

Urban Traffic ◽

Massive Data ◽

Trajectory Data ◽

Research Areas ◽

Density Based Clustering ◽

Traffic Guidance

In view of the fact that the density-based clustering algorithm is sensitive to the input data, which results in the limitation of computing space and poor timeliness, a new method is proposed based on grid information entropy clustering algorithm for mining hotspots of taxi passengers. This paper selects representative geographical areas of Nanjing and Beijing as the research areas and uses information entropy and aggregation degree to analyze the distribution of passenger-carrying points. This algorithm uses a grid instead of original trajectory data to calculate and excavate taxi passenger hotspots. Through the comparison and analysis of the data of taxi loading points in Nanjing and Beijing, it is found that the experimental results are consistent with the actual urban passenger hotspots, which verifies the effectiveness of the algorithm. It overcomes the shortcomings of a density-based clustering algorithm that is limited by computing space and poor timeliness, reduces the size of data needed to be processed, and has greater flexibility to process and analyze massive data. The research results can provide an important scientific basis for urban traffic guidance and urban management.

Download Full-text

Significant DBSCAN+: Statistically Robust Density-based Clustering

ACM Transactions on Intelligent Systems and Technology ◽

10.1145/3474842 ◽

2021 ◽

Vol 12 (5) ◽

pp. 1-26

Author(s):

Yiqun Xie ◽

Xiaowei Jia ◽

Shashi Shekhar ◽

Han Bao ◽

Xun Zhou

Keyword(s):

Science Studies ◽

Cluster Detection ◽

Clustering Methods ◽

Multi Scale ◽

Density Based Clustering ◽

Data Points ◽

Speed Up ◽

Robust Formulation ◽

Spurious Results ◽

Powerful Mechanism

Cluster detection is important and widely used in a variety of applications, including public health, public safety, transportation, and so on. Given a collection of data points, we aim to detect density-connected spatial clusters with varying geometric shapes and densities, under the constraint that the clusters are statistically significant. The problem is challenging, because many societal applications and domain science studies have low tolerance for spurious results, and clusters may have arbitrary shapes and varying densities. As a classical topic in data mining and learning, a myriad of techniques have been developed to detect clusters with both varying shapes and densities (e.g., density-based, hierarchical, spectral, or deep clustering methods). However, the vast majority of these techniques do not consider statistical rigor and are susceptible to detecting spurious clusters formed as a result of natural randomness. On the other hand, scan statistic approaches explicitly control the rate of spurious results, but they typically assume a single “hotspot” of over-density and many rely on further assumptions such as a tessellated input space. To unite the strengths of both lines of work, we propose a statistically robust formulation of a multi-scale DBSCAN, namely Significant DBSCAN+, to identify significant clusters that are density connected. As we will show, incorporation of statistical rigor is a powerful mechanism that allows the new Significant DBSCAN+ to outperform state-of-the-art clustering techniques in various scenarios. We also propose computational enhancements to speed-up the proposed approach. Experiment results show that Significant DBSCAN+ can simultaneously improve the success rate of true cluster detection (e.g., 10–20% increases in absolute F1 scores) and substantially reduce the rate of spurious results (e.g., from thousands/hundreds of spurious detections to none or just a few across 100 datasets), and the acceleration methods can improve the efficiency for both clustered and non-clustered data.

Download Full-text