A Data-Driven Parameter Adaptive Clustering Algorithm Based on Density Peak

Complexity ◽

10.1155/2018/5232543 ◽

2018 ◽

Vol 2018 ◽

pp. 1-14

Author(s):

Tao Du ◽

Shouning Qu ◽

Qin Wang

Keyword(s):

Time Complexity ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Training Data ◽

Data Driven ◽

Density Peak ◽

Data Set ◽

Adaptive Clustering ◽

Key Factor ◽

Parameter Adaptive

Clustering is an important unsupervised machine learning method which can efficiently partition points without training data set. However, most of the existing clustering algorithms need to set parameters artificially, and the results of clustering are much influenced by these parameters, so optimizing clustering parameters is a key factor of improving clustering performance. In this paper, we propose a parameter adaptive clustering algorithm DDPA-DP which is based on density-peak algorithm. In DDPA-DP, all parameters can be adaptively adjusted based on the data-driven thought, and then the accuracy of clustering is highly improved, and the time complexity is not increased obviously. To prove the performance of DDPA-DP, a series of experiments are designed with some artificial data sets and a real application data set, and the clustering results of DDPA-DP are compared with some typical algorithms by these experiments. Based on these results, the accuracy of DDPA-DP has obvious advantage of all, and its time complexity is close to classical DP-Clust.

Get full-text (via PubEx)

Urban green economic development indicators based on spatial clustering algorithm and blockchain

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189535 ◽

2020 ◽

pp. 1-12

Author(s):

Xiaoguang Gao

Keyword(s):

Development Strategy ◽

Clustering Algorithm ◽

Spatial Clustering ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Large Data ◽

Experimental Comparison ◽

High Dimensional ◽

Density Peak ◽

Data Set

The unbalanced development strategy makes the regional development unbalanced. Therefore, in the development process, resources must be effectively utilized according to the level and characteristics of each region. Considering the resource and environmental constraints, this paper measures and analyzes China’s green economic efficiency and green total factor productivity. Moreover, by expounding the characteristics of high-dimensional data, this paper points out the problems of traditional clustering algorithms in high-dimensional data clustering. This paper proposes a density peak clustering algorithm based on sampling and residual squares, which is suitable for high-dimensional large data sets. The algorithm finds abnormal points and boundary points by identifying halo points, and finally determines clusters. In addition, from the experimental comparison on the data set, it can be seen that the improved algorithm is better than the DPC algorithm in both time complexity and clustering results. Finally, this article analyzes data based on actual cases. The research results show that the method proposed in this paper is effective.

Get full-text (via PubEx)

A Data-Driven Surrogate Approach for the Temporal Stability Forecasting of Vegetation Covered Dikes

Water ◽

10.3390/w13010107 ◽

2021 ◽

Vol 13 (1) ◽

pp. 107

Author(s):

Elahe Jamalinia ◽

Faraz S. Tehrani ◽

Susan C. Steele-Dunne ◽

Philip J. Vardon

Keyword(s):

Numerical Simulation ◽

Water Flux ◽

Temporal Stability ◽

Synthetic Data ◽

Climatic Conditions ◽

Training Data ◽

Data Driven ◽

Data Set ◽

Surface Cracking ◽

Real Time Analysis

Climatic conditions and vegetation cover influence water flux in a dike, and potentially the dike stability. A comprehensive numerical simulation is computationally too expensive to be used for the near real-time analysis of a dike network. Therefore, this study investigates a random forest (RF) regressor to build a data-driven surrogate for a numerical model to forecast the temporal macro-stability of dikes. To that end, daily inputs and outputs of a ten-year coupled numerical simulation of an idealised dike (2009–2019) are used to create a synthetic data set, comprising features that can be observed from a dike surface, with the calculated factor of safety (FoS) as the target variable. The data set before 2018 is split into training and testing sets to build and train the RF. The predicted FoS is strongly correlated with the numerical FoS for data that belong to the test set (before 2018). However, the trained model shows lower performance for data in the evaluation set (after 2018) if further surface cracking occurs. This proof-of-concept shows that a data-driven surrogate can be used to determine dike stability for conditions similar to the training data, which could be used to identify vulnerable locations in a dike network for further examination.

Get full-text (via PubEx)

An improved OPTICS clustering algorithm for discovering clusters with uneven densities

Intelligent Data Analysis ◽

10.3233/ida-205497 ◽

2021 ◽

Vol 25 (6) ◽

pp. 1453-1471

Author(s):

Chunhua Tang ◽

Han Wang ◽

Zhiwen Wang ◽

Xiangkun Zeng ◽

Huaran Yan ◽

...

Keyword(s):

Time Complexity ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Substantial Improvement ◽

Experimental Results ◽

High Time ◽

Parameter Setting ◽

K Nearest Neighbor ◽

Density Based Clustering

Most density-based clustering algorithms have the problems of difficult parameter setting, high time complexity, poor noise recognition, and weak clustering for datasets with uneven density. To solve these problems, this paper proposes FOP-OPTICS algorithm (Finding of the Ordering Peaks Based on OPTICS), which is a substantial improvement of OPTICS (Ordering Points To Identify the Clustering Structure). The proposed algorithm finds the demarcation point (DP) from the Augmented Cluster-Ordering generated by OPTICS and uses the reachability-distance of DP as the radius of neighborhood eps of its corresponding cluster. It overcomes the weakness of most algorithms in clustering datasets with uneven densities. By computing the distance of the k-nearest neighbor of each point, it reduces the time complexity of OPTICS; by calculating density-mutation points within the clusters, it can efficiently recognize noise. The experimental results show that FOP-OPTICS has the lowest time complexity, and outperforms other algorithms in parameter setting and noise recognition.

Get full-text (via PubEx)

A SELF-ORGANIZING MAP FOR MIXED CONTINUOUS AND CATEGORICAL DATA

International Journal of Computing ◽

10.47839/ijc.10.1.733 ◽

2011 ◽

pp. 24-32 ◽

Cited By ~ 1

Author(s):

Nicoleta Rogovschi ◽

Mustapha Lebbah ◽

Younès Bennani

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Mixed Data ◽

Categorical Variables ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Public Data ◽

Self Organizing

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.

Get full-text (via PubEx)

Research on the Method of Coke Optical Tissue Segmentation Based on Adaptive Clustering

International Journal of Photoenergy ◽

10.1155/2021/4378823 ◽

2021 ◽

Vol 2021 ◽

pp. 1-16

Author(s):

Huaiguang Liu ◽

Liheng Zhang ◽

Shiyang Zhou ◽

Li Fang

Keyword(s):

Mahalanobis Distance ◽

Euclidean Distance ◽

Clustering Algorithm ◽

Adaptive Threshold ◽

Segmentation Method ◽

Tissue Segmentation ◽

Adaptive Clustering ◽

Morphological Filtering ◽

Key Factor ◽

The Relationship

The microstructure is the key factor for quality discriminate of coke. In view of the characteristics of coke optical tissue (COT), a segmentation method of coke microstructures based on adaptive clustering was proposed. According to the strategy of multiresolution, adaptive threshold binarization and morphological filtering were carried out on COT images with lower resolution. The contour of the COT body was detected through the relationship checking between contours in the binary image, and hence, COT pixels were picked out to cluster for tissue segmentation. In order to get the optimum segmentation for each tissue, an advanced K -means method with adaptive clustering centers was provided according to the Calinski-Harabasz score. Meanwhile, Euclidean distance was substituted with Mahalanobis distance between each pixel in HSV space to improve the accuracy. The experimental results show that compared with the traditional K -means algorithm, FCM algorithm, and Meanshift algorithm, the adaptive clustering algorithm proposed in this paper is more accurate in the segmentation of various tissue components in COT images, and the accuracy of tissue segmentation reaches 94.3500%.

Get full-text (via PubEx)

Clustering Bathymetric Data for Electronic Navigational Charts

Journal of Navigation ◽

10.1017/s0373463316000035 ◽

2016 ◽

Vol 69 (5) ◽

pp. 1143-1153 ◽

Cited By ~ 24

Author(s):

Marta Wlodarczyk–Sielicka ◽

Andrzej Stateczny

Keyword(s):

Clustering Algorithm ◽

Search Algorithm ◽

Clustering Algorithms ◽

Data Set ◽

Bathymetric Data ◽

Large Sets ◽

Analysis Of Results ◽

Comparison And Analysis ◽

Self Organising Map ◽

Source Of Information

An electronic navigational chart is a major source of information for the navigator. The component that contributes most significantly to the safety of navigation on water is the information on the depth of an area. For the purposes of this article, the authors use data obtained by the interferometric sonar GeoSwath Plus. The data were collected in the area of the Port of Szczecin. The samples constitute large sets of data. Data reduction is a procedure to reduce the size of a data set to make it easier and more effective to analyse. The main objective of the authors is the compilation of a new reduction algorithm for bathymetric data. The clustering of data is the first part of the search algorithm. The next step consists of generalisation of bathymetric data. This article presents a comparison and analysis of results of clustering bathymetric data using the following selected methods:K-means clustering algorithm, traditional hierarchical clustering algorithms and self-organising map (using artificial neural networks).

Get full-text (via PubEx)

Techniques for Fast Screening of 3D Heterogeneous Shale Barrier Configurations and Their Impacts on SAGD Chamber Development

SPE Journal ◽

10.2118/199906-pa ◽

2021 ◽

pp. 1-25

Author(s):

Chang Gao ◽

Juliana Y. Leung

Keyword(s):

Distance Measure ◽

Flow Simulation ◽

Training Data ◽

Distance Measures ◽

Data Driven ◽

Data Set ◽

Flow Simulations ◽

Steam Chamber ◽

Reservoir Models ◽

Tracking Model

Summary The steam-assisted gravity drainage (SAGD) recovery process is strongly impacted by the spatial distributions of heterogeneous shale barriers. Though detailed compositional flow simulators are available for SAGD recovery performance evaluation, the simulation process is usually quite computationally demanding, rendering their use over a large number of reservoir models for assessing the impacts of heterogeneity (uncertainties) to be impractical. In recent years, data-driven proxies have been widely proposed to reduce the computational effort; nevertheless, the proxy must be trained using a large data set consisting of many flow simulation cases that are ideally spanning the model parameter spaces. The question remains: is there a more efficient way to screen a large number of heterogeneous SAGD models? Such techniques could help to construct a training data set with less redundancy; they can also be used to quickly identify a subset of heterogeneous models for detailed flow simulation. In this work, we formulated two particular distance measures, flow-based and static-based, to quantify the similarity among a set of 3D heterogeneous SAGD models. First, to formulate the flow-based distance measure, a physics-basedparticle-tracking model is used: Darcy’s law and energy balance are integrated to mimic the steam chamber expansion process; steam particles that are located at the edge of the chamber would release their energy to the surrounding cold bitumen, while detailed fluid displacements are not explicitly simulated. The steam chamber evolution is modeled, and a flow-based distance between two given reservoir models is defined as the difference in their chamber sizes over time. Second, to formulate the static-based distance, the Hausdorff distance (Hausdorff 1914) is used: it is often used in image processing to compare two images according to their corresponding spatial arrangement and shapes of various objects. A suite of 3D models is constructed using representative petrophysical properties and operating constraints extracted from several pads in Suncor Energy’s Firebag project. The computed distance measures are used to partition the models into different groups. To establish a baseline for comparison, flow simulations are performed on these models to predict the actual chamber evolution and production profiles. The grouping results according to the proposed flow- and static-based distance measures match reasonably well to those obtained from detailed flow simulations. Significant improvement in computational efficiency is achieved with the proposed techniques. They can be used to efficiently screen a large number of reservoir models and facilitate the clustering of these models into groups with distinct shale heterogeneity characteristics. It presents a significant potential to be integrated with other data-driven approaches for reducing the computational load typically associated with detailed flow simulations involving multiple heterogeneous reservoir realizations.

Get full-text (via PubEx)

Data Fusion Using a Multi-Sensor Sparse-Based Clustering Algorithm

Remote Sensing ◽

10.3390/rs12234007 ◽

2020 ◽

Vol 12 (23) ◽

pp. 4007

Author(s):

Kasra Rafiezadeh Shahi ◽

Pedram Ghamisi ◽

Behnood Rasti ◽

Robert Jackisch ◽

Paul Scheunders ◽

...

Keyword(s):

Clustering Algorithm ◽

Spatial Information ◽

Clustering Algorithms ◽

Hyperspectral Data ◽

Sensor Data ◽

Data Sets ◽

Data Types ◽

Data Set ◽

Multiple Data Sets ◽

Imaging Sensors

The increasing amount of information acquired by imaging sensors in Earth Sciences results in the availability of a multitude of complementary data (e.g., spectral, spatial, elevation) for monitoring of the Earth’s surface. Many studies were devoted to investigating the usage of multi-sensor data sets in the performance of supervised learning-based approaches at various tasks (i.e., classification and regression) while unsupervised learning-based approaches have received less attention. In this paper, we propose a new approach to fuse multiple data sets from imaging sensors using a multi-sensor sparse-based clustering algorithm (Multi-SSC). A technique for the extraction of spatial features (i.e., morphological profiles (MPs) and invariant attribute profiles (IAPs)) is applied to high spatial-resolution data to derive the spatial and contextual information. This information is then fused with spectrally rich data such as multi- or hyperspectral data. In order to fuse multi-sensor data sets a hierarchical sparse subspace clustering approach is employed. More specifically, a lasso-based binary algorithm is used to fuse the spectral and spatial information prior to automatic clustering. The proposed framework ensures that the generated clustering map is smooth and preserves the spatial structures of the scene. In order to evaluate the generalization capability of the proposed approach, we investigate its performance not only on diverse scenes but also on different sensors and data types. The first two data sets are geological data sets, which consist of hyperspectral and RGB data. The third data set is the well-known benchmark Trento data set, including hyperspectral and LiDAR data. Experimental results indicate that this novel multi-sensor clustering algorithm can provide an accurate clustering map compared to the state-of-the-art sparse subspace-based clustering algorithms.

Get full-text (via PubEx)

A Dynamic Genetic Algorithm for Clustering Problems

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.411-414.1884 ◽

2013 ◽

Vol 411-414 ◽

pp. 1884-1893

Author(s):

Yong Chun Cao ◽

Ya Bin Shao ◽

Shuang Liang Tian ◽

Zheng Qi Cai

Keyword(s):

Genetic Algorithm ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Search Space ◽

Adaptive Mutation ◽

Data Sets ◽

Data Set ◽

Local Optima ◽

Clustering Problems

Due to many of the clustering algorithms based on GAs suffer from degeneracy and are easy to fall in local optima, a novel dynamic genetic algorithm for clustering problems (DGA) is proposed. The algorithm adopted the variable length coding to represent individuals and processed the parallel crossover operation in the subpopulation with individuals of the same length, which allows the DGA algorithm clustering to explore the search space more effectively and can automatically obtain the proper number of clusters and the proper partition from a given data set; the algorithm used the dynamic crossover probability and adaptive mutation probability, which prevented the dynamic clustering algorithm from getting stuck at a local optimal solution. The clustering results in the experiments on three artificial data sets and two real-life data sets show that the DGA algorithm derives better performance and higher accuracy on clustering problems.

Get full-text (via PubEx)

Soft Sensor Modeling Based on Radial Basis Function Neural Network and Fuzzy C-Means

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.219-220.1263 ◽

2011 ◽

Vol 219-220 ◽

pp. 1263-1266

Author(s):

Xi Huai Wang ◽

Jian Mei Xiao

Keyword(s):

Neural Network ◽

Fuzzy Clustering ◽

Clustering Algorithm ◽

Soft Sensor ◽

Fuzzy Cluster ◽

Training Data ◽

Temperature Estimation ◽

Data Set ◽

Slab Temperature ◽

Walking Beam Reheating Furnace

A neural network soft sensor based on fuzzy clustering is presented. The training data set is separated into several clusters with different centers, the number of fuzzy cluster is decided automatically, and the clustering centers are modified using an adaptive fuzzy clustering algorithm in the online stage. The proposed approach has been applied to the slab temperature estimation in a practical walking beam reheating furnace. Simulation results show that the approach is effective.

Get full-text (via PubEx)