scholarly journals Spectral Clustering of Mixed-Type Data

Stats ◽  
2021 ◽  
Vol 5 (1) ◽  
pp. 1-11
Author(s):  
Felix Mbuga ◽  
Cristina Tortora

Cluster analysis seeks to assign objects with similar characteristics into groups called clusters so that objects within a group are similar to each other and dissimilar to objects in other groups. Spectral clustering has been shown to perform well in different scenarios on continuous data: it can detect convex and non-convex clusters, and can detect overlapping clusters. However, the constraint on continuous data can be limiting in real applications where data are often of mixed-type, i.e., data that contains both continuous and categorical features. This paper looks at extending spectral clustering to mixed-type data. The new method replaces the Euclidean-based similarity distance used in conventional spectral clustering with different dissimilarity measures for continuous and categorical variables. A global dissimilarity measure is than computed using a weighted sum, and a Gaussian kernel is used to convert the dissimilarity matrix into a similarity matrix. The new method includes an automatic tuning of the variable weight and kernel parameter. The performance of spectral clustering in different scenarios is compared with that of two state-of-the-art mixed-type data clustering methods, k-prototypes and KAMILA, using several simulated and real data sets.

2020 ◽  
Vol 52 (8) ◽  
pp. 1035-1048
Author(s):  
H. Talebi ◽  
L. J. M. Peeters ◽  
U. Mueller ◽  
R. Tolosana-Delgado ◽  
K. G. van den Boogaart

AbstractThe particularities of geosystems and geoscience data must be understood before any development or implementation of statistical learning algorithms. Without such knowledge, the predictions and inferences may not be accurate and physically consistent. Accuracy, transparency and interpretability, credibility, and physical realism are minimum criteria for statistical learning algorithms when applied to the geosciences. This study briefly reviews several characteristics of geoscience data and challenges for novel statistical learning algorithms. A novel spatial spectral clustering approach is introduced to illustrate how statistical learners can be adapted for modelling geoscience data. The spatial awareness and physical realism of the spectral clustering are improved by utilising a dissimilarity matrix based on nonparametric higher-order spatial statistics. The proposed model-free technique can identify meaningful spatial clusters (i.e. meaningful geographical subregions) from multivariate spatial data at different scales without the need to define a model of co-dependence. Several mixed (e.g. continuous and categorical) variables can be used as inputs to the proposed clustering technique. The proposed technique is illustrated using synthetic and real mining datasets. The results of the case studies confirm the usefulness of the proposed method for modelling spatial data.


2020 ◽  
Vol 13 (3-4) ◽  
pp. 51-60
Author(s):  
Lisa B. Clark ◽  
Eduardo González ◽  
Annie L. Henry ◽  
Anna A. Sher

Abstract Coupled human and natural systems (CHANS) are frequently represented by large datasets with varied data including continuous, ordinal, and categorical variables. Conventional multivariate analyses cannot handle these mixed data types. In this paper, our goal was to show how a clustering method that has not before been applied to understanding the human dimension of CHANS: a Gower dissimilarity matrix with partitioning around medoids (PAM) can be used to treat mixed-type human datasets. A case study of land managers responsible for invasive plant control projects across rivers of the southwestern U.S. was used to characterize managers’ backgrounds and decisions, and project properties through clustering. Results showed that managers could be classified as “federal multitaskers” or as “educated specialists”. Decisions were characterized by being either “quick and active” or “thorough and careful”. Project goals were either comprehensive with ecological goals or more limited in scope. This study shows that clustering with Gower and PAM can simplify the complex human dimension of this system, demonstrating the utility of this approach for systems frequently composed of mixed-type data such as CHANS. This clustering approach can be used to direct scientific recommendations towards homogeneous groups of managers and project types.


Sign in / Sign up

Export Citation Format

Share Document