scholarly journals Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Gregoire Preud’homme ◽  
Kevin Duarte ◽  
Kevin Dalleau ◽  
Claire Lacomblez ◽  
Emmanuel Bresso ◽  
...  

AbstractThe choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of “ready-to-use” tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.

2005 ◽  
Vol 2 (2) ◽  
Author(s):  
Matej Francetič ◽  
Mateja Nagode ◽  
Bojan Nastav

Clustering methods are among the most widely used methods in multivariate analysis. Two main groups of clustering methods can be distinguished: hierarchical and non-hierarchical. Due to the nature of the problem examined, this paper focuses on hierarchical methods such as the nearest neighbour, the furthest neighbour, Ward's method, between-groups linkage, within-groups linkage, centroid and median clustering. The goal is to assess the performance of different clustering methods when using concave sets of data, and also to figure out in which types of different data structures can these methods reveal and correctly assign group membership. The simulations were run in a two- and threedimensional space. Using different standard deviations of points around the skeleton further modified each of the two original shapes. In this manner various shapes of sets with different inter-cluster distances were generated. Generating the data sets provides the essential knowledge of cluster membership for comparing the clustering methods' performances. Conclusions are important and interesting since real life data seldom follow the simple convex-shaped structure, but need further work, such as the bootstrap application, the inclusion of the dendrogram-based analysis or other data structures. Therefore this paper can serve as a basis for further study of hierarchical clustering performance with concave sets.


2011 ◽  
Vol 19 (2) ◽  
pp. 173-187 ◽  
Author(s):  
Drew A. Linzer

Contingency tables are among the most basic and useful techniques available for analyzing categorical data, but they produce highly imprecise estimates in small samples or for population subgroups that arise following repeated stratification. I demonstrate that preprocessing an observed set of categorical variables using a latent class model can greatly improve the quality of table-based inferences. As a density estimator, the latent class model closely approximates the underlying joint distribution of the variables of interest, which enables reliable estimation of conditional probabilities and marginal effects, even among subgroups containing fewer than 40 observations. Though here focused on applications to public opinion, the procedure has a wide range of potential uses. I illustrate the benefits of the latent class model—based approach for greatly improved accuracy in estimating and forecasting vote preferences within small demographic subgroups using survey data from the 2004 and 2008 U.S. presidential election campaigns.


Author(s):  
Ioulia Papageorgiou

Quantitative Archaeology had a rapid development in the past few decades due to the parallel development of methodologies in Physics, Chemistry and Geology that can be implemented in archaeological findings and produce measurements on a number of variables. Those measurements form the data, the basis for a statistical analysis, which in turn can provide us with objective results and answers, within the prediction or estimation framework, about the archaeological findings. Exploratory statistical analysis was almost exclusively used initially for analyzing such data mainly because of their simplicity. The simplicity originates from the fact that exploratory techniques do not rely on any distribution assumption and conduct a non-parametric statistical analysis. However the recent development of the statistical methodology and the computing software allows us to make use of more sophisticated statistical techniques and obtain more informative results. We explore and present applications of three such techniques. The finite mixture approach for model based clustering, the latent class model and the Bayesian mixture of normal distributions with unknown number of components. All three methods can be used for identifying sub-groups in the sample and classify the items.


2021 ◽  
pp. 133-178
Author(s):  
Magy Seif El-Nasr ◽  
Truong Huy Nguyen Dinh ◽  
Alessandro Canossa ◽  
Anders Drachen

This chapter discusses different clustering methods and their application to game data. In particular, the chapter details K-means, Fuzzy C-Means, Hierarchical Clustering, Archetypical Analysis, and Model-based clustering techniques. It discusses the disadvantages and advantages of the different methods and discusses when you may use one method vs. the other. It also identifies and shows you ways to visualize the results to make sense of the resulting clusters. It also includes details on how one would evaluate such clusters or go about applying the algorithms to a game dataset. The chapter includes labs to delve deeper into the application of these algorithms on real game data.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Maéva Kyheng ◽  
Génia Babykina ◽  
Camille Ternynck ◽  
David Devos ◽  
Julien Labreuche ◽  
...  

Abstract Background In many clinical applications, evolution of a longitudinal marker is censored by an event occurrence, and, symmetrically, event occurrence can be influenced by the longitudinal marker evolution. In such frameworks joint modeling is of high interest. The Joint Latent Class Model (JLCM) allows to stratify the population into groups (classes) of patients that are homogeneous both with respect to the evolution of a longitudinal marker and to the occurrence of an event; this model is widely employed in real-life applications. However, the finite sample-size properties of this model remain poorly explored. Methods In the present paper, a simulation study is carried out to assess the impact of the number of individuals, of the censoring rate and of the degree of class separation on the finite sample size properties of the JLCM. A real-life application from the neurology domain is also presented. This study assesses the precision of class membership prediction and the impact of covariates omission on the model parameter estimates. Results Simulation study reveals some departures from normality of the model for survival sub-model parameters. The censoring rate and the number of individuals impact the relative bias of parameters, especially when the classes are weakly distinguished. In real-data application the observed heterogeneity on individual profiles in terms of a longitudinal marker evolution and of the event occurrence remains after adjusting to clinically relevant and available covariates; Conclusion The JLCM properties have been evaluated. We have illustrated the discovery in practice and highlights the usefulness of the joint models with latent classes in this kind of data even with pre-specified factors. We made some recommendations for the use of this model and for future research.


2020 ◽  
Vol 13 (3-4) ◽  
pp. 51-60
Author(s):  
Lisa B. Clark ◽  
Eduardo González ◽  
Annie L. Henry ◽  
Anna A. Sher

Abstract Coupled human and natural systems (CHANS) are frequently represented by large datasets with varied data including continuous, ordinal, and categorical variables. Conventional multivariate analyses cannot handle these mixed data types. In this paper, our goal was to show how a clustering method that has not before been applied to understanding the human dimension of CHANS: a Gower dissimilarity matrix with partitioning around medoids (PAM) can be used to treat mixed-type human datasets. A case study of land managers responsible for invasive plant control projects across rivers of the southwestern U.S. was used to characterize managers’ backgrounds and decisions, and project properties through clustering. Results showed that managers could be classified as “federal multitaskers” or as “educated specialists”. Decisions were characterized by being either “quick and active” or “thorough and careful”. Project goals were either comprehensive with ecological goals or more limited in scope. This study shows that clustering with Gower and PAM can simplify the complex human dimension of this system, demonstrating the utility of this approach for systems frequently composed of mixed-type data such as CHANS. This clustering approach can be used to direct scientific recommendations towards homogeneous groups of managers and project types.


2004 ◽  
Vol 12 (1) ◽  
pp. 3-27 ◽  
Author(s):  
Annabel Bolck ◽  
Marcel Croon ◽  
Jacques Hagenaars

We study the properties of a three-step approach to estimating the parameters of a latent structure model for categorical data and propose a simple correction for a common source of bias. Such models have a measurement part (essentially the latent class model) and a structural (causal) part (essentially a system of logit equations). In the three-step approach, a stand-alone measurement model is first defined and its parameters are estimated. Individual predicted scores on the latent variables are then computed from the parameter estimates of the measurement model and the individual observed scoring patterns on the indicators. Finally, these predicted scores are used in the causal part and treated as observed variables. We show that such a naive use of predicted latent scores cannot be recommended since it leads to a systematic underestimation of the strength of the association among the variables in the structural part of the models. However, a simple correction procedure can eliminate this systematic bias. This approach is illustrated on simulated and real data. A method that uses multiple imputation to account for the fact that the predicted latent variables are random variables can produce standard errors for the parameters in the structural part of the model.


Sign in / Sign up

Export Citation Format

Share Document