scholarly journals Model based clustering of high-dimensional binary data

2015 ◽  
Vol 87 ◽  
pp. 84-101 ◽  
Author(s):  
Yang Tang ◽  
Ryan P. Browne ◽  
Paul D. McNicholas
2013 ◽  
Vol 7 (3) ◽  
pp. 281-300 ◽  
Author(s):  
Anastasios Bellas ◽  
Charles Bouveyron ◽  
Marie Cottrell ◽  
Jérôme Lacaille

2019 ◽  
Author(s):  
Siva Rajesh Kasa ◽  
Vaibhav Rajan

AbstractWe study two practically important cases of model based clustering using Gaussian Mixture Models: (1) when there is misspecification and (2) on high dimensional data, in the light of recent advances in Gradient Descent (GD) based optimization using Automatic Differentiation (AD). Our simulation studies show that EM has better clustering performance, measured by Adjusted Rand Index, compared to GD in cases of misspecification, whereas on high dimensional data GD outperforms EM. We observe that both with EM and GD there are many solutions with high likelihood but poor cluster interpretation. To address this problem we design a new penalty term for the likelihood based on the Kullback Leibler divergence between pairs of fitted components. Closed form expressions for the gradients of this penalized likelihood are difficult to derive but AD can be done effortlessly, illustrating the advantage of AD-based optimization. Extensions of this penalty for high dimensional data and for model selection are discussed. Numerical experiments on synthetic and real datasets demonstrate the efficacy of clustering using the proposed penalized likelihood approach.


The R Journal ◽  
2017 ◽  
Vol 9 (1) ◽  
pp. 403 ◽  
Author(s):  
Panagiotis Papastamoulis ◽  
Magnus Rattray

2014 ◽  
Vol 71 ◽  
pp. 52-78 ◽  
Author(s):  
Charles Bouveyron ◽  
Camille Brunet-Saumard

Biometrics ◽  
2009 ◽  
Vol 66 (3) ◽  
pp. 793-804 ◽  
Author(s):  
Jian Guo ◽  
Elizaveta Levina ◽  
George Michailidis ◽  
Ji Zhu

Author(s):  
Siva Rajesh Kasa ◽  
Sakyajit Bhattacharya ◽  
Vaibhav Rajan

Abstract Motivation The identification of sub-populations of patients with similar characteristics, called patient subtyping, is important for realizing the goals of precision medicine. Accurate subtyping is crucial for tailoring therapeutic strategies that can potentially lead to reduced mortality and morbidity. Model-based clustering, such as Gaussian mixture models, provides a principled and interpretable methodology that is widely used to identify subtypes. However, they impose identical marginal distributions on each variable; such assumptions restrict their modeling flexibility and deteriorates clustering performance. Results In this paper, we use the statistical framework of copulas to decouple the modeling of marginals from the dependencies between them. Current copula-based methods cannot scale to high dimensions due to challenges in parameter inference. We develop HD-GMCM, that addresses these challenges and, to our knowledge, is the first copula-based clustering method that can fit high-dimensional data. Our experiments on real high-dimensional gene-expression and clinical datasets show that HD-GMCM outperforms state-of-the-art model-based clustering methods, by virtue of modeling non-Gaussian data and being robust to outliers through the use of Gaussian mixture copulas. We present a case study on lung cancer data from TCGA. Clusters obtained from HD-GMCM can be interpreted based on the dependencies they model, that offers a new way of characterizing subtypes. Empirically, such modeling not only uncovers latent structure that leads to better clustering but also meaningful clinical subtypes in terms of survival rates of patients. Availability and implementation An implementation of HD-GMCM in R is available at: https://bitbucket.org/cdal/hdgmcm/. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 18 (2) ◽  
pp. 175-196 ◽  
Author(s):  
Gertraud Malsiner-Walli ◽  
Daniela Pauger ◽  
Helga Wagner

Abstract: In social and economic studies many of the collected variables are measured on a nominal scale, often with a large number of categories. The definition of categories can be ambiguous and different classification schemes using either a finer or a coarser grid are possible. Categorization has an impact when such a variable is included as covariate in a regression model: a too fine grid will result in imprecise estimates of the corresponding effects, whereas with a too coarse grid important effects will be missed, resulting in biased effect estimates and poor predictive performance. To achieve an automatic grouping of the levels of a categorical covariate with essentially the same effect, we adopt a Bayesian approach and specify the prior on the level effects as a location mixture of spiky Normal components. Model-based clustering of the effects during MCMC sampling allows to simultaneously detect categories which have essentially the same effect size and identify variables with no effect at all. Fusion of level effects is induced by a prior on the mixture weights which encourages empty components. The properties of this approach are investigated in simulation studies. Finally, the method is applied to analyse effects of high-dimensional categorical predictors on income in Austria.


2013 ◽  
Vol 54 (1) ◽  
pp. 196-215 ◽  
Author(s):  
Leonard K.M. Poon ◽  
Nevin L. Zhang ◽  
Tengfei Liu ◽  
April H. Liu

Sign in / Sign up

Export Citation Format

Share Document