Gaussian mixture copulas for high-dimensional clustering and dependency-based subtyping

Bioinformatics ◽

10.1093/bioinformatics/btz599 ◽

2019 ◽

Cited By ~ 3

Author(s):

Siva Rajesh Kasa ◽

Sakyajit Bhattacharya ◽

Vaibhav Rajan

Keyword(s):

Survival Rates ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Supplementary Information ◽

High Dimensional ◽

Clustering Methods ◽

Mortality And Morbidity ◽

Model Based Clustering ◽

Cancer Data ◽

Model Based

Abstract Motivation The identification of sub-populations of patients with similar characteristics, called patient subtyping, is important for realizing the goals of precision medicine. Accurate subtyping is crucial for tailoring therapeutic strategies that can potentially lead to reduced mortality and morbidity. Model-based clustering, such as Gaussian mixture models, provides a principled and interpretable methodology that is widely used to identify subtypes. However, they impose identical marginal distributions on each variable; such assumptions restrict their modeling flexibility and deteriorates clustering performance. Results In this paper, we use the statistical framework of copulas to decouple the modeling of marginals from the dependencies between them. Current copula-based methods cannot scale to high dimensions due to challenges in parameter inference. We develop HD-GMCM, that addresses these challenges and, to our knowledge, is the first copula-based clustering method that can fit high-dimensional data. Our experiments on real high-dimensional gene-expression and clinical datasets show that HD-GMCM outperforms state-of-the-art model-based clustering methods, by virtue of modeling non-Gaussian data and being robust to outliers through the use of Gaussian mixture copulas. We present a case study on lung cancer data from TCGA. Clusters obtained from HD-GMCM can be interpreted based on the dependencies they model, that offers a new way of characterizing subtypes. Empirically, such modeling not only uncovers latent structure that leads to better clustering but also meaningful clinical subtypes in terms of survival rates of patients. Availability and implementation An implementation of HD-GMCM in R is available at: https://bitbucket.org/cdal/hdgmcm/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Regularized Parameter Estimation in High-Dimensional Gaussian Mixture Models

Neural Computation ◽

10.1162/neco_a_00128 ◽

2011 ◽

Vol 23 (6) ◽

pp. 1605-1622 ◽

Cited By ~ 12

Author(s):

Lingyan Ruan ◽

Ming Yuan ◽

Hui Zou

Keyword(s):

Parameter Estimation ◽

Mixture Models ◽

Gaussian Mixture Models ◽

Expectation Maximization Algorithm ◽

Real Data ◽

Gaussian Mixture ◽

High Dimensional ◽

Model Based Clustering ◽

Text Type ◽

Effective Dimensionality

Finite gaussian mixture models are widely used in statistics thanks to their great flexibility. However, parameter estimation for gaussian mixture models with high dimensionality can be challenging because of the large number of parameters that need to be estimated. In this letter, we propose a penalized likelihood estimator to address this difficulty. The [Formula: see text]-type penalty we impose on the inverse covariance matrices encourages sparsity on its entries and therefore helps to reduce the effective dimensionality of the problem. We show that the proposed estimate can be efficiently computed using an expectation-maximization algorithm. To illustrate the practical merits of the proposed method, we consider its applications in model-based clustering and mixture discriminant analysis. Numerical experiments with both simulated and real data show that the new method is a valuable tool for high-dimensional data analysis.

Download Full-text

flowEMMi: an automated model-based clustering tool for microbial cytometric data

BMC Bioinformatics ◽

10.1186/s12859-019-3152-3 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 2

Author(s):

Joachim Ludwig ◽

Christian Höner zu Siederdissen ◽

Zishu Liu ◽

Peter F. Stadler ◽

Susann Müller

Keyword(s):

Microbial Community ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Medical Diagnostics ◽

Main Concern ◽

Accurate Identification ◽

Technical Noise ◽

Running Time ◽

Model Based Clustering ◽

Model Based

Abstract Background Flow cytometry (FCM) is a powerful single-cell based measurement method to ascertain multidimensional optical properties of millions of cells. FCM is widely used in medical diagnostics and health research. There is also a broad range of applications in the analysis of complex microbial communities. The main concern in microbial community analyses is to track the dynamics of microbial subcommunities. So far, this can be achieved with the help of time-consuming manual clustering procedures that require extensive user-dependent input. In addition, several tools have recently been developed by using different approaches which, however, focus mainly on the clustering of medical FCM data or of microbial samples with a well-known background, while much less work has been done on high-throughput, online algorithms for two-channel FCM. Results We bridge this gap with , a model-based clustering tool based on multivariate Gaussian mixture models with subsampling and foreground/background separation. These extensions provide a fast and accurate identification of cell clusters in FCM data, in particular for microbial community FCM data that are often affected by irrelevant information like technical noise, beads or cell debris. outperforms other available tools with regard to running time and information content of the clustering results and provides near-online results and optional heuristics to reduce the running-time further. Conclusions is a useful tool for the automated cluster analysis of microbial FCM data. It overcomes the user-dependent and time-consuming manual clustering procedure and provides consistent results with ancillary information and statistical proof.

Download Full-text

Model-based Clustering using Automatic Differentiation: Confronting Misspecification and High-Dimensional Data

10.1101/2019.12.13.876326 ◽

2019 ◽

Author(s):

Siva Rajesh Kasa ◽

Vaibhav Rajan

Keyword(s):

Automatic Differentiation ◽

High Dimensional Data ◽

Penalized Likelihood ◽

Gaussian Mixture ◽

High Dimensional ◽

Adjusted Rand Index ◽

Penalty Term ◽

Model Based Clustering ◽

Model Based ◽

Leibler Divergence

AbstractWe study two practically important cases of model based clustering using Gaussian Mixture Models: (1) when there is misspecification and (2) on high dimensional data, in the light of recent advances in Gradient Descent (GD) based optimization using Automatic Differentiation (AD). Our simulation studies show that EM has better clustering performance, measured by Adjusted Rand Index, compared to GD in cases of misspecification, whereas on high dimensional data GD outperforms EM. We observe that both with EM and GD there are many solutions with high likelihood but poor cluster interpretation. To address this problem we design a new penalty term for the likelihood based on the Kullback Leibler divergence between pairs of fitted components. Closed form expressions for the gradients of this penalized likelihood are difficult to derive but AD can be done effortlessly, illustrating the advantage of AD-based optimization. Extensions of this penalty for high dimensional data and for model selection are discussed. Numerical experiments on synthetic and real datasets demonstrate the efficacy of clustering using the proposed penalized likelihood approach.

Download Full-text

Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models

Computational Statistics & Data Analysis ◽

10.1016/j.csda.2009.02.011 ◽

2010 ◽

Vol 54 (3) ◽

pp. 711-723 ◽

Cited By ~ 74

Author(s):

P.D. McNicholas ◽

T.B. Murphy ◽

A.F. McDaid ◽

D. Frost

Keyword(s):

Mixture Models ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Model Based Clustering ◽

Model Based ◽

Parallel Implementations

Download Full-text

flowEMMi: An automated model-based clustering tool for microbial cytometric data

10.1101/667691 ◽

2019 ◽

Author(s):

Joachim Ludwig ◽

Christian Höner zu Siederdissen ◽

Zishu Liu ◽

Peter F Stadler ◽

Susann Müller

Keyword(s):

Microbial Community ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Medical Diagnostics ◽

Main Concern ◽

Accurate Identification ◽

Technical Noise ◽

Running Time ◽

Model Based Clustering ◽

Model Based

AbstractBackgroundFlow cytometry (FCM) is a powerful single-cell based measurement method to ascertain multidimensional optical properties of millions of cells. FCM is widely used in medical diagnostics and health research. There is also a broad range of applications in the analysis of complex microbial communities. The main concern in microbial community analyses is to track the dynamics of microbial subcommunities. So far, this can be achieved with the help of time-consuming manual clustering procedures that require extensive user-dependent input. In addition, several tools have recently been developed by using different approaches which, however, focus mainly on the clustering of medical FCM data or of microbial samples with a well-known background, while much less work has been done on high-throughput, online algorithms for two-channel FCM.ResultsWe bridge this gap with flowEMMi, a model-based clustering tool based on multivariate Gaussian mixture models with subsampling and foreground/background separation. These extensions provide a fast and accurate identification of cell clusters in FCM data, in particular for microbial community FCM data that are often affected by irrelevant information like technical noise, beads or cell debris. flowEMMi outperforms other available tools with regard to running time and information content of the clustering results and provides near-online results and optional heuristics to reduce the running-time further.ConclusionsflowEMMi is a useful tool for the automated cluster analysis of microbial FCM data. It overcomes the user-dependent and time-consuming manual clustering procedure and provides consistent results with ancillary information and statistical proof.

Download Full-text

Model Based Clustering of Audio Clips Using Gaussian Mixture Models

2009 Seventh International Conference on Advances in Pattern Recognition ◽

10.1109/icapr.2009.92 ◽

2009 ◽

Cited By ~ 2

Author(s):

S. Chandrakala ◽

C. Chandra Sekhar

Keyword(s):

Mixture Models ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Model Based Clustering ◽

Model Based

Download Full-text

Model-based clustering of microarray expression data via latent Gaussian mixture models

Bioinformatics ◽

10.1093/bioinformatics/btq498 ◽

2010 ◽

Vol 26 (21) ◽

pp. 2705-2712 ◽

Cited By ~ 112

Author(s):

P. D. McNicholas ◽

T. B. Murphy

Keyword(s):

Mixture Models ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Expression Data ◽

Model Based Clustering ◽

Microarray Expression Data ◽

Model Based ◽

Microarray Expression

Download Full-text

Massive MIMO Codebook Design Using Gaussian Mixture Model Based Clustering

Intelligent Automation & Soft Computing ◽

10.32604/iasc.2022.021779 ◽

2022 ◽

Vol 32 (1) ◽

pp. 361-375

Author(s):

S. Markkandan ◽

S. Sivasubramanian ◽

Jaison Mulerikkal ◽

Nazeer Shaik ◽

Beulah Jackson ◽

...

Keyword(s):

Gaussian Mixture Model ◽

Mixture Model ◽

Massive Mimo ◽

Gaussian Mixture ◽

Model Based Clustering ◽

Codebook Design ◽

Model Based

Download Full-text

Comparison of dimensionality reduction and clustering methods for SARS-CoV-2 genome

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v10i4.2803 ◽

2021 ◽

Vol 10 (4) ◽

pp. 2170-2180

Author(s):

Untari N. Wisesty ◽

Tati Rajab Mengko

Keyword(s):

Dimensionality Reduction ◽

Dimensional Reduction ◽

Clustering Algorithm ◽

Sequence Data ◽

Clustering Algorithms ◽

Gaussian Mixture Models ◽

Reduction Process ◽

Principal Component ◽

Gaussian Mixture ◽

Clustering Methods

This paper aims to conduct an analysis of the SARS-CoV-2 genome variation was carried out by comparing the results of genome clustering using several clustering algorithms and distribution of sequence in each cluster. The clustering algorithms used are K-means, Gaussian mixture models, agglomerative hierarchical clustering, mean-shift clustering, and DBSCAN. However, the clustering algorithm has a weakness in grouping data that has very high dimensions such as genome data, so that a dimensional reduction process is needed. In this research, dimensionality reduction was carried out using principal component analysis (PCA) and autoencoder method with three models that produce 2, 10, and 50 features. The main contributions achieved were the dimensional reduction and clustering scheme of SARS-CoV-2 sequence data and the performance analysis of each experiment on each scheme and hyper parameters for each method. Based on the results of experiments conducted, PCA and DBSCAN algorithm achieve the highest silhouette score of 0.8770 with three clusters when using two features. However, dimensionality reduction using autoencoder need more iterations to converge. On the testing process with Indonesian sequence data, more than half of them enter one cluster and the rest are distributed in the other two clusters.

Download Full-text

Identification of typical building daily electricity usage profiles using Gaussian mixture model-based clustering and hierarchical clustering

Applied Energy ◽

10.1016/j.apenergy.2018.09.050 ◽

2018 ◽

Vol 231 ◽

pp. 331-342 ◽

Cited By ~ 23

Author(s):

Kehua Li ◽

Zhenjun Ma ◽

Duane Robinson ◽

Jun Ma

Keyword(s):

Gaussian Mixture Model ◽

Mixture Model ◽

Hierarchical Clustering ◽

Gaussian Mixture ◽

Model Based Clustering ◽

Model Based

Download Full-text