Gaussian mixture copulas for high-dimensional clustering and dependency-based subtyping

Author(s):  
Siva Rajesh Kasa ◽  
Sakyajit Bhattacharya ◽  
Vaibhav Rajan

Abstract Motivation The identification of sub-populations of patients with similar characteristics, called patient subtyping, is important for realizing the goals of precision medicine. Accurate subtyping is crucial for tailoring therapeutic strategies that can potentially lead to reduced mortality and morbidity. Model-based clustering, such as Gaussian mixture models, provides a principled and interpretable methodology that is widely used to identify subtypes. However, they impose identical marginal distributions on each variable; such assumptions restrict their modeling flexibility and deteriorates clustering performance. Results In this paper, we use the statistical framework of copulas to decouple the modeling of marginals from the dependencies between them. Current copula-based methods cannot scale to high dimensions due to challenges in parameter inference. We develop HD-GMCM, that addresses these challenges and, to our knowledge, is the first copula-based clustering method that can fit high-dimensional data. Our experiments on real high-dimensional gene-expression and clinical datasets show that HD-GMCM outperforms state-of-the-art model-based clustering methods, by virtue of modeling non-Gaussian data and being robust to outliers through the use of Gaussian mixture copulas. We present a case study on lung cancer data from TCGA. Clusters obtained from HD-GMCM can be interpreted based on the dependencies they model, that offers a new way of characterizing subtypes. Empirically, such modeling not only uncovers latent structure that leads to better clustering but also meaningful clinical subtypes in terms of survival rates of patients. Availability and implementation An implementation of HD-GMCM in R is available at: https://bitbucket.org/cdal/hdgmcm/. Supplementary information Supplementary data are available at Bioinformatics online.

2011 ◽  
Vol 23 (6) ◽  
pp. 1605-1622 ◽  
Author(s):  
Lingyan Ruan ◽  
Ming Yuan ◽  
Hui Zou

Finite gaussian mixture models are widely used in statistics thanks to their great flexibility. However, parameter estimation for gaussian mixture models with high dimensionality can be challenging because of the large number of parameters that need to be estimated. In this letter, we propose a penalized likelihood estimator to address this difficulty. The [Formula: see text]-type penalty we impose on the inverse covariance matrices encourages sparsity on its entries and therefore helps to reduce the effective dimensionality of the problem. We show that the proposed estimate can be efficiently computed using an expectation-maximization algorithm. To illustrate the practical merits of the proposed method, we consider its applications in model-based clustering and mixture discriminant analysis. Numerical experiments with both simulated and real data show that the new method is a valuable tool for high-dimensional data analysis.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Joachim Ludwig ◽  
Christian Höner zu Siederdissen ◽  
Zishu Liu ◽  
Peter F. Stadler ◽  
Susann Müller

Abstract Background Flow cytometry (FCM) is a powerful single-cell based measurement method to ascertain multidimensional optical properties of millions of cells. FCM is widely used in medical diagnostics and health research. There is also a broad range of applications in the analysis of complex microbial communities. The main concern in microbial community analyses is to track the dynamics of microbial subcommunities. So far, this can be achieved with the help of time-consuming manual clustering procedures that require extensive user-dependent input. In addition, several tools have recently been developed by using different approaches which, however, focus mainly on the clustering of medical FCM data or of microbial samples with a well-known background, while much less work has been done on high-throughput, online algorithms for two-channel FCM. Results We bridge this gap with , a model-based clustering tool based on multivariate Gaussian mixture models with subsampling and foreground/background separation. These extensions provide a fast and accurate identification of cell clusters in FCM data, in particular for microbial community FCM data that are often affected by irrelevant information like technical noise, beads or cell debris. outperforms other available tools with regard to running time and information content of the clustering results and provides near-online results and optional heuristics to reduce the running-time further. Conclusions is a useful tool for the automated cluster analysis of microbial FCM data. It overcomes the user-dependent and time-consuming manual clustering procedure and provides consistent results with ancillary information and statistical proof.


2019 ◽  
Author(s):  
Siva Rajesh Kasa ◽  
Vaibhav Rajan

AbstractWe study two practically important cases of model based clustering using Gaussian Mixture Models: (1) when there is misspecification and (2) on high dimensional data, in the light of recent advances in Gradient Descent (GD) based optimization using Automatic Differentiation (AD). Our simulation studies show that EM has better clustering performance, measured by Adjusted Rand Index, compared to GD in cases of misspecification, whereas on high dimensional data GD outperforms EM. We observe that both with EM and GD there are many solutions with high likelihood but poor cluster interpretation. To address this problem we design a new penalty term for the likelihood based on the Kullback Leibler divergence between pairs of fitted components. Closed form expressions for the gradients of this penalized likelihood are difficult to derive but AD can be done effortlessly, illustrating the advantage of AD-based optimization. Extensions of this penalty for high dimensional data and for model selection are discussed. Numerical experiments on synthetic and real datasets demonstrate the efficacy of clustering using the proposed penalized likelihood approach.


2019 ◽  
Author(s):  
Joachim Ludwig ◽  
Christian Höner zu Siederdissen ◽  
Zishu Liu ◽  
Peter F Stadler ◽  
Susann Müller

AbstractBackgroundFlow cytometry (FCM) is a powerful single-cell based measurement method to ascertain multidimensional optical properties of millions of cells. FCM is widely used in medical diagnostics and health research. There is also a broad range of applications in the analysis of complex microbial communities. The main concern in microbial community analyses is to track the dynamics of microbial subcommunities. So far, this can be achieved with the help of time-consuming manual clustering procedures that require extensive user-dependent input. In addition, several tools have recently been developed by using different approaches which, however, focus mainly on the clustering of medical FCM data or of microbial samples with a well-known background, while much less work has been done on high-throughput, online algorithms for two-channel FCM.ResultsWe bridge this gap with flowEMMi, a model-based clustering tool based on multivariate Gaussian mixture models with subsampling and foreground/background separation. These extensions provide a fast and accurate identification of cell clusters in FCM data, in particular for microbial community FCM data that are often affected by irrelevant information like technical noise, beads or cell debris. flowEMMi outperforms other available tools with regard to running time and information content of the clustering results and provides near-online results and optional heuristics to reduce the running-time further.ConclusionsflowEMMi is a useful tool for the automated cluster analysis of microbial FCM data. It overcomes the user-dependent and time-consuming manual clustering procedure and provides consistent results with ancillary information and statistical proof.


2022 ◽  
Vol 32 (1) ◽  
pp. 361-375
Author(s):  
S. Markkandan ◽  
S. Sivasubramanian ◽  
Jaison Mulerikkal ◽  
Nazeer Shaik ◽  
Beulah Jackson ◽  
...  

2021 ◽  
Vol 10 (4) ◽  
pp. 2170-2180
Author(s):  
Untari N. Wisesty ◽  
Tati Rajab Mengko

This paper aims to conduct an analysis of the SARS-CoV-2 genome variation was carried out by comparing the results of genome clustering using several clustering algorithms and distribution of sequence in each cluster. The clustering algorithms used are K-means, Gaussian mixture models, agglomerative hierarchical clustering, mean-shift clustering, and DBSCAN. However, the clustering algorithm has a weakness in grouping data that has very high dimensions such as genome data, so that a dimensional reduction process is needed. In this research, dimensionality reduction was carried out using principal component analysis (PCA) and autoencoder method with three models that produce 2, 10, and 50 features. The main contributions achieved were the dimensional reduction and clustering scheme of SARS-CoV-2 sequence data and the performance analysis of each experiment on each scheme and hyper parameters for each method. Based on the results of experiments conducted, PCA and DBSCAN algorithm achieve the highest silhouette score of 0.8770 with three clusters when using two features. However, dimensionality reduction using autoencoder need more iterations to converge. On the testing process with Indonesian sequence data, more than half of them enter one cluster and the rest are distributed in the other two clusters.


Sign in / Sign up

Export Citation Format

Share Document