MODEL-BASED CLUSTERING WITH GENE RANKING USING PENALIZED MIXTURES OF HEAVY-TAILED DISTRIBUTIONS

2013 ◽  
Vol 11 (03) ◽  
pp. 1341007 ◽  
Author(s):  
ALBERTO COZZINI ◽  
AJAY JASRA ◽  
GIOVANNI MONTANA

Cluster analysis of biological samples using gene expression measurements is a common task which aids the discovery of heterogeneous biological sub-populations having distinct mRNA profiles. Several model-based clustering algorithms have been proposed in which the distribution of gene expression values within each sub-group is assumed to be Gaussian. In the presence of noise and extreme observations, a mixture of Gaussian densities may over-fit and overestimate the true number of clusters. Moreover, commonly used model-based clustering algorithms do not generally provide a mechanism to quantify the relative contribution of each gene to the final partitioning of the data. We propose a penalized mixture of Student's t distributions for model-based clustering and gene ranking. Together with a resampling procedure, the proposed approach provides a means for ranking genes according to their contributions to the clustering process. Experimental results show that the algorithm performs well comparably to traditional Gaussian mixtures in the presence of outliers and longer tailed distributions. The algorithm also identifies the true informative genes with high sensitivity, and achieves improved model selection. An illustrative application to breast cancer data is also presented which confirms established tumor sub-classes.

2012 ◽  
Vol 28 (15) ◽  
pp. 2004-2007 ◽  
Author(s):  
M. Nascimento ◽  
T. Safadi ◽  
F. F. e. Silva ◽  
A. C. C. Nascimento

Author(s):  
Naohiko Kinoshita ◽  
◽  
Yasunori Endo ◽  
Akira Sugawara ◽  
◽  
...  

Clustering is representative unsupervised classification. Many researchers have proposed clustering algorithms based on mathematical models – methods we call model-based clustering. Clustering techniques are very useful for determining data structures, but model-based clustering is difficult to use for analyzing data correctly because we cannot select a suitable method unless we know the data structure at least partially. The new clustering algorithm we propose introduces soft computing techniques such as fuzzy reasoning in what we call linguistic-based clustering, whose features are not incident to the data structure. We verify the method’s effectiveness through numerical examples.


2020 ◽  
Vol 13 (2) ◽  
pp. 178-187
Author(s):  
Farzane Ahmadi ◽  
Ali-Reza Abadi ◽  
Zahra Bazi ◽  
Abolfazl Movafagh

Background: Aging is an organized biological process that is regulated by highly interconnected pathways between different cells and tissues in the living organism. Identification of similar genes between tissues in different ages may also help to discover the general mechanism of aging or to discover more effective therapeutic decisions. Objective: Objective: According to the wide application of model-based clustering techniques, the aim is to evaluate the performance of the Mixture of Multivariate Normal Distributions (MMNDs) as a valid method for clustering time series gene expression data with the Mixture of Matrix-Variate Normal Distributions (MMVNDs). Methods: In this study, the expression of aging data from NCBI’s Gene Expression Omnibus was elaborated to utilize proper data. A set of common genes which were differentially expressed between different tissues were selected and then clustered together through two methods. Finally, the biological significance of clusters was evaluated, using their ability to find genes in the cell using Enricher. Results: The MMVNDs is more efficient to find co-express genes. Six clusters of genes were observed using the MMVNDs. According to the functional analysis, most genes in clusters 1-6 are related to the B-cell receptors and IgG immunoglobulin complex, proliferating cell nuclear antigen complex, the metabolic pathways of iron, fat, and body mass control, the defense against bacteria, the cancer development incidence, and the chronic kidney failure, respectively. Conclusion: Results showed that most biological changes of aging between tissues are related to the specific components of immune cells. Also, the application of MMVNDs can increase the ability to find similar genes.


Author(s):  
Siva Rajesh Kasa ◽  
Sakyajit Bhattacharya ◽  
Vaibhav Rajan

Abstract Motivation The identification of sub-populations of patients with similar characteristics, called patient subtyping, is important for realizing the goals of precision medicine. Accurate subtyping is crucial for tailoring therapeutic strategies that can potentially lead to reduced mortality and morbidity. Model-based clustering, such as Gaussian mixture models, provides a principled and interpretable methodology that is widely used to identify subtypes. However, they impose identical marginal distributions on each variable; such assumptions restrict their modeling flexibility and deteriorates clustering performance. Results In this paper, we use the statistical framework of copulas to decouple the modeling of marginals from the dependencies between them. Current copula-based methods cannot scale to high dimensions due to challenges in parameter inference. We develop HD-GMCM, that addresses these challenges and, to our knowledge, is the first copula-based clustering method that can fit high-dimensional data. Our experiments on real high-dimensional gene-expression and clinical datasets show that HD-GMCM outperforms state-of-the-art model-based clustering methods, by virtue of modeling non-Gaussian data and being robust to outliers through the use of Gaussian mixture copulas. We present a case study on lung cancer data from TCGA. Clusters obtained from HD-GMCM can be interpreted based on the dependencies they model, that offers a new way of characterizing subtypes. Empirically, such modeling not only uncovers latent structure that leads to better clustering but also meaningful clinical subtypes in terms of survival rates of patients. Availability and implementation An implementation of HD-GMCM in R is available at: https://bitbucket.org/cdal/hdgmcm/. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document