scholarly journals Multi-Partitions Subspace Clustering

Mathematics ◽  
2020 ◽  
Vol 8 (4) ◽  
pp. 597 ◽  
Author(s):  
Vincent Vandewalle

In model based clustering, it is often supposed that only one clustering latent variable explains the heterogeneity of the whole dataset. However, in many cases several latent variables could explain the heterogeneity of the data at hand. Finding such class variables could result in a richer interpretation of the data. In the continuous data setting, a multi-partition model based clustering is proposed. It assumes the existence of several latent clustering variables, each one explaining the heterogeneity of the data with respect to some clustering subspace. It allows to simultaneously find the multi-partitions and the related subspaces. Parameters of the model are estimated through an EM algorithm relying on a probabilistic reinterpretation of the factorial discriminant analysis. A model choice strategy relying on the BIC criterion is proposed to select to number of subspaces and the number of clusters by subspace. The obtained results are thus several projections of the data, each one conveying its own clustering of the data. Model’s behavior is illustrated on simulated and real data.

2016 ◽  
Vol 2016 ◽  
pp. 1-13 ◽  
Author(s):  
Yanbo Wang ◽  
Quan Liu ◽  
Bo Yuan

Learning a Gaussian graphical model with latent variables is ill posed when there is insufficient sample complexity, thus having to be appropriately regularized. A common choice is convexl1plus nuclear norm to regularize the searching process. However, the best estimator performance is not always achieved with these additive convex regularizations, especially when the sample complexity is low. In this paper, we consider a concave additive regularization which does not require the strong irrepresentable condition. We use concave regularization to correct the intrinsic estimation biases from Lasso and nuclear penalty as well. We establish the proximity operators for our concave regularizations, respectively, which induces sparsity and low rankness. In addition, we extend our method to also allow the decomposition of fused structure-sparsity plus low rankness, providing a powerful tool for models with temporal information. Specifically, we develop a nontrivial modified alternating direction method of multipliers with at least local convergence. Finally, we use both synthetic and real data to validate the excellence of our method. In the application of reconstructing two-stage cancer networks, “the Warburg effect” can be revealed directly.


2021 ◽  
Author(s):  
Gordana C. Popovic ◽  
Francis K.C. Hui ◽  
David I. Warton

Visualising data is a vital part of analysis, allowing researchers to find patterns, and assess and communicate the results of statistical modeling. In ecology, visualisation is often challenging when there are many variables (often for different species or other taxonomic groups) and they are not normally distributed (often counts or presence-absence data). Ordination is a common and powerful way to overcome this hurdle by reducing data from many response variables to just two or three, to be easily plotted. Ordination is traditionally done using dissimilarity-based methods, most commonly non-metric multidimensional scaling (nMDS). In the last decade however, model-based methods for unconstrained ordination have gained popularity. These are primarily based on latent variable models, with latent variables estimating the underlying, unobserved ecological gradients. Despite some major benefits, a major drawback of model-based ordination methods is their speed, as they typically taking much longer to return a result than dissimilarity-based methods, especially for large sample sizes. We introduce copula ordination, a new, scalable model-based approach to unconstrained ordination. This method has all the desirable properties of model-based ordination methods, with the added advantage that it is computationally far more efficient. In particular, simulations show copula ordination is an order of magnitude faster than current model-based methods, and can even be faster than nMDS for large sample sizes, while being able to produce similar ordination plots and trends as these methods.


2020 ◽  
Vol 43 ◽  
pp. e49929
Author(s):  
Gislene Araujo Pereira ◽  
Mariana Resende ◽  
Marcelo Ângelo Cirillo

Multicollinearity is detected via regression models, where independent variables are strongly correlated. Since they entail linear relations between observed or latent variables, the structural equation models (SEM) are subject to the multicollinearity effect, whose numerous consequences include the singularity between the inverse matrices used in estimation methods. Given to this behavior, it is natural to understand that the suitability of these estimators to structural equation models show the same features, either in the simulation results that validate the estimators in different multicollinearity degrees, or in their application to real data. Due to the multicollinearity overview arose from the fact that the matrices inversion is impracticable, the usage of numerical procedures demanded by the maximum likelihood methods leads to numerical singularity problems. An alternative could be the use of the Partial Least Squares (PLS) method, however, it is demanded that the observed variables are built by assuming a positive correlation with the latent variable. Thus, theoretically, it is expected that the load signals are positive, however, there are no restrictions to these signals in the algorithms used in the PLS method. This fact implies in corrective areas, such as the observed variables removal or new formulations of the theoretical model. In view of this problem, this paper aimed to propose adaptations of six generalized ridge estimators as alternative methods to estimate SEM parameters. The conclusion is that the evaluated estimators presented the same performance in terms of accuracy, precision while considering the scenarios represented by model without specification error and model with specification error, different levels of multicollinearity and sample sizes.


2021 ◽  
Vol 32 (1) ◽  
Author(s):  
Luis A. García-Escudero ◽  
Agustín Mayo-Iscar ◽  
Marco Riani

AbstractA new methodology for constrained parsimonious model-based clustering is introduced, where some tuning parameter allows to control the strength of these constraints. The methodology includes the 14 parsimonious models that are often applied in model-based clustering when assuming normal components as limit cases. This is done in a natural way by filling the gap among models and providing a smooth transition among them. The methodology provides mathematically well-defined problems and is also useful to prevent us from obtaining spurious solutions. Novel information criteria are proposed to help the user in choosing parameters. The interest of the proposed methodology is illustrated through simulation studies and a real-data application on COVID data.


2021 ◽  
Author(s):  
Bert van der Veen ◽  
Francis K.C. Hui ◽  
Knut A. Hovstad ◽  
Robert B. O’Hara

SummaryIn community ecology, unconstrained ordination can be used to predict latent variables from a multivariate dataset, which generated the observed species composition.Latent variables can be understood as ecological gradients, which are represented as a function of measured predictors in constrained ordination, so that ecologists can better relate species composition to the environment while reducing dimensionality of the predictors and the response data.However, existing constrained ordination methods do not explicitly account for information provided by species responses, so that they have the potential to misrepresent community structure if not all predictors are measured.We propose a new method for model-based ordination with constrained latent variables in the Generalized Linear Latent Variable Model framework, which incorporates both measured predictors and residual covariation to optimally represent ecological gradients. Simulations of unconstrained and constrained ordination show that the proposed method outperforms CCA and RDA.


2020 ◽  
Vol 2 (3) ◽  
Author(s):  
Stijn Hawinkel ◽  
Luc Bijnens ◽  
Kim-Anh Lê Cao ◽  
Olivier Thas

Abstract The integration of multiple omics datasets measured on the same samples is a challenging task: data come from heterogeneous sources and vary in signal quality. In addition, some omics data are inherently compositional, e.g. sequence count data. Most integrative methods are limited in their ability to handle covariates, missing values, compositional structure and heteroscedasticity. In this article we introduce a flexible model-based approach to data integration to address these current limitations: COMBI. We combine concepts, such as compositional biplots and log-ratio link functions with latent variable models, and propose an attractive visualization through multiplots to improve interpretation. Using real data examples and simulations, we illustrate and compare our method with other data integration techniques. Our algorithm is available in the R-package combi.


2021 ◽  
Vol 16 ◽  
Author(s):  
Zhaoyang Liu ◽  
Hongsheng Yin ◽  
Shutao Chen ◽  
Hui Liu ◽  
Jia Meng ◽  
...  

Background: m6A methylation is a ubiquitous post-transcriptional modification that exists in mammals. MeRIP-seq technology makes the acquisition of m6A data in the whole transcriptome under different conditions realizable. The specific regulation of the enzyme will present co-methylation module on m6A methylation level data. Thus, mining the co-methylation module from which can help to unveil the mechanism of m<sup>6</sup>A methylation modification and its mechanism in the occurrence and development of complex diseases such as cancer. Objective: To develop a clustering algorithm that can effectively realize the mining of m6 co-methylation module. Method: In this study, a novel beta mixture model-based clustering algorithm named MBMM was proposed, which is based on the EM framework and introduces the method of moment estimating in M-step for parameter estimation to tackle the high-dimensional small sample m6A data. Simulation research was employed to evaluate the clustering performance of the proposed algorithm, and by which the co-methylation module mining was done based on real data. Biological significance correlation analysis was employed to explore whether the clustering results are co-methylation modules. Results and Conclusion: Simulation research demonstrated that MBMM performed out than other clustering algorithms. In real data, seven co-methylation modules were found by MBMM. Six m6A-related pathways specific analysis showed that six co-methylation modules were enriched in the pathway and were different. Five enzymes substrate-specific analysis revealed that seven co-methylation modules expressed varying degrees of enrichment. Gene Ontology enrichment analysis indicated that these modules may be regulated by enzymes while having potential functional specificity.


Author(s):  
Wentian Zhao ◽  
Shaojie Wang ◽  
Zhihuai Xie ◽  
Jing Shi ◽  
Chenliang Xu

Expectation maximization (EM) algorithm is to find maximum likelihood solution for models having latent variables. A typical example is Gaussian Mixture Model (GMM) which requires Gaussian assumption, however, natural images are highly non-Gaussian so that GMM cannot be applied to perform image clustering task on pixel space. To overcome such limitation, we propose a GAN based EM learning framework that can maximize the likelihood of images and estimate the latent variables. We call this model GAN-EM, which is a framework for image clustering, semi-supervised classification and dimensionality reduction. In M-step, we design a novel loss function for discriminator of GAN to perform maximum likelihood estimation (MLE) on data with soft class label assignments. Specifically, a conditional generator captures data distribution for K classes, and a discriminator tells whether a sample is real or fake for each class. Since our model is unsupervised, the class label of real data is regarded as latent variable, which is estimated by an additional network (E-net) in E-step. The proposed GAN-EM achieves state-of-the-art clustering and semi-supervised classification results on MNIST, SVHN and CelebA, as well as comparable quality of generated images to other recently developed generative models.


Author(s):  
Christian Damgaard ◽  
Rikke Reisner Hansen ◽  
Francis K. C. Hui

AbstractRecently, there has been an increasing interest in model-based approaches for the statistical modelling of the joint distribution of multi-species abundances. The Dirichlet-multinomial distribution has been proposed as a suitable candidate distribution for the joint species distribution of pin-point plant cover data and is here applied in a model-based ordination framework. Unlike most model-based ordination methods, both fixed and random effects are in our proposed model structured as p-dimensional vectors and added to the latent variables before the inner product with the species-specific coefficients. This changes the interpretation of the parameters, so that the fixed and random effects now measure the relative displacement of the vegetation by the fixed and random factors in the p-dimensional latent variable space. This parameterization allows statistical inference of the effect of fixed and random factors in vector space, and makes it easier for practitioners to perform inferences on species composition in a multivariate setting. The method was applied on plant pin-point cover data from dry heathlands that had received different management treatments (burned, grazed, harvested, unmanaged), and it was found that treatment have a significant effect on heathland vegetation both when considering plant functional groups or when the taxonomic resolution was at the species level.


Sign in / Sign up

Export Citation Format

Share Document