Fast model-based ordination with copulas

Mapping Intimacies ◽

10.1101/2021.03.28.437086 ◽

2021 ◽

Author(s):

Gordana C. Popovic ◽

Francis K.C. Hui ◽

David I. Warton

Keyword(s):

Latent Variables ◽

Latent Variable ◽

Current Model ◽

Sample Sizes ◽

Major Drawback ◽

Large Sample ◽

Model Based ◽

Ordination Methods ◽

Order Of Magnitude ◽

Taxonomic Groups

Visualising data is a vital part of analysis, allowing researchers to find patterns, and assess and communicate the results of statistical modeling. In ecology, visualisation is often challenging when there are many variables (often for different species or other taxonomic groups) and they are not normally distributed (often counts or presence-absence data). Ordination is a common and powerful way to overcome this hurdle by reducing data from many response variables to just two or three, to be easily plotted. Ordination is traditionally done using dissimilarity-based methods, most commonly non-metric multidimensional scaling (nMDS). In the last decade however, model-based methods for unconstrained ordination have gained popularity. These are primarily based on latent variable models, with latent variables estimating the underlying, unobserved ecological gradients. Despite some major benefits, a major drawback of model-based ordination methods is their speed, as they typically taking much longer to return a result than dissimilarity-based methods, especially for large sample sizes. We introduce copula ordination, a new, scalable model-based approach to unconstrained ordination. This method has all the desirable properties of model-based ordination methods, with the added advantage that it is computationally far more efficient. In particular, simulations show copula ordination is an order of magnitude faster than current model-based methods, and can even be faster than nMDS for large sample sizes, while being able to produce similar ordination plots and trends as these methods.

Get full-text (via PubEx)

PCR-Multiplex of Six Chloroplast Microsatellites for Population Studies and Genetic Typing in Pinus sylvestris

Silvae Genetica ◽

10.1515/sg-2004-0045 ◽

2004 ◽

Vol 53 (1-6) ◽

pp. 246-248 ◽

Cited By ~ 3

Author(s):

A. Dzialuk ◽

J. Burczyk

Keyword(s):

Pinus Sylvestris ◽

Population Studies ◽

Chloroplast Microsatellites ◽

Dna Fragments ◽

Sample Sizes ◽

Major Drawback ◽

Genetic Typing ◽

Large Sample ◽

Pcr Multiplex ◽

Skilled Personnel

Abstract The major drawback of microsatellites analysis is that they are expensive to develop, labor-intensive and demand skilled personnel. However, such studies might be still simplified and accelerated by multiplexing of the markers and the use of highthroughput systems for genotyping DNA fragments. In this paper we present a single, simple and highly effective PCRmultiplex reaction composed of six chloroplast microsatellites widely used for population studies in pines but here applied to Pinus sylvestris. The reaction allows for rapid genotyping of large sample sizes.

Get full-text (via PubEx)

Inferring Population Size Histories using Coalescent Hidden Markov Models with TMRCA and Total Branch Length as Hidden States

10.1101/2021.05.22.445274 ◽

2021 ◽

Author(s):

Gautam Upadhya ◽

Matthias Steinruecken

Keyword(s):

Population Size ◽

Hidden Markov Models ◽

Latent Variable ◽

Markov Models ◽

Hidden Markov ◽

Branch Length ◽

Quality Data ◽

Sample Sizes ◽

Large Sample ◽

Total Branch Length

Unraveling the complex demographic histories of natural populations is a central problem in population genetics. Understanding past demographic events is of general anthropological interest and is also an important step in establishing accurate null models when identifying adaptive or disease-associated genetic variation. An important class of tools for inferring past population size changes are Coalescent Hidden Markov Models (CHMMs). These models make efficient use of the linkage information in population genomic datasets by using the local genealogies relating sampled individuals as latent states that evolve across the chromosome in an HMM framework. Extending these models to large sample sizes is challenging, since the number of possible latent states increases rapidly. Here, we present our method CHIMP (CHMM History-Inference ML Procedure), a novel CHMM method for inferring the size history of a population. It can be applied to large samples (hundreds of haplotypes) and only requires unphased genomes as input. The two implementations of CHIMP that we present here use either the height of the genealogical tree (TMRCA) or the total branch length, respectively, as the latent variable at each position in the genome. The requisite transition and emission probabilities are obtained by numerically solving certain systems of differential equations derived from the ancestral process with recombination. The parameters of the population size history are subsequently inferred using an Expectation-Maximization algorithm. In addition, we implement a composite likelihood scheme to allow the method to scale to large sample sizes. We demonstrate the efficiency and accuracy of our method in a variety of benchmark tests using simulated data and present comparisons to other state-of-the-art methods. Specifically, our implementation using TMRCA as the latent variable shows comparable performance and provides accurate estimates of effective population sizes in intermediate and ancient times. Our method is agnostic to the phasing of the data, which makes it a promising alternative in scenarios where high quality data is not available, and has potential applications for pseudo-haploid data.

Get full-text (via PubEx)

Model-based ordination with constrained latent variables

10.1101/2021.10.11.463884 ◽

2021 ◽

Author(s):

Bert van der Veen ◽

Francis K.C. Hui ◽

Knut A. Hovstad ◽

Robert B. O’Hara

Keyword(s):

Species Composition ◽

Latent Variables ◽

Latent Variable ◽

List Type ◽

Ecological Gradients ◽

Model Framework ◽

Variable Model ◽

Response Data ◽

Model Based ◽

Constrained Ordination

SummaryIn community ecology, unconstrained ordination can be used to predict latent variables from a multivariate dataset, which generated the observed species composition.Latent variables can be understood as ecological gradients, which are represented as a function of measured predictors in constrained ordination, so that ecologists can better relate species composition to the environment while reducing dimensionality of the predictors and the response data.However, existing constrained ordination methods do not explicitly account for information provided by species responses, so that they have the potential to misrepresent community structure if not all predictors are measured.We propose a new method for model-based ordination with constrained latent variables in the Generalized Linear Latent Variable Model framework, which incorporates both measured predictors and residual covariation to optimally represent ecological gradients. Simulations of unconstrained and constrained ordination show that the proposed method outperforms CCA and RDA.

Get full-text (via PubEx)

Model-based ordination of pin-point cover data: effect of management on dry heathland

10.1101/2020.03.05.980060 ◽

2020 ◽

Cited By ~ 1

Author(s):

Christian Damgaard ◽

Rikke Reisner Hansen ◽

Francis K. C. Hui

Keyword(s):

Random Effects ◽

Latent Variables ◽

Latent Variable ◽

Plant Cover ◽

Multinomial Distribution ◽

Inner Product ◽

Plant Functional Groups ◽

Taxonomic Resolution ◽

Model Based ◽

Fixed And Random Effects

AbstractRecently, there has been an increasing interest in model-based approaches for the statistical modelling of the joint distribution of multi-species abundances. The Dirichlet-multinomial distribution has been proposed as a suitable candidate distribution for the joint species distribution of pin-point plant cover data and is here applied in a model-based ordination framework. Unlike most model-based ordination methods, both fixed and random effects are in our proposed model structured as p-dimensional vectors and added to the latent variables before the inner product with the species-specific coefficients. This changes the interpretation of the parameters, so that the fixed and random effects now measure the relative displacement of the vegetation by the fixed and random factors in the p-dimensional latent variable space. This parameterization allows statistical inference of the effect of fixed and random factors in vector space, and makes it easier for practitioners to perform inferences on species composition in a multivariate setting. The method was applied on plant pin-point cover data from dry heathlands that had received different management treatments (burned, grazed, harvested, unmanaged), and it was found that treatment have a significant effect on heathland vegetation both when considering plant functional groups or when the taxonomic resolution was at the species level.

Get full-text (via PubEx)

Multi-Partitions Subspace Clustering

Mathematics ◽

10.3390/math8040597 ◽

2020 ◽

Vol 8 (4) ◽

pp. 597 ◽

Cited By ~ 1

Author(s):

Vincent Vandewalle

Keyword(s):

Latent Variables ◽

Latent Variable ◽

Subspace Clustering ◽

Real Data ◽

Model Choice ◽

Model Based Clustering ◽

Model Based ◽

Choice Strategy ◽

Factorial Discriminant Analysis ◽

Bic Criterion

In model based clustering, it is often supposed that only one clustering latent variable explains the heterogeneity of the whole dataset. However, in many cases several latent variables could explain the heterogeneity of the data at hand. Finding such class variables could result in a richer interpretation of the data. In the continuous data setting, a multi-partition model based clustering is proposed. It assumes the existence of several latent clustering variables, each one explaining the heterogeneity of the data with respect to some clustering subspace. It allows to simultaneously find the multi-partitions and the related subspaces. Parameters of the model are estimated through an EM algorithm relying on a probabilistic reinterpretation of the factorial discriminant analysis. A model choice strategy relying on the BIC criterion is proposed to select to number of subspaces and the number of clusters by subspace. The obtained results are thus several projections of the data, each one conveying its own clustering of the data. Model’s behavior is illustrated on simulated and real data.

Get full-text (via PubEx)

Probability-Based and Measurement-Related Hypotheses With Full Restriction for Investigations by Means of Confirmatory Factor Analysis

Methodology ◽

10.1027/1614-2241/a000033 ◽

2011 ◽

Vol 7 (4) ◽

pp. 157-164

Author(s):

Karl Schweizer

Keyword(s):

Factor Analysis ◽

Confirmatory Factor Analysis ◽

Cognitive Processing ◽

Latent Variables ◽

Repeated Measures ◽

Latent Variable ◽

Model Fit ◽

Repeated Measures Data ◽

Confirmatory Factor ◽

And Performance

Probability-based and measurement-related hypotheses for confirmatory factor analysis of repeated-measures data are investigated. Such hypotheses comprise precise assumptions concerning the relationships among the true components associated with the levels of the design or the items of the measure. Measurement-related hypotheses concentrate on the assumed processes, as, for example, transformation and memory processes, and represent treatment-dependent differences in processing. In contrast, probability-based hypotheses provide the opportunity to consider probabilities as outcome predictions that summarize the effects of various influences. The prediction of performance guided by inexact cues serves as an example. In the empirical part of this paper probability-based and measurement-related hypotheses are applied to working-memory data. Latent variables according to both hypotheses contribute to a good model fit. The best model fit is achieved for the model including latent variables that represented serial cognitive processing and performance according to inexact cues in combination with a latent variable for subsidiary processes.

Get full-text (via PubEx)

Conceptualizing Protective Family Context and Its effect on Substance Use: Comparisons Across Diverse Ethnic-Racial Youth

10.31234/osf.io/abfs3 ◽

2019 ◽

Author(s):

Kevin Constante ◽

Edward Huntley ◽

Emma Schillinger ◽

Christine Wagner ◽

Daniel Keating

Keyword(s):

Substance Use ◽

Measurement Invariance ◽

Latent Variables ◽

Latent Variable ◽

Path Model ◽

Family Context ◽

Partial Metric ◽

Racial Groups ◽

Protective Methods ◽

Family Variables

Background: Although family behaviors are known to be important for buffering youth against substance use, research in this area often evaluates a particular type of family interaction and how it shapes adolescents’ behaviors, when it is likely that youth experience the co-occurrence of multiple types of family behaviors that may be protective. Methods: The current study (N = 1716, 10th and 12th graders, 55% female) examined associations between protective family context, a latent variable comprised of five different measures of family behaviors, and past 12 months substance use: alcohol, cigarettes, marijuana, and e-cigarettes. Results: A multi-group measurement invariance assessment supported protective family context as a coherent latent construct with partial (metric) measurement invariance among Black, Latinx, and White youth. A multi-group path model indicated that protective family context was significantly associated with less substance use for all youth, but of varying magnitudes across ethnic-racial groups. Conclusion: These results emphasize the importance of evaluating psychometric properties of family-relevant latent variables on the basis of group membership in order to draw appropriate inferences on how such family variables relate to substance use among diverse samples.

Get full-text (via PubEx)

High heterogeneity undermines generalization of differential expression results in RNA-Seq analysis

Human Genomics ◽

10.1186/s40246-021-00308-5 ◽

2021 ◽

Vol 15 (1) ◽

Author(s):

Weitong Cui ◽

Huaru Xue ◽

Lei Wei ◽

Jinghua Jin ◽

Xuewen Tian ◽

...

Keyword(s):

Gene Expression ◽

Differential Expression ◽

Small Sample ◽

Differentially Expressed ◽

Cancer Type ◽

Rna Seq ◽

Sample Sizes ◽

Large Sample ◽

Expression Levels ◽

Gene Expression Levels

Abstract Background RNA sequencing (RNA-Seq) has been widely applied in oncology for monitoring transcriptome changes. However, the emerging problem that high variation of gene expression levels caused by tumor heterogeneity may affect the reproducibility of differential expression (DE) results has rarely been studied. Here, we investigated the reproducibility of DE results for any given number of biological replicates between 3 and 24 and explored why a great many differentially expressed genes (DEGs) were not reproducible. Results Our findings demonstrate that poor reproducibility of DE results exists not only for small sample sizes, but also for relatively large sample sizes. Quite a few of the DEGs detected are specific to the samples in use, rather than genuinely differentially expressed under different conditions. Poor reproducibility of DE results is mainly caused by high variation of gene expression levels for the same gene in different samples. Even though biological variation may account for much of the high variation of gene expression levels, the effect of outlier count data also needs to be treated seriously, as outlier data severely interfere with DE analysis. Conclusions High heterogeneity exists not only in tumor tissue samples of each cancer type studied, but also in normal samples. High heterogeneity leads to poor reproducibility of DEGs, undermining generalization of differential expression results. Therefore, it is necessary to use large sample sizes (at least 10 if possible) in RNA-Seq experimental designs to reduce the impact of biological variability and DE results should be interpreted cautiously unless soundly validated.

Get full-text (via PubEx)

Interpretable Variational Graph Autoencoder with Noninformative Prior

Future Internet ◽

10.3390/fi13020051 ◽

2021 ◽

Vol 13 (2) ◽

pp. 51

Author(s):

Lili Sun ◽

Xueyan Liu ◽

Min Zhao ◽

Bo Yang

Keyword(s):

Latent Variables ◽

Latent Variable ◽

Expert Knowledge ◽

Structural Information ◽

Standard Normal Distribution ◽

Noninformative Prior ◽

Latent Space ◽

Distribution Parameters ◽

Standard Normal ◽

Low Dimensional

Variational graph autoencoder, which can encode structural information and attribute information in the graph into low-dimensional representations, has become a powerful method for studying graph-structured data. However, most existing methods based on variational (graph) autoencoder assume that the prior of latent variables obeys the standard normal distribution which encourages all nodes to gather around 0. That leads to the inability to fully utilize the latent space. Therefore, it becomes a challenge on how to choose a suitable prior without incorporating additional expert knowledge. Given this, we propose a novel noninformative prior-based interpretable variational graph autoencoder (NPIVGAE). Specifically, we exploit the noninformative prior as the prior distribution of latent variables. This prior enables the posterior distribution parameters to be almost learned from the sample data. Furthermore, we regard each dimension of a latent variable as the probability that the node belongs to each block, thereby improving the interpretability of the model. The correlation within and between blocks is described by a block–block correlation matrix. We compare our model with state-of-the-art methods on three real datasets, verifying its effectiveness and superiority.

Get full-text (via PubEx)

A general class of latent variable models for ordinal manifest variables with covariate effects on the manifest and latent variables

British Journal of Mathematical and Statistical Psychology ◽

10.1348/000711003770480075 ◽

2003 ◽

Vol 56 (2) ◽

pp. 337-357 ◽

Cited By ~ 56

Author(s):

Irini Moustaki

Keyword(s):

Latent Variables ◽

General Class ◽

Latent Variable ◽

Latent Variable Models ◽

Covariate Effects

Get full-text (via PubEx)