scholarly journals Approximate Information and Accelerating for High-throughput Heterogeneous Data Analysis with Linear Mixed Models

Author(s):  
Shengxin Zhu

Linear mixed models are frequently used for analysing heterogeneous data in a broad range of applications. The restricted maximum likelihood method is often preferred to estimate co-variance parameters in such models due to its unbiased estimation of the underlying variance parameters. The restricted log-likelihood function involves log determinants of a complicated co-variance matrix. An efficient statistical estimate of the underlying model parameters and quantifying the accuracy of the estimation requires the first derivatives and the second derivatives of the restricted log-likelihood function, i.e., the observed information. Standard approaches to compute the observed information and its expectation, the Fisher information, is computationally prohibitive for linear mixed models with thousands random and fixed effects. Customized algorithms are of highly demand to keep mixed models analysis scalable for increasing high-throughput heterogeneous data sets. In this paper, we explore how to leverage an averaged information splitting technique and dedicate matrix transform to significantly reduce computations and to accelerate computing. Together with a fill-in reducing multi-frontal sparse direct solver, the averaged information splitting approach improves the performance of the computation process.

F1000Research ◽  
2019 ◽  
Vol 6 ◽  
pp. 748 ◽  
Author(s):  
Malgorzata Nowicka ◽  
Carsten Krieg ◽  
Helena L. Crowell ◽  
Lukas M. Weber ◽  
Felix J. Hartmann ◽  
...  

High-dimensional mass and flow cytometry (HDCyto) experiments have become a method of choice for high-throughput interrogation and characterization of cell populations. Here, we present an updated R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages. We computationally define cell populations using FlowSOM clustering, and facilitate an optional but reproducible strategy for manual merging of algorithm-generated clusters. Our workflow offers different analysis paths, including association of cell type abundance with a phenotype or changes in signalling markers within specific subpopulations, or differential analyses of aggregated signals. Importantly, the differential analyses we show are based on regression frameworks where the HDCyto data is the response; thus, we are able to model arbitrary experimental designs, such as those with batch effects, paired designs and so on. In particular, we apply generalized linear mixed models or linear mixed models to analyses of cell population abundance or cell-population-specific analyses of signaling markers, allowing overdispersion in cell count or aggregated signals across samples to be appropriately modeled. To support the formal statistical analyses, we encourage exploratory data analysis at every step, including quality control (e.g., multi-dimensional scaling plots), reporting of clustering results (dimensionality reduction, heatmaps with dendrograms) and differential analyses (e.g., plots of aggregated signals).


2011 ◽  
Vol 23 (11) ◽  
pp. 2833-2867 ◽  
Author(s):  
Yi Dong ◽  
Stefan Mihalas ◽  
Alexander Russell ◽  
Ralph Etienne-Cummings ◽  
Ernst Niebur

When a neuronal spike train is observed, what can we deduce from it about the properties of the neuron that generated it? A natural way to answer this question is to make an assumption about the type of neuron, select an appropriate model for this type, and then choose the model parameters as those that are most likely to generate the observed spike train. This is the maximum likelihood method. If the neuron obeys simple integrate-and-fire dynamics, Paninski, Pillow, and Simoncelli ( 2004 ) showed that its negative log-likelihood function is convex and that, at least in principle, its unique global minimum can thus be found by gradient descent techniques. Many biological neurons are, however, known to generate a richer repertoire of spiking behaviors than can be explained in a simple integrate-and-fire model. For instance, such a model retains only an implicit (through spike-induced currents), not an explicit, memory of its input; an example of a physiological situation that cannot be explained is the absence of firing if the input current is increased very slowly. Therefore, we use an expanded model (Mihalas & Niebur, 2009 ), which is capable of generating a large number of complex firing patterns while still being linear. Linearity is important because it maintains the distribution of the random variables and still allows maximum likelihood methods to be used. In this study, we show that although convexity of the negative log-likelihood function is not guaranteed for this model, the minimum of this function yields a good estimate for the model parameters, in particular if the noise level is treated as a free parameter. Furthermore, we show that a nonlinear function minimization method (r-algorithm with space dilation) usually reaches the global minimum.


Author(s):  
Muhammad Abu Shadeque Mullah ◽  
Andrea Benedetti

AbstractBesides being mainly used for analyzing clustered or longitudinal data, generalized linear mixed models can also be used for smoothing via restricting changes in the fit at the knots in regression splines. The resulting models are usually called semiparametric mixed models (SPMMs). We investigate the effect of smoothing using SPMMs on the correlation and variance parameter estimates for serially correlated longitudinal normal, Poisson and binary data. Through simulations, we compare the performance of SPMMs to other simpler methods for estimating the nonlinear association such as fractional polynomials, and using a parametric nonlinear function. Simulation results suggest that, in general, the SPMMs recover the true curves very well and yield reasonable estimates of the correlation and variance parameters. However, for binary outcomes, SPMMs produce biased estimates of the variance parameters for high serially correlated data. We apply these methods to a dataset investigating the association between CD4 cell count and time since seroconversion for HIV infected men enrolled in the Multicenter AIDS Cohort Study.


F1000Research ◽  
2019 ◽  
Vol 6 ◽  
pp. 748 ◽  
Author(s):  
Malgorzata Nowicka ◽  
Carsten Krieg ◽  
Helena L. Crowell ◽  
Lukas M. Weber ◽  
Felix J. Hartmann ◽  
...  

High-dimensional mass and flow cytometry (HDCyto) experiments have become a method of choice for high-throughput interrogation and characterization of cell populations. Here, we present an updated R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages. We computationally define cell populations using FlowSOM clustering, and facilitate an optional but reproducible strategy for manual merging of algorithm-generated clusters. Our workflow offers different analysis paths, including association of cell type abundance with a phenotype or changes in signalling markers within specific subpopulations, or differential analyses of aggregated signals. Importantly, the differential analyses we show are based on regression frameworks where the HDCyto data is the response; thus, we are able to model arbitrary experimental designs, such as those with batch effects, paired designs and so on. In particular, we apply generalized linear mixed models or linear mixed models to analyses of cell population abundance or cell-population-specific analyses of signaling markers, allowing overdispersion in cell count or aggregated signals across samples to be appropriately modeled. To support the formal statistical analyses, we encourage exploratory data analysis at every step, including quality control (e.g., multi-dimensional scaling plots), reporting of clustering results (dimensionality reduction, heatmaps with dendrograms) and differential analyses (e.g., plots of aggregated signals).


2021 ◽  
Vol 13 (6) ◽  
pp. 3274
Author(s):  
Suzanne Maas ◽  
Paraskevas Nikolaou ◽  
Maria Attard ◽  
Loukas Dimitriou

Bicycle sharing systems (BSSs) have been implemented in cities worldwide in an attempt to promote cycling. Despite exhibiting characteristics considered to be barriers to cycling, such as hot summers, hilliness and car-oriented infrastructure, Southern European island cities and tourist destinations Limassol (Cyprus), Las Palmas de Gran Canaria (Canary Islands, Spain) and the Valletta conurbation (Malta) are all experiencing the implementation of BSSs and policies to promote cycling. In this study, a year of trip data and secondary datasets are used to analyze dock-based BSS usage in the three case-study cities. How land use, socio-economic, network and temporal factors influence BSS use at station locations, both as an origin and as a destination, was examined using bivariate correlation analysis and through the development of linear mixed models for each case study. Bivariate correlations showed significant positive associations with the number of cafes and restaurants, vicinity to the beach or promenade and the percentage of foreign population at the BSS station locations in all cities. A positive relation with cycling infrastructure was evident in Limassol and Las Palmas de Gran Canaria, but not in Malta, as no cycling infrastructure is present in the island’s conurbation, where the BSS is primarily operational. Elevation had a negative association with BSS use in all three cities. In Limassol and Malta, where seasonality in weather patterns is strongest, a negative effect of rainfall and a positive effect of higher temperature were observed. Although there was a positive association between BSS use and the number of visiting tourists in Limassol and Malta, this is predominantly explained through the multi-collinearity with weather factors rather than by intensive use of the BSS by tourists. The linear mixed models showed more fine-grained results and explained differences in BSS use at stations, including differences for station use as an origin and as a destination. The insights from the correlation analysis and linear mixed models can be used to inform policies promoting cycling and BSS use and support sustainable mobility policies in the case-study cities and cities with similar characteristics.


Psych ◽  
2021 ◽  
Vol 3 (2) ◽  
pp. 197-232
Author(s):  
Yves Rosseel

This paper discusses maximum likelihood estimation for two-level structural equation models when data are missing at random at both levels. Building on existing literature, a computationally efficient expression is derived to evaluate the observed log-likelihood. Unlike previous work, the expression is valid for the special case where the model implied variance–covariance matrix at the between level is singular. Next, the log-likelihood function is translated to R code. A sequence of R scripts is presented, starting from a naive implementation and ending at the final implementation as found in the lavaan package. Along the way, various computational tips and tricks are given.


2019 ◽  
Vol 38 (30) ◽  
pp. 5603-5622 ◽  
Author(s):  
Bernard G. Francq ◽  
Dan Lin ◽  
Walter Hoyer

Sign in / Sign up

Export Citation Format

Share Document