Hierarchical likelihood methods for nonlinear and generalized linear mixed models with missing data and measurement errors in covariates

Abstract Background Converting electronic health record (EHR) entries to useful clinical inferences requires one to address the poor scalability of existing implementations of Generalized Linear Mixed Models (GLMM) for repeated measures. The major computational bottleneck concerns the numerical evaluation of multivariable integrals, which even for the simplest EHR analyses may involve millions of dimensions (one for each patient). The hierarchical likelihood (h-lik) approach to GLMMs is a methodologically rigorous framework for the estimation of GLMMs that is based on the Laplace Approximation (LA), which replaces integration with numerical optimization, and thus scales very well with dimensionality. Methods We present a high-performance, direct implementation of the h-lik for GLMMs in the R package TMB. Using this approach, we examined the relation of repeated serum potassium measurements and survival in the Cerner Real World Data (CRWD) EHR database. Analyzing this data requires the evaluation of an integral in over 3 million dimensions, putting this problem beyond the reach of conventional approaches. We also assessed the scalability and accuracy of LA in smaller samples of 1 and 10% size of the full dataset that were analyzed via the a) original, interconnected Generalized Linear Models (iGLM), approach to h-lik, b) Adaptive Gaussian Hermite (AGH) and c) the gold standard for multivariate integration Markov Chain Monte Carlo (MCMC). Results Random effects estimates generated by the LA were within 10% of the values obtained by the iGLMs, AGH and MCMC techniques. The H-lik approach was 4–30 times faster than AGH and nearly 800 times faster than MCMC. The major clinical inferences in this problem are the establishment of the non-linear relationship between the potassium level and the risk of mortality, as well as estimates of the individual and health care facility sources of variations for mortality risk in CRWD. Conclusions We found that the direct implementation of the h-lik offers a computationally efficient, numerically accurate approach for the analysis of extremely large, real world repeated measures data via the h-lik approach to GLMMs. The clinical inference from our analysis may guide choices of treatment thresholds for treating potassium disorders in the clinic.

Download Full-text

Power for balanced linear mixed models with complex missing data processes

Communication in Statistics- Theory and Methods ◽

10.1080/03610926.2021.1909732 ◽

2021 ◽

pp. 1-19

Author(s):

Kevin P. Josey ◽

Brandy M. Ringham ◽

Anna E. Barón ◽

Margaret Schenkman ◽

Katherine A. Sauder ◽

...

Keyword(s):

Missing Data ◽

Mixed Models ◽

Linear Mixed Models

Download Full-text

l2-Penalized temporal logit-mixed models for the estimation of regional obesity prevalence over time

Statistical Methods in Medical Research ◽

10.1177/09622802211017583 ◽

2021 ◽

pp. 096228022110175

Author(s):

Jan P Burgard ◽

Joscha Krause ◽

Ralf Münnich ◽

Domingo Morales

Keyword(s):

Parameter Estimation ◽

Medical Treatment ◽

Mixed Models ◽

Generalized Linear Mixed Models ◽

Linear Mixed Models ◽

Obesity Prevalence ◽

Model Parameter ◽

Model Parameter Estimation ◽

Public Health Reporting ◽

Over Time

Obesity is considered to be one of the primary health risks in modern industrialized societies. Estimating the evolution of its prevalence over time is an essential element of public health reporting. This requires the application of suitable statistical methods on epidemiologic data with substantial local detail. Generalized linear-mixed models with medical treatment records as covariates mark a powerful combination for this purpose. However, the task is methodologically challenging. Disease frequencies are subject to both regional and temporal heterogeneity. Medical treatment records often show strong internal correlation due to diagnosis-related grouping. This frequently causes excessive variance in model parameter estimation due to rank-deficiency problems. Further, generalized linear-mixed models are often estimated via approximate inference methods as their likelihood functions do not have closed forms. These problems combined lead to unacceptable uncertainty in prevalence estimates over time. We propose an l2-penalized temporal logit-mixed model to solve these issues. We derive empirical best predictors and present a parametric bootstrap to estimate their mean-squared errors. A novel penalized maximum approximate likelihood algorithm for model parameter estimation is stated. With this new methodology, the regional obesity prevalence in Germany from 2009 to 2012 is estimated. We find that the national prevalence ranges between 15 and 16%, with significant regional clustering in eastern Germany.

Download Full-text