Hierarchical Shrinkage Priors and Model Fitting for High-dimensional Generalized Linear Models

Generalized linear models (GLMs) are used in high-dimensional machine learning, statistics, communications, and signal processing. In this paper we analyze GLMs when the data matrix is random, as relevant in problems such as compressed sensing, error-correcting codes, or benchmark models in neural networks. We evaluate the mutual information (or “free entropy”) from which we deduce the Bayes-optimal estimation and generalization errors. Our analysis applies to the high-dimensional limit where both the number of samples and the dimension are large and their ratio is fixed. Nonrigorous predictions for the optimal errors existed for special cases of GLMs, e.g., for the perceptron, in the field of statistical physics based on the so-called replica method. Our present paper rigorously establishes those decades-old conjectures and brings forward their algorithmic interpretation in terms of performance of the generalized approximate message-passing algorithm. Furthermore, we tightly characterize, for many learning problems, regions of parameters for which this algorithm achieves the optimal performance and locate the associated sharp phase transitions separating learnable and nonlearnable regions. We believe that this random version of GLMs can serve as a challenging benchmark for multipurpose algorithms.

Download Full-text

Generalized orthogonal components regression for high dimensional generalized linear models

Computational Statistics & Data Analysis ◽

10.1016/j.csda.2015.02.006 ◽

2015 ◽

Vol 88 ◽

pp. 119-127 ◽

Cited By ~ 1

Author(s):

Yanzhu Lin ◽

Min Zhang ◽

Dabao Zhang

Keyword(s):

Generalized Linear Models ◽

Linear Models ◽

High Dimensional ◽

Orthogonal Components ◽

Generalized Orthogonal

Download Full-text

Bayesian variable selection for high dimensional generalized linear models: Convergence rates of the fitted densities

The Annals of Statistics ◽

10.1214/009053607000000019 ◽

2007 ◽

Vol 35 (4) ◽

pp. 1487-1511 ◽

Cited By ~ 34

Author(s):

Wenxin Jiang

Keyword(s):

Variable Selection ◽

Generalized Linear Models ◽

Linear Models ◽

Convergence Rates ◽

Bayesian Variable Selection ◽

High Dimensional ◽

Selection For

Download Full-text

A weight-relaxed model averaging approach for high-dimensional generalized linear models

The Annals of Statistics ◽

10.1214/17-aos1538 ◽

2017 ◽

Vol 45 (6) ◽

pp. 2654-2679 ◽

Cited By ~ 18

Author(s):

Tomohiro Ando ◽

Ker-chau Li

Keyword(s):

Generalized Linear Models ◽

Linear Models ◽

Model Averaging ◽

High Dimensional

Download Full-text

eNetXplorer: an R package for the quantitative exploration of elastic net families for generalized linear models

10.1101/305870 ◽

2018 ◽

Cited By ~ 1

Author(s):

Julián Candia ◽

John S. Tsang

Keyword(s):

Generalized Linear Models ◽

Linear Models ◽

Statistical Significance ◽

Model Fitting ◽

Predictive Performance ◽

R Package ◽

Research Area ◽

Elastic Net ◽

Feature Identification ◽

Practical Applications

AbstractBackgroundRegularized generalized linear models (GLMs) are popular regression methods in bioinformatics, particularly useful in scenarios with fewer observations than parameters/features or when many of the features are correlated. In both ridge and lasso regularization, feature shrinkage is controlled by a penalty parameter λ. The elastic net introduces a mixing parameter α to tune the shrinkage continuously from ridge to lasso. Selecting α objectively and determining which features contributed significantly to prediction after model fitting remain a practical challenge given the paucity of available software to evaluate performance and statistical significance.ResultseNetXplorer builds on top of glmnet to address the above issues for linear (Gaussian), binomial (logistic), and multinomial GLMs. It provides new functionalities to empower practical applications by using a cross validation framework that assesses the predictive performance and statistical significance of a family of elastic net models (as α is varied) and of the corresponding features that contribute to prediction. The user can select which quality metrics to use to quantify the concordance between predicted and observed values, with defaults provided for each GLM. Statistical significance for each model (as defined by α) is determined based on comparison to a set of null models generated by random permutations of the response; the same permutation-based approach is used to evaluate the significance of individual features. In the analysis of large and complex biological datasets, such as transcriptomic and proteomic data, eNetXplorer provides summary statistics, output tables, and visualizations to help assess which subset(s) of features have predictive value for a set of response measurements, and to what extent those subset(s) of features can be expanded or reduced via regularization.ConclusionsThis package presents a framework and software for exploratory data analysis and visualization. By making regularized GLMs more accessible and interpretable, eNetXplorer guides the process to generate hypotheses based on features significantly associated with biological phenotypes of interest, e.g. to identify biomarkers for therapeutic responsiveness. eNetXplorer is also generally applicable to any research area that may benefit from predictive modeling and feature identification using regularized GLMs.Availability and implementationThe package is available under GPL-3 license at the CRAN repository, https://CRAN.R-project.org/package=eNetXplorer

Download Full-text

Sequential Feature Screening for Generalized Linear Models with Sparse Ultra-High Dimensional Data

Journal of Systems Science and Complexity ◽

10.1007/s11424-020-8273-2 ◽

2020 ◽

Vol 33 (2) ◽

pp. 510-526

Author(s):

Junying Zhang ◽

Hang Wang ◽

Riquan Zhang ◽

Jiajia Zhang

Keyword(s):

Generalized Linear Models ◽

Linear Models ◽

High Dimensional Data ◽

High Dimensional ◽

Feature Screening

Download Full-text

Penalized empirical likelihood for high-dimensional generalized linear models

Statistics and Its Interface ◽

10.4310/20-sii615 ◽

2021 ◽

Vol 14 (2) ◽

pp. 83-94

Author(s):

Xia Chen ◽

Liyue Mao

Keyword(s):

Empirical Likelihood ◽

Generalized Linear Models ◽

Linear Models ◽

High Dimensional ◽

Penalized Empirical Likelihood

Download Full-text

Large-scale model selection in misspecified generalized linear models

Biometrika ◽

10.1093/biomet/asab005 ◽

2021 ◽

Author(s):

Emre Demirkaya ◽

Yang Feng ◽

Pallavi Basu ◽

Jinchi Lv

Keyword(s):

Model Selection ◽

Generalized Linear Models ◽

Large Scale ◽

Linear Models ◽

Information Criterion ◽

Scale Model ◽

High Dimensional ◽

Model Selection Consistency ◽

New Information ◽

Large Scale Model

Summary Model selection is crucial both to high-dimensional learning and to inference for contemporary big data applications in pinpointing the best set of covariates among a sequence of candidate interpretable models. Most existing work assumes implicitly that the models are correctly specified or have fixed dimensionality, yet both are prevalent in practice. In this paper, we exploit the framework of model selection principles under the misspecified generalized linear models presented in Lv and Liu (2014) and investigate the asymptotic expansion of the posterior model probability in the setting of high-dimensional misspecified models.With a natural choice of prior probabilities that encourages interpretability and incorporates the Kullback–Leibler divergence, we suggest the high-dimensional generalized Bayesian information criterion with prior probability for large-scale model selection with misspecification. Our new information criterion characterizes the impacts of both model misspecification and high dimensionality on model selection. We further establish the consistency of covariance contrast matrix estimation and the model selection consistency of the new information criterion in ultra-high dimensions under some mild regularity conditions. The numerical studies demonstrate that our new method enjoys improved model selection consistency compared to its main competitors.

Download Full-text