Bayesian approaches to variable selection: a comparative study from practical perspectives

2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Zihang Lu ◽  
Wendy Lou

Abstract In many clinical studies, researchers are interested in parsimonious models that simultaneously achieve consistent variable selection and optimal prediction. The resulting parsimonious models will facilitate meaningful biological interpretation and scientific findings. Variable selection via Bayesian inference has been receiving significant advancement in recent years. Despite its increasing popularity, there is limited practical guidance for implementing these Bayesian approaches and evaluating their comparative performance in clinical datasets. In this paper, we review several commonly used Bayesian approaches to variable selection, with emphasis on application and implementation through R software. These approaches can be roughly categorized into four classes: namely the Bayesian model selection, spike-and-slab priors, shrinkage priors, and the hybrid of both. To evaluate their variable selection performance under various scenarios, we compare these four classes of approaches using real and simulated datasets. These results provide practical guidance to researchers who are interested in applying Bayesian approaches for the purpose of variable selection.

2020 ◽  
Author(s):  
Connor Donegan ◽  
Yongwan Chun ◽  
Amy E. Hughes

This paper proposes a Bayesian method for spatial regression using eigenvector spatial filtering (ESF) and Piironen and Vehtari's (2017) regularized horseshoe (RHS) prior. ESF models are most often estimated using variable selection procedures such as stepwise selection, but in the absence of a Bayesian model averaging procedure variable selection methods cannot properly account for parameter uncertainty. Hierarchical shrinkage priors such as the RHS address the foregoing concern in a computationally efficient manner by encoding prior information about spatial filters into an adaptive prior distribution which shrinks posterior estimates towards zero in the absence of a strong signal while only minimally regularizing coefficients of important eigenvectors. This paper presents results from a large simulation study which compares the performance of the proposed Bayesian model (RHS-ESF) to alternative spatial models under a variety of spatial autocorrelation scenarios. The RHS-ESF model performance matched that of the SAR model from which the data was generated. The study highlights that reliable uncertainty estimates require greater attention to spatial autocorrelation in covariates than is typically given. A demonstration analysis of 2016 U.S. Presidential election results further evidences robustness of results to hyper-prior specifications as well as the advantages of estimating spatial models using the Stan probabilistic programming language.


Nutrients ◽  
2021 ◽  
Vol 13 (4) ◽  
pp. 1098
Author(s):  
Ewelina Łukaszyk ◽  
Katarzyna Bień-Barkowska ◽  
Barbara Bień

Identifying factors that affect mortality requires a robust statistical approach. This study’s objective is to assess an optimal set of variables that are independently associated with the mortality risk of 433 older comorbid adults that have been discharged from the geriatric ward. We used both the stepwise backward variable selection and the iterative Bayesian model averaging (BMA) approaches to the Cox proportional hazards models. Potential predictors of the mortality rate were based on a broad range of clinical data; functional and laboratory tests, including geriatric nutritional risk index (GNRI); lymphocyte count; vitamin D, and the age-weighted Charlson comorbidity index. The results of the multivariable analysis identified seven explanatory variables that are independently associated with the length of survival. The mortality rate was higher in males than in females; it increased with the comorbidity level and C-reactive proteins plasma level but was negatively affected by a person’s mobility, GNRI and lymphocyte count, as well as the vitamin D plasma level.


2021 ◽  
Author(s):  
Arinjita Bhattacharyya ◽  
Subhadip Pal ◽  
Riten Mitra ◽  
Shesh Rai

Abstract Background: Prediction and classification algorithms are commonly used in clinical research for identifying patients susceptible to clinical conditions like diabetes, colon cancer, and Alzheimer’s disease. Developing accurate prediction and classification methods have implications for personalized medicine. Building an excellent predictive model involves selecting features that are most significantly associated with the response at hand. These features can include several biological and demographic characteristics, such as genomic biomarkers and health history. Such variable selection becomes challenging when the number of potential predictors is large. Bayesian shrinkage models have emerged as popular and flexible methods of variable selection in regression settings. The article discusses variable selection with three shrinkage priors and illustrates its application to clinical data sets such as Pima Indians Diabetes, Colon cancer, ADNI, and OASIS Alzheimer’s data sets. Methods: We present a unified Bayesian hierarchical framework that implements and compares shrinkage priors in binary and multinomial logistic regression models. The key feature is the representation of the likelihood by a Polya-Gamma data augmentation, which admits a natural integration with a family of shrinkage priors. We specifically focus on the Horseshoe, Dirichlet Laplace, and Double Pareto priors. Extensive simulation studies are conducted to assess the performances under different data dimensions and parameter settings. Measures of accuracy, AUC, brier score, L1 error, cross-entropy, ROC surface plots are used as evaluation criteria comparing the priors to frequentist methods like Lasso, Elastic-Net, and Ridge regression. Results: All three priors can be used for robust prediction with significant metrics, irrespective of their categorical response model choices. Simulation study could achieve the mean prediction accuracy of 91% (95% CI: 90.7, 91.2) and 74% (95% CI: 73.8,74.1) for logistic regression and multinomial logistic models, respectively. The model can identify significant variables for disease risk prediction and is computationally efficient. Conclusions: The models are robust enough to conduct both variable selection and future prediction because of their high shrinkage property and applicability to a broad range of classification problems.


2021 ◽  
pp. 179-198
Author(s):  
Yan Dora Zhang ◽  
Weichang Yu ◽  
Howard D. Bondell

Author(s):  
Oliver M. Crook ◽  
Laurent Gatto ◽  
Paul D. W. Kirk

Abstract The Dirichlet Process (DP) mixture model has become a popular choice for model-based clustering, largely because it allows the number of clusters to be inferred. The sequential updating and greedy search (SUGS) algorithm (Wang & Dunson, 2011) was proposed as a fast method for performing approximate Bayesian inference in DP mixture models, by posing clustering as a Bayesian model selection (BMS) problem and avoiding the use of computationally costly Markov chain Monte Carlo methods. Here we consider how this approach may be extended to permit variable selection for clustering, and also demonstrate the benefits of Bayesian model averaging (BMA) in place of BMS. Through an array of simulation examples and well-studied examples from cancer transcriptomics, we show that our method performs competitively with the current state-of-the-art, while also offering computational benefits. We apply our approach to reverse-phase protein array (RPPA) data from The Cancer Genome Atlas (TCGA) in order to perform a pan-cancer proteomic characterisation of 5157 tumour samples. We have implemented our approach, together with the original SUGS algorithm, in an open-source R package named sugsvarsel, which accelerates analysis by performing intensive computations in C++ and provides automated parallel processing. The R package is freely available from: https://github.com/ococrook/sugsvarsel


Author(s):  
I. Tsamardinos ◽  
G. Borboudakis ◽  
E. G. Christodoulou ◽  
O. D. Røe

The chemosensitivity of tumours to specific drugs can be predicted based on molecular quantities, such as gene expressions, miRNA expressions, and protein concentrations. This finding is important for improving drug efficacy and personalizing drug use. In this paper, the authors present an analysis strategy that, compared to prior work, retains more information in the data for analysis and may lead to improved chemosensitivity prediction. The authors apply improved methods for estimating the GI50 value of a drug (an indicator of the response to the drug), regression methods for constructing predictive models of the GI50 value, advanced variable selection techniques, such as MMPC, and a multi-task variable selection technique for identifying a small-size signature that is simultaneously predictive for several drugs and cell lines. The methods are applied on gene expression, miRNA expression, and proteomics data from 53 tumour cell lines after treatment with 120 drugs, obtained from the National Cancer Institute databases. A biological interpretation and discussion of the results is presented for the most clinically important subset of 14 drugs.


2013 ◽  
Vol 50 (7) ◽  
pp. 766-776 ◽  
Author(s):  
Yu Wang ◽  
Kai Huang ◽  
Zijun Cao

This paper develops Bayesian approaches for underground soil stratum identification and soil classification using cone penetration tests (CPTs). The uncertainty in the CPT-based soil classification using the Robertson chart is modeled explicitly in the Bayesian approaches, and the probability that the soil belongs to one of the nine soil types in the Robertson chart based on a set of CPT data is formulated using the maximum entropy principle. The proposed Bayesian approaches contain two major components: a Bayesian model class selection approach to identify the most probable number of underground soil layers and a Bayesian system identification approach to simultaneously estimate the most probable layer thicknesses and classify the soil types. Equations are derived for the Bayesian approaches, and the proposed approaches are illustrated using a real-life CPT performed at the National Geotechnical Experimentation Site (NGES) at Texas A&M University, USA. It has been shown that the proposed approaches properly identify the underground soil stratification and classify the soil type of each layer. In addition, as the number of model classes increases, the Bayesian model class selection approach identifies the soil layers progressively, starting from the statistically most significant boundary and gradually zooming into less significant ones with improved resolution. Furthermore, it is found that the evolution of the identified soil strata as the model class increases provides additional valuable information for assisting in the interpretation of CPT data in a rational and transparent manner.


2001 ◽  
Vol 20 (21) ◽  
pp. 3215-3230 ◽  
Author(s):  
Valerie Viallefont ◽  
Adrian E. Raftery ◽  
Sylvia Richardson

Sign in / Sign up

Export Citation Format

Share Document