scholarly journals High-Dimensional Regression and Variable Selection Using CAR Scores

Author(s):  
Verena Zuber ◽  
Korbinian Strimmer

Variable selection is a difficult problem that is particularly challenging in the analysis of high-dimensional genomic data. Here, we introduce the CAR score, a novel and highly effective criterion for variable ranking in linear regression based on Mahalanobis-decorrelation of the explanatory variables. The CAR score provides a canonical ordering that encourages grouping of correlated predictors and down-weights antagonistic variables. It decomposes the proportion of variance explained and it is an intermediate between marginal correlation and the standardized regression coefficient. As a population quantity, any preferred inference scheme can be applied for its estimation. Using simulations, we demonstrate that variable selection by CAR scores is very effective and yields prediction errors and true and false positive rates that compare favorably with modern regression techniques such as elastic net and boosting. We illustrate our approach by analyzing data concerned with diabetes progression and with the effect of aging on gene expression in the human brain. The R package “care” implementing CAR score regression is available from CRAN.

2011 ◽  
Vol 55 (10) ◽  
pp. 2807-2818 ◽  
Author(s):  
Deukwoo Kwon ◽  
Maria Teresa Landi ◽  
Marina Vannucci ◽  
Haleem J. Issaq ◽  
DaRue Prieto ◽  
...  

Author(s):  
Wencan Zhu ◽  
Céline Lévy-Leduc ◽  
Nils Ternès

Abstract Motivation In genomic studies, identifying biomarkers associated with a variable of interest is a major concern in biomedical research. Regularized approaches are classically used to perform variable selection in high-dimensional linear models. However, these methods can fail in highly correlated settings. Results We propose a novel variable selection approach called WLasso, taking these correlations into account. It consists in rewriting the initial high-dimensional linear model to remove the correlation between the biomarkers (predictors) and in applying the generalized Lasso criterion. The performance of WLasso is assessed using synthetic data in several scenarios and compared with recent alternative approaches. The results show that when the biomarkers are highly correlated, WLasso outperforms the other approaches in sparse high-dimensional frameworks. The method is also illustrated on publicly available gene expression data in breast cancer. Availabilityand implementation Our method is implemented in the WLasso R package which is available from the Comprehensive R Archive Network (CRAN). Supplementary information Supplementary data are available at Bioinformatics online.


Mathematics ◽  
2021 ◽  
Vol 9 (3) ◽  
pp. 222
Author(s):  
Juan C. Laria ◽  
M. Carmen Aguilera-Morillo ◽  
Enrique Álvarez ◽  
Rosa E. Lillo ◽  
Sara López-Taruella ◽  
...  

Over the last decade, regularized regression methods have offered alternatives for performing multi-marker analysis and feature selection in a whole genome context. The process of defining a list of genes that will characterize an expression profile remains unclear. It currently relies upon advanced statistics and can use an agnostic point of view or include some a priori knowledge, but overfitting remains a problem. This paper introduces a methodology to deal with the variable selection and model estimation problems in the high-dimensional set-up, which can be particularly useful in the whole genome context. Results are validated using simulated data and a real dataset from a triple-negative breast cancer study.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Matthew D. Koslovsky ◽  
Marina Vannucci

An amendment to this paper has been published and can be accessed via the original article.


Sign in / Sign up

Export Citation Format

Share Document