A fast and consistent variable selection method for high-dimensional multivariate linear regression with a large number of explanatory variables

Ryoya Oda; Hirokazu Yanagihara

doi:10.1214/20-ejs1701

A fast and consistent variable selection method for high-dimensional multivariate linear regression with a large number of explanatory variables

Electronic Journal of Statistics ◽

10.1214/20-ejs1701 ◽

2020 ◽

Vol 14 (1) ◽

pp. 1386-1412

Author(s):

Ryoya Oda ◽

Hirokazu Yanagihara

Keyword(s):

Linear Regression ◽

Variable Selection ◽

Selection Method ◽

Multivariate Linear Regression ◽

High Dimensional ◽

Explanatory Variables ◽

Variable Selection Method

Download Full-text

A Consistent Likelihood-Based Variable Selection Method in Normal Multivariate Linear Regression

Intelligent Decision Technologies - Smart Innovation, Systems and Technologies ◽

10.1007/978-981-16-2765-1_33 ◽

2021 ◽

pp. 391-401

Author(s):

Ryoya Oda ◽

Hirokazu Yanagihara

Keyword(s):

Linear Regression ◽

Variable Selection ◽

Selection Method ◽

Multivariate Linear Regression ◽

Variable Selection Method

Download Full-text

BS-SIM: An effective variable selection method for high-dimensional single index model

Electronic Journal of Statistics ◽

10.1214/17-ejs1329 ◽

2017 ◽

Vol 11 (2) ◽

pp. 3522-3548 ◽

Cited By ~ 1

Author(s):

Longjie Cheng ◽

Peng Zeng ◽

Yu Zhu

Keyword(s):

Variable Selection ◽

Selection Method ◽

High Dimensional ◽

Single Index ◽

Index Model ◽

Variable Selection Method ◽

Single Index Model

Download Full-text

Kick-one-out-based variable selection method for Euclidean distance-based classifier in high-dimensional settings

Journal of Multivariate Analysis ◽

10.1016/j.jmva.2021.104756 ◽

2021 ◽

Vol 184 ◽

pp. 104756

Author(s):

Tomoyuki Nakagawa ◽

Hiroki Watanabe ◽

Masashi Hyodo

Keyword(s):

Variable Selection ◽

Euclidean Distance ◽

Selection Method ◽

High Dimensional ◽

Variable Selection Method

Download Full-text

A nonlinear multi-dimensional variable selection method for high dimensional data: Sparse MAVE

Computational Statistics & Data Analysis ◽

10.1016/j.csda.2008.03.003 ◽

2008 ◽

Vol 52 (9) ◽

pp. 4512-4520 ◽

Cited By ~ 36

Author(s):

Qin Wang ◽

Xiangrong Yin

Keyword(s):

Variable Selection ◽

High Dimensional Data ◽

Selection Method ◽

High Dimensional ◽

Variable Selection Method ◽

Dimensional Variable

Download Full-text

A consistent variable selection method in high-dimensional canonical discriminant analysis

Journal of Multivariate Analysis ◽

10.1016/j.jmva.2019.104561 ◽

2020 ◽

Vol 175 ◽

pp. 104561 ◽

Cited By ~ 1

Author(s):

Ryoya Oda ◽

Yuya Suzuki ◽

Hirokazu Yanagihara ◽

Yasunori Fujikoshi

Keyword(s):

Discriminant Analysis ◽

Variable Selection ◽

Selection Method ◽

Canonical Discriminant Analysis ◽

High Dimensional ◽

Variable Selection Method

Download Full-text

Using a supervised principal components analysis for variable selection in high-dimensional datasets reduces false discovery rates

10.1101/2020.05.15.097774 ◽

2020 ◽

Author(s):

Insha Ullah ◽

Kerrie Mengersen ◽

Anthony Pettitt ◽

Benoit Liquet

Keyword(s):

Gene Expression ◽

Variable Selection ◽

Selection Method ◽

High Dimensional ◽

Computationally Efficient ◽

Statistical Tools ◽

Multivariate Statistical ◽

Variable Selection Method ◽

False Discovery ◽

High Dimensional Datasets

AbstractHigh-dimensional datasets, where the number of variables ‘p’ is much larger compared to the number of samples ‘n’, are ubiquitous and often render standard classification and regression techniques unreliable due to overfitting. An important research problem is feature selection — ranking of candidate variables based on their relevance to the outcome variable and retaining those that satisfy a chosen criterion. In this article, we propose a computationally efficient variable selection method based on principal component analysis. The method is very simple, accessible, and suitable for the analysis of high-dimensional datasets. It allows to correct for population structure in genome-wide association studies (GWAS) which otherwise would induce spurious associations and is less likely to overfit. We expect our method to accurately identify important features but at the same time reduce the False Discovery Rate (FDR) (the expected proportion of erroneously rejected null hypotheses) through accounting for the correlation between variables and through de-noising data in the training phase, which also make it robust to outliers in the training data. Being almost as fast as univariate filters, our method allows for valid statistical inference. The ability to make such inferences sets this method apart from most of the current multivariate statistical tools designed for today’s high-dimensional data. We demonstrate the superior performance of our method through extensive simulations. A semi-real gene-expression dataset, a challenging childhood acute lymphoblastic leukemia (CALL) gene expression study, and a GWAS that attempts to identify single-nucleotide polymorphisms (SNPs) associated with the rice grain length further demonstrate the usefulness of our method in genomic applications.Author summaryAn integral part of modern statistical research is feature selection, which has claimed various scientific discoveries, especially in the emerging genomics applications such as gene expression and proteomics studies, where data has thousands or tens of thousands of features but a limited number of samples. However, in practice, due to unavailability of suitable multivariate methods, researchers often resort to univariate filters when it comes to deal with a large number of variables. These univariate filters do not take into account the dependencies between variables because they independently assess variables one-by-one. This leads to loss of information, loss of statistical power (the probability of correctly rejecting the null hypothesis) and potentially biased estimates. In our paper, we propose a new variable selection method. Being computationally efficient, our method allows for valid inference. The ability to make such inferences sets this method apart from most of the current multivariate statistical tools designed for today’s high-dimensional data.

Download Full-text

A robust and efficient variable selection method for linear regression

Journal of Applied Statistics ◽

10.1080/02664763.2021.1962259 ◽

2021 ◽

pp. 1-16

Author(s):

Zhuoran Yang ◽

Liya Fu ◽

You-Gan Wang ◽

Zhixiong Dong ◽

Yunlu Jiang

Keyword(s):

Linear Regression ◽

Variable Selection ◽

Selection Method ◽

Variable Selection Method

Download Full-text

New Variable Selection Method Using Interval Segmentation Purity with Application to Blockwise Kernel Transform Support Vector Machine Classification of High-Dimensional Microarray Data

Journal of Chemical Information and Modeling ◽

10.1021/ci900032q ◽

2009 ◽

Vol 49 (8) ◽

pp. 2002-2009 ◽

Cited By ~ 9

Author(s):

Li-Juan Tang ◽

Wen Du ◽

Hai-Yan Fu ◽

Jian-Hui Jiang ◽

Hai-Long Wu ◽

...

Keyword(s):

Support Vector Machine ◽

Variable Selection ◽

Microarray Data ◽

Selection Method ◽

High Dimensional ◽

Support Vector ◽

Variable Selection Method ◽

Support Vector Machine Classification

Download Full-text

Variable Selection Method Based on Partial Mutual Information and Its Application to NOx Emission Prediction

2020 39th Chinese Control Conference (CCC) ◽

10.23919/ccc50068.2020.9189070 ◽

2020 ◽

Author(s):

QIN Tianmu ◽

ZHANG Jinzhe ◽

YOU Mo ◽

YANG Tingting

Keyword(s):

Mutual Information ◽

Variable Selection ◽

Selection Method ◽

Nox Emission ◽

Variable Selection Method

Download Full-text

Predictive and Descriptive CoMFA Models: The Effect of Variable Selection

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207321666180212162028 ◽

2018 ◽

Vol 21 (2) ◽

pp. 117-124 ◽

Cited By ~ 4

Author(s):

Bakhtyar Sepehri ◽

Nematollah Omidikia ◽

Mohsen Kompany-Zareh ◽

Raouf Ghavami

Keyword(s):

Variable Selection ◽

Predictive Power ◽

Selection Method ◽

Data Sets ◽

Data Set ◽

Comfa Model ◽

Variable Selection Method

Aims & Scope: In this research, 8 variable selection approaches were used to investigate the effect of variable selection on the predictive power and stability of CoMFA models. Materials & Methods: Three data sets including 36 EPAC antagonists, 79 CD38 inhibitors and 57 ATAD2 bromodomain inhibitors were modelled by CoMFA. First of all, for all three data sets, CoMFA models with all CoMFA descriptors were created then by applying each variable selection method a new CoMFA model was developed so for each data set, 9 CoMFA models were built. Obtained results show noisy and uninformative variables affect CoMFA results. Based on created models, applying 5 variable selection approaches including FFD, SRD-FFD, IVE-PLS, SRD-UVEPLS and SPA-jackknife increases the predictive power and stability of CoMFA models significantly. Result & Conclusion: Among them, SPA-jackknife removes most of the variables while FFD retains most of them. FFD and IVE-PLS are time consuming process while SRD-FFD and SRD-UVE-PLS run need to few seconds. Also applying FFD, SRD-FFD, IVE-PLS, SRD-UVE-PLS protect CoMFA countor maps information for both fields.

Download Full-text