When Choosing the Best Subset Is Not the Best Choice
Abstract Background: Variable selection in linear regression settings is a much discussed problem. Best subset selection (BSS) is often considered as an intuitively appealing ‘gold standard’, with its use being restricted mainly by its N P-hard nature. Instead, alternatives such as the least absolute shrinkage and selection operator (Lasso) or the elastic net (Enet) have become methods of choice in high-dimensional settings. A recent proposal represents BSS as a mixed integer optimization problem so that much larger problems have become feasible in reasonable computation time. This has been exploited to study the prediction performance of BSS and its competitors. Here, we present an extensive simulation study assessing, instead, the variable selection performance of BSS compared to forward stepwise selection (FSS), Lasso and Enet. The analysis considers a wide range of settings that are challenging with regard to dimensionality, signal-to-noise ratio and correlations between relevant and irrelevant direct predictors. As measure of performance we used the best possible F1 score for each method so as to ensure a fair comparison irrespective of any criterion for choosing the tuning parameters.Results: Somewhat surprisingly, it was only in settings where the signal-to-noise ratio was high and the variables were (nearly) uncorrelated that BSS reliably outperformed the other methods. This was the case even in low dimensional settings where the number of observations exceeded the number of variables by a factor of ten. Further, the FSS approach performed nearly identically to BSS. Conclusion: Our results shed a new light on the usual presumption of BSS being, in principle, the best choice for variable selection. More attention needs to be payed to the data generating process when considering variable selection methods. Especially for correlated variables, convex alternatives like Enet are not only faster but also appear to be more accurate in practical settings.