scholarly journals Stability selection for lasso, ridge and elastic net implemented with AFT models

Author(s):  
Md Hasinur Rahaman Khan ◽  
Anamika Bhadra ◽  
Tamanna Howlader

Abstract The instability in the selection of models is a major concern with data sets containing a large number of covariates. We focus on stability selection which is used as a technique to improve variable selection performance for a range of selection methods, based on aggregating the results of applying a selection procedure to sub-samples of the data where the observations are subject to right censoring. The accelerated failure time (AFT) models have proved useful in many contexts including the heavy censoring (as for example in cancer survival) and the high dimensionality (as for example in micro-array data). We implement the stability selection approach using three variable selection techniques—Lasso, ridge regression, and elastic net applied to censored data using AFT models. We compare the performances of these regularized techniques with and without stability selection approaches with simulation studies and two real data examples–a breast cancer data and a diffuse large B-cell lymphoma data. The results suggest that stability selection gives always stable scenario about the selection of variables and that as the dimension of data increases the performance of methods with stability selection also improves compared to methods without stability selection irrespective of the collinearity between the covariates.

2018 ◽  
Vol 34 (7) ◽  
Author(s):  
Manuel Lozano ◽  
Lara Manyes ◽  
Juanjo Peiró ◽  
Adina Iftimi ◽  
José María Ramada

Multidisciplinary research in public health is approached using methods from many scientific disciplines. One of the main characteristics of this type of research is dealing with large data sets. Classic statistical variable selection methods, known as “screen and clean”, and used in a single-step, select the variables with greater explanatory weight in the model. These methods, commonly used in public health research, may induce masking and multicollinearity, excluding relevant variables for the experts in each discipline and skewing the result. Some specific techniques are used to solve this problem, such as penalized regressions and Bayesian statistics, they offer more balanced results among subsets of variables, but with less restrictive selection thresholds. Using a combination of classical methods, a three-step procedure is proposed in this manuscript, capturing the relevant variables of each scientific discipline, minimizing the selection of variables in each of them and obtaining a balanced distribution that explains most of the variability. This procedure was applied on a dataset from a public health research. Comparing the results with the single-step methods, the proposed method shows a greater reduction in the number of variables, as well as a balanced distribution among the scientific disciplines associated with the response variable. We propose an innovative procedure for variable selection and apply it to our dataset. Furthermore, we compare the new method with the classic single-step procedures.


2021 ◽  
Author(s):  
Reetika Sarkar ◽  
Sithija Manage ◽  
Xiaoli Gao

Abstract Background: High-dimensional genomic data studies are often found to exhibit strong correlations, which results in instability and inconsistency in the estimates obtained using commonly used regularization approaches including both the Lasso and MCP, and related methods. Result: In this paper, we perform a comparative study of regularization approaches for variable selection under different correlation structures, and propose a two-stage procedure named rPGBS to address the issue of stable variable selection in various strong correlation settings. This approach involves repeatedly running of a two-stage hierarchical approach consisting of a random pseudo-group clustering and bi-level variable selection. Conclusion: Both the simulation studies and high-dimensional genomic data analysis have demonstrated the advantage of the proposed rPGBS method over most commonly used regularization methods. In particular, the rPGBS results in more stable selection of variables across a variety of correlation settings, as compared to recent work addressing variable selection with strong correlations. Moreover, the rPGBS is computationally efficient across various settings.


2019 ◽  
Vol 29 (3) ◽  
pp. 677-694 ◽  
Author(s):  
Oliver Dukes ◽  
Stijn Vansteelandt

The problem of how to best select variables for confounding adjustment forms one of the key challenges in the evaluation of exposure or treatment effects in observational studies. Routine practice is often based on stepwise selection procedures that use hypothesis testing, change-in-estimate assessments or the lasso, which have all been criticised for – amongst other things – not giving sufficient priority to the selection of confounders. This has prompted vigorous recent activity in developing procedures that prioritise the selection of confounders, while preventing the selection of so-called instrumental variables that are associated with exposure, but not outcome (after adjustment for the exposure). A major drawback of all these procedures is that there is no finite sample size at which they are guaranteed to deliver treatment effect estimators and associated confidence intervals with adequate performance. This is the result of the estimator jumping back and forth between different selected models, and standard confidence intervals ignoring the resulting model selection uncertainty. In this paper, we will develop insight into this by evaluating the finite-sample distribution of the exposure effect estimator in linear regression, under a number of the aforementioned confounder selection procedures. We will show that by making clever use of propensity scores, a simple and generic solution is obtained in the context of generalized linear models, which overcomes this concern (under weaker conditions than competing proposals). Specifically, we propose to use separate regularized regressions for the outcome and propensity score models in order to construct a doubly robust ‘g-estimator’; when these models are sufficiently sparse and correctly specified, standard confidence intervals for the g-estimator implicitly incorporate the uncertainty induced by the variable selection procedure.


2016 ◽  
Vol 27 (3) ◽  
pp. 785-797 ◽  
Author(s):  
Ismaïl Ahmed ◽  
Antoine Pariente ◽  
Pascale Tubert-Bitter

Background All methods routinely used to generate safety signals from pharmacovigilance databases rely on disproportionality analyses of counts aggregating patients’ spontaneous reports. Recently, it was proposed to analyze individual spontaneous reports directly using Bayesian lasso logistic regressions. Nevertheless, this raises the issue of choosing an adequate regularization parameter in a variable selection framework while accounting for computational constraints due to the high dimension of the data. Purpose Our main objective is to propose a method, which exploits the subsampling idea from Stability Selection, a variable selection procedure combining subsampling with a high-dimensional selection algorithm, and adapts it to the specificities of the spontaneous reporting data, the latter being characterized by their large size, their binary nature and their sparsity. Materials and method Given the large imbalance existing between the presence and absence of a given adverse event, we propose an alternative subsampling scheme to that of Stability Selection resulting in an over-representation of the minority class and a drastic reduction in the number of observations in each subsample. Simulations are used to help define the detection threshold as regards the average proportion of false signals. They are also used to compare the performances of the proposed sampling scheme with that originally proposed for Stability Selection. Finally, we compare the proposed method to the gamma Poisson shrinker, a disproportionality method, and to a lasso logistic regression approach through an empirical study conducted on the French national pharmacovigilance database and two sets of reference signals. Results Simulations show that the proposed sampling strategy performs better in terms of false discoveries and is faster than the equiprobable sampling of Stability Selection. The empirical evaluation illustrates the better performances of the proposed method compared with gamma Poisson shrinker and the lasso in terms of number of reference signals retrieved.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Hannah E. Correia

AbstractEcologists and fisheries managers are interested in monitoring economically important marine fish species and using this data to inform management strategies. Determining environmental factors that best predict changes in these populations, particularly under rapid climate change, are a priority. I illustrate the application of the least squares-based spline estimation and group LASSO (LSSGLASSO) procedure for selection of coefficient functions in single index varying coefficient models (SIVCMs) on an ecological data set that includes spatiotemporal environmental covariates suspected to play a role in the catches and weights of six groundfish species. Temporal trends in variable selection were apparent, though the selection of variables was largely unrelated to common North Pacific climate indices. These results indicate that the strength of an environmental variable’s effect on a groundfish population may change over time, and not necessarily in-step with known low-frequency patterns of ocean-climate variability commonly attributable to large-scale regime shifts in the North Pacific. My application of the LSSGLASSO procedure for SIVCMs to deep water species using environmental data from various sources illustrates how variable selection with a flexible model structure can produce informative inference for remote and hard-to-reach animal populations.


2015 ◽  
Vol 744-746 ◽  
pp. 1222-1225
Author(s):  
Peng Tian ◽  
Gao Feng Zhan ◽  
Lei Nai

By combining RBF neural network with MIV algorithm, the main influencing factors of asphalt mixture pavement performance will be selected. First, the MIV values will be calculated by MIV method. Selection of variables is based on the size of MIV. There are 8 variables selected form 12 variables. Then, a new RBF neural network will be found by the data which have great impact to the output result. The comparison between the two RBF simulate results will prove that the method of MIV is feasible in variable selection. By the MIV method, the simulate results of RBF will be calculated faster and more accurately.


2019 ◽  
Vol 9 (3) ◽  
pp. 4169-4175
Author(s):  
R. F. Kamala ◽  
P. R. J. Thangaiah

In feature subset selection the variable selection procedure selects a subset of the most relevant features. Filter and wrapper methods are categories of variable selection methods. Feature subsets are similar to data pre-processing and are applied to reduce feature dimensions in a very large dataset. In this paper, in order to deal with this kind of problems, the selection of feature subset methods depending on the fitness evaluation of the classifier is introduced to alleviate the classification task and to progress the classification performance. To curtail the dimensions of the feature space, a novel approach for selecting optimal features on two-stage selection of feature subsets (TSFS) method is done, both theoretically and experimentally. The results of this method include improvements in the performance measures like efficiency, accuracy, and scalability of machine learning algorithms. Comparison of the proposed method is made with known relevant methods using benchmark databases. The proposed method performs better than the earlier hybrid feature selection methodologies discussed in relevant works, regarding classifiers’ accuracy and error.


Methodology ◽  
2018 ◽  
Vol 14 (4) ◽  
pp. 177-188 ◽  
Author(s):  
Martin Schultze ◽  
Michael Eid

Abstract. In the construction of scales intended for the use in cross-cultural studies, the selection of items needs to be guided not only by traditional criteria of item quality, but has to take information about the measurement invariance of the scale into account. We present an approach to automated item selection which depicts the process as a combinatorial optimization problem and aims at finding a scale which fulfils predefined target criteria – such as measurement invariance across cultures. The search for an optimal solution is performed using an adaptation of the [Formula: see text] Ant System algorithm. The approach is illustrated using an application to item selection for a personality scale assuming measurement invariance across multiple countries.


Sign in / Sign up

Export Citation Format

Share Document