scholarly journals Robust Variable Selection with Optimality Guarantees for High-Dimensional Logistic Regression

Stats ◽  
2021 ◽  
Vol 4 (3) ◽  
pp. 665-681
Author(s):  
Luca Insolia ◽  
Ana Kenney ◽  
Martina Calovi ◽  
Francesca Chiaromonte

High-dimensional classification studies have become widespread across various domains. The large dimensionality, coupled with the possible presence of data contamination, motivates the use of robust, sparse estimation methods to improve model interpretability and ensure the majority of observations agree with the underlying parametric model. In this study, we propose a robust and sparse estimator for logistic regression models, which simultaneously tackles the presence of outliers and/or irrelevant features. Specifically, we propose the use of L0-constraints and mixed-integer conic programming techniques to solve the underlying double combinatorial problem in a framework that allows one to pursue optimality guarantees. We use our proposal to investigate the main drivers of honey bee (Apis mellifera) loss through the annual winter loss survey data collected by the Pennsylvania State Beekeepers Association. Previous studies mainly focused on predictive performance, however our approach produces a more interpretable classification model and provides evidence for several outlying observations within the survey data. We compare our proposal with existing heuristic methods and non-robust procedures, demonstrating its effectiveness. In addition to the application to honey bee loss, we present a simulation study where our proposal outperforms other methods across most performance measures and settings.

2019 ◽  
Author(s):  
Jacques Muthusi ◽  
Samuel Mwalili ◽  
Peter Young

AbstractIntroductionReproducible research is increasingly gaining interest in the research community. Automating the production of research manuscript tables from statistical software can help increase the reproducibility of findings. Logistic regression is used in studying disease prevalence and associated factors in epidemiological studies and can be easily performed using widely available software including SAS, SUDAAN, Stata or R. However, output from these software must be processed further to make it readily presentable. There exists a number of procedures developed to organize regression output, though many of them suffer limitations of flexibility, complexity, lack of validation checks for input parameters, as well as inability to incorporate survey design.MethodsWe developed a SAS macro, %svy_logistic_regression, for fitting simple and multiple logistic regression models. The macro also creates quality publication-ready tables using survey or non-survey data which aims to increase transparency of data analyses. It further significantly reduces turn-around time for conducting analysis and preparing output tables while also addressing the limitations of existing procedures.ResultsWe demonstrate the use of the macro in the analysis of the 2013-2014 National Health and Nutrition Examination Survey (NHANES), a complex survey designed to assess the health and nutritional status of adults and children in the United States. The output presented here is directly from the macro and is consistent with how regression results are often presented in the epidemiological and biomedical literature, with unadjusted and adjusted model results presented side by side.ConclusionsThe SAS code presented in this macro is comprehensive, easy to follow, manipulate and to extend to other areas of interest. It can also be incorporated quickly by the statistician for immediate use. It is an especially valuable tool for generating quality, easy to review tables which can be incorporated directly in a publication.


Author(s):  
Beibei Yuan ◽  
Willem Heiser ◽  
Mark de Rooij

AbstractThe $$\delta $$ δ -machine is a statistical learning tool for classification based on dissimilarities or distances between profiles of the observations to profiles of a representation set, which was proposed by Yuan et al. (J Claasif 36(3): 442–470, 2019). So far, the $$\delta $$ δ -machine was restricted to continuous predictor variables only. In this article, we extend the $$\delta $$ δ -machine to handle continuous, ordinal, nominal, and binary predictor variables. We utilized a tailored dissimilarity function for mixed type variables which was defined by Gower. This measure has properties of a Manhattan distance. We develop, in a similar vein, a Euclidean dissimilarity function for mixed type variables. In simulation studies we compare the performance of the two dissimilarity functions and we compare the predictive performance of the $$\delta $$ δ -machine to logistic regression models. We generated data according to two population distributions where the type of predictor variables, the distribution of categorical variables, and the number of predictor variables was varied. The performance of the $$\delta $$ δ -machine using the two dissimilarity functions and different types of representation set was investigated. The simulation studies showed that the adjusted Euclidean dissimilarity function performed better than the adjusted Gower dissimilarity function; that the $$\delta $$ δ -machine outperformed logistic regression; and that for constructing the representation set, K-medoids clustering achieved fewer active exemplars than the one using K-means clustering while maintaining the accuracy. We also applied the $$\delta $$ δ -machine to an empirical example, discussed its interpretation in detail, and compared the classification performance with five other classification methods. The results showed that the $$\delta $$ δ -machine has a good balance between accuracy and interpretability.


2021 ◽  
Vol 20 (1) ◽  
Author(s):  
Eimear Cleary ◽  
Manuel W. Hetzel ◽  
Paul Siba ◽  
Colleen L. Lau ◽  
Archie C. A. Clements

Abstract Background Considerable progress towards controlling malaria has been made in Papua New Guinea through the national malaria control programme’s free distribution of long-lasting insecticidal nets, improved diagnosis with rapid diagnostic tests and improved access to artemisinin combination therapy. Predictive prevalence maps can help to inform targeted interventions and monitor changes in malaria epidemiology over time as control efforts continue. This study aims to compare the predictive performance of prevalence maps generated using Bayesian decision network (BDN) models and multilevel logistic regression models (a type of generalized linear model, GLM) in terms of malaria spatial risk prediction accuracy. Methods Multilevel logistic regression models and BDN models were developed using 2010/2011 malaria prevalence survey data collected from 77 randomly selected villages to determine associations of Plasmodium falciparum and Plasmodium vivax prevalence with precipitation, temperature, elevation, slope (terrain aspect), enhanced vegetation index and distance to the coast. Predictive performance of multilevel logistic regression and BDN models were compared by cross-validation methods. Results Prevalence of P. falciparum, based on results obtained from GLMs was significantly associated with precipitation during the 3 driest months of the year, June to August (β = 0.015; 95% CI = 0.01–0.03), whereas P. vivax infection was associated with elevation (β = − 0.26; 95% CI = − 0.38 to − 3.04), precipitation during the 3 driest months of the year (β = 0.01; 95% CI = − 0.01–0.02) and slope (β = 0.12; 95% CI = 0.05–0.19). Compared with GLM model performance, BDNs showed improved accuracy in prediction of the prevalence of P. falciparum (AUC = 0.49 versus 0.75, respectively) and P. vivax (AUC = 0.56 versus 0.74, respectively) on cross-validation. Conclusions BDNs provide a more flexible modelling framework than GLMs and may have a better predictive performance when developing malaria prevalence maps due to the multiple interacting factors that drive malaria prevalence in different geographical areas. When developing malaria prevalence maps, BDNs may be particularly useful in predicting prevalence where spatial variation in climate and environmental drivers of malaria transmission exists, as is the case in Papua New Guinea.


Author(s):  
Ziqian Zhuang ◽  
Wei Xu ◽  
Rahi Jain

Introduction: High dimensional Selection with Interactions for Binary Outcome (HDSI-BO) algorithm can incorporate interaction terms and combine with existing techniques for feature selection. Simulation studies have validated the ability of HDSI-BO to select true features and consequently, improve prediction accuracy compared to standard algorithms. Our goal is to assess the applicability of HDSI-BO in combining different techniques and measure its predictive performance in a real data study of predicting height indicators by social-life and well-being factors. Methods: HDSI-BO was combined with logistic regression, ridge regression, LASSO, adaptive LASSO, and elastic net. Two-way interaction terms were considered. Hyperparameters used in HDSI-BO were optimized through genetic algorithms with five-fold cross-validation. To measure the performance of feature selection, we fitted final models by logistic regression based on the sets of selected features and used the model’s AUC as a measure. 30 trials were repeated to generate a range of the number of selected features and a 95% confidence interval for AUC. Results: When combined with all of the above methods, HDSI-BO methods achieved higher final AUC values both in terms of mean and confidence interval. In addition, HDSI-BO methods effectively narrowed down the sets of selected features and interaction terms compared with standard methods. Conclusion: The HDSI-BO algorithm combines well with multiple standard methods and has comparable or better predictive performance compared with the standard methods. The computational and time complexity of HDSI-BO is higher but still acceptable. Considering AUC as the single metric cannot comprehensively measure the feature selection performance. More effective metrics of performance should be explored for future work.


Sign in / Sign up

Export Citation Format

Share Document