Stochastic variable selection strategies for zero-inflated models

2017 ◽  
Vol 18 (1) ◽  
pp. 3-23 ◽  
Author(s):  
Eva Cantoni ◽  
Marie Auda

When count data exhibit excess zero, that is more zero counts than a simpler parametric distribution can model, the zero-inflated Poisson (ZIP) or zero-inflated negative binomial (ZINB) models are often used. Variable selection for these models is even more challenging than for other regression situations because the availability of p covariates implies 4 p possible models. We adapt to zero-inflated models an approach for variable selection that avoids the screening of all possible models. This approach is based on a stochastic search through the space of all possible models, which generates a chain of interesting models. As an additional novelty, we propose three ways of extracting information from this rich chain and we compare them in two simulation studies, where we also contrast our approach with regularization (penalized) techniques available in the literature. The analysis of a typical dataset that has motivated our research is also presented, before concluding with some recommendations.

Author(s):  
Cindy Xin Feng

AbstractCounts data with excessive zeros are frequently encountered in practice. For example, the number of health services visits often includes many zeros representing the patients with no utilization during a follow-up time. A common feature of this type of data is that the count measure tends to have excessive zero beyond a common count distribution can accommodate, such as Poisson or negative binomial. Zero-inflated or hurdle models are often used to fit such data. Despite the increasing popularity of ZI and hurdle models, there is still a lack of investigation of the fundamental differences between these two types of models. In this article, we reviewed the zero-inflated and hurdle models and highlighted their differences in terms of their data generating processes. We also conducted simulation studies to evaluate the performances of both types of models. The final choice of regression model should be made after a careful assessment of goodness of fit and should be tailored to a particular data in question.


2021 ◽  
Vol 36 (4) ◽  
pp. 475-491
Author(s):  
Liu-cang Wu ◽  
Song-qin Yang ◽  
Ye Tao

AbstractAlthough there are many papers on variable selection methods based on mean model in the finite mixture of regression models, little work has been done on how to select significant explanatory variables in the modeling of the variance parameter. In this paper, we propose and study a novel class of models: a skew-normal mixture of joint location and scale models to analyze the heteroscedastic skew-normal data coming from a heterogeneous population. The problem of variable selection for the proposed models is considered. In particular, a modified Expectation-Maximization(EM) algorithm for estimating the model parameters is developed. The consistency and the oracle property of the penalized estimators is established. Simulation studies are conducted to investigate the finite sample performance of the proposed methodologies. An example is illustrated by the proposed methodologies.


2017 ◽  
Vol 51 (3) ◽  
pp. 198-208 ◽  
Author(s):  
John S. Preisser ◽  
D. Leann Long ◽  
John W. Stamm

Marginalized zero-inflated count regression models have recently been introduced for the statistical analysis of dental caries indices and other zero-inflated count data as alternatives to traditional zero-inflated and hurdle models. Unlike the standard approaches, the marginalized models directly estimate overall exposure or treatment effects by relating covariates to the marginal mean count. This article discusses model interpretation and model class choice according to the research question being addressed in caries research. Two data sets, one consisting of fictional dmft counts in 2 groups and the other on DMFS among schoolchildren from a randomized clinical trial comparing 3 toothpaste formulations to prevent incident dental caries, are analyzed with negative binomial hurdle, zero-inflated negative binomial, and marginalized zero-inflated negative binomial models. In the first example, estimates of treatment effects vary according to the type of incidence rate ratio (IRR) estimated by the model. Estimates of IRRs in the analysis of the randomized clinical trial were similar despite their distinctive interpretations. The choice of statistical model class should match the study's purpose, while accounting for the broad decline in children's caries experience, such that dmft and DMFS indices more frequently generate zero counts. Marginalized (marginal mean) models for zero-inflated count data should be considered for direct assessment of exposure effects on the marginal mean dental caries count in the presence of high frequencies of zero counts.


2018 ◽  
Vol 28 (12) ◽  
pp. 3683-3696
Author(s):  
Peng Ye ◽  
Wan Tang ◽  
Jiang He ◽  
Hua He

Count outcomes with excessive zeros are common in behavioral and social studies, and zero-inflated count models such as zero-inflated Poisson (ZIP) and zero-inflated Negative Binomial (ZINB) can be applied when such zero-inflated count data are used as response variable. However, when the zero-inflated count data are used as predictors, ignoring the difference of structural and random zeros can result in biased estimates. In this paper, a generalized estimating equation (GEE)-type mixture model is proposed to jointly model the response of interest and the zero-inflated count predictors. Simulation studies show that the proposed method performs well for practical settings and is more robust for model misspecification than the likelihood-based approach. A case study is also provided for illustration.


2016 ◽  
Vol 25 (6) ◽  
pp. 2685-2703 ◽  
Author(s):  
Zhu Wang ◽  
Shuangge Ma ◽  
Michael Zappitelli ◽  
Chirag Parikh ◽  
Ching-Yun Wang ◽  
...  

Pediatric cardiac surgery may lead to poor outcomes such as acute kidney injury (AKI) and prolonged hospital length of stay (LOS). Plasma and urine biomarkers may help with early identification and prediction of these adverse clinical outcomes. In a recent multi-center study, 311 children undergoing cardiac surgery were enrolled to evaluate multiple biomarkers for diagnosis and prognosis of AKI and other clinical outcomes. LOS is often analyzed as count data, thus Poisson regression and negative binomial (NB) regression are common choices for developing predictive models. With many correlated prognostic factors and biomarkers, variable selection is an important step. The present paper proposes new variable selection methods for Poisson and NB regression. We evaluated regularized regression through penalized likelihood function. We first extend the elastic net (Enet) Poisson to two penalized Poisson regression: Mnet, a combination of minimax concave and ridge penalties; and Snet, a combination of smoothly clipped absolute deviation (SCAD) and ridge penalties. Furthermore, we extend the above methods to the penalized NB regression. For the Enet, Mnet, and Snet penalties (EMSnet), we develop a unified algorithm to estimate the parameters and conduct variable selection simultaneously. Simulation studies show that the proposed methods have advantages with highly correlated predictors, against some of the competing methods. Applying the proposed methods to the aforementioned data, it is discovered that early postoperative urine biomarkers including NGAL, IL18, and KIM-1 independently predict LOS, after adjusting for risk and biomarker variables.


Sign in / Sign up

Export Citation Format

Share Document