scholarly journals Models for Analyzing Zero-Inflated and Overdispersed Count Data: An Application to Cigarette and Marijuana Use

2018 ◽  
Vol 22 (8) ◽  
pp. 1390-1398 ◽  
Author(s):  
Brian Pittman ◽  
Eugenia Buta ◽  
Suchitra Krishnan-Sarin ◽  
Stephanie S O’Malley ◽  
Thomas Liss ◽  
...  

Abstract Introduction This article describes different methods for analyzing counts and illustrates their use on cigarette and marijuana smoking data. Methods The Poisson, zero-inflated Poisson (ZIP), hurdle Poisson (HUP), negative binomial (NB), zero-inflated negative binomial (ZINB), and hurdle negative binomial (HUNB) regression models are considered. The different approaches are evaluated in terms of the ability to take into account zero-inflation (extra zeroes) and overdispersion (variance larger than expected) in count outcomes, with emphasis placed on model fit, interpretation, and choosing an appropriate model given the nature of the data. The illustrative data example focuses on cigarette and marijuana smoking reports from a study on smoking habits among youth e-cigarette users with gender, age, and e-cigarette use included as predictors. Results Of the 69 subjects available for analysis, 36% and 64% reported smoking no cigarettes and no marijuana, respectively, suggesting both outcomes might be zero-inflated. Both outcomes were also overdispersed with large positive skew. The ZINB and HUNB models fit the cigarette counts best. According to goodness-of-fit statistics, the NB, HUNB, and ZINB models fit the marijuana data well, but the ZINB provided better interpretation. Conclusion In the absence of zero-inflation, the NB model fits smoking data well, which is typically overdispersed. In the presence of zero-inflation, the ZINB or HUNB model is recommended to account for additional heterogeneity. In addition to model fit and interpretability, choosing between a zero-inflated or hurdle model should ultimately depend on the assumptions regarding the zeros, study design, and the research question being asked. Implications Count outcomes are frequent in tobacco research and often have many zeros and exhibit large variance and skew. Analyzing such data based on methods requiring a normally distributed outcome are inappropriate and will likely produce spurious results. This study compares and contrasts appropriate methods for analyzing count data, specifically those with an over-abundance of zeros, and illustrates their use on cigarette and marijuana smoking data. Recommendations are provided.

Author(s):  
Cindy Xin Feng

AbstractCounts data with excessive zeros are frequently encountered in practice. For example, the number of health services visits often includes many zeros representing the patients with no utilization during a follow-up time. A common feature of this type of data is that the count measure tends to have excessive zero beyond a common count distribution can accommodate, such as Poisson or negative binomial. Zero-inflated or hurdle models are often used to fit such data. Despite the increasing popularity of ZI and hurdle models, there is still a lack of investigation of the fundamental differences between these two types of models. In this article, we reviewed the zero-inflated and hurdle models and highlighted their differences in terms of their data generating processes. We also conducted simulation studies to evaluate the performances of both types of models. The final choice of regression model should be made after a careful assessment of goodness of fit and should be tailored to a particular data in question.


Author(s):  
Moritz Berger ◽  
Gerhard Tutz

AbstractA flexible semiparametric class of models is introduced that offers an alternative to classical regression models for count data as the Poisson and Negative Binomial model, as well as to more general models accounting for excess zeros that are also based on fixed distributional assumptions. The model allows that the data itself determine the distribution of the response variable, but, in its basic form, uses a parametric term that specifies the effect of explanatory variables. In addition, an extended version is considered, in which the effects of covariates are specified nonparametrically. The proposed model and traditional models are compared in simulations and by utilizing several real data applications from the area of health and social science.


2016 ◽  
Vol 63 (1) ◽  
pp. 77-87 ◽  
Author(s):  
William H. Fisher ◽  
Stephanie W. Hartwell ◽  
Xiaogang Deng

Poisson and negative binomial regression procedures have proliferated, and now are available in virtually all statistical packages. Along with the regression procedures themselves are procedures for addressing issues related to the over-dispersion and excessive zeros commonly observed in count data. These approaches, zero-inflated Poisson and zero-inflated negative binomial models, use logit or probit models for the “excess” zeros and count regression models for the counted data. Although these models are often appropriate on statistical grounds, their interpretation may prove substantively difficult. This article explores this dilemma, using data from a study of individuals released from facilities maintained by the Massachusetts Department of Correction.


Author(s):  
Getu Segni Tulu ◽  
M. Mazharul Haque ◽  
Simon Washington ◽  
Mark J. King

Pedestrian crashes represent about 40% of total fatal crashes in low-income developing countries. Although many pedestrian crashes in these countries occur at unsignalized intersections such as roundabouts, studies focusing on this issue are limited. The objective of this study was to develop safety performance functions for pedestrian crashes at modern roundabouts to identify significant roadway geometric, traffic, and land use characteristics related to pedestrian safety. Detailed data, including various forms of exposure, geometric and traffic characteristics, and spatial factors such as proximity to schools and to drinking establishments were collected from a sample of 22 modern roundabouts in Addis Ababa, Ethiopia, representing about 56% of such roundabouts in Addis Ababa. To account for spatial correlation resulting from multiple observations at a roundabout, both the random effect Poisson (REP) and random effect negative binomial (RENB) regression models were estimated. Model goodness-of-fit statistics revealed a marginally superior fit of the REP model to the data compared with the RENB model. Pedestrian crossing volume and the product of traffic volumes along major and minor roads had significant and positive associations with pedestrian crashes at roundabouts. The presence of a public transport (bus or taxi) terminal beside a roundabout was associated with increased pedestrian crashes. Although the maximum gradient of an approach road was negatively associated with pedestrian safety, the provision of a raised median along an approach appeared to increase pedestrian safety at roundabouts. Remedial measures were identified for combating pedestrian safety problems at roundabouts in the context of a developing country.


2021 ◽  
Author(s):  
Daniel Lüdecke ◽  
Mattan S. Ben-Shachar ◽  
Indrajeet Patil ◽  
Philip Waggoner ◽  
Dominique Makowski

A crucial part of statistical analysis is evaluating a model's quality and fit, or performance. During analysis, especially with regression models, investigating the fit of models to data also often involves selecting the best fitting model amongst many competing models. Upon investigation, fit indices should also be reported both visually and numerically to bring readers in on the investigative effort. While functions to build and produce diagnostic plots or to compute fit statistics exist, these are located across many packages, which results in a lack of a unique and consistent approach to assess the performance of many types of models. The result is a difficult-to-navigate, unorganized ecosystem of individual packages with different syntax, making it onerous for researchers to locate and use fit indices relevant for their unique purposes. The performance package in R fills this gap by offering researchers a suite of intuitive functions with consistent syntax for computing, building, and presenting regression model fit statistics and visualizations.


2017 ◽  
Vol 51 (3) ◽  
pp. 198-208 ◽  
Author(s):  
John S. Preisser ◽  
D. Leann Long ◽  
John W. Stamm

Marginalized zero-inflated count regression models have recently been introduced for the statistical analysis of dental caries indices and other zero-inflated count data as alternatives to traditional zero-inflated and hurdle models. Unlike the standard approaches, the marginalized models directly estimate overall exposure or treatment effects by relating covariates to the marginal mean count. This article discusses model interpretation and model class choice according to the research question being addressed in caries research. Two data sets, one consisting of fictional dmft counts in 2 groups and the other on DMFS among schoolchildren from a randomized clinical trial comparing 3 toothpaste formulations to prevent incident dental caries, are analyzed with negative binomial hurdle, zero-inflated negative binomial, and marginalized zero-inflated negative binomial models. In the first example, estimates of treatment effects vary according to the type of incidence rate ratio (IRR) estimated by the model. Estimates of IRRs in the analysis of the randomized clinical trial were similar despite their distinctive interpretations. The choice of statistical model class should match the study's purpose, while accounting for the broad decline in children's caries experience, such that dmft and DMFS indices more frequently generate zero counts. Marginalized (marginal mean) models for zero-inflated count data should be considered for direct assessment of exposure effects on the marginal mean dental caries count in the presence of high frequencies of zero counts.


2018 ◽  
Vol 52 (4) ◽  
pp. 339-345 ◽  
Author(s):  
Alex Man Him Chau ◽  
Edward Chin Man Lo ◽  
May Chun Mei Wong ◽  
Chun Hung Chu

Oral epidemiology involves studying and investigating the distribution and determinants of dental-related diseases in a specified population group to inform decisions in the management of health problems. In oral epidemiology studies, the hypothesis is typically followed by a cogent study design and data collection. Appropriate statistical analysis is essential to demonstrate the scientific association between the independent factors and the target variable. Analysis also helps to develop and build a statistical model. Poisson regression and its extensions have gained more attention in caries epidemiology than other working models such as logistic regression. This review discusses the fundamental principles and basic knowledge of Poisson regression models. It also introduces the use of a robust variance estimator with a focus on the “robust” interpretation of the model. In addition, extensions of regression models, including the zero-inflated model, hurdle model, and negative binomial model, and their interpretation in caries studies are reviewed. Principles of model fitting, including goodness-of-fit measures, are also discussed. Clinicians and researchers should pay attention to the statistical context of the models used and interpret the models to improve the oral and general health of the communities in which they live.


2019 ◽  
Vol 67 (2) ◽  
pp. 117-122
Author(s):  
Nasiba Maruf Ahmed ◽  
Taslim Sazzad Mallick

In medical science, pharmaceutical studies, public health and socio-economic researches we often encounter the situation of excess of zeros in count data. This preponderance of zeros leads to overdispersion. In such cases traditional count data regression models like Poisson and negative binomial (NB) regression may not be pertinent for inference. The two most commonly used types of model that have been developed to adjust for excessivezeros in count data are Hurdle and zero-inflated models. In this study we have analyzed the antenatal care (ANC) visit data of pregnant women in Bangladesh using traditional and zero-modified count models. Based on the model selection criteria, we found that negative binomial hurdle model fits the data best. Through this analysis,we have perceived that the variables age of mother, division, birth order (order a child is born), place of residence, economic condition, media exposure of the mother, mainaccess road to village and education gap between husband and wife have significant impact on the mean number of ANC visits taken. Dhaka Univ. J. Sci. 67(2): 117-122, 2019 (July)


Sign in / Sign up

Export Citation Format

Share Document