MODELLING ZERO-INFLATED COUNT DATA WITH A SPECIAL CASE OF THE GENERALISED POISSON DISTRIBUTION

AbstractA one-parameter version of the generalised Poisson distribution provided by Consul and Jain (1973) is considered in this paper. The distribution is unimodal with a zero vertex and over-dispersed. A generalised linear model related to this distribution is also presented. Its parameters can be estimated by using a Fisher-Scoring algorithm which is equivalent to iteratively reweighted least squares. Due to its flexibility and capacity to describe highly skewed data with an excessive number of zeros, the model is suitable to be applied in insurance settings as an alternative to the negative binomial and zero-inflated model.

Download Full-text

A Bimodal Discrete Shifted Poisson Distribution. A Case Study of Tourists’ Length of Stay

Symmetry ◽

10.3390/sym12030442 ◽

2020 ◽

Vol 12 (3) ◽

pp. 442 ◽

Cited By ~ 1

Author(s):

Emilio Gómez-Déniz ◽

Jorge Vicente Pérez-Rodríguez ◽

Jimmy Reyes ◽

Héctor W. Gómez

Keyword(s):

Length Of Stay ◽

Poisson Distribution ◽

Canary Islands ◽

Count Data ◽

Mean Value ◽

Data Models ◽

Skewed Data ◽

Count Data Models ◽

Empirical Results

Although the Poisson distribution is appropriate for modelling equi-dispersed distributions, it reflects bimodality less well. In this paper, we propose a distribution which is more suitable for the latter purpose. It can be fitted to both positively and negatively skewed data and appears to represent overdispersion phenomena correctly in count data models obtained using a Poisson distribution. Furthermore, the distribution can be normalised in terms of its mean value, and therefore covariates can be included. Our empirical results are based on tourists’ length of stay in the Canary Islands (Spain), a popular holiday destination. The study analyses data supplied by the Canary Islands Tourist Expenditure Survey. Our findings show that the model presented is valid and that the fit obtained is reasonably good.

Download Full-text

Zero-Inflated Time Series Modelling of COVID-19 Deaths in Ghana

Journal of Environmental and Public Health ◽

10.1155/2021/5543977 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Kassim Tawiah ◽

Wahab Abdul Iddrisu ◽

Killian Asampana Asosega

Keyword(s):

Time Series ◽

Autoregressive Model ◽

Negative Binomial ◽

Time Series Data ◽

Partial Likelihood ◽

Series Data ◽

Number Of Zeros ◽

Time Series Modelling ◽

Poisson Autoregressive Model ◽

Excessive Number

Discrete count time series data with an excessive number of zeros have warranted the development of zero-inflated time series models to incorporate the inflation of zeros and the overdispersion that comes with it. In this paper, we investigated the characteristics of the trend of daily count of COVID-19 deaths in Ghana using zero-inflated models. We envisaged that the trend of COVID-19 deaths per day in Ghana portrays a general increase from the onset of the pandemic in the country to about day 160 after which there is a general decrease onward. We fitted a zero-inflated Poisson autoregressive model and zero-inflated negative binomial autoregressive model to the data in the partial-likelihood framework. The zero-inflated negative binomial autoregressive model outperformed the zero-inflated Poisson autoregressive model. On the other hand, the dynamic zero-inflated Poisson autoregressive model performed better than the dynamic negative binomial autoregressive model. The predicted new death based on the zero-inflated negative binomial autoregressive model indicated that Ghana’s COVID-19 death per day will rise sharply few days after 30th November 2020 and drastically fall just as in the observed data.

Download Full-text

Zero-Inflated Models for RNA-Seq Count Data

Journal of Biomedical Analytics ◽

10.30577/jba.2018.v1n2.23 ◽

2018 ◽

Vol 1 (2) ◽

pp. 55-70 ◽

Cited By ~ 2

Author(s):

Morshed Alam ◽

Naim Al Mahi ◽

Munni Begum

Keyword(s):

Count Data ◽

Negative Binomial ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Mixed Effects ◽

Mixed Effects Model ◽

Rna Seq ◽

Number Of Zeros ◽

Biological Studies ◽

Differential Gene

One of the main objectives of many biological studies is to explore differential gene expression profiles between samples. Genes are referred to as differentially expressed (DE) if the read counts change across treatments or conditions systematically. Poisson and negative binomial (NB) regressions are widely used methods for non-over-dispersed (NOD) and over-dispersed (OD) count data respectively. However, in the presence of excessive number of zeros, these methods need adjustments. In this paper, we consider a zero-inflated Poisson mixed effects model (ZIPMM) and zero-inflated negative binomial mixed effects model (ZINBMM) to address excessive zero counts in the NOD and OD RNA-seq data respectively in the presence of random effects. We apply these methods to both simulated and real RNA-seq datasets. The ZIPMM and ZINBMM perform better on both simulated and real datasets.

Download Full-text

A comparison of zero-inflated and hurdle models for modeling zero-inflated count data

Journal of Statistical Distributions and Applications ◽

10.1186/s40488-021-00121-4 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Cindy Xin Feng

Keyword(s):

Health Services ◽

Count Data ◽

Goodness Of Fit ◽

Negative Binomial ◽

Simulation Studies ◽

Final Choice ◽

Hurdle Models ◽

Count Distribution ◽

Careful Assessment

AbstractCounts data with excessive zeros are frequently encountered in practice. For example, the number of health services visits often includes many zeros representing the patients with no utilization during a follow-up time. A common feature of this type of data is that the count measure tends to have excessive zero beyond a common count distribution can accommodate, such as Poisson or negative binomial. Zero-inflated or hurdle models are often used to fit such data. Despite the increasing popularity of ZI and hurdle models, there is still a lack of investigation of the fundamental differences between these two types of models. In this article, we reviewed the zero-inflated and hurdle models and highlighted their differences in terms of their data generating processes. We also conducted simulation studies to evaluate the performances of both types of models. The final choice of regression model should be made after a careful assessment of goodness of fit and should be tailored to a particular data in question.

Download Full-text

Analysis of count data in the setting of cervical cancer detection

Journal of Investigative Medicine ◽

10.1136/jim-2020-001381 ◽

2020 ◽

Vol 68 (6) ◽

pp. 1196-1198

Author(s):

Christina G Bracamontes ◽

Thelma Carrillo ◽

Jane Montealegre ◽

Leonid Fradkin ◽

Michele Follen ◽

...

Keyword(s):

Sexual Abuse ◽

Count Data ◽

Pap Smear ◽

Negative Binomial ◽

El Paso ◽

Language Preference ◽

A Value ◽

History Of ◽

Zip Model ◽

Endocervical Canal

Women with an abnormal Pap smear are often referred to colposcopy, a procedure during which endocervical curettage (ECC) may be performed. ECC is a scraping of the endocervical canal lining. Our goal was to compare the performance of a naïve Poisson (NP) regression model with that of a zero-inflated Poisson (ZIP) model when identifying predictors of the number of distress/pain vocalizations made by women undergoing ECC. Data on women seen in the colposcopy clinic at a medical school in El Paso, Texas, were analyzed. The outcome was the number of pain vocalizations made by the patient during ECC. Six dichotomous predictors were evaluated. Initially, NP regression was used to model the data. A high proportion of patients did not make any vocalizations, and hence a ZIP model was also fit and relative rates (RRs) and 95% CIs were calculated. AIC was used to identify the best model (NP or ZIP). Of the 210 women, 154 (73.3%) had a value of 0 for the number of ECC vocalizations. NP identified three statistically significant predictors (language preference of the subject, sexual abuse history and length of the colposcopy), while ZIP identified one: history of sexual abuse (yes vs no; adjusted RR=2.70, 95% CI 1.47 to 4.97). ZIP was preferred over NP. ZIP performed better than NP regression. Clinicians and epidemiologists should consider using the ZIP model (or the zero-inflated negative binomial model) for zero-inflated count data.

Download Full-text

Statistical models for analyzing count data: predictors of length of stay among HIV patients in Portugal using a multilevel model

BMC Health Services Research ◽

10.1186/s12913-021-06389-1 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Ahmed Nabil Shaaban ◽

Bárbara Peleteiro ◽

Maria Rosario O. Martins

Keyword(s):

Length Of Stay ◽

Regression Model ◽

Random Effects ◽

Count Data ◽

Negative Binomial ◽

Negative Binomial Regression ◽

Comprehensive Approach ◽

Negative Binomial Regression Model ◽

Hiv Patients ◽

Binomial Regression

Abstract Background This study offers a comprehensive approach to precisely analyze the complexly distributed length of stay among HIV admissions in Portugal. Objective To provide an illustration of statistical techniques for analysing count data using longitudinal predictors of length of stay among HIV hospitalizations in Portugal. Method Registered discharges in the Portuguese National Health Service (NHS) facilities Between January 2009 and December 2017, a total of 26,505 classified under Major Diagnostic Category (MDC) created for patients with HIV infection, with HIV/AIDS as a main or secondary cause of admission, were used to predict length of stay among HIV hospitalizations in Portugal. Several strategies were applied to select the best count fit model that includes the Poisson regression model, zero-inflated Poisson, the negative binomial regression model, and zero-inflated negative binomial regression model. A random hospital effects term has been incorporated into the negative binomial model to examine the dependence between observations within the same hospital. A multivariable analysis has been performed to assess the effect of covariates on length of stay. Results The median length of stay in our study was 11 days (interquartile range: 6–22). Statistical comparisons among the count models revealed that the random-effects negative binomial models provided the best fit with observed data. Admissions among males or admissions associated with TB infection, pneumocystis, cytomegalovirus, candidiasis, toxoplasmosis, or mycobacterium disease exhibit a highly significant increase in length of stay. Perfect trends were observed in which a higher number of diagnoses or procedures lead to significantly higher length of stay. The random-effects term included in our model and refers to unexplained factors specific to each hospital revealed obvious differences in quality among the hospitals included in our study. Conclusions This study provides a comprehensive approach to address unique problems associated with the prediction of length of stay among HIV patients in Portugal.

Download Full-text

Improved inference for areal unit count data using graph-based optimisation

Statistics and Computing ◽

10.1007/s11222-021-10025-7 ◽

2021 ◽

Vol 31 (4) ◽

Author(s):

Duncan Lee ◽

Kitty Meeks ◽

William Pettersson

Keyword(s):

Random Effects ◽

Count Data ◽

Prior Distribution ◽

Disease Surveillance ◽

Sharing Rule ◽

Markov Random ◽

Spatial Correlation Structure ◽

Conditional Autoregressive ◽

Spatio Temporal ◽

Special Case

AbstractSpatio-temporal count data relating to a set of non-overlapping areal units are prevalent in many fields, including epidemiology and social science. The spatial autocorrelation inherent in these data is typically modelled by a set of random effects that are assigned a conditional autoregressive prior distribution, which is a special case of a Gaussian Markov random field. The autocorrelation structure implied by this model depends on a binary neighbourhood matrix, where two random effects are assumed to be partially autocorrelated if their areal units share a common border, and are conditionally independent otherwise. This paper proposes a novel graph-based optimisation algorithm for estimating either a static or a temporally varying neighbourhood matrix for the data that better represents its spatial correlation structure, by viewing the areal units as the vertices of a graph and the neighbour relations as the set of edges. The improved estimation performance of our methodology compared to the commonly used border sharing rule is evidenced by simulation, before the method is applied to a new respiratory disease surveillance study in Scotland between 2011 and 2017.

Download Full-text

Transition models for count data: a flexible alternative to fixed distribution models

Statistical Methods & Applications ◽

10.1007/s10260-021-00558-6 ◽

2021 ◽

Author(s):

Moritz Berger ◽

Gerhard Tutz

Keyword(s):

Count Data ◽

Regression Models ◽

Negative Binomial ◽

Real Data ◽

Distribution Models ◽

Explanatory Variables ◽

Excess Zeros ◽

Proposed Model ◽

Transition Models ◽

Fixed Distribution

AbstractA flexible semiparametric class of models is introduced that offers an alternative to classical regression models for count data as the Poisson and Negative Binomial model, as well as to more general models accounting for excess zeros that are also based on fixed distributional assumptions. The model allows that the data itself determine the distribution of the response variable, but, in its basic form, uses a parametric term that specifies the effect of explanatory variables. In addition, an extended version is considered, in which the effects of covariates are specified nonparametrically. The proposed model and traditional models are compared in simulations and by utilizing several real data applications from the area of health and social science.

Download Full-text

Beta-binomial models for meta-analysis with binary outcomes: Variations, extensions, and additional insights from econometrics

Research Methods in Medicine & Health Sciences ◽

10.1177/2632084321996225 ◽

2021 ◽

pp. 263208432199622

Author(s):

Tim Mathes ◽

Oliver Kuss

Keyword(s):

Simulation Study ◽

Count Data ◽

Negative Binomial ◽

Meta Analysis ◽

Negative Binomial Regression ◽

Binary Outcomes ◽

Small Scale ◽

Panel Count Data ◽

Count Data Models ◽

Meta Analyses

Background Meta-analysis of systematically reviewed studies on interventions is the cornerstone of evidence based medicine. In the following, we will introduce the common-beta beta-binomial (BB) model for meta-analysis with binary outcomes and elucidate its equivalence to panel count data models. Methods We present a variation of the standard “common-rho” BB (BBST model) for meta-analysis, namely a “common-beta” BB model. This model has an interesting connection to fixed-effect negative binomial regression models (FE-NegBin) for panel count data. Using this equivalence, it is possible to estimate an extension of the FE-NegBin with an additional multiplicative overdispersion term (RE-NegBin), while preserving a closed form likelihood. An advantage due to the connection to econometric models is, that the models can be easily implemented because “standard” statistical software for panel count data can be used. We illustrate the methods with two real-world example datasets. Furthermore, we show the results of a small-scale simulation study that compares the new models to the BBST. The input parameters of the simulation were informed by actually performed meta-analysis. Results In both example data sets, the NegBin, in particular the RE-NegBin showed a smaller effect and had narrower 95%-confidence intervals. In our simulation study, median bias was negligible for all methods, but the upper quartile for median bias suggested that BBST is most affected by positive bias. Regarding coverage probability, BBST and the RE-NegBin model outperformed the FE-NegBin model. Conclusion For meta-analyses with binary outcomes, the considered common-beta BB models may be valuable extensions to the family of BB models.

Download Full-text

Flexible models for overdispersed and underdispersed count data

Statistical Papers ◽

10.1007/s00362-021-01222-7 ◽

2021 ◽

Author(s):

Dexter Cahoy ◽

Elvira Di Nardo ◽

Federico Polito

Keyword(s):

Poisson Distribution ◽

Count Data ◽

Hypergeometric Functions ◽

Natural Generalization ◽

Model Parameters ◽

Probability Models ◽

Limiting Behavior ◽

Poisson Models ◽

Special Cases ◽

Flexible Models

AbstractWithin the framework of probability models for overdispersed count data, we propose the generalized fractional Poisson distribution (gfPd), which is a natural generalization of the fractional Poisson distribution (fPd), and the standard Poisson distribution. We derive some properties of gfPd and more specifically we study moments, limiting behavior and other features of fPd. The skewness suggests that fPd can be left-skewed, right-skewed or symmetric; this makes the model flexible and appealing in practice. We apply the model to real big count data and estimate the model parameters using maximum likelihood. Then, we turn to the very general class of weighted Poisson distributions (WPD’s) to allow both overdispersion and underdispersion. Similarly to Kemp’s generalized hypergeometric probability distribution, which is based on hypergeometric functions, we analyze a class of WPD’s related to a generalization of Mittag–Leffler functions. The proposed class of distributions includes the well-known COM-Poisson and the hyper-Poisson models. We characterize conditions on the parameters allowing for overdispersion and underdispersion, and analyze two special cases of interest which have not yet appeared in the literature.

Download Full-text