The Effect of Sample Size on the Efficiency of Count Data Models: Application to Marriage Data

Abstract: Sample size requirements are common in many multivariate analysis techniques as one of the measures taken to ensure the robustness of such techniques, such requirements have not been of interest in the area of count data models. As such, this study investigated the effect of sample size on the efficiency of six commonly used count data models namely: Poisson regression model (PRM), Negative binomial regression model (NBRM), Zero-inflated Poisson (ZIP), Zero-inflated negative binomial (ZINB), Poisson Hurdle model (PHM) and Negative binomial hurdle model (NBHM). The data used in this study were sourced from Data First and were collected by Statistics South Africa through the Marriage and Divorce database. PRM, NBRM, ZIP, ZINB, PHM and NBHM were applied to ten randomly selected samples ranging from 4392 to 43916 and differing by 10% in size. The six models were compared using the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Vuongâ€™s test for over-dispersion, McFadden RSQ, Mean Square Error (MSE) and Mean Absolute Deviation (MAD).The results revealed that generally, the Negative Binomial-based models outperformed Poisson-based models. However, the results did not reveal the effect of sample size variations on the efficiency of the models since there was no consistency in the change in AIC, BIC, Vuongâ€™s test for over-dispersion, McFadden RSQ, MSE and MAD as the sample size increased.

Download Full-text

The Effect of Sample Size on the Efficiency of Count Data Models: Application to Marriage Data

Journal of Economics and Behavioral Studies ◽

10.22610/jebs.v9i3.1742 ◽

2017 ◽

Vol 9 (3) ◽

pp. 6

Author(s):

Volition Tlhalitshi Montshiwa ◽

Ntebogang Dinah Moroke

Keyword(s):

Regression Model ◽

Sample Size ◽

Count Data ◽

Negative Binomial ◽

Information Criterion ◽

Data Models ◽

Hurdle Model ◽

Negative Binomial Regression Model ◽

Count Data Models ◽

Over Dispersion

Abstract: Sample size requirements are common in many multivariate analysis techniques as one of the measures taken to ensure the robustness of such techniques, such requirements have not been of interest in the area of count data models. As such, this study investigated the effect of sample size on the efficiency of six commonly used count data models namely: Poisson regression model (PRM), Negative binomial regression model (NBRM), Zero-inflated Poisson (ZIP), Zero-inflated negative binomial (ZINB), Poisson Hurdle model (PHM) and Negative binomial hurdle model (NBHM). The data used in this study were sourced from Data First and were collected by Statistics South Africa through the Marriage and Divorce database. PRM, NBRM, ZIP, ZINB, PHM and NBHM were applied to ten randomly selected samples ranging from 4392 to 43916 and differing by 10% in size. The six models were compared using the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Vuong’s test for over-dispersion, McFadden RSQ, Mean Square Error (MSE) and Mean Absolute Deviation (MAD).The results revealed that generally, the Negative Binomial-based models outperformed Poisson-based models. However, the results did not reveal the effect of sample size variations on the efficiency of the models since there was no consistency in the change in AIC, BIC, Vuong’s test for over-dispersion, McFadden RSQ, MSE and MAD as the sample size increased.

Download Full-text

Statistical models for analyzing count data: predictors of length of stay among HIV patients in Portugal using a multilevel model

BMC Health Services Research ◽

10.1186/s12913-021-06389-1 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Ahmed Nabil Shaaban ◽

Bárbara Peleteiro ◽

Maria Rosario O. Martins

Keyword(s):

Length Of Stay ◽

Regression Model ◽

Random Effects ◽

Count Data ◽

Negative Binomial ◽

Negative Binomial Regression ◽

Comprehensive Approach ◽

Negative Binomial Regression Model ◽

Hiv Patients ◽

Binomial Regression

Abstract Background This study offers a comprehensive approach to precisely analyze the complexly distributed length of stay among HIV admissions in Portugal. Objective To provide an illustration of statistical techniques for analysing count data using longitudinal predictors of length of stay among HIV hospitalizations in Portugal. Method Registered discharges in the Portuguese National Health Service (NHS) facilities Between January 2009 and December 2017, a total of 26,505 classified under Major Diagnostic Category (MDC) created for patients with HIV infection, with HIV/AIDS as a main or secondary cause of admission, were used to predict length of stay among HIV hospitalizations in Portugal. Several strategies were applied to select the best count fit model that includes the Poisson regression model, zero-inflated Poisson, the negative binomial regression model, and zero-inflated negative binomial regression model. A random hospital effects term has been incorporated into the negative binomial model to examine the dependence between observations within the same hospital. A multivariable analysis has been performed to assess the effect of covariates on length of stay. Results The median length of stay in our study was 11 days (interquartile range: 6–22). Statistical comparisons among the count models revealed that the random-effects negative binomial models provided the best fit with observed data. Admissions among males or admissions associated with TB infection, pneumocystis, cytomegalovirus, candidiasis, toxoplasmosis, or mycobacterium disease exhibit a highly significant increase in length of stay. Perfect trends were observed in which a higher number of diagnoses or procedures lead to significantly higher length of stay. The random-effects term included in our model and refers to unexplained factors specific to each hospital revealed obvious differences in quality among the hospitals included in our study. Conclusions This study provides a comprehensive approach to address unique problems associated with the prediction of length of stay among HIV patients in Portugal.

Download Full-text

A Combined PLS and Negative Binomial Regression Model for Inferring Association Networks from Next-Generation Sequencing Count Data

IEEE/ACM Transactions on Computational Biology and Bioinformatics ◽

10.1109/tcbb.2017.2665495 ◽

2018 ◽

Vol 15 (3) ◽

pp. 760-773 ◽

Cited By ~ 2

Author(s):

Maiju Pesonen ◽

Jaakko Nevalainen ◽

Steven Potter ◽

Somnath Datta ◽

Susmita Datta

Keyword(s):

Next Generation Sequencing ◽

Regression Model ◽

Count Data ◽

Negative Binomial ◽

Negative Binomial Regression ◽

Negative Binomial Regression Model ◽

Next Generation ◽

Binomial Regression ◽

Generation Sequencing

Download Full-text

PEMODELAN DENGAN GEOGRAPHICALLY WEIGHTED NEGATIVE BINOMIAL REGRESSION (Studi kasus: Banyaknya Penderita Kusta di Jawa Barat)

Xplore Journal of Statistics ◽

10.29244/xplore.v10i3.833 ◽

2021 ◽

Vol 10 (3) ◽

pp. 226-236

Author(s):

Khusnul Khotimah ◽

Itasia Dina Sulvianti ◽

Pika Silvianti

Keyword(s):

Regression Model ◽

Count Data ◽

Poisson Regression ◽

Negative Binomial ◽

Negative Binomial Regression ◽

Kernel Weight ◽

Negative Binomial Regression Model ◽

West Java ◽

Binomial Regression ◽

Spatial Heterogenity

The number of leper in West Java is an example of the count data case. The analyzes commonly used in count data is Poisson regression. This research will determine the variables that influence the number of leper in West Java. The data used is the number of leper in West Java in 2019. This data has an overdispersion condition and spatial heterogenity. To handle overdispersion, the negative binomial regression model can be employed. While spatial heterogenity is overcome by adding adaptive bisquare kernel weight. This research resulted Geographically Weighted Negative Binomial Regression (GWNBR) with a weighting adaptive bisquare kernel classifies regency/city in West Java into ten groups based on the variables that sigfinicantly influence the number of leper. In general, the variable in the percentage of households with Clean and Healthy Behavior (PHBS) has a significant effect in all regency/city in West Java. Especially for Bogor Regency, Depok City, Bogor City, and Pangandaran Regency, the variable of the percentage of people poverty does not have a significant effect on the number leper.

Download Full-text

Sample size calculations for the differential expression analysis of RNA-seq data using a negative binomial regression model

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2018-0021 ◽

2019 ◽

Vol 18 (1) ◽

Cited By ~ 2

Author(s):

Xiaohong Li ◽

Dongfeng Wu ◽

Nigel G.F. Cooper ◽

Shesh N. Rai

Keyword(s):

Regression Model ◽

Sample Size ◽

Negative Binomial ◽

Negative Binomial Regression ◽

Wald Test ◽

Maximum Likelihood Estimates ◽

Negative Binomial Regression Model ◽

Rna Seq ◽

Sample Sizes ◽

Binomial Regression

Abstract High throughput RNA sequencing (RNA-seq) technology is increasingly used in disease-related biomarker studies. A negative binomial distribution has become the popular choice for modeling read counts of genes in RNA-seq data due to over-dispersed read counts. In this study, we propose two explicit sample size calculation methods for RNA-seq data using a negative binomial regression model. To derive these new sample size formulas, the common dispersion parameter and the size factor as an offset via a natural logarithm link function are incorporated. A two-sided Wald test statistic derived from the coefficient parameter is used for testing a single gene at a nominal significance level 0.05 and multiple genes at a false discovery rate 0.05. The variance for the Wald test is computed from the variance-covariance matrix with the parameters estimated from the maximum likelihood estimates under the unrestricted and constrained scenarios. The performance and a side-by-side comparison of our new formulas with three existing methods with a Wald test, a likelihood ratio test or an exact test are evaluated via simulation studies. Since other methods are much computationally extensive, we recommend our M1 method for quick and direct estimation of sample sizes in an experimental design. Finally, we illustrate sample sizes estimation using an existing breast cancer RNA-seq data.

Download Full-text

Non-Central Negative Binomial Regression Model for Count Data

2020 8th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO) ◽

10.1109/icrito48877.2020.9197974 ◽

2020 ◽

Author(s):

Anwar Hassan ◽

Ishfaq S. Ahmad ◽

Peer Bilal Ahmad

Keyword(s):

Regression Model ◽

Count Data ◽

Negative Binomial ◽

Negative Binomial Regression ◽

Negative Binomial Regression Model ◽

Binomial Regression

Download Full-text

Endogenous switching regression model and treatment effects of count-data outcome

The Stata Journal Promoting communications on statistics and Stata ◽

10.1177/1536867x20953573 ◽

2020 ◽

Vol 20 (3) ◽

pp. 627-646

Author(s):

Takuya Hasebe

Keyword(s):

Regression Model ◽

Count Data ◽

Treatment Effects ◽

Negative Binomial ◽

Negative Binomial Regression ◽

Negative Binomial Regression Model ◽

Switching Regression ◽

Latent Heterogeneity ◽

Endogenous Switching ◽

Binomial Regression

In this article, I describe the escount command, which implements the estimation of an endogenous switching model with count-data outcomes, where a potential outcome differs across two alternate treatment statuses. escount allows for either a Poisson or a negative binomial regression model with lognormal latent heterogeneity. After estimating the parameters of the switching regression model, one can estimate various treatment effects with the command teescount. I also describe the command lncount, which fits the Poisson or negative binomial regression model with lognormal latent heterogeneity.

Download Full-text

Testing Exogeneity of Multinomial Regressors in Count Data Models: Does Two-stage Residual Inclusion Work?

Journal of Econometric Methods ◽

10.1515/jem-2014-0019 ◽

2016 ◽

Vol 7 (1) ◽

Cited By ~ 3

Author(s):

Andrea Geraci ◽

Daniele Fabbri ◽

Chiara Monfardini

Keyword(s):

Count Data ◽

Estimation Method ◽

Data Models ◽

Parametric Models ◽

Finite Sample ◽

Count Data Models ◽

Two Stage ◽

Simulation Experiments ◽

Original Application ◽

Over Dispersion

AbstractWe study a simple exogeneity test in count data models with possibly endogenous multinomial treatment. The test is based on Two Stage Residual Inclusion (2SRI), an estimation method which has been proved to be consistent for a general class of nonlinear parametric models. Results from a broad set of simulation experiments provide novel evidence on important features of this approach. We find differences in the finite sample performance of various likelihood-based tests, analyze their robustness to misspecification arising from neglected over-dispersion or from incorrect specification of the first stage model, and uncover that standardizing the variance of the first stage residuals leads to better results. An original application to testing the endogeneity status of insurance in a model of healthcare demand corroborates our Monte Carlo findings.

Download Full-text

A Study of Count Regression Models for Mortality Rate

CAUCHY ◽

10.18860/ca.v7i1.13642 ◽

2021 ◽

Vol 7 (1) ◽

pp. 142-151

Author(s):

Anwar Fitrianto

Keyword(s):

Mortality Rate ◽

Regression Model ◽

Count Data ◽

Bayesian Information Criterion ◽

Regression Models ◽

Negative Binomial ◽

Negative Binomial Regression ◽

Information Criterion ◽

Poisson Regression Model ◽

Binomial Regression

This paper discusses how overdispersed count data to be fit. Poisson regression model, Negative Binomial 1 regression model (NEGBIN 1) and Negative Binomial regression 2 (NEGBIN 2) model were proposed to fit mortality rate data. The method used is comparing the values of Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) to find out which method suits the data the most. The results show that the data indeed display higher variability. Among the three models, the model preferred is NEGBIN 1 model.

Download Full-text

مقارنة بين طرائق انحدار الحرف ونوع ليو في تقدير معلمات أنموذج انحدار ثنائي الحدين السالب في ظل وجود مشكلة التعدد الخطي باستخدام المحاكاة

Journal of Economics and Administrative Sciences ◽

10.33095/jeas.v24i109.1564 ◽

2018 ◽

Vol 24 (109) ◽

pp. 515

Author(s):

سهيل نجم عبود ◽

ايناس صلاح خورشيد

Keyword(s):

Regression Model ◽

Ridge Regression ◽

Count Data ◽

Negative Binomial ◽

Negative Binomial Regression ◽

Negative Binomial Regression Model ◽

Regression Estimator ◽

Binomial Regression ◽

Ridge Regression Estimator

ان مشكلة التعدد الخطي من المشاكل الشائعة والتي تتعامل الى حد كبير مع الارتباط الداخلي بين المتغيرات التوضيحية وتظهر هذه المشكلة خصوصا في الاقتصاد والبحوث التطبيقية، ويكون لمشكلة التعدد الخطي تاثير سلبي على أنموذج الانحدار مثل وجود درجة تباين متضخم وتقدير معلمات تكون غير مستقرة عندما نستخدم مقدرات المربعات الصغرى الاعتيادية (OLS) ، لهذا تم اللجوء الى استخدام طرائق اخرى لتقدير معلمات أنموذج ثنائي الحدين السالب منها طريقة مقدر انحدار الحرف ومقدر نوع ليو، ويعتبر أنموذج انحدار ثنائي الحدين السالب (Negative Binomial Regression Model) كأنموذج انحدار غير خطي او كجزء من العائلة الاسية المعممة و هذا ألانموذج الهيكل الاساسي لتحليل بيانات العد (Count Data) و الذي استخدم كبديل لنموذج بواسون عندما تكون هناك مشكلة فوق التشتت (Overdisperison) اي عندما تكون قيمة تباين متغير الاستجابة (Y) اكبر من وسطه الحسابي ، وتم تصميم دراسة محاكاة مونت كارلوا للمقارنة بين طريقتي تقدير انحدار الحرف (Ridge Regression Estimator) ومقدر نوع ليو (Liu Type Estimator) من خلال استخدام معيار مقارنة متوسط مربعات الخطأ (MSE)، حيث بينت نتيجة المحاكاة ان طريقة مقدر نوع ليو هي افضل من طريقة مقدر انحدار الحرف اذ جاءت متوسط مربعات الخطأ لها اقل في صيغته التقديرية الثالثة والرابعة .

Download Full-text