scholarly journals A Flexible Multivariate Distribution for Correlated Count Data

Stats ◽  
2021 ◽  
Vol 4 (2) ◽  
pp. 308-326
Author(s):  
Kimberly F. Sellers ◽  
Tong Li ◽  
Yixuan Wu ◽  
Narayanaswamy Balakrishnan

Multivariate count data are often modeled via a multivariate Poisson distribution, but it contains an underlying, constraining assumption of data equi-dispersion (where its variance equals its mean). Real data are oftentimes over-dispersed and, as such, consider various advancements of a negative binomial structure. While data over-dispersion is more prevalent than under-dispersion in real data, however, examples containing under-dispersed data are surfacing with greater frequency. Thus, there is a demonstrated need for a flexible model that can accommodate both data types. We develop a multivariate Conway–Maxwell–Poisson (MCMP) distribution to serve as a flexible alternative for correlated count data that contain data dispersion. This structure contains the multivariate Poisson, multivariate geometric, and the multivariate Bernoulli distributions as special cases, and serves as a bridge distribution across these three classical models to address other levels of over- or under-dispersion. In this work, we not only derive the distributional form and statistical properties of this model, but we further address parameter estimation, establish informative hypothesis tests to detect statistically significant data dispersion and aid in model parsimony, and illustrate the distribution’s flexibility through several simulated and real-world data examples. These examples demonstrate that the MCMP distribution performs on par with the multivariate negative binomial distribution for over-dispersed data, and proves particularly beneficial in effectively representing under-dispersed data. Thus, the MCMP distribution offers an effective, unifying framework for modeling over- or under-dispersed multivariate correlated count data that do not necessarily adhere to Poisson assumptions.

Author(s):  
Winai Bodhisuwan ◽  
Pornpop Saengthong

In this paper, a new mixed negative binomial (NB) distribution named as negative binomial-weighted Garima (NB-WG) distribution has been introduced for modeling count data. Two special cases of the formulation distribution including negative binomial- Garima (NB-G) and negative binomial-size biased Garima (NB-SBG) are obtained by setting the specified parameter. Some statistical properties such as the factorial moments, the first four moments, variance and skewness have also been derived. Parameter estimation is implemented using maximum likelihood estimation (MLE) and real data sets are discussed to demonstrate the usefulness and applicability of the proposed distribution.


Author(s):  
Moritz Berger ◽  
Gerhard Tutz

AbstractA flexible semiparametric class of models is introduced that offers an alternative to classical regression models for count data as the Poisson and Negative Binomial model, as well as to more general models accounting for excess zeros that are also based on fixed distributional assumptions. The model allows that the data itself determine the distribution of the response variable, but, in its basic form, uses a parametric term that specifies the effect of explanatory variables. In addition, an extended version is considered, in which the effects of covariates are specified nonparametrically. The proposed model and traditional models are compared in simulations and by utilizing several real data applications from the area of health and social science.


2016 ◽  
Vol 63 (1) ◽  
pp. 77-87 ◽  
Author(s):  
William H. Fisher ◽  
Stephanie W. Hartwell ◽  
Xiaogang Deng

Poisson and negative binomial regression procedures have proliferated, and now are available in virtually all statistical packages. Along with the regression procedures themselves are procedures for addressing issues related to the over-dispersion and excessive zeros commonly observed in count data. These approaches, zero-inflated Poisson and zero-inflated negative binomial models, use logit or probit models for the “excess” zeros and count regression models for the counted data. Although these models are often appropriate on statistical grounds, their interpretation may prove substantively difficult. This article explores this dilemma, using data from a study of individuals released from facilities maintained by the Massachusetts Department of Correction.


2017 ◽  
Author(s):  
Fangzheng Xie ◽  
Mingyuan Zhou ◽  
Yanxun Xu

AbstractTumors are heterogeneous - a tumor sample usually consists of a set of subclones with distinct transcriptional profiles and potentially different degrees of aggressiveness and responses to drugs. Understanding tumor heterogeneity is therefore critical for precise cancer prognosis and treatment. In this paper, we introduce BayCount, a Bayesian decomposition method to infer tumor heterogeneity with highly over-dispersed RNA sequencing count data. Using negative binomial factor analysis, BayCount takes into account both the between-sample and gene-specific random effects on raw counts of sequencing reads mapped to each gene. For the posterior inference, we develop an efficient compound Poisson based blocked Gibbs sampler. Simulation studies show that BayCount is able to accurately estimate the subclonal inference, including number of subclones, the proportions of these subclones in each tumor sample, and the gene expression profiles in each subclone. For real-world data examples, we apply BayCount to The Cancer Genome Atlas lung cancer and kidney cancer RNA sequencing count data and obtain biologically interpretable results. Our method represents the first effort in characterizing tumor heterogeneity using RNA sequencing count data that simultaneously removes the need of normalizing the counts, achieves statistical robustness, and obtains biologically/clinically meaningful insights. The R package BayCount implementing our model and algorithm is available for download.


Econometrics ◽  
2020 ◽  
Vol 8 (1) ◽  
pp. 9 ◽  
Author(s):  
Brendan P. M. McCabe ◽  
Christopher L. Skeels

The Poisson regression model remains an important tool in the econometric analysis of count data. In a pioneering contribution to the econometric analysis of such models, Lung-Fei Lee presented a specification test for a Poisson model against a broad class of discrete distributions sometimes called the Katz family. Two members of this alternative class are the binomial and negative binomial distributions, which are commonly used with count data to allow for under- and over-dispersion, respectively. In this paper we explore the structure of other distributions within the class and their suitability as alternatives to the Poisson model. Potential difficulties with the Katz likelihood leads us to investigate a class of point optimal tests of the Poisson assumption against the alternative of over-dispersion in both the regression and intercept only cases. In a simulation study, we compare score tests of ‘Poisson-ness’ with various point optimal tests, based on the Katz family, and conclude that it is possible to choose a point optimal test which is better in the intercept only case, although the nuisance parameters arising in the regression case are problematic. One possible cause is poor choice of the point at which to optimize. Consequently, we explore the use of Hellinger distance to aid this choice. Ultimately we conclude that score tests remain the most practical approach to testing for over-dispersion in this context.


2018 ◽  
Vol 28 (5) ◽  
pp. 1540-1551
Author(s):  
Maengseok Noh ◽  
Youngjo Lee

Poisson models are widely used for statistical inference on count data. However, zero-inflation or zero-deflation with either overdispersion or underdispersion could occur. Currently, there is no available model for count data, that allows excessive occurrence of zeros along with underdispersion in non-zero counts, even though there have been reported necessity of such models. Furthermore, given an excessive zero rate, we need a model that allows a larger degree of overdispersion than existing models. In this paper, we use a random-effect model to produce a general statistical model for accommodating such phenomenon occurring in real data analyses.


Author(s):  
Chenangnon Frédéric Tovissodé ◽  
Romain Glele Kakai

It is quite easy to stochastically distort an original count variable to obtain a new count variable with relatively more variability than in the original variable. Many popular overdispersion models (variance greater than mean) can indeed be obtained by mixtures, compounding or randomlystopped sums. There is no analogous stochastic mechanism for the construction of underdispersed count variables (variance less than mean), starting from an original count distribution of interest. This work proposes a generic method to stochastically distort an original count variable to obtain a new count variable with relatively less variability than in the original variable. The proposed mechanism, termed condensation, attracts probability masses from the quantiles in the tails of the original distribution and redirect them toward quantiles around the expected value. If the original distribution can be simulated, then the simulation of variates from a condensed distribution is straightforward. Moreover, condensed distributions have a simple mean-parametrization, a characteristic useful in a count regression context. An application to the negative binomial distribution resulted in a distribution allowing under, equi and overdispersion. In addition to graphical insights, fields of applications of special cases of condensed Poisson and condensed negative binomial distributions were pointed out as an indication of the potential of condensation for a flexible analysis of count data


2017 ◽  
Vol 9 (3) ◽  
pp. 6
Author(s):  
Volition Tlhalitshi Montshiwa ◽  
Ntebogang Dinah Moroke

Abstract: Sample size requirements are common in many multivariate analysis techniques as one of the measures taken to ensure the robustness of such techniques, such requirements have not been of interest in the area of count data models. As such, this study investigated the effect of sample size on the efficiency of six commonly used count data models namely: Poisson regression model (PRM), Negative binomial regression model (NBRM), Zero-inflated Poisson (ZIP), Zero-inflated negative binomial (ZINB), Poisson Hurdle model (PHM) and Negative binomial hurdle model (NBHM). The data used in this study were sourced from Data First and were collected by Statistics South Africa through the Marriage and Divorce database. PRM, NBRM, ZIP, ZINB, PHM and NBHM were applied to ten randomly selected samples ranging from 4392 to 43916 and differing by 10% in size. The six models were compared using the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Vuong’s test for over-dispersion, McFadden RSQ, Mean Square Error (MSE) and Mean Absolute Deviation (MAD).The results revealed that generally, the Negative Binomial-based models outperformed Poisson-based models. However, the results did not reveal the effect of sample size variations on the efficiency of the models since there was no consistency in the change in AIC, BIC, Vuong’s test for over-dispersion, McFadden RSQ, MSE and MAD as the sample size increased.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Ashenafi A. Yirga ◽  
Sileshi F. Melesse ◽  
Henry G. Mwambi ◽  
Dawit G. Ayele

Abstract It is of great interest for a biomedical analyst or an investigator to correctly model the CD4 cell count or disease biomarkers of a patient in the presence of covariates or factors determining the disease progression over time. The Poisson mixed-effects models (PMM) can be an appropriate choice for repeated count data. However, this model is not realistic because of the restriction that the mean and variance are equal. Therefore, the PMM is replaced by the negative binomial mixed-effects model (NBMM). The later model effectively manages the over-dispersion of the longitudinal data. We evaluate and compare the proposed models and their application to the number of CD4 cells of HIV-Infected patients recruited in the CAPRISA 002 Acute Infection Study. The results display that the NBMM has appropriate properties and outperforms the PMM in terms of handling over-dispersion of the data. Multiple imputation techniques are also used to handle missing values in the dataset to get valid inferences for parameter estimates. In addition, the results imply that the effect of baseline BMI, HAART initiation, baseline viral load, and the number of sexual partners were significantly associated with the patient’s CD4 count in both fitted models. Comparison, discussion, and conclusion of the results of the fitted models complete the study.


2020 ◽  
Vol 36 (8) ◽  
pp. 2345-2351 ◽  
Author(s):  
Xinyan Zhang ◽  
Nengjun Yi

Abstract Motivation Longitudinal metagenomics data, including both 16S rRNA and whole-metagenome shotgun sequencing data, enhanced our abilities to understand the dynamic associations between the human microbiome and various diseases. However, analytic tools have not been fully developed to simultaneously address the main challenges of longitudinal metagenomics data, i.e. high-dimensionality, dependence among samples and zero-inflation of observed counts. Results We propose a fast zero-inflated negative binomial mixed modeling (FZINBMM) approach to analyze high-dimensional longitudinal metagenomic count data. The FZINBMM approach is based on zero-inflated negative binomial mixed models (ZINBMMs) for modeling longitudinal metagenomic count data and a fast EM-IWLS algorithm for fitting ZINBMMs. FZINBMM takes advantage of a commonly used procedure for fitting linear mixed models, which allows us to include various types of fixed and random effects and within-subject correlation structures and quickly analyze many taxa. We found that FZINBMM remarkably outperformed in computational efficiency and was statistically comparable with two R packages, GLMMadaptive and glmmTMB, that use numerical integration to fit ZINBMMs. Extensive simulations and real data applications showed that FZINBMM outperformed other previous methods, including linear mixed models, negative binomial mixed models and zero-inflated Gaussian mixed models. Availability and implementation FZINBMM has been implemented in the R package NBZIMM, available in the public GitHub repository http://github.com//nyiuab//NBZIMM. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document