On the Discretization of Continuous Probability Distributions Using a Probabilistic Rounding Mechanism

Chénangnon Frédéric Tovissodé; Sèwanou Hermann Honfo; Jonas Têlé Doumatè; Romain Glèlè Kakaï

doi:10.3390/math9050555

On the Discretization of Continuous Probability Distributions Using a Probabilistic Rounding Mechanism

Mathematics ◽

10.3390/math9050555 ◽

2021 ◽

Vol 9 (5) ◽

pp. 555

Author(s):

Chénangnon Frédéric Tovissodé ◽

Sèwanou Hermann Honfo ◽

Jonas Têlé Doumatè ◽

Romain Glèlè Kakaï

Keyword(s):

Poisson Distribution ◽

Count Data ◽

Probability Distributions ◽

Continuous Distribution ◽

Simple Expression ◽

Distribution Model ◽

Stochastic Representation ◽

Gamma Distributions ◽

The Family ◽

Jensen Shannon Divergence

Most existing flexible count distributions allow only approximate inference when used in a regression context. This work proposes a new framework to provide an exact and flexible alternative for modeling and simulating count data with various types of dispersion (equi-, under-, and over-dispersion). The new method, referred to as “balanced discretization”, consists of discretizing continuous probability distributions while preserving expectations. It is easy to generate pseudo random variates from the resulting balanced discrete distribution since it has a simple stochastic representation (probabilistic rounding) in terms of the continuous distribution. For illustrative purposes, we develop the family of balanced discrete gamma distributions that can model equi-, under-, and over-dispersed count data. This family of count distributions is appropriate for building flexible count regression models because the expectation of the distribution has a simple expression in terms of the parameters of the distribution. Using the Jensen–Shannon divergence measure, we show that under the equidispersion restriction, the family of balanced discrete gamma distributions is similar to the Poisson distribution. Based on this, we conjecture that while covering all types of dispersions, a count regression model based on the balanced discrete gamma distribution will allow recovering a near Poisson distribution model fit when the data are Poisson distributed.

On the Discretization of Continuous Probability Distributions for Flexible Count Regression

10.20944/preprints202101.0332.v1 ◽

2021 ◽

Author(s):

Chénangnon Frédéric Tovissodé ◽

Romain Glèlè Kakaï ◽

Sèwanou Hermann Honfo ◽

Jonas Têlé Doumatè

Keyword(s):

Poisson Distribution ◽

Count Data ◽

Probability Distributions ◽

Continuous Distribution ◽

Simple Expression ◽

Distribution Model ◽

Stochastic Representation ◽

Gamma Distributions ◽

The Family ◽

Jensen Shannon Divergence

Most existing flexible count regression models allow only approximate inference. This work proposes a new framework to provide an exact and flexible alternative for modeling and simulating count data with various types of dispersion (equi-, under- and overdispersion). The new method, referred as “balanced discretization”, consists in discretizing continuous probability distributions while preserving expectations. It is easy to generate pseudo random variates from the resulting balanced discrete distribution since it has a simple stochastic representation in terms of the continuous distribution. For illustrative purposes, we have developed the family of balanced discrete gamma distributions which can model equi-, under- and overdispersed count data. This family of count distributions is appropriate for building flexible count regressionmodels because the expectation of the distribution has a simple expression in terms of the parameters of the distribution. Using the Jensen–Shannon divergence measure, we have shown that under equidispersion restriction, the family of balanced discrete gamma distributions is similar to the Poisson distribution. Based on this, we conjecture that while covering all types of dispersion, a count regression model based on the balanced discrete gamma distribution will allow recovering a near Poisson distribution model fit when the data is Poisson distributed.

minicore: Fast scRNA-seq clustering with various distances

10.1101/2021.03.24.436859 ◽

2021 ◽

Author(s):

Daniel N. Baker ◽

Nathan Dyjack ◽

Vladimir Braverman ◽

Stephanie C. Hicks ◽

Ben Langmead

Keyword(s):

Open Source ◽

Count Data ◽

Probability Distributions ◽

Expression Profiles ◽

Distance Measures ◽

Bhattacharyya Distance ◽

Link Type ◽

Leibler Divergence ◽

Careful Handling ◽

Jensen Shannon Divergence

AbstractSingle-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore’s novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions.Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and minibatch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels.AvailabilityThe open source library is at https://github.com/dnbaker/minicore. Code used for experiments is at https://github.com/dnbaker/minicore-experiments.

An Alternative Distribution for Modelling Overdispersion Count Data: Poisson Shanker Distribution

ICSA - International Conference on Statistics and Analytics 2019 ◽

10.29244/icsa.2019.pp108-120 ◽

2021 ◽

pp. 108-120

Author(s):

A Meytrianti ◽

S Nurrohmah ◽

M Novita

Keyword(s):

Poisson Distribution ◽

Count Data ◽

Continuous Distribution ◽

Mixture Distribution ◽

Real Data ◽

Data Set ◽

Mixing Distribution ◽

A Value ◽

Mean And Variance ◽

Parameter Values

Poisson distribution is a common distribution for modelling count data with assumption mean and variance has the same value (equidispersion). In fact, most of the count data have mean that is smaller than variance (overdispersion) and Poisson distribution cannot be used for modelling this kind of data. Thus, several alternative distributions have been introduced to solve this problem. One of them is Shanker distribution that only has one parameter. Since Shanker distribution is continuous distribution, it cannot be used for modelling count data. Therefore, a new distribution is offered that is Poisson-Shanker distribution. Poisson-Shanker distribution is obtained by mixing Poisson and Shanker distribution, with Shanker distribution as the mixing distribution. The result is a mixture distribution that has one parameter and can be used for modelling overdispersion count data. In this paper, we obtain that Poisson-Shanker distribution has several properties are unimodal, overdispersion, increasing hazard rate, and right skew. The first four raw moments and central moments have been obtained. Maximum likelihood is a method that is used to estimate the parameter, and the solution can be done using numerical iterations. A real data set is used to illustrate the proposed distribution. The characteristics of the Poisson-Shanker distribution parameter is also obtained by numerical simulation with several variations in parameter values and sample size. The result is MSE and bias of the estimated parameter theta will increase when the parameter value rises for a value of n and will decrease when the value of n rises for a parameter value.

Log-Compound-Poisson Distribution Model of Intermittency in Turbulence

Zeitschrift für Naturforschung A ◽

10.1515/zna-1998-10-1105 ◽

1998 ◽

Vol 53 (10-11) ◽

pp. 828-832

Author(s):

Feng Quing-Zeng

Keyword(s):

Energy Dissipation ◽

Poisson Distribution ◽

Turbulent Energy ◽

Compound Poisson Distribution ◽

Distribution Model ◽

Velocity Difference ◽

Scaling Exponents ◽

Compound Poisson ◽

Developed Turbulence ◽

Experimental Values

Abstract The log-compound-Poisson distribution for the breakdown coefficients of turbulent energy dissipation is proposed, and the scaling exponents for the velocity difference moments in fully developed turbulence are obtained, which agree well with experimental values up to measurable orders. The under-lying physics of this model is directly related to the burst phenomenon in turbulence, and a detailed discussion is given in the last section.

Some properties of the unified skew-normal distribution

Statistical Papers ◽

10.1007/s00362-021-01235-2 ◽

2021 ◽

Author(s):

Reinaldo B. Arellano-Valle ◽

Adelchi Azzalini

Keyword(s):

Normal Distribution ◽

Fourth Order ◽

Probability Distributions ◽

Skew Normal Distribution ◽

Present Contribution ◽

The Family ◽

Multivariate Skewness ◽

Unified Skew Normal Distribution ◽

Skewness And Kurtosis ◽

Skew Normal

AbstractFor the family of multivariate probability distributions variously denoted as unified skew-normal, closed skew-normal and other names, a number of properties are already known, but many others are not, even some basic ones. The present contribution aims at filling some of the missing gaps. Specifically, the moments up to the fourth order are obtained, and from here the expressions of the Mardia’s measures of multivariate skewness and kurtosis. Other results concern the property of log-concavity of the distribution, closure with respect to conditioning on intervals, and a possible alternative parameterization.

Flexible models for overdispersed and underdispersed count data

Statistical Papers ◽

10.1007/s00362-021-01222-7 ◽

2021 ◽

Author(s):

Dexter Cahoy ◽

Elvira Di Nardo ◽

Federico Polito

Keyword(s):

Poisson Distribution ◽

Count Data ◽

Hypergeometric Functions ◽

Natural Generalization ◽

Model Parameters ◽

Probability Models ◽

Limiting Behavior ◽

Poisson Models ◽

Special Cases ◽

Flexible Models

AbstractWithin the framework of probability models for overdispersed count data, we propose the generalized fractional Poisson distribution (gfPd), which is a natural generalization of the fractional Poisson distribution (fPd), and the standard Poisson distribution. We derive some properties of gfPd and more specifically we study moments, limiting behavior and other features of fPd. The skewness suggests that fPd can be left-skewed, right-skewed or symmetric; this makes the model flexible and appealing in practice. We apply the model to real big count data and estimate the model parameters using maximum likelihood. Then, we turn to the very general class of weighted Poisson distributions (WPD’s) to allow both overdispersion and underdispersion. Similarly to Kemp’s generalized hypergeometric probability distribution, which is based on hypergeometric functions, we analyze a class of WPD’s related to a generalization of Mittag–Leffler functions. The proposed class of distributions includes the well-known COM-Poisson and the hyper-Poisson models. We characterize conditions on the parameters allowing for overdispersion and underdispersion, and analyze two special cases of interest which have not yet appeared in the literature.

Marginal regression models for clustered count data based on zero-inflated Conway-Maxwell-Poisson distribution with applications

Biometrics ◽

10.1111/biom.12436 ◽

2015 ◽

Vol 72 (2) ◽

pp. 606-618 ◽

Cited By ~ 13

Author(s):

Hyoyoung Choo-Wosoba ◽

Steven M. Levy ◽

Somnath Datta

Keyword(s):

Poisson Distribution ◽

Count Data ◽

Regression Models ◽

Marginal Regression

Continuous distribution model for the investigation of complex molecular architectures near interfaces with scattering techniques

Journal of Applied Physics ◽

10.1063/1.3661986 ◽

2011 ◽

Vol 110 (10) ◽

pp. 102216 ◽

Cited By ~ 41

Author(s):

Prabhanshu Shekhar ◽

Hirsh Nanda ◽

Mathias Lösche ◽

Frank Heinrich

Keyword(s):

Continuous Distribution ◽

Distribution Model ◽

Molecular Architectures

Risk Planning with Discrete Distribution Analysis Applied to Petroleum Spills

International Journal of Risk and Contingency Management ◽

10.4018/ijrcm.2013100105 ◽

2013 ◽

Vol 2 (4) ◽

pp. 61-78 ◽

Cited By ~ 9

Author(s):

Roy L. Nersesian ◽

Kenneth David Strang

Keyword(s):

Probability Distribution ◽

Probability Distributions ◽

Low Cost ◽

Discrete Distribution ◽

Distribution Model ◽

Distribution Analysis ◽

Distribution Models ◽

Discrete Probability ◽

Nonparametric Statistical ◽

Spreadsheet Software

This study discussed the theoretical literature related to developing and probability distributions for estimating uncertainty. A theoretically selected ten-year empirical sample was collected and evaluated for the Albany NY area (N=942). A discrete probability distribution model was developed and applied for part of the sample, to illustrate the likelihood of petroleum spills by industry and day of week. The benefit of this paper for the community of practice was to demonstrate how to select, develop, test and apply a probability distribution to analyze the patterns in disaster events, using inferential parametric and nonparametric statistical techniques. The method, not the model, was intended to be generalized to other researchers and populations. An interesting side benefit from this study was that it revealed significant findings about where and when most of the human-attributed petroleum leaks had occurred in the Albany NY area over the last ten years (ending in 2013). The researchers demonstrated how to develop and apply distribution models in low cost spreadsheet software (Excel).

Two Useful Discrete Distributions to Model Overdispersed Count Data

Revista Colombiana de Estadística ◽

10.15446/rce.v43n1.77052 ◽

2020 ◽

Vol 43 (1) ◽

pp. 21-48

Author(s):

Josmar Mazucheli ◽

Wesley Bertoli ◽

Ricardo Puziol Oliveira

Keyword(s):

Infinite Series ◽

Count Data ◽

Survival Function ◽

Continuous Distribution ◽

Discrete Distributions ◽

Mathematical Expressions ◽

Probability Mass ◽

The Difference ◽

Discrete Analogues ◽

Mass Functions

The methods to obtain discrete analogues of continuous distributions have been widely considered in recent years. In general, the discretization process provides probability mass functions that can be competitive with the traditional model used in the analysis of count data, the Poisson distribution. The discretization procedure also avoids the use of continuous distribution in the analysis of strictly discrete data. In this paper, we seek to introduce two discrete analogues for the Shanker distribution using the method of the infinite series and the method based on the survival function as alternatives to model overdispersed datasets. Despite the difference between discretization methods, the resulting distributions are interchangeable. However, the distribution generated by the method of infinite series method has simpler mathematical expressions for the shape, the generating functions and the central moments. The maximum likelihood theory is considered for estimation and asymptotic inference concerns. A simulation study is carried out in order to evaluate some frequentist properties of the developed methodology. The usefulness of the proposed models is evaluated using real datasets provided by the literature.