scholarly journals Poisson regression for linguists: A tutorial introduction to modeling count data with brms

2021 ◽  
Author(s):  
Bodo Winter ◽  
Paul - Christian Bürkner

Count data is prevalent in many different areas of linguistics, such as when counting words, syntactic constructions, discourse particles, case markers, or speech errors. The Poisson distribution is the canonical distribution for characterizing count data with no or unknown upper bound. Whereas logistic regression is very common in linguistics, Poisson regression is little known. This tutorial introduces readers to foundational concepts needed for Poisson regression, followed by a hands-on tutorial using the R package brms. We discuss a dataset where Catalan and Korean speakers change the frequency of their co-speech gestures as a function of politeness contexts. This dataset also involves exposure variables (the incorporation of time to deal with unequal intervals) and overdispersion (excess variance). Altogether, we hope that more linguists will consider Poisson regression for the analysis of count data.

Author(s):  
Osval Antonio Montesinos López ◽  
Abelardo Montesinos López ◽  
Jose Crossa

AbstractIn this chapter, we explain, under a Bayesian framework, the fundamentals and practical issues for implementing genomic prediction models for categorical and count traits. First, we derive the Bayesian ordinal model and exemplify it with plant breeding data. These examples were implemented in the library BGLR. We also derive the ordinal logistic regression. The fundamentals and practical issues of penalized multinomial logistic regression and penalized Poisson regression are given including several examples illustrating the use of the glmnet library. All the examples include main effects of environments and genotypes as well as the genotype × environment interaction term.


Author(s):  
Lili Puspita Rahayu ◽  
Kusman Sadik ◽  
Indahwati Indahwati

Poisson distribution is one of discrete distribution that is often used in modeling of rare events. The data obtained in form of counts with non-negative integers. One of analysis that is used in modeling count data is Poisson regression. Deviation of assumption that often occurs in the Poisson regression is overdispersion. Cause of overdispersion is an excess zero probability on the response variable. Solving model that be used to overcome of overdispersion is zero-inflated Poisson (ZIP) regression. The research aimed to develop a study of overdispersion for Poisson and ZIP regression on some characteristics of the data. Overdispersion on some characteristics of the data that were studied in this research are simulated by combining the parameter of Poisson distribution (λ), zero probability (p), and sample size (n) on the response variable then comparing the Poisson and ZIP regression models. Overdispersion study on data simulation showed that the larger λ, n, and p, the better is the model of ZIP than Poisson regression. The results of this simulation are also strengthened by the exploration of Pearson residual in Poisson and ZIP regression.


Author(s):  
Constantin Ahlmann-Eltze ◽  
Wolfgang Huber

Abstract Motivation The Gamma-Poisson distribution is a theoretically and empirically motivated model for the sampling variability of single cell RNA-sequencing counts (Grün et al., 2014; Svensson, 2020; Silverman et al., 2018; Hafemeister and Satija, 2019) and an essential building block for analysis approaches including differential expression analysis (Robinson et al., 2010; McCarthy et al., 2012; Anders and Huber, 2010; Love et al., 2014), principal component analysis (Townes et al., 2019) and factor analysis (Risso et al., 2018). Existing implementations for inferring its parameters from data often struggle with the size of single cell datasets, which can comprise millions of cells; at the same time, they do not take full advantage of the fact that zero and other small numbers are frequent in the data. These limitations have hampered uptake of the model, leaving room for statistically inferior approaches such as logarithm(-like) transformation. Results We present a new R package for fitting the Gamma-Poisson distribution to data with the characteristics of modern single cell datasets more quickly and more accurately than existing methods. The software can work with data on disk without having to load them into RAM simultaneously. Availability The package glmGamPoi is available from Bioconductor for Windows, macOS, and Linux, and source code is available on github.com/const-ae/glmGamPoi under a GPL-3 license.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Arnaud Liehrmann ◽  
Guillem Rigaill ◽  
Toby Dylan Hocking

Abstract Background Histone modification constitutes a basic mechanism for the genetic regulation of gene expression. In early 2000s, a powerful technique has emerged that couples chromatin immunoprecipitation with high-throughput sequencing (ChIP-seq). This technique provides a direct survey of the DNA regions associated to these modifications. In order to realize the full potential of this technique, increasingly sophisticated statistical algorithms have been developed or adapted to analyze the massive amount of data it generates. Many of these algorithms were built around natural assumptions such as the Poisson distribution to model the noise in the count data. In this work we start from these natural assumptions and show that it is possible to improve upon them. Results Our comparisons on seven reference datasets of histone modifications (H3K36me3 & H3K4me3) suggest that natural assumptions are not always realistic under application conditions. We show that the unconstrained multiple changepoint detection model with alternative noise assumptions and supervised learning of the penalty parameter reduces the over-dispersion exhibited by count data. These models, implemented in the R package CROCS (https://github.com/aLiehrmann/CROCS), detect the peaks more accurately than algorithms which rely on natural assumptions. Conclusion The segmentation models we propose can benefit researchers in the field of epigenetics by providing new high-quality peak prediction tracks for H3K36me3 and H3K4me3 histone modifications.


Author(s):  
J. M. Muñoz-Pichardo ◽  
R. Pino-Mejías ◽  
J. García-Heras ◽  
F. Ruiz-Muñoz ◽  
M. Luz González-Regalado

Author(s):  
Dexter Cahoy ◽  
Elvira Di Nardo ◽  
Federico Polito

AbstractWithin the framework of probability models for overdispersed count data, we propose the generalized fractional Poisson distribution (gfPd), which is a natural generalization of the fractional Poisson distribution (fPd), and the standard Poisson distribution. We derive some properties of gfPd and more specifically we study moments, limiting behavior and other features of fPd. The skewness suggests that fPd can be left-skewed, right-skewed or symmetric; this makes the model flexible and appealing in practice. We apply the model to real big count data and estimate the model parameters using maximum likelihood. Then, we turn to the very general class of weighted Poisson distributions (WPD’s) to allow both overdispersion and underdispersion. Similarly to Kemp’s generalized hypergeometric probability distribution, which is based on hypergeometric functions, we analyze a class of WPD’s related to a generalization of Mittag–Leffler functions. The proposed class of distributions includes the well-known COM-Poisson and the hyper-Poisson models. We characterize conditions on the parameters allowing for overdispersion and underdispersion, and analyze two special cases of interest which have not yet appeared in the literature.


2020 ◽  
Vol 2020 ◽  
pp. 1-9
Author(s):  
Yong Wu ◽  
Qigai Yin ◽  
Xiaobao Zhang ◽  
Pin Zhu ◽  
Hengfei Luan ◽  
...  

Background. Sepsis is a systemic inflammatory syndrome caused by infection with a high incidence and mortality. Although long noncoding RNAs have been identified to be closely involved in many inflammatory diseases, little is known about the role of lncRNAs in pediatric septic shock. Methods. We downloaded the mRNA profiles GSE13904 and GSE4607, of which GSE13904 includes 106 blood samples of pediatric patients with septic shock and 18 health control samples; GSE4607 includes 69 blood samples of pediatric patients with septic shock and 15 health control samples. The differentially expressed lncRNAs were identified through the limma R package; meanwhile, GO terms and KEGG pathway enrichment analysis was performed via the clusterProfiler R package. The protein-protein interaction (PPI) network was constructed based on the STRING database using the targets of differently expressed lncRNAs. The MCODE plug-in of Cytoscape was used to screen significant clustering modules composed of key genes. Finally, stepwise regression analysis was performed to screen the optimal lncRNAs and construct the logistic regression model, and the ROC curve was applied to evaluate the accuracy of the model. Results. A total of 13 lncRNAs which simultaneously exhibited significant differences in the septic shock group compared with the control group from two sets were identified. According to the 18 targets of differentially expressed lncRNAs, we identified some inflammatory and immune response-related pathways. In addition, several target mRNAs were predicted to be potentially involved in the occurrence of septic shock. The logistic regression model constructed based on two optimal lncRNAs THAP9-AS1 and TSPOAP1-AS1 could efficiently separate samples with septic shock from normal controls. Conclusion. In summary, a predictive model based on the lncRNAs THAP9-AS1 and TSPOAP1-AS1 provided novel lightings on diagnostic research of septic shock.


2012 ◽  
Vol 57 (1) ◽  
Author(s):  
SEYED EHSAN SAFFAR ◽  
ROBIAH ADNAN ◽  
WILLIAM GREENE

A Poisson model typically is assumed for count data. In many cases, there are many zeros in the dependent variable and because of these many zeros, the mean and the variance values of the dependent variable are not the same as before. In fact, the variance value of the dependent variable will be much more than the mean value of the dependent variable and this is called over–dispersion. Therefore, Poisson model is not suitable anymore for this kind of data because of too many zeros. Thus, it is suggested to use a hurdle Poisson regression model to overcome over–dispersion problem. Furthermore, the response variable in such cases is censored for some values. In this paper, a censored hurdle Poisson regression model is introduced on count data with many zeros. In this model, we consider a response variable and one or more than one explanatory variables. The estimation of regression parameters using the maximum likelihood method is discussed and the goodness–of–fit for the regression model is examined. We study the effects of right censoring on estimated parameters and their standard errors via an example.


Author(s):  
Mamoru Saito

Japanese exhibits some unique features with respect to phrase structure and movement. It is well-known that its phrase structure is strictly head-final. It also provides ample evidence that a sentence may have more complex structure than its surface form suggests. Causative sentences are the best-known example of this. They appear to be simple sentences with verbs accompanying the causative suffix, -sase. But the causative suffix is an independent verb and takes a small clause vP complement in the syntactic representation. Japanese sentences can have a rich structure in the right periphery. For example, embedded clauses may contain up to three overt complementizers, corresponding to Finite (no), Interrogative (ka), and Report/Force (to). Matrix clauses may end in a sequence of discourse particles, such as wa, yo, and ne. Each of the complementizers and discourse particles has a selectional requirement of its own. More research is required to settle on the functional heads in the nominal structure. Among the controversial issues are whether D is present and whether Case markers should be analyzed as independent heads. Various kinds of movement operations are observed in the language. NP-movement to the subject position takes place in passive and unaccusative sentences, and clausal comparatives and clefts are derived by operator-movement. Scrambling is a unique movement operation that should be distinguished from both NP-movement and operator-movement. It does not establish operator-variable relations but is not subject to the locality requirements imposed on NP-movement. It cannot be PF-movement as it creates new binding possibilities. It is still debated whether head movement, for example, the movement of verb to tense, takes place in the language.


Sign in / Sign up

Export Citation Format

Share Document