Large-sample confidence intervals of information-theoretic measures in linguistics

Author(s):  
Ryan Ka Yau Lai ◽  
Youngah Do

This article explores a method of creating confidence bounds for information-theoretic measures in linguistics, such as entropy, Kullback-Leibler Divergence (KLD), and mutual information. We show that a useful measure of uncertainty can be derived from simple statistical principles, namely the asymptotic distribution of the maximum likelihood estimator (MLE) and the delta method. Three case studies from phonology and corpus linguistics are used to demonstrate how to apply it and examine its robustness against common violations of its assumptions in linguistics, such as insufficient sample size and non-independence of data points.

2020 ◽  
Vol 15 (2) ◽  
pp. 2335-2348
Author(s):  
Issa Cherif Geraldo

In this paper, we study the maximum likelihood estimator (MLE) of the parameter vector of a discrete multivariate crash frequencies model used in the statistical analysis of the effectiveness of a road safety measure. We derive the closed-form expression of the MLE afterwards we prove its strong consistency and we obtain the exact variance of the components of the MLE except one component whose variance is approximated via the delta method.


Author(s):  
Frank Nielsen ◽  
Ke Sun

Information-theoretic measures such as the entropy, cross-entropy and the Kullback-Leibler divergence between two mixture models is a core primitive in many signal processing tasks. Since the Kullback-Leibler divergence of mixtures provably does not admit a closed-form formula, it is in practice either estimated using costly Monte-Carlo stochastic integration, approximated, or bounded using various techniques. We present a fast and generic method that builds algorithmically closed-form lower and upper bounds on the entropy, the cross-entropy and the Kullback-Leibler divergence of mixtures. We illustrate the versatile method by reporting on our experiments for approximating the Kullback-Leibler divergence between univariate exponential mixtures, Gaussian mixtures, Rayleigh mixtures, and Gamma mixtures.


Author(s):  
Hazim Mansour Gorgees ◽  
Bushra Abdualrasool Ali ◽  
Raghad Ibrahim Kathum

     In this paper, the maximum likelihood estimator and the Bayes estimator of the reliability function for negative exponential distribution has been derived, then a Monte –Carlo simulation technique was employed to compare the performance of such estimators. The integral mean square error (IMSE) was used as a criterion for this comparison. The simulation results displayed that the Bayes estimator performed better than the maximum likelihood estimator for different samples sizes.


2021 ◽  
Author(s):  
Jakob Raymaekers ◽  
Peter J. Rousseeuw

AbstractMany real data sets contain numerical features (variables) whose distribution is far from normal (Gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box–Cox and Yeo–Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformation parameter is highly sensitive to outliers, and will often try to move outliers inward at the expense of the normality of the central part of the data. We propose a modification of these transformations as well as an estimator of the transformation parameter that is robust to outliers, so the transformed data can be approximately normal in the center and a few outliers may deviate from it. It compares favorably to existing techniques in an extensive simulation study and on real data.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
James M. Kunert-Graf ◽  
Nikita A. Sakhanenko ◽  
David J. Galas

Abstract Background Permutation testing is often considered the “gold standard” for multi-test significance analysis, as it is an exact test requiring few assumptions about the distribution being computed. However, it can be computationally very expensive, particularly in its naive form in which the full analysis pipeline is re-run after permuting the phenotype labels. This can become intractable in multi-locus genome-wide association studies (GWAS), in which the number of potential interactions to be tested is combinatorially large. Results In this paper, we develop an approach for permutation testing in multi-locus GWAS, specifically focusing on SNP–SNP-phenotype interactions using multivariable measures that can be computed from frequency count tables, such as those based in Information Theory. We find that the computational bottleneck in this process is the construction of the count tables themselves, and that this step can be eliminated at each iteration of the permutation testing by transforming the count tables directly. This leads to a speed-up by a factor of over 103 for a typical permutation test compared to the naive approach. Additionally, this approach is insensitive to the number of samples making it suitable for datasets with large number of samples. Conclusions The proliferation of large-scale datasets with genotype data for hundreds of thousands of individuals enables new and more powerful approaches for the detection of multi-locus genotype-phenotype interactions. Our approach significantly improves the computational tractability of permutation testing for these studies. Moreover, our approach is insensitive to the large number of samples in these modern datasets. The code for performing these computations and replicating the figures in this paper is freely available at https://github.com/kunert/permute-counts.


Author(s):  
Laurie Beth Feldman ◽  
Vidhushini Srinivasan ◽  
Rachel B. Fernandes ◽  
Samira Shaikh

Abstract Twitter data from a crisis that impacted many English–Spanish bilinguals show that the direction of codeswitches is associated with the statistically documented tendency of single speakers to prefer one language over another in their tweets, as gleaned from their tweeting history. Further, lexical diversity, a measure of vocabulary richness derived from information-theoretic measures of uncertainty in communication, is greater in proximity to a codeswitch than in productions remote from a switch. The prospects of a role for lexical diversity in characterizing the conditions for a language switch suggest that communicative precision may induce conditions that attenuate constraints against language mixing.


Sign in / Sign up

Export Citation Format

Share Document