Field-theoretic density estimation for biological sequence space with applications to 5′ splice site diversity and aneuploidy in cancer

2021 ◽  
Vol 118 (40) ◽  
pp. e2025782118
Author(s):  
Wei-Chia Chen ◽  
Juannan Zhou ◽  
Jason M. Sheltzer ◽  
Justin B. Kinney ◽  
David M. McCandlish

Density estimation in sequence space is a fundamental problem in machine learning that is also of great importance in computational biology. Due to the discrete nature and large dimensionality of sequence space, how best to estimate such probability distributions from a sample of observed sequences remains unclear. One common strategy for addressing this problem is to estimate the probability distribution using maximum entropy (i.e., calculating point estimates for some set of correlations based on the observed sequences and predicting the probability distribution that is as uniform as possible while still matching these point estimates). Building on recent advances in Bayesian field-theoretic density estimation, we present a generalization of this maximum entropy approach that provides greater expressivity in regions of sequence space where data are plentiful while still maintaining a conservative maximum entropy character in regions of sequence space where data are sparse or absent. In particular, we define a family of priors for probability distributions over sequence space with a single hyperparameter that controls the expected magnitude of higher-order correlations. This family of priors then results in a corresponding one-dimensional family of maximum a posteriori estimates that interpolate smoothly between the maximum entropy estimate and the observed sample frequencies. To demonstrate the power of this method, we use it to explore the high-dimensional geometry of the distribution of 5′ splice sites found in the human genome and to understand patterns of chromosomal abnormalities across human cancers.

2020 ◽  
Author(s):  
Wei-Chia Chen ◽  
Juannan Zhou ◽  
Jason M Sheltzer ◽  
Justin B Kinney ◽  
David M McCandlish

AbstractDensity estimation in sequence space is a fundamental problem in machine learning that is of great importance in computational biology. Due to the discrete nature and large dimensionality of sequence space, how best to estimate such probability distributions from a sample of observed sequences remains unclear. One common strategy for addressing this problem is to estimate the probability distribution using maximum entropy, i.e. calculating point estimates for some set of correlations based on the observed sequences and predicting the probability distribution that is as uniform as possible while still matching these point estimates. Building on recent advances in Bayesian field-theoretic density estimation, we present a generalization of this maximum entropy approach that provides greater expressivity in regions of sequence space where data is plentiful while still maintaining a conservative maximum entropy char-acter in regions of sequence space where data is sparse or absent. In particular, we define a family of priors for probability distributions over sequence space with a single hyper-parameter that controls the expected magnitude of higher-order correlations. This family of priors then results in a corresponding one-dimensional family of maximum a posteriori estimates that interpolate smoothly between the maximum entropy estimate and the observed sample frequencies. To demonstrate the power of this method, we use it to explore the high-dimensional geometry of the distribution of 5′ splice sites found in the human genome and to understand the accumulation of chromosomal abnormalities during cancer progression.


Author(s):  
Qunfeng Dong ◽  
Xiang Gao

Abstract Accurate estimations of the seroprevalence of antibodies to severe acute respiratory syndrome coronavirus 2 need to properly consider the specificity and sensitivity of the antibody tests. In addition, prior knowledge of the extent of viral infection in a population may also be important for adjusting the estimation of seroprevalence. For this purpose, we have developed a Bayesian approach that can incorporate the variabilities of specificity and sensitivity of the antibody tests, as well as the prior probability distribution of seroprevalence. We have demonstrated the utility of our approach by applying it to a recently published large-scale dataset from the US CDC, with our results providing entire probability distributions of seroprevalence instead of single-point estimates. Our Bayesian code is freely available at https://github.com/qunfengdong/AntibodyTest.


2019 ◽  
Vol 7 ◽  
Author(s):  
SERGEY NORIN

We construct an $S_{3}$ -symmetric probability distribution on $\{(a,b,c)\in \mathbb{Z}_{{\geqslant}0}^{3}\,:\,a+b+c=n\}$ such that its marginal achieves the maximum entropy among all probability distributions on $\{0,1,\ldots ,n\}$ with mean $n/3$ . Existence of such a distribution verifies a conjecture of Kleinberg et al. [‘The growth rate of tri-colored sum-free sets’, Discrete Anal. (2018), Paper No. 12, arXiv:1607.00047v1], which is motivated by the study of sum-free sets.


1980 ◽  
Vol 102 (3) ◽  
pp. 460-468
Author(s):  
J. N. Siddall ◽  
Ali Badawy

A new algorithm using the maximum entropy principle is introduced to estimate the probability distribution of a random variable, using directly a ranked sample. It is demonstrated that almost all of the analytical probability distributions can be approximated by the new algorithm. A comparison is made between existing methods and the new algorithm; and examples are given of fitting the new distribution to an actual ranked sample.


2018 ◽  
Vol 107 (3) ◽  
pp. 302-318
Author(s):  
JONATHAN BORWEIN ◽  
PHIL HOWLETT

In modelling joint probability distributions it is often desirable to incorporate standard marginal distributions and match a set of key observed mixed moments. At the same time it may also be prudent to avoid additional unwarranted assumptions. The problem is to find the least ordered distribution that respects the prescribed constraints. In this paper we will construct a suitable joint probability distribution by finding the checkerboard copula of maximum entropy that allows us to incorporate the appropriate marginal distributions and match the nominated set of observed moments.


1984 ◽  
Vol R-33 (4) ◽  
pp. 353-357 ◽  
Author(s):  
James E. Miller ◽  
Richard W. Kulp ◽  
George E. Orr

Author(s):  
MICHAEL J. MARKHAM

In an expert system having a consistent set of linear constraints it is known that the Method of Tribus may be used to determine a probability distribution which exhibits maximised entropy. The method is extended here to include independence constraints (Accommodation). The paper proceeds to discusses this extension, and its limitations, then goes on to advance a technique for determining a small set of independencies which can be added to the linear constraints required in a particular representation of an expert system called a causal network, so that the Maximum Entropy and Causal Networks methodologies give matching distributions (Emulation). This technique may also be applied in cases where no initial independencies are given and the linear constraints are incomplete, in order to provide an optimal ME fill-in for the missing information.


2011 ◽  
Vol 09 (supp01) ◽  
pp. 39-47
Author(s):  
ALESSIA ALLEVI ◽  
MARIA BONDANI ◽  
ALESSANDRA ANDREONI

We present the experimental reconstruction of the Wigner function of some optical states. The method is based on direct intensity measurements by non-ideal photodetectors operated in the linear regime. The signal state is mixed at a beam-splitter with a set of coherent probes of known complex amplitudes and the probability distribution of the detected photons is measured. The Wigner function is given by a suitable sum of these probability distributions measured for different values of the probe. For comparison, the same data are analyzed to obtain the number distributions and the Wigner functions for photons.


Sign in / Sign up

Export Citation Format

Share Document