scholarly journals Statistical Properties of Similarity Score Functions

2006 ◽  
Vol DMTCS Proceedings vol. AG,... (Proceedings) ◽  
Author(s):  
Jérémie Bourdon ◽  
Alban Mancheron

International audience In computational biology, a large amount of problems, such as pattern discovery, deals with the comparison of several sequences (of nucleotides, proteins or genes for instance). Very often, algorithms that address this problem use score functions that reflect a notion of similarity between the sequences. The most efficient methods take benefit from theoretical knowledge of the classical behavior of these score functions such as their mean, their variance, and sometime their asymptotic distribution in a given probabilistic model. In this paper, we study a recent family of score functions introduced in Mancheron 2003, which allows to compare two words having the same length. Here, the similarity takes into account all matches and mismatches between two sequences and not only the longest common subsequence as in the case of classical algorithms such as BLAST or FASTA. Based on generating functions, we provide closed formulas for the mean and the variance of these functions in an independent probabilistic model. Finally, we prove that every function in this family asymptotically behaves as a Gaussian random variable.

2012 ◽  
Vol DMTCS Proceedings vol. AQ,... (Proceedings) ◽  
Author(s):  
Patrick Bindjeme ◽  
james Allen fill

International audience In a continuous-time setting, Fill (2012) proved, for a large class of probabilistic sources, that the number of symbol comparisons used by $\texttt{QuickSort}$, when centered by subtracting the mean and scaled by dividing by time, has a limiting distribution, but proved little about that limiting random variable $Y$—not even that it is nondegenerate. We establish the nondegeneracy of $Y$. The proof is perhaps surprisingly difficult.


2006 ◽  
Vol 38 (03) ◽  
pp. 827-852 ◽  
Author(s):  
Raphael Hauser ◽  
Servet Martínez ◽  
Heinrich Matzinger

Consider the random variable L n defined as the length of a longest common subsequence of two random strings of length n and whose random characters are independent and identically distributed over a finite alphabet. Chvátal and Sankoff showed that the limit γ=lim n→∞E[L n ]/n is well defined. The exact value of this constant is not known, but various methods for the computation of upper and lower bounds have been discussed in the literature. Even so, high-precision bounds are hard to come by. In this paper we discuss how large deviation theory can be used to derive a consistent sequence of upper bounds, (q m ) m∈ℕ, on γ, and how Monte Carlo simulation can be used in theory to compute estimates, q̂ m , of the q m such that, for given Ξ > 0 and Λ ∈ (0,1), we have P[γ < q̂ < γ + Ξ] ≥ Λ. In other words, with high probability the result is an upper bound that approximates γ to high precision. We establish O((1 − Λ)−1Ξ−(4+ε)) as a theoretical upper bound on the complexity of computing q̂ m to the given level of accuracy and confidence. Finally, we discuss a practical heuristic based on our theoretical approach and discuss its empirical behavior.


2006 ◽  
Vol 38 (3) ◽  
pp. 827-852 ◽  
Author(s):  
Raphael Hauser ◽  
Servet Martínez ◽  
Heinrich Matzinger

Consider the random variable Ln defined as the length of a longest common subsequence of two random strings of length n and whose random characters are independent and identically distributed over a finite alphabet. Chvátal and Sankoff showed that the limit γ=limn→∞E[Ln]/n is well defined. The exact value of this constant is not known, but various methods for the computation of upper and lower bounds have been discussed in the literature. Even so, high-precision bounds are hard to come by. In this paper we discuss how large deviation theory can be used to derive a consistent sequence of upper bounds, (qm)m∈ℕ, on γ, and how Monte Carlo simulation can be used in theory to compute estimates, q̂m, of the qm such that, for given Ξ > 0 and Λ ∈ (0,1), we have P[γ < q̂ < γ + Ξ] ≥ Λ. In other words, with high probability the result is an upper bound that approximates γ to high precision. We establish O((1 − Λ)−1Ξ−(4+ε)) as a theoretical upper bound on the complexity of computing q̂m to the given level of accuracy and confidence. Finally, we discuss a practical heuristic based on our theoretical approach and discuss its empirical behavior.


2021 ◽  
Vol 58 (2) ◽  
pp. 335-346
Author(s):  
Mackenzie Simper

AbstractConsider an urn containing balls labeled with integer values. Define a discrete-time random process by drawing two balls, one at a time and with replacement, and noting the labels. Add a new ball labeled with the sum of the two drawn labels. This model was introduced by Siegmund and Yakir (2005) Ann. Prob.33, 2036 for labels taking values in a finite group, in which case the distribution defined by the urn converges to the uniform distribution on the group. For the urn of integers, the main result of this paper is an exponential limit law. The mean of the exponential is a random variable with distribution depending on the starting configuration. This is a novel urn model which combines multi-drawing and an infinite type of balls. The proof of convergence uses the contraction method for recursive distributional equations.


Sign in / Sign up

Export Citation Format

Share Document