Statistical Properties of Similarity Score Functions

Jérémie Bourdon; Alban Mancheron

doi:10.46298/dmtcs.3502

Statistical Properties of Similarity Score Functions

Discrete Mathematics & Theoretical Computer Science ◽

10.46298/dmtcs.3502 ◽

2006 ◽

Vol DMTCS Proceedings vol. AG,... (Proceedings) ◽

Author(s):

Jérémie Bourdon ◽

Alban Mancheron

Keyword(s):

Probabilistic Model ◽

Random Variable ◽

Similarity Score ◽

Longest Common Subsequence ◽

Score Functions ◽

Common Subsequence ◽

International Audience ◽

The Mean ◽

Classical Behavior ◽

Or Genes

International audience In computational biology, a large amount of problems, such as pattern discovery, deals with the comparison of several sequences (of nucleotides, proteins or genes for instance). Very often, algorithms that address this problem use score functions that reflect a notion of similarity between the sequences. The most efficient methods take benefit from theoretical knowledge of the classical behavior of these score functions such as their mean, their variance, and sometime their asymptotic distribution in a given probabilistic model. In this paper, we study a recent family of score functions introduced in Mancheron 2003, which allows to compare two words having the same length. Here, the similarity takes into account all matches and mismatches between two sequences and not only the longest common subsequence as in the case of classical algorithms such as BLAST or FASTA. Based on generating functions, we provide closed formulas for the mean and the variance of these functions in an independent probabilistic model. Finally, we prove that every function in this family asymptotically behaves as a Gaussian random variable.

Download Full-text

The Limiting Distribution for the Number of Symbol Comparisons Used by QuickSort is Nondegenerate (Extended Abstract)

Discrete Mathematics & Theoretical Computer Science ◽

10.46298/dmtcs.3004 ◽

2012 ◽

Vol DMTCS Proceedings vol. AQ,... (Proceedings) ◽

Author(s):

Patrick Bindjeme ◽

james Allen fill

Keyword(s):

Large Class ◽

Continuous Time ◽

Random Variable ◽

Limiting Distribution ◽

International Audience ◽

The Mean

International audience In a continuous-time setting, Fill (2012) proved, for a large class of probabilistic sources, that the number of symbol comparisons used by $\texttt{QuickSort}$, when centered by subtracting the mean and scaled by dividing by time, has a limiting distribution, but proved little about that limiting random variable $Y$—not even that it is nondegenerate. We establish the nondegeneracy of $Y$. The proof is perhaps surprisingly difficult.

Download Full-text

Large deviations-based upper bounds on the expected relative length of longest common subsequences

Advances in Applied Probability ◽

10.1017/s0001867800001294 ◽

2006 ◽

Vol 38 (03) ◽

pp. 827-852 ◽

Cited By ~ 1

Author(s):

Raphael Hauser ◽

Servet Martínez ◽

Heinrich Matzinger

Keyword(s):

High Precision ◽

Upper Bound ◽

Large Deviation ◽

Relative Length ◽

Random Variable ◽

Upper Bounds ◽

Longest Common Subsequence ◽

Upper And Lower Bounds ◽

Finite Alphabet ◽

Common Subsequence

Consider the random variable L n defined as the length of a longest common subsequence of two random strings of length n and whose random characters are independent and identically distributed over a finite alphabet. Chvátal and Sankoff showed that the limit γ=lim n→∞E[L n ]/n is well defined. The exact value of this constant is not known, but various methods for the computation of upper and lower bounds have been discussed in the literature. Even so, high-precision bounds are hard to come by. In this paper we discuss how large deviation theory can be used to derive a consistent sequence of upper bounds, (q m ) m∈ℕ, on γ, and how Monte Carlo simulation can be used in theory to compute estimates, q̂ m , of the q m such that, for given Ξ > 0 and Λ ∈ (0,1), we have P[γ < q̂ < γ + Ξ] ≥ Λ. In other words, with high probability the result is an upper bound that approximates γ to high precision. We establish O((1 − Λ)−1Ξ−(4+ε)) as a theoretical upper bound on the complexity of computing q̂ m to the given level of accuracy and confidence. Finally, we discuss a practical heuristic based on our theoretical approach and discuss its empirical behavior.

Download Full-text

Correction: The Rate of Convergence of the Mean Length of the Longest Common Subsequence

The Annals of Applied Probability ◽

10.1214/aoap/1177004844 ◽

1995 ◽

Vol 5 (1) ◽

pp. 327-327

Author(s):

K. S. Alexander

Keyword(s):

Rate Of Convergence ◽

Longest Common Subsequence ◽

Common Subsequence ◽

The Mean

Download Full-text

A Formula for the Mean Length of the Longest Common Subsequence

Journal of Siberian Federal University Mathematics & Physics ◽

10.17516/1997-1397-2017-10-1-71-74 ◽

2017 ◽

Vol 10 (1) ◽

pp. 71-74

Author(s):

Sergej V. Znamenskij ◽

Keyword(s):

Longest Common Subsequence ◽

Common Subsequence ◽

The Mean

Download Full-text

The Rate of Convergence of the Mean Length of the Longest Common Subsequence

The Annals of Applied Probability ◽

10.1214/aoap/1177004903 ◽

1994 ◽

Vol 4 (4) ◽

pp. 1074-1082 ◽

Cited By ~ 30

Author(s):

Kenneth S. Alexander

Keyword(s):

Rate Of Convergence ◽

Longest Common Subsequence ◽

Common Subsequence ◽

The Mean

Download Full-text

Large deviations-based upper bounds on the expected relative length of longest common subsequences

Advances in Applied Probability ◽

10.1239/aap/1158685004 ◽

2006 ◽

Vol 38 (3) ◽

pp. 827-852 ◽

Cited By ~ 5

Author(s):

Raphael Hauser ◽

Servet Martínez ◽

Heinrich Matzinger

Keyword(s):

High Precision ◽

Upper Bound ◽

Large Deviation ◽

Relative Length ◽

Random Variable ◽

Upper Bounds ◽

Longest Common Subsequence ◽

Upper And Lower Bounds ◽

Finite Alphabet ◽

Common Subsequence

Consider the random variable Ln defined as the length of a longest common subsequence of two random strings of length n and whose random characters are independent and identically distributed over a finite alphabet. Chvátal and Sankoff showed that the limit γ=limn→∞E[Ln]/n is well defined. The exact value of this constant is not known, but various methods for the computation of upper and lower bounds have been discussed in the literature. Even so, high-precision bounds are hard to come by. In this paper we discuss how large deviation theory can be used to derive a consistent sequence of upper bounds, (qm)m∈ℕ, on γ, and how Monte Carlo simulation can be used in theory to compute estimates, q̂m, of the qm such that, for given Ξ > 0 and Λ ∈ (0,1), we have P[γ < q̂ < γ + Ξ] ≥ Λ. In other words, with high probability the result is an upper bound that approximates γ to high precision. We establish O((1 − Λ)−1Ξ−(4+ε)) as a theoretical upper bound on the complexity of computing q̂m to the given level of accuracy and confidence. Finally, we discuss a practical heuristic based on our theoretical approach and discuss its empirical behavior.

Download Full-text

XLCS: A New Bit-Parallel Longest Common Subsequence Algorithm on Xeon Phi Clusters

2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS) ◽

10.1109/hpcc/smartcity/dss.2019.00204 ◽

2019 ◽

Author(s):

Zekun Yin ◽

Hao Zhang ◽

Kai Xu ◽

Yuandong Chan ◽

Shaoliang Peng ◽

...

Keyword(s):

Longest Common Subsequence ◽

Xeon Phi ◽

Common Subsequence

Download Full-text

Longest common subsequence as private search

Proceedings of the 8th ACM workshop on Privacy in the electronic society - WPES '09 ◽

10.1145/1655188.1655200 ◽

2009 ◽

Cited By ~ 5

Author(s):

Mark Gondree ◽

Payman Mohassel

Keyword(s):

Longest Common Subsequence ◽

Common Subsequence ◽

Private Search

Download Full-text

Random additions in urns of integers

Journal of Applied Probability ◽

10.1017/jpr.2020.90 ◽

2021 ◽

Vol 58 (2) ◽

pp. 335-346

Author(s):

Mackenzie Simper

Keyword(s):

Finite Group ◽

Random Process ◽

Uniform Distribution ◽

Discrete Time ◽

Random Variable ◽

Urn Model ◽

Infinite Type ◽

The Mean ◽

A Finite Group ◽

Proof Of Convergence

AbstractConsider an urn containing balls labeled with integer values. Define a discrete-time random process by drawing two balls, one at a time and with replacement, and noting the labels. Add a new ball labeled with the sum of the two drawn labels. This model was introduced by Siegmund and Yakir (2005) Ann. Prob.33, 2036 for labels taking values in a finite group, in which case the distribution defined by the urn converges to the uniform distribution on the group. For the urn of integers, the main result of this paper is an exponential limit law. The mean of the exponential is a random variable with distribution depending on the starting configuration. This is a novel urn model which combines multi-drawing and an infinite type of balls. The proof of convergence uses the contraction method for recursive distributional equations.

Download Full-text

Side Channel Leakage Alignment Based on Longest Common Subsequence

2020 IEEE 14th International Conference on Big Data Science and Engineering (BigDataSE) ◽

10.1109/bigdatase50710.2020.00025 ◽

2020 ◽

Author(s):

Anni Jia ◽

Wei Yang ◽

Gongxuan Zhang

Keyword(s):

Longest Common Subsequence ◽

Side Channel ◽

Common Subsequence

Download Full-text