Merging Multi-Version Texts: a Generic Solution to the Overlap Problem

Author(s):  
Desmond Schmidt

Multi-Version Documents or MVDs, as described in Schmidt and Colomb (Schm09), provide a simple format for representing overlapping structures in digital text. They permit the reuse of existing technologies, such as XML, to encode the content of individual versions, while allowing overlapping hierarchies (separate, partial or conditional) and textual variation (insertions, deletions, alternatives and transpositions) to exist within the same document. Most desired operations on MVDs may be performed by simple algorithms in linear time. However, creating and editing MVDs is a much harder and more complex operation that resembles the multiple-sequence alignment problem in biology. The inclusion of the transposition operation into the alignment process makes this a hard problem, with no solutions known to be both optimal and practical. However, a suitable heuristic algorithm can be devised, based in part on the most recent biological alignment programs, whose time complexity is quadratic in the worst case, and is often much faster. The results are satisfactory both in terms of speed and alignment quality. This means that MVDs can be considered as a practical and editable format suitable for representing many cases of overlapping structure in digital text.

2020 ◽  
Author(s):  
Ahsan Sanaullah ◽  
Degui Zhi ◽  
Shaojie Zhang

AbstractDurbin’s PBWT, a scalable data structure for haplotype matching, has been successfully applied to identical by descent (IBD) segment identification and genotype imputation. Once the PBWT of a haplotype panel is constructed, it supports efficient retrieval of all shared long segments among all individuals (long matches) and efficient query between an external haplotype and the panel. However, the standard PBWT is an array-based static data structure and does not support dynamic updates of the panel. Here, we generalize the static PBWT to a dynamic data structure, d-PBWT, where the reverse prefix sorting at each position is represented by linked lists. We developed efficient algorithms for insertion and deletion of individual haplotypes. In addition, we verified that d-PBWT can support all algorithms of PBWT. In doing so, we systematically investigated variations of set maximal match and long match query algorithms: while they all have average case time complexity independent of database size, they have different worst case complexities, linear time complexity with the size of the genome, and dependency on additional data structures.


Algorithms ◽  
2019 ◽  
Vol 12 (6) ◽  
pp. 124
Author(s):  
Sukhpal Ghuman ◽  
Emanuele Giaquinta ◽  
Jorma Tarhio

We present two modifications of Duval’s algorithm for computing the Lyndon factorization of a string. One of the algorithms has been designed for strings containing runs of the smallest character. It works best for small alphabets and it is able to skip a significant number of characters of the string. Moreover, it can be engineered to have linear time complexity in the worst case. When there is a run-length encoded string R of length ρ , the other algorithm computes the Lyndon factorization of R in O ( ρ ) time and in constant space. It is shown by experimental results that the new variations are faster than Duval’s original algorithm in many scenarios.


2017 ◽  
Vol 27 (01n02) ◽  
pp. 85-119 ◽  
Author(s):  
Karl Bringmann ◽  
Marvin Künnemann

The Fréchet distance is a well studied and very popular measure of similarity of two curves. The best known algorithms have quadratic time complexity, which has recently been shown to be optimal assuming the Strong Exponential Time Hypothesis (SETH) [Bringmann, FOCS'14]. To overcome the worst-case quadratic time barrier, restricted classes of curves have been studied that attempt to capture realistic input curves. The most popular such class are [Formula: see text]-packed curves, for which the Fréchet distance has a [Formula: see text]-approximation in time [Formula: see text] [Driemel et al., DCG'12]. In dimension [Formula: see text] this cannot be improved to [Formula: see text] for any [Formula: see text] unless SETH fails [Bringmann, FOCS'14]. In this paper, exploiting properties that prevent stronger lower bounds, we present an improved algorithm with time complexity [Formula: see text]. This improves upon the algorithm by Driemel et al. for any [Formula: see text]. Moreover, our algorithm's dependence on [Formula: see text], [Formula: see text] and [Formula: see text] is optimal in high dimensions apart from lower order factors, unless SETH fails. Our main new ingredients are as follows: For filling the classical free-space diagram we project short subcurves onto a line, which yields one-dimensional separated curves with roughly the same pairwise distances between vertices. Then we tackle this special case in near-linear time by carefully extending a greedy algorithm for the Fréchet distance of one-dimensional separated curves.


2005 ◽  
Vol 03 (01) ◽  
pp. 1-18 ◽  
Author(s):  
FRANCIS Y. L. CHIN ◽  
N. L. HO ◽  
T. W. LAM ◽  
PRUDENCE W. H. WONG

The constrained multiple sequence alignment problem is to align a set of sequences of maximum length n subject to a given constrained sequence, which arises from some knowledge of the structure of the sequences. This paper presents new algorithms for this problem, which are more efficient in terms of time and space (memory) than the previous algorithms,15 and with a worst-case guarantee on the quality of the alignment. Saving the space requirement by a quadratic factor is particularly significant as the previous O(n4)-space algorithm has limited application due to its huge memory requirement. Experiments on real data sets confirm that our new algorithms show improvements in both alignment quality and resource requirements.


2018 ◽  
Author(s):  
Edgar Garriga ◽  
Paolo Di Tommaso ◽  
Cedrik Magis ◽  
Ionas Erb ◽  
Hafid Laayouni ◽  
...  

AbstractInferences derived from large multiple alignments of biological sequences are critical to many areas of biology, including evolution, genomics, biochemistry, and structural biology. However, the complexity of the alignment problem imposes the use of approximate solutions. The most common is the progressive algorithm, which starts by aligning the most similar sequences, incorporating the remaining ones following the order imposed by a guide-tree. We developed and validated on protein sequences a regressive algorithm that works the other way around, aligning first the most dissimilar sequences. Our algorithm produces more accurate alignments than non-regressive methods, especially on datasets larger than 10,000 sequences. By design, it can run any existing alignment method in linear time thus allowing the scale-up required for extremely large genomic analyses.One Sentence SummaryInitiating alignments with the most dissimilar sequences allows slow and accurate methods to be used on large datasets


Author(s):  
Nirmal K. Nair ◽  
James H. Oliver

Abstract An efficient algorithm is presented to determine the blank shape necessary to manufacture a surface by press forming. The technique is independent of material properties and instead uses surface geometry and an area conservation constraint to generate a geometrically feasible blank shape. The algorithm is formulated as an approximate geometric interpretation of the reversal of the forming process. The primary applications for this technique are in preliminary surface design, assessment of manufacturability, and location of binder wrap. Since the algorithm exhibits linear time complexity, it is amenable to implementation as an interactive design aid. The algorithm is applied to two example surfaces and the results are discussed.


Sign in / Sign up

Export Citation Format

Share Document