insertions and deletions
Recently Published Documents


TOTAL DOCUMENTS

241
(FIVE YEARS 51)

H-INDEX

33
(FIVE YEARS 3)

2021 ◽  
Vol 17 (4) ◽  
pp. 1-12
Author(s):  
Robert E. Tarjan ◽  
Caleb Levy ◽  
Stephen Timmel

We introduce the zip tree , 1 a form of randomized binary search tree that integrates previous ideas into one practical, performant, and pleasant-to-implement package. A zip tree is a binary search tree in which each node has a numeric rank and the tree is (max)-heap-ordered with respect to ranks, with rank ties broken in favor of smaller keys. Zip trees are essentially treaps [8], except that ranks are drawn from a geometric distribution instead of a uniform distribution, and we allow rank ties. These changes enable us to use fewer random bits per node. We perform insertions and deletions by unmerging and merging paths ( unzipping and zipping ) rather than by doing rotations, which avoids some pointer changes and improves efficiency. The methods of zipping and unzipping take inspiration from previous top-down approaches to insertion and deletion by Stephenson [10], Martínez and Roura [5], and Sprugnoli [9]. From a theoretical standpoint, this work provides two main results. First, zip trees require only O (log log n ) bits (with high probability) to represent the largest rank in an n -node binary search tree; previous data structures require O (log n ) bits for the largest rank. Second, zip trees are naturally isomorphic to skip lists [7], and simplify Dean and Jones’ mapping between skip lists


2021 ◽  
Vol 68 (5) ◽  
pp. 1-39
Author(s):  
Bernhard Haeupler ◽  
Amirbehshad Shahrasbi

We introduce synchronization strings , which provide a novel way to efficiently deal with synchronization errors , i.e., insertions and deletions. Synchronization errors are strictly more general and much harder to cope with than more commonly considered Hamming-type errors , i.e., symbol substitutions and erasures. For every ε > 0, synchronization strings allow us to index a sequence with an ε -O(1) -size alphabet, such that one can efficiently transform k synchronization errors into (1 + ε)k Hamming-type errors . This powerful new technique has many applications. In this article, we focus on designing insdel codes , i.e., error correcting block codes (ECCs) for insertion-deletion channels. While ECCs for both Hamming-type errors and synchronization errors have been intensely studied, the latter has largely resisted progress. As Mitzenmacher puts it in his 2009 survey [30]: “ Channels with synchronization errors...are simply not adequately understood by current theory. Given the near-complete knowledge, we have for channels with erasures and errors...our lack of understanding about channels with synchronization errors is truly remarkable. ” Indeed, it took until 1999 for the first insdel codes with constant rate, constant distance, and constant alphabet size to be constructed and only since 2016 are there constructions of constant rate insdel codes for asymptotically large noise rates. Even in the asymptotically large or small noise regimes, these codes are polynomially far from the optimal rate-distance tradeoff. This makes the understanding of insdel codes up to this work equivalent to what was known for regular ECCs after Forney introduced concatenated codes in his doctoral thesis 50 years ago. A straightforward application of our synchronization strings-based indexing method gives a simple black-box construction that transforms any ECC into an equally efficient insdel code with only a small increase in the alphabet size. This instantly transfers much of the highly developed understanding for regular ECCs into the realm of insdel codes. Most notably, for the complete noise spectrum, we obtain efficient “near-MDS” insdel codes, which get arbitrarily close to the optimal rate-distance tradeoff given by the Singleton bound. In particular, for any δ ∈ (0,1) and ε > 0, we give a family of insdel codes achieving a rate of 1 - δ - ε over a constant-size alphabet that efficiently corrects a δ fraction of insertions or deletions.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Massimo Maiolo ◽  
Lorenzo Gatti ◽  
Diego Frei ◽  
Tiziano Leidi ◽  
Manuel Gil ◽  
...  

Abstract Background Current alignment tools typically lack an explicit model of indel evolution, leading to artificially short inferred alignments (i.e., over-alignment) due to inconsistencies between the indel history and the phylogeny relating the input sequences. Results We present a new progressive multiple sequence alignment tool ProPIP. The process of insertions and deletions is described using an explicit evolutionary model—the Poisson Indel Process or PIP. The method is based on dynamic programming and is implemented in a frequentist framework. The source code can be compiled on Linux, macOS and Microsoft Windows platforms. The algorithm is implemented in C++ as standalone program. The source code is freely available on GitHub at https://github.com/acg-team/ProPIP and is distributed under the terms of the GNU GPL v3 license. Conclusions The use of an explicit indel evolution model allows to avoid over-alignment, to infer gaps in a phylogenetically consistent way and to make inferences about the rates of insertions and deletions. Instead of the arbitrary gap penalties, the parameters used by ProPIP are the insertion and deletion rates, which have biological interpretation and are contextualized in a probabilistic environment. As a result, indel rate settings may be optimised in order to infer phylogenetically meaningful gap patterns.


Computability ◽  
2021 ◽  
pp. 1-27
Author(s):  
Martin Vu ◽  
Henning Fernau

In this paper, we discuss the addition of substitutions as a further type of operations to (in particular, context-free) insertion-deletion systems, i.e., in addition to insertions and deletions we allow single letter replacements to occur. We investigate the effect of the addition of substitution rules on the context dependency of such systems, thereby also obtaining new characterizations of and even normal forms for context-sensitive (CS) and recursively enumerable (RE) languages and their phrase-structure grammars. More specifically, we prove that for each RE language, there is a system generating this language that only inserts and deletes strings of length two without considering the context of the insertion or deletion site, but which may change symbols (by a substitution operation) by checking a single symbol to the left of the substitution site. When we allow checking left and right single-letter context in substitutions, even context-free insertions and deletions of single letters suffice to reach computational completeness. When allowing context-free insertions only, checking left and right single-letter context in substitutions gives a new characterization of CS. This clearly shows the power of this new type of rules.


2021 ◽  
Vol 18 (183) ◽  
Author(s):  
Nora S. Martin ◽  
Sebastian E. Ahnert

Genotype–phenotype maps link genetic changes to their fitness effect and are thus an essential component of evolutionary models. The map between RNA sequences and their secondary structures is a key example and has applications in functional RNA evolution. For this map, the structural effect of substitutions is well understood, but models usually assume a constant sequence length and do not consider insertions or deletions. Here, we expand the sequence–structure map to include single nucleotide insertions and deletions by using the RNAshapes concept. To quantify the structural effect of insertions and deletions, we generalize existing definitions for robustness and non-neutral mutation probabilities. We find striking similarities between substitutions, deletions and insertions: robustness to substitutions is correlated with robustness to insertions and, for most structures, to deletions. In addition, frequent structural changes after substitutions also tend to be common for insertions and deletions. This is consistent with the connection between energetically suboptimal folds and possible structural transitions. The similarities observed hold both for genotypic and phenotypic robustness and mutation probabilities, i.e. for individual sequences and for averages over sequences with the same structure. Our results could have implications for the rate of neutral and non-neutral evolution.


2021 ◽  
Vol 22 (S4) ◽  
Author(s):  
Zexian Zeng ◽  
Chengsheng Mao ◽  
Andy Vo ◽  
Xiaoyu Li ◽  
Janna Ore Nugent ◽  
...  

Abstract Background Genetic information is becoming more readily available and is increasingly being used to predict patient cancer types as well as their subtypes. Most classification methods thus far utilize somatic mutations as independent features for classification and are limited by study power. We aim to develop a novel method to effectively explore the landscape of genetic variants, including germline variants, and small insertions and deletions for cancer type prediction. Results We proposed DeepCues, a deep learning model that utilizes convolutional neural networks to unbiasedly derive features from raw cancer DNA sequencing data for disease classification and relevant gene discovery. Using raw whole-exome sequencing as features, germline variants and somatic mutations, including insertions and deletions, were interactively amalgamated for feature generation and cancer prediction. We applied DeepCues to a dataset from TCGA to classify seven different types of major cancers and obtained an overall accuracy of 77.6%. We compared DeepCues to conventional methods and demonstrated a significant overall improvement (p < 0.001). Strikingly, using DeepCues, the top 20 breast cancer relevant genes we have identified, had a 40% overlap with the top 20 known breast cancer driver genes. Conclusion Our results support DeepCues as a novel method to improve the representational resolution of DNA sequencings and its power in deriving features from raw sequences for cancer type prediction, as well as discovering new cancer relevant genes.


Author(s):  
Gil Loewenthal ◽  
Dana Rapoport ◽  
Oren Avram ◽  
Asher Moshe ◽  
Elya Wygoda ◽  
...  

Abstract Insertions and deletions (indels) are common molecular evolutionary events. However, probabilistic models for indel evolution are under-developed due to their computational complexity. Here we introduce several improvements to indel modeling: (1) While previous models for indel evolution assumed that the rates and length distributions of insertions and deletions are equal, here we propose a richer model that explicitly distinguishes between the two; (2) We introduce numerous summary statistics that allow Approximate Bayesian Computation (ABC) based parameter estimation; (3) We develop a method to correct for biases introduced by alignment programs, when inferring indel parameters from empirical datasets; (4) Using a model-selection scheme we test whether the richer model better fits biological data compared to the simpler model. Our analyses suggest that both our inference scheme and the model-selection procedure achieve high accuracy on simulated data. We further demonstrate that our proposed richer model better fits a large number of empirical datasets and that, for the majority of these datasets, the deletion rate is higher than the insertion rate.


Author(s):  
Kohei Hagiwara ◽  
Michael N Edmonson ◽  
David A Wheeler ◽  
Jinghui Zhang

Abstract Summary Small insertions and deletions (indels) in nucleotide sequence may be represented differently between mapping algorithms and variant callers, or in the flanking sequence context. Representational ambiguity is especially profound for complex indels, complicating comparisons between multiple mappings and call sets. Complex indels may additionally suffer from incomplete allele representation, potentially leading to critical misannotation of variant effect. We present indelPost, a Python library that harmonizes these ambiguities for simple and complex indels via realignment and read-based phasing. We demonstrate that indelPost enables accurate analysis of ambiguous data and can derive the correct complex indel alleles from the simple indel predictions provided by standard small variant detectors, with improved performance over a specialized tool for complex indel analysis. Availability indelPost is freely available at: https://github.com/stjude/indelPost. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Filip Wierzbicki ◽  
Robert Kofler ◽  
Sarah Signor

AbstractSmall RNAs produced from transposable element (TE) rich sections of the genome, termed piRNA clusters, are a crucial component in the genomic defense against selfish DNA. In animals it is thought the invasion of a TE is stopped when a copy of the TE inserts into a piRNA cluster, triggering the production of cognate small RNAs that silence the TE. Despite this importance for TE control, little is known about the evolutionary dynamics of piRNA clusters, mostly because these repeat rich regions are difficult to assemble and compare. Here we establish a framework for studying the evolution of piRNA clusters quantitatively. Previously introduced quality metrics and a newly developed software for multiple alignments of repeat annotations (Manna) allow us to estimate the level of polymorphism segregating in piRNA clusters and the divergence among homologous piRNA clusters. By studying 20 conserved piRNA clusters in multiple assemblies of four Drosophila species we show that piRNA clusters are evolving rapidly. While 70-80% of the clusters are conserved within species, the clusters share almost no similarity between species as closely related as D. melanogaster and D. simulans. Furthermore, abundant insertions and deletions are segregating within the Drosophila species. We show that the evolution of clusters is mainly driven by large insertions of recently active TEs, and smaller deletions mostly in older TEs. The effect of these forces is so rapid that homologous clusters often do not contain insertions from the same TE families.x


Sign in / Sign up

Export Citation Format

Share Document