letter alphabet
Recently Published Documents


TOTAL DOCUMENTS

119
(FIVE YEARS 16)

H-INDEX

13
(FIVE YEARS 1)

2021 ◽  
Vol 3 (4) ◽  
Author(s):  
Guilherme de Sena Brandine ◽  
Andrew D Smith

Abstract DNA cytosine methylation is an important epigenomic mark with a wide range of functions in many organisms. Whole genome bisulfite sequencing is the gold standard to interrogate cytosine methylation genome-wide. Algorithms used to map bisulfite-converted reads often encode the four-base DNA alphabet with three letters by reducing two bases to a common letter. This encoding substantially reduces the entropy of nucleotide frequencies in the resulting reference genome. Within the paradigm of read mapping by first filtering possible candidate alignments, reduced entropy in the sequence space can increase the required computing effort. We introduce another bisulfite mapping algorithm (abismal), based on the idea of encoding a four-letter DNA sequence as only two letters, one for purines and one for pyrimidines. We show that this encoding can lead to greater specificity compared to existing encodings used to map bisulfite sequencing reads. Through the two-letter encoding, the abismal software tool maps reads in less time and using less memory than most bisulfite sequencing read mapping software tools, while attaining similar accuracy. This allows in silico methylation analysis to be performed in a wider range of computing machines with limited hardware settings.


2021 ◽  
Vol 8 (5) ◽  
pp. 379-388
Author(s):  
Tulus Nadapdap ◽  
Tulus . ◽  
Opim Salim

Systems of equations of the form X = Y + Z and X = C, in which the unknowns are sets of integers,”+” denotes pairwise sum of sets S + T = m + n m S, n T , and C is an ultimately periodic constant. When restricted to sets of natural numbers, such equations can be equally seen as language equations over a one-letter alphabet with concatenation and regular constants, and it is shown that such systems are computationally universal, in the sense that for every recursive set S N there exists a system with a unique solution containing T with S = n 16n + 13 T. For systems over sets of all integers, both positive and negative, there is a similar construction of a system with a unique solution S = {n|16n ∈ T} representing any hyper-arithmetical set S ⊆ N. Keywords: Language equations, Natural numbers, Equations of natural number.


2021 ◽  
Author(s):  
Pamela Fleischmann

The domain of Combinatorics on Words, first introduced by Axel Thue in 1906, covers by now many subdomains. In this work we are investigating scattered factors as a representation of non-complete information and two measurements for words, namely the locality of a word and prefix normality, which have applications in pattern matching. In the first part of the thesis we investigate scattered factors: A word u is a scattered factor of w if u can be obtained from w by deleting some of its letters. That is, there exist the (potentially empty) words u1, u2, . . . , un, and v0,v1,...,vn such that u = u1u2 ̈ ̈ ̈un and w = v0u1v1u2v2 ̈ ̈ ̈unvn. First, we consider the set of length-k scattered factors of a given word w, called the k-spectrum of w and denoted by ScatFactk(w). We prove a series of properties of the sets ScatFactk(w) for binary weakly-0-balanced and, respectively, weakly-c-balanced words w, i.e., words over a two- letter alphabet where the number of occurrences of each letter is the same, or, respectively, one letter has c occurrences more than the other. In particular, we consider the question which cardinalities n = | ScatFactk (w)| are obtainable, for a positive integer k, when w is either a weakly-0- balanced binary word of length 2k, or a weakly-c-balanced binary word of length 2k ́ c. Second, we investigate k-spectra that contain all possible words of length k, i.e., k-spectra of so called k-universal words. We present an algorithm deciding whether the k-spectra for given k of two words are equal or not, running in optimal time. Moreover, we present several results regarding k-universal words and extend this notion to circular universality that helps in investigating how the universality of repetitions of a given word can be determined. We conclude the part about scattered factors with results on the reconstruction problem of words from scattered factors that asks for the minimal information, like multisets of scattered factors of a given length or the number of occurrences of scattered factors from a given set, necessary to uniquely determine a word. We show that a word w P {a, b} ̊ can be reconstructed from the number of occurrences of at most min(|w|a, |w|b) + 1 scattered factors of the form aib, where |w|a is the number of occurrences of the letter a in w. Moreover, we generalise the result to alphabets of the form {1, . . . , q} by showing that at most ∑q ́1 |w|i (q ́ i + 1) scattered factors suffices to reconstruct w. Both results i=1 improve on the upper bounds known so far. Complexity time bounds on reconstruction algorithms are also considered here. In the second part we consider patterns, i.e., words consisting of not only letters but also variables, and in particular their locality. A pattern is called k-local if on marking the pattern in a given order never more than k marked blocks occur. We start with the proof that determining the minimal k for a given pattern such that the pattern is k-local is NP- complete. Afterwards we present results on the behaviour of the locality of repetitions and palindromes. We end this part with the proof that the matching problem becomes also NP-hard if we do not consider a regular pattern - for which the matching problem is efficiently solvable - but repetitions of regular patterns. In the last part we investigate prefix normal words which are binary words in which each prefix has at least the same number of 1s as any factor of the same length. First introduced in 2011 by Fici and Lipták, the problem of determining the index (amount of equivalence classes for a given word length) of the prefix normal equivalence relation is still open. In this paper, we investigate two aspects of the problem, namely prefix normal palindromes and so-called collapsing words (extending the notion of critical words). We prove characterizations for both the palindromes and the collapsing words and show their connection. Based on this, we show that still open problems regarding prefix normal words can be split into certain subproblems.


2020 ◽  
Author(s):  
Guilherme de Sena Brandine ◽  
Andrew D. Smith

AbstractDNA methylation, characterized by the presence of methyl group at cytosines in a DNA sequence, is an important epigenomic mark with a wide range of functions across diverse organisms. Whole genome bisulfite sequencing (WGBS) has emerged as the gold standard to interrogate cytosine methylation. Accurately mapping WGBS reads to a reference genome allows reconstruction of tissue methylomes at single-base resolution. Algorithms used to map WGBS reads often encode the four-base DNA alphabet with three letters by reducing two bases to a common letter.We introduce another bisulfite mapping algorithm (abismal), based on the novel idea of encoding a four-letter DNA sequence as two letters, one for purines and one for pyrimidines. We show theoretically that this encoding benefits from higher uniformity and specificity when subsequences are selected from reads for filtration. In our implementation, this leads to a decreased mapping time relative to the three-letter encoding. We demonstrate, using data from multiple public studies, that the abismal software tool improves mapping accuracy at significantly lower mapping times compared to commonly used mappers, with most notable improvements observed in samples originating from the random priming post-bisulfite adapter tagging protocol.


Author(s):  
Mikhail V. Berlinkov ◽  
Cyril Nicaud

In this paper we address the question of synchronizing random automata in the critical settings of almost-group automata. Group automata are automata where all letters act as permutations on the set of states, and they are not synchronizing (unless they have one state). In almost-group automata, one of the letters acts as a permutation on [Formula: see text] states, and the others as permutations. We prove that this small change is enough for automata to become synchronizing with high probability. More precisely, we establish that the probability that a strongly-connected almost-group automaton is not synchronizing is [Formula: see text], for a [Formula: see text]-letter alphabet. We also present an efficient algorithm that decides whether a strongly-connected almost-group automaton is synchronizing. For a natural model of computation, we establish a [Formula: see text] worst-case lower bound for this problem ([Formula: see text] for the average case), which is almost matched by our algorithm.


Entropy ◽  
2020 ◽  
Vol 22 (12) ◽  
pp. 1333
Author(s):  
James Kunert-Graf ◽  
Nikita Sakhanenko ◽  
David Galas

Information theory provides robust measures of multivariable interdependence, but classically does little to characterize the multivariable relationships it detects. The Partial Information Decomposition (PID) characterizes the mutual information between variables by decomposing it into unique, redundant, and synergistic components. This has been usefully applied, particularly in neuroscience, but there is currently no generally accepted method for its computation. Independently, the Information Delta framework characterizes non-pairwise dependencies in genetic datasets. This framework has developed an intuitive geometric interpretation for how discrete functions encode information, but lacks some important generalizations. This paper shows that the PID and Delta frameworks are largely equivalent. We equate their key expressions, allowing for results in one framework to apply towards open questions in the other. For example, we find that the approach of Bertschinger et al. is useful for the open Information Delta question of how to deal with linkage disequilibrium. We also show how PID solutions can be mapped onto the space of delta measures. Using Bertschinger et al. as an example solution, we identify a specific plane in delta-space on which this approach’s optimization is constrained, and compute it for all possible three-variable discrete functions of a three-letter alphabet. This yields a clear geometric picture of how a given solution decomposes information.


2020 ◽  
Vol 21 (19) ◽  
pp. 7392
Author(s):  
Peter R. Wills ◽  
Charles W. Carter

We recently observed that errors in gene replication and translation could be seen qualitatively to behave analogously to the impedances in acoustical and electronic energy transducing systems. We develop here quantitative relationships necessary to confirm that analogy and to place it into the context of the minimization of dissipative losses of both chemical free energy and information. The formal developments include expressions for the information transferred from a template to a new polymer, Iσ; an impedance parameter, Z; and an effective alphabet size, neff; all of which have non-linear dependences on the fidelity parameter, q, and the alphabet size, n. Surfaces of these functions over the {n,q} plane reveal key new insights into the origin of coding. Our conclusion is that the emergence and evolutionary refinement of information transfer in biology follow principles previously identified to govern physical energy flows, strengthening analogies (i) between chemical self-organization and biological natural selection, and (ii) between the course of evolutionary trajectories and the most probable pathways for time-dependent transitions in physics. Matching the informational impedance of translation to the four-letter alphabet of genes uncovers a pivotal role for the redundancy of triplet codons in preserving as much intrinsic genetic information as possible, especially in early stages when the coding alphabet size was small.


Author(s):  
James Kunert-Graf ◽  
Nikita Sakhanenko ◽  
David Galas

Information theory provides robust measures of multivariable interdependence, but classically does little to characterize the multivariable relationships it detects. The Partial Information Decomposition (PID) characterizes the mutual information between variables by decomposing it into unique, redundant, and synergistic components. This has been usefully applied, particularly in neuroscience, but there is currently no generally accepted method for its computation. Independently, the Information Delta framework characterizes non-pairwise dependencies in genetic datasets. This framework has developed an intuitive geometric interpretation for how discrete functions encode information, but lacks some important generalizations. This paper shows that the PID and Delta frameworks are largely equivalent. We equate their key expressions, allowing for results in one framework to apply towards open questions in the other. For example, we find that the approach of Bertschinger et al. is useful for the open Information Delta question of how to deal with linkage disequilibrium. We also show how PID solutions can be mapped onto the space of delta measures. Using Bertschinger et al. as an example solution, we identify a specific plane in delta-space on which this approach’s optimization is constrained, and compute it for all possible three-variable discrete functions of a three-letter alphabet. This yields a clear geometric picture of how a given solution decomposes information


Mathematics ◽  
2020 ◽  
Vol 8 (5) ◽  
pp. 778
Author(s):  
Herman Z. Q. Chen ◽  
Sergey Kitaev ◽  
Brian Y. Sun

A universal cycle, or u-cycle, for a given set of words is a circular word that contains each word from the set exactly once as a contiguous subword. The celebrated de Bruijn sequences are a particular case of such a u-cycle, where a set in question is the set A n of all words of length n over a k-letter alphabet A. A universal word, or u-word, is a linear, i.e., non-circular, version of the notion of a u-cycle, and it is defined similarly. Removing some words in A n may, or may not, result in a set of words for which u-cycle, or u-word, exists. The goal of this paper is to study the probability of existence of the universal objects in such a situation. We give lower bounds for the probability in general cases, and also derive explicit answers for the case of removing up to two words in A n , or the case when k = 2 and n ≤ 4 .


2020 ◽  
Author(s):  
Vasil Dinev Penchev

The “four-color” theorem seems to be generalizable as follows. The four-letter alphabet is sufficient to encode unambiguously any set of well-orderings including a geographical map or the “map” of any logic and thus that of all logics or the DNA (RNA) plan(s) of any (all) alive being(s).Then the corresponding maximally generalizing conjecture would state: anything in the universe or mind can be encoded unambiguously by four letters.That admits to be formulated as a “four-letter theorem”, and thus one can search for a properly mathematical proof of the statement.It would imply the “four colour theorem”, the proof of which many philosophers and mathematicians believe not to be entirely satisfactory for it is not a “human proof”, but intermediated by computers unavoidably since the necessary calculations exceed the human capabilities fundamentally. It is furthermore rather unsatisfactory because it consists in enumerating and proving all cases one by one.Sometimes, a more general theorem turns out to be much easier for proving including a general “human” method, and the particular and too difficult for proving theorem to be implied as a corollary in certain simple conditions.The same approach will be followed as to the four colour theorem, i.e. to be deduced more or less trivially from the “four-letter theorem” if the latter is proved. References are only classical and thus very well-known papers: their complete bibliographic description is omitted.


Sign in / Sign up

Export Citation Format

Share Document