finite alphabet
Recently Published Documents


TOTAL DOCUMENTS

438
(FIVE YEARS 90)

H-INDEX

24
(FIVE YEARS 3)

Entropy ◽  
2021 ◽  
Vol 24 (1) ◽  
pp. 65
Author(s):  
Jesús E. Garca ◽  
Verónica A. González-López ◽  
Gustavo H. Tasca ◽  
Karina Y. Yaginuma

In the framework of coding theory, under the assumption of a Markov process (Xt) on a finite alphabet A, the compressed representation of the data will be composed of a description of the model used to code the data and the encoded data. Given the model, the Huffman’s algorithm is optimal for the number of bits needed to encode the data. On the other hand, modeling (Xt) through a Partition Markov Model (PMM) promotes a reduction in the number of transition probabilities needed to define the model. This paper shows how the use of Huffman code with a PMM reduces the number of bits needed in this process. We prove the estimation of a PMM allows for estimating the entropy of (Xt), providing an estimator of the minimum expected codeword length per symbol. We show the efficiency of the new methodology on a simulation study and, through a real problem of compression of DNA sequences of SARS-CoV-2, obtaining in the real data at least a reduction of 10.4%.


2021 ◽  
Vol 3 ◽  
pp. 4
Author(s):  
Tai-Danae Bradley ◽  
Yiannis Vlassopoulos

This work originates from the observation that today's state-of-the-art statistical language models are impressive not only for their performance, but also---and quite crucially---because they are built entirely from correlations in unstructured text data. The latter observation prompts a fundamental question that lies at the heart of this paper: What mathematical structure exists in unstructured text data? We put forth enriched category theory as a natural answer. We show that sequences of symbols from a finite alphabet, such as those found in a corpus of text, form a category enriched over probabilities. We then address a second fundamental question: How can this information be stored and modeled in a way that preserves the categorical structure? We answer this by constructing a functor from our enriched category of text to a particular enriched category of reduced density operators. The latter leverages the Loewner order on positive semidefinite operators, which can further be interpreted as a toy example of entailment.


Author(s):  
Галина Николаевна Жукова ◽  
Михаил Васильевич Ульянов

В статье рассмотрена задача восстановления символьных периодических последовательностей, искаженных шумами вставки, а также замены и удаления символов. Поскольку степень детализации символьного описания процесса определяется мощностью алфавита, представляет интерес исследование влияния степени детализации символьного описания на возможность восстановления полной информации об исходной периодической последовательности. Представлено экспериментальное исследование зависимости характеристик качества предложенного авторами метода восстановления периода от мощности алфавита. Для алфавитов разной мощности приводятся доля последовательностей с удовлетворительно восстановленным периодом и относительная погрешность определения длины периода. Качество восстановления оценивается отношением редакционного расстояния от восстановленной периодической последовательности до исходной строго периодической последовательности The relevance of this study is associated with the presence of a wide range of applied problems in real-world data processing and analysis. It is sensible to encode information using symbols from a finite alphabet in such problems. By varying the cardinality of the alphabet, in the description of the process, the symbolic representation provides a level of detail sufficient for real-world data analysis. However, for a number of subject areas in which it is possible to use symbolic coding of trajectories of the examined processes researchers face the presence of distortions, noise, and fragmentation of information. This occurs in bioinformatics, medicine, digital economy, time series forecasting and analysis of business processes. Periodic processes are widely represented in these subject areas. Without noise, these processes correspond to periodic symbolic sequences, i.e. words over a finite alphabet. A researcher often receives a sequence distorted by noises of various origins as the experimental data, instead of the expected periodic symbolic sequence. Under these conditions, when solving the problem of identifying the periodicity, which includes both the determination of a periodically repeating symbolic fragment and its length, hereinafter called the period, the problem requires reducing the effect of noise on the experimental results. The article deals with the problem of recovering periodic sequences, distorted by presence of noise along the replaced and deleted symbols. Since the level of detail in the description of the process depends on the cardinality of the alphabet, it is of interest to study the influence of the level of detail in the symbolic description on the possibility of recovering complete information about the initially periodic sequences. The article experimentally examines the dependence of the cardinality of the alphabet on the quality characteristics of the period recovery method proposed by the authors. For alphabets of different cardinalities, the proportion of sequences with a satisfactorily reconstructed period and the relative error in determining the length of the period are given. The quality of reconstruction of a periodically repeating fragment is estimated by the ratio of the editing distance from the reconstructed periodic sequence to the original sequence distorted by noise


Entropy ◽  
2021 ◽  
Vol 23 (9) ◽  
pp. 1148
Author(s):  
Łukasz Dębowski

We present a hypothetical argument against finite-state processes in statistical language modeling that is based on semantics rather than syntax. In this theoretical model, we suppose that the semantic properties of texts in a natural language could be approximately captured by a recently introduced concept of a perigraphic process. Perigraphic processes are a class of stochastic processes that satisfy a Zipf-law accumulation of a subset of factual knowledge, which is time-independent, compressed, and effectively inferrable from the process. We show that the classes of finite-state processes and of perigraphic processes are disjoint, and we present a new simple example of perigraphic processes over a finite alphabet called Oracle processes. The disjointness result makes use of the Hilberg condition, i.e., the almost sure power-law growth of algorithmic mutual information. Using a strongly consistent estimator of the number of hidden states, we show that finite-state processes do not satisfy the Hilberg condition whereas Oracle processes satisfy the Hilberg condition via the data-processing inequality. We discuss the relevance of these mathematical results for theoretical and computational linguistics.


2021 ◽  
Author(s):  
Giuseppe De Giacomo ◽  
Aniello Murano ◽  
Fabio Patrizi ◽  
Giuseppe Perelli

Trace Alignment is a prominent problem in Declarative Process Mining, which consists in identifying a minimal set of modifications that a log trace (produced by a system under execution) requires in order to be made compliant with a temporal specification. In its simplest form, log traces are sequences of events from a finite alphabet and specifications are written in DECLARE, a strict sublanguage of linear-time temporal logic over finite traces (LTLf ). The best approach for trace alignment has been developed in AI, using cost-optimal planning, and handles the whole LTLf . In this paper, we study the timed version of trace alignment, where events are paired with timestamps and specifications are provided in metric temporal logic over finite traces (MTLf ), essentially a superlanguage of LTLf . Due to the infiniteness of timestamps, this variant is substantially more challenging than the basic version, as the structures involved in the search are (uncountably) infinite-state, and calls for a more sophisticated machinery based on alternating (timed) automata, as opposed to the standard finite-state automata sufficient for the untimed version. The main contribution of the paper is a provably correct, effective technique for Timed Trace Alignment that takes advantage of results on MTLf decidability as well as on reachability for well-structured transition systems.


Entropy ◽  
2021 ◽  
Vol 23 (8) ◽  
pp. 1045
Author(s):  
Farzad Shahrivari ◽  
Nikola Zlatanov

In this paper, we investigate the problem of classifying feature vectors with mutually independent but non-identically distributed elements that take values from a finite alphabet set. First, we show the importance of this problem. Next, we propose a classifier and derive an analytical upper bound on its error probability. We show that the error probability moves to zero as the length of the feature vectors grows, even when there is only one training feature vector per label available. Thereby, we show that for this important problem at least one asymptotically optimal classifier exists. Finally, we provide numerical examples where we show that the performance of the proposed classifier outperforms conventional classification algorithms when the number of training data is small and the length of the feature vectors is sufficiently high.


Author(s):  
Victor I. Bakhtin ◽  
Bruno Sadok

We consider a space of infinite signals composed of letters from a finite alphabet. Each signal generates a sequence of empirical measures on the alphabet and the limit set corresponding to this sequence. The space of signals is partitioned into narrow basins consisting of signals with identical limit sets for the sequence of empirical measures and for each narrow basin its packing dimension is computed. Furthermore, we compute packing dimensions for two other types of basins defined in terms of limit behaviour of the empirical measures.


2021 ◽  
Vol 1964 (6) ◽  
pp. 062049
Author(s):  
B Paulchamy ◽  
S Chidambaram ◽  
S Vairaprakash ◽  
A N Duraivel
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document