scholarly journals Large-scale detection of repetitions

Author(s):  
W. F. Smyth

Combinatorics on words began more than a century ago with a demonstration that an infinitely long string with no repetitions could be constructed on an alphabet of only three letters. Computing all the repetitions (such as ⋯ TTT ⋯ or ⋯ CGACGA ⋯ ) in a given string x of length n is one of the oldest and most important problems of computational stringology, requiring time in the worst case. About a dozen years ago, it was discovered that repetitions can be computed as a by-product of the Θ ( n )-time computation of all the maximal periodicities or runs in x . However, even though the computation is linear, it is also brute force: global data structures, such as the suffix array , the longest common prefix array and the Lempel–Ziv factorization , need to be computed in a preprocessing phase. Furthermore, all of this effort is required despite the fact that the expected number of runs in a string is generally a small fraction of the string length. In this paper, I explore the possibility that repetitions (perhaps also other regularities in strings) can be computed in a manner commensurate with the size of the output.

2015 ◽  
Author(s):  
◽  
Richard Beal ◽  

The parameterized string (p-string), a generalization of the traditional string, is composed of constant and parameter symbols. A parameterized match (p-match) exists between two p-strings if the constants match exactly and there exists a bijection between the parameter symbols. Historically, p-strings have been employed in source code cloning, plagiarism detection, and structural similarity between biological sequences. By handling the intricacies of the parameterized suffix, we can efficiently address complex applications with data structures also reusable in traditional matching scenarios. In this dissertation, we extend data structures for p-strings (and variants) to address sophisticated string computations.;We introduce a taxonomy of classes for longest factor problems. Using this taxonomy, we show an interesting connection between the parameterized longest previous factor (pLPF) and familiar data structures in string theory, including the border array, prefix array, longest common prefix array, and analogous p-string data structures. Exploiting this connection, we construct a multitude of data structures using the same general pLPF framework.;Before this dissertation, the p-match was defined predominately by the matching between uncompressed p-strings. Here, we introduce the compressed parameterized pattern match to find all p-matches between a pattern and a text, using only the pattern and a compressed form of the text. We present parameterized compression (p-compression) as a new way to losslessly compress data to support p-matching. Experimentally, it is shown that p-compression is competitive with standard compression schemes. Using p-compression, we address the compressed p-match independent of the underlying compression routine.;Currently, p-string theory lacks the capability to support indeterminate symbols, a staple essential for applications involving inexact matching such as in music analysis. In this work, we propose and efficiently address two new types of p-matching with indeterminate symbols. (1) We introduce the indeterminate parameterized match (ip-match) to permit matching with indeterminate holes in a p-string. We support the ip-match by introducing data structures that extend the prefix array. (2) From a different perspective, the equivalence parameterized match (e-match) evolves the p-match to consider intra-alphabet symbol classes as equivalence classes. We propose a method to perform the e-match using the p-string suffix array framework, i.e. the parameterized suffix array (pSA) and parameterized longest common prefix array (pLCP). Historically, direct constructions of the pSA and pLCP have suffered from quadratic time bounds in the worst-case. Here, we introduce new p-string theory to efficiently construct the pSA/pLCP and break the theoretical worst-case time barrier.;Biological applications have become a classical use of p-string theory. Here, we introduce the structural border array to provide a lightweight solution to the biologically-oriented variant of the p-match, i.e. the structural match (s-match) on structural strings (s-strings). Following the s-match, we show how to use s-string suffix structures to support various pattern matching problems involving RNA secondary structures. Finally, we propose/construct the forward stem matrix (FSM), a data structure to access RNA stem structures, and we apply the FSM to the detection of hairpins and pseudoknots in an RNA sequence.;This dissertation advances the state-of-the-art in p-string theory by developing data structures for p-strings/s-strings and using p-string/s-string theory in new and old contexts to address various applications. Due to the flexibility of the p-string/s-string, the data structures and algorithms in this work are also applicable to the myriad of problems in the string community that involve traditional strings.


2010 ◽  
Vol 20 (01) ◽  
pp. 15-30 ◽  
Author(s):  
EDDY CARON ◽  
FRÉDÉRIC DESPREZ ◽  
FRANCK PETIT ◽  
CÉDRIC TEDESCHI

Several factors still hinder the deployment of computational grids over large scale platforms. Among them, the resource discovery is one crucial issue. New approaches, based on peer-to-peer technologies, tackle this issue. Because they efficiently allow range queries, Tries (a.k.a., Prefix Trees) appear to be among promising ways in the design of distributed data structures indexing resources. Despite their lack of robustness in dynamic settings, trie-structured approaches outperform other peer-to-peer fashioned technologies by efficiently supporting range queries. Within recent trie-based approaches, the fault-tolerance is handled by preventive mechanisms, intensively using replication. However, replication can be very costly in terms of computing and storage resources and does not ensure the recovery of the system after arbitrary failures. Self-stabilization is an efficient approach in the design of reliable solutions for dynamic systems. It ensures a system to converge to its intended behavior, regardless of its initial state, in a finite time. A snap-stabilizing algorithm guarantees that it always behaves according to its specification, once the protocol is launched. In this paper, we provide the first snap-stabilizing protocol for trie construction. We design particular tries called Proper Greatest Common Prefix (PGCP) Tree. The proposed algorithm arranges the n label values stored in the tree, in average, in O(h + h′) rounds, where h and h′ are the initial and final heights of the tree, respectively. In the worst case, the algorithm requires an O(n) extra space on each node, O(n) rounds and O(n2) actions. However, simulations allow to state that this worst case is far from being reached and to confirm the average complexities, showing the practical efficiency of this protocol.


AMB Express ◽  
2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Marcelo dos Santos Barbosa ◽  
Iara Beatriz Andrade de Sousa ◽  
Simone Simionatto ◽  
Sibele Borsuk ◽  
Silvana Beutinger Marchioro

AbstractCurrent prevention methods for the transmission of Mycobacterium leprae, the causative agent of leprosy, are inadequate as suggested by the rate of new leprosy cases reported. Simple large-scale detection methods for M. leprae infection are crucial for early detection of leprosy and disease control. The present study investigates the production and seroreactivity of a recombinant polypeptide composed of various M. leprae protein epitopes. The structural and physicochemical parameters of this construction were assessed using in silico tools. Parameters like subcellular localization, presence of signal peptide, primary, secondary, and tertiary structures, and 3D model were ascertained using several bioinformatics tools. The resultant purified recombinant polypeptide, designated rMLP15, is composed of 15 peptides from six selected M. leprae proteins (ML1358, ML2055, ML0885, ML1811, ML1812, and ML1214) that induce T cell reactivity in leprosy patients from different hyperendemic regions. Using rMLP15 as the antigen, sera from 24 positive patients and 14 healthy controls were evaluated for reactivity via ELISA. ELISA-rMLP15 was able to diagnose 79.17% of leprosy patients with a specificity of 92.86%. rMLP15 was also able to detect the multibacillary and paucibacillary patients in the same proportions, a desirable addition in the leprosy diagnosis. These results summarily indicate the utility of the recombinant protein rMLP15 in the diagnosis of leprosy and the future development of a viable screening test.


2010 ◽  
Vol DMTCS Proceedings vol. AM,... (Proceedings) ◽  
Author(s):  
Thomas Fernique ◽  
Damien Regnault

International audience This paper introduces a Markov process inspired by the problem of quasicrystal growth. It acts over dimer tilings of the triangular grid by randomly performing local transformations, called $\textit{flips}$, which do not increase the number of identical adjacent tiles (this number can be thought as the tiling energy). Fixed-points of such a process play the role of quasicrystals. We are here interested in the worst-case expected number of flips to converge towards a fixed-point. Numerical experiments suggest a $\Theta (n^2)$ bound, where $n$ is the number of tiles of the tiling. We prove a $O(n^{2.5})$ upper bound and discuss the gap between this bound and the previous one. We also briefly discuss the average-case.


2019 ◽  
Author(s):  
Jaclyn Marjorie Smith ◽  
Melvin Lathara ◽  
Hollis Wright ◽  
Brian Hill ◽  
Nalini Ganapati ◽  
...  

Abstract Background The affordability of next-generation genomic sequencing and the improvement of medical data management have contributed largely to the evolution of biological analysis from both a clinical and research perspective. Precision medicine is a response to these advancements that places individuals into better-defined subsets based on shared clinical and genetic features. The identification of personalized diagnosis and treatment options is dependent on the ability to draw insights from large-scale, multi-modal analysis of biomedical datasets. Driven by a real use case, we premise that platforms that support precision medicine analysis should maintain data in their optimal data stores, should support distributed storage and query mechanisms, and should scale as more samples are added to the system. Results We extended a genomics-based columnar data store, GenomicsDB, for ease of use within a distributed analytics platform for clinical and genomic data integration, known as the ODA framework. The framework supports interaction from an i2b2 plugin as well as a notebook environment. We show that the ODA framework exhibits worst-case linear scaling for array size (storage), import time (data construction), and query time for an increasing number of samples. We go on to show worst-case linear time for both import of clinical data and aggregate query execution time within a distributed environment. Conclusions This work highlights the integration of a distributed genomic database with a distributed compute environment to support scalable and efficient precision medicine queries from a HIPAA-compliant, cohort system in a real-world setting. The ODA framework is currently deployed in production to support precision medicine exploration and analysis from clinicians and researchers at UCLA David Geffen School of Medicine.


2015 ◽  
Vol 138 (1) ◽  
Author(s):  
Jesse Austin-Breneman ◽  
Bo Yang Yu ◽  
Maria C. Yang

During the early stage design of large-scale engineering systems, design teams are challenged to balance a complex set of considerations. The established structured approaches for optimizing complex system designs offer strategies for achieving optimal solutions, but in practice suboptimal system-level results are often reached due to factors such as satisficing, ill-defined problems, or other project constraints. Twelve subsystem and system-level practitioners at a large aerospace organization were interviewed to understand the ways in which they integrate subsystems in their own work. Responses showed subsystem team members often presented conservative, worst-case scenarios to other subsystems when negotiating a tradeoff as a way of hedging against their own future needs. This practice of biased information passing, referred to informally by the practitioners as adding “margins,” is modeled in this paper with a series of optimization simulations. Three “bias” conditions were tested: no bias, a constant bias, and a bias which decreases with time. Results from the simulations show that biased information passing negatively affects both the number of iterations needed and the Pareto optimality of system-level solutions. Results are also compared to the interview responses and highlight several themes with respect to complex system design practice.


Author(s):  
Juha Kärkkäinen ◽  
Giovanni Manzini ◽  
Simon J. Puglisi

Sign in / Sign up

Export Citation Format

Share Document