unique substrings Latest Research Papers

Nonderived environment blocking and input-oriented computation

Evolutionary Linguistic Theory ◽

10.1075/elt.00031.cha ◽

2021 ◽

Vol 3 (2) ◽

pp. 129-153

Author(s):

Jane Chandlee

Keyword(s):

Computational Complexity ◽

Complexity Classes ◽

Input Structure ◽

Unique Substrings

Abstract This paper presents a computational account of nonderived environment blocking (NDEB) that indicates the challenges it has posed for phonological theory do not stem from any inherent complexity of the patterns themselves. Specifically, it makes use of input strictly local (ISL) functions, which are among the most restrictive (i.e., lowest computational complexity) classes of functions in the subregular hierarchy (Heinz 2018) and shows that NDEB is ISL provided the derived and nonderived environments correspond to unique substrings in the input structure. Using three classic examples of NDEB from Finnish, Polish, and Turkish, it is shown that the distinction between derived and nonderived sequences is fully determined by the input structure and can be achieved without serial derivation or intermediate representations. This result reveals that such cases of NDEB are computationally unexceptional and lends support to proposals in rule- and constraint-based theories that make use of its input-oriented nature.

Download Full-text

Computing Minimal Unique Substrings for a Sliding Window

Algorithmica ◽

10.1007/s00453-021-00864-1 ◽

2021 ◽

Author(s):

Takuya Mieno ◽

Yuta Fujishige ◽

Yuto Nakashima ◽

Shunsuke Inenaga ◽

Hideo Bannai ◽

...

Keyword(s):

Sliding Window ◽

Input String ◽

Unique Substrings

AbstractA substring u of a string T is called a minimal unique substring (MUS) of T if u occurs exactly once in T and any proper substring of u occurs at least twice in T. In this paper, we study the problem of computing MUSs for a sliding window over a given string T. We first show how the set of MUSs can change when the window slides over T. We then present an $$O(n\log \sigma ')$$ O ( n log σ ′ ) -time and O(d)-space algorithm to compute MUSs for a sliding window of size d over the input string T of length n, where $$\sigma '\le d$$ σ ′ ≤ d is the maximum number of distinct characters in every window.

Download Full-text

Space-efficient algorithms for computing minimal/shortest unique substrings

Theoretical Computer Science ◽

10.1016/j.tcs.2020.09.017 ◽

2020 ◽

Vol 845 ◽

pp. 230-242

Author(s):

Takuya Mieno ◽

Dominik Köppl ◽

Yuto Nakashima ◽

Shunsuke Inenaga ◽

Hideo Bannai ◽

...

Keyword(s):

Efficient Algorithms ◽

Unique Substrings

Download Full-text

Efficient Data Structures for Range Shortest Unique Substring Queries

Algorithms ◽

10.3390/a13110276 ◽

2020 ◽

Vol 13 (11) ◽

pp. 276

Author(s):

Paniz Abedin ◽

Arnab Ganguly ◽

Solon P. Pissis ◽

Sharma V. Thankachan

Keyword(s):

Information Retrieval ◽

Data Structure ◽

Data Structures ◽

Query Time ◽

Geometric Data ◽

Small Constant ◽

Efficient Data ◽

Online Queries ◽

Efficient Data Structures ◽

Unique Substrings

Let T[1,n] be a string of length n and T[i,j] be the substring of T starting at position i and ending at position j. A substring T[i,j] of T is a repeat if it occurs more than once in T; otherwise, it is a unique substring of T. Repeats and unique substrings are of great interest in computational biology and information retrieval. Given string T as input, the Shortest Unique Substring problem is to find a shortest substring of T that does not occur elsewhere in T. In this paper, we introduce the range variant of this problem, which we call the Range Shortest Unique Substring problem. The task is to construct a data structure over T answering the following type of online queries efficiently. Given a range [α,β], return a shortest substring T[i,j] of T with exactly one occurrence in [α,β]. We present an O(nlogn)-word data structure with O(logwn) query time, where w=Ω(logn) is the word size. Our construction is based on a non-trivial reduction allowing for us to apply a recently introduced optimal geometric data structure [Chan et al., ICALP 2018]. Additionally, we present an O(n)-word data structure with O(nlogϵn) query time, where ϵ>0 is an arbitrarily small constant. The latter data structure relies heavily on another geometric data structure [Nekrich and Navarro, SWAT 2012].

Download Full-text

More Time-Space Tradeoffs for Finding a Shortest Unique Substring

Algorithms ◽

10.3390/a13090234 ◽

2020 ◽

Vol 13 (9) ◽

pp. 234

Author(s):

Hideo Bannai ◽

Travis Gagie ◽

Gary Hoppenworth ◽

Simon J. Puglisi ◽

Luís M. S. Russo

Keyword(s):

High Probability ◽

Random Access ◽

Deterministic Algorithm ◽

Time Space ◽

Unique Substrings ◽

New Time

We extend recent results regarding finding shortest unique substrings (SUSs) to obtain new time-space tradeoffs for this problem and the generalization of finding k-mismatch SUSs. Our new results include the first algorithm for finding a k-mismatch SUS in sublinear space, which we obtain by extending an algorithm by Senanayaka (2019) and combining it with a result on sketching by Gawrychowski and Starikovskaya (2019). We first describe how, given a text T of length n and m words of workspace, with high probability we can find an SUS of length L in O(n(L/m)logL) time using random access to T, or in O(n(L/m)log2(L)loglogσ) time using O((L/m)log2L) sequential passes over T. We then describe how, for constant k, with high probability, we can find a k-mismatch SUS in O(n1+ϵL/m) time using O(nϵL/m) sequential passes over T, again using only m words of workspace. Finally, we also describe a deterministic algorithm that takes O(nτlogσlogn) time to find an SUS using O(n/τ) words of workspace, where τ is a parameter.

Download Full-text

Strain Level Microbial Detection and Quantification with Applications to Single Cell Metagenomics

10.1101/2020.06.12.149245 ◽

2020 ◽

Author(s):

Kaiyuan Zhu ◽

Welles Robinson ◽

Alejandro A. Schäffer ◽

Junyan Xu ◽

Eytan Ruppin ◽

...

Keyword(s):

Single Cells ◽

Distinct Species ◽

Strain Level ◽

Abundance Estimation ◽

Microbial Abundance ◽

Sequencing Data ◽

Arbitrary Length ◽

Methodological Innovation ◽

Microbial Identification ◽

Unique Substrings

AbstractThe identification and quantification of microbial abundance at the species or strain level from sequencing data is crucial for our understanding of human health and disease. Existing approaches for microbial abundance estimation either use accurate but computationally expensive alignment-based approaches for species-level estimation or less accurate but computationally fast alignment-free approaches that fail to classify many reads accurately at the species or strain-level.Here we introduce CAMMiQ, a novel combinatorial solution to the microbial identification and abundance estimation problem, which performs better than the best used tools on simulated and real datasets with respect to the number of correctly classified reads (i.e., specificity) by an order of magnitude and resolves possible mixtures of similar genomes.As we demonstrate, CAMMiQ can better distinguish between single cells deliberately infected with distinct Salmonella strains and sequenced using scRNA-seq reads than alternative approaches. We also demonstrate that CAMMiQ is also more accurate than the best used approaches on a variety of synthetic genomic read data involving some of the most challenging bacterial genomes derived from NCBI RefSeq database; it can distinguish not only distinct species but also closely related strains of bacteria.The key methodological innovation of CAMMiQ is its use of arbitrary length, doubly-unique substrings, i.e. substrings that appear in (exactly) two genomes in the input database, instead of fixed-length, unique substrings. To resolve the ambiguity in the genomic origin of doubly-unique substrings, CAMMiQ employs a combinatorial optimization formulation, which can be solved surprisingly quickly. CAMMiQ’s index consists of a sparsified subset of the shortest unique and doubly-unique substrings of each genome in the database, within a user specified length range and as such it is fairly compact. In short, CAMMiQ offers more accurate genomic identification and abundance estimation than the best used alternatives while using similar computational resources.Availabilityhttps://github.com/algo-cancer/CAMMiQ

Download Full-text