scholarly journals A Character Frequency based Approach to Search for Substrings of a Circular Pattern and its Conjugates in an Online Text

2021 ◽  
Vol 22 (2) ◽  
Author(s):  
Vinod Prasad

A fundamental problem in computational biology is to deal with circular patterns. The problem consists of finding the least certain length substrings of a pattern and its rotations in the database. In this paper, a novel method is presented to deal with circular patterns. The problem is solved using two incremental steps. First, an algorithm is provided that reports all substrings of a given linear pattern in an online text. Next, without losing efficiency, the algorithm is extended to process all circular rotations of the pattern. For a given pattern P of size M, and a text T of size N, the algorithm reports all locations in the text where a substring of Pc is found, where Pc is one of the rotations of P. For an alphabet size σ, using O(M) space, desired goals are achieved in an average O(MN/σ) time, which is O(N) for all patterns of length M ≤ σ. Traditional string processing algorithms make use of advanced data structures such as suffix trees and automaton. We show that basic data structures such as arrays can be used in the text processing algorithms without compromising the efficiency.

10.37236/1947 ◽  
2005 ◽  
Vol 12 (1) ◽  
Author(s):  
Avraham Goldstein ◽  
Petr Kolman ◽  
Jie Zheng

String comparison is a fundamental problem in computer science, with applications in areas such as computational biology, text processing and compression. In this paper we address the minimum common string partition problem, a string comparison problem with tight connection to the problem of sorting by reversals with duplicates, a key problem in genome rearrangement. A partition of a string $A$ is a sequence ${\cal P} = (P_1,P_2,\dots,P_m)$ of strings, called the blocks, whose concatenation is equal to $A$. Given a partition ${\cal P}$ of a string $A$ and a partition ${\cal Q}$ of a string $B$, we say that the pair $\langle{{\cal P},{\cal Q}}\rangle$ is a common partition of $A$ and $B$ if ${\cal Q}$ is a permutation of ${\cal P}$. The minimum common string partition problem (MCSP) is to find a common partition of two strings $A$ and $B$ with the minimum number of blocks. The restricted version of MCSP where each letter occurs at most $k$ times in each input string, is denoted by $k$-MCSP. In this paper, we show that $2$-MCSP (and therefore MCSP) is NP-hard and, moreover, even APX-hard. We describe a $1.1037$-approximation for $2$-MCSP and a linear time $4$-approximation algorithm for $3$-MCSP. We are not aware of any better approximations.


Web Mining ◽  
2011 ◽  
pp. 322-338 ◽  
Author(s):  
Zhixiang Chen ◽  
Richard H. Fowler ◽  
Ada Wai-Chee Fu ◽  
Chunyue Wang

A maximal forward reference of a Web user is a longest consecutive sequence of Web pages visited by the user in a session without revisiting some previously visited page in the sequence. Efficient mining of frequent traversal path patterns, that is, large reference sequences of maximal forward references, from very large Web logs is a fundamental problem in Web mining. This chapter aims at designing algorithms for this problem with the best possible efficiency. First, two optimal linear time algorithms are designed for finding maximal forward references from Web logs. Second, two algorithms for mining frequent traversal path patterns are devised with the help of a fast construction of shallow generalized suffix trees over a very large alphabet. These two algorithms have respectively provable linear and sublinear time complexity, and their performances are analyzed in comparison with the a priori-like algorithms and the Ukkonen algorithm. It is shown that these two new algorithms are substantially more efficient than the a priori-like algorithms and the Ukkonen algorithm.


1996 ◽  
Vol 05 (01n02) ◽  
pp. 199-218 ◽  
Author(s):  
J.R. BENTON ◽  
S.S. IYENGAR ◽  
W. DENG ◽  
N. BRENER ◽  
V.S. SUBRAHMANIAN

This paper defines a new approach and investigates a fundamental problem in route planners. This capability is important for robotic vehicles(Martian Rovers, etc.) and for planning off-road military maneuvers. The emphasis throughout this paper will be on the design and analysis and hieiaichical implementation of our route planner. This work was motivated by anticipation of the need to search a grid of a trillion points for optimum routes. This cannot be done simply by scaling upward from the algorithms used to search a grid of 10,000 points. Algorithms sufficient for the small grid are totally inadequate for the large grid. Soon, the challenge will be to compute off-road routes more than 100 km long and with a one or two-meter grid. Previous efforts are reviewed and the data structures, decomposition methods and search algorithms are analyzed and limitations are discussed. A detailed discussion of a hieraichical implementation is provided and the experimental results are analyzed.


2015 ◽  
Vol 27 (2) ◽  
pp. 277-295 ◽  
Author(s):  
MAXIME CROCHEMORE ◽  
COSTAS S. ILIOPOULOS ◽  
ALESSIO LANGIU ◽  
FILIPPO MIGNOSI

Given a set $\mathcal{D}$ of q documents, the Longest Common Substring (LCS) problem asks, for any integer 2 ⩽ k ⩽ q, the longest substring that appears in k documents. LCS is a well-studied problem having a wide range of applications in Bioinformatics: from microarrays to DNA sequences alignments and analysis. This problem has been solved by Hui (2000International Journal of Computer Science and Engineering15 73–76) by using a famous constant-time solution to the Lowest Common Ancestor (LCA) problem in trees coupled with the use of suffix trees.In this article, we present a simple method for solving the LCS problem by using suffix trees (STs) and classical union-find data structures. In turn, we show how this simple algorithm can be adapted in order to work with other space efficient data structures such as the enhanced suffix arrays (ESA) and the compressed suffix tree.


Author(s):  
Amir Adel Mabrouk Eldeib, Moulay Ibrahim El- Khalil Ghembaza

The science of diacritical marks is closely related to the Holy Quran, as it was used in the Quran to remove confusion and error from the pronunciation of the reader, so the introduction of any technique in the process of processing Quranic texts will have an effect on facilitating the tasks of researchers in the field of Quranic studies, whether on the reader of the Quran, to help him read accurate and correct recitation, or on the tutor to help him compile a number of examples appropriate for training. The importance of this research lies in employing automated text- processing algorithms to determine the locations of the Nunation vowelization types in the Holy Quran, and the possibility of their computerizing in order to facilitate the accurate recitation of the Holy Quran and, at the same time, to collect training examples in a database or building a corpus for future use in many research and software applications for the Holy Quran and its sciences. This research aims to present a new idea through the proposition of a framework architecture that identifies and discover automatically the locations and types of the Nunation in the Holy Quran based on the part- of- speech tagging algorithm for Arabic language so as to determine the type of words, and then by using a knowledge base to discover the appropriate Nunation words and their locations, and finally discovering the type of Nunation so as to determine the vowelization of the last letter of each Nunation word according to the Quran diacritical marks science. Furthermore, another benefit is to link searching processes with Quranic texts towards extracting the composition Nunation and the sequence Nunations in the Holy Quran emerges from the science of Quran diacritical marks; and display them as data according to a set of options selected by the user through suitable applications interfaces. The basic elements that the results of searching Quranic texts should display are highlighted, in order to extract the positions and types of Nunation vowelizations. As well as, a template for the results of searching all types of Nunation in a specific Quranic Chapter is given, with several possible options to retrieve all data in detail.


2017 ◽  
Author(s):  
Artem Babaian ◽  
Anicet Ebou ◽  
Alyssa Fegen ◽  
Ho Yin (Jeffrey) Kam ◽  
German E. Novakovsky ◽  
...  

AbstractComputational biology requires the reading and comprehension of biological data files. Plain-text formats such as SAM, VCF, GTF, PDB and FASTA, often contain critical information that is obfuscated by the complexity of the data structures. bioSyntax (http://bioSyntax.org) is a freely available suite of syntax highlighting packages for vim, gedit, Sublime, and less, which aids computational scientists to parse and work with their data more efficiently.


2006 ◽  
Vol 17 (06) ◽  
pp. 1385-1402 ◽  
Author(s):  
Élise Prieur ◽  
Thierry Lecroq

We present a first formal setting for suffix vectors that are space economical alternative data structures to suffix trees. We give two linear algorithms for converting a suffix tree into a suffix vector and conversely. We enrich suffix vectors with formulas for counting the number of occurrences of repeated substrings. We also propose an alternative implementation for suffix vectors that should outperform the existing one.


1996 ◽  
Vol 06 (01) ◽  
pp. 35-44 ◽  
Author(s):  
DANY BRESLAUER ◽  
RAMESH HARIHARAN

This paper gives optimal parallel algorithms for the construction of the smallest deterministic finite automata recognizing all the suffixes and the factors of a string. The algorithms use recently discovered optimal parallel suffix tree construction algorithms together with data structures for the efficient manipulation of trees, exploiting the well known relation between suffix and factor automata and suffix trees.


Symmetry ◽  
2020 ◽  
Vol 12 (9) ◽  
pp. 1505
Author(s):  
Luis Acedo ◽  
Abraham J. Arenas ◽  
Nicolas De La Espriella

In this article, we design a novel method for finding the exact solution of the geodesic equation in Schwarzschild spacetime, which represents the trajectories of the particles. This is a fundamental problem in astrophysics and astrodynamics if we want to incorporate relativistic effects in high precision calculations. Here, we show that exact analytical expressions can be given, in terms of modal transseries for the spiral orbits as they approach the limit cycles given by the two circular orbits that appear for each angular momentum value. The solution is expressed in terms of transseries generated by transmonomials of the form e−nθ, n=1, 2, …, where θ is the angle measured in the orbital plane. Examples are presented that verify the effect of the solutions.


Sign in / Sign up

Export Citation Format

Share Document