String Matching Problems from Bioinformatics Which Still Need Better Solutions

Author(s):  
Gaston H. Gonnet
Entropy ◽  
2020 ◽  
Vol 23 (1) ◽  
pp. 31
Author(s):  
Ivan Markić ◽  
Maja Štula ◽  
Marija Zorić ◽  
Darko Stipaničev

The string-matching paradigm is applied in every computer science and science branch in general. The existence of a plethora of string-matching algorithms makes it hard to choose the best one for any particular case. Expressing, measuring, and testing algorithm efficiency is a challenging task with many potential pitfalls. Algorithm efficiency can be measured based on the usage of different resources. In software engineering, algorithmic productivity is a property of an algorithm execution identified with the computational resources the algorithm consumes. Resource usage in algorithm execution could be determined, and for maximum efficiency, the goal is to minimize resource usage. Guided by the fact that standard measures of algorithm efficiency, such as execution time, directly depend on the number of executed actions. Without touching the problematics of computer power consumption or memory, which also depends on the algorithm type and the techniques used in algorithm development, we have developed a methodology which enables the researchers to choose an efficient algorithm for a specific domain. String searching algorithms efficiency is usually observed independently from the domain texts being searched. This research paper aims to present the idea that algorithm efficiency depends on the properties of searched string and properties of the texts being searched, accompanied by the theoretical analysis of the proposed approach. In the proposed methodology, algorithm efficiency is expressed through character comparison count metrics. The character comparison count metrics is a formal quantitative measure independent of algorithm implementation subtleties and computer platform differences. The model is developed for a particular problem domain by using appropriate domain data (patterns and texts) and provides for a specific domain the ranking of algorithms according to the patterns’ entropy. The proposed approach is limited to on-line exact string-matching problems based on information entropy for a search pattern. Meticulous empirical testing depicts the methodology implementation and purports soundness of the methodology.


Author(s):  
Yangjun Chen

In computer engineering, a number of programming tasks involve a special problem, the so-called tree matching problem (Cole & Hariharan, 1997), as a crucial step, such as the design of interpreters for nonprocedural programming languages, automatic implementation of abstract data types, code optimization in compilers, symbolic computation, context searching in structure editors and automatic theorem proving. Recently, it has been shown that this problem can be transformed in linear time to another problem, the so called subset matching problem (Cole & Hariharan, 2002, 2003), which is to find all occurrences of a pattern string p of length m in a text string t of length n, where each pattern and text position is a set of characters drawn from some alphabet S. The pattern is said to occur at text position i if the set p[j] is a subset of the set t[i + j - 1], for all j (1 = j = m). This is a generalization of the ordinary string matching and is of interest since an efficient algorithm for this problem implies an efficient solution to the tree matching problem. In addition, as shown in (Indyk, 1997), this problem can also be used to solve general string matching and counting matching (Muthukrishan, 1997; Muthukrishan & Palem, 1994), and enables us to design efficient algorithms for several geometric pattern matching problems. In this article, we propose a new algorithm on this issue, which needs only O(n + m) time in the case that the size of S is small and O(n + m·n0.5) time on average in general cases.


2001 ◽  
Vol 11 (05) ◽  
pp. 445-453 ◽  
Author(s):  
TATIANA TAMBOURATZIS

Three artificial neural networks (ANNs) are proposed for solving a variety of on- and off-line string matching problems. The ANN structure employed as the building block of these ANNs is derived from the harmony theory (HT) ANN, whereby the resulting string matching ANNs are characterized by fast match-mismatch decisions, low computational complexity, and activation values of the ANN output nodes that can be used as indicators of substitution, insertion (addition) and deletion spelling errors.


1992 ◽  
Vol 101 (2) ◽  
pp. 131-149 ◽  
Author(s):  
Kosaburo Hashiguchi ◽  
Kazuya Yamada

Author(s):  
Zhan Peng ◽  
Yuping Wang ◽  
Wei Yue

Multi-string matching (MSM) is a core technique searching a text string for all occurrences of some string patterns. It is widely used in many applications. However, as the number of string patterns increases, most of the existing algorithms suffer from two issues: the long matching time, and the high memory consumption. To address these issues, in this paper, a fast matching engine is proposed for large-scale string matching problems. Our engine includes a filter module and a verification module. The filter module is based on several bitmaps which are responsible for quickly filtering out the invalid positions in the text, while for each potential matched position, the verification module confirms true pattern occurrence. In particular, we design a compact data structure called Adaptive Matching Tree (AMT) for the verification module, in which each tree node only saves some pattern fragments of the whole pattern set and the inner structure of each tree node is chosen adaptively according to the features of the corresponding pattern fragments. This makes the engine time and space efficient. The experiments indicate that, our matching engine performs better than the compared algorithms, especially for large pattern sets.


2012 ◽  
Vol 433-440 ◽  
pp. 4468-4474
Author(s):  
Qiang Zheng

The design of exact single pattern string matching algorithm with high performance is the basis of all string matching problems. To overcome the defects of low efficiency of pattern matching, this paper improves one of the fastest exact single pattern matching algorithms known on English text, which is SBNDM2。The simplest form of the BNDM core loop is obtained, in which there are only 5 instructions per-character read by amending the relationship between position in the pattern and bit in the bit mask. And a cross-border protection method is added to the algorithm in order to reduce the cost of cross-border inspection. Two algorithms named S2BNDM and S2BNDM′ are presented. The experimental results indicate that both S2BNDM and S2BNDM′are faster than SBNDM2 in any case.


Sign in / Sign up

Export Citation Format

Share Document