Optimizing the cost matrix for approximate string matching using genetic algorithms

<p>In this thesis, we will tackle the problem of how keyphrase extraction systems can be evaluated to reveal their true efficacy. The aim is to develop a new semantically-oriented approximate string matching criteria, one that is comparable to human judgements, but without the cost and energy associated with manual evaluation. This matching criteria can also be adapted for any information retrieval (IR) system where the evaluation process involves comparing candidate strings (produced by the IR system) to a gold standard (created by humans). Our contributions are threefold. First, we define a new semantic relationship called substitutability – how suitable a phrase is when used in place of another – and then design a generic system which measures/quantifies this relationship by exploiting the interlinking structure of external knowledge sources. Second, we develop two concrete substitutability systems based on our generic design: WordSub, which is backed by WordNet; and WikiSub, which is backed by Wikipedia. Third, we construct a dataset, with the help of human volunteers, that isolates the task of measuring substitutability. This dataset is then used to evaluate the performance of our substitutability systems, along with existing approximate string matching techniques, by comparing them using a set of agreement metrics. Our results clearly demonstrate that WordSub and WikiSub comfortably outperform current approaches to approximate string matching, including both lexical-based methods, such as R-precision; and semantically-oriented techniques, such as METEOR. In fact, WikiSub’s performance comes sensibly close to that of an average human volunteer, when comparing it to the optimistic (best-case) interhuman agreement.</p>

Download Full-text

Quantifying Substitutability

10.26686/wgtn.17009885.v1 ◽

2021 ◽

Author(s):

◽

David X. Wang

Keyword(s):

String Matching ◽

Evaluation Process ◽

Approximate String Matching ◽

Keyphrase Extraction ◽

Human Volunteers ◽

Matching Criteria ◽

The Cost ◽

Generic Design ◽

Matching Techniques ◽

Generic System

<p>In this thesis, we will tackle the problem of how keyphrase extraction systems can be evaluated to reveal their true efficacy. The aim is to develop a new semantically-oriented approximate string matching criteria, one that is comparable to human judgements, but without the cost and energy associated with manual evaluation. This matching criteria can also be adapted for any information retrieval (IR) system where the evaluation process involves comparing candidate strings (produced by the IR system) to a gold standard (created by humans). Our contributions are threefold. First, we define a new semantic relationship called substitutability – how suitable a phrase is when used in place of another – and then design a generic system which measures/quantifies this relationship by exploiting the interlinking structure of external knowledge sources. Second, we develop two concrete substitutability systems based on our generic design: WordSub, which is backed by WordNet; and WikiSub, which is backed by Wikipedia. Third, we construct a dataset, with the help of human volunteers, that isolates the task of measuring substitutability. This dataset is then used to evaluate the performance of our substitutability systems, along with existing approximate string matching techniques, by comparing them using a set of agreement metrics. Our results clearly demonstrate that WordSub and WikiSub comfortably outperform current approaches to approximate string matching, including both lexical-based methods, such as R-precision; and semantically-oriented techniques, such as METEOR. In fact, WikiSub’s performance comes sensibly close to that of an average human volunteer, when comparing it to the optimistic (best-case) interhuman agreement.</p>

Download Full-text

On Some Properties of Semi-rings

Mechanical Engineering and Computer Science ◽

10.24108/0318.0001379 ◽

2018 ◽

pp. 35-50

Author(s):

A. I. Belousov

Keyword(s):

Linear Equations ◽

Oriented Graph ◽

Theoretical Computer Science ◽

Alternative Methods ◽

Main Role ◽

Cost Matrix ◽

Sequential Elimination ◽

The Matrix ◽

Systems Of Linear Equations ◽

The Cost

The main objective of this paper is to prove a theorem according to which a method of successive elimination of unknowns in the solution of systems of linear equations in the semi-rings with iteration gives the really smallest solution of the system. The proof is based on the graph interpretation of the system and establishes a relationship between the method of sequential elimination of unknowns and the method for calculating a cost matrix of a labeled oriented graph using the method of sequential calculation of cost matrices following the paths of increasing ranks. Along with that, and in terms of preparing for the proof of the main theorem, we consider the following important properties of the closed semi-rings and semi-rings with iteration.We prove the properties of an infinite sum (a supremum of the sequence in natural ordering of an idempotent semi-ring). In particular, the proof of the continuity of the addition operation is much simpler than in the known issues, which is the basis for the well-known algorithm for solving a linear equation in a semi-ring with iteration.Next, we prove a theorem on the closeness of semi-rings with iteration with respect to solutions of the systems of linear equations. We also give a detailed proof of the theorem of the cost matrix of an oriented graph labeled above a semi-ring as an iteration of the matrix of arc labels.The concept of an automaton over a semi-ring is introduced, which, unlike the usual labeled oriented graph, has a distinguished "final" vertex with a zero out-degree.All of the foregoing provides a basis for the proof of the main theorem, in which the concept of an automaton over a semi-ring plays the main role.The article's results are scientifically and methodologically valuable. The proposed proof of the main theorem allows us to relate two alternative methods for calculating the cost matrix of a labeled oriented graph, and the proposed proofs of already known statements can be useful in presenting the elements of the theory of semi-rings that plays an important role in mathematical studies of students majoring in software technologies and theoretical computer science.

Download Full-text

Incremental maintenance of length normalized indexes for approximate string matching

Proceedings of the 35th SIGMOD international conference on Management of data - SIGMOD '09 ◽

10.1145/1559845.1559891 ◽

2009 ◽

Cited By ~ 23

Author(s):

Marios Hadjieleftheriou ◽

Nick Koudas ◽

Divesh Srivastava

Keyword(s):

String Matching ◽

Approximate String Matching ◽

Incremental Maintenance

Download Full-text

Optimal implementations of the approximate string matching and the approximate discrete signal matching on the memory machine models

International Journal of Parallel Emergent and Distributed Systems ◽

10.1080/17445760.2013.773330 ◽

2013 ◽

Vol 29 (2) ◽

pp. 104-118 ◽

Cited By ~ 1

Author(s):

Koji Nakano

Keyword(s):

String Matching ◽

Approximate String Matching ◽

Discrete Signal

Download Full-text

Sublinear approximate string matching and biological applications

Algorithmica ◽

10.1007/bf01185431 ◽

1994 ◽

Vol 12 (4-5) ◽

pp. 327-344 ◽

Cited By ~ 87

Author(s):

W. I. Chang ◽

E. L. Lawler

Keyword(s):

String Matching ◽

Approximate String Matching ◽

Biological Applications

Download Full-text

Average-optimal single and multiple approximate string matching

Journal of Experimental Algorithmics ◽

10.1145/1005813.1041513 ◽

2004 ◽

Vol 9 ◽

Cited By ~ 17

Author(s):

Kimmo Fredriksson ◽

Gonzalo Navarro

Keyword(s):

String Matching ◽

Approximate String Matching

Download Full-text

Searching Best Strategies Algorithm for the No Balance Assignment Problem

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.50-51.386 ◽

2011 ◽

Vol 50-51 ◽

pp. 386-390

Author(s):

Mao Yan Fang ◽

Min Le Wang ◽

Yi Ming Bi

Keyword(s):

Assignment Problem ◽

Cost Matrix ◽

Classical Algorithm ◽

The Cost

The No Balance Assignment Problem (NBAP) is mainly resolved by changing it into Balance Assignment Problem (BAP) and using classical algorithm to deal with it now. This paper proposed Searching Best strategies Algorithm (SBSA) to resolve this problem, and it needn’t to change NBAP into BAP. SBSA resolves NBAP based on searching the best answer of the cost matrix. This algorithm’s theory is simple，and it is easy to operate. The result of the research indicate that the algorithm not only can deal with NBAP, but also can deal with BAP and other problems such as translation problem.

Download Full-text