scholarly journals Quantifying Substitutability

2021 ◽  
Author(s):  
◽  
David X. Wang

<p>In this thesis, we will tackle the problem of how keyphrase extraction systems can be evaluated to reveal their true efficacy. The aim is to develop a new semantically-oriented approximate string matching criteria, one that is comparable to human judgements, but without the cost and energy associated with manual evaluation. This matching criteria can also be adapted for any information retrieval (IR) system where the evaluation process involves comparing candidate strings (produced by the IR system) to a gold standard (created by humans). Our contributions are threefold. First, we define a new semantic relationship called substitutability – how suitable a phrase is when used in place of another – and then design a generic system which measures/quantifies this relationship by exploiting the interlinking structure of external knowledge sources. Second, we develop two concrete substitutability systems based on our generic design: WordSub, which is backed by WordNet; and WikiSub, which is backed by Wikipedia. Third, we construct a dataset, with the help of human volunteers, that isolates the task of measuring substitutability. This dataset is then used to evaluate the performance of our substitutability systems, along with existing approximate string matching techniques, by comparing them using a set of agreement metrics. Our results clearly demonstrate that WordSub and WikiSub comfortably outperform current approaches to approximate string matching, including both lexical-based methods, such as R-precision; and semantically-oriented techniques, such as METEOR. In fact, WikiSub’s performance comes sensibly close to that of an average human volunteer, when comparing it to the optimistic (best-case) interhuman agreement.</p>

2021 ◽  
Author(s):  
◽  
David X. Wang

<p>In this thesis, we will tackle the problem of how keyphrase extraction systems can be evaluated to reveal their true efficacy. The aim is to develop a new semantically-oriented approximate string matching criteria, one that is comparable to human judgements, but without the cost and energy associated with manual evaluation. This matching criteria can also be adapted for any information retrieval (IR) system where the evaluation process involves comparing candidate strings (produced by the IR system) to a gold standard (created by humans). Our contributions are threefold. First, we define a new semantic relationship called substitutability – how suitable a phrase is when used in place of another – and then design a generic system which measures/quantifies this relationship by exploiting the interlinking structure of external knowledge sources. Second, we develop two concrete substitutability systems based on our generic design: WordSub, which is backed by WordNet; and WikiSub, which is backed by Wikipedia. Third, we construct a dataset, with the help of human volunteers, that isolates the task of measuring substitutability. This dataset is then used to evaluate the performance of our substitutability systems, along with existing approximate string matching techniques, by comparing them using a set of agreement metrics. Our results clearly demonstrate that WordSub and WikiSub comfortably outperform current approaches to approximate string matching, including both lexical-based methods, such as R-precision; and semantically-oriented techniques, such as METEOR. In fact, WikiSub’s performance comes sensibly close to that of an average human volunteer, when comparing it to the optimistic (best-case) interhuman agreement.</p>


2014 ◽  
Vol 513-517 ◽  
pp. 1017-1020
Author(s):  
Bing Liu ◽  
Dan Han ◽  
Shuang Zhang

String matching is one of the most typical problems in computer science. Previous studies mainly focused on accurate string matching problem. However, with the rapid development of the computer and Internet as well as the continuously rising of new issues, people find that it has very important theoretical value and practical meaning to research and design efficient approximate string matching algorithms. Approximate string matching is also called string matching that allows errors, which mainly aims to find the pattern string in the text and database and allows k differences between the pattern string and its occurring forms in the text. For the problem of approximate string matching, though a number of algorithms have been proposed, there are fewer studies which focus on large size of alphabet . Most of experts are interested in small or middle size of alphabet . For large size of , especially for Chinese characters and Asian phonetics, there are fewer efficient algorithms. For the above reasons, this paper focuses on the approximate Chinese strings matching problem based on the pinyin input method.


1998 ◽  
Vol 31 (4) ◽  
pp. 431-440 ◽  
Author(s):  
Marc Parizeau ◽  
Nadia Ghazzali ◽  
Jean-François Hébert

2020 ◽  
Vol 20 (1) ◽  
Author(s):  
Nathaniel Bell ◽  
Rebecca Wilkerson ◽  
Kathy Mayfield-Smith ◽  
Ana Lòpez-De Fede

Abstract Background Patient-Centered Medical Home (PCMH) adoption is an important strategy to help improve primary care quality within Health Resources and Service Administration (HRSA) community health centers (CHC), but evidence of its effect thus far remains mixed. A limitation of previous evaluations has been the inability to account for the proportion of CHC delivery sites that are designated medical homes. Methods Retrospective cross-sectional study using HRSA Uniform Data System (UDS) and certification files from the National Committee for Quality Assurance (NCQA) and the Joint Commission (JC). Datasets were linked through geocoding and an approximate string-matching algorithm. Predicted probability scores were regressed onto 11 clinical performance measures using 10% increments in site-level designation using beta logistic regression. Results The geocoding and approximate string-matching algorithm identified 2615 of the 6851 (41.8%) delivery sites included in the analyses as having been designated through the NCQA and/or JC. In total, 74.7% (n = 777) of the 1039 CHCs that met the inclusion criteria for the analysis managed at least one NCQA- and/or JC-designated site. A proportional increase in site-level designation showed a positive association with adherence scores for the majority of all indicators, but primarily among CHCs that designated at least 50% of its delivery sites. Once this threshold was achieved, there was a stepwise percentage point increase in adherence scores, ranging from 1.9 to 11.8% improvement, depending on the measure. Conclusion Geocoding and approximate string-matching techniques offer a more reliable and nuanced approach for monitoring the association between site-level PCMH designation and clinical performance within HRSA’s CHC delivery sites. Our findings suggest that transformation does in fact matter, but that it may not appear until half of the delivery sites become designated. There also appears to be a continued stepwise increase in adherence scores once this threshold is achieved.


Algorithmica ◽  
1994 ◽  
Vol 12 (4-5) ◽  
pp. 327-344 ◽  
Author(s):  
W. I. Chang ◽  
E. L. Lawler

Sign in / Sign up

Export Citation Format

Share Document