approximate string matching
Recently Published Documents


TOTAL DOCUMENTS

294
(FIVE YEARS 21)

H-INDEX

29
(FIVE YEARS 2)

2021 ◽  
Author(s):  
◽  
David X. Wang

<p>In this thesis, we will tackle the problem of how keyphrase extraction systems can be evaluated to reveal their true efficacy. The aim is to develop a new semantically-oriented approximate string matching criteria, one that is comparable to human judgements, but without the cost and energy associated with manual evaluation. This matching criteria can also be adapted for any information retrieval (IR) system where the evaluation process involves comparing candidate strings (produced by the IR system) to a gold standard (created by humans). Our contributions are threefold. First, we define a new semantic relationship called substitutability – how suitable a phrase is when used in place of another – and then design a generic system which measures/quantifies this relationship by exploiting the interlinking structure of external knowledge sources. Second, we develop two concrete substitutability systems based on our generic design: WordSub, which is backed by WordNet; and WikiSub, which is backed by Wikipedia. Third, we construct a dataset, with the help of human volunteers, that isolates the task of measuring substitutability. This dataset is then used to evaluate the performance of our substitutability systems, along with existing approximate string matching techniques, by comparing them using a set of agreement metrics. Our results clearly demonstrate that WordSub and WikiSub comfortably outperform current approaches to approximate string matching, including both lexical-based methods, such as R-precision; and semantically-oriented techniques, such as METEOR. In fact, WikiSub’s performance comes sensibly close to that of an average human volunteer, when comparing it to the optimistic (best-case) interhuman agreement.</p>


2021 ◽  
Author(s):  
◽  
David X. Wang

<p>In this thesis, we will tackle the problem of how keyphrase extraction systems can be evaluated to reveal their true efficacy. The aim is to develop a new semantically-oriented approximate string matching criteria, one that is comparable to human judgements, but without the cost and energy associated with manual evaluation. This matching criteria can also be adapted for any information retrieval (IR) system where the evaluation process involves comparing candidate strings (produced by the IR system) to a gold standard (created by humans). Our contributions are threefold. First, we define a new semantic relationship called substitutability – how suitable a phrase is when used in place of another – and then design a generic system which measures/quantifies this relationship by exploiting the interlinking structure of external knowledge sources. Second, we develop two concrete substitutability systems based on our generic design: WordSub, which is backed by WordNet; and WikiSub, which is backed by Wikipedia. Third, we construct a dataset, with the help of human volunteers, that isolates the task of measuring substitutability. This dataset is then used to evaluate the performance of our substitutability systems, along with existing approximate string matching techniques, by comparing them using a set of agreement metrics. Our results clearly demonstrate that WordSub and WikiSub comfortably outperform current approaches to approximate string matching, including both lexical-based methods, such as R-precision; and semantically-oriented techniques, such as METEOR. In fact, WikiSub’s performance comes sensibly close to that of an average human volunteer, when comparing it to the optimistic (best-case) interhuman agreement.</p>


2021 ◽  
Vol 9 (2) ◽  
pp. 168-175
Author(s):  
Sebastianus A S Mola ◽  
Meiton Boru ◽  
Emerensye Sofia Yublina Pandie

Komunikasi tertulis dalam media sosial yang menekankan pada kecepatan penyebaran informasi sering kali terjadi fenomena penggunaan bahasa yang tidak baku baik pada level kalimat, klausa, frasa maupun kata. Sebagai sebuah sumber data, media sosial dengan fenomena ini memberikan tantangan dalam proses ekstraksi informasi. Normalisasi bahasa yang tidak baku menjadi bahasa baku dimulai pada proses normalisasi kata di mana kata yang tidak baku (non-standard word (NSW)) dinormalisasikan ke bentuk baku (standard word (SW)). Proses normalisasi dengan menggunakan edit distance memiliki keterbatasan dalam proses pembobotan nilai mismatch, match, dan gap yang bersifat statis. Dalam perhitungan nilai mismatch, pembobotan statida tidak dapat memberikan pembedaan bobot akibat kesalahan penekanan tombol pada keyboard terutama tombol yang berdekatan. Karena keterbatasan pembobotan edit distance ini maka dalam penelitian ini diusulkan sebuah metode pembobotan dinamis untuk bobot mismatch. Hasil dari penelitian ini adalah adanya metode baru dalam pembobotan dinamis berbasis posisi tombol keyboard yang dapat digunakan dalam melakukan normalisasi NSW menggunakan metode approximate string matching.


2021 ◽  
Author(s):  
David Castells-Rufas ◽  
Santiago Marco-Sola ◽  
Quim Aguado-Puig ◽  
Antonio Espinosa-Morales ◽  
Juan Carlos Moure ◽  
...  

2021 ◽  
Vol 11 (2) ◽  
pp. 63-70
Author(s):  
Nadhia Nurin Syarafina ◽  
◽  
Jozua Ferjanus Palandi ◽  

Good scriptwriting or reporting requires a high level of accuracy. The basic problem is that the level of accuracy of the authors is not the same. The low level of accuracy allows for mistyping of words in a sentence. Typing errors caused the word to become non-standard. Even worse, the word became meaningless. In this case, the recommendation application serves to provide word-writing recommendations in case of a typing error. This application can reduce the error rate of the writer when typing. One method to improve word spelling is Approximate String Matching. This method applies an approach to the string search process. The Levenshtein Distance algorithm is a part of the Approximate String-Matching method. This method, firstly, is necessary to go through the preprocessing stage to correct an incorrectly written word using the Levenshtein Distance algorithm. The application testing phase uses ten texts composed of 100 words, ten texts composed of 100 to 250 words, and ten texts composed of 250 to 500 words. The average accuracy rate of these test results was 95%, 94%, and 90%.


2020 ◽  
Vol 20 (1) ◽  
Author(s):  
Nathaniel Bell ◽  
Rebecca Wilkerson ◽  
Kathy Mayfield-Smith ◽  
Ana Lòpez-De Fede

Abstract Background Patient-Centered Medical Home (PCMH) adoption is an important strategy to help improve primary care quality within Health Resources and Service Administration (HRSA) community health centers (CHC), but evidence of its effect thus far remains mixed. A limitation of previous evaluations has been the inability to account for the proportion of CHC delivery sites that are designated medical homes. Methods Retrospective cross-sectional study using HRSA Uniform Data System (UDS) and certification files from the National Committee for Quality Assurance (NCQA) and the Joint Commission (JC). Datasets were linked through geocoding and an approximate string-matching algorithm. Predicted probability scores were regressed onto 11 clinical performance measures using 10% increments in site-level designation using beta logistic regression. Results The geocoding and approximate string-matching algorithm identified 2615 of the 6851 (41.8%) delivery sites included in the analyses as having been designated through the NCQA and/or JC. In total, 74.7% (n = 777) of the 1039 CHCs that met the inclusion criteria for the analysis managed at least one NCQA- and/or JC-designated site. A proportional increase in site-level designation showed a positive association with adherence scores for the majority of all indicators, but primarily among CHCs that designated at least 50% of its delivery sites. Once this threshold was achieved, there was a stepwise percentage point increase in adherence scores, ranging from 1.9 to 11.8% improvement, depending on the measure. Conclusion Geocoding and approximate string-matching techniques offer a more reliable and nuanced approach for monitoring the association between site-level PCMH designation and clinical performance within HRSA’s CHC delivery sites. Our findings suggest that transformation does in fact matter, but that it may not appear until half of the delivery sites become designated. There also appears to be a continued stepwise increase in adherence scores once this threshold is achieved.


2020 ◽  
Author(s):  
Nathaniel Bell ◽  
Rebecca Wilkerson ◽  
Kathy Mayfield-Smith ◽  
Ana Lòpez-De Fede

Abstract Background: Patient-Centered Medical Home (PCMH) adoption is an important strategy to help improve primary care quality within Health Resources and Service Administration (HRSA) community health centers (CHC), but evidence of its effect thus far remains mixed. A limitation of previous evaluations has been the inability to account for the proportion of CHC delivery sites that are designated medical homes.Methods: Retrospective cross-sectional study using HRSA Uniform Data System (UDS) and certification files from the National Committee for Quality Assurance (NCQA) and the Joint Commission (JC). Datasets were linked through geocoding and an approximate string-matching algorithm. Predicted probability scores were regressed onto 11 clinical performance measures using 10% increments in site-level designation using 10% increments in site designation and regressed onto 11 clinical performance measures using beta logistic regression.Results: The geocoding and approximate string-matching algorithm identified 2,615 of the 6,851 (41.8%) delivery sites included in the analyses as having been designated through the NCQA and/or JC. In total, 74.7% (n=777) of the 1,039 CHCs that met the inclusion criteria for the analysis managed at least one NCQA- and/or JC-designated site. A proportional increase in site-level designation showed a positive association with adherence scores for the majority of all indicators, but primarily among CHCs that designated at least 50% of its delivery sites. Once this threshold was achieved, there was a stepwise percentage point increase in adherence scores, ranging from 1.9% to 11.8% improvement, depending on the measure.Conclusion: Geocoding and approximate string-matching techniques offer a reliable approach for monitoring the association between site-level PCMH designation and clinical performance within HRSA’s CHC delivery sites. The model also offers preliminary evidence of a stepwise increase in quality metrics once half of a CHC’s delivery sites become designated medical homes.


2020 ◽  
Author(s):  
Nathaniel Bell ◽  
Rebecca Wilkerson ◽  
Kathy Mayfield-Smith ◽  
Ana Lòpez-De Fede

Abstract Background: Patient-Centered Medical Home (PCMH) adoption is as an important strategy to help improve primary care quality within Health Resources and Service Administration (HRSA) Community Health Centers (CHC), but evidence of its effect thus far remains mixed. A limitation of previous evaluations has been the inability to account for the proportion of CHC delivery sites that are designated medical homes.Methods: Retrospective cross-sectional study using HRSA Uniform Data System (UDS) and certification files from the National Committee for Quality Assurance (NCQA) and the Joint Commission (JC). Datasets were linked through geocoding and an approximate string matching algorithm. Predicted probability scores were regressed onto 11 clinical performance measures using 10% increments in site-level designation.using 10% increments in site designation and regressed onto 11 clinical performance measures using beta logistic regression.Results: The geocoding and approximate string matching algorithm identified 2,615 of the 6,851 (41.8%) delivery sites included in the analyses as having been designated through the NCQA and/or JC. In total, 74.7% (n=777) of the 1,039 CHCs that met the inclusion criteria for the analysis managed at least 1 NCQA and/or JC designated site. A proportional increase in site-level designation showed a positive association with adherence scores for the majority of all indicators, but primarily among CHC’s that designated at least 50% of its delivery sites. Once this threshold was achieved, there was a stepwise percentage point increase in adherence scores, ranging from 1.9% to 11.8% improvement, depending on the measureConclusion: Geocoding and approximate string matching techniques offer a more nuanced approach for addressing ongoing limitations in HRSA’s PCMH evaluations. The study methodology proposes new questions to as to whether there is a threshold effect when measuring the association between designation and care quality. The model also offers preliminary evidence of a step-wise increase in quality metrics once half of a CHCs delivery sites become designated medical homes.


2020 ◽  
Author(s):  
Nathaniel Bell ◽  
Rebecca Wilkerson ◽  
Kathy Mayfield-Smith ◽  
Ana Lòpez-De Fede

Abstract Background: Patient-Centered Medical Home (PCMH) adoption has been proposed as an important strategy to help improve primary care quality within Health Resources and Service Administration (HRSA) Community Health Centers (CHC), but evidence of its effect thus far remains mixed. A limitation of previous evaluations has been the inability to account for the proportion of CHC delivery sites that are designated sites.Methods: Retrospective cross-sectional study using HRSA Uniform Data System (UDS) and certification files from the National Committee for Quality Assurance (NCQA) and the Joint Commission (JC). All datasets were linked through geocoding and approximate string matching. Proportional implementation was assessed in 10% increments and regressed onto 14 clinical performance measures. The analysis included 1,281 community HCs and 8,022 delivery sites within the lower 48 states and District of Columbia.Results: The geocoding and approximate string matching algorithm identified 2,615 of the 6,851 (41.8%) delivery sites included in the analyses as having been designated through the NCQA and/or JC. In total, 74.7% (n=777) of the 1,039 CHCs included in the analysis after removing false positive/negative matches managed at least 1 NCQA and/or JC designated site. There was no stepwise improvement in clinical quality across all 14 indicators as the proportion of designated delivery sites increased. A trend for numerous indicators was that site-level designation rates of at least 90% were associated with better indicator adherence.Conclusion: Geocoding and approximate string matching offers a more accurate approach to monitor the impact of PCMH transformation on meeting quality performance targets. The lack of a consistent stepwise association between increased site-level designation and clinical quality underscores the need for additional risk-adjustment criteria within annual quality performance reporting in order to assess whether PCMH interventions are improving patient care.


Sign in / Sign up

Export Citation Format

Share Document