string matching
Recently Published Documents


TOTAL DOCUMENTS

1272
(FIVE YEARS 159)

H-INDEX

51
(FIVE YEARS 3)

Author(s):  
Mohd Kamir Yusof ◽  
Wan Mohd Amir Fazamin Wan Hamzah ◽  
Nur Shuhada Md Rusli

The coronavirus COVID-19 is affecting 196 countries and territories around the world. The number of deaths keep on increasing each day because of COVID-19. According to World Health Organization (WHO), infected COVID-19 is slightly increasing day by day and now reach to 570,000. WHO is prefer to conduct a screening COVID-19 test via online system. A suitable approach especially in string matching based on symptoms is required to produce fast and accurate result during retrieving process. Currently, four latest approaches in string matching have been implemented in string matching; characters-based algorithm, hashing algorithm, suffix automation algorithm and hybrid algorithm. Meanwhile, extensible markup language (XML), JavaScript object notation (JSON), asynchronous JavaScript XML (AJAX) and JQuery tehnology has been used widelfy for data transmission, data storage and data retrieval. This paper proposes a combination of algorithm among hybrid, JSON and JQuery in order to produce a fast and accurate results during COVID-19 screening process. A few experiments have been by comparison performance in term of execution time and memory usage using five different collections of datasets. Based on the experiments, the results show hybrid produce better performance compared to JSON and JQuery. Online screening COVID-19 is hopefully can reduce the number of effected and deaths because of COVID.


2021 ◽  
Author(s):  
◽  
David X. Wang

<p>In this thesis, we will tackle the problem of how keyphrase extraction systems can be evaluated to reveal their true efficacy. The aim is to develop a new semantically-oriented approximate string matching criteria, one that is comparable to human judgements, but without the cost and energy associated with manual evaluation. This matching criteria can also be adapted for any information retrieval (IR) system where the evaluation process involves comparing candidate strings (produced by the IR system) to a gold standard (created by humans). Our contributions are threefold. First, we define a new semantic relationship called substitutability – how suitable a phrase is when used in place of another – and then design a generic system which measures/quantifies this relationship by exploiting the interlinking structure of external knowledge sources. Second, we develop two concrete substitutability systems based on our generic design: WordSub, which is backed by WordNet; and WikiSub, which is backed by Wikipedia. Third, we construct a dataset, with the help of human volunteers, that isolates the task of measuring substitutability. This dataset is then used to evaluate the performance of our substitutability systems, along with existing approximate string matching techniques, by comparing them using a set of agreement metrics. Our results clearly demonstrate that WordSub and WikiSub comfortably outperform current approaches to approximate string matching, including both lexical-based methods, such as R-precision; and semantically-oriented techniques, such as METEOR. In fact, WikiSub’s performance comes sensibly close to that of an average human volunteer, when comparing it to the optimistic (best-case) interhuman agreement.</p>


2021 ◽  
Author(s):  
◽  
David X. Wang

<p>In this thesis, we will tackle the problem of how keyphrase extraction systems can be evaluated to reveal their true efficacy. The aim is to develop a new semantically-oriented approximate string matching criteria, one that is comparable to human judgements, but without the cost and energy associated with manual evaluation. This matching criteria can also be adapted for any information retrieval (IR) system where the evaluation process involves comparing candidate strings (produced by the IR system) to a gold standard (created by humans). Our contributions are threefold. First, we define a new semantic relationship called substitutability – how suitable a phrase is when used in place of another – and then design a generic system which measures/quantifies this relationship by exploiting the interlinking structure of external knowledge sources. Second, we develop two concrete substitutability systems based on our generic design: WordSub, which is backed by WordNet; and WikiSub, which is backed by Wikipedia. Third, we construct a dataset, with the help of human volunteers, that isolates the task of measuring substitutability. This dataset is then used to evaluate the performance of our substitutability systems, along with existing approximate string matching techniques, by comparing them using a set of agreement metrics. Our results clearly demonstrate that WordSub and WikiSub comfortably outperform current approaches to approximate string matching, including both lexical-based methods, such as R-precision; and semantically-oriented techniques, such as METEOR. In fact, WikiSub’s performance comes sensibly close to that of an average human volunteer, when comparing it to the optimistic (best-case) interhuman agreement.</p>


2021 ◽  
Vol 14 (11) ◽  
pp. 6711-6740
Author(s):  
Ranee Joshi ◽  
Kavitha Madaiah ◽  
Mark Jessell ◽  
Mark Lindsay ◽  
Guillaume Pirot

Abstract. A huge amount of legacy drilling data is available in geological survey but cannot be used directly as they are compiled and recorded in an unstructured textual form and using different formats depending on the database structure, company, logging geologist, investigation method, investigated materials and/or drilling campaign. They are subjective and plagued by uncertainty as they are likely to have been conducted by tens to hundreds of geologists, all of whom would have their own personal biases. dh2loop (https://github.com/Loop3D/dh2loop, last access: 30 September 2021​​​​​​​) is an open-source Python library for extracting and standardizing geologic drill hole data and exporting them into readily importable interval tables (collar, survey, lithology). In this contribution, we extract, process and classify lithological logs from the Geological Survey of Western Australia (GSWA) Mineral Exploration Reports (WAMEX) database in the Yalgoo–Singleton greenstone belt (YSGB) region. The contribution also addresses the subjective nature and variability of the nomenclature of lithological descriptions within and across different drilling campaigns by using thesauri and fuzzy string matching. For this study case, 86 % of the extracted lithology data is successfully matched to lithologies in the thesauri. Since this process can be tedious, we attempted to test the string matching with the comments, which resulted in a matching rate of 16 % (7870 successfully matched records out of 47 823 records). The standardized lithological data are then classified into multi-level groupings that can be used to systematically upscale and downscale drill hole data inputs for multiscale 3D geological modelling. dh2loop formats legacy data bridging the gap between utilization and maximization of legacy drill hole data and drill hole analysis functionalities available in existing Python libraries (lasio, welly, striplog).


2021 ◽  
Vol 1 (2) ◽  
pp. 87-95
Author(s):  
Nur Aini Rakhmawati ◽  
Miftahul Jannah

Open Food Facts provides a database of food products such as product names, compositions, and additives, where everyone can contribute to add the data or reuse the existing data. The open food facts data are dirty and needs to be processed before storing the data to our system. To reduce redundancy in food ingredients data, we measure the similarity of ingredient food using two similarities: the conceptual similarity and textual similarity. The conceptual similarity measures the similarity between the two datasets by its word meaning (synonym), while the textual similarity is based on fuzzy string matching, namely Levenshtein distance, Jaro-Winkler distance, and Jaccard distance. Based on our evaluation, the combination of similarity measurements using textual and Wordnet similarity (conceptual) was the most optimal similarity method in food ingredients.


2021 ◽  
Author(s):  
Timothée Poisot ◽  
Rory Gibb ◽  
Sadie Jane Ryan ◽  
Colin Carlson

NCBITaxonomy.jl is a package designed to facilitate the reconciliation and cleaning of taxonomic names, using a local copy of the NCBI taxonomic backbone (Federhen 2012, Schoch et al. 2020); The basic search functions are coupled with quality-of-life functions including case-insensitive search and custom fuzzy string matching to facilitate the amount of information that can be extracted automatically while allowing efficient manual curation and inspection of results. NCBITaxonomy.jl works with version 1.6 of the Julia programming language (Bezanson et al. 2017), and relies on the Apache Arrow format to store a local copy of the NCBI raw taxonomy files. The design of NCBITaxonomy.jl has been inspired by similar efforts, like the R package taxadb (Norman et al. 2020), which provides an offline alternative to packages like taxize (Chamberlain and Szöcs 2013).


2021 ◽  
Vol 9 (2) ◽  
pp. 168-175
Author(s):  
Sebastianus A S Mola ◽  
Meiton Boru ◽  
Emerensye Sofia Yublina Pandie

Komunikasi tertulis dalam media sosial yang menekankan pada kecepatan penyebaran informasi sering kali terjadi fenomena penggunaan bahasa yang tidak baku baik pada level kalimat, klausa, frasa maupun kata. Sebagai sebuah sumber data, media sosial dengan fenomena ini memberikan tantangan dalam proses ekstraksi informasi. Normalisasi bahasa yang tidak baku menjadi bahasa baku dimulai pada proses normalisasi kata di mana kata yang tidak baku (non-standard word (NSW)) dinormalisasikan ke bentuk baku (standard word (SW)). Proses normalisasi dengan menggunakan edit distance memiliki keterbatasan dalam proses pembobotan nilai mismatch, match, dan gap yang bersifat statis. Dalam perhitungan nilai mismatch, pembobotan statida tidak dapat memberikan pembedaan bobot akibat kesalahan penekanan tombol pada keyboard terutama tombol yang berdekatan. Karena keterbatasan pembobotan edit distance ini maka dalam penelitian ini diusulkan sebuah metode pembobotan dinamis untuk bobot mismatch. Hasil dari penelitian ini adalah adanya metode baru dalam pembobotan dinamis berbasis posisi tombol keyboard yang dapat digunakan dalam melakukan normalisasi NSW menggunakan metode approximate string matching.


Sign in / Sign up

Export Citation Format

Share Document