Typo handling in searching of Quran verse based on phonetic similarities

The Quran search system is a search system that was built to make it easier for Indonesians to find a verse with text by Indonesian pronunciation, this is a solution for users who have difficulty writing or typing Arabic characters. Quran search system with phonetic similarity can make it easier for Indonesian Muslims to find a particular verse. Lafzi was one of the systems that developed the search, then Lafzi was further developed under the name Lafzi+. The Lafzi+ system can handle searches with typo queries but there are still fewer variations regarding typing error types. In this research Lafzi++, an improvement from previous development to handle typographical error types was carried out by applying typo correction using the autocomplete method to correct incorrect queries and Damerau Levenshtein distance to calculate the edit distance, so that the system can provide query suggestions when a user mistypes a search, either in the form of substitution, insertion, deletion, or transposition. Users can also search easily because they use Latin characters according to pronunciation in Indonesian. Based on the evaluation results it is known that the system can be better developed, this can be seen from the accuracy value in each query that is tested can surpass the accuracy of the previous system, by getting the highest recall of 96.20% and the highest Mean Average Precision (MAP) reaching 90.69%. The Lafzi++ system can improve the previous system.

Download Full-text

Studi Perbandingan Algoritma Pencarian String dalam Metode Approximate String Matching untuk Identifikasi Kesalahan Pengetikan Teks

Jurnal Buana Informatika ◽

10.24002/jbi.v7i2.491 ◽

2016 ◽

Vol 7 (2) ◽

Cited By ~ 1

Author(s):

Yeny Rochmawati ◽

Retno Kusumaningrum

Keyword(s):

Hamming Distance ◽

String Matching ◽

Mean Average Precision ◽

Levenshtein Distance ◽

Approximate String Matching ◽

Average Precision ◽

Relevance Judgments ◽

Typing Error ◽

The Mean ◽

Distance Hamming

Abstract. Error typing resulting in the change of standard words into non-standard words are often caused by misspelling. This can be addressed by developing a system to identify errors in typing. Approximate string matching is one method that is widely implemented to identify error typing by using several string search algorithms, i.e. Levenshtein Distance, Hamming Distance, Damerau Levenshtein Distance and Jaro Winkler Distance. However, there is no study that compares the performance of the four algorithms.Â Therefore, this research aims to compare the performance between the four algorithms in order to identify which algorithm is the most accurate and precise in the search string based on various errors typing. Evaluation is performed by using usersâ€™ relevance judgments which produce the mean average precision (MAP) to determine the best algorithm. The result shows that Jaro Winkler Distance algorithm is the best in word-checking with 0.87 of MAP value when identifying the typing error of 50 incorrect words.Keywords: Errors typing, Levenshtein, Hamming, Damerau Levenshtein, Jaro WinklerÂ Abstrak. Kesalahan pengetikan mengakibatkan kata baku berubah menjadi kata tidak baku karena ejaan yang digunakan tidak sesuai. Hal tersebut dapat ditangani dengan mengembangkan sistem untuk mengidentifikasi kesalahan pengetikan. Metode approximate string matching merupakan salah satu metode yang banyak diterapkan untuk mengidentifikasi kesalahan pengetikan dengan berbagai jenis algoritma pencarian string yaitu Levenshtein Distance, Hamming Distance, Damerau Levenshtein Distance dan Jaro Winkler Distance. Akan tetapi studi perbandingan kinerja dari keempat algoritma tersebut untuk Bahasa Indonesia belum pernah dilakukan. Oleh karena itu penelitian ini bertujuan untuk melakukan studi perbandingan kinerja dari keempat algoritma tersebut sehingga dapat diketahui algoritma mana yang lebih akurat dan tepat dalam pencarian string berdasarkan kesalahan penulisan yang bervariasi. Evaluasi yang dilakukan menggunakan user relevance judgement yang menghasilkan nilai mean average precision (MAP) untuk menentukan algoritma yang terbaik. Hasil penelitian terhadap 50 kata salah menunjukkan bahwa algoritma Jaro Winkler Distance terbaik dalam melakukan pengecekan kata dengan nilai MAP sebesar 0,87.Kata Kunci: Kesalahan pengetikan, Levenshtein, Hamming, Damerau Levenshtein, Jaro Winkler

Download Full-text

Reconstruction of a Symbolic Periodic Sequence from a Sequence with Noise

INFORMACIONNYE TEHNOLOGII ◽

10.17587/it.27.531-541 ◽

2021 ◽

Vol 27 (10) ◽

pp. 531-541

Author(s):

G. N. Zhukova ◽

◽

M. V. Ulyanov ◽

◽

Keyword(s):

Fragment Length ◽

Edit Distance ◽

Periodic Sequence ◽

Levenshtein Distance ◽

Original Sequence ◽

Fixed Length ◽

Approximating Sequence ◽

Period Estimate ◽

Correct Comparison

The problem of constructing a periodic sequence consisting of at least eight periods is considered, based on a given sequence obtained from an unknown periodic sequence, also containing at least eight periods, by introducing noise of deletion, replacement, and insertion of symbols. To construct a periodic sequence that approximates a given one, distorted by noise, it is first required to estimate the length of the repeating fragment (period). Further, the distorted original sequence is divided into successive sections of equal length; the length takes on integer values from 80 to 120 % of the period estimate. Each obtained section is compared with each of the remaining sections, a section is selected to build a periodic sequence that has the minimum edit distance (Levenshtein distance) to any of the remaining sections, minimization is carried out over all sections of a fixed length, and then along all lengths from 80 to 120 % of period estimates. For correct comparison of fragments of different lengths, we consider the ration between the edit distance and the length of the fragment. The length of a fragment that minimizes the ratio of the edit distance to another fragment of the same length to the fragment length is considered the period of the approximating periodic sequence, and the fragment itself, repeating the required number of times, forms an approximating sequence. The constructed sequence may contain an incomplete repeating fragment at the end. The quality of the approximation is estimated by the ratio of the edit distance from the original distorted sequence to the constructed periodic sequence of the same length and this length.

Download Full-text

A Novel classification framework for the Thirukkural for building an efficient search system

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-211667 ◽

2021 ◽

pp. 1-12

Author(s):

Anita Ramalingam ◽

Subalalitha Chinnaudayar Navaneethakrishnan

Keyword(s):

Keyword Search ◽

Performance Comparison ◽

Support Vector ◽

Comparison Analysis ◽

Average Precision ◽

Search System ◽

Didactic Literature ◽

Classification Framework ◽

Bayes Algorithm ◽

Google Search

Thirukkural, a Tamil classic literature, which was written in 300 BCE is a didactic literature. Though Thirukkural comprises 1330 couplets which are organized into three sections and 133 chapters, in order to retrieve meaningful Thirukkural for a given query in search systems, a better organization of the Thirukkural is needed. This paper lays such a foundation by classifying the Thirukkural into ten new categories called superclasses that is helpful for building a better Information Retrieval (IR) system. The classifier is trained using Multinomial Naïve Bayes algorithm. Each superclass is further classified into two subcategories based on the didactic information. The proposed classification framework is evaluated using precision, recall and F-score metrics and achieved an overall F-score of 82.33% and a comparison analysis has been done with the Support Vector Machine, Logistic Regression and Random Forest algorithms. An IR system is built on top of the proposed system and the performance comparison has been done with the Google search and a locally built keyword search. The proposed classification framework has achieved a mean average precision score of 89%, whereas the Google search and keyword search have yielded 59% and 68% respectively.

Download Full-text

Lietuviškų adresų geokodavimo problemos ir jų sprendimo būdai

Informacijos mokslai ◽

10.15388/im.2009.0.3235 ◽

2009 ◽

Vol 50 ◽

pp. 217-222 ◽

Cited By ~ 1

Author(s):

Viktoras Paliulionis

Keyword(s):

Levenshtein Distance ◽

Spelling Errors ◽

Postal Address ◽

Phonetic Similarity ◽

Textual Description ◽

House Number

Geokodavimas yra procesas, kai tekstinis vietos aprašas transformuojamas į geografi nes koordinates. Vienas iš dažniausiai naudojamų vietos aprašymo būdų yra pašto adresas, kurį sudaro gyvenvietės pavadinimas, gatvės pavadinimas, namo numeris ir kiti adreso elementai. Šiame straipsnyje nagrinėjamos lietuviškų adresų geokodavimo problemos, atsirandančios dėl adreso formatų įvairovės, netiksliai ir su rašybos klaidomis užrašomų adresų. Straipsnyje aprašyti geokodavimo procesoetapai ir juose naudojamų algoritmų principai. Pasiūlytas lietuvių kalbai pritaikytas LT-Soundex algoritmas, leidžiantis indeksuoti adreso elementus pagal fonetinį panašumą ir atlikti apytikslę paiešką.Lithuanian Address Geocoding: Problems and SolutionsViktoras Paliulionis SummaryGeocoding is the process of converting of a textual description of a location into geographic coordinates. One of the most frequently used way to describe a place is its postal address that contains a city name, street name, house number and other address components. The paper deals with the problems of the geocoding of Lithuanian addresses. The main problems are variety of used address formats and possible typing and spelling errors. The paper describes the steps of the geocoding process and used algorithms. We propose a phonetic algorithm called LT-Soundex, adapted for the Lithuanian language and enabling to index addresses components by phonetic similarity and perform approximate address searching. It is used with Levenshtein distance for effective approximate address searching.;">

Download Full-text

Fast trie-based method for multiple pairwise sequence alignment

Доклады Академии наук ◽

10.31857/s0869-56524844401-404 ◽

2019 ◽

Vol 484 (4) ◽

pp. 401-404

Author(s):

P. A. Yakovlev

Keyword(s):

Numerical Experiments ◽

Edit Distance ◽

Dynamic Programming Algorithm ◽

Programming Algorithm ◽

Levenshtein Distance ◽

Biological Sequences ◽

Pairwise Sequence Alignment ◽

Symbol Sequence ◽

Original Algorithm ◽

Variable Domains

A method for efficient comparison of a symbol sequence with all strings of a set is presented, which performs considerably faster than the naive enumeration of comparisons with all strings in succession. The procedure is accelerated by applying an original algorithm combining a prefix tree and a standard dynamic programming algorithm searching for the edit distance (Levenshtein distance) between strings. The efficiency of the method is confirmed by numerical experiments with arrays consisting of tens of millions of biological sequences of variable domains of monoclonal antibodies.

Download Full-text

Designing a word recommendation application using the Levenshtein Distance algorithm

Matrix Jurnal Manajemen Teknologi dan Informatika ◽

10.31940/matrix.v11i2.2419 ◽

2021 ◽

Vol 11 (2) ◽

pp. 63-70

Author(s):

Nadhia Nurin Syarafina ◽

◽

Jozua Ferjanus Palandi ◽

Keyword(s):

String Matching ◽

Levenshtein Distance ◽

Approximate String Matching ◽

Test Results ◽

Average Accuracy ◽

Typing Error ◽

Word Spelling ◽

Written Word ◽

Word Writing ◽

High Level

Good scriptwriting or reporting requires a high level of accuracy. The basic problem is that the level of accuracy of the authors is not the same. The low level of accuracy allows for mistyping of words in a sentence. Typing errors caused the word to become non-standard. Even worse, the word became meaningless. In this case, the recommendation application serves to provide word-writing recommendations in case of a typing error. This application can reduce the error rate of the writer when typing. One method to improve word spelling is Approximate String Matching. This method applies an approach to the string search process. The Levenshtein Distance algorithm is a part of the Approximate String-Matching method. This method, firstly, is necessary to go through the preprocessing stage to correct an incorrectly written word using the Levenshtein Distance algorithm. The application testing phase uses ten texts composed of 100 words, ten texts composed of 100 to 250 words, and ten texts composed of 250 to 500 words. The average accuracy rate of these test results was 95%, 94%, and 90%.

Download Full-text

ALGORITMA COCKE YOUNGER KASAMI UNTUK DETEKSI STRUKTUR KALIMAT DAN MEREKOMENDASIKANYA MENGGUNAKAN ALGORITMA DAMERAU LEVENSHTEIN DISTANCE

Telematika ◽

10.31315/telematika.v1i1.3378 ◽

2020 ◽

Vol 17 (2) ◽

pp. 101

Author(s):

Budi Prabowo ◽

Heru Cahya Rustamadji ◽

Yuli Fauziah

Keyword(s):

Edit Distance ◽

Levenshtein Distance

Penggunaan kata baku dan struktur kalimat merupakan salah satu syarat dalam penulisan laporan penelitian, tanpa disadari kesalahan penulisan dapat terjadi baik berupa kesalahan pengetikan maupun pada struktur kalimat, beberapa penyebabnya ialah kebiasaan saat menulis pesan pendek, berkembangnya bahasa yang digunakan sehari-hari dan susunan keyboard yang terlalu dekat. Kesalahan penulisan biasanya akan segera diperbaiki setelah selesai menulis, namun untuk memperbaikinya diperlukan waktu dan ketelitian. Algoritma CYK merupakan algoritma parsing keanggotaan untuk tatabahasa bebas konteks yang dapat digunakan untuk memeriksa struktur kalimat sedangkan algoritma DLD merupakan algoritma yang mampu menghitung jarak perbedaan dari dua buah string sehingga dapat dimanfaatkan untuk rekomendasi kata dan kalimat. Tujuan dari penelitian ini adalah menerapkan algoritma CYK untuk mendeteksi struktur kalimat dan algoritma DLD untuk merekomendasikan kata dan struktur kalimat. Pemeriksaan kalimat dilakukan dengan mengelompokan setiap kata yang terdapat pada teks berdasarkan jenisnya, kata yang telah dikelompokkan tersebut kemudian disusun kembali kedalam bentuk kalimat dan diperiksa dengan algoritma CYK untuk mengetahui apakah kalimat tersebut benar atau salah, jika kalimat salah maka diberikan rekomendasi kalimat menggunakan algoritma DLD dengan menghitung edit distance-nya, selain perbaikan pada kalimat algoritma DLD juga melakukan perbaikan pada kata yang salah. Hasil pengujian didapatkan tingkat keberhasilan algoritma CYK dalam mendeteksi struktur kalimat sebesar 96% dan algoritma DLD dalam merekomendasikan kata sebesar 96%, sedangkan untuk merekomendasikan kalimat sebesar 88%.

Download Full-text

Investigating word recognition in intercomprehension: Methods and findings

Linguistics ◽

10.1515/ling-2015-0006 ◽

2015 ◽

Vol 53 (2) ◽

Cited By ~ 2

Author(s):

Robert Möller ◽

Ludger Zeevaert

Keyword(s):

Reading Comprehension ◽

Word Recognition ◽

Levenshtein Distance ◽

Free Response ◽

Place Of Articulation ◽

Phonetic Similarity ◽

Different Types ◽

Similarity Measuring ◽

Germanic Languages ◽

Text Context

AbstractThis article presents methods and findings from research on factors that determine the recognizability of cognate words in Germanic intercomprehension (more precisely: German speakers' reading comprehension in Germanic languages they have not learnt). Different types of written tests were carried out (free response, multiple choice, judgments on the probability of two words being cognates) in order to assess the importance of different aspects of linguistic similarity for the transparency of written cognates. Apart from the overall amount of differing segments of the cognate words, phonetic similarity between the differing segments – in particular an identical place of articulation – turned out to be most important, which can reflect either a spontaneous sense of similarity or familiarity with phenomena of variation between phonetically similar sounds. By adjusting similarity measuring to these results (weighted Levenshtein distance), it is possible, to a certain degree, to assess the transparency of cognates. Nevertheless, certain results could not be explained by phonetic similarity. Thus, recordings were made of subjects commenting on their way of proceeding in such tasks as well as in text decoding, which made clear that the subjects often attribute the same importance to other associations as they do to phonetic ones – even in the recognition of isolated words semantic connections in the mental lexicon are involved. In text context, the most frequent strategy seems to rely on an interplay of phonetic similarity and inference; however, in the subjects' reflections, the aspect of semantic probability manifestly overrides intuitions about phonetic similarity.

Download Full-text

Spelling Correction Application with Damerau-Levenshtein Distance to Help Teachers Examine Typographical Error in Exam Test Scripts

E3S Web of Conferences ◽

10.1051/e3sconf/202018800027 ◽

2020 ◽

Vol 188 ◽

pp. 00027

Author(s):

Viny Christanti Mawardi ◽

Fendy Augusfian ◽

Jeanny Pragantha ◽

Stéphane Bressan

Keyword(s):

Levenshtein Distance ◽

Typographical Error ◽

Real Word ◽

Distance Method ◽

Spelling Correction ◽

Short Answer ◽

Test Script ◽

Test Result

This research was intended to create Spelling Correction Application to help teachers examine questions scripts with the capability to found typographical error and give suggestion for non-real word error. This application is built with simple Damerau-Levenshtein Distance method to detect errors and give word suggestions from the typo word. This application can be used by the teacher to examine documents in the form of short answer, essay and multiple choices then save them back in the form of original documents. This application is built using a dictionary lookup consist of 41 312 words in Indonesian. The first test result is the application can detect non-real word errors from 50 sentences that have non-real word error in each sentence and produce an accuracy of 88 %. The second test is try to detect typographical error in exam test script that consist of 15 sample questions, consisting of five essay questions, five short answer, and five multiple choices.

Download Full-text

MeDict: Health Dictionary Application Using Damerau-Levenshtein Distance Algorithm

International Journal of New Media Technology ◽

10.31937/ijnmt.v7i2.1654 ◽

2020 ◽

Vol 7 (2) ◽

pp. 98-101

Author(s):

Wiwi Clarissa ◽

Farica Perdana Putri

Keyword(s):

Technology Acceptance ◽

Technology Acceptance Model ◽

Levenshtein Distance ◽

Typographical Error ◽

Application Development ◽

Search Optimization ◽

Word Search ◽

Acceptance Model ◽

Strongly Agree

Typographical error often happens. It can occur due to mechanical errors or missed hands or fingers when typing. Someone's ignorance of how to spell correctly also can cause typographical errors. Dictionary application development has been carried out by various parties so that the searching process in the dictionary becomes more efficient. However, there is no word search optimization when the typographical error happens. Typographical errors in the searching process can result in the information sought cannot be found. The Damerau-Levenshtein Distance algorithm implemented to provide search suggestions when a typographical error occurs. This research aims to design and build a health dictionary application, MeDict, using the Damerau-Levenshtein Distance algorithm. Technology Acceptance Model (TAM) used to evaluate the application. The result is 86.2% stating strongly agree that the application can be useful and 86.9% stating strongly agree that the application can be used easily.

Download Full-text