Dictionary Distribution Based on Number of Characters for Damerau-Levenshtein Distance Spell Checker Optimization

Author(s):  
Utomo Pujianto ◽  
Aji Prasetya Wibawa ◽  
Raditha Ulfah
Author(s):  
Yunita Purnama Sari ◽  
Gede Aditra Pradnyana ◽  
I Made Agus Wirawan

Abstrak— Bahasa Bima merupakan salah satu identitas kebudayaan masyarakat Bima. Pengaruh perkembangan globalisasi membuat Bahasa Bima semakin terpinggirkan dan seolah-olah menjadi bahasa “asing” bagi masyarakat Bima sendiri. Bahasa Bima sebagai bahasa ibu wajib dilestarikan, salah satu upayanya adalah dengan mengembangkan aplikasi kamus bahasa Bima. Tujuan dari penelitian ini adalah: (1) Untuk mengetahui bagaimana perancangan terhadap Aplikasi Kamus Bahasa Bima - Bahasa Indonesia Menggunakan Algoritma Levenshtein Distance Sebagai Spell Checker Berbasis Android. (2) Untuk mengetahui bagaimana implementasi terhadap Aplikasi Kamus Bahasa Bima - Bahasa Indonesia Menggunakan Algoritma Levenshtein Distance Sebagai Spell Checker Berbasis Android. Metode penelitian yang digunakan adalah penelitian dan pengembangan dengan model waterfall. Aplikasi dikembangkan dengan menerapkan algoritma Levenshtein Distance, yakni algoritma pencarian untuk menentukan jumlah perbedaan string pada dua buah kata/kalimat yang ditemukan oleh Vladimir Levenshtein. Aplikasi akan menggunakan database SQLite agar dapat digunakan secara offline. Hasil penelitian ini berupa aplikasi kamus Bahasa Bima – Bahsa Indonesia yang dapat dipasang pada smartphone Android. Dengan adanya aplikasi ini diharapkan mampu menjaga kelestarian Bahasa Bima serta mempermudah bekomunikasi antara masyarakat Bima dan lainnya.Kata kunci: Kamus, Bahasa Bima-Indonesia, Levenshtein Distance, Android


2020 ◽  
Vol 2 (2) ◽  
pp. 57
Author(s):  
Puji Santoso ◽  
Pundhi Yuliawati ◽  
Ridwan Shalahuddin ◽  
Ilham Ari Elbaith Zaeni

Damerau-Levenshtein Distance menentukan jarak atau jumlah minimum operasi yang dibutuhkan untuk mengubah satu string menjadi string lain, di mana operasi yang digunakan untuk menentukan tingkat kemiripian antar String adalah insertion, deletion, substitution dan transposition. Algoritma ini sendiri dapat juga digunakan untuk mengoreksi kesalahan kata. Namun, Algoritma Damerau-Levenshtein Distance mempunyai kelemahan, yaitu waktu pemrosesan yang lama. Pada perhitungan jarak antara dua string dengan algoritma Damerau-Levenshtein, setiap huruf dari kedua string akan dibandingkan dengan membuat matriks distance. Karena Kamus Bahasa Indonesia memiliki lebih dari 30.000 kata dasar, operasi perhitungan jarak akan dilakukan lebih dari 30.000 kali untuk setiap kesalahan. Penelitian ini mengusulkan peningkatan untuk mempersingkat waktu pemrosesan algoritma Damerau-Levenshtein dengan mengurangi baris dan kolom matriks distance. Hasil akhir yang diharapkan dari penelitian ini adalah waktu pemrosesan menjadi lebih cepat tanpa harus mengorbankan akurasi.


2017 ◽  
Vol 3 (2) ◽  
pp. 1-6
Author(s):  
Ferly Gunawan ◽  
M. Ali Fauzi ◽  
Putra Pandu Adikara

Perkembangan aplikasi mobile yang pesat membuat banyak aplikasi diciptakan dengan berbagai kegunaan untuk memenuhi kebutuhan pengguna. Setiap aplikasi memungkinkan pengguna untuk memberi ulasan tentang aplikasi tersebut. Tujuan dari ulasan adalah untuk mengevaluasi dan meningkatkan kualitas produk ke depannya. Untuk mengetahui hal tersebut, analisis sentimen dapat digunakan untuk mengklasifikasikan ulasan ke dalam sentimen positif atau negatif. Pada ulasan aplikasi biasanya terdapat salah eja sehingga sulit dipahami. Kata yang mengalami salah eja perlu dilakukan normalisasi kata untuk diubah menjadi kata standar. Karena itu, normalisasi kata dibutuhkan untuk menyelesaikan masalah salah eja. Penelitian ini menggunakan normalisasi kata berbasis Levenshtein distance. Berdasarkan pengujian, nilai akurasi tertinggi terdapat pada perbandingan data latih 70% dan data uji 30%. Hasil akurasi tertinggi dari pengujian menggunakan nilai edit <=2 adalah 100%, nilai edit tertinggi kedua didapat pada nilai edit <=1 dengan akurasi 96,4%, sedangkan nilai edit dengan akurasi terendah diperoleh pada nilai edit <=4 dan <=5 dengan akurasi 66,6%. Hasil dari pengujian Naive Bayes-Levenshtein Distance memiliki nilai akurasi tertinggi yaitu 96,9% dibandingkan dengan pengujian Naive Bayes tanpa Levenshtein Distance dengan nilai akurasi 94,4%.  


2020 ◽  
Vol 56 (4) ◽  
pp. 629-650
Author(s):  
Filip Graliński ◽  
Krzysztof Jassem

Abstract The paper describes a method for finding diachronic spelling variants in a corpus that consists of historical and modern Polish texts. The procedure applies the Levenshtein distance and the similarity measure determined with a Word2vec model. The method was applied for both words and sub-word units. A sample of spelling variants was manually evaluated and compared against an existing morphological analyser for Polish historical texts. The resulting lists of spelling variants and spelling modernisation rules were used in a text modernisation tool and their contribution was evaluated. The paper also presents an analogous method for finding spelling variants that result from erroneous OCR. The obtained lists of OCR variants and rules may serve for the correction of OCR output.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Mukesh Kumar ◽  
Palak Rehan

Social media networks like Twitter, Facebook, WhatsApp etc. are most commonly used medium for sharing news, opinions and to stay in touch with peers. Messages on twitter are limited to 140 characters. This led users to create their own novel syntax in tweets to express more in lesser words. Free writing style, use of URLs, markup syntax, inappropriate punctuations, ungrammatical structures, abbreviations etc. makes it harder to mine useful information from them. For each tweet, we can get an explicit time stamp, the name of the user, the social network the user belongs to, or even the GPS coordinates if the tweet is created with a GPS-enabled mobile device. With these features, Twitter is, in nature, a good resource for detecting and analyzing the real time events happening around the world. By using the speed and coverage of Twitter, we can detect events, a sequence of important keywords being talked, in a timely manner which can be used in different applications like natural calamity relief support, earthquake relief support, product launches, suspicious activity detection etc. The keyword detection process from Twitter can be seen as a two step process: detection of keyword in the raw text form (words as posted by the users) and keyword normalization process (reforming the users’ unstructured words in the complete meaningful English language words). In this paper a keyword detection technique based upon the graph, spanning tree and Page Rank algorithm is proposed. A text normalization technique based upon hybrid approach using Levenshtein distance, demetaphone algorithm and dictionary mapping is proposed to work upon the unstructured keywords as produced by the proposed keyword detector. The proposed normalization technique is validated using the standard lexnorm 1.2 dataset. The proposed system is used to detect the keywords from Twiter text being posted at real time. The detected and normalized keywords are further validated from the search engine results at later time for detection of events.


2021 ◽  
Vol 27 (10) ◽  
pp. 531-541
Author(s):  
G. N. Zhukova ◽  
◽  
M. V. Ulyanov ◽  
◽  

The problem of constructing a periodic sequence consisting of at least eight periods is considered, based on a given sequence obtained from an unknown periodic sequence, also containing at least eight periods, by introducing noise of deletion, replacement, and insertion of symbols. To construct a periodic sequence that approximates a given one, distorted by noise, it is first required to estimate the length of the repeating fragment (period). Further, the distorted original sequence is divided into successive sections of equal length; the length takes on integer values from 80 to 120 % of the period estimate. Each obtained section is compared with each of the remaining sections, a section is selected to build a periodic sequence that has the minimum edit distance (Levenshtein distance) to any of the remaining sections, minimization is carried out over all sections of a fixed length, and then along all lengths from 80 to 120 % of period estimates. For correct comparison of fragments of different lengths, we consider the ration between the edit distance and the length of the fragment. The length of a fragment that minimizes the ratio of the edit distance to another fragment of the same length to the fragment length is considered the period of the approximating periodic sequence, and the fragment itself, repeating the required number of times, forms an approximating sequence. The constructed sequence may contain an incomplete repeating fragment at the end. The quality of the approximation is estimated by the ratio of the edit distance from the original distorted sequence to the constructed periodic sequence of the same length and this length.


2021 ◽  
pp. 014272372110422
Author(s):  
Jolien Faes ◽  
Joris Gillis ◽  
Steven Gillis

Auditory brainstem implantation (ABI) is a recent innovation in pediatric hearing restoration in children with a sensorineural hearing impairment. Only limited information is available on the spontaneous speech development of severe-to-profound congenitally hearing-impaired children who received an ABI. The purpose of this study was to investigate longitudinally the accuracy of ABI children’s word productions in spontaneous speech in comparison to the accuracy of children who received a cochlear implant and children with normal hearing. The data of this study consist of recordings of the spontaneous speech of the first three Dutch-speaking children living in Belgium who received an ABI. The children’s utterances were phonemically transcribed and for each word, the distance between the child’s production and the standard adult phonemic transcription was computed using the Levenshtein Distance as a metric. The same procedure was applied to the longitudinal data of the children with CI and the normally hearing children. The main result was that the Levenshtein Distance decreased in the three children with ABI but it remained significantly higher than that of children with typical hearing and cochlear implants matched on chronological age, hearing age, and lexicon size. In other words, the phonemic accuracy increased in the children with ABI but stayed well below that of children without hearing loss and children with cochlear implants. Moreover, the analyses revealed considerable individual variation between the children with ABI.


Sign in / Sign up

Export Citation Format

Share Document