Position-restricted approximate string matching with metric Hamming distance

Abstract. Error typing resulting in the change of standard words into non-standard words are often caused by misspelling. This can be addressed by developing a system to identify errors in typing. Approximate string matching is one method that is widely implemented to identify error typing by using several string search algorithms, i.e. Levenshtein Distance, Hamming Distance, Damerau Levenshtein Distance and Jaro Winkler Distance. However, there is no study that compares the performance of the four algorithms.Â Therefore, this research aims to compare the performance between the four algorithms in order to identify which algorithm is the most accurate and precise in the search string based on various errors typing. Evaluation is performed by using usersâ€™ relevance judgments which produce the mean average precision (MAP) to determine the best algorithm. The result shows that Jaro Winkler Distance algorithm is the best in word-checking with 0.87 of MAP value when identifying the typing error of 50 incorrect words.Keywords: Errors typing, Levenshtein, Hamming, Damerau Levenshtein, Jaro WinklerÂ Abstrak. Kesalahan pengetikan mengakibatkan kata baku berubah menjadi kata tidak baku karena ejaan yang digunakan tidak sesuai. Hal tersebut dapat ditangani dengan mengembangkan sistem untuk mengidentifikasi kesalahan pengetikan. Metode approximate string matching merupakan salah satu metode yang banyak diterapkan untuk mengidentifikasi kesalahan pengetikan dengan berbagai jenis algoritma pencarian string yaitu Levenshtein Distance, Hamming Distance, Damerau Levenshtein Distance dan Jaro Winkler Distance. Akan tetapi studi perbandingan kinerja dari keempat algoritma tersebut untuk Bahasa Indonesia belum pernah dilakukan. Oleh karena itu penelitian ini bertujuan untuk melakukan studi perbandingan kinerja dari keempat algoritma tersebut sehingga dapat diketahui algoritma mana yang lebih akurat dan tepat dalam pencarian string berdasarkan kesalahan penulisan yang bervariasi. Evaluasi yang dilakukan menggunakan user relevance judgement yang menghasilkan nilai mean average precision (MAP) untuk menentukan algoritma yang terbaik. Hasil penelitian terhadap 50 kata salah menunjukkan bahwa algoritma Jaro Winkler Distance terbaik dalam melakukan pengecekan kata dengan nilai MAP sebesar 0,87.Kata Kunci: Kesalahan pengetikan, Levenshtein, Hamming, Damerau Levenshtein, Jaro Winkler

Download Full-text

Correction to: New algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance

The Journal of Supercomputing ◽

10.1007/s11227-018-2324-7 ◽

2018 ◽

Vol 74 (5) ◽

pp. 1835-1835

Author(s):

ThienLuan Ho ◽

Seung-Rohk Oh ◽

HyunJin Kim

Keyword(s):

Hamming Distance ◽

String Matching ◽

Approximate String Matching ◽

Fixed Length ◽

New Algorithms

Download Full-text

Generalised Implementation for Fixed-Length Approximate String Matching under Hamming Distance and Applications

2015 IEEE International Parallel and Distributed Processing Symposium Workshop ◽

10.1109/ipdpsw.2015.106 ◽

2015 ◽

Cited By ~ 4

Author(s):

Solon Pissis ◽

Ahmad Retha

Keyword(s):

Hamming Distance ◽

String Matching ◽

Approximate String Matching ◽

Fixed Length

Download Full-text

New algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance

The Journal of Supercomputing ◽

10.1007/s11227-017-2192-6 ◽

2017 ◽

Vol 74 (5) ◽

pp. 1815-1834 ◽

Cited By ~ 1

Author(s):

ThienLuan Ho ◽

Seung-Rohk Oh ◽

HyunJin Kim

Keyword(s):

Hamming Distance ◽

String Matching ◽

Approximate String Matching ◽

Fixed Length ◽

New Algorithms

Download Full-text

BDD-BASED ANALYSIS OF GAPPED q-GRAM FILTERS

International Journal of Foundations of Computer Science ◽

10.1142/s0129054105003698 ◽

2005 ◽

Vol 16 (06) ◽

pp. 1121-1134 ◽

Cited By ~ 2

Author(s):

MARC FONTAINE ◽

STEFAN BURKHARDT ◽

JUHA KÄRKKÄINEN

Keyword(s):

Hamming Distance ◽

String Matching ◽

Combinatorial Problem ◽

Good Choice ◽

Design Parameters ◽

Second Step ◽

Approximate String Matching ◽

Binary Decision ◽

New Class ◽

Important Design

Recently, there has been a surge of interest in gapped q-gram filters for approximate string matching. Important design parameters for filters are for example the value of q, the filter-threshold and in particular the shape (aka seed) of the filter. A good choice of parameters can improve the performance of a q-gram filter by orders of magnitude and optimizing these parameters is a nontrivial combinatorial problem. We describe a new method for analyzing gapped q-gram filters. This method is simple and generic. It applies to a variety of filters, overcomes many restrictions that are present in existing algorithms and can easily be extended to new filter variants. To implement our approach, we use an extended version of BDDs (Binary Decision Diagrams), a data structure that efficiently represents sets of bit-strings. In a second step, we define a new class of multi-shape filters and analyze these filters with the BDD-based approach. Experiments show that multi-shape filters can outperform the best single-shape filters, which are currently in use, in many aspects. The BDD-based algorithm is crucial for the design and analysis of these new and better multi-shape filters. Our results apply to the k-mismatches problem, i.e. approximate string matching with Hamming distance.

Download Full-text

Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index

10.1101/301085 ◽

2018 ◽

Cited By ~ 1

Author(s):

Kiavash Kianfar ◽

Christopher Pockrandt ◽

Bahman Torkamandi ◽

Haochen Luo ◽

Knut Reinert

Keyword(s):

Dynamic Programming ◽

Optimization Problem ◽

Ad Hoc ◽

Hamming Distance ◽

String Matching ◽

Integer Program ◽

Mixed Integer ◽

Mixed Integer Program ◽

Approximate String Matching ◽

Optimal Search

AbstractFinding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics and has been extensively researched. Bidirectional indices have opened new possibilities in this regard allowing the search to start from anywhere within the pattern and extend in both directions. In particular, use of search schemes (partitioning the pattern and searching the pieces in certain orders with given bounds on errors) can yield significant speed-ups. However, finding optimal search schemes is a difficult combinatorial optimization problem.Here for the first time, we propose a mixed integer program (MIP) capable to solve this optimization problem for Hamming distance with given number of pieces. Our experiments show that the optimal search schemes found by our MIP significantly improve the performance of search in bidirectional FM-index upon previous ad-hoc solutions. For example, approximate matching of 101-bp Illumina reads (with two errors) becomes 35 times faster than standard backtracking. Moreover, despite being performed purely in the index, the running time of search using our optimal schemes (for up to two errors) is comparable to the best state-of-the-art aligners, which benefit from combining search in index with in-text verification using dynamic programming. As a result, we anticipate a full-fledged aligner that employs an intelligent combination of search in the bidirectional FM-index using our optimal search schemes and in-text verification using dynamic programming that will outperform today’s best aligners. The development of such an aligner, called FAMOUS (Fast Approximate string Matching using OptimUm search Schemes), is ongoing as our future work.

Download Full-text