BDD-BASED ANALYSIS OF GAPPED q-GRAM FILTERS

Recently, there has been a surge of interest in gapped q-gram filters for approximate string matching. Important design parameters for filters are for example the value of q, the filter-threshold and in particular the shape (aka seed) of the filter. A good choice of parameters can improve the performance of a q-gram filter by orders of magnitude and optimizing these parameters is a nontrivial combinatorial problem. We describe a new method for analyzing gapped q-gram filters. This method is simple and generic. It applies to a variety of filters, overcomes many restrictions that are present in existing algorithms and can easily be extended to new filter variants. To implement our approach, we use an extended version of BDDs (Binary Decision Diagrams), a data structure that efficiently represents sets of bit-strings. In a second step, we define a new class of multi-shape filters and analyze these filters with the BDD-based approach. Experiments show that multi-shape filters can outperform the best single-shape filters, which are currently in use, in many aspects. The BDD-based algorithm is crucial for the design and analysis of these new and better multi-shape filters. Our results apply to the k-mismatches problem, i.e. approximate string matching with Hamming distance.

Download Full-text

Position-restricted approximate string matching with metric Hamming distance

2017 IEEE International Conference on Big Data and Smart Computing (BigComp) ◽

10.1109/bigcomp.2017.7881724 ◽

2017 ◽

Author(s):

Sung-Hwan Kim ◽

Hwan-Gue Cho

Keyword(s):

Hamming Distance ◽

String Matching ◽

Approximate String Matching

Download Full-text

Studi Perbandingan Algoritma Pencarian String dalam Metode Approximate String Matching untuk Identifikasi Kesalahan Pengetikan Teks

Jurnal Buana Informatika ◽

10.24002/jbi.v7i2.491 ◽

2016 ◽

Vol 7 (2) ◽

Cited By ~ 1

Author(s):

Yeny Rochmawati ◽

Retno Kusumaningrum

Keyword(s):

Hamming Distance ◽

String Matching ◽

Mean Average Precision ◽

Levenshtein Distance ◽

Approximate String Matching ◽

Average Precision ◽

Relevance Judgments ◽

Typing Error ◽

The Mean ◽

Distance Hamming

Abstract. Error typing resulting in the change of standard words into non-standard words are often caused by misspelling. This can be addressed by developing a system to identify errors in typing. Approximate string matching is one method that is widely implemented to identify error typing by using several string search algorithms, i.e. Levenshtein Distance, Hamming Distance, Damerau Levenshtein Distance and Jaro Winkler Distance. However, there is no study that compares the performance of the four algorithms.Â Therefore, this research aims to compare the performance between the four algorithms in order to identify which algorithm is the most accurate and precise in the search string based on various errors typing. Evaluation is performed by using usersâ€™ relevance judgments which produce the mean average precision (MAP) to determine the best algorithm. The result shows that Jaro Winkler Distance algorithm is the best in word-checking with 0.87 of MAP value when identifying the typing error of 50 incorrect words.Keywords: Errors typing, Levenshtein, Hamming, Damerau Levenshtein, Jaro WinklerÂ Abstrak. Kesalahan pengetikan mengakibatkan kata baku berubah menjadi kata tidak baku karena ejaan yang digunakan tidak sesuai. Hal tersebut dapat ditangani dengan mengembangkan sistem untuk mengidentifikasi kesalahan pengetikan. Metode approximate string matching merupakan salah satu metode yang banyak diterapkan untuk mengidentifikasi kesalahan pengetikan dengan berbagai jenis algoritma pencarian string yaitu Levenshtein Distance, Hamming Distance, Damerau Levenshtein Distance dan Jaro Winkler Distance. Akan tetapi studi perbandingan kinerja dari keempat algoritma tersebut untuk Bahasa Indonesia belum pernah dilakukan. Oleh karena itu penelitian ini bertujuan untuk melakukan studi perbandingan kinerja dari keempat algoritma tersebut sehingga dapat diketahui algoritma mana yang lebih akurat dan tepat dalam pencarian string berdasarkan kesalahan penulisan yang bervariasi. Evaluasi yang dilakukan menggunakan user relevance judgement yang menghasilkan nilai mean average precision (MAP) untuk menentukan algoritma yang terbaik. Hasil penelitian terhadap 50 kata salah menunjukkan bahwa algoritma Jaro Winkler Distance terbaik dalam melakukan pengecekan kata dengan nilai MAP sebesar 0,87.Kata Kunci: Kesalahan pengetikan, Levenshtein, Hamming, Damerau Levenshtein, Jaro Winkler

Download Full-text

Correction to: New algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance

The Journal of Supercomputing ◽

10.1007/s11227-018-2324-7 ◽

2018 ◽

Vol 74 (5) ◽

pp. 1835-1835

Author(s):

ThienLuan Ho ◽

Seung-Rohk Oh ◽

HyunJin Kim

Keyword(s):

Hamming Distance ◽

String Matching ◽

Approximate String Matching ◽

Fixed Length ◽

New Algorithms

Download Full-text

Generalised Implementation for Fixed-Length Approximate String Matching under Hamming Distance and Applications

2015 IEEE International Parallel and Distributed Processing Symposium Workshop ◽

10.1109/ipdpsw.2015.106 ◽

2015 ◽

Cited By ~ 4

Author(s):

Solon Pissis ◽

Ahmad Retha

Keyword(s):

Hamming Distance ◽

String Matching ◽

Approximate String Matching ◽

Fixed Length

Download Full-text

New algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance

The Journal of Supercomputing ◽

10.1007/s11227-017-2192-6 ◽

2017 ◽

Vol 74 (5) ◽

pp. 1815-1834 ◽

Cited By ~ 1

Author(s):

ThienLuan Ho ◽

Seung-Rohk Oh ◽

HyunJin Kim

Keyword(s):

Hamming Distance ◽

String Matching ◽

Approximate String Matching ◽

Fixed Length ◽

New Algorithms

Download Full-text

Business Process Automation: A Workflow Incorporating Optical Character Recognition and Approximate String and Pattern Matching for Solving Practical Industry Problems

Applied System Innovation ◽

10.3390/asi2040033 ◽

2019 ◽

Vol 2 (4) ◽

pp. 33 ◽

Cited By ~ 2

Author(s):

Coenrad de Jager ◽

Marinda Nel

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Business Processes ◽

String Matching ◽

Image Data ◽

Second Step ◽

Approximate String Matching ◽

Process Automation ◽

Optical Character ◽

Image Digitization

Companies are relying more on artificial intelligence and machine learning in order to enhance and automate existing business processes. While the power of OCR (Optical Character Recognition) technologies can be harnessed for the digitization of image data, the digitalized text still needs to be validated and enhanced to ensure that data quality standards are met for the data to be usable. This research paper focuses on finding and creating an automated workflow that can follow image digitization and produce a dictionary consisting of the desired information. The workflow introduced consists of a three-step process that is implemented after the OCR output has been generated. With the introduction of each step, the accuracy of key-value matches of field names and values is increased. The first step takes the raw OCR output and identifies field names using exact string matching and field-values using regular expressions from an externally maintained file. The second step introduces index pairing that matches field-values to field names based on the location of the field name and value on the document. Finally, approximate string matching is introduced to the workflow, which increases accuracy. By implementing these steps, the F-measure for key-value pair matches is measured at 60.18% in the first step, 80.61% once index pairing is introduced, and finally 90.06% after approximate string matching is introduced. The research proved that accurate usable data can be obtained automatically from images with the implementation of a workflow after OCR.

Download Full-text

Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index

10.1101/301085 ◽

2018 ◽

Cited By ~ 1

Author(s):

Kiavash Kianfar ◽

Christopher Pockrandt ◽

Bahman Torkamandi ◽

Haochen Luo ◽

Knut Reinert

Keyword(s):

Dynamic Programming ◽

Optimization Problem ◽

Ad Hoc ◽

Hamming Distance ◽

String Matching ◽

Integer Program ◽

Mixed Integer ◽

Mixed Integer Program ◽

Approximate String Matching ◽

Optimal Search

AbstractFinding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics and has been extensively researched. Bidirectional indices have opened new possibilities in this regard allowing the search to start from anywhere within the pattern and extend in both directions. In particular, use of search schemes (partitioning the pattern and searching the pieces in certain orders with given bounds on errors) can yield significant speed-ups. However, finding optimal search schemes is a difficult combinatorial optimization problem.Here for the first time, we propose a mixed integer program (MIP) capable to solve this optimization problem for Hamming distance with given number of pieces. Our experiments show that the optimal search schemes found by our MIP significantly improve the performance of search in bidirectional FM-index upon previous ad-hoc solutions. For example, approximate matching of 101-bp Illumina reads (with two errors) becomes 35 times faster than standard backtracking. Moreover, despite being performed purely in the index, the running time of search using our optimal schemes (for up to two errors) is comparable to the best state-of-the-art aligners, which benefit from combining search in index with in-text verification using dynamic programming. As a result, we anticipate a full-fledged aligner that employs an intelligent combination of search in the bidirectional FM-index using our optimal search schemes and in-text verification using dynamic programming that will outperform today’s best aligners. The development of such an aligner, called FAMOUS (Fast Approximate string Matching using OptimUm search Schemes), is ongoing as our future work.

Download Full-text

Binary Amplitude Reflection Gratings for X-ray Shearing and Hartmann Wavefront Sensors

Sensors ◽

10.3390/s21020536 ◽

2021 ◽

Vol 21 (2) ◽

pp. 536

Author(s):

Kenneth A. Goldberg ◽

Antoine Wojdyla ◽

Diane Bryant

Keyword(s):

Operating Conditions ◽

Design Parameters ◽

Wavefront Aberration ◽

Light Sources ◽

Wavefront Sensors ◽

X Ray ◽

New Class ◽

Coherent Wave ◽

Reflection Gratings ◽

Dynamic Operating

New, high-coherent-flux X-ray beamlines at synchrotron and free-electron laser light sources rely on wavefront sensors to achieve and maintain optimal alignment under dynamic operating conditions. This includes feedback to adaptive X-ray optics. We describe the design and modeling of a new class of binary-amplitude reflective gratings for shearing interferometry and Hartmann wavefront sensing. Compact arrays of deeply etched gratings illuminated at glancing incidence can withstand higher power densities than transmission membranes and can be designed to operate across a broad range of photon energies with a fixed grating-to-detector distance. Coherent wave-propagation is used to study the energy bandwidth of individual elements in an array and to set the design parameters. We observe that shearing operates well over a ±10% bandwidth, while Hartmann can be extended to ±30% or more, in our configuration. We apply this methodology to the design of a wavefront sensor for a soft X-ray beamline operating from 230 eV to 1400 eV and model shearing and Hartmann tests in the presence of varying wavefront aberration types and magnitudes.

Download Full-text