Suffix array for multi-pattern matching with variable length wildcards

Na Liu; Fei Xie; Xindong Wu

doi:10.3233/ida-205087

Suffix array for multi-pattern matching with variable length wildcards

Intelligent Data Analysis ◽

10.3233/ida-205087 ◽

2021 ◽

Vol 25 (2) ◽

pp. 283-303

Author(s):

Na Liu ◽

Fei Xie ◽

Xindong Wu

Keyword(s):

Dynamic Programming ◽

Data Structure ◽

Pattern Matching ◽

Edit Distance ◽

State Of The Art ◽

Suffix Array ◽

Variable Length ◽

Distance Method ◽

Efficient Data ◽

Comparison Algorithms

Approximate multi-pattern matching is an important issue that is widely and frequently utilized, when the pattern contains variable-length wildcards. In this paper, two suffix array-based algorithms have been proposed to solve this problem. Suffix array is an efficient data structure for exact string matching in existing studies, as well as for approximate pattern matching and multi-pattern matching. An algorithm called MMSA-S is for the short exact characters in a pattern by dynamic programming, while another algorithm called MMSA-L deals with the long exact characters by the edit distance method. Experimental results of Pizza & Chili corpus demonstrate that these two newly proposed algorithms, in most cases, are more time-efficient than the state-of-the-art comparison algorithms.

Download Full-text

An Efficient Data Structure for Maxplus Merge in Dynamic Programming

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems ◽

10.1109/tcad.2006.882479 ◽

2006 ◽

Vol 25 (12) ◽

pp. 3004-3009 ◽

Cited By ~ 2

Author(s):

Ruiming Chen ◽

Hai Zhou

Keyword(s):

Dynamic Programming ◽

Data Structure ◽

Efficient Data

Download Full-text

Dynamic Generalized Suffix Arrays

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.263-266.1398 ◽

2012 ◽

Vol 263-266 ◽

pp. 1398-1401

Author(s):

Song Feng Lu ◽

Hua Zhao

Keyword(s):

Data Structure ◽

Pattern Matching ◽

Time Complexity ◽

Document Retrieval ◽

Suffix Array ◽

Index Structure ◽

Suffix Arrays ◽

Basic Task ◽

Insertion And Deletion ◽

Dynamic Version

Document retrieval is the basic task of search engines, and seize amount of attention by the pattern matching community. In this paper, we focused on the dynamic version of this problem, in which the text insertion and deletion is allowable. By using the generalized suffix array and other data structure, we proposed a new index structure. Our scheme achieved better time complexity than the existing ones, and a bit more space overhead is needed as return.

Download Full-text

ADDMC: Weighted Model Counting with Algebraic Decision Diagrams

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i02.5505 ◽

2020 ◽

Vol 34 (02) ◽

pp. 1468-1476

Author(s):

Jeffrey Dudek ◽

Vu Phan ◽

Moshe Vardi

Keyword(s):

Dynamic Programming ◽

Data Structure ◽

Normal Form ◽

Standard Model ◽

State Of The Art ◽

Conjunctive Normal Form ◽

Decision Diagrams ◽

Boolean Formulas ◽

Model Counting ◽

Weighted Model

We present an algorithm to compute exact literal-weighted model counts of Boolean formulas in Conjunctive Normal Form. Our algorithm employs dynamic programming and uses Algebraic Decision Diagrams as the main data structure. We implement this technique in ADDMC, a new model counter. We empirically evaluate various heuristics that can be used with ADDMC. We then compare ADDMC to four state-of-the-art weighted model counters (Cachet, c2d, d4, and miniC2D) on 1914 standard model counting benchmarks and show that ADDMC significantly improves the virtual best solver.

Download Full-text

Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds

10.1101/2021.11.05.467453 ◽

2021 ◽

Author(s):

Pesho Ivanov ◽

Benjamin Bichsel ◽

Martin Vechev

Keyword(s):

Dynamic Programming ◽

Edit Distance ◽

Reference Genome ◽

State Of The Art ◽

Optimal Alignment ◽

Reference Mark ◽

A Algorithm ◽

Optimal Sequence ◽

E Coli ◽

Graph Alignment

We present a novel A* seed heuristic enabling fast and optimal sequence-to-graph alignment, guaranteed to minimize the edit distance of the alignment assuming non-negative edit costs. We phrase optimal alignment as a shortest path problem and solve it by instantiating the A* algorithm with our novel seed heuristic. The key idea of the seed heuristic is to extract seeds from the read, locate them in the reference, mark preceding reference positions by crumbs, and use the crumbs to direct the A* search. We prove admissibility of the seed heuristic, thus guaranteeing alignment optimality. Our implementation extends the free and open source AStarix aligner and demonstrates that the seed heuristic outperforms all state-of-the-art optimal aligners including GraphAligner, Vargas, PaSGAL, and the prefix heuristic previously employed by AStarix. Specifically, we achieve a consistent speedup of >60x on both short Illumina reads and long HiFi reads (up to 25kbp), on both the E. coli linear reference genome (1Mbp) and the MHC variant graph (5Mbp). Our speedup is enabled by the seed heuristic consistently skipping >99.99% of the table cells that optimal aligners based on dynamic programming compute.

Download Full-text

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

10.1101/472423 ◽

2018 ◽

Author(s):

Alan Kuhnle ◽

Taher Mun ◽

Christina Boucher ◽

Travis Gagie ◽

Ben Langmead ◽

...

Keyword(s):

Data Structure ◽

State Of The Art ◽

Suffix Array ◽

Genomic Databases ◽

Run Length ◽

Slowing Down ◽

Human Genomes ◽

Efficient Construction ◽

Main Components ◽

Burrows Wheeler Transform

AbstractWhile short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find the interval in the string’s suffix array (SA) containing pointers to starting positions of occurrences of a given pattern; second, a sample of the SA that — when used with the rank data structure — allows us access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that Gagie et al. (SODA 2018) have defined an SA sample that takes about the same space as the run-length compressed BWT — we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018 we showed how to build the BWT of large genomic databases efficiently (WABI 2018) but the problem of building Gagie et al.’s SA sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes, and show that it improves over Bowtie with respect to both memory and time.AvailabilityWe note that the implementation of our methods can be found here: https://github.com/alshai/r-index.

Download Full-text

An efficient General Variable Neighborhood Search for large Travelling Salesman Problem with Time Windows

Yugoslav journal of operations research ◽

10.2298/yjor120530015m ◽

2013 ◽

Vol 23 (1) ◽

pp. 19-30 ◽

Cited By ~ 22

Author(s):

Nenad Mladenovic ◽

Raca Todosijevic ◽

Dragan Urosevic

Keyword(s):

Data Structure ◽

Variable Neighborhood Search ◽

State Of The Art ◽

Travelling Salesman Problem ◽

Time Windows ◽

Neighborhood Search ◽

Travelling Salesman ◽

Large Size ◽

Efficient Data ◽

Checking Procedure

General Variable Neighborhood Search (GVNS) is shown to be a powerful and robust methodology for solving travelling salesman and vehicle routing problems. However, its efficient implementation may play a significant role in solving large size instances. In this paper we suggest new GVNS heuristic for solving Travelling salesman problem with time windows. It uses different set of neighborhoods, new feasibility checking procedure and a more efficient data structure than the recent GVNS method that can be considered as a state-of-the-art heuristic. As a result, our GVNS is much faster and more effective than the previous GVNS. It is able to improve 14 out of 25 best known solutions for large test instances from the literature.

Download Full-text

A novel optimal multi-pattern matching method with wildcards for DNA sequence

Technology and Health Care ◽

10.3233/thc-218012 ◽

2021 ◽

Vol 29 ◽

pp. 115-124

Author(s):

Xinlu Wang ◽

Ahmed A.F. Saif ◽

Dayou Liu ◽

Yungang Zhu ◽

Jon Atli Benediktsson

Keyword(s):

Dna Sequence ◽

Pattern Matching ◽

Health Informatics ◽

State Of The Art ◽

Machine Language ◽

Data Sets ◽

Fundamental Issue ◽

Matching Method ◽

Dna Sequence Alignment ◽

The Given

BACKGROUND: DNA sequence alignment is one of the most fundamental and important operation to identify which gene family may contain this sequence, pattern matching for DNA sequence has been a fundamental issue in biomedical engineering, biotechnology and health informatics. OBJECTIVE: To solve this problem, this study proposes an optimal multi pattern matching with wildcards for DNA sequence. METHODS: This proposed method packs the patterns and a sliding window of texts, and the window slides along the given packed text, matching against stored packed patterns. RESULTS: Three data sets are used to test the performance of the proposed algorithm, and the algorithm was seen to be more efficient than the competitors because its operation is close to machine language. CONCLUSIONS: Theoretical analysis and experimental results both demonstrate that the proposed method outperforms the state-of-the-art methods and is especially effective for the DNA sequence.

Download Full-text

Efficient Data Transmission Method Considering Hierarchical Data Structure

Proceedings of the 2020 International Conference on Computer Communication and Information Systems ◽

10.1145/3418994.3418996 ◽

2020 ◽

Author(s):

SeoYoon Jang ◽

JiHoon Kang

Keyword(s):

Data Structure ◽

Data Transmission ◽

Hierarchical Data ◽

Transmission Method ◽

Hierarchical Data Structure ◽

Efficient Data

Download Full-text

Cache-efficient sweeping-based interval joins for extended Allen relation predicates

The VLDB Journal ◽

10.1007/s00778-020-00650-5 ◽

2021 ◽

Author(s):

Danila Piatov ◽

Sven Helmer ◽

Anton Dignös ◽

Fabio Persia

Keyword(s):

Data Structure ◽

Experimental Evaluation ◽

State Of The Art ◽

Temporal Databases ◽

Access Method ◽

Wide Range ◽

Interval Relation ◽

Cache Efficient ◽

Join Algorithms ◽

Better Than

AbstractWe develop a family of efficient plane-sweeping interval join algorithms for evaluating a wide range of interval predicates such as Allen’s relationships and parameterized relationships. Our technique is based on a framework, components of which can be flexibly combined in different manners to support the required interval relation. In temporal databases, our algorithms can exploit a well-known and flexible access method, the Timeline Index, thus expanding the set of operations it supports even further. Additionally, employing a compact data structure, the gapless hash map, we utilize the CPU cache efficiently. In an experimental evaluation, we show that our approach is several times faster and scales better than state-of-the-art techniques, while being much better suited for real-time event processing.

Download Full-text

Efficient data structure for representing and simplifying simplicial complexes in high dimensions

Proceedings of the 27th annual ACM symposium on Computational geometry - SoCG '11 ◽

10.1145/1998196.1998277 ◽

2011 ◽

Cited By ~ 6

Author(s):

Dominique Attali ◽

André Lieutier ◽

David Salinas

Keyword(s):

Data Structure ◽

Simplicial Complexes ◽

High Dimensions ◽

Efficient Data

Download Full-text