suffix array Latest Research Papers

Abstract Background Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library. Results We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index’s suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3’s FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is $$\sim $$ ∼ 2–4x faster than SeqAn3 for nucleotide search, and $$\sim $$ ∼ 2–6x faster for amino acid search; it is also $$\sim $$ ∼ 4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage. Conclusions AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.

Download Full-text

Computation of the suffix array, burrows-wheeler transform and FM-index in V-order

Theoretical Computer Science ◽

10.1016/j.tcs.2021.06.004 ◽

2021 ◽

Author(s):

Jacqueline W. Daykin ◽

Neerja Mhaskar ◽

W.F. Smyth

Keyword(s):

Suffix Array ◽

Burrows Wheeler Transform

Download Full-text

Suffix array for multi-pattern matching with variable length wildcards

Intelligent Data Analysis ◽

10.3233/ida-205087 ◽

2021 ◽

Vol 25 (2) ◽

pp. 283-303

Author(s):

Na Liu ◽

Fei Xie ◽

Xindong Wu

Keyword(s):

Dynamic Programming ◽

Data Structure ◽

Pattern Matching ◽

Edit Distance ◽

State Of The Art ◽

Suffix Array ◽

Variable Length ◽

Distance Method ◽

Efficient Data ◽

Comparison Algorithms

Approximate multi-pattern matching is an important issue that is widely and frequently utilized, when the pattern contains variable-length wildcards. In this paper, two suffix array-based algorithms have been proposed to solve this problem. Suffix array is an efficient data structure for exact string matching in existing studies, as well as for approximate pattern matching and multi-pattern matching. An algorithm called MMSA-S is for the short exact characters in a pattern by dynamic programming, while another algorithm called MMSA-L deals with the long exact characters by the edit distance method. Experimental results of Pizza & Chili corpus demonstrate that these two newly proposed algorithms, in most cases, are more time-efficient than the state-of-the-art comparison algorithms.

Download Full-text

Designing efficient algorithms for querying large corpora

Oslo Studies in Language ◽

10.5617/osla.8504 ◽

2021 ◽

Vol 11 (2) ◽

pp. 283-302

Author(s):

Paul Meurer

Keyword(s):

Regular Expression ◽

Linear Time ◽

Suffix Array ◽

Efficient Algorithms ◽

Regular Expressions ◽

Efficient Treatment ◽

Suffix Arrays ◽

Regular Expression Matching ◽

Finite State ◽

Query System

I describe several new efficient algorithms for querying large annotated corpora. The search algorithms as they are implemented in several popular corpus search engines are less than optimal in two respects: regular expression string matching in the lexicon is done in linear time, and regular expressions over corpus positions are evaluated starting in those corpus positions that match the constraints of the initial edges of the corresponding network. To address these shortcomings, I have developed an algorithm for regular expression matching on suffix arrays that allows fast lexicon lookup, and a technique for running finite state automata from edges with lowest corpus counts. The implementation of the lexicon as suffix array also lends itself to an elegant and efficient treatment of multi-valued and set-valued attributes. The described techniques have been implemented in a fully functional corpus management system and are also used in a treebank query system.

Download Full-text

Building and Checking Suffix Array Simultaneously by Induced Sorting Method

IEEE Transactions on Computers ◽

10.1109/tc.2021.3061709 ◽

2021 ◽

pp. 1-1

Author(s):

Bin Lao ◽

Yi Wu ◽

Ge Nong ◽

Wai Hong Chan

Keyword(s):

Suffix Array ◽

Sorting Method

Download Full-text

Computing Maximal Lyndon Substrings of a String

Algorithms ◽

10.3390/a13110294 ◽

2020 ◽

Vol 13 (11) ◽

pp. 294

Author(s):

Frantisek Franek ◽

Michael Liut

Keyword(s):

Theoretical Analysis ◽

Suffix Array ◽

Input String ◽

Sorting Algorithm ◽

Worst Case ◽

Average Case ◽

Linear Algorithm ◽

Case Complexity ◽

Worst Case Complexity ◽

Novel Algorithm

There are two reasons to have an efficient algorithm for identifying all right-maximal Lyndon substrings of a string: firstly, Bannai et al. introduced in 2015 a linear algorithm to compute all runs of a string that relies on knowing all right-maximal Lyndon substrings of the input string, and secondly, Franek et al. showed in 2017 a linear equivalence of sorting suffixes and sorting right-maximal Lyndon substrings of a string, inspired by a novel suffix sorting algorithm of Baier. In 2016, Franek et al. presented a brief overview of algorithms for computing the Lyndon array that encodes the knowledge of right-maximal Lyndon substrings of the input string. Among those presented were two well-known algorithms for computing the Lyndon array: a quadratic in-place algorithm based on the iterated Duval algorithm for Lyndon factorization and a linear algorithmic scheme based on linear suffix sorting, computing the inverse suffix array, and applying to it the next smaller value algorithm. Duval’s algorithm works for strings over any ordered alphabet, while for linear suffix sorting, a constant or an integer alphabet is required. The authors at that time were not aware of Baier’s algorithm. In 2017, our research group proposed a novel algorithm for the Lyndon array. Though the proposed algorithm is linear in the average case and has O(nlog(n)) worst-case complexity, it is interesting as it emulates the fast Fourier algorithm’s recursive approach and introduces τ-reduction, which might be of independent interest. In 2018, we presented a linear algorithm to compute the Lyndon array of a string inspired by Phase I of Baier’s algorithm for suffix sorting. This paper presents the theoretical analysis of these two algorithms and provides empirical comparisons of both of their C++ implementations with respect to the iterated Duval algorithm.

Download Full-text

Property Suffix Array with Applications in Indexing Weighted Sequences

Journal of Experimental Algorithmics ◽

10.1145/3385898 ◽

2020 ◽

Vol 25 ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Panagiotis Charalampopoulos ◽

Costas S. Iliopoulos ◽

Chang Liu ◽

Solon P. Pissis

Keyword(s):

Suffix Array ◽

Weighted Sequences

Download Full-text

Sapling: accelerating suffix array queries with learned data models

Bioinformatics ◽

10.1093/bioinformatics/btaa911 ◽

2020 ◽

Cited By ~ 1

Author(s):

Melanie Kirsche ◽

Arun Das ◽

Michael C Schatz

Keyword(s):

Open Source ◽

Sequence Alignment ◽

Search Algorithm ◽

Piecewise Linear ◽

Network Models ◽

Suffix Array ◽

Binary Search ◽

Data Models ◽

Supplementary Information ◽

Neural Network Models

Abstract Motivation As genomic data becomes more abundant, efficient algorithms and data structures for sequence alignment become increasingly important. The suffix array is a widely used data structure to accelerate alignment, but the binary search algorithm used to query, it requires widespread memory accesses, causing a large number of cache misses on large datasets. Results Here, we present Sapling, an algorithm for sequence alignment, which uses a learned data model to augment the suffix array and enable faster queries. We investigate different types of data models, providing an analysis of different neural network models as well as providing an open-source aligner with a compact, practical piecewise linear model. We show that Sapling outperforms both an optimized binary search approach and multiple widely used read aligners on a diverse collection of genomes, including human, bacteria and plants, speeding up the algorithm by more than a factor of two while adding <1% to the suffix array’s memory footprint. Availability and implementation The source code and tutorial are available open-source at https://github.com/mkirsche/sapling. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Computing Maximal Lyndon Substrings of a String

10.20944/preprints202009.0557.v1 ◽

2020 ◽

Author(s):

Frantisek Franek ◽

Michael Liut

Keyword(s):

Theoretical Analysis ◽

Suffix Array ◽

Input String ◽

Sorting Algorithm ◽

Worst Case ◽

Average Case ◽

Linear Algorithm ◽

Case Complexity ◽

Worst Case Complexity ◽

Novel Algorithm

There are two reasons to have an efficient algorithm for identifying all maximal Lyndon substrings of a string: firstly, Bannai et al. introduced in 2015 a linear algorithm to compute all runs of a string that relies on knowing all maximal Lyndon substrings of the input string, and secondly, Franek et al. showed in 2017 a linear equivalence of sorting suffixes and sorting maximal Lyndon substrings of a string, inspired by a novel suffix sorting algorithm of Baier. In 2016, Franek et al. presented a brief overview of algorithms for computing the Lyndon array that encodes the knowledge of maximal Lyndon substrings of the input string. Among the presented were two well-known algorithms for computing the Lyndon array: a quadratic in-place algorithm based on iterated Duval's algorithm for Lyndon factorization, and a linear algorithmic scheme based on linear suffix sorting, computing inverse suffix array, and applying to it the Next Smaller Value algorithm. Duval's algorithm works for strings over any ordered alphabet, while for linear suffix sorting, a constant or an integer alphabet is required. The authors at that time were not aware of Baier's algorithm. In 2017, our research group proposed a novel algorithm for the Lyndon array. Though the proposed algorithm is linear in the average case and has O(n log(n)) worst-case complexity, it is interesting as it emulates the fast Fourier algorithm's recursive approach and introduces tau-reduction that might be of independent interest. In 2018, we presented a linear algorithm to compute the Lyndon array of a string inspired by Phase I of Baier's algorithm for suffix sorting. This paper presents theoretical analysis of these two algorithms and provides empirical comparisons of both their C++ implementations with respect to iterated Duval's algorithm.

Download Full-text

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

Algorithms for Molecular Biology ◽

10.1186/s13015-020-00177-y ◽

2020 ◽

Vol 15 (1) ◽

Author(s):

Felipe A. Louza ◽

Guilherme P. Telles ◽

Simon Gog ◽

Nicola Prezza ◽

Giovanna Rosone

Keyword(s):

Data Structures ◽

Suffix Array ◽

Data Indexing ◽

Related Data ◽

Suffix Arrays ◽

Common Prefix ◽

Different Types ◽

Single String ◽

Burrows Wheeler Transform ◽

Longest Common Prefix Array

Abstract Background The construction of a suffix array for a collection of strings is a fundamental task in Bioinformatics and in many other applications that process strings. Related data structures, as the Longest Common Prefix array, the Burrows–Wheeler transform, and the document array, are often needed to accompany the suffix array to efficiently solve a wide variety of problems. While several algorithms have been proposed to construct the suffix array for a single string, less emphasis has been put on algorithms to construct suffix arrays for string collections. Result In this paper we introduce , an open source software for constructing the suffix array and related data indexing structures for a string collection with N symbols in O(N) time. Our tool is written in and is based on the algorithm gSACA-K (Louza et al. in Theor Comput Sci 678:22–39, 2017), the fastest algorithm to construct suffix arrays for string collections. The tool supports large fasta, fastq and text files with multiple strings as input. Experiments have shown very good performance on different types of strings. Conclusions is a fast, portable, and lightweight tool for constructing the suffix array and additional data structures for string collections.

Download Full-text

suffix array
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

An optimized FM-index library for nucleotide and amino acid search

Computation of the suffix array, burrows-wheeler transform and FM-index in V-order

Suffix array for multi-pattern matching with variable length wildcards

Designing efficient algorithms for querying large corpora

Building and Checking Suffix Array Simultaneously by Induced Sorting Method

Computing Maximal Lyndon Substrings of a String

Property Suffix Array with Applications in Indexing Weighted Sequences

Sapling: accelerating suffix array queries with learned data models

Computing Maximal Lyndon Substrings of a String

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

Export Citation Format

suffix arrayRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

An optimized FM-index library for nucleotide and amino acid search

Computation of the suffix array, burrows-wheeler transform and FM-index in V-order

Suffix array for multi-pattern matching with variable length wildcards

Designing efficient algorithms for querying large corpora

Building and Checking Suffix Array Simultaneously by Induced Sorting Method

Computing Maximal Lyndon Substrings of a String

Property Suffix Array with Applications in Indexing Weighted Sequences

Sapling: accelerating suffix array queries with learned data models

Computing Maximal Lyndon Substrings of a String

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

suffix array
Recently Published Documents