MONI: A Pangenomics Index for Finding MEMs

Mapping Intimacies ◽

10.1101/2021.07.06.451246 ◽

2021 ◽

Author(s):

Massimiliano Rossi ◽

Marco Oliva ◽

Ben Langmead ◽

Travis Gagie ◽

Christina Boucher

Keyword(s):

Pattern Matching ◽

Linear Time ◽

Repetitive Sequences ◽

Major Advance ◽

Human Chromosomes ◽

Time And Space ◽

Index Construction ◽

Approximate Pattern Matching ◽

Human Genomes ◽

Novel Algorithm

Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding --- but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called MONI can rapidly find MEMs between reads and large sequence collections of highly repetitive sequences. Compared to other read aligners -- PuffAligner, Bowtie2, BWA-MEM, and CHIC -- MONI used 2--11 times less memory and was 2--32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references. Availability: MONI is publicly available at https://github.com/maxrossi91/moni.

Download Full-text

INDEXING GAPPED-FACTORS USING A TREE

International Journal of Foundations of Computer Science ◽

10.1142/s0129054108005541 ◽

2008 ◽

Vol 19 (01) ◽

pp. 71-87 ◽

Cited By ~ 3

Author(s):

PIERRE PETERLONGO ◽

JULIEN ALLALI ◽

MARIE-FRANCE SAGOT

Keyword(s):

Data Structure ◽

Pattern Matching ◽

Suffix Tree ◽

Linear Time ◽

Fixed Size ◽

Specific Kind ◽

Time And Space ◽

Inference Problems

We present a data structure to index a specific kind of factors, that is of substrings, called gapped-factors. A gapped-factor is a factor containing a gap that is ignored during the indexation. The data structure presented is based on the suffix tree and indexes all the gapped-factors of a text with a fixed size of gap, and only those. The construction of this data structure is done online in linear time and space. Such a data structure may play an important role in various pattern matching and motif inference problems, for instance in text filtration.

Download Full-text

Linear Approximate Pattern Matching Algorithm

10.21203/rs.3.rs-1021063/v1 ◽

2021 ◽

Author(s):

Anas Al-okaily ◽

Abdelghani Tbakhi

Keyword(s):

Pattern Matching ◽

Linear Time ◽

Search Costs ◽

Exact Matching ◽

Time And Space ◽

Matching Problem ◽

Approximate Matching ◽

Large Length ◽

Reference Stream ◽

Inexact Matching

Abstract Pattern matching is a fundamental process in almost every scientific domain. The problem involves finding the positions of a given pattern (usually of short length) in a reference stream of data (usually of large length). The matching can be as an exact or as an approximate (inexact) matching. Exact matching is to search for the pattern without allowing for mismatches (or insertions and deletions) of one or more characters in the pattern), while approximate matching is the opposite. For exact matching, several data structures that can be built in linear time and space are used and in practice nowadays. For approximate matching, the solutions proposed to solve this matching are non-linear and currently impractical. In this paper, we designed and implemented a structure that can be built in linear time and space and solve the approximate matching problem in (O(m + {log_Σ^k}n/{k!} + occ) search costs, where m is the length of the pattern, n is the length of the reference, and k is the number of tolerated mismatches (and insertion and deletions).

Download Full-text

Approximate Pattern Matching in Massive Graphs with Precision and Recall Guarantees

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data ◽

10.1145/3318464.3380566 ◽

2020 ◽

Author(s):

Tashin Reza ◽

Matei Ripeanu ◽

Geoffrey Sanders ◽

Roger Pearce

Keyword(s):

Pattern Matching ◽

Massive Graphs ◽

Approximate Pattern Matching

Download Full-text

Toward accurate dynamic time warping in linear time and space

Intelligent Data Analysis ◽

10.3233/ida-2007-11508 ◽

2007 ◽

Vol 11 (5) ◽

pp. 561-580 ◽

Cited By ~ 502

Author(s):

Stan Salvador ◽

Philip Chan

Keyword(s):

Dynamic Time Warping ◽

Linear Time ◽

Time Warping ◽

Time And Space ◽

Dynamic Time

Download Full-text

Computing quality scores and uncertainty for approximate pattern matching in geospatial semantic graphs

Statistical Analysis and Data Mining The ASA Data Science Journal ◽

10.1002/sam.11294 ◽

2015 ◽

Vol 8 (5-6) ◽

pp. 340-352 ◽

Cited By ~ 2

Author(s):

David J. Stracuzzi ◽

Randy C. Brost ◽

Cynthia A. Phillips ◽

David G. Robinson ◽

Alyson G. Wilson ◽

...

Keyword(s):

Pattern Matching ◽

Approximate Pattern Matching

Download Full-text

A novel algorithm for pattern matching with back references

2015 IEEE 34th International Performance Computing and Communications Conference (IPCCC) ◽

10.1109/pccc.2015.7410264 ◽

2015 ◽

Cited By ~ 1

Author(s):

Liu Yang ◽

Vinod Ganapathy ◽

Pratyusa Manadhata ◽

Ye Wu

Keyword(s):

Pattern Matching ◽

Novel Algorithm

Download Full-text

Numerical solution for the linear time and space fractional diffusion equation

Journal of Vibration and Control ◽

10.1177/1077546313500687 ◽

2013 ◽

Vol 21 (9) ◽

pp. 1769-1777 ◽

Cited By ~ 2

Author(s):

Talaat S El Danaf

Keyword(s):

Diffusion Equation ◽

Numerical Solution ◽

Linear Time ◽

Fractional Diffusion ◽

Fractional Diffusion Equation ◽

Time And Space ◽

Space Fractional Diffusion Equation

Download Full-text

State Complexity of Neighbourhoods and Approximate Pattern Matching

International Journal of Foundations of Computer Science ◽

10.1142/s0129054118400099 ◽

2018 ◽

Vol 29 (02) ◽

pp. 315-329 ◽

Cited By ~ 5

Author(s):

Timothy Ng ◽

David Rappaport ◽

Kai Salomaa

Keyword(s):

Lower Bound ◽

Pattern Matching ◽

Finite Automaton ◽

The State ◽

Worst Case ◽

State Complexity ◽

Approximate Pattern Matching ◽

The Given ◽

Nondeterministic Finite Automaton ◽

Distance Formula

The neighbourhood of a language [Formula: see text] with respect to an additive distance consists of all strings that have distance at most the given radius from some string of [Formula: see text]. We show that the worst case deterministic state complexity of a radius [Formula: see text] neighbourhood of a language recognized by an [Formula: see text] state nondeterministic finite automaton [Formula: see text] is [Formula: see text]. In the case where [Formula: see text] is deterministic we get the same lower bound for the state complexity of the neighbourhood if we use an additive quasi-distance. The lower bound constructions use an alphabet of size linear in [Formula: see text]. We show that the worst case state complexity of the set of strings that contain a substring within distance [Formula: see text] from a string recognized by [Formula: see text] is [Formula: see text].

Download Full-text

Robotomata: A framework for approximate pattern matching of big data on an automata processor

2017 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2017.8257936 ◽

2017 ◽

Cited By ~ 2

Author(s):

Xiaodong Yu ◽

Kaixi Hou ◽

Hao Wang ◽

Wu-chun Feng

Keyword(s):

Big Data ◽

Pattern Matching ◽

Approximate Pattern Matching

Download Full-text

A Linear Size Index for Approximate Pattern Matching

Combinatorial Pattern Matching - Lecture Notes in Computer Science ◽

10.1007/11780441_6 ◽

2006 ◽

pp. 49-59 ◽

Cited By ~ 14

Author(s):

Ho-Leung Chan ◽

Tak-Wah Lam ◽

Wing-Kin Sung ◽

Siu-Lung Tam ◽

Swee-Seong Wong

Keyword(s):

Pattern Matching ◽

Linear Size ◽

Approximate Pattern Matching ◽

Size Index

Download Full-text