Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds

We present a novel A* seed heuristic enabling fast and optimal sequence-to-graph alignment, guaranteed to minimize the edit distance of the alignment assuming non-negative edit costs. We phrase optimal alignment as a shortest path problem and solve it by instantiating the A* algorithm with our novel seed heuristic. The key idea of the seed heuristic is to extract seeds from the read, locate them in the reference, mark preceding reference positions by crumbs, and use the crumbs to direct the A* search. We prove admissibility of the seed heuristic, thus guaranteeing alignment optimality. Our implementation extends the free and open source AStarix aligner and demonstrates that the seed heuristic outperforms all state-of-the-art optimal aligners including GraphAligner, Vargas, PaSGAL, and the prefix heuristic previously employed by AStarix. Specifically, we achieve a consistent speedup of >60x on both short Illumina reads and long HiFi reads (up to 25kbp), on both the E. coli linear reference genome (1Mbp) and the MHC variant graph (5Mbp). Our speedup is enabled by the seed heuristic consistently skipping >99.99% of the table cells that optimal aligners based on dynamic programming compute.

Download Full-text

AStarix: Fast and Optimal Sequence-to-Graph Alignment

10.1101/2020.01.22.915496 ◽

2020 ◽

Author(s):

Pesho Ivanov ◽

Benjamin Bichsel ◽

Harun Mustafa ◽

André Kahles ◽

Gunnar Rätsch ◽

...

Keyword(s):

Shortest Path ◽

Edit Distance ◽

Reference Genome ◽

State Of The Art ◽

Alignment Algorithm ◽

Optimal Alignment ◽

Heuristic Function ◽

Optimal Sequence ◽

Domain Specific ◽

Distance Minimization

AbstractWe present an algorithm for the optimal alignment of sequences to genome graphs. It works by phrasing the edit distance minimization task as finding a shortest path on an implicit alignment graph. To find a shortest path, we instantiate the A⋆ paradigm with a novel domain-specific heuristic function that accounts for the upcoming subsequence in the query to be aligned, resulting in a provably optimal alignment algorithm called AStarix.Experimental evaluation of AStarix shows that it is 1–2 orders of magnitude faster than state-of-the-art optimal algorithms on the task of aligning Illumina reads to reference genome graphs. Implementations and evaluations are available at https://github.com/eth-sri/astarix.

Download Full-text

Suffix array for multi-pattern matching with variable length wildcards

Intelligent Data Analysis ◽

10.3233/ida-205087 ◽

2021 ◽

Vol 25 (2) ◽

pp. 283-303

Author(s):

Na Liu ◽

Fei Xie ◽

Xindong Wu

Keyword(s):

Dynamic Programming ◽

Data Structure ◽

Pattern Matching ◽

Edit Distance ◽

State Of The Art ◽

Suffix Array ◽

Variable Length ◽

Distance Method ◽

Efficient Data ◽

Comparison Algorithms

Approximate multi-pattern matching is an important issue that is widely and frequently utilized, when the pattern contains variable-length wildcards. In this paper, two suffix array-based algorithms have been proposed to solve this problem. Suffix array is an efficient data structure for exact string matching in existing studies, as well as for approximate pattern matching and multi-pattern matching. An algorithm called MMSA-S is for the short exact characters in a pattern by dynamic programming, while another algorithm called MMSA-L deals with the long exact characters by the edit distance method. Experimental results of Pizza & Chili corpus demonstrate that these two newly proposed algorithms, in most cases, are more time-efficient than the state-of-the-art comparison algorithms.

Download Full-text

Context-Aware Seeds for Read Mapping

10.21203/rs.2.19241/v1 ◽

2019 ◽

Author(s):

Hongyi Xin ◽

Mingfu Shao ◽

Carl Kingsford

Keyword(s):

Edit Distance ◽

State Of The Art ◽

Linear Time ◽

Pigeonhole Principle ◽

Context Aware ◽

Read Mapping ◽

Distance Threshold ◽

E Coli ◽

Long Reads ◽

Confidence Radius

Abstract Motivation: Most modern seed-and-extend NGS read mappers employ a seeding scheme that requires extracting t non-overlapping seeds in each read in order to find all valid mappings under an edit distance threshold of t . As t grows (such as in long reads with high error rate), this seeding scheme forces mappers to use more and shorter seeds, which increases the seed hits (seed frequencies) and therefore reduces the efficiency of mappers. Results: We propose a novel seeding framework, context-aware seeds (CAS). CAS guarantees finding all valid mappings but uses fewer (and longer) seeds, which reduces seed frequencies and increases efficiency of mappers. CAS achieves this improvement by attaching a confidence radius to each seed in the reference. We prove that all valid mappings can be found if the sum of confidence radii of seeds are greater than t . CAS generalizes the existing pigeonhole-principle-based seeding scheme in which this confidence radius is implicitly always 1. Moreover, we design an efficient algorithm that constructs the confidence radius database in linear time. We experiment CAS with E. coli genome and show that CAS reduces seed frequencies by up to 20.3% when compared with the state-of-the-art pigeonhole-principle-based seeding algorithm, the Optimal Seed Solver.

Download Full-text

A Dynamic Programming A* Algorithm for Computing Unordered Tree Edit Distance

2013 Second IIAI International Conference on Advanced Applied Informatics ◽

10.1109/iiai-aai.2013.71 ◽

2013 ◽

Author(s):

Takuya Yoshino ◽

Shoichi Higuchi ◽

Kouichi Hirata

Keyword(s):

Dynamic Programming ◽

Edit Distance ◽

A Algorithm ◽

Tree Edit Distance ◽

Unordered Tree

Download Full-text

Context-Aware Seeds for Read Mapping

10.1101/643072 ◽

2019 ◽

Author(s):

Hongyi Xin ◽

Mingfu Shao ◽

Carl Kingsford

Keyword(s):

Edit Distance ◽

State Of The Art ◽

Linear Time ◽

Pigeonhole Principle ◽

Context Aware ◽

Read Mapping ◽

Distance Threshold ◽

E Coli ◽

Long Reads ◽

Confidence Radius

AbstractMotivationMost modern seed-and-extend NGS read mappers employ a seeding scheme that requires extracting t non-overlapping seeds in each read in order to find all valid mappings under an edit distance threshold of t. As t grows (such as in long reads with high error rate), this seeding scheme forces mappers to use more and shorter seeds, which increases the seed hits (seed frequencies) and therefore reduces the efficiency of mappers.ResultsWe propose a novel seeding framework, context-aware seeds (CAS). CAS guarantees finding all valid mapping but uses fewer (and longer) seeds, which reduces seed frequencies and increases efficiency of mappers. CAS achieves this improvement by attaching a confidence radius to each seed. We prove that all valid mappings can be found if the sum of confidence radii of seeds are greater than t. CAS generalizes the existing pigeonhole-principle-based seeding scheme in which this confidence radius is implicitly always 1. Moreover, we design an efficient algorithm that constructs the confidence radius database in linear time. We experiment CAS with E. coli genome and show that CAS reduces seed frequencies by up to 25.4% when compared with the state-of-the-art pigeonhole-principle-based seeding algorithm, the Optimal Seed Solver.Availabilityhttps://github.com/Kingsford-Group/CAS_code

Download Full-text

Computing Robust Principal Components by A* Search

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213018600138 ◽

2018 ◽

Vol 27 (07) ◽

pp. 1860013 ◽

Cited By ~ 1

Author(s):

Swair Shah ◽

Baokun He ◽

Crystal Maung ◽

Haim Schweitzer

Keyword(s):

State Of The Art ◽

Principal Component ◽

Low Rank ◽

A Algorithm ◽

Running Time ◽

Current State ◽

Dimensionality Reduction Technique ◽

Related Variant ◽

Low Rank Representation ◽

The Cost

Principal Component Analysis (PCA) is a classical dimensionality reduction technique that computes a low rank representation of the data. Recent studies have shown how to compute this low rank representation from most of the data, excluding a small amount of outlier data. We show how to convert this problem into graph search, and describe an algorithm that solves this problem optimally by applying a variant of the A* algorithm to search for the outliers. The results obtained by our algorithm are optimal in terms of accuracy, and are shown to be more accurate than results obtained by the current state-of-the- art algorithms which are shown not to be optimal. This comes at the cost of running time, which is typically slower than the current state of the art. We also describe a related variant of the A* algorithm that runs much faster than the optimal variant and produces a solution that is guaranteed to be near the optimal. This variant is shown experimentally to be more accurate than the current state-of-the-art and has a comparable running time.

Download Full-text

Finding an Optimal Sequence by Dynamic Programming: An Extension to Precedence-Related Tasks

Operations Research ◽

10.1287/opre.26.1.111 ◽

1978 ◽

Vol 26 (1) ◽

pp. 111-120 ◽

Cited By ~ 84

Author(s):

Kenneth R. Baker ◽

Linus E. Schrage

Keyword(s):

Dynamic Programming ◽

Optimal Sequence

Download Full-text

Speeding-Up the Dynamic Programming Procedure for the Edit Distance of Two Strings

Communications in Computer and Information Science - Database and Expert Systems Applications ◽

10.1007/978-3-030-27684-3_9 ◽

2019 ◽

pp. 59-66

Author(s):

Giuseppe Lancia ◽

Marcello Dalpasso

Keyword(s):

Dynamic Programming ◽

Edit Distance

Download Full-text

A dynamic alignment algorithm for imperfect speech and transcript

Computer Science and Information Systems ◽

10.2298/csis1001075t ◽

2010 ◽

Vol 7 (1) ◽

pp. 75-84 ◽

Cited By ~ 4

Author(s):

Ye Tao ◽

Li Xueqing ◽

Wu Bian

Keyword(s):

Dynamic Programming ◽

Boundary Detection ◽

Multimedia Content ◽

Alignment Algorithm ◽

Optimal Alignment ◽

Multi Stage ◽

Sentence Level ◽

Sentence Boundary ◽

English Training ◽

Dynamic Alignment

This paper presents a novel alignment approach for imperfect speech and the corresponding transcription. The algorithm gets started with multi-stage sentence boundary detection in audio, followed by a dynamic programming based search, to find the optimal alignment and detect the mismatches at sentence level. Experiments show promising performance, compared to the traditional forced alignment approach. The proposed algorithm has already been applied in preparing multimedia content for an online English training platform.

Download Full-text

Enabling multiscale variation analysis with genome graphs

10.1101/2021.02.03.429603 ◽

2021 ◽

Author(s):

Brice Letcher ◽

Martin Hunt ◽

Zamin Iqbal

Keyword(s):

Genetic Variation ◽

Directed Acyclic Graph ◽

Structural Variation ◽

Reference Genome ◽

Multiple Scales ◽

State Of The Art ◽

Variant Calling ◽

Variation Analysis ◽

New Algorithms ◽

Genome Graph

AbstractBackgroundStandard approaches to characterising genetic variation revolve around mapping reads to a reference genome and describing variants in terms of differences from the reference; this is based on the assumption that these differences will be small and provides a simple coordinate system. However this fails, and the coordinates break down, when there are diverged haplotypes at a locus (e.g. one haplotype contains a multi-kilobase deletion, a second contains a few SNPs, and a third is highly diverged with hundreds of SNPs). To handle these, we need to model genetic variation that occurs at different length-scales (SNPs to large structural variants) and that occurs on alternate backgrounds. We refer to these together as multiscale variation.ResultsWe model the genome as a directed acyclic graph consisting of successive hierarchical subgraphs (“sites”) that naturally incorporate multiscale variation, and introduce an algorithm for genotyping, implemented in the software gramtools. This enables variant calling on different sequence backgrounds. In addition to producing regular VCF files, we introduce a JSON file format based on VCF, which records variant site relationships and alternate sequence backgrounds.We show two applications. First, we benchmark gramtools against existing state-of-the-art methods in joint-genotyping 17 M. tuberculosis samples at long deletions and the overlapping small variants that segregate in a cohort of 1,017 genomes. Second, in 706 African and SE Asian P. falciparum genomes, we analyse a dimorphic surface antigen gene which possesses variation on two diverged backgrounds which appeared to not recombine. This generates the first map of variation on both backgrounds, revealing patterns of recombination that were previously unknown.ConclusionsWe need new approaches to be able to jointly analyse SNP and structural variation in cohorts, and even more to handle variants on different genetic backgrounds. We have demonstrated that by modelling with a directed, acyclic and locally hierarchical genome graph, we can apply new algorithms to accurately genotype dense variation at multiple scales. We also propose a generalisation of VCF for accessing multiscale variation in genome graphs, which we hope will be of wide utility.

Download Full-text