Bit-parallel sequence-to-graph alignment

Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction, and variant calling with respect to a variation graph. Here, we generalize two linear sequence-to-sequence algorithms to graphs: the Shift-And algorithm for exact matching and Myers’ bitvector algorithm for semi-global alignment. These linear algorithms are both based on processing w sequence characters with a constant number of operations, where w is the word size of the machine (commonly 64), and achieve a speedup of w over naive algorithms. Our bitvector-based graph alignment algorithm reaches a worst case runtime of for acyclic graphs and O(V + mE log w) for arbitrary cyclic graphs. We apply it to four different types of graphs and observe a speedup between 3.1-fold and 10.1-fold compared to previous algorithms.

Download Full-text

Bit-parallel sequence-to-graph alignment

Bioinformatics ◽

10.1093/bioinformatics/btz162 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3599-3607 ◽

Cited By ~ 25

Author(s):

Mikko Rautiainen ◽

Veli Mäkinen ◽

Tobias Marschall

Keyword(s):

Variant Calling ◽

Supplementary Information ◽

Global Alignment ◽

Alignment Algorithm ◽

Exact Matching ◽

Worst Case ◽

De Bruijn Graphs ◽

Graph Alignment ◽

Linear Sequence ◽

String Graphs

Abstract Motivation Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction and variant calling with respect to a variation graph. Results We generalize two linear sequence-to-sequence algorithms to graphs: the Shift-And algorithm for exact matching and Myers’ bitvector algorithm for semi-global alignment. These linear algorithms are both based on processing w sequence characters with a constant number of operations, where w is the word size of the machine (commonly 64), and achieve a speedup of up to w over naive algorithms. For a graph with |V| nodes and |E| edges and a sequence of length m, our bitvector-based graph alignment algorithm reaches a worst case runtime of O(|V|+⌈mw⌉|E| log w) for acyclic graphs and O(|V|+m|E| log w) for arbitrary cyclic graphs. We apply it to five different types of graphs and observe a speedup between 3-fold and 20-fold compared with a previous (asymptotically optimal) alignment algorithm. Availability and implementation https://github.com/maickrau/GraphAligner Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Aligning sequences to general graphs in O(V + mE) time

10.1101/216127 ◽

2017 ◽

Cited By ~ 10

Author(s):

Mikko Rautiainen ◽

Tobias Marschall

Keyword(s):

Basic Problem ◽

Variant Calling ◽

Theory And Practice ◽

De Bruijn Graphs ◽

Topological Features ◽

String Graphs ◽

Wide Range ◽

Read Error Correction ◽

Genome Assemblies ◽

General Graphs

Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction, and variant calling with respect to a variation graph. Given the wide range of applications of this basic problem, it is surprising that algorithms with optimal runtime are, to the best of our knowledge, yet unknown. In particular, aligning sequences to cyclic graphs currently represents a challenge both in theory and practice. Here, we introduce an algorithm to compute the minimum edit distance of a sequence of length m to any path in a node-labeled directed graph (V, E) in O(|V |+m|E|) time and O(|V |) space. The corresponding alignment can be obtained in the same runtime using space. The time complexity depends only on the length of the sequence and the size of the graph. In particular, it does not depend on the cyclicity of the graph, or any other topological features.

Download Full-text

CLASSIFICATION AND IDENTIFICATION OF FUNGAL SEQUENCES USING CHARACTERISTIC RESTRICTION ENDONUCLEASE CUT ORDER

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720010004616 ◽

2010 ◽

Vol 08 (02) ◽

pp. 181-198 ◽

Cited By ~ 2

Author(s):

RAJIB SENGUPTA ◽

DHUNDY R. BASTOLA ◽

HESHAM H. ALI

Keyword(s):

Dna Sequences ◽

Restriction Enzymes ◽

Epidemiological Studies ◽

Global Alignment ◽

Target Sequence ◽

Alignment Algorithm ◽

Molecular Fingerprinting ◽

Data Set ◽

Alignment Free ◽

Wet Lab

Restriction Fragment Length Polymorphism (RFLP) is a powerful molecular tool that is extensively used in the molecular fingerprinting and epidemiological studies of microorganisms. In a wet-lab setting, the DNA is cut with one or more restriction enzymes and subjected to gel electrophoresis to obtain signature fragment patterns, which is utilized in the classification and identification of organisms. This wet-lab approach may not be practical when the experimental data set includes a large number of genetic sequences and a wide pool of restriction enzymes to choose from. In this study, we introduce a novel concept of Enzyme Cut Order — a biological property-based characteristic of DNA sequences which can be defined and analyzed computationally without any alignment algorithm. In this alignment-free approach, a similarity matrix is developed based on the pairwise Longest Common Subsequences (LCS) of the Enzyme Cut Orders. The choice of an ideal set of restriction enzymes used for analysis is augmented by using genetic algorithms. The results obtained from this approach using internal transcribed spacer regions of rDNA from fungi as the target sequence show that the phylogenetically-related organisms form a single cluster and successful grouping of phylogenetically close or distant organisms is dependent on the choice of restriction enzymes used in the analysis. Additionally, comparison of trees obtained with this alignment-free and the legacy method revealed highly similar tree topologies. This novel alignment-free method, which utilizes the Enzyme Cut Order and restriction enzyme profile, is a reliable alternative to local or global alignment-based classification and identification of organisms.

Download Full-text

MPSAGA: a matrix-based pair-wise sequence alignment algorithm for global alignment with position based sequence representation

Sadhana ◽

10.1007/s12046-019-1141-x ◽

2019 ◽

Vol 44 (7) ◽

Cited By ~ 1

Author(s):

Jyoti Lakhani ◽

Ajay Khunteta ◽

Anupama Choudhary ◽

Dharmesh Harwani

Keyword(s):

Sequence Alignment ◽

Global Alignment ◽

Alignment Algorithm ◽

Sequence Alignment Algorithm ◽

Sequence Representation

Download Full-text

Worst-case errors of linear algorithms for identification in H

International Journal of Control ◽

10.1080/002071798222884 ◽

1998 ◽

Vol 69 (2) ◽

pp. 347-352 ◽

Cited By ~ 4

Author(s):

Jonathan R. Partington

Keyword(s):

Worst Case ◽

Linear Algorithms

Download Full-text

Coordinated Reasoning for Cross-Lingual Knowledge Graph Alignment

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6476 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9354-9361

Author(s):

Kun Xu ◽

Linfeng Song ◽

Yansong Feng ◽

Yan Song ◽

Dong Yu

Keyword(s):

State Of The Art ◽

The Other ◽

Knowledge Graph ◽

Alignment Algorithm ◽

Graph Alignment ◽

Current State ◽

Alignment Model ◽

The Many ◽

The One ◽

Cross Lingual

Existing entity alignment methods mainly vary on the choices of encoding the knowledge graph, but they typically use the same decoding method, which independently chooses the local optimal match for each source entity. This decoding method may not only cause the “many-to-one” problem but also neglect the coordinated nature of this task, that is, each alignment decision may highly correlate to the other decisions. In this paper, we introduce two coordinated reasoning methods, i.e., the Easy-to-Hard decoding strategy and joint entity alignment algorithm. Specifically, the Easy-to-Hard strategy first retrieves the model-confident alignments from the predicted results and then incorporates them as additional knowledge to resolve the remaining model-uncertain alignments. To achieve this, we further propose an enhanced alignment model that is built on the current state-of-the-art baseline. In addition, to address the many-to-one problem, we propose to jointly predict entity alignments so that the one-to-one constraint can be naturally incorporated into the alignment prediction. Experimental results show that our model achieves the state-of-the-art performance and our reasoning methods can also significantly improve existing baselines.

Download Full-text

Accelerating Needleman-Wunsch global alignment algorithm with GPUs

2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA) ◽

10.1109/aiccsa.2015.7507113 ◽

2015 ◽

Cited By ~ 7

Author(s):

Maged Fakirah ◽

Mohammed A. Shehab ◽

Yaser Jararweh ◽

Mahmoud Al-Ayyoub

Keyword(s):

Global Alignment ◽

Alignment Algorithm

Download Full-text

Sequence Alignment on Directed Graphs

10.1101/124941 ◽

2017 ◽

Cited By ~ 2

Author(s):

Kavya Vaddadi ◽

Naveen Sivadasan ◽

Kshitij Tayal ◽

Rajgopal Srinivasan

Keyword(s):

Blow Up ◽

Directed Graphs ◽

Input Sequence ◽

Directed Acyclic Graphs ◽

Sequence Length ◽

Alignment Algorithm ◽

Input Graph ◽

Worst Case ◽

Gapped Alignment ◽

Vertex Set

AbstractGenomic variations in a reference collection are naturally represented as genome variation graphs. Such graphs encode common subsequences as vertices and the variations are captured using additional vertices and directed edges. The resulting graphs are directed graphs possibly with cycles. Existing algorithms for aligning sequences on such graphs make use of partial order alignment (POA) techniques that work on directed acyclic graphs (DAG). For this, acyclic extensions of the input graphs are first constructed through expensive loop unrolling steps (DAGification). Also, such graph extensions could have considerable blow up in their size and in the worst case the blow up factor is proportional to the input sequence length. We provide a novel alignment algorithm V-ALIGN that aligns the input sequence directly on the input graph while avoiding such expensive DAGification steps. V-ALIGN is based on a novel dynamic programming formulation that allows gapped alignment directly on the input graph. It supports affine and linear gaps. We also propose refinements to V-ALIGN for better performance in practice. In this, the time to fill the DP table has linear dependence on the sizes of the sequence, the graph and its feedback vertex set. We perform experiments to compare against the POA based alignment. For aligning short sequences, standard approaches restrict the expensive gapped alignment to small filtered subgraphs having high ‘similarity’ to the input sequence. In such cases, the performance of V-ALIGN for gapped alignment on the filtered subgraph depends on the subgraph sizes.

Download Full-text

Acceleration of Nucleotide Semi-Global Alignment with Adaptive Banded Dynamic Programming

10.1101/130633 ◽

2017 ◽

Cited By ~ 9

Author(s):

Hajime Suzuki ◽

Masahiro Kasahara

Keyword(s):

Dynamic Programming ◽

Single Molecule ◽

Computation Time ◽

Error Rates ◽

Nucleotide Sequences ◽

Sequencing Error ◽

Local Alignment ◽

Global Alignment ◽

Alignment Algorithm ◽

Short Read

AbstractMotivationPairwise alignment of nucleotide sequences has previously been carried out using the seed- and-extend strategy, where we enumerate seeds (shared patterns) between sequences and then extend the seeds by Smith-Waterman-like semi-global dynamic programming to obtain full pairwise alignments. With the advent of massively parallel short read sequencers, algorithms and data structures for efficiently finding seeds have been extensively explored. However, recent advances in single-molecule sequencing technologies have enabled us to obtain millions of reads, each of which is orders of magnitude longer than those output by the short-read sequencers, demanding a faster algorithm for the extension step that accounts for most of the computation time required for pairwise local alignment. Our goal is to design a faster extension algorithm suitable for single-molecule sequencers with high sequencing error rates (e.g., 10-15%) and with more frequent insertions and deletions than substitutions.ResultsWe propose an adaptive banded dynamic programming algorithm for calculating pairwise semi-global alignment of nucleotide sequences that allows a relatively high insertion or deletion rate while keeping band width relatively low (e.g., 32 or 64 cells) regardless of sequence lengths. Our new algorithm eliminated mutual dependences between elements in a vector, allowing an efficient Single-Instruction-Multiple-Data parallelization. We experimentally demonstrate that our algorithm runs approximately 5× faster than the extension alignment algorithm in NCBI BLAST+ while retaining similar sensitivity (recall).We also show that our extension algorithm is more sensitive than the extension alignment routine in DALIGNER, while the computation time is comparable.AvailabilityThe implementation of the algorithm and the benchmarking scripts are available at https://github.com/ocxtal/[email protected]

Download Full-text

GLobal Alignment Tool (GLAT) – A Proposed Protein Alignment Algorithm

Advances in Intelligent and Soft Computing - Proceedings of the International Conference on Soft Computing for Problem Solving (SocProS 2011) December 20-22, 2011 ◽

10.1007/978-81-322-0491-6_81 ◽

2012 ◽

pp. 885-890

Author(s):

Samarjeet Borah ◽

Krishna Bikram Shah

Keyword(s):

Global Alignment ◽

Alignment Algorithm ◽

Protein Alignment ◽

Alignment Tool

Download Full-text