scholarly journals Bit-parallel sequence-to-graph alignment

2018 ◽  
Author(s):  
Mikko Rautiainen ◽  
Veli Mäkinen ◽  
Tobias Marschall

Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction, and variant calling with respect to a variation graph. Here, we generalize two linear sequence-to-sequence algorithms to graphs: the Shift-And algorithm for exact matching and Myers’ bitvector algorithm for semi-global alignment. These linear algorithms are both based on processing w sequence characters with a constant number of operations, where w is the word size of the machine (commonly 64), and achieve a speedup of w over naive algorithms. Our bitvector-based graph alignment algorithm reaches a worst case runtime of for acyclic graphs and O(V + mE log w) for arbitrary cyclic graphs. We apply it to four different types of graphs and observe a speedup between 3.1-fold and 10.1-fold compared to previous algorithms.

2019 ◽  
Vol 35 (19) ◽  
pp. 3599-3607 ◽  
Author(s):  
Mikko Rautiainen ◽  
Veli Mäkinen ◽  
Tobias Marschall

Abstract Motivation Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction and variant calling with respect to a variation graph. Results We generalize two linear sequence-to-sequence algorithms to graphs: the Shift-And algorithm for exact matching and Myers’ bitvector algorithm for semi-global alignment. These linear algorithms are both based on processing w sequence characters with a constant number of operations, where w is the word size of the machine (commonly 64), and achieve a speedup of up to w over naive algorithms. For a graph with |V| nodes and |E| edges and a sequence of length m, our bitvector-based graph alignment algorithm reaches a worst case runtime of O(|V|+⌈mw⌉|E| log w) for acyclic graphs and O(|V|+m|E| log w) for arbitrary cyclic graphs. We apply it to five different types of graphs and observe a speedup between 3-fold and 20-fold compared with a previous (asymptotically optimal) alignment algorithm. Availability and implementation https://github.com/maickrau/GraphAligner Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Mikko Rautiainen ◽  
Tobias Marschall

Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction, and variant calling with respect to a variation graph. Given the wide range of applications of this basic problem, it is surprising that algorithms with optimal runtime are, to the best of our knowledge, yet unknown. In particular, aligning sequences to cyclic graphs currently represents a challenge both in theory and practice. Here, we introduce an algorithm to compute the minimum edit distance of a sequence of length m to any path in a node-labeled directed graph (V, E) in O(|V |+m|E|) time and O(|V |) space. The corresponding alignment can be obtained in the same runtime using space. The time complexity depends only on the length of the sequence and the size of the graph. In particular, it does not depend on the cyclicity of the graph, or any other topological features.


2010 ◽  
Vol 08 (02) ◽  
pp. 181-198 ◽  
Author(s):  
RAJIB SENGUPTA ◽  
DHUNDY R. BASTOLA ◽  
HESHAM H. ALI

Restriction Fragment Length Polymorphism (RFLP) is a powerful molecular tool that is extensively used in the molecular fingerprinting and epidemiological studies of microorganisms. In a wet-lab setting, the DNA is cut with one or more restriction enzymes and subjected to gel electrophoresis to obtain signature fragment patterns, which is utilized in the classification and identification of organisms. This wet-lab approach may not be practical when the experimental data set includes a large number of genetic sequences and a wide pool of restriction enzymes to choose from. In this study, we introduce a novel concept of Enzyme Cut Order — a biological property-based characteristic of DNA sequences which can be defined and analyzed computationally without any alignment algorithm. In this alignment-free approach, a similarity matrix is developed based on the pairwise Longest Common Subsequences (LCS) of the Enzyme Cut Orders. The choice of an ideal set of restriction enzymes used for analysis is augmented by using genetic algorithms. The results obtained from this approach using internal transcribed spacer regions of rDNA from fungi as the target sequence show that the phylogenetically-related organisms form a single cluster and successful grouping of phylogenetically close or distant organisms is dependent on the choice of restriction enzymes used in the analysis. Additionally, comparison of trees obtained with this alignment-free and the legacy method revealed highly similar tree topologies. This novel alignment-free method, which utilizes the Enzyme Cut Order and restriction enzyme profile, is a reliable alternative to local or global alignment-based classification and identification of organisms.


2020 ◽  
Vol 34 (05) ◽  
pp. 9354-9361
Author(s):  
Kun Xu ◽  
Linfeng Song ◽  
Yansong Feng ◽  
Yan Song ◽  
Dong Yu

Existing entity alignment methods mainly vary on the choices of encoding the knowledge graph, but they typically use the same decoding method, which independently chooses the local optimal match for each source entity. This decoding method may not only cause the “many-to-one” problem but also neglect the coordinated nature of this task, that is, each alignment decision may highly correlate to the other decisions. In this paper, we introduce two coordinated reasoning methods, i.e., the Easy-to-Hard decoding strategy and joint entity alignment algorithm. Specifically, the Easy-to-Hard strategy first retrieves the model-confident alignments from the predicted results and then incorporates them as additional knowledge to resolve the remaining model-uncertain alignments. To achieve this, we further propose an enhanced alignment model that is built on the current state-of-the-art baseline. In addition, to address the many-to-one problem, we propose to jointly predict entity alignments so that the one-to-one constraint can be naturally incorporated into the alignment prediction. Experimental results show that our model achieves the state-of-the-art performance and our reasoning methods can also significantly improve existing baselines.


2017 ◽  
Author(s):  
Kavya Vaddadi ◽  
Naveen Sivadasan ◽  
Kshitij Tayal ◽  
Rajgopal Srinivasan

AbstractGenomic variations in a reference collection are naturally represented as genome variation graphs. Such graphs encode common subsequences as vertices and the variations are captured using additional vertices and directed edges. The resulting graphs are directed graphs possibly with cycles. Existing algorithms for aligning sequences on such graphs make use of partial order alignment (POA) techniques that work on directed acyclic graphs (DAG). For this, acyclic extensions of the input graphs are first constructed through expensive loop unrolling steps (DAGification). Also, such graph extensions could have considerable blow up in their size and in the worst case the blow up factor is proportional to the input sequence length. We provide a novel alignment algorithm V-ALIGN that aligns the input sequence directly on the input graph while avoiding such expensive DAGification steps. V-ALIGN is based on a novel dynamic programming formulation that allows gapped alignment directly on the input graph. It supports affine and linear gaps. We also propose refinements to V-ALIGN for better performance in practice. In this, the time to fill the DP table has linear dependence on the sizes of the sequence, the graph and its feedback vertex set. We perform experiments to compare against the POA based alignment. For aligning short sequences, standard approaches restrict the expensive gapped alignment to small filtered subgraphs having high ‘similarity’ to the input sequence. In such cases, the performance of V-ALIGN for gapped alignment on the filtered subgraph depends on the subgraph sizes.


2017 ◽  
Author(s):  
Hajime Suzuki ◽  
Masahiro Kasahara

AbstractMotivationPairwise alignment of nucleotide sequences has previously been carried out using the seed- and-extend strategy, where we enumerate seeds (shared patterns) between sequences and then extend the seeds by Smith-Waterman-like semi-global dynamic programming to obtain full pairwise alignments. With the advent of massively parallel short read sequencers, algorithms and data structures for efficiently finding seeds have been extensively explored. However, recent advances in single-molecule sequencing technologies have enabled us to obtain millions of reads, each of which is orders of magnitude longer than those output by the short-read sequencers, demanding a faster algorithm for the extension step that accounts for most of the computation time required for pairwise local alignment. Our goal is to design a faster extension algorithm suitable for single-molecule sequencers with high sequencing error rates (e.g., 10-15%) and with more frequent insertions and deletions than substitutions.ResultsWe propose an adaptive banded dynamic programming algorithm for calculating pairwise semi-global alignment of nucleotide sequences that allows a relatively high insertion or deletion rate while keeping band width relatively low (e.g., 32 or 64 cells) regardless of sequence lengths. Our new algorithm eliminated mutual dependences between elements in a vector, allowing an efficient Single-Instruction-Multiple-Data parallelization. We experimentally demonstrate that our algorithm runs approximately 5× faster than the extension alignment algorithm in NCBI BLAST+ while retaining similar sensitivity (recall).We also show that our extension algorithm is more sensitive than the extension alignment routine in DALIGNER, while the computation time is comparable.AvailabilityThe implementation of the algorithm and the benchmarking scripts are available at https://github.com/ocxtal/[email protected]


Sign in / Sign up

Export Citation Format

Share Document