alignment algorithms Latest Research Papers

Improving the time and space complexity of the WFA algorithm and generalizing its scoring

10.1101/2022.01.12.476087 ◽

2022 ◽

Author(s):

Jordan M Eizenga ◽

Benedict Paten

Keyword(s):

Scoring System ◽

Space Complexity ◽

Genomic Sequencing ◽

Sequencing Data ◽

Time And Space ◽

Alignment Algorithms ◽

Run Time ◽

Modern Genomic ◽

Time And Space Complexity

Modern genomic sequencing data is trending toward longer sequences with higher accuracy. Many analyses using these data will center on alignments, but classical exact alignment algorithms are infeasible for long sequences. The recently proposed WFA algorithm demonstrated how to perform exact alignment for long, similar sequences in O(sN) time and O(s2) memory, where s is a score that is low for similar sequences (Marco-Sola et al., 2021). However, this algorithm still has infeasible memory requirements for longer sequences. Also, it uses an alternate scoring system that is unfamiliar to many bioinformaticians. We describe variants of WFA that improve its asymptotic memory use from O(s2) to O(s3/2) and its asymptotic run time from O(sN) to O(s2 + N). We expect the reduction in memory use to be particularly impactful, as it makes it practical to perform highly multithreaded megabase-scale exact alignments in common compute environments. In addition, we show how to fold WFA's alternate scoring into the broader literature on alignment scores.

Download Full-text

LaRA 2: parallel and vectorized program for sequence–structure alignment of RNA sequences

BMC Bioinformatics ◽

10.1186/s12859-021-04532-7 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Jörg Winkler ◽

Gianvito Urgese ◽

Elisa Ficarra ◽

Knut Reinert

Keyword(s):

Structural Information ◽

Structural Alignment ◽

Lower Boundary ◽

Secondary Structures ◽

Parallel Execution ◽

Task Demands ◽

Structure Alignment ◽

Rna Sequences ◽

Genomic Databases ◽

Alignment Algorithms

Abstract Background The function of non-coding RNA sequences is largely determined by their spatial conformation, namely the secondary structure of the molecule, formed by Watson–Crick interactions between nucleotides. Hence, modern RNA alignment algorithms routinely take structural information into account. In order to discover yet unknown RNA families and infer their possible functions, the structural alignment of RNAs is an essential task. This task demands a lot of computational resources, especially for aligning many long sequences, and it therefore requires efficient algorithms that utilize modern hardware when available. A subset of the secondary structures contains overlapping interactions (called pseudoknots), which add additional complexity to the problem and are often ignored in available software. Results We present the SeqAn-based software LaRA 2 that is significantly faster than comparable software for accurate pairwise and multiple alignments of structured RNA sequences. In contrast to other programs our approach can handle arbitrary pseudoknots. As an improved re-implementation of the LaRA tool for structural alignments, LaRA 2 uses multi-threading and vectorization for parallel execution and a new heuristic for computing a lower boundary of the solution. Our algorithmic improvements yield a program that is up to 130 times faster than the previous version. Conclusions With LaRA 2 we provide a tool to analyse large sets of RNA secondary structures in relatively short time, based on structural alignment. The produced alignments can be used to derive structural motifs for the search in genomic databases.

Download Full-text

Align, then memorise: the dynamics of learning with feedback alignment

Journal of Physics A Mathematical and Theoretical ◽

10.1088/1751-8121/ac411b ◽

2021 ◽

Author(s):

Maria Refinetti ◽

Stéphane d'Ascoli ◽

Ruben Ohana ◽

Sebastian Goldt

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

State Of The Art ◽

Simple Explanation ◽

Low Loss ◽

Convolutional Networks ◽

Linear Networks ◽

Alignment Algorithms ◽

Direct Feedback ◽

The Impact

Abstract Direct Feedback Alignment (DFA) is emerging as an eﬁcient and biologically plausible alternative to backpropagation for training deep neural networks. Despite relying on random feedback weights for the backward pass, DFA successfully trains state-of-the-art models such as Transformers. On the other hand, it notoriously fails to train convolutional networks. An understanding of the inner workings of DFA to explain these diverging results remains elusive. Here, we propose a theory of feedback alignment algorithms. We ﬀrst show that learning in shallow networks proceeds in two steps: an alignment phase, where the model adapts its weights to align the approximate gradient with the true gradient of the loss function, is followed by a memorisation phase, where the model focuses on ﬀtting the data. This two-step process has a degeneracy breaking eﬂect: out of all the low-loss solutions in the landscape, a network trained with DFA naturally converges to the solution which maximises gradient alignment. We also identify a key quantity underlying alignment in deep linear networks: the conditioning of the alignment matrices. The latter enables a detailed understanding of the impact of data structure on alignment, and suggests a simple explanation for the well-known failure of DFA to train convolutional neural networks. Numerical experiments on MNIST and CIFAR10 clearly demonstrate degeneracy breaking in deep non-linear networks and show that the align-then-memorize process occurs sequentially from the bottom layers of the network to the top.

Download Full-text

Lossless Indexing with Counting de Bruijn Graphs

10.1101/2021.11.09.467907 ◽

2021 ◽

Author(s):

Mikhail Karasikov ◽

Harun Mustafa ◽

Gunnar Rätsch ◽

André Kahles

Keyword(s):

High Throughput Sequencing ◽

Sparse Matrices ◽

Rna Seq ◽

Sequencing Data ◽

De Bruijn Graphs ◽

High Throughput Sequencing Data ◽

Alignment Algorithms ◽

Compressed Data Structures ◽

De Bruijn ◽

Public Repositories

High-throughput sequencing data is rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in solving the experiment discovery problem and building compressed representations of annotated de Bruijn graphs where k-mer sets can be efficiently indexed and interactively queried. However, approaches for representing and retrieving other quantitative attributes such as gene expression or genome positions in a general manner have yet to be developed. In this work, we propose the concept of Counting de Bruijn graphs generalizing the notion of annotated (or colored) de Bruijn graphs. Counting de Bruijn graphs supplement each node-label relation with one or many attributes (e.g., a k-mer count or its positions in genome). To represent them, we first observe that many schemes for the representation of compressed binary matrices already support the rank operation on the columns or rows, which can be used to define an inherent indexing of any additional quantitative attributes. Based on this property, we generalize these schemes and introduce a new approach for representing non-binary sparse matrices in compressed data structures. Finally, we notice that relation attributes are often easily predictable from a node's local neighborhood in the graph. Notable examples are genome positions shifting by 1 for neighboring nodes in the graph, or expression levels that are often shared across neighbors. We exploit this regularity of graph annotations and apply an invertible delta-like coding to achieve better compression. We show that Counting de Bruijn graphs index k-mer counts from 2,652 human RNA-Seq read sets in representations over 8-fold smaller and yet faster to query compared to state-of-the-art bioinformatics tools. Furthermore, Counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip -9 for human Illumina RNA-Seq and 57% smaller for PacBio HiFi sequencing of viral samples. A complete joint searchable index of all viral PacBio SMRT reads from NCBI's SRA (152,884 read sets, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, they generate a lossless and fully queryable index that is 4.4-fold smaller compared to the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools employing de Bruijn graphs and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed and fully searchable graph-based sequence indexes.

Download Full-text

Russian-English dataset and comparative analysis of algorithms for cross-language embedding-based entity alignment

Journal of Physics Conference Series ◽

10.1088/1742-6596/2099/1/012023 ◽

2021 ◽

Vol 2099 (1) ◽

pp. 012023

Author(s):

V A Gnezdilova ◽

Z V Apanovich

Keyword(s):

Comparative Analysis ◽

Analysis Of Algorithms ◽

Main Step ◽

Data Bases ◽

Future Directions ◽

Identity Resolution ◽

Alignment Algorithms ◽

Alignment Problem ◽

Knowledge Graphs ◽

Cross Language

Abstract The problem of data fusion from data bases and knowledge graphs in different languages is becoming increasingly important. The main step of such a fusion is the identification of equivalent entities in different knowledge graphs and merging their descriptions. This problem is known as the identity resolution, or entity alignment problem. Recently, a large group of new entity alignment methods has emerged. They look for the so called “embeddings” of entities and establish the equivalence of entities by comparing their embeddings. This paper presents experiments with embedding-based entity alignment algorithms on a Russian-English dataset. The purpose of this work is to identify language-specific features of the entity alignment algorithms. Also, future directions of research are outlined.

Download Full-text

Boosting Graph Alignment Algorithms

10.1145/3459637.3482067 ◽

2021 ◽

Author(s):

Alexander Frederiksen Kyster ◽

Simon Daugaard Nielsen ◽

Judith Hermanns ◽

Davide Mottin ◽

Panagiotis Karras

Keyword(s):

Graph Alignment ◽

Alignment Algorithms

Download Full-text

On the sensitivity of acoustic distance measures to different parameterizations of mel-frequency cepstral coefficients and temporal alignment algorithms

The Journal of the Acoustical Society of America ◽

10.1121/10.0008572 ◽

2021 ◽

Vol 150 (4) ◽

pp. A355-A356

Author(s):

Charles H. Redmon

Keyword(s):

Distance Measures ◽

Mel Frequency Cepstral Coefficients ◽

Temporal Alignment ◽

Alignment Algorithms ◽

Cepstral Coefficients

Download Full-text

Characterization of the tracrARN-DRARN genetic complex associated with the CRISPR-Cas9 system of the phytosymbiont Acholeplasma palmae: biotechnological interest

Revista de la Facultad de Agronomía, Universidad del Zulia ◽

10.47280/revfacagron(luz).v38.n4.13 ◽

2021 ◽

Vol 38 (4) ◽

pp. 970-992

Author(s):

Luis Moncayo ◽

Alex Castro ◽

Diego Arcos ◽

Paulo Centanaro ◽

Diego Vaca ◽

...

Keyword(s):

Gene Editing ◽

Genetic Manipulation ◽

Plant Biotechnology ◽

Data Bank ◽

Molecular Tools ◽

Biophysical Characterization ◽

Guide Rna ◽

Alignment Algorithms ◽

Docking Power

The CRISPR-Cas9 technology used in plant biotechnology is based on the use of Cas9 endonucleases to generate precise cuts in the genome, and a duplex consisting of a trans-activating CRISPR RNA (tracrRNA) and a CRISPR RNA (DRRNA) which are precursors of guide RNA (sgRNA) commercially redesigned (sgRNA-Cas9) to guide gene cleavage. Most of these tools come from clinical bacteria. However, there are several CRISPR-Cas9 systems in environmental microorganisms such as phytoendosymbionts of plants of the genus Acholeplasma. But the exploitation of these systems more compatible with plants requires using bioinformatics tools for prediction and study. We identified and characterized the elements associated with the duplex in the genome of A. palmae. For this, the protein information was obtained from the Protein Data Bank and the genomics from GenBank/NCBI. The CRISPR system was studied with the CRISPRfinder software. Alignment algorithms and NUPACK software were used to identify the tracrRNA and DRRNA modules, together with various computational software for genetic, structural and biophysical characterization. A CRISPR-Cas system was found in A. palmae with type II-C characteristics, as well as a thermodynamically very stable duplex, with flexible regions, exhibiting a docking power with Cas9 thermodynamically favored. These results are desirable in programmable gene editing systems and show the possibility of exploring native molecular tools in environmental microorganisms applicable to the genetic manipulation of plants, as more research is carried out. This study represents the first report on the thermodynamic stability and molecular docking of elements associated with the tracrRNA-DRRNA duplex in the phytosymbiont A. palmae.

Download Full-text

Accurate Sequence-Based Prediction of Deleterious nsSNPs with Multiple Sequence Profiles and Putative Binding Residues

Biomolecules ◽

10.3390/biom11091337 ◽

2021 ◽

Vol 11 (9) ◽

pp. 1337

Author(s):

Ruiyang Song ◽

Baixin Cao ◽

Zhenling Peng ◽

Christopher J. Oldfield ◽

Lukasz Kurgan ◽

...

Keyword(s):

State Of The Art ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Nucleotide Polymorphisms ◽

Multiple Sequence ◽

Predictive Quality ◽

Alignment Algorithms ◽

Benchmark Datasets ◽

Binding Residues ◽

Sequence Profiles

Non-synonymous single nucleotide polymorphisms (nsSNPs) may result in pathogenic changes that are associated with human diseases. Accurate prediction of these deleterious nsSNPs is in high demand. The existing predictors of deleterious nsSNPs secure modest levels of predictive performance, leaving room for improvements. We propose a new sequence-based predictor, DMBS, which addresses the need to improve the predictive quality. The design of DMBS relies on the observation that the deleterious mutations are likely to occur at the highly conserved and functionally important positions in the protein sequence. Correspondingly, we introduce two innovative components. First, we improve the estimates of the conservation computed from the multiple sequence profiles based on two complementary databases and two complementary alignment algorithms. Second, we utilize putative annotations of functional/binding residues produced by two state-of-the-art sequence-based methods. These inputs are processed by a random forests model that provides favorable predictive performance when empirically compared against five other machine-learning algorithms. Empirical results on four benchmark datasets reveal that DMBS achieves AUC > 0.94, outperforming current methods, including protein structure-based approaches. In particular, DMBS secures AUC = 0.97 for the SNPdbe and ExoVar datasets, compared to AUC = 0.70 and 0.88, respectively, that were obtained by the best available methods. Further tests on the independent HumVar dataset shows that our method significantly outperforms the state-of-the-art method SNPdryad. We conclude that DMBS provides accurate predictions that can effectively guide wet-lab experiments in a high-throughput manner.

Download Full-text

Technology dictates algorithms: recent developments in read alignment

Genome Biology ◽

10.1186/s13059-021-02443-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Mohammed Alser ◽

Jeremy Rotman ◽

Dhrithi Deshpande ◽

Kodi Taraszka ◽

Huwenbo Shi ◽

...

Keyword(s):

Experimental Evaluation ◽

Genomic Analysis ◽

Computational Algorithms ◽

Read Alignment ◽

Systematic Survey ◽

Essential Step ◽

Technological Advances ◽

Alignment Algorithms ◽

Long Reads ◽

Recent Developments

AbstractAligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today’s diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology.

Download Full-text

alignment algorithms
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Improving the time and space complexity of the WFA algorithm and generalizing its scoring

LaRA 2: parallel and vectorized program for sequence–structure alignment of RNA sequences

Align, then memorise: the dynamics of learning with feedback alignment

Lossless Indexing with Counting de Bruijn Graphs

Russian-English dataset and comparative analysis of algorithms for cross-language embedding-based entity alignment

Boosting Graph Alignment Algorithms

On the sensitivity of acoustic distance measures to different parameterizations of mel-frequency cepstral coefficients and temporal alignment algorithms

Characterization of the tracrARN-DRARN genetic complex associated with the CRISPR-Cas9 system of the phytosymbiont Acholeplasma palmae: biotechnological interest

Accurate Sequence-Based Prediction of Deleterious nsSNPs with Multiple Sequence Profiles and Putative Binding Residues

Technology dictates algorithms: recent developments in read alignment

Export Citation Format

alignment algorithmsRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Improving the time and space complexity of the WFA algorithm and generalizing its scoring

LaRA 2: parallel and vectorized program for sequence–structure alignment of RNA sequences

Align, then memorise: the dynamics of learning with feedback alignment

Lossless Indexing with Counting de Bruijn Graphs

Russian-English dataset and comparative analysis of algorithms for cross-language embedding-based entity alignment

Boosting Graph Alignment Algorithms

On the sensitivity of acoustic distance measures to different parameterizations of mel-frequency cepstral coefficients and temporal alignment algorithms

Characterization of the tracrARN-DRARN genetic complex associated with the CRISPR-Cas9 system of the phytosymbiont Acholeplasma palmae: biotechnological interest

Accurate Sequence-Based Prediction of Deleterious nsSNPs with Multiple Sequence Profiles and Putative Binding Residues

Technology dictates algorithms: recent developments in read alignment

alignment algorithms
Recently Published Documents