alignment algorithms
Recently Published Documents


TOTAL DOCUMENTS

237
(FIVE YEARS 79)

H-INDEX

24
(FIVE YEARS 3)

2022 ◽  
Author(s):  
Jordan M Eizenga ◽  
Benedict Paten

Modern genomic sequencing data is trending toward longer sequences with higher accuracy. Many analyses using these data will center on alignments, but classical exact alignment algorithms are infeasible for long sequences. The recently proposed WFA algorithm demonstrated how to perform exact alignment for long, similar sequences in O(sN) time and O(s2) memory, where s is a score that is low for similar sequences (Marco-Sola et al., 2021). However, this algorithm still has infeasible memory requirements for longer sequences. Also, it uses an alternate scoring system that is unfamiliar to many bioinformaticians. We describe variants of WFA that improve its asymptotic memory use from O(s2) to O(s3/2) and its asymptotic run time from O(sN) to O(s2 + N). We expect the reduction in memory use to be particularly impactful, as it makes it practical to perform highly multithreaded megabase-scale exact alignments in common compute environments. In addition, we show how to fold WFA's alternate scoring into the broader literature on alignment scores.


2022 ◽  
Vol 23 (1) ◽  
Author(s):  
Jörg Winkler ◽  
Gianvito Urgese ◽  
Elisa Ficarra ◽  
Knut Reinert

Abstract Background The function of non-coding RNA sequences is largely determined by their spatial conformation, namely the secondary structure of the molecule, formed by Watson–Crick interactions between nucleotides. Hence, modern RNA alignment algorithms routinely take structural information into account. In order to discover yet unknown RNA families and infer their possible functions, the structural alignment of RNAs is an essential task. This task demands a lot of computational resources, especially for aligning many long sequences, and it therefore requires efficient algorithms that utilize modern hardware when available. A subset of the secondary structures contains overlapping interactions (called pseudoknots), which add additional complexity to the problem and are often ignored in available software. Results We present the SeqAn-based software LaRA 2 that is significantly faster than comparable software for accurate pairwise and multiple alignments of structured RNA sequences. In contrast to other programs our approach can handle arbitrary pseudoknots. As an improved re-implementation of the LaRA tool for structural alignments, LaRA 2 uses multi-threading and vectorization for parallel execution and a new heuristic for computing a lower boundary of the solution. Our algorithmic improvements yield a program that is up to 130 times faster than the previous version. Conclusions With LaRA 2 we provide a tool to analyse large sets of RNA secondary structures in relatively short time, based on structural alignment. The produced alignments can be used to derive structural motifs for the search in genomic databases.


Author(s):  
Maria Refinetti ◽  
Stéphane d'Ascoli ◽  
Ruben Ohana ◽  
Sebastian Goldt

Abstract Direct Feedback Alignment (DFA) is emerging as an eficient and biologically plausible alternative to backpropagation for training deep neural networks. Despite relying on random feedback weights for the backward pass, DFA successfully trains state-of-the-art models such as Transformers. On the other hand, it notoriously fails to train convolutional networks. An understanding of the inner workings of DFA to explain these diverging results remains elusive. Here, we propose a theory of feedback alignment algorithms. We ffrst show that learning in shallow networks proceeds in two steps: an alignment phase, where the model adapts its weights to align the approximate gradient with the true gradient of the loss function, is followed by a memorisation phase, where the model focuses on fftting the data. This two-step process has a degeneracy breaking eflect: out of all the low-loss solutions in the landscape, a network trained with DFA naturally converges to the solution which maximises gradient alignment. We also identify a key quantity underlying alignment in deep linear networks: the conditioning of the alignment matrices. The latter enables a detailed understanding of the impact of data structure on alignment, and suggests a simple explanation for the well-known failure of DFA to train convolutional neural networks. Numerical experiments on MNIST and CIFAR10 clearly demonstrate degeneracy breaking in deep non-linear networks and show that the align-then-memorize process occurs sequentially from the bottom layers of the network to the top.


2021 ◽  
Author(s):  
Mikhail Karasikov ◽  
Harun Mustafa ◽  
Gunnar Rätsch ◽  
André Kahles

High-throughput sequencing data is rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in solving the experiment discovery problem and building compressed representations of annotated de Bruijn graphs where k-mer sets can be efficiently indexed and interactively queried. However, approaches for representing and retrieving other quantitative attributes such as gene expression or genome positions in a general manner have yet to be developed. In this work, we propose the concept of Counting de Bruijn graphs generalizing the notion of annotated (or colored) de Bruijn graphs. Counting de Bruijn graphs supplement each node-label relation with one or many attributes (e.g., a k-mer count or its positions in genome). To represent them, we first observe that many schemes for the representation of compressed binary matrices already support the rank operation on the columns or rows, which can be used to define an inherent indexing of any additional quantitative attributes. Based on this property, we generalize these schemes and introduce a new approach for representing non-binary sparse matrices in compressed data structures. Finally, we notice that relation attributes are often easily predictable from a node's local neighborhood in the graph. Notable examples are genome positions shifting by 1 for neighboring nodes in the graph, or expression levels that are often shared across neighbors. We exploit this regularity of graph annotations and apply an invertible delta-like coding to achieve better compression. We show that Counting de Bruijn graphs index k-mer counts from 2,652 human RNA-Seq read sets in representations over 8-fold smaller and yet faster to query compared to state-of-the-art bioinformatics tools. Furthermore, Counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip -9 for human Illumina RNA-Seq and 57% smaller for PacBio HiFi sequencing of viral samples. A complete joint searchable index of all viral PacBio SMRT reads from NCBI's SRA (152,884 read sets, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, they generate a lossless and fully queryable index that is 4.4-fold smaller compared to the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools employing de Bruijn graphs and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed and fully searchable graph-based sequence indexes.


2021 ◽  
Vol 2099 (1) ◽  
pp. 012023
Author(s):  
V A Gnezdilova ◽  
Z V Apanovich

Abstract The problem of data fusion from data bases and knowledge graphs in different languages is becoming increasingly important. The main step of such a fusion is the identification of equivalent entities in different knowledge graphs and merging their descriptions. This problem is known as the identity resolution, or entity alignment problem. Recently, a large group of new entity alignment methods has emerged. They look for the so called “embeddings” of entities and establish the equivalence of entities by comparing their embeddings. This paper presents experiments with embedding-based entity alignment algorithms on a Russian-English dataset. The purpose of this work is to identify language-specific features of the entity alignment algorithms. Also, future directions of research are outlined.


2021 ◽  
Author(s):  
Alexander Frederiksen Kyster ◽  
Simon Daugaard Nielsen ◽  
Judith Hermanns ◽  
Davide Mottin ◽  
Panagiotis Karras

Author(s):  
Luis Moncayo ◽  
Alex Castro ◽  
Diego Arcos ◽  
Paulo Centanaro ◽  
Diego Vaca ◽  
...  

The CRISPR-Cas9 technology used in plant biotechnology is based on the use of Cas9 endonucleases to generate precise cuts in the genome, and a duplex consisting of a trans-activating CRISPR RNA (tracrRNA) and a CRISPR RNA (DRRNA) which are precursors of guide RNA (sgRNA) commercially redesigned (sgRNA-Cas9) to guide gene cleavage. Most of these tools come from clinical bacteria. However, there are several CRISPR-Cas9 systems in environmental microorganisms such as phytoendosymbionts of plants of the genus Acholeplasma. But the exploitation of these systems more compatible with plants requires using bioinformatics tools for prediction and study. We identified and characterized the elements associated with the duplex in the genome of A. palmae. For this, the protein information was obtained from the Protein Data Bank and the genomics from GenBank/NCBI. The CRISPR system was studied with the CRISPRfinder software. Alignment algorithms and NUPACK software were used to identify the tracrRNA and DRRNA modules, together with various computational software for genetic, structural and biophysical characterization. A CRISPR-Cas system was found in A. palmae with type II-C characteristics, as well as a thermodynamically very stable duplex, with flexible regions, exhibiting a docking power with Cas9 thermodynamically favored. These results are desirable in programmable gene editing systems and show the possibility of exploring native molecular tools in environmental microorganisms applicable to the genetic manipulation of plants, as more research is carried out. This study represents the first report on the thermodynamic stability and molecular docking of elements associated with the tracrRNA-DRRNA duplex in the phytosymbiont A. palmae.


Biomolecules ◽  
2021 ◽  
Vol 11 (9) ◽  
pp. 1337
Author(s):  
Ruiyang Song ◽  
Baixin Cao ◽  
Zhenling Peng ◽  
Christopher J. Oldfield ◽  
Lukasz Kurgan ◽  
...  

Non-synonymous single nucleotide polymorphisms (nsSNPs) may result in pathogenic changes that are associated with human diseases. Accurate prediction of these deleterious nsSNPs is in high demand. The existing predictors of deleterious nsSNPs secure modest levels of predictive performance, leaving room for improvements. We propose a new sequence-based predictor, DMBS, which addresses the need to improve the predictive quality. The design of DMBS relies on the observation that the deleterious mutations are likely to occur at the highly conserved and functionally important positions in the protein sequence. Correspondingly, we introduce two innovative components. First, we improve the estimates of the conservation computed from the multiple sequence profiles based on two complementary databases and two complementary alignment algorithms. Second, we utilize putative annotations of functional/binding residues produced by two state-of-the-art sequence-based methods. These inputs are processed by a random forests model that provides favorable predictive performance when empirically compared against five other machine-learning algorithms. Empirical results on four benchmark datasets reveal that DMBS achieves AUC > 0.94, outperforming current methods, including protein structure-based approaches. In particular, DMBS secures AUC = 0.97 for the SNPdbe and ExoVar datasets, compared to AUC = 0.70 and 0.88, respectively, that were obtained by the best available methods. Further tests on the independent HumVar dataset shows that our method significantly outperforms the state-of-the-art method SNPdryad. We conclude that DMBS provides accurate predictions that can effectively guide wet-lab experiments in a high-throughput manner.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Mohammed Alser ◽  
Jeremy Rotman ◽  
Dhrithi Deshpande ◽  
Kodi Taraszka ◽  
Huwenbo Shi ◽  
...  

AbstractAligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today’s diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology.


Sign in / Sign up

Export Citation Format

Share Document