LinearPartition: linear-time approximation of RNA folding partition function and base-pairing probabilities

Abstract Motivation RNA secondary structure prediction is widely used to understand RNA function. Recently, there has been a shift away from the classical minimum free energy methods to partition function-based methods that account for folding ensembles and can therefore estimate structure and base pair probabilities. However, the classical partition function algorithm scales cubically with sequence length, and is therefore prohibitively slow for long sequences. This slowness is even more severe than cubic-time free energy minimization due to a substantially larger constant factor in runtime. Results Inspired by the success of our recent LinearFold algorithm that predicts the approximate minimum free energy structure in linear time, we design a similar linear-time heuristic algorithm, LinearPartition, to approximate the partition function and base-pairing probabilities, which is shown to be orders of magnitude faster than Vienna RNAfold and CONTRAfold (e.g. 2.5 days versus 1.3 min on a sequence with length 32 753 nt). More interestingly, the resulting base-pairing probabilities are even better correlated with the ground-truth structures. LinearPartition also leads to a small accuracy improvement when used for downstream structure prediction on families with the longest length sequences (16S and 23S rRNAs), as well as a substantial improvement on long-distance base pairs (500+ nt apart). Availability and implementation Code: http://github.com/LinearFold/LinearPartition; Server: http://linearfold.org/partition. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Sparse RNA folding revisited: space-efficient minimum free energy structure prediction

Algorithms for Molecular Biology ◽

10.1186/s13015-016-0071-y ◽

2016 ◽

Vol 11 (1) ◽

Cited By ~ 4

Author(s):

Sebastian Will ◽

Hosna Jabbari

Keyword(s):

Free Energy ◽

Structure Prediction ◽

Rna Folding ◽

Minimum Free Energy ◽

Energy Structure ◽

Minimum Free Energy Structure

Download Full-text

PRACTICALITY AND TIME COMPLEXITY OF A SPARSIFIED RNA FOLDING ALGORITHM

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720012410077 ◽

2012 ◽

Vol 10 (02) ◽

pp. 1241007 ◽

Cited By ~ 7

Author(s):

SLAVICA DIMITRIEVA ◽

PHILIPP BUCHER

Keyword(s):

Time Complexity ◽

Rna Folding ◽

Minimum Free Energy ◽

Energy Structure ◽

Base Pairing ◽

Time Analysis ◽

Folding Algorithm ◽

Run Time Analysis ◽

Run Time ◽

Standard Energy

Commonly used RNA folding programs compute the minimum free energy structure of a sequence under the pseudoknot exclusion constraint. They are based on Zuker's algorithm which runs in time O(n3). Recently, it has been claimed that RNA folding can be achieved in average time O(n2) using a sparsification technique. A proof of quadratic time complexity was based on the assumption that computational RNA folding obeys the "polymer-zeta property". Several variants of sparse RNA folding algorithms were later developed. Here, we present our own version, which is readily applicable to existing RNA folding programs, as it is extremely simple and does not require any new data structure. We applied it to the widely used Vienna RNAfold program, to create sibRNAfold, the first public sparsified version of a standard RNA folding program. To gain a better understanding of the time complexity of sparsified RNA folding in general, we carried out a thorough run time analysis with synthetic random sequences, both in the context of energy minimization and base pairing maximization. Contrary to previous claims, the asymptotic time complexity of a sparsified RNA folding algorithm using standard energy parameters remains O(n3) under a wide variety of conditions. Consistent with our run-time analysis, we found that RNA folding does not obey the "polymer-zeta property" as claimed previously. Yet, a basic version of a sparsified RNA folding algorithm provides 15- to 50-fold speed gain. Surprisingly, the same sparsification technique has a different effect when applied to base pairing optimization. There, its asymptotic running time complexity appears to be either quadratic or cubic depending on the base composition. The code used in this work is available at: http://sibRNAfold.sourceforge.net/ .

Download Full-text

Minimum free energy predicted base pairing in the 39 nt spliced leader and 5’ UTR of calmodulin mRNA from Trypanosoma cruzi: influence of the multiple trans-splicing sites

Anais da Academia Brasileira de Ciências ◽

10.1590/0001-3765201720170082 ◽

2018 ◽

Vol 90 (2 suppl 1) ◽

pp. 2311-2316

Author(s):

FRANKLYN SAMUDIO ◽

ADEILTON BRANDÃO

Keyword(s):

Free Energy ◽

Trypanosoma Cruzi ◽

Minimum Free Energy ◽

Base Pairing ◽

Spliced Leader ◽

Trans Splicing

Download Full-text

LinearTurboFold: Linear-Time RNA Structural Alignment and Conserved Structure Prediction with Applications to Coronaviruses

10.1101/2020.11.23.393488 ◽

2020 ◽

Author(s):

Sizhen Li ◽

He Zhang ◽

Liang Zhang ◽

Kaibo Liu ◽

Boxiang Liu ◽

...

Keyword(s):

Structure Prediction ◽

Linear Time ◽

Average Length ◽

Structural Alignment ◽

Scale Up ◽

Sequence Length ◽

Multiple Sequence ◽

Viral Genomes ◽

Conserved Regions ◽

Rna Structural Alignment

Many functional RNA structures are conserved across evolution, and such conserved structures provide critical targets for diagnostics and treatment. TurboFold II is a state-of-the-art software that can predict conserved structures and alignments given homologous sequences, but its cubic runtime and quadratic memory usage with sequence length prevent it from being applied to most full-length viral genomes. As the COVID-19 outbreak spreads, there is a growing need to have a fast and accurate tool to identify conserved regions of SARS-CoV-2. To address this issue, we present LinearTurboFold, which successfully accelerates TurboFold II without sacrificing accuracy on secondary structure and multiple sequence alignment prediction. LinearTurboFold is orders of magnitude faster than Turbo-Fold II, e.g., 372× faster (12 minutes vs. 3.1 days) on a group of five HIV-1 homologs with average length 9,686 nt. LinearTurboFold is able to scale up to the full sequence of SARS-CoV-2, and identifies conserved structures that have been supported by previous studies. Additionally, LinearTurboFold finds a list of novel conserved regions, including long-range base pairs, which may be useful for better understanding the virus.

Download Full-text

LinearSampling: Linear-Time Stochastic Sampling of RNA Secondary Structure with Applications to SARS-CoV-2

10.1101/2020.12.29.424617 ◽

2020 ◽

Author(s):

He Zhang ◽

Liang Zhang ◽

Sizhen Li ◽

David Mathews ◽

Liang Huang

Keyword(s):

Partition Function ◽

Linear Time ◽

Reference Sequence ◽

Sequence Length ◽

Sampling Algorithm ◽

Stochastic Sampling ◽

Viral Genomes ◽

Multiple Structures ◽

Sampling Algorithms ◽

Current Sampling

Many RNAs fold into multiple structures at equilibrium. The classical stochastic sampling algorithm can sample secondary structures according to their probabilities in the Boltzmann ensemble, and is widely used, e.g., for accessibility prediction. However, the current sampling algorithm, consisting of a bottom-up partition function phase followed by a top-down sampling phase, suffers from three limitations: (a) the formulation and implementation of the sampling phase are unnecessarily complicated; (b) much redundant work is repeatedly performed in the sampling phase; (c) the partition function runtime scales cubically with the sequence length. These issues prevent it from being used for full-length viral genomes such as SARS-CoV-2. To address these problems, we first present a hypergraph framework under which the sampling algorithm can be greatly simplified. We then present three sampling algorithms under this framework of which two eliminate redundant work in the sampling phase. Finally, we present LinearSampling, an end-to-end linear-time sampling algorithm that is orders of magnitude faster than the standard algorithm. For instance, LinearSampling is 111 times faster (48s vs. 1.5h) than Vienna RNAsubopt on the longest sequence in the RNAcentral dataset that RNAsubopt can run (15,780 nt). More importantly, LinearSampling is the first sampling algorithm to scale to the full genome of SARS-CoV-2, taking only 96 seconds on its reference sequence (29,903 nt). It finds 23 regions of 15 nt with high accessibilities, which can be potentially used for COVID-19 diagnostics and drug design.

Download Full-text

RNA Secondary Structure Prediction by Minimum Free Energy, 2006; Ogurtsov, Shabalina, Kondrashov, Roytberg

10.1007/springerreference_57866 ◽

2011 ◽

Keyword(s):

Free Energy ◽

Secondary Structure ◽

Structure Prediction ◽

Rna Secondary Structure ◽

Secondary Structure Prediction ◽

Minimum Free Energy ◽

Rna Secondary Structure Prediction

Download Full-text

Prediction of Minimum Free Energy Structure for Simple Non-standard Pseudoknot

Biomedical Engineering Systems and Technologies - Communications in Computer and Information Science ◽

10.1007/978-3-642-18472-7_27 ◽

2011 ◽

pp. 345-355

Author(s):

Thomas K. F. Wong ◽

S. M. Yiu

Keyword(s):

Free Energy ◽

Minimum Free Energy ◽

Energy Structure ◽

Minimum Free Energy Structure

Download Full-text

RNA Secondary Structure Prediction by Minimum Free Energy

Encyclopedia of Algorithms ◽

10.1007/978-3-642-27848-8_347-2 ◽

2015 ◽

pp. 1-6

Author(s):

Rune B. Lyngsø

Keyword(s):

Free Energy ◽

Secondary Structure ◽

Structure Prediction ◽

Rna Secondary Structure ◽

Secondary Structure Prediction ◽

Minimum Free Energy ◽

Rna Secondary Structure Prediction

Download Full-text

Mathematical and Biological Modelling of RNA Secondary Structure and Its Effects on Gene Expression

Computational and Mathematical Methods in Medicine ◽

10.1080/10273660600906416 ◽

2006 ◽

Vol 7 (1) ◽

pp. 37-43 ◽

Cited By ~ 2

Author(s):

T. A. Hughes ◽

J. N. McElwaine

Keyword(s):

Gene Expression ◽

Free Energy ◽

Secondary Structure ◽

Structure Prediction ◽

Secondary Structure Prediction ◽

Secondary Structures ◽

Minimum Free Energy ◽

Messenger Rnas ◽

Rna Sequences ◽

Translational Machinery

Secondary structures within the 5′ untranslated regions of messenger RNAs can have profound effects on the efficiency of translation of their messages and thereby on gene expression. Consequently they can act as important regulatory motifs in both physiological and pathological settings. Current approaches to predicting the secondary structure of these RNA sequences find the structure with the global-minimum free energy. However, since RNA folds progressively from the 5′ end when synthesised or released from the translational machinery, this may not be the most probable structure. We discuss secondary structure prediction based on local-minimisation of free energy with thermodynamic fluctuations as nucleotides are added to the 3′ end and show that these can result in different secondary structures. We also discuss approaches for studying the extent of the translational inhibition specified by structures within the 5′ untranslated region.

Download Full-text

Machine learning a model for RNA structure prediction

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa090 ◽

2020 ◽

Vol 2 (4) ◽

Author(s):

Nicola Calonaci ◽

Alisha Jones ◽

Francesca Cuturello ◽

Michael Sattler ◽

Giovanni Bussi

Keyword(s):

Free Energy ◽

Rna Structure ◽

Structure Prediction ◽

Secondary Structure Prediction ◽

Structural Information ◽

Minimum Free Energy ◽

Experimental Information ◽

Rna Structures ◽

Chemical Probing ◽

Validation Set

Abstract RNA function crucially depends on its structure. Thermodynamic models currently used for secondary structure prediction rely on computing the partition function of folding ensembles, and can thus estimate minimum free-energy structures and ensemble populations. These models sometimes fail in identifying native structures unless complemented by auxiliary experimental data. Here, we build a set of models that combine thermodynamic parameters, chemical probing data (DMS and SHAPE) and co-evolutionary data (direct coupling analysis) through a network that outputs perturbations to the ensemble free energy. Perturbations are trained to increase the ensemble populations of a representative set of known native RNA structures. In the chemical probing nodes of the network, a convolutional window combines neighboring reactivities, enlightening their structural information content and the contribution of local conformational ensembles. Regularization is used to limit overfitting and improve transferability. The most transferable model is selected through a cross-validation strategy that estimates the performance of models on systems on which they are not trained. With the selected model we obtain increased ensemble populations for native structures and more accurate predictions in an independent validation set. The flexibility of the approach allows the model to be easily retrained and adapted to incorporate arbitrary experimental information.

Download Full-text