scholarly journals LinearSampling: Linear-Time Stochastic Sampling of RNA Secondary Structure with Applications to SARS-CoV-2

2020 ◽  
Author(s):  
He Zhang ◽  
Liang Zhang ◽  
Sizhen Li ◽  
David Mathews ◽  
Liang Huang

Many RNAs fold into multiple structures at equilibrium. The classical stochastic sampling algorithm can sample secondary structures according to their probabilities in the Boltzmann ensemble, and is widely used, e.g., for accessibility prediction. However, the current sampling algorithm, consisting of a bottom-up partition function phase followed by a top-down sampling phase, suffers from three limitations: (a) the formulation and implementation of the sampling phase are unnecessarily complicated; (b) much redundant work is repeatedly performed in the sampling phase; (c) the partition function runtime scales cubically with the sequence length. These issues prevent it from being used for full-length viral genomes such as SARS-CoV-2. To address these problems, we first present a hypergraph framework under which the sampling algorithm can be greatly simplified. We then present three sampling algorithms under this framework of which two eliminate redundant work in the sampling phase. Finally, we present LinearSampling, an end-to-end linear-time sampling algorithm that is orders of magnitude faster than the standard algorithm. For instance, LinearSampling is 111 times faster (48s vs. 1.5h) than Vienna RNAsubopt on the longest sequence in the RNAcentral dataset that RNAsubopt can run (15,780 nt). More importantly, LinearSampling is the first sampling algorithm to scale to the full genome of SARS-CoV-2, taking only 96 seconds on its reference sequence (29,903 nt). It finds 23 regions of 15 nt with high accessibilities, which can be potentially used for COVID-19 diagnostics and drug design.

2020 ◽  
Vol 36 (Supplement_1) ◽  
pp. i258-i267 ◽  
Author(s):  
He Zhang ◽  
Liang Zhang ◽  
David H Mathews ◽  
Liang Huang

Abstract Motivation RNA secondary structure prediction is widely used to understand RNA function. Recently, there has been a shift away from the classical minimum free energy methods to partition function-based methods that account for folding ensembles and can therefore estimate structure and base pair probabilities. However, the classical partition function algorithm scales cubically with sequence length, and is therefore prohibitively slow for long sequences. This slowness is even more severe than cubic-time free energy minimization due to a substantially larger constant factor in runtime. Results Inspired by the success of our recent LinearFold algorithm that predicts the approximate minimum free energy structure in linear time, we design a similar linear-time heuristic algorithm, LinearPartition, to approximate the partition function and base-pairing probabilities, which is shown to be orders of magnitude faster than Vienna RNAfold and CONTRAfold (e.g. 2.5 days versus 1.3 min on a sequence with length 32 753 nt). More interestingly, the resulting base-pairing probabilities are even better correlated with the ground-truth structures. LinearPartition also leads to a small accuracy improvement when used for downstream structure prediction on families with the longest length sequences (16S and 23S rRNAs), as well as a substantial improvement on long-distance base pairs (500+ nt apart). Availability and implementation Code: http://github.com/LinearFold/LinearPartition; Server: http://linearfold.org/partition. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Sizhen Li ◽  
He Zhang ◽  
Liang Zhang ◽  
Kaibo Liu ◽  
Boxiang Liu ◽  
...  

Many functional RNA structures are conserved across evolution, and such conserved structures provide critical targets for diagnostics and treatment. TurboFold II is a state-of-the-art software that can predict conserved structures and alignments given homologous sequences, but its cubic runtime and quadratic memory usage with sequence length prevent it from being applied to most full-length viral genomes. As the COVID-19 outbreak spreads, there is a growing need to have a fast and accurate tool to identify conserved regions of SARS-CoV-2. To address this issue, we present LinearTurboFold, which successfully accelerates TurboFold II without sacrificing accuracy on secondary structure and multiple sequence alignment prediction. LinearTurboFold is orders of magnitude faster than Turbo-Fold II, e.g., 372× faster (12 minutes vs. 3.1 days) on a group of five HIV-1 homologs with average length 9,686 nt. LinearTurboFold is able to scale up to the full sequence of SARS-CoV-2, and identifies conserved structures that have been supported by previous studies. Additionally, LinearTurboFold finds a list of novel conserved regions, including long-range base pairs, which may be useful for better understanding the virus.


2000 ◽  
Vol 13 ◽  
pp. 155-188 ◽  
Author(s):  
J. Cheng ◽  
M. J. Druzdzel

Stochastic sampling algorithms, while an attractive alternative to exact algorithms in very large Bayesian network models, have been observed to perform poorly in evidential reasoning with extremely unlikely evidence. To address this problem, we propose an adaptive importance sampling algorithm, AIS-BN, that shows promising convergence rates even under extreme conditions and seems to outperform the existing sampling algorithms consistently. Three sources of this performance improvement are (1) two heuristics for initialization of the importance function that are based on the theoretical properties of importance sampling in finite-dimensional integrals and the structural advantages of Bayesian networks, (2) a smooth learning method for the importance function, and (3) a dynamic weighting function for combining samples from different stages of the algorithm. We tested the performance of the AIS-BN algorithm along with two state of the art general purpose sampling algorithms, likelihood weighting (Fung & Chang, 1989; Shachter & Peot, 1989) and self-importance sampling (Shachter & Peot, 1989). We used in our tests three large real Bayesian network models available to the scientific community: the CPCS network (Pradhan et al., 1994), the PathFinder network (Heckerman, Horvitz, & Nathwani, 1990), and the ANDES network (Conati, Gertner, VanLehn, & Druzdzel, 1997), with evidence as unlikely as 10^-41. While the AIS-BN algorithm always performed better than the other two algorithms, in the majority of the test cases it achieved orders of magnitude improvement in precision of the results. Improvement in speed given a desired precision is even more dramatic, although we are unable to report numerical results here, as the other algorithms almost never achieved the precision reached even by the first few iterations of the AIS-BN algorithm.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5722 ◽  
Author(s):  
Wartini Ng ◽  
Budiman Minasny ◽  
Brendan Malone ◽  
Patrick Filippi

Background The use of visible-near infrared (vis-NIR) spectroscopy for rapid soil characterisation has gained a lot of interest in recent times. Soil spectra absorbance from the visible-infrared range can be calibrated using regression models to predict a set of soil properties. The accuracy of these regression models relies heavily on the calibration set. The optimum sample size and the overall sample representativeness of the dataset could further improve the model performance. However, there is no guideline on which sampling method should be used under different size of datasets. Methods Here, we show different sampling algorithms performed differently under different data size and different regression models (Cubist regression tree and Partial Least Square Regression (PLSR)). We analysed the effect of three sampling algorithms: Kennard-Stone (KS), conditioned Latin Hypercube Sampling (cLHS) and k-means clustering (KM) against random sampling on the prediction of up to five different soil properties (sand, clay, carbon content, cation exchange capacity and pH) on three datasets. These datasets have different coverages: a European continental dataset (LUCAS, n = 5,639), a regional dataset from Australia (Geeves, n = 379), and a local dataset from New South Wales, Australia (Hillston, n = 384). Calibration sample sizes ranging from 50 to 3,000 were derived and tested for the continental dataset; and from 50 to 200 samples for the regional and local datasets. Results Overall, the PLSR gives a better prediction in comparison to the Cubist model for the prediction of various soil properties. It is also less prone to the choice of sampling algorithm. The KM algorithm is more representative in the larger dataset up to a certain calibration sample size. The KS algorithm appears to be more efficient (as compared to random sampling) in small datasets; however, the prediction performance varied a lot between soil properties. The cLHS sampling algorithm is the most robust sampling method for multiple soil properties regardless of the sample size. Discussion Our results suggested that the optimum calibration sample size relied on how much generalization the model had to create. The use of the sampling algorithm is beneficial for larger datasets than smaller datasets where only small improvements can be made. KM is suitable for large datasets, KS is efficient in small datasets but results can be variable, while cLHS is less affected by sample size.


2020 ◽  
Vol 498 (3) ◽  
pp. 4492-4502 ◽  
Author(s):  
Rory J E Smith ◽  
Gregory Ashton ◽  
Avi Vajpeyi ◽  
Colm Talbot

ABSTRACT Understanding the properties of transient gravitational waves (GWs) and their sources is of broad interest in physics and astronomy. Bayesian inference is the standard framework for astrophysical measurement in transient GW astronomy. Usually, stochastic sampling algorithms are used to estimate posterior probability distributions over the parameter spaces of models describing experimental data. The most physically accurate models typically come with a large computational overhead which can render data analsis extremely time consuming, or possibly even prohibitive. In some cases highly specialized optimizations can mitigate these issues, though they can be difficult to implement, as well as to generalize to arbitrary models of the data. Here, we investigate an accurate, flexible, and scalable method for astrophysical inference: parallelized nested sampling. The reduction in the wall-time of inference scales almost linearly with the number of parallel processes running on a high-performance computing cluster. By utilizing a pool of several hundreds or thousands of CPUs in a high-performance cluster, the large wall times of many astrophysical inferences can be alleviated while simultaneously ensuring that any GW signal model can be used ‘out of the box’, i.e. without additional optimization or approximation. Our method will be useful to both the LIGO-Virgo-KAGRA collaborations and the wider scientific community performing astrophysical analyses on GWs. An implementation is available in the open source gravitational-wave inference library pBilby (parallel bilby).


2008 ◽  
Vol 2008 ◽  
pp. 1-21 ◽  
Author(s):  
M. De La Sen ◽  
S. Alonso-Quesada

This paper discusses the generation of a carrying capacity of the environment so that the famous Beverton-Holt equation of Ecology has a prescribed solution. The way used to achieve the tracking objective is the design of a carrying capacity through a feedback law so that the prescribed reference sequence, which defines the suitable behavior, is achieved. The advantage that the inverse of the Beverton-Holt equation is a linear time-varying discrete dynamic system whose external input is the inverse of the environment carrying capacity is taken in mind. In the case when the intrinsic growth rate is not perfectly known, an adaptive law implying parametrical estimation is incorporated to the scheme so that the tracking property of the reference sequence becomes an asymptotic objective in the absence of additive disturbances. The main advantage of the proposal is that the population evolution might behave as a prescribed one either for all time or asymptotically, which defines the desired population evolution. The technique might be of interest in some industrial exploitation problems like, for instance, in aquaculture management.


2016 ◽  
Vol 27 (05) ◽  
pp. 1650052 ◽  
Author(s):  
Zeinab S. Jalali ◽  
Alireza Rezvanian ◽  
Mohammad Reza Meybodi

Due to the large scales and limitations in accessing most online social networks, it is hard or infeasible to directly access them in a reasonable amount of time for studying and analysis. Hence, network sampling has emerged as a suitable technique to study and analyze real networks. The main goal of sampling online social networks is constructing a small scale sampled network which preserves the most important properties of the original network. In this paper, we propose two sampling algorithms for sampling online social networks using spanning trees. The first proposed sampling algorithm finds several spanning trees from randomly chosen starting nodes; then the edges in these spanning trees are ranked according to the number of times that each edge has appeared in the set of found spanning trees in the given network. The sampled network is then constructed as a sub-graph of the original network which contains a fraction of nodes that are incident on highly ranked edges. In order to avoid traversing the entire network, the second sampling algorithm is proposed using partial spanning trees. The second sampling algorithm is similar to the first algorithm except that it uses partial spanning trees. Several experiments are conducted to examine the performance of the proposed sampling algorithms on well-known real networks. The obtained results in comparison with other popular sampling methods demonstrate the efficiency of the proposed sampling algorithms in terms of Kolmogorov–Smirnov distance (KSD), skew divergence distance (SDD) and normalized distance (ND).


2020 ◽  
Vol 2 (4) ◽  
Author(s):  
Massimo Maiolo ◽  
Simone Ulzega ◽  
Manuel Gil ◽  
Maria Anisimova

Abstract Recently we presented a frequentist dynamic programming (DP) approach for multiple sequence alignment based on the explicit model of indel evolution Poisson Indel Process (PIP). This phylogeny-aware approach produces evolutionary meaningful gap patterns and is robust to the ‘over-alignment’ bias. Despite linear time complexity for the computation of marginal likelihoods, the overall method’s complexity is cubic in sequence length. Inspired by the popular aligner MAFFT, we propose a new technique to accelerate the evolutionary indel based alignment. Amino acid sequences are converted to sequences representing their physicochemical properties, and homologous blocks are identified by multi-scale short-time Fourier transform. Three three-dimensional DP matrices are then created under PIP, with homologous blocks defining sparse structures where most cells are excluded from the calculations. The homologous blocks are connected through intermediate ‘linking blocks’. The homologous and linking blocks are aligned under PIP as independent DP sub-matrices and their tracebacks merged to yield the final alignment. The new algorithm can largely profit from parallel computing, yielding a theoretical speed-up estimated to be proportional to the cubic power of the number of sub-blocks in the DP matrices. We compare the new method to the original PIP approach and demonstrate it on real data.


Sign in / Sign up

Export Citation Format

Share Document