STEM KERNELS FOR RNA SEQUENCE ANALYSES

2007 ◽  
Vol 05 (05) ◽  
pp. 1103-1122 ◽  
Author(s):  
YASUBUMI SAKAKIBARA ◽  
KRIS POPENDORF ◽  
NANA OGAWA ◽  
KIYOSHI ASAI ◽  
KENGO SATO

Several computational methods based on stochastic context-free grammars have been developed for modeling and analyzing functional RNA sequences. These grammatical methods have succeeded in modeling typical secondary structures of RNA, and are used for structural alignment of RNA sequences. However, such stochastic models cannot sufficiently discriminate member sequences of an RNA family from nonmembers and hence detect noncoding RNA regions from genome sequences. A novel kernel function, stem kernel, for the discrimination and detection of functional RNA sequences using support vector machines (SVMs) is proposed. The stem kernel is a natural extension of the string kernel, specifically the all-subsequences kernel, and is tailored to measure the similarity of two RNA sequences from the viewpoint of secondary structures. The stem kernel examines all possible common base pairs and stem structures of arbitrary lengths, including pseudoknots between two RNA sequences, and calculates the inner product of common stem structure counts. An efficient algorithm is developed to calculate the stem kernels based on dynamic programming. The stem kernels are then applied to discriminate members of an RNA family from nonmembers using SVMs. The study indicates that the discrimination ability of the stem kernel is strong compared with conventional methods. Furthermore, the potential application of the stem kernel is demonstrated by the detection of remotely homologous RNA families in terms of secondary structures. This is because the string kernel is proven to work for the remote homology detection of protein sequences. These experimental results have convinced us to apply the stem kernel in order to find novel RNA families from genome sequences.

2021 ◽  
Vol 22 (S3) ◽  
Author(s):  
Jun Meng ◽  
Qiang Kang ◽  
Zheng Chang ◽  
Yushi Luan

Abstract Background Long noncoding RNAs (lncRNAs) play an important role in regulating biological activities and their prediction is significant for exploring biological processes. Long short-term memory (LSTM) and convolutional neural network (CNN) can automatically extract and learn the abstract information from the encoded RNA sequences to avoid complex feature engineering. An ensemble model learns the information from multiple perspectives and shows better performance than a single model. It is feasible and interesting that the RNA sequence is considered as sentence and image to train LSTM and CNN respectively, and then the trained models are hybridized to predict lncRNAs. Up to present, there are various predictors for lncRNAs, but few of them are proposed for plant. A reliable and powerful predictor for plant lncRNAs is necessary. Results To boost the performance of predicting lncRNAs, this paper proposes a hybrid deep learning model based on two encoding styles (PlncRNA-HDeep), which does not require prior knowledge and only uses RNA sequences to train the models for predicting plant lncRNAs. It not only learns the diversified information from RNA sequences encoded by p-nucleotide and one-hot encodings, but also takes advantages of lncRNA-LSTM proposed in our previous study and CNN. The parameters are adjusted and three hybrid strategies are tested to maximize its performance. Experiment results show that PlncRNA-HDeep is more effective than lncRNA-LSTM and CNN and obtains 97.9% sensitivity, 95.1% precision, 96.5% accuracy and 96.5% F1 score on Zea mays dataset which are better than those of several shallow machine learning methods (support vector machine, random forest, k-nearest neighbor, decision tree, naive Bayes and logistic regression) and some existing tools (CNCI, PLEK, CPC2, LncADeep and lncRNAnet). Conclusions PlncRNA-HDeep is feasible and obtains the credible predictive results. It may also provide valuable references for other related research.


2020 ◽  
Vol 36 (Supplement_2) ◽  
pp. i857-i865
Author(s):  
Derrick Blakely ◽  
Eamon Collins ◽  
Ritambhara Singh ◽  
Andrew Norton ◽  
Jack Lanchantin ◽  
...  

Abstract Motivation Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task’s alphabet size. Results In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines. Availability and implementation Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSK Supplementary information Supplementary data are available at Bioinformatics online.


2011 ◽  
Vol 12 (1) ◽  
pp. 108 ◽  
Author(s):  
Arif O Harmanci ◽  
Gaurav Sharma ◽  
David H Mathews

2018 ◽  
Vol 13 (5) ◽  
pp. 450-460 ◽  
Author(s):  
Xingli Guo ◽  
Lin Gao ◽  
Yu Wang ◽  
David K.Y. Chiu ◽  
Bingbo Wang ◽  
...  

2002 ◽  
Vol 59 (6) ◽  
pp. 903-909 ◽  
Author(s):  
R Bundschuh ◽  
T Hwa

2013 ◽  
Vol 12 (06) ◽  
pp. 1175-1199 ◽  
Author(s):  
MINGHE SUN

A multi-class support vector machine (M-SVM) is developed, its dual is derived, its dual is mapped to high dimensional feature spaces using inner product kernels, and its performance is tested. The M-SVM is formulated as a quadratic programming model. Its dual, also a quadratic programming model, is very elegant and is easier to solve than the primal. The discriminant functions can be directly constructed from the dual solution. By using inner product kernels, the M-SVM can be built and nonlinear discriminant functions can be constructed in high dimensional feature spaces without carrying out the mappings from the input space to the feature spaces. The size of the dual, measured by the number of variables and constraints, is independent of the dimension of the input space and stays the same whether the M-SVM is built in the input space or in a feature space. Compared to other models published in the literature, this M-SVM is equally or more effective. An example is presented to demonstrate the dual formulation and solution in feature spaces. Very good results were obtained on benchmark test problems from the literature.


2017 ◽  
Vol 92 (1) ◽  
Author(s):  
Grace Logan ◽  
Joseph Newman ◽  
Caroline F. Wright ◽  
Lidia Lasecka-Dykes ◽  
Daniel T. Haydon ◽  
...  

ABSTRACTNonenveloped viruses protect their genomes by packaging them into an outer shell or capsid of virus-encoded proteins. Packaging and capsid assembly in RNA viruses can involve interactions between capsid proteins and secondary structures in the viral genome, as exemplified by the RNA bacteriophage MS2 and as proposed for other RNA viruses of plants, animals, and human. In the picornavirus family of nonenveloped RNA viruses, the requirements for genome packaging remain poorly understood. Here, we show a novel and simple approach to identify predicted RNA secondary structures involved in genome packaging in the picornavirus foot-and-mouth disease virus (FMDV). By interrogating deep sequencing data generated from both packaged and unpackaged populations of RNA, we have determined multiple regions of the genome with constrained variation in the packaged population. Predicted secondary structures of these regions revealed stem-loops with conservation of structure and a common motif at the loop. Disruption of these features resulted in attenuation of virus growth in cell culture due to a reduction in assembly of mature virions. This study provides evidence for the involvement of predicted RNA structures in picornavirus packaging and offers a readily transferable methodology for identifying packaging requirements in many other viruses.IMPORTANCEIn order to transmit their genetic material to a new host, nonenveloped viruses must protect their genomes by packaging them into an outer shell or capsid of virus-encoded proteins. For many nonenveloped RNA viruses the requirements for this critical part of the viral life cycle remains poorly understood. We have identified RNA sequences involved in genome packaging of the picornavirus foot-and-mouth disease virus. This virus causes an economically devastating disease of livestock affecting both the developed and developing world. The experimental methods developed to carry out this work are novel, simple, and transferable to the study of packaging signals in other RNA viruses. Improved understanding of RNA packaging may lead to novel vaccine approaches or targets for antiviral drugs with broad-spectrum activity.


2005 ◽  
Vol 03 (03) ◽  
pp. 527-550 ◽  
Author(s):  
RUI KUANG ◽  
EUGENE IE ◽  
KE WANG ◽  
KAI WANG ◽  
MAHIRA SIDDIQI ◽  
...  

We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We further examine how to incorporate predicted secondary structure information into the profile kernel to obtain a small but significant performance improvement. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs" — short regions of the original profile that contribute almost all the weight of the SVM classification score — and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results also outperform cluster kernels while providing much better scalability to large datasets. Supplementary website:.


Sign in / Sign up

Export Citation Format

Share Document