Informative RNA-base embedding for functional RNA structural alignment and clustering by deep representation learning

Mapping Intimacies ◽

10.1101/2021.08.23.457433 ◽

2021 ◽

Author(s):

Manato Akiyama ◽

Yasubumi Sakakibara

Keyword(s):

Time Complexity ◽

Learning Algorithm ◽

Structural Alignment ◽

Representation Learning ◽

Sequence Motif ◽

Alignment Algorithm ◽

Dependent Manner ◽

Rna Sequences ◽

Rna Sequence ◽

Rna Structural Alignment

Effective embedding is being actively conducted by applying deep learning to biomolecular information. Obtaining better embedding enhances the quality of downstream analysis such as DNA sequence motif detection and protein function prediction. In this study, we adopt a pre-training algorithm for the effective embedding of RNA bases to acquire semantically rich representations, and apply it to two fundamental RNA sequence problems: structural alignment and clustering. By using the pre-learning algorithm to embed the four bases of RNA in a position-dependent manner using a large number of RNA sequences from various RNA families, a context-sensitive embedding representation is obtained. As a result, not only base information but also secondary structure and context information of RNA sequences are embedded for each base. We call this informative base embedding and use it to achieve accuracy superior to that of existing state-of-the-art methods in RNA structural alignment and RNA family clustering tasks. Furthermore, by performing RNA sequence alignment combining this informative base embedding with a simple Needleman-Wunsch alignment algorithm, we succeed in calculating a structural alignment in a time complexity O(n2) instead of the O(n6) time complexity of Sankoff-style algorithms.

Download Full-text

TOPAS: network-based structural alignment of RNA sequences

Bioinformatics ◽

10.1093/bioinformatics/btz001 ◽

2019 ◽

Vol 35 (17) ◽

pp. 2941-2948 ◽

Cited By ~ 2

Author(s):

Chun-Chi Chen ◽

Hyundoo Jeong ◽

Xiaoning Qian ◽

Byung-Jun Yoon

Keyword(s):

Computational Complexity ◽

Secondary Structure ◽

Large Scale ◽

Structural Alignment ◽

Programming Approach ◽

Rna Sequences ◽

Optimal Sequence ◽

Dynamic Programming Approach ◽

Probabilistic Network ◽

Rna Structural Alignment

Abstract Motivation For many RNA families, the secondary structure is known to be better conserved among the member RNAs compared to the primary sequence. For this reason, it is important to consider the underlying folding structures when aligning RNA sequences, especially for those with relatively low sequence identity. Given a set of RNAs with unknown structures, simultaneous RNA alignment and folding algorithms aim to accurately align the RNAs by jointly predicting their consensus secondary structure and the optimal sequence alignment. Despite the improved accuracy of the resulting alignment, the computational complexity of simultaneous alignment and folding for a pair of RNAs is O(N6), which is too costly to be used for large-scale analysis. Results In order to address this shortcoming, in this work, we propose a novel network-based scheme for pairwise structural alignment of RNAs. The proposed algorithm, TOPAS, builds on the concept of topological networks that provide structural maps of the RNAs to be aligned. For each RNA sequence, TOPAS first constructs a topological network based on the predicted folding structure, which consists of sequential edges and structural edges weighted by the base-pairing probabilities. The obtained networks can then be efficiently aligned by using probabilistic network alignment techniques, thereby yielding the structural alignment of the RNAs. The computational complexity of our proposed method is significantly lower than that of the Sankoff-style dynamic programming approach, while yielding favorable alignment results. Furthermore, another important advantage of the proposed algorithm is its capability of handling RNAs with pseudoknots while predicting the RNA structural alignment. We demonstrate that TOPAS generally outperforms previous RNA structural alignment methods on RNA benchmarks in terms of both speed and accuracy. Availability and implementation Source code of TOPAS and the benchmark data used in this paper are available at https://github.com/bjyoontamu/TOPAS.

Download Full-text

RNAfamProb Plus NeoFold: Estimations of Posterior Probabilities on RNA Structural Alignment and RNA Secondary Structures with Incorporating Homologous-RNA Sequences

10.1101/812891 ◽

2019 ◽

Author(s):

Masaki Tagashira ◽

Kiyoshi Asai

Keyword(s):

Secondary Structure ◽

Sequence Alignment ◽

Structural Alignment ◽

Secondary Structures ◽

Simultaneous Optimization ◽

Supplementary Information ◽

Sequence Alignments ◽

Rna Sequences ◽

Link Type ◽

Rna Structural Alignment

AbstractMotivationThe simultaneous optimization of the sequence alignment and secondary structures among RNAs, structural alignment, has been required for the more appropriate comparison of functional ncRNAs than sequence alignment. Pseudo-probabilities given RNA sequences on structural alignment have been desired for more-accurate secondary structures, sequence alignments, consensus secondary structures, and structural alignments. However, any algorithms have not been proposed for these pseudo-probabilities.ResultsWe invented the RNAfamProb algorithm, an algorithm for estimating these pseudo-probabilities. We performed the application of these pseudo-probabilities to two biological problems, the visualization with these pseudo-probabilities and maximum-expected-accuracy secondary-structure (estimation). The RNAfamProb program, an implementation of this algorithm, plus the NeoFold program, a maximum-expected-accuracy secondary-structure program with these pseudo-probabilities, demonstrated prediction accuracy better than three state-of-the-art programs of maximum-expected-accuracy secondary-structure while demanding running time far longer than these three programs as expected due to the intrinsic serious problem-complexity of structural alignment compared with independent secondary structure and sequence alignment. Both the RNAfamProb and NeoFold programs estimate matters more accurately with incorporating homologous-RNA sequences.AvailabilityThe source code of each of these two programs is available on each of “https://github.com/heartsh/rnafamprob” and “https://github.com/heartsh/neofold”.Contact“[email protected]” and “[email protected]”.Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

The Network Representation Learning Algorithm Based on Semi-Supervised Random Walk

IEEE Access ◽

10.1109/access.2020.3044367 ◽

2020 ◽

Vol 8 ◽

pp. 222956-222965

Author(s):

Dong Liu ◽

Qinpeng Li ◽

Yan Ru ◽

Jun Zhang

Keyword(s):

Random Walk ◽

Learning Algorithm ◽

Representation Learning ◽

Network Representation

Download Full-text

Network Representation Learning Algorithm Combined with Node Text Information

Journal of Physics Conference Series ◽

10.1088/1742-6596/1769/1/012054 ◽

2021 ◽

Vol 1769 (1) ◽

pp. 012054

Author(s):

Rui Wang ◽

Yu Liu ◽

Jiawang Chen

Keyword(s):

Learning Algorithm ◽

Representation Learning ◽

Network Representation ◽

Text Information

Download Full-text

A Graph Representation Learning Algorithm Based on Attention Mechanism and Node Similarity

Computer Supported Cooperative Work and Social Computing - Communications in Computer and Information Science ◽

10.1007/978-981-15-1377-0_46 ◽

2019 ◽

pp. 591-604

Author(s):

Kun Guo ◽

Deqin Wang ◽

Jiangsheng Huang ◽

Yuzhong Chen ◽

Zhihao Zhu ◽

...

Keyword(s):

Learning Algorithm ◽

Representation Learning ◽

Attention Mechanism ◽

Graph Representation ◽

Node Similarity

Download Full-text

Tile2Vec: Unsupervised Representation Learning for Spatially Distributed Data

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013967 ◽

2019 ◽

Vol 33 ◽

pp. 3967-3974 ◽

Cited By ~ 7

Author(s):

Neal Jean ◽

Sherrie Wang ◽

Anshul Samar ◽

George Azzari ◽

David Lobell ◽

...

Keyword(s):

Natural Language ◽

Learning Algorithm ◽

Representation Learning ◽

Distributed Data ◽

Simple Arithmetic ◽

Latent Space ◽

Wide Range ◽

Spatially Distributed ◽

Vector Representations ◽

Spatially Distributed Data

Geospatial analysis lacks methods like the word vector representations and pre-trained networks that significantly boost performance across a wide range of natural language and computer vision tasks. To fill this gap, we introduce Tile2Vec, an unsupervised representation learning algorithm that extends the distributional hypothesis from natural language — words appearing in similar contexts tend to have similar meanings — to spatially distributed data. We demonstrate empirically that Tile2Vec learns semantically meaningful representations for both image and non-image datasets. Our learned representations significantly improve performance in downstream classification tasks and, similarly to word vectors, allow visual analogies to be obtained via simple arithmetic in the latent space.

Download Full-text

Stochastic sampling of the RNA structural alignment space

Nucleic Acids Research ◽

10.1093/nar/gkp276 ◽

2009 ◽

Vol 37 (12) ◽

pp. 4063-4075 ◽

Cited By ~ 11

Author(s):

Arif Ozgun Harmanci ◽

Gaurav Sharma ◽

David H. Mathews

Keyword(s):

Structural Alignment ◽

Stochastic Sampling ◽

Rna Structural Alignment

Download Full-text

A non-negative representation learning algorithm for selecting neighbors

Machine Learning ◽

10.1007/s10994-015-5501-4 ◽

2015 ◽

Vol 102 (2) ◽

pp. 133-153 ◽

Cited By ~ 3

Author(s):

Lili Li ◽

Jiancheng Lv ◽

Zhang Yi

Keyword(s):

Learning Algorithm ◽

Representation Learning ◽

Negative Representation

Download Full-text

Genomic RNA sequence of feline coronavirus strain FCoV C1Je

Journal of Feline Medicine and Surgery ◽

10.1016/j.jfms.2006.12.002 ◽

2007 ◽

Vol 9 (3) ◽

pp. 202-213 ◽

Cited By ~ 22

Author(s):

Charlotte Dye ◽

Stuart G. Siddell

Keyword(s):

Consensus Sequence ◽

Pcr Amplification ◽

Laboratory Strain ◽

Open Reading Frames ◽

Viral Rna ◽

Genomic Rna ◽

Rt Pcr ◽

Rna Sequences ◽

Rna Sequence ◽

Feline Coronavirus

This paper reports the first genomic RNA sequence of a field strain feline coronavirus (FCoV). Viral RNA was isolated at post mortem from the jejunum and liver of a cat with feline infectious peritonitis (FIP). A consensus sequence of the jejunum-derived genomic RNA (FCoV C1Je) was determined from overlapping cDNA fragments produced by reverse transcriptase polymerase chain reaction (RT-PCR) amplification. RT-PCR products were sequenced by a reiterative sequencing strategy and the genomic RNA termini were determined using a rapid amplification of cDNA ends PCR strategy. The FCoV C1Je genome was found to be 29,255 nucleotides in length, excluding the poly(A) tail. Comparison of the FCoV C1Je genomic RNA sequence with that of the laboratory strain FCoV FIP virus (FIPV) 79-1146 showed that both viruses have a similar genome organisation and predictions made for the open reading frames and cis-acting elements of the FIPV 79-1146 genome hold true for FCoV C1Je. In addition, the sequence of the 3′-proximal third of the liver derived genomic RNA (FCoV C1Li), which encompasses the structural and accessory protein genes of the virus, was also determined. Comparisons of the enteric (jejunum) and non-enteric (liver) derived viral RNA sequences revealed 100% nucleotide identity, a finding that questions the well accepted ‘internal mutation theory’ of FIPV pathogenicity.

Download Full-text

Differential accumulation of poly(A)+ RNA between virulent and double-stranded RNA-induced hypovirulent strains of Cryphonectria (Endothia) parasitica.

Molecular and Cellular Biology ◽

10.1128/mcb.7.10.3688 ◽

1987 ◽

Vol 7 (10) ◽

pp. 3688-3693 ◽

Cited By ~ 39

Author(s):

W A Powell ◽

N K Van Alfen

Keyword(s):

Gene Expression ◽

Specific Effect ◽

Rna Sequences ◽

Double Stranded Rna ◽

Total Rna ◽

Rna Sequence ◽

Differential Hybridization ◽

Endothia Parasitica ◽

Fungal Gene Expression ◽

Fungal Gene

The double-stranded RNA responsible for transmissible hypovirulence in Cryphonectria (Endothia) parasitica was found to affect the accumulation of specific poly(A)+ RNA. Using differential hybridization techniques, two genes were isolated, Vir1 and Vir2, which were specifically expressed as poly(A)+ RNAs in the virulent cells. The highly expressed RNA sequences from these genes were not found in total RNA isolated from either American or European hypovirulent strains, although the genes were present in their genomes. Other virulence- and hypovirulence-specific RNA sequences were also detected. One isolated hypovirulence-specific RNA sequence was expressed in both virulent and hypovirulent cells, but in a two- to fourfold-higher concentration in the hypovirulent cells. The results show that hypovirulence is associated with concurrent changes in a few highly expressed poly(A)+ RNAs, which suggests a specific effect of the double-stranded RNA on fungal gene expression.

Download Full-text