RNAsamba: neural network-based assessment of the protein-coding potential of RNA sequences

Abstract The advent of high-throughput sequencing technologies made it possible to obtain large volumes of genetic information, quickly and inexpensively. Thus, many efforts are devoted to unveiling the biological roles of genomic elements, being the distinction between protein-coding and long non-coding RNAs one of the most important tasks. We describe RNAsamba, a tool to predict the coding potential of RNA molecules from sequence information using a neural network-based that models both the whole sequence and the ORF to identify patterns that distinguish coding from non-coding transcripts. We evaluated RNAsamba’s classification performance using transcripts coming from humans and several other model organisms and show that it recurrently outperforms other state-of-the-art methods. Our results also show that RNAsamba can identify coding signals in partial-length ORFs and UTR sequences, evidencing that its algorithm is not dependent on complete transcript sequences. Furthermore, RNAsamba can also predict small ORFs, traditionally identified with ribosome profiling experiments. We believe that RNAsamba will enable faster and more accurate biological findings from genomic data of species that are being sequenced for the first time. A user-friendly web interface, the documentation containing instructions for local installation and usage, and the source code of RNAsamba can be found at https://rnasamba.lge.ibi.unicamp.br/.

Download Full-text

RNAsamba: coding potential assessment using ORF and whole transcript sequence information

10.1101/620880 ◽

2019 ◽

Author(s):

Antonio P. Camargo ◽

Vsevolod Sourkov ◽

Marcelo F. Carazzolle

Keyword(s):

High Throughput Sequencing ◽

Model Organisms ◽

Sequence Information ◽

Protein Coding ◽

Rna Molecules ◽

Coding Regions ◽

Sequencing Technologies ◽

Partial Length ◽

Non Coding Rnas ◽

Coding Potential

AbstractMotivationThe advent of high-throughput sequencing technologies made it possible to obtain large volumes of genetic information, quickly and inexpensively. Thus, many efforts are devoted to unveil the biological roles of genomic elements, being one of the main tasks the identification of protein-coding and long non-coding RNAs.ResultsWe describe RNAsamba, a tool to predict the coding potential of RNA molecules from sequence information using a deep-learning model that processes both the whole sequence and the ORF to look for patterns that distinguish coding and non-coding RNAs. We evaluated the model in the classification of coding and non-coding transcripts of humans and five other model organisms and show that RNAsamba mostly outperforms other state-of-the-art methods. We also show that RNAsamba can identify coding signals in partial-length ORFs and UTR sequences, evidencing that its model is not dependent on the presence of complete coding regions. RNAsamba is a fast and easy tool that can provide valuable contributions to genome annotation pipelines.Availability and implementationThe source code of RNAsamba is freely available at:https://github.com/apcamargo/RNAsamba.

Download Full-text

DeepCPP: a deep neural network based on nucleotide bias information and minimum distribution similarity feature selection for RNA coding potential prediction

Briefings in Bioinformatics ◽

10.1093/bib/bbaa039 ◽

2020 ◽

Cited By ~ 3

Author(s):

Yu Zhang ◽

Cangzhi Jia ◽

Melissa Jane Fullwood ◽

Chee Keong Kwoh

Keyword(s):

Neural Network ◽

Feature Selection ◽

Deep Neural Network ◽

Feature Selection Method ◽

Classification Problem ◽

Open Reading Frames ◽

Nucleotide Bias ◽

Sequencing Technologies ◽

Type Data ◽

Coding Potential

Abstract The development of deep sequencing technologies has led to the discovery of novel transcripts. Many in silico methods have been developed to assess the coding potential of these transcripts to further investigate their functions. Existing methods perform well on distinguishing majority long noncoding RNAs (lncRNAs) and coding RNAs (mRNAs) but poorly on RNAs with small open reading frames (sORFs). Here, we present DeepCPP (deep neural network for coding potential prediction), a deep learning method for RNA coding potential prediction. Extensive evaluations on four previous datasets and six new datasets constructed in different species show that DeepCPP outperforms other state-of-the-art methods, especially on sORF type data, which overcomes the bottleneck of sORF mRNA identification by improving more than 4.31, 37.24 and 5.89% on its accuracy for newly discovered human, vertebrate and insect data, respectively. Additionally, we also revealed that discontinuous k-mer, and our newly proposed nucleotide bias and minimal distribution similarity feature selection method play crucial roles in this classification problem. Taken together, DeepCPP is an effective method for RNA coding potential prediction.

Download Full-text

Predicting circRNA-RBP interaction sites using a codon-based encoding and hybrid deep neural networks

10.1101/499012 ◽

2018 ◽

Cited By ~ 2

Author(s):

Kaiming Zhang ◽

Xiaoyong Pan ◽

Yang Yang ◽

Hong-Bin Shen

Keyword(s):

Neural Network ◽

Machine Learning ◽

Deep Learning ◽

Binding Sites ◽

Large Scale ◽

Rna Binding ◽

Sequence Information ◽

Rna Sequences ◽

Encoding Scheme ◽

Interaction Sites

AbstractCircular RNAs (circRNAs), with their crucial roles in gene regulation and disease development, have become a rising star in the RNA world. A lot of previous wet-lab studies focused on the interaction mechanisms between circRNAs and RNA-binding proteins (RBPs), as the knowledge of circRNA-RBP association is very important for understanding functions of circRNAs. Recently, the abundant CLIP-Seq experimental data has made the large-scale identification and analysis of circRNA-RBP interactions possible, while no computational tool based on machine learning has been developed yet.We present a new deep learning-based method, CRIP (CircRNAs Interact with Proteins), for the prediction of RBP binding sites on circRNAs, using only the RNA sequences. In order to fully exploit the sequence information, we propose a stacked codon-based encoding scheme and a hybrid deep learning architecture, in which a convolutional neural network (CNN) learns high-level abstract features and a recurrent neural network (RNN) learns long dependency in the sequences. We construct 37 datasets including sequence fragments of binding sites on circRNAs, and each set corresponds to one RBP. The experimental results show that the new encoding scheme is superior to the existing feature representation methods for RNA sequences, and the hybrid network outperforms conventional classifiers by a large margin, where both the CNN and RNN components contribute to the performance improvement. To the best of our knowledge, CRIP is the first machine learning-based tool specialized in the prediction of circRNA-RBP interactions, which is expected to play an important role for large-scale function analysis of circRNAs.

Download Full-text

Potentially translated sequences determine protein-coding potential of RNAs in cellular organisms

10.1101/2021.04.14.439730 ◽

2021 ◽

Author(s):

Yusuke Suenaga ◽

Mamoru Kato ◽

Momoko Nagai ◽

Kazuma Nakatani ◽

Hiroyuki Kogashi ◽

...

Keyword(s):

Noncoding Rna ◽

Rna Virus ◽

Host Cells ◽

Rna Sequences ◽

Protein Coding ◽

Functional Peptides ◽

Definition Of ◽

Virus Genomes ◽

Sequence Characteristics ◽

Coding Potential

AbstractRNA sequence characteristics determine whether their transcripts are coding or noncoding. Recent studies have shown that, paradoxical to the definition of noncoding RNA, several long noncoding RNAs (lncRNAs) translate functional peptides/proteins. However, the characteristics of RNA sequences that distinguish such newly identified coding transcripts from lncRNAs remain largely unknown. In this study, we found that potentially translated sequences in RNAs determine the protein-coding potential of RNAs in cellular organisms. We defined the potentially translated island (PTI) score as the fraction of the length of the longest potentially translated region among all regions. To analyze its relationship with protein-coding potential, we calculated the PTI scores in 3.4 million RNA transcripts from 100 cellular organisms, including 5 bacteria, 10 archaea, and 85 eukaryotes, as well as 105 positive-sense single-strand RNA virus genomes. In bacteria and archaea, coding and noncoding transcripts exclusively presented high and low PTI scores, respectively, whereas those of eukaryotic coding and noncoding transcripts showed relatively broader distributions. The relationship between the PTI score and protein-coding potential was sigmoidal in most eukaryotes; however, it was linear passing through the origin in three distinct eutherian lineages, including humans. The RNA sequences of virus genomes appeared to adapt to translation systems of host organisms by maximizing protein-coding potential in host cells. Hence, the PTIs determined the protein-coding potential of RNAs in cellular organisms. Additionally, coding and noncoding RNA do not exhibit dichotomous sequence characteristics in eukaryotes, instead they exhibit a gradient of protein-coding potential.

Download Full-text

Discovering the Unknown: Improving Detection of Novel Species and Genera from Short Reads

Journal of Biomedicine and Biotechnology ◽

10.1155/2011/495849 ◽

2011 ◽

Vol 2011 ◽

pp. 1-11 ◽

Cited By ~ 6

Author(s):

Gail L. Rosen ◽

Robi Polikar ◽

Diamantino A. Caseiro ◽

Steven D. Essinger ◽

Bahrad A. Sokhansanj

Keyword(s):

High Throughput Sequencing ◽

Mine Drainage ◽

Novel Species ◽

Classification Performance ◽

Species Level ◽

Metagenomic Data ◽

Sequencing Technologies ◽

Novel Taxa ◽

New Algorithms ◽

Better Than

High-throughput sequencing technologies enable metagenome profiling, simultaneous sequencing of multiple microbial species present within an environmental sample. Since metagenomic data includes sequence fragments (“reads”) from organisms that are absent from any database, new algorithms must be developed for the identification and annotation of novel sequence fragments. Homology-based techniques have been modified to detect novel species and genera, but, composition-based methods, have not been adapted. We develop a detection technique that can discriminate between “known” and “unknown” taxa, which can be used with composition-based methods, as well as a hybrid method. Unlike previous studies, we rigorously evaluate all algorithms for their ability to detect novel taxa. First, we show that the integration of a detector with a composition-based method performs significantly better than homology-based methods for the detection of novel species and genera, with best performance at finer taxonomic resolutions. Most importantly, we evaluate all the algorithms by introducing an “unknown” class and show that the modified version of PhymmBL has similar or better overall classification performance than the other modified algorithms, especially for the species-level and ultrashort reads. Finally, we evaluate theperformance of several algorithms on a real acid mine drainage dataset.

Download Full-text

SALTS – SURFR (sncRNA) And LAGOOn (lncRNA) Transcriptomics Suite

10.1101/2021.02.08.430280 ◽

2021 ◽

Author(s):

Mohan V Kasukurthi ◽

Dominika Houserova ◽

Yulong Huang ◽

Addison A. Barchie ◽

Justin T. Roberts ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Protein Coding ◽

Web Based ◽

Functional Roles ◽

Sequencing Technologies ◽

Active Research ◽

The Cost ◽

Analysis Platform ◽

Transcriptional Output

ABSTRACTThe widespread utilization of high-throughput sequencing technologies has unequivocally demonstrated that eukaryotic transcriptomes consist primarily (>98%) of non-coding RNA (ncRNA) transcripts significantly more diverse than their protein-coding counterparts.ncRNAs are typically divided into two categories based on their length. (1) ncRNAs less than 200 nucleotides (nt) long are referred as small non-coding RNAs (sncRNAs) and include microRNAs (miRNAs), piwi-interacting RNAs (piRNAs), small nucleolar RNAs (snoRNAs), transfer ribonucleic RNAs (tRNAs), etc., and the majority of these are thought to function primarily in controlling gene expression. That said, the full repertoire of sncRNAs remains fairly poorly defined as evidenced by two entirely new classes of sncRNAs only recently being reported, i.e., snoRNA-derived RNAs (sdRNAs) and tRNA-derived fragments (tRFs). (2) ncRNAs longer than 200 nt long are known as long ncRNAs (lncRNAs). lncRNAs represent the 2nd largest transcriptional output of the cell (behind only ribosomal RNAs), and although functional roles for several lncRNAs have been reported, most lncRNAs remain largely uncharacterized due to a lack of predictive tools aimed at guiding functional characterizations.Importantly, whereas the cost of high-throughput transcriptome sequencing is now feasible for most active research programs, tools necessary for the interpretation of these sequencings typically require significant computational expertise and resources markedly hindering widespread utilization of these datasets. In light of this, we have developed a powerful new ncRNA transcriptomics suite, SALTS, which is highly accurate, markedly efficient, and extremely user-friendly. SALTS stands for SURFR (sncRNA) And LAGOOn (lncRNA) Transcriptomics Suite and offers platforms for comprehensive sncRNA and lncRNA profiling and discovery, ncRNA functional prediction, and the identification of significant differential expressions among datasets. Notably, SALTS is accessed through an intuitive Web-based interface, can be used to analyze either user-generated, standard next-generation sequencing (NGS) output file uploads (e.g., FASTQ) or existing NCBI Sequence Read Archive (SRA) data, and requires absolutely no dataset pre-processing or knowledge of library adapters/oligonucleotides.SALTS constitutes the first publically available, Web-based, comprehensive ncRNA transcriptomic NGS analysis platform designed specifically for users with no computational background, providing a much needed, powerful new resource capable of enabling more widespread ncRNA transcriptomic analyses. The SALTS WebServer is freely available online at http://salts.soc.southalabama.edu.

Download Full-text

Comprehensive Annotations of Human Herpesvirus 6A and 6B Genomes Reveal Novel and Conserved Genomic Features

10.1101/730028 ◽

2019 ◽

Author(s):

Yaara Finkel ◽

Dominik Schmiedel ◽

Julie Tai-Schmiedel ◽

Aharon Nachshon ◽

Michal Schwartz ◽

...

Keyword(s):

Human Herpesvirus ◽

Ribosome Profiling ◽

Open Reading Frames ◽

Temporal Expression ◽

Protein Coding ◽

Functional Studies ◽

Viral Genes ◽

Non Coding Rnas ◽

Coding Potential ◽

Reading Frames

AbstractHuman herpesvirus 6 (HHV-6) A and B are highly ubiquitous betaherpesviruses, infecting the majority of the human population. Like other herpesviruses, they encompass large genomes and our understanding of their protein coding potential is far from complete. Here we employ ribosome profiling and systematic transcript analysis to experimentally define the HHV-6 translation products and to follow their temporal expression. We identify hundreds of new open reading frames (ORFs), including many upstream ORFs (uORFs) and internal ORFs (iORFs), generating a complete unbiased atlas of HHV-6 proteome. Furthermore, by integrating systematic data from the prototypic betaherpesvirus, human cytomegalovirus, we uncover numerous uORFs and iORFs that are conserved across betaherpesviruses and we show that uORFs are specifically enriched in late viral genes. Using our transcriptome measurements, we identified three highly abundant HHV-6 encoded long non-coding RNAs (lncRNAs), one of which generates a non-polyadenylated stable intron that appears to be a conserved feature of betaherpesviruses. Overall, our work reveals the complexity of HHV-6 genomes and highlights novel features that are conserved between betaherpesviruses, providing a rich resource for future functional studies.

Download Full-text

Predicting Protein Phosphorylation Sites Based on Deep Learning

Current Bioinformatics ◽

10.2174/1574893614666190902154332 ◽

2020 ◽

Vol 15 (4) ◽

pp. 300-308

Author(s):

Haixia Long ◽

Zhao Sun ◽

Manzhi Li ◽

Hai Yan Fu ◽

Ming Cai Lin

Keyword(s):

Neural Network ◽

Deep Learning ◽

Amino Acid ◽

Protein Phosphorylation ◽

High Throughput Sequencing ◽

Short Term Memory ◽

Basic Research ◽

Phosphorylation Sites ◽

Sequencing Technologies ◽

And Function

Background: Protein phosphorylation is one of the most important Post-translational Modifications (PTMs) occurring at amino acid residues serine (S), threonine (T), and tyrosine (Y). It plays critical roles in protein structure and function predicting. With the development of novel high-throughput sequencing technologies, there are a huge amount of protein sequences being generated and stored in databases. Objective: It is of great importance in both basic research and drug development to quickly and accurately predict which residues of S, T, or Y can be phosphorylated. Methods: In order to solve the problem, a novel hybrid deep learning model with a convolutional neural network and bi-directional long short-term memory recurrent neural network (CNN+BLSTM) is proposed for predicting phosphorylation sites in proteins. The model contains a list of layers that transform the input data into an output class, in which the convolution layer captures higher-level abstraction features of amino acid, while the recurrent layer captures long-term dependencies between amino acids to improve predictions. The joint model learns interactions between higher-level features derived from the protein sequence to predict the phosphorylated sites. Results: We applied our model together with two canonical methods namely iPhos-PseEn and MusiteDeep. A 5-fold cross-validation process indicated that CNN+BLSTM outperforms the two competitors in various evaluation metrics like the area under the receiver operating characteristic and precision-recall curves, the Matthews correlation coefficient, F-measure, accuracy, and so on. Conclusion: CNN+BLSTM is promising in identifying potential protein phosphorylation for further experimental validation.

Download Full-text

ORFik: a comprehensive R toolkit for the analysis of translation

BMC Bioinformatics ◽

10.1186/s12859-021-04254-w ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Håkon Tjeldnes ◽

Kornel Labun ◽

Yamila Torres Cleuren ◽

Katarzyna Chyżyńska ◽

Michał Świrski ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Ribosome Profiling ◽

Open Reading Frames ◽

Sequencing Data ◽

Protein Coding ◽

High Throughput Sequencing Data ◽

Upstream Open Reading Frames ◽

Many Core ◽

User Friendly

Abstract Background With the rapid growth in the use of high-throughput methods for characterizing translation and the continued expansion of multi-omics, there is a need for back-end functions and streamlined tools for processing, analyzing, and characterizing data produced by these assays. Results Here, we introduce ORFik, a user-friendly R/Bioconductor API and toolbox for studying translation and its regulation. It extends GenomicRanges from the genome to the transcriptome and implements a framework that integrates data from several sources. ORFik streamlines the steps to process, analyze, and visualize the different steps of translation with a particular focus on initiation and elongation. It accepts high-throughput sequencing data from ribosome profiling to quantify ribosome elongation or RCP-seq/TCP-seq to also quantify ribosome scanning. In addition, ORFik can use CAGE data to accurately determine 5′UTRs and RNA-seq for determining translation relative to RNA abundance. ORFik supports and calculates over 30 different translation-related features and metrics from the literature and can annotate translated regions such as proteins or upstream open reading frames (uORFs). As a use-case, we demonstrate using ORFik to rapidly annotate the dynamics of 5′ UTRs across different tissues, detect their uORFs, and characterize their scanning and translation in the downstream protein-coding regions. Conclusion In summary, ORFik introduces hundreds of tested, documented and optimized methods. ORFik is designed to be easily customizable, enabling users to create complete workflows from raw data to publication-ready figures for several types of sequencing data. Finally, by improving speed and scope of many core Bioconductor functions, ORFik offers enhancement benefiting the entire Bioconductor environment. Availability http://bioconductor.org/packages/ORFik.

Download Full-text

A Deep Recurrent Neural Network Discovers Complex Biological Rules to Decipher RNA Protein-Coding Potential

10.1101/200758 ◽

2017 ◽

Cited By ~ 1

Author(s):

Steven T. Hill ◽

Rachael Kuintzle ◽

Amy Teegarden ◽

Erich Merrill ◽

Padideh Danaee ◽

...

Keyword(s):

Neural Network ◽

Recurrent Neural Network ◽

Messenger Rna ◽

Noncoding Rna ◽

De Novo ◽

Biological Knowledge ◽

Protein Coding ◽

Deep Recurrent Neural Network ◽

Wide Range ◽

Coding Potential

AbstractThe current deluge of newly identified RNA transcripts presents a singular opportunity for improved assessment of coding potential, a cornerstone of genome annotation, and for machine-driven discovery of biological knowledge. While traditional, feature-based methods for RNA classification are limited by current scientific knowledge, deep learning methods can independently discover complex biological rules in the data de novo. We trained a gated recurrent neural network (RNN) on human messenger RNA (mRNA) and long noncoding RNA (lncRNA) sequences. Our model, mRNA RNN (mRNN), surpasses state-of-the-art methods at predicting protein-coding potential. To understand what mRNN learned, we probed the network and uncovered several context-sensitive codons highly predictive of coding potential. Our results suggest that gated RNNs can learn complex and long-range patterns in full-length human transcripts, making them ideal for performing a wide range of difficult classification tasks and, most importantly, for harvesting new biological insights from the rising flood of sequencing data.

Download Full-text