De novo prediction of RNA-protein interactions with Graph Neural Networks

Mapping Intimacies ◽

10.1101/2021.09.28.462100 ◽

2021 ◽

Author(s):

Viplove Arora ◽

Guido Sanguinetti

Keyword(s):

Neural Networks ◽

Protein Interactions ◽

Large Scale ◽

Rna Binding ◽

De Novo ◽

Rna Binding Proteins ◽

Large Data ◽

Data Sets ◽

Graph Neural Networks ◽

Post Transcriptional Regulation

RNA-binding proteins (RBPs) are key co- and post-transcriptional regulators of gene expression, playing a crucial role in many biological processes. Experimental methods like CLIP-seq have enabled the identification of transcriptome-wide RNA-protein interactions for select proteins, however the time and resource intensive nature of these technologies call for the development of computational methods to complement their predictions. Here we leverage recent, large-scale CLIP-seq experiments to construct a de novo predictor of RNA-protein interactions based on graph neural networks (GNN). We show that the GNN method allows not only to predict missing links in a RNA-protein network, but to predict the entire complement of targets of previously unassayed proteins, and even to reconstruct the entire network of RNA-protein interactions in different conditions based on minimal information. Our results demonstrate the potential of machine learning methods to extract useful information on post-transcriptional regulation from large data sets.

Download Full-text

iSUMO - integrative prediction of functionally relevant SUMOylation events

10.1101/056564 ◽

2016 ◽

Author(s):

Xiaotong Yao ◽

Shuvadeep Maity ◽

Shashank Gandhi ◽

Marcin Imielenski ◽

Christine Vogel

Keyword(s):

Protein Interactions ◽

Large Scale ◽

Rna Binding ◽

Rna Binding Proteins ◽

False Positive Rate ◽

Protein Protein Interactions ◽

Cellular Functions ◽

Positive Rate ◽

Protein Nucleic Acid ◽

Scale Experiment

AbstractPost-translational modifications by the Small Ubiquitin-like Modifier (SUMO) are essential for diverse cellular functions. Large-scale experiment and sequence-based predictions have identified thousands of SUMOylated proteins. However, the overlap between the datasets is small, suggesting many false positives with low functional relevance. Therefore, we integrated ~800 sequence features and protein characteristics such as cellular function and protein-protein interactions in a machine learning approach to score likely functional SUMOylation events (iSUMO). iSUMO is trained on a total of 24 large-scale datasets, and it predicts 2,291 and 706 SUMO targets in human and yeast, respectively. These estimates are five times higher than what existing sequence-based tools predict at the same 5% false positive rate. Protein-protein and protein-nucleic acid interactions are highly predictive of protein SUMOylation, supporting a role of the modification in protein complex formation. We note the marked prevalence of SUMOylation amongst RNA-binding proteins. We validate iSUMO predictions by experimental or other evidence. iSUMO therefore represents a comprehensive tool to identify high-confidence, functional SUMOylation events for human and yeast.

Download Full-text

The increasing diversity and complexity of the RNA-binding protein repertoire in plants

Proceedings of The Royal Society B Biological Sciences ◽

10.1098/rspb.2020.1397 ◽

2020 ◽

Vol 287 (1935) ◽

pp. 20201397 ◽

Cited By ~ 1

Author(s):

C. Marondedze

Keyword(s):

Stress Responses ◽

Large Scale ◽

Rna Binding ◽

Rna Binding Proteins ◽

Functional Roles ◽

Binding Domains ◽

Quantitative Studies ◽

Current State ◽

Rna Interaction ◽

Post Transcriptional Regulation

Post-transcriptional regulation has far-reaching implications on the fate of RNAs. It is gaining increasing momentum as a critical component in adjusting global cellular transcript levels during development and in response to environmental stresses. In this process, RNA-binding proteins (RBPs) are indispensable chaperones that naturally bind RNA via one or multiple globular RNA-binding domains (RBDs) changing the function or fate of the bound RNAs. Despite the technical challenges faced in plants in large-scale studies, several hundreds of these RBPs have been discovered and elucidated globally over the past few years. Recent discoveries have more than doubled the number of proteins implicated in RNA interaction, including identification of RBPs lacking classical RBDs. This review will discuss these new emerging classes of RBPs, focusing on the current state of the RBP repertoire in Arabidopsis thaliana , including the diverse functional roles derived from quantitative studies implicating RBPs in abiotic stress responses. Notably, this review highlights that 836 RBPs are enriched as Arabidopsis RBPs while 1865 can be classified as candidate RBPs. The review will also outline outstanding areas within this field that require addressing to advance our understanding and potential biotechnological applications of RBPs.

Download Full-text

Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks

10.1101/146175 ◽

2017 ◽

Cited By ~ 3

Author(s):

Xiaoyong Pan ◽

Peter Rijnbeek ◽

Junchi Yan ◽

Hong-Bin Shen

Keyword(s):

Neural Networks ◽

Binding Sites ◽

Large Scale ◽

Rna Binding ◽

Short Term Memory ◽

Rna Binding Proteins ◽

Point Of View ◽

Rna Sequences ◽

Close Relationship ◽

Binding Sequence

AbstractRNA regulation is significantly dependent on its binding protein partner, which is known as the RNA-binding proteins (RBPs). Unfortunately, the binding preferences for most RBPs are still not well characterized, especially on the structure point of view. Informative signals hiding and interdependencies between sequence and structure specificities are two challenging problems for both predicting RBP binding sites and accurate sequence and structure motifs mining.In this study, we propose a deep learning-based method, iDeepS, to simultaneously identify the binding sequence and structure motifs from RNA sequences using convolutional neural networks (CNNs) and a bidirectional long short term memory network (BLSTM). We first perform one-hot encoding for both the sequence and predicted secondary structure, which are appropriate for subsequent convolution operations. To reveal the hidden binding knowledge from the observations, the CNNs are applied to learn the abstract motif features. Considering the close relationship between sequences and predicted structures, we use the BLSTM to capture the long range dependencies between binding sequence and structure motifs identified by the CNNs. Finally, the learned weighted representations are fed into a classification layer to predict the RBP binding sites. We evaluated iDeepS on verified RBP binding sites derived from large-scale representative CLIP-seq datasets, and the results demonstrate that iDeepS can reliably predict the RBP binding sites on RNAs, and outperforms the state-of-the-art methods. An important advantage is that iDeepS is able to automatically extract both binding sequence and structure motifs, which will improve our transparent understanding of the mechanisms of binding specificities of RBPs. iDeepS is available at https://github.com/xypan1232/iDeepS.

Download Full-text

Development of Large-scale Cross-linking Mass Spectrometry

Molecular & Cellular Proteomics ◽

10.1074/mcp.r116.061663 ◽

2017 ◽

Vol 17 (6) ◽

pp. 1055-1066 ◽

Cited By ~ 16

Author(s):

Helena Maria Barysz ◽

Johan Malmström

Keyword(s):

Mass Spectrometry ◽

Protein Interactions ◽

Protein Function ◽

Large Scale ◽

Unmet Need ◽

Large Data ◽

Cross Linking ◽

Data Sets ◽

Protein Protein Interactions ◽

Model Complex

Cross-linking mass spectrometry (CLMS) provides distance constraints to study the structure of proteins, multiprotein complexes and protein-protein interactions which are critical for the understanding of protein function. CLMS is an attractive technology to bridge the gap between high-resolution structural biology techniques and proteomic-based interactome studies. However, as outlined in this review there are still several bottlenecks associated with CLMS which limit its application on a proteome-wide level. Specifically, there is an unmet need for comprehensive software that can reliably identify cross-linked peptides from large data sets. In this review we provide supporting information to reason that targeted proteomics of cross-links may provide the required sensitivity to reliably detect and quantify cross-linked peptides and that a reporter ion signature for cross-linked peptides may become a useful approach to increase confidence in the identification process of cross-linked peptides. In addition, the review summarizes the recent advances in CLMS workflows using the analysis of condensin complex in intact chromosomes as a model complex.

Download Full-text

Deep neural networks for inferring binding sites of RNA-binding proteins by using distributed representations of RNA primary sequence and secondary structure

BMC Genomics ◽

10.1186/s12864-020-07239-w ◽

2020 ◽

Vol 21 (S13) ◽

Author(s):

Lei Deng ◽

Youzhi Liu ◽

Yechuan Shi ◽

Wenhao Zhang ◽

Chun Yang ◽

...

Keyword(s):

Neural Networks ◽

Secondary Structure ◽

Binding Sites ◽

Large Scale ◽

Binding Proteins ◽

Rna Binding ◽

Rna Binding Proteins ◽

Secondary Structures ◽

Rna Sequences ◽

Distributed Representations

Abstract Background RNA binding proteins (RBPs) play a vital role in post-transcriptional processes in all eukaryotes, such as splicing regulation, mRNA transport, and modulation of mRNA translation and decay. The identification of RBP binding sites is a crucial step in understanding the biological mechanism of post-transcriptional gene regulation. However, the determination of RBP binding sites on a large scale is a challenging task due to high cost of biochemical assays. Quite a number of studies have exploited machine learning methods to predict binding sites. Especially, deep learning is increasingly used in the bioinformatics field by virtue of its ability to learn generalized representations from DNA and protein sequences. Results In this paper, we implemented a novel deep neural network model, DeepRKE, which combines primary RNA sequence and secondary structure information to effectively predict RBP binding sites. Specifically, we used word embedding algorithm to extract features of RNA sequences and secondary structures, i.e., distributed representation of k-mers sequence rather than traditional one-hot encoding. The distributed representations are taken as input of convolutional neural networks (CNN) and bidirectional long-term short-term memory networks (BiLSTM) to identify RBP binding sites. Our results show that deepRKE outperforms existing counterpart methods on two large-scale benchmark datasets. Conclusions Our extensive experimental results show that DeepRKE is an efficacious tool for predicting RBP binding sites. The distributed representations of RNA sequences and secondary structures can effectively detect the latent relationship and similarity between k-mers, and thus improve the predictive performance. The source code of DeepRKE is available at https://github.com/youzhiliu/DeepRKE/.

Download Full-text

An RNA tagging approach for system-wide RNA-binding proteome profiling and dynamics investigation upon transcription inhibition

Nucleic Acids Research ◽

10.1093/nar/gkab156 ◽

2021 ◽

Author(s):

Zheng Zhang ◽

Tong Liu ◽

Hangyan Dong ◽

Jian Li ◽

Haofan Sun ◽

...

Keyword(s):

Protein Interactions ◽

Large Scale ◽

Rna Binding ◽

Rna Binding Proteins ◽

Decay Rates ◽

Mass Spectrometry Analysis ◽

Transcription Inhibition ◽

Distribution Profile ◽

Proteome Profiling ◽

Spectrometry Analysis

Abstract RNA-protein interactions play key roles in epigenetic, transcriptional and posttranscriptional regulation. To reveal the regulatory mechanisms of these interactions, global investigation of RNA-binding proteins (RBPs) and monitor their changes under various physiological conditions are needed. Herein, we developed a psoralen probe (PP)-based method for RNA tagging and ribonucleic-protein complex (RNP) enrichment. Isolation of both coding and noncoding RNAs and mapping of 2986 RBPs including 782 unknown candidate RBPs from HeLa cells was achieved by PP enrichment, RNA-sequencing and mass spectrometry analysis. The dynamics study of RNPs by PP enrichment after the inhibition of RNA synthesis provides the first large-scale distribution profile of RBPs bound to RNAs with different decay rates. Furthermore, the remarkably greater decreases in the abundance of the RBPs obtained by PP-enrichment than by global proteome profiling suggest that PP enrichment after transcription inhibition offers a valuable way for large-scale evaluation of the candidate RBPs.

Download Full-text

The search for RNA-binding proteins: a technical and interdisciplinary challenge

Biochemical Society Transactions ◽

10.1042/bst20200688 ◽

2021 ◽

Author(s):

Jeffrey M. Smith ◽

Jarrod J. Sandow ◽

Andrew I. Webb

Keyword(s):

Gene Expression ◽

Protein Interactions ◽

Binding Proteins ◽

Rna Binding ◽

Rna Binding Proteins ◽

Target Validation ◽

Binding Function ◽

Technical Features ◽

Post Transcriptional Regulation ◽

The Impact

RNA-binding proteins are customarily regarded as important facilitators of gene expression. In recent years, RNA–protein interactions have also emerged as a pervasive force in the regulation of homeostasis. The compendium of proteins with provable RNA-binding function has swelled from the hundreds to the thousands astride the partnership of mass spectrometry-based proteomics and RNA sequencing. At the foundation of these advances is the adaptation of RNA-centric capture methods that can extract bound protein that has been cross-linked in its native environment. These methods reveal snapshots in time displaying an extensive network of regulation and a wealth of data that can be used for both the discovery of RNA-binding function and the molecular interfaces at which these interactions occur. This review will focus on the impact of these developments on our broader perception of post-transcriptional regulation, and how the technical features of current capture methods, as applied in mammalian systems, create a challenging medium for interpretation by systems biologists and target validation by experimental researchers.

Download Full-text

Global importance analysis: An interpretability method to quantify importance of genomic features in deep neural networks

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008925 ◽

2021 ◽

Vol 17 (5) ◽

pp. e1008925

Author(s):

Peter K. Koo ◽

Antonio Majdandzic ◽

Matthew Ploenzke ◽

Praveen Anand ◽

Steffan B. Paul

Keyword(s):

Neural Networks ◽

Protein Interactions ◽

Effect Size ◽

Deep Neural Networks ◽

Rna Binding ◽

Rna Binding Proteins ◽

Population Level ◽

Sequence Motifs ◽

Convolutional Network ◽

Importance Analysis

Deep neural networks have demonstrated improved performance at predicting the sequence specificities of DNA- and RNA-binding proteins compared to previous methods that rely on k-mers and position weight matrices. To gain insights into why a DNN makes a given prediction, model interpretability methods, such as attribution methods, can be employed to identify motif-like representations along a given sequence. Because explanations are given on an individual sequence basis and can vary substantially across sequences, deducing generalizable trends across the dataset and quantifying their effect size remains a challenge. Here we introduce global importance analysis (GIA), a model interpretability method that quantifies the population-level effect size that putative patterns have on model predictions. GIA provides an avenue to quantitatively test hypotheses of putative patterns and their interactions with other patterns, as well as map out specific functions the network has learned. As a case study, we demonstrate the utility of GIA on the computational task of predicting RNA-protein interactions from sequence. We first introduce a convolutional network, we call ResidualBind, and benchmark its performance against previous methods on RNAcompete data. Using GIA, we then demonstrate that in addition to sequence motifs, ResidualBind learns a model that considers the number of motifs, their spacing, and sequence context, such as RNA secondary structure and GC-bias.

Download Full-text

Post-transcriptionally impaired de novo mutations contribute to the genetic etiology of four neuropsychiatric disorders

10.1101/175844 ◽

2017 ◽

Author(s):

Fengbiao Mao ◽

Lu Wang ◽

Xiaolu Zhao ◽

Zhongshan Li ◽

Luoyuan Xiao ◽

...

Keyword(s):

Protein Interactions ◽

Rna Binding ◽

De Novo ◽

Rna Binding Proteins ◽

Genetic Disorders ◽

Interaction Network ◽

Neuropsychiatric Disorders ◽

Protein Protein Interactions ◽

De Novo Mutations ◽

Coding Region

AbstractWhile deleterious de novo mutations (DNMs) in coding region conferring risk in neuropsychiatric disorders have been revealed by next-generation sequencing, the role of DNMs involved in post-transcriptional regulation in pathogenesis of these disorders remains to be elucidated. Here, we identified 1,736 post-transcriptionally impaired DNMs (piDNMs), and prioritized 1,482 candidate genes in four neuropsychiatric disorders from 7,748 families. Our results revealed higher prevalence of piDNMs in the probands than in controls (P = 8.19×10−17), and piDNM-harboring genes were enriched for epigenetic modifications and neuronal or synaptic functions. Moreover, we identified 86 piDNM-containing genes forming convergent co-expression modules and intensive protein-protein interactions in at least two neuropsychiatric disorders. These cross-disorder genes carrying piDNMs could form interaction network centered on RNA binding proteins, suggesting a shared post-transcriptional etiology underlying these disorders. Our findings illustrate the significant contribution of piDNMs to four neuropsychiatric disorders, and lay emphasis on combining functional and network-based evidences to identify regulatory causes of genetic disorders.

Download Full-text

Recent advances in proximity-based labeling methods for interactome mapping

F1000Research ◽

10.12688/f1000research.16903.1 ◽

2019 ◽

Vol 8 ◽

pp. 135 ◽

Cited By ~ 29

Author(s):

Laura Trinkle-Mulcahy

Keyword(s):

Protein Interactions ◽

Large Scale ◽

Rna Binding ◽

Rna Binding Proteins ◽

Protein Complexes ◽

Affinity Purification ◽

Small Scale ◽

Protein Protein Interactions ◽

Large Scale Network ◽

Complementary Approach

Proximity-based labeling has emerged as a powerful complementary approach to classic affinity purification of multiprotein complexes in the mapping of protein–protein interactions. Ongoing optimization of enzyme tags and delivery methods has improved both temporal and spatial resolution, and the technique has been successfully employed in numerous small-scale (single complex mapping) and large-scale (network mapping) initiatives. When paired with quantitative proteomic approaches, the ability of these assays to provide snapshots of stable and transient interactions over time greatly facilitates the mapping of dynamic interactomes. Furthermore, recent innovations have extended biotin-based proximity labeling techniques such as BioID and APEX beyond classic protein-centric assays (tag a protein to label neighboring proteins) to include RNA-centric (tag an RNA species to label RNA-binding proteins) and DNA-centric (tag a gene locus to label associated protein complexes) assays.

Download Full-text