DeepPPSite: A deep learning-based model for analysis and prediction of phosphorylation sites using efficient sequence information

AbstractProtein phosphorylation, which is one of the most important post-translational modifications (PTMs), is involved in regulating myriad cellular processes. Herein, we present a novel deep learning based approach for organism-specific protein phosphorylation site prediction in Chlamydomonas reinhardtii, a model algal phototroph. An ensemble model combining convolutional neural networks and long short-term memory (LSTM) achieves the best performance in predicting phosphorylation sites in C. reinhardtii. Deemed Chlamy-EnPhosSite, the measured best AUC and MCC are 0.90 and 0.64 respectively for a combined dataset of serine (S) and threonine (T) in independent testing higher than those measures for other predictors. When applied to the entire C. reinhardtii proteome (totaling 1,809,304 S and T sites), Chlamy-EnPhosSite yielded 499,411 phosphorylated sites with a cut-off value of 0.5 and 237,949 phosphorylated sites with a cut-off value of 0.7. These predictions were compared to an experimental dataset of phosphosites identified by liquid chromatography-tandem mass spectrometry (LC–MS/MS) in a blinded study and approximately 89.69% of 2,663 C. reinhardtii S and T phosphorylation sites were successfully predicted by Chlamy-EnPhosSite at a probability cut-off of 0.5 and 76.83% of sites were successfully identified at a more stringent 0.7 cut-off. Interestingly, Chlamy-EnPhosSite also successfully predicted experimentally confirmed phosphorylation sites in a protein sequence (e.g., RPS6 S245) which did not appear in the training dataset, highlighting prediction accuracy and the power of leveraging predictions to identify biologically relevant PTM sites. These results demonstrate that our method represents a robust and complementary technique for high-throughput phosphorylation site prediction in C. reinhardtii. It has potential to serve as a useful tool to the community. Chlamy-EnPhosSite will contribute to the understanding of how protein phosphorylation influences various biological processes in this important model microalga.

Download Full-text

MUFFIN: multi-scale feature fusion for drug–drug interaction prediction

Bioinformatics ◽

10.1093/bioinformatics/btab169 ◽

2021 ◽

Author(s):

Yujie Chen ◽

Tengfei Ma ◽

Xixi Yang ◽

Jianmin Wang ◽

Bosheng Song ◽

...

Keyword(s):

Molecular Structure ◽

Deep Learning ◽

Medical Information ◽

Feature Fusion ◽

Molecular Graph ◽

Knowledge Graph ◽

Sequence Information ◽

Learning Models ◽

Scale Feature ◽

Multi Scale

Abstract Motivation Adverse drug–drug interactions (DDIs) are crucial for drug research and mainly cause morbidity and mortality. Thus, the identification of potential DDIs is essential for doctors, patients and the society. Existing traditional machine learning models rely heavily on handcraft features and lack generalization. Recently, the deep learning approaches that can automatically learn drug features from the molecular graph or drug-related network have improved the ability of computational models to predict unknown DDIs. However, previous works utilized large labeled data and merely considered the structure or sequence information of drugs without considering the relations or topological information between drug and other biomedical objects (e.g. gene, disease and pathway), or considered knowledge graph (KG) without considering the information from the drug molecular structure. Results Accordingly, to effectively explore the joint effect of drug molecular structure and semantic information of drugs in knowledge graph for DDI prediction, we propose a multi-scale feature fusion deep learning model named MUFFIN. MUFFIN can jointly learn the drug representation based on both the drug-self structure information and the KG with rich bio-medical information. In MUFFIN, we designed a bi-level cross strategy that includes cross- and scalar-level components to fuse multi-modal features well. MUFFIN can alleviate the restriction of limited labeled data on deep learning models by crossing the features learned from large-scale KG and drug molecular graph. We evaluated our approach on three datasets and three different tasks including binary-class, multi-class and multi-label DDI prediction tasks. The results showed that MUFFIN outperformed other state-of-the-art baselines. Availability and implementation The source code and data are available at https://github.com/xzenglab/MUFFIN.

Download Full-text

Self-Supervised Representation Learning of Protein Tertiary Structures (PtsRep): Protein Engineering as A Case Study

10.1101/2020.12.22.423916 ◽

2020 ◽

Author(s):

Junwen Luo ◽

Yi Cai ◽

Jialin Wu ◽

Hongmin Cai ◽

Xiaofeng Yang ◽

...

Keyword(s):

Deep Learning ◽

Protein Engineering ◽

Structural Information ◽

Representation Learning ◽

Sequence Information ◽

Structural Representation ◽

Tertiary Structures ◽

Structural Space ◽

General Protein ◽

And Function

AbstractIn recent years, deep learning has been increasingly used to decipher the relationships among protein sequence, structure, and function. Thus far deep learning of proteins has mostly utilized protein primary sequence information, while the vast amount of protein tertiary structural information remains unused. In this study, we devised a self-supervised representation learning framework to extract the fundamental features of unlabeled protein tertiary structures (PtsRep), and the embedded representations were transferred to two commonly recognized protein engineering tasks, protein stability and GFP fluorescence prediction. On both tasks, PtsRep significantly outperformed the two benchmark methods (UniRep and TAPE-BERT), which are based on protein primary sequences. Protein clustering analyses demonstrated that PtsRep can capture the structural signals in proteins. PtsRep reveals an avenue for general protein structural representation learning, and for exploring protein structural space for protein engineering and drug design.

Download Full-text

BGFE: A Deep Learning Model for ncRNA-Protein Interaction Predictions Based on Improved Sequence Information

International Journal of Molecular Sciences ◽

10.3390/ijms20040978 ◽

2019 ◽

Vol 20 (4) ◽

pp. 978 ◽

Cited By ~ 5

Author(s):

Zhao-Hui Zhan ◽

Li-Na Jia ◽

Yong Zhou ◽

Li-Ping Li ◽

Hai-Cheng Yi

Keyword(s):

Deep Learning ◽

Protein Interactions ◽

Prediction Accuracy ◽

Sparse Matrices ◽

Protein Sequences ◽

Biological Research ◽

Sequence Information ◽

Feature Extraction Method ◽

Cellular Processes ◽

High Level

The interactions between ncRNAs and proteins are critical for regulating various cellular processes in organisms, such as gene expression regulations. However, due to limitations, including financial and material consumptions in recent experimental methods for predicting ncRNA and protein interactions, it is essential to propose an innovative and practical approach with convincing performance of prediction accuracy. In this study, based on the protein sequences from a biological perspective, we put forward an effective deep learning method, named BGFE, to predict ncRNA and protein interactions. Protein sequences are represented by bi-gram probability feature extraction method from Position Specific Scoring Matrix (PSSM), and for ncRNA sequences, k-mers sparse matrices are employed to represent them. Furthermore, to extract hidden high-level feature information, a stacked auto-encoder network is employed with the stacked ensemble integration strategy. We evaluate the performance of the proposed method by using three datasets and a five-fold cross-validation after classifying the features through the random forest classifier. The experimental results clearly demonstrate the effectiveness and the prediction accuracy of our approach. In general, the proposed method is helpful for ncRNA and protein interacting predictions and it provides some serviceable guidance in future biological research.

Download Full-text

Identification of Antioxidant Proteins With Deep Learning From Sequence Information

Frontiers in Pharmacology ◽

10.3389/fphar.2018.01036 ◽

2018 ◽

Vol 9 ◽

Cited By ~ 6

Author(s):

Lifen Shao ◽

Hui Gao ◽

Zhen Liu ◽

Juan Feng ◽

Lixia Tang ◽

...

Keyword(s):

Deep Learning ◽

Sequence Information ◽

Antioxidant Proteins

Download Full-text

Beyond Homology Transfer: Deep Learning for Automated Annotation of Proteins

10.1101/168120 ◽

2017 ◽

Cited By ~ 1

Author(s):

Mohammad Nauman ◽

Hafeez Ur Rehman ◽

Gianfranco Politano ◽

Alfredo Benso

Keyword(s):

Deep Learning ◽

Protein Interactions ◽

Protein Function ◽

Sequence Data ◽

3D Structure ◽

Input Sequence ◽

Sequence Information ◽

Large Set ◽

Automated Annotation ◽

Protein Functions

ABSTRACTAccurate annotation of protein functions is important for a profound understanding of molecular biology. A large number of proteins remain uncharacterized because of the sparsity of available supporting information. For a large set of uncharacterized proteins, the only type of information available is their amino acid sequence. In this paper, we propose DeepSeq – a deep learning architecture – that utilizes only the protein sequence information to predict its associated functions. The prediction process does not require handcrafted features; rather, the architecture automatically extracts representations from the input sequence data. Results of our experiments with DeepSeq indicate significant improvements in terms of prediction accuracy when compared with other sequence-based methods. Our deep learning model achieves an overall validation accuracy of 86.72%, with an F1 score of 71.13%. Moreover, using the automatically learned features and without any changes to DeepSeq, we successfully solved a different problem i.e. protein function localization, with no human intervention. Finally, we discuss how this same architecture can be used to solve even more complicated problems such as prediction of 2D and 3D structure as well as protein-protein interactions.

Download Full-text

Predicting circRNA-RBP interaction sites using a codon-based encoding and hybrid deep neural networks

10.1101/499012 ◽

2018 ◽

Cited By ~ 2

Author(s):

Kaiming Zhang ◽

Xiaoyong Pan ◽

Yang Yang ◽

Hong-Bin Shen

Keyword(s):

Neural Network ◽

Machine Learning ◽

Deep Learning ◽

Binding Sites ◽

Large Scale ◽

Rna Binding ◽

Sequence Information ◽

Rna Sequences ◽

Encoding Scheme ◽

Interaction Sites

AbstractCircular RNAs (circRNAs), with their crucial roles in gene regulation and disease development, have become a rising star in the RNA world. A lot of previous wet-lab studies focused on the interaction mechanisms between circRNAs and RNA-binding proteins (RBPs), as the knowledge of circRNA-RBP association is very important for understanding functions of circRNAs. Recently, the abundant CLIP-Seq experimental data has made the large-scale identification and analysis of circRNA-RBP interactions possible, while no computational tool based on machine learning has been developed yet.We present a new deep learning-based method, CRIP (CircRNAs Interact with Proteins), for the prediction of RBP binding sites on circRNAs, using only the RNA sequences. In order to fully exploit the sequence information, we propose a stacked codon-based encoding scheme and a hybrid deep learning architecture, in which a convolutional neural network (CNN) learns high-level abstract features and a recurrent neural network (RNN) learns long dependency in the sequences. We construct 37 datasets including sequence fragments of binding sites on circRNAs, and each set corresponds to one RBP. The experimental results show that the new encoding scheme is superior to the existing feature representation methods for RNA sequences, and the hybrid network outperforms conventional classifiers by a large margin, where both the CNN and RNN components contribute to the performance improvement. To the best of our knowledge, CRIP is the first machine learning-based tool specialized in the prediction of circRNA-RBP interactions, which is expected to play an important role for large-scale function analysis of circRNAs.

Download Full-text

FilterDCA: interpretable supervised contact prediction using inter-domain coevolution

10.1101/2019.12.24.887877 ◽

2019 ◽

Cited By ~ 1

Author(s):

Maureen Muscat ◽

Giancarlo Croce ◽

Edoardo Sarti ◽

Martin Weigt

Keyword(s):

Deep Learning ◽

De Novo ◽

Protein Complexes ◽

Protein Structures ◽

Direct Coupling ◽

Sequence Information ◽

Coupling Analysis ◽

Contact Patterns ◽

Direct Coupling Analysis ◽

Training Sets

AbstractPredicting three-dimensional protein structure and assembling protein complexes using sequence information belongs to the most prominent tasks in computational biology. Recently substantial progress has been obtained in the case of single proteins using a combination of unsupervised coevolutionary sequence analysis with structurally supervised deep learning. While reaching impressive accuracies in predicting residue-residue contacts, deep learning has a number of disadvantages. The need for large structural training sets limits the applicability to multi-protein complexes; and their deep architecture makes the interpretability of the convolutional neural networks intrinsically hard. Here we introduce FilterDCA, a simpler supervised predictor for inter-domain and inter-protein contacts. It is based on the fact that contact maps of proteins show typical contact patterns, which results from secondary structure and are reflected by patterns in coevolutionary analysis. We explicitly integrate averaged contacts patterns with coevolutionary scores derived by Direct Coupling Analysis, reaching results comparable to more complex deep-learning approaches, while remaining fully transparent and interpretable. The FilterDCA code is available at http://gitlab.lcqb.upmc.fr/muscat/FilterDCA.Author summaryThe de novo prediction of tertiary and quaternary protein structures has recently seen important advances, by combining unsupervised, purely sequence-based coevolutionary analyses with structure-based supervision using deep learning for contact-map prediction. While showing impressive performance, deep-learning methods require large training sets and pose severe obstacles for their interpretability. Here we construct a simple, transparent and therefore fully interpretable inter-domain contact predictor, which uses the results of coevolutionary Direct Coupling Analysis in combination with explicitly constructed filters reflecting typical contact patterns in a training set of known protein structures, and which improves the accuracy of predicted contacts significantly. Our approach thereby sheds light on the question how contact information is encoded in coevolutionary signals.

Download Full-text

Hidden dynamic signatures drive substrate selectivity in the disordered phosphoproteome

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1921473117 ◽

2020 ◽

Vol 117 (38) ◽

pp. 23606-23616

Author(s):

Min-Hyung Cho ◽

James O. Wrabl ◽

James Taylor ◽

Vincent J. Hilser

Keyword(s):

Conformational Equilibrium ◽

Energy Difference ◽

Statistical Thermodynamic ◽

Sequence Information ◽

Substrate Selectivity ◽

Phosphorylation Sites ◽

Information Encoding ◽

Conformational Ensembles ◽

Protein Substrates ◽

Conservation Patterns

Phosphorylation sites are hyperabundant in the eukaryotic disordered proteome, suggesting that conformational fluctuations play a major role in determining to what extent a kinase interacts with a particular substrate. In biophysical terms, substrate selectivity may be determined not just by the structural–chemical complementarity between the kinase and its protein substrates but also by the free energy difference between the conformational ensembles that are, or are not, recognized by the kinase. To test this hypothesis, we developed a statistical-thermodynamics-based informatics framework, which allows us to probe for the contribution of equilibrium fluctuations to phosphorylation, as evaluated by the ability to predict Ser/Thr/Tyr phosphorylation sites in the disordered proteome. Essential to this framework is a decomposition of substrate sequence information into two types: vertical information encoding conserved kinase specificity motifs and horizontal information encoding substrate conformational equilibrium that is embedded, but often not apparent, within position-specific conservation patterns. We find not only that conformational fluctuations play a major role but also that they are the dominant contribution to substrate selectivity. In fact, the main substrate classifier distinguishing selectivity is the magnitude of change in local compaction of the disordered chain upon phosphorylation of these mostly singly phosphorylated sites. In addition to providing fundamental insights into the consequences of phosphorylation across the proteome, our approach provides a statistical-thermodynamic strategy for partitioning any sequence-based search into contributions from structural–chemical complementarity and those from changes in conformational equilibrium.

Download Full-text

Fast and accurate microRNA search using CNN

BMC Bioinformatics ◽

10.1186/s12859-019-3279-2 ◽

2019 ◽

Vol 20 (S23) ◽

Author(s):

Xubo Tang ◽

Yanni Sun

Keyword(s):

Deep Learning ◽

Secondary Structure ◽

Classification Accuracy ◽

Feature Learning ◽

Sequence Information ◽

Learning Models ◽

Structure Conservation ◽

Different Types ◽

Key Issues ◽

Negative Class

Abstract Background There are many different types of microRNAs (miRNAs) and elucidating their functions is still under intensive research. A fundamental step in functional annotation of a new miRNA is to classify it into characterized miRNA families, such as those in Rfam and miRBase. With the accumulation of annotated miRNAs, it becomes possible to use deep learning-based models to classify different types of miRNAs. In this work, we investigate several key issues associated with successful application of deep learning models for miRNA classification. First, as secondary structure conservation is a prominent feature for noncoding RNAs including miRNAs, we examine whether secondary structure-based encoding improves classification accuracy. Second, as there are many more non-miRNA sequences than miRNAs, instead of assigning a negative class for all non-miRNA sequences, we test whether using softmax output can distinguish in-distribution and out-of-distribution samples. Finally, we investigate whether deep learning models can correctly classify sequences from small miRNA families. Results We present our trained convolutional neural network (CNN) models for classifying miRNAs using different types of feature learning and encoding methods. In the first method, we explicitly encode the predicted secondary structure in a matrix. In the second method, we use only the primary sequence information and one-hot encoding matrix. In addition, in order to reject sequences that should not be classified into targeted miRNA families, we use a threshold derived from softmax layer to exclude out-of-distribution sequences, which is an important feature to make this model useful for real transcriptomic data. The comparison with the state-of-the-art ncRNA classification tools such as Infernal shows that our method can achieve comparable sensitivity and accuracy while being significantly faster. Conclusion Automatic feature learning in CNN can lead to better classification accuracy and sensitivity for miRNA classification and annotation. The trained models and also associated codes are freely available at https://github.com/HubertTang/DeepMir.

Download Full-text