Deep learning program to predict protein functions based on sequence information

ABSTRACTAccurate annotation of protein functions is important for a profound understanding of molecular biology. A large number of proteins remain uncharacterized because of the sparsity of available supporting information. For a large set of uncharacterized proteins, the only type of information available is their amino acid sequence. In this paper, we propose DeepSeq – a deep learning architecture – that utilizes only the protein sequence information to predict its associated functions. The prediction process does not require handcrafted features; rather, the architecture automatically extracts representations from the input sequence data. Results of our experiments with DeepSeq indicate significant improvements in terms of prediction accuracy when compared with other sequence-based methods. Our deep learning model achieves an overall validation accuracy of 86.72%, with an F1 score of 71.13%. Moreover, using the automatically learned features and without any changes to DeepSeq, we successfully solved a different problem i.e. protein function localization, with no human intervention. Finally, we discuss how this same architecture can be used to solve even more complicated problems such as prediction of 2D and 3D structure as well as protein-protein interactions.

Download Full-text

MUFFIN: multi-scale feature fusion for drug–drug interaction prediction

Bioinformatics ◽

10.1093/bioinformatics/btab169 ◽

2021 ◽

Author(s):

Yujie Chen ◽

Tengfei Ma ◽

Xixi Yang ◽

Jianmin Wang ◽

Bosheng Song ◽

...

Keyword(s):

Molecular Structure ◽

Deep Learning ◽

Medical Information ◽

Feature Fusion ◽

Molecular Graph ◽

Knowledge Graph ◽

Sequence Information ◽

Learning Models ◽

Scale Feature ◽

Multi Scale

Abstract Motivation Adverse drug–drug interactions (DDIs) are crucial for drug research and mainly cause morbidity and mortality. Thus, the identification of potential DDIs is essential for doctors, patients and the society. Existing traditional machine learning models rely heavily on handcraft features and lack generalization. Recently, the deep learning approaches that can automatically learn drug features from the molecular graph or drug-related network have improved the ability of computational models to predict unknown DDIs. However, previous works utilized large labeled data and merely considered the structure or sequence information of drugs without considering the relations or topological information between drug and other biomedical objects (e.g. gene, disease and pathway), or considered knowledge graph (KG) without considering the information from the drug molecular structure. Results Accordingly, to effectively explore the joint effect of drug molecular structure and semantic information of drugs in knowledge graph for DDI prediction, we propose a multi-scale feature fusion deep learning model named MUFFIN. MUFFIN can jointly learn the drug representation based on both the drug-self structure information and the KG with rich bio-medical information. In MUFFIN, we designed a bi-level cross strategy that includes cross- and scalar-level components to fuse multi-modal features well. MUFFIN can alleviate the restriction of limited labeled data on deep learning models by crossing the features learned from large-scale KG and drug molecular graph. We evaluated our approach on three datasets and three different tasks including binary-class, multi-class and multi-label DDI prediction tasks. The results showed that MUFFIN outperformed other state-of-the-art baselines. Availability and implementation The source code and data are available at https://github.com/xzenglab/MUFFIN.

Download Full-text

Self-Supervised Representation Learning of Protein Tertiary Structures (PtsRep): Protein Engineering as A Case Study

10.1101/2020.12.22.423916 ◽

2020 ◽

Author(s):

Junwen Luo ◽

Yi Cai ◽

Jialin Wu ◽

Hongmin Cai ◽

Xiaofeng Yang ◽

...

Keyword(s):

Deep Learning ◽

Protein Engineering ◽

Structural Information ◽

Representation Learning ◽

Sequence Information ◽

Structural Representation ◽

Tertiary Structures ◽

Structural Space ◽

General Protein ◽

And Function

AbstractIn recent years, deep learning has been increasingly used to decipher the relationships among protein sequence, structure, and function. Thus far deep learning of proteins has mostly utilized protein primary sequence information, while the vast amount of protein tertiary structural information remains unused. In this study, we devised a self-supervised representation learning framework to extract the fundamental features of unlabeled protein tertiary structures (PtsRep), and the embedded representations were transferred to two commonly recognized protein engineering tasks, protein stability and GFP fluorescence prediction. On both tasks, PtsRep significantly outperformed the two benchmark methods (UniRep and TAPE-BERT), which are based on protein primary sequences. Protein clustering analyses demonstrated that PtsRep can capture the structural signals in proteins. PtsRep reveals an avenue for general protein structural representation learning, and for exploring protein structural space for protein engineering and drug design.

Download Full-text

BGFE: A Deep Learning Model for ncRNA-Protein Interaction Predictions Based on Improved Sequence Information

International Journal of Molecular Sciences ◽

10.3390/ijms20040978 ◽

2019 ◽

Vol 20 (4) ◽

pp. 978 ◽

Cited By ~ 5

Author(s):

Zhao-Hui Zhan ◽

Li-Na Jia ◽

Yong Zhou ◽

Li-Ping Li ◽

Hai-Cheng Yi

Keyword(s):

Deep Learning ◽

Protein Interactions ◽

Prediction Accuracy ◽

Sparse Matrices ◽

Protein Sequences ◽

Biological Research ◽

Sequence Information ◽

Feature Extraction Method ◽

Cellular Processes ◽

High Level

The interactions between ncRNAs and proteins are critical for regulating various cellular processes in organisms, such as gene expression regulations. However, due to limitations, including financial and material consumptions in recent experimental methods for predicting ncRNA and protein interactions, it is essential to propose an innovative and practical approach with convincing performance of prediction accuracy. In this study, based on the protein sequences from a biological perspective, we put forward an effective deep learning method, named BGFE, to predict ncRNA and protein interactions. Protein sequences are represented by bi-gram probability feature extraction method from Position Specific Scoring Matrix (PSSM), and for ncRNA sequences, k-mers sparse matrices are employed to represent them. Furthermore, to extract hidden high-level feature information, a stacked auto-encoder network is employed with the stacked ensemble integration strategy. We evaluate the performance of the proposed method by using three datasets and a five-fold cross-validation after classifying the features through the random forest classifier. The experimental results clearly demonstrate the effectiveness and the prediction accuracy of our approach. In general, the proposed method is helpful for ncRNA and protein interacting predictions and it provides some serviceable guidance in future biological research.

Download Full-text

Identification of Antioxidant Proteins With Deep Learning From Sequence Information

Frontiers in Pharmacology ◽

10.3389/fphar.2018.01036 ◽

2018 ◽

Vol 9 ◽

Cited By ~ 6

Author(s):

Lifen Shao ◽

Hui Gao ◽

Zhen Liu ◽

Juan Feng ◽

Lixia Tang ◽

...

Keyword(s):

Deep Learning ◽

Sequence Information ◽

Antioxidant Proteins

Download Full-text

Predicting circRNA-RBP interaction sites using a codon-based encoding and hybrid deep neural networks

10.1101/499012 ◽

2018 ◽

Cited By ~ 2

Author(s):

Kaiming Zhang ◽

Xiaoyong Pan ◽

Yang Yang ◽

Hong-Bin Shen

Keyword(s):

Neural Network ◽

Machine Learning ◽

Deep Learning ◽

Binding Sites ◽

Large Scale ◽

Rna Binding ◽

Sequence Information ◽

Rna Sequences ◽

Encoding Scheme ◽

Interaction Sites

AbstractCircular RNAs (circRNAs), with their crucial roles in gene regulation and disease development, have become a rising star in the RNA world. A lot of previous wet-lab studies focused on the interaction mechanisms between circRNAs and RNA-binding proteins (RBPs), as the knowledge of circRNA-RBP association is very important for understanding functions of circRNAs. Recently, the abundant CLIP-Seq experimental data has made the large-scale identification and analysis of circRNA-RBP interactions possible, while no computational tool based on machine learning has been developed yet.We present a new deep learning-based method, CRIP (CircRNAs Interact with Proteins), for the prediction of RBP binding sites on circRNAs, using only the RNA sequences. In order to fully exploit the sequence information, we propose a stacked codon-based encoding scheme and a hybrid deep learning architecture, in which a convolutional neural network (CNN) learns high-level abstract features and a recurrent neural network (RNN) learns long dependency in the sequences. We construct 37 datasets including sequence fragments of binding sites on circRNAs, and each set corresponds to one RBP. The experimental results show that the new encoding scheme is superior to the existing feature representation methods for RNA sequences, and the hybrid network outperforms conventional classifiers by a large margin, where both the CNN and RNN components contribute to the performance improvement. To the best of our knowledge, CRIP is the first machine learning-based tool specialized in the prediction of circRNA-RBP interactions, which is expected to play an important role for large-scale function analysis of circRNAs.

Download Full-text

Insights into SARS-CoV-2, the Coronavirus Underlying COVID-19: Recent Genomic Data and the Development of Reverse Genetics Systems

Journal of General Virology ◽

10.1099/jgv.0.001458 ◽

2020 ◽

Vol 101 (10) ◽

pp. 1021-1024

Author(s):

Severino Jefferson Ribeiro da Silva ◽

Renata Pessôa Germano Mendes ◽

Caroline Targino Alves da Silva ◽

Alessio Lorusso ◽

Alain Kohl ◽

...

Keyword(s):

Severe Acute Respiratory Syndrome ◽

Reverse Genetics ◽

World Health ◽

Close Relative ◽

Sequence Information ◽

Recombinant Viruses ◽

Protein Functions ◽

Genomic Studies ◽

Health Organization

The emergence and rapid worldwide spread of a novel pandemic of acute respiratory disease – eventually named coronavirus disease 2019 (COVID-19) by the World Health Organization (WHO) – across the human population has raised great concerns. It prompted a mobilization around the globe to study the underlying pathogen, a close relative of severe acute respiratory syndrome coronavirus (SARS-CoV) called severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Numerous genome sequences of SARS-CoV-2 are now available and in-depth analyses are advancing. These will allow detailed characterization of sequence and protein functions, including comparative studies. Care should be taken when inferring function from sequence information alone, and reverse genetics systems can be used to unequivocally identify key features. For example, the molecular markers of virulence, host range and transmissibility of SARS-CoV-2 can be compared to those of related viruses in order to shed light on the biology of this emerging pathogen. Here, we summarize some recent insights from genomic studies and strategies for reverse genetics systems to generate recombinant viruses, which will be useful to investigate viral genome properties and evolution.

Download Full-text

FilterDCA: interpretable supervised contact prediction using inter-domain coevolution

10.1101/2019.12.24.887877 ◽

2019 ◽

Cited By ~ 1

Author(s):

Maureen Muscat ◽

Giancarlo Croce ◽

Edoardo Sarti ◽

Martin Weigt

Keyword(s):

Deep Learning ◽

De Novo ◽

Protein Complexes ◽

Protein Structures ◽

Direct Coupling ◽

Sequence Information ◽

Coupling Analysis ◽

Contact Patterns ◽

Direct Coupling Analysis ◽

Training Sets

AbstractPredicting three-dimensional protein structure and assembling protein complexes using sequence information belongs to the most prominent tasks in computational biology. Recently substantial progress has been obtained in the case of single proteins using a combination of unsupervised coevolutionary sequence analysis with structurally supervised deep learning. While reaching impressive accuracies in predicting residue-residue contacts, deep learning has a number of disadvantages. The need for large structural training sets limits the applicability to multi-protein complexes; and their deep architecture makes the interpretability of the convolutional neural networks intrinsically hard. Here we introduce FilterDCA, a simpler supervised predictor for inter-domain and inter-protein contacts. It is based on the fact that contact maps of proteins show typical contact patterns, which results from secondary structure and are reflected by patterns in coevolutionary analysis. We explicitly integrate averaged contacts patterns with coevolutionary scores derived by Direct Coupling Analysis, reaching results comparable to more complex deep-learning approaches, while remaining fully transparent and interpretable. The FilterDCA code is available at http://gitlab.lcqb.upmc.fr/muscat/FilterDCA.Author summaryThe de novo prediction of tertiary and quaternary protein structures has recently seen important advances, by combining unsupervised, purely sequence-based coevolutionary analyses with structure-based supervision using deep learning for contact-map prediction. While showing impressive performance, deep-learning methods require large training sets and pose severe obstacles for their interpretability. Here we construct a simple, transparent and therefore fully interpretable inter-domain contact predictor, which uses the results of coevolutionary Direct Coupling Analysis in combination with explicitly constructed filters reflecting typical contact patterns in a training set of known protein structures, and which improves the accuracy of predicted contacts significantly. Our approach thereby sheds light on the question how contact information is encoded in coevolutionary signals.

Download Full-text

Fast and accurate microRNA search using CNN

BMC Bioinformatics ◽

10.1186/s12859-019-3279-2 ◽

2019 ◽

Vol 20 (S23) ◽

Author(s):

Xubo Tang ◽

Yanni Sun

Keyword(s):

Deep Learning ◽

Secondary Structure ◽

Classification Accuracy ◽

Feature Learning ◽

Sequence Information ◽

Learning Models ◽

Structure Conservation ◽

Different Types ◽

Key Issues ◽

Negative Class

Abstract Background There are many different types of microRNAs (miRNAs) and elucidating their functions is still under intensive research. A fundamental step in functional annotation of a new miRNA is to classify it into characterized miRNA families, such as those in Rfam and miRBase. With the accumulation of annotated miRNAs, it becomes possible to use deep learning-based models to classify different types of miRNAs. In this work, we investigate several key issues associated with successful application of deep learning models for miRNA classification. First, as secondary structure conservation is a prominent feature for noncoding RNAs including miRNAs, we examine whether secondary structure-based encoding improves classification accuracy. Second, as there are many more non-miRNA sequences than miRNAs, instead of assigning a negative class for all non-miRNA sequences, we test whether using softmax output can distinguish in-distribution and out-of-distribution samples. Finally, we investigate whether deep learning models can correctly classify sequences from small miRNA families. Results We present our trained convolutional neural network (CNN) models for classifying miRNAs using different types of feature learning and encoding methods. In the first method, we explicitly encode the predicted secondary structure in a matrix. In the second method, we use only the primary sequence information and one-hot encoding matrix. In addition, in order to reject sequences that should not be classified into targeted miRNA families, we use a threshold derived from softmax layer to exclude out-of-distribution sequences, which is an important feature to make this model useful for real transcriptomic data. The comparison with the state-of-the-art ncRNA classification tools such as Infernal shows that our method can achieve comparable sensitivity and accuracy while being significantly faster. Conclusion Automatic feature learning in CNN can lead to better classification accuracy and sensitivity for miRNA classification and annotation. The trained models and also associated codes are freely available at https://github.com/HubertTang/DeepMir.

Download Full-text