Predicting subcellular location of protein with evolution information and sequence-based deep learning

Abstract Background Protein subcellular localization prediction plays an important role in biology research. Since traditional methods are laborious and time-consuming, many machine learning-based prediction methods have been proposed. However, most of the proposed methods ignore the evolution information of proteins. In order to improve the prediction accuracy, we present a deep learning-based method to predict protein subcellular locations. Results Our method utilizes not only amino acid compositions sequence but also evolution matrices of proteins. Our method uses a bidirectional long short-term memory network that processes the entire protein sequence and a convolutional neural network that extracts features from protein sequences. The position specific scoring matrix is used as a supplement to protein sequences. Our method was trained and tested on two benchmark datasets. The experiment results show that our method yields accurate results on the two datasets with an average precision of 0.7901, ranking loss of 0.0758 and coverage of 1.2848. Conclusion The experiment results show that our method outperforms five methods currently available. According to those experiments, we can see that our method is an acceptable alternative to predict protein subcellular location.

Download Full-text

A Hybrid Deep Learning Model for Predicting Protein Hydroxylation Sites

International Journal of Molecular Sciences ◽

10.3390/ijms19092817 ◽

2018 ◽

Vol 19 (9) ◽

pp. 2817 ◽

Cited By ~ 9

Author(s):

Haixia Long ◽

Bo Liao ◽

Xingyu Xu ◽

Jialiang Yang

Keyword(s):

Deep Learning ◽

Short Term Memory ◽

Learning Model ◽

New Drugs ◽

Post Translational Modifications ◽

Novel Approach ◽

Benchmark Datasets ◽

Memory Network ◽

Scoring Matrix ◽

Deep Learning Model

Protein hydroxylation is one type of post-translational modifications (PTMs) playing critical roles in human diseases. It is known that protein sequence contains many uncharacterized residues of proline and lysine. The question that needs to be answered is: which residue can be hydroxylated, and which one cannot. The answer will not only help understand the mechanism of hydroxylation but can also benefit the development of new drugs. In this paper, we proposed a novel approach for predicting hydroxylation using a hybrid deep learning model integrating the convolutional neural network (CNN) and long short-term memory network (LSTM). We employed a pseudo amino acid composition (PseAAC) method to construct valid benchmark datasets based on a sliding window strategy and used the position-specific scoring matrix (PSSM) to represent samples as inputs to the deep learning model. In addition, we compared our method with popular predictors including CNN, iHyd-PseAAC, and iHyd-PseCp. The results for 5-fold cross-validations all demonstrated that our method significantly outperforms the other methods in prediction accuracy.

Download Full-text

Advances in the Prediction of Protein Subcellular Locations with Machine Learning

Current Bioinformatics ◽

10.2174/1574893614666181217145156 ◽

2019 ◽

Vol 14 (5) ◽

pp. 406-421 ◽

Cited By ~ 3

Author(s):

Ting-He Zhang ◽

Shao-Wu Zhang

Keyword(s):

Machine Learning ◽

Feature Fusion ◽

Protein Sequences ◽

Subcellular Location ◽

Automated Analysis ◽

Cellular Level ◽

Machine Learning Algorithms ◽

Feature Representation ◽

Protein Subcellular Location ◽

Protein Subcellular Locations

Background: Revealing the subcellular location of a newly discovered protein can bring insight into their function and guide research at the cellular level. The experimental methods currently used to identify the protein subcellular locations are both time-consuming and expensive. Thus, it is highly desired to develop computational methods for efficiently and effectively identifying the protein subcellular locations. Especially, the rapidly increasing number of protein sequences entering the genome databases has called for the development of automated analysis methods. Methods: In this review, we will describe the recent advances in predicting the protein subcellular locations with machine learning from the following aspects: i) Protein subcellular location benchmark dataset construction, ii) Protein feature representation and feature descriptors, iii) Common machine learning algorithms, iv) Cross-validation test methods and assessment metrics, v) Web servers. Result & Conclusion: Concomitant with a large number of protein sequences generated by highthroughput technologies, four future directions for predicting protein subcellular locations with machine learning should be paid attention. One direction is the selection of novel and effective features (e.g., statistics, physical-chemical, evolutional) from the sequences and structures of proteins. Another is the feature fusion strategy. The third is the design of a powerful predictor and the fourth one is the protein multiple location sites prediction.

Download Full-text

Learning complex subcellular distribution patterns of proteins via analysis of immunohistochemistry images

Bioinformatics ◽

10.1093/bioinformatics/btz844 ◽

2019 ◽

Vol 36 (6) ◽

pp. 1908-1914 ◽

Cited By ~ 4

Author(s):

Ying-Ying Xu ◽

Hong-Bin Shen ◽

Robert F Murphy

Keyword(s):

Subcellular Location ◽

Distribution Patterns ◽

Cell Types ◽

Supplementary Information ◽

Protein Distribution ◽

Protein Subcellular Location ◽

Location Patterns ◽

Location Proteomics ◽

Human Protein Atlas ◽

Protein Subcellular Locations

Abstract Motivation Systematic and comprehensive analysis of protein subcellular location as a critical part of proteomics (‘location proteomics’) has been studied for many years, but annotating protein subcellular locations and understanding variation of the location patterns across various cell types and states is still challenging. Results In this work, we used immunohistochemistry images from the Human Protein Atlas as the source of subcellular location information, and built classification models for the complex protein spatial distribution in normal and cancerous tissues. The models can automatically estimate the fractions of protein in different subcellular locations, and can help to quantify the changes of protein distribution from normal to cancer tissues. In addition, we examined the extent to which different annotated protein pathways and complexes showed similarity in the locations of their member proteins, and then predicted new potential proteins for these networks. Availability and implementation The dataset and code are available at: www.csbio.sjtu.edu.cn/bioinf/complexsubcellularpatterns. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Predictive Analysis of Cryptocurrency Price Using Deep Learning

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.27.17889 ◽

2018 ◽

Vol 7 (3.27) ◽

pp. 258 ◽

Cited By ~ 4

Author(s):

Yecheng Yao ◽

Jungho Yi ◽

Shengjun Zhai ◽

Yuwen Lin ◽

Taekseung Kim ◽

...

Keyword(s):

Deep Learning ◽

International Relations ◽

Short Term Memory ◽

Training Data ◽

Short Term ◽

Effective Learning ◽

Learning Techniques ◽

Benchmark Datasets ◽

Novel Method ◽

Long Short Term Memory

The decentralization of cryptocurrencies has greatly reduced the level of central control over them, impacting international relations and trade. Further, wide fluctuations in cryptocurrency price indicate an urgent need for an accurate way to forecast this price. This paper proposes a novel method to predict cryptocurrency price by considering various factors such as market cap, volume, circulating supply, and maximum supply based on deep learning techniques such as the recurrent neural network (RNN) and the long short-term memory (LSTM),which are effective learning models for training data, with the LSTM being better at recognizing longer-term associations. The proposed approach is implemented in Python and validated for benchmark datasets. The results verify the applicability of the proposed approach for the accurate prediction of cryptocurrency price.

Download Full-text

Deep Learning based Semantic Similarity Detection using Text Data

Information Technology And Control ◽

10.5755/j01.itc.49.4.27118 ◽

2020 ◽

Vol 49 (4) ◽

pp. 495-510

Author(s):

Muhammad Mansoor ◽

Zahoor ur Rehman ◽

Muhammad Shaheen ◽

Muhammad Attique Khan ◽

Mohamed Habib

Keyword(s):

Deep Learning ◽

Language Processing ◽

Short Term Memory ◽

Main Task ◽

Detection Algorithms ◽

Similarity Detection ◽

Novel Approach ◽

Proposed Model ◽

Memory Network ◽

Numeric Data

Similarity detection in the text is the main task for a number of Natural Language Processing (NLP) applications. As textual data is comparatively large in quantity and huge in volume than the numeric data, therefore measuring textual similarity is one of the important problems. Most of the similarity detection algorithms are based upon word to word matching, sentence/paragraph matching, and matching of the whole document. In this research, a novel approach is proposed using deep learning models, combining Long Short Term Memory network (LSTM) with Convolutional Neural Network (CNN) for measuring semantics similarity between two questions. The proposed model takes sentence pairs as input to measure the similarity between them. The model is tested on publicly available Quora’s dataset. The model in comparison to the existing techniques gave 87.50 % accuracy which is better than the previous approaches.

Download Full-text

Advancing PICO element detection in biomedical text via deep neural networks

Bioinformatics ◽

10.1093/bioinformatics/btaa256 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3856-3862

Author(s):

Di Jin ◽

Peter Szolovits

Keyword(s):

Deep Learning ◽

Short Term Memory ◽

Conditional Random Field ◽

Contextual Information ◽

Learning Model ◽

Detection Accuracy ◽

Clinical Question ◽

Specific Patient ◽

Benchmark Datasets ◽

Deep Learning Model

Abstract Motivation In evidence-based medicine, defining a clinical question in terms of the specific patient problem aids the physicians to efficiently identify appropriate resources and search for the best available evidence for medical treatment. In order to formulate a well-defined, focused clinical question, a framework called PICO is widely used, which identifies the sentences in a given medical text that belong to the four components typically reported in clinical trials: Participants/Problem (P), Intervention (I), Comparison (C) and Outcome (O). In this work, we propose a novel deep learning model for recognizing PICO elements in biomedical abstracts. Based on the previous state-of-the-art bidirectional long-short-term memory (bi-LSTM) plus conditional random field architecture, we add another layer of bi-LSTM upon the sentence representation vectors so that the contextual information from surrounding sentences can be gathered to help infer the interpretation of the current one. In addition, we propose two methods to further generalize and improve the model: adversarial training and unsupervised pre-training over large corpora. Results We tested our proposed approach over two benchmark datasets. One is the PubMed-PICO dataset, where our best results outperform the previous best by 5.5%, 7.9% and 5.8% for P, I and O elements in terms of F1 score, respectively. And for the other dataset named NICTA-PIBOSO, the improvements for P/I/O elements are 3.9%, 15.6% and 1.3% in F1 score, respectively. Overall, our proposed deep learning model can obtain unprecedented PICO element detection accuracy while avoiding the need for any manual feature selection. Availability and implementation Code is available at https://github.com/jind11/Deep-PICO-Detection.

Download Full-text

Deep learning for quality prediction of nonlinear dynamic processes with variable attention‐based long short‐term memory network

The Canadian Journal of Chemical Engineering ◽

10.1002/cjce.23665 ◽

2019 ◽

Vol 98 (6) ◽

pp. 1377-1389 ◽

Cited By ~ 12

Author(s):

Xiaofeng Yuan ◽

Lin Li ◽

Yalin Wang ◽

Chunhua Yang ◽

Weihua Gui

Keyword(s):

Deep Learning ◽

Nonlinear Dynamic ◽

Short Term Memory ◽

Dynamic Processes ◽

Quality Prediction ◽

Short Term ◽

Term Memory ◽

Memory Network ◽

Long Short Term Memory ◽

Nonlinear Dynamic Processes

Download Full-text

Predicting Apoptosis Protein Subcellular Locations based on the Protein Overlapping Property Matrix and Tri-Gram Encoding

International Journal of Molecular Sciences ◽

10.3390/ijms20092344 ◽

2019 ◽

Vol 20 (9) ◽

pp. 2344

Author(s):

Yang Yang ◽

Huiwen Zheng ◽

Chunhua Wang ◽

Wanyue Xiao ◽

Taigang Liu

Keyword(s):

Support Vector Machine ◽

Subcellular Location ◽

Recursive Feature Elimination ◽

Support Vector ◽

Svm Classifier ◽

Protein Subcellular Location ◽

Promising Tool ◽

Apoptosis Protein ◽

Benchmark Datasets ◽

Apoptosis Proteins

To reveal the working pattern of programmed cell death, knowledge of the subcellular location of apoptosis proteins is essential. Besides the costly and time-consuming method of experimental determination, research into computational locating schemes, focusing mainly on the innovation of representation techniques on protein sequences and the selection of classification algorithms, has become popular in recent decades. In this study, a novel tri-gram encoding model is proposed, which is based on using the protein overlapping property matrix (POPM) for predicting apoptosis protein subcellular location. Next, a 1000-dimensional feature vector is built to represent a protein. Finally, with the help of support vector machine-recursive feature elimination (SVM-RFE), we select the optimal features and put them into a support vector machine (SVM) classifier for predictions. The results of jackknife tests on two benchmark datasets demonstrate that our proposed method can achieve satisfactory prediction performance level with less computing capacity required and could work as a promising tool to predict the subcellular locations of apoptosis proteins.

Download Full-text

Deep-Kcr: accurate detection of lysine crotonylation sites using deep learning method

Briefings in Bioinformatics ◽

10.1093/bib/bbaa255 ◽

2020 ◽

Author(s):

Hao Lv ◽

Fu-Ying Dao ◽

Zheng-Xing Guan ◽

Hui Yang ◽

Yan-Wen Li ◽

...

Keyword(s):

Deep Learning ◽

Large Scale ◽

Short Term Memory ◽

Information Gain ◽

Independent Set ◽

Cost Effective ◽

Cellular Regulation ◽

Proposed Model ◽

Experimental Approaches ◽

Memory Network

Abstract As a newly discovered protein posttranslational modification, histone lysine crotonylation (Kcr) involved in cellular regulation and human diseases. Various proteomics technologies have been developed to detect Kcr sites. However, experimental approaches for identifying Kcr sites are often time-consuming and labor-intensive, which is difficult to widely popularize in large-scale species. Computational approaches are cost-effective and can be used in a high-throughput manner to generate relatively precise identification. In this study, we develop a deep learning-based method termed as Deep-Kcr for Kcr sites prediction by combining sequence-based features, physicochemical property-based features and numerical space-derived information with information gain feature selection. We investigate the performances of convolutional neural network (CNN) and five commonly used classifiers (long short-term memory network, random forest, LogitBoost, naive Bayes and logistic regression) using 10-fold cross-validation and independent set test. Results show that CNN could always display the best performance with high computational efficiency on large dataset. We also compare the Deep-Kcr with other existing tools to demonstrate the excellent predictive power and robustness of our method. Based on the proposed model, a webserver called Deep-Kcr was established and is freely accessible at http://lin-group.cn/server/Deep-Kcr.

Download Full-text

Deep learning-based container throughput forecasting: a triple bottom line approach

Industrial Management & Data Systems ◽

10.1108/imds-12-2020-0704 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Sonali Shankar ◽

Sushil Punia ◽

P. Vigneswara Ilavarasan

Keyword(s):

Deep Learning ◽

Short Term Memory ◽

Moving Average ◽

Short Term ◽

Bottom Line ◽

Content Type ◽

Autoregressive Integrated Moving Average ◽

Forecasting Method ◽

Memory Network ◽

Long Short Term Memory

PurposeContainer throughput forecasting plays a pivotal role in strategic, tactical and operational level decision-making. The determination and analysis of the influencing factors of container throughput are observed to enhance the predicting accuracy. Therefore, for effective port planning and management, this study employs a deep learning-based method to forecast the container throughput while considering the influence of economic, environmental and social factors on throughput forecasting.Design/methodology/approachA novel multivariate container throughput forecasting method is proposed using long short-term memory network (LSTM). The external factors influencing container throughput, delineated using triple bottom line, are considered as an input to the forecasting method. The principal component analysis (PCA) is employed to reduce the redundancy of the input variables. The container throughput data of the Port of Los Angeles (PLA) is considered for empirical analysis. The forecasting accuracy of the proposed method is measured via an error matrix. The accuracy of the results is further substantiated by the Diebold-Mariano statistical test.FindingsThe result of the proposed method is benchmarked with vector autoregression (VAR), autoregressive integrated moving average (ARIMAX) and LSTM. It is observed that the proposed method outperforms other counterpart methods. Though PCA was not an integral part of the forecasting process, it facilitated the prediction by means of “less data, more accuracy.”Originality/valueA novel deep learning-based forecasting method is proposed to predict container throughput using a hybridized autoregressive integrated moving average with external factors model and long short-term memory network (ARIMAX-LSTM).

Download Full-text