scholarly journals LPI-deepGBDT: a multiple-layer deep framework based on gradient boosting decision trees for lncRNA–protein interaction identification

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Liqian Zhou ◽  
Zhao Wang ◽  
Xiongfei Tian ◽  
Lihong Peng

Abstract Background Long noncoding RNAs (lncRNAs) play important roles in various biological and pathological processes. Discovery of lncRNA–protein interactions (LPIs) contributes to understand the biological functions and mechanisms of lncRNAs. Although wet experiments find a few interactions between lncRNAs and proteins, experimental techniques are costly and time-consuming. Therefore, computational methods are increasingly exploited to uncover the possible associations. However, existing computational methods have several limitations. First, majority of them were measured based on one simple dataset, which may result in the prediction bias. Second, few of them are applied to identify relevant data for new lncRNAs (or proteins). Finally, they failed to utilize diverse biological information of lncRNAs and proteins. Results Under the feed-forward deep architecture based on gradient boosting decision trees (LPI-deepGBDT), this work focuses on classify unobserved LPIs. First, three human LPI datasets and two plant LPI datasets are arranged. Second, the biological features of lncRNAs and proteins are extracted by Pyfeat and BioProt, respectively. Thirdly, the features are dimensionally reduced and concatenated as a vector to represent an lncRNA–protein pair. Finally, a deep architecture composed of forward mappings and inverse mappings is developed to predict underlying linkages between lncRNAs and proteins. LPI-deepGBDT is compared with five classical LPI prediction models (LPI-BLS, LPI-CatBoost, PLIPCOM, LPI-SKF, and LPI-HNM) under three cross validations on lncRNAs, proteins, lncRNA–protein pairs, respectively. It obtains the best average AUC and AUPR values under the majority of situations, significantly outperforming other five LPI identification methods. That is, AUCs computed by LPI-deepGBDT are 0.8321, 0.6815, and 0.9073, respectively and AUPRs are 0.8095, 0.6771, and 0.8849, respectively. The results demonstrate the powerful classification ability of LPI-deepGBDT. Case study analyses show that there may be interactions between GAS5 and Q15717, RAB30-AS1 and O00425, and LINC-01572 and P35637. Conclusions Integrating ensemble learning and hierarchical distributed representations and building a multiple-layered deep architecture, this work improves LPI prediction performance as well as effectively probes interaction data for new lncRNAs/proteins.

2021 ◽  
Author(s):  
Liqian ZhouZhou ◽  
Zhao Wang ◽  
Xiongfei Tian ◽  
Lihong Peng

Abstract Background: Long noncoding RNAs (lncRNAs) play important roles in various biological and pathological processes. Discovery of lncRNA-protein interactions (LPIs) contributes to understand the biological functions and mechanisms of lncRNAs. Although wet experiments find a few interactions between lncRNAs and proteins, experimental techniques are costly and time-consuming. Therefore, computational methods are increasingly exploited to uncover the possible associations. However, existing computational methods have several limitations. First, majority of them were measured based on one simple dataset, which may result in the prediction bias. Second, few of them are applied to identify relevant data for new lncRNAs (or proteins). Finally, they failed to utilize diverse biological information of lncRNAs and proteins. Results: Under the feed-forward deep architecture based on Gradient Boosting Decision Trees (LPI-deepGBDT), this work focuses on classify unobserved LPIs. First, three human LPI datasets and two plant LPI datasets are arranged. Second, the biological features of lncRNAs and proteins are extracted by Pyfeat and BioProt, respectively. Thirdly, the features are dimensionally reduced and concatenated as a vector to represent an lncRNA-protein pair. Finally, a deep architecture composed of the forward mappings and inverse mappings is developed to predict underlying linkages between lncRNAs and proteins. LPI-deepGBDT is compared with four classical LPI prediction models (LPI-BLS, LPI-CatBoost, PLIPCOM, and LPI-SKF) under three cross validations on lncRNAs, proteins, lncRNA-protein pairs, respectively. It obtains the best average AUC and AUPR values on the five datasets under the three cross validations, significantly outperforming other four LPI identification methods. That is, AUCs computed by LPI-deepGBDT are 0.8321, 0.6815, and 0.9073, respectively and AUPRs are 0.8095, 0.6771, and 0.8849, respectively. The results demonstrate the powerful classification ability of LPI-deepGBDT. Case study analyses show that there may be interactions between GAS5 and Q15717, RAB30-AS1 and O00425, and LINC-01572 and P35637. Conclusions: Integrating ensemble learning and hierarchical distributed representations and building a multiple-layered deep architecture, this work improves LPI prediction performance as well as effectively probes interaction data for new lncRNAs/proteins.


Protein-Protein Interactions referred as PPIs perform significant role in biological functions like cell metabolism, immune response, signal transduction etc. Hot spots are small fractions of residues in interfaces and provide substantial binding energy in PPIs. Therefore, identification of hot spots is important to discover and analyze molecular medicines and diseases. The current strategy, alanine scanning isn't pertinent to enormous scope applications since the technique is very costly and tedious. The existing computational methods are poor in classification performance as well as accuracy in prediction. They are concerned with the topological structure and gene expression of hub proteins. The proposed system focuses on hot spots of hub proteins by eliminating redundant as well as highly correlated features using Pearson Correlation Coefficient and Support Vector Machine based feature elimination. Extreme Gradient boosting and LightGBM algorithms are used to ensemble a set of weak classifiers to form a strong classifier. The proposed system shows better accuracy than the existing computational methods. The model can also be used to predict accurate molecular inhibitors for specific PPIs


2019 ◽  
Vol 20 (3) ◽  
pp. 177-184 ◽  
Author(s):  
Nantao Zheng ◽  
Kairou Wang ◽  
Weihua Zhan ◽  
Lei Deng

Background:Targeting critical viral-host Protein-Protein Interactions (PPIs) has enormous application prospects for therapeutics. Using experimental methods to evaluate all possible virus-host PPIs is labor-intensive and time-consuming. Recent growth in computational identification of virus-host PPIs provides new opportunities for gaining biological insights, including applications in disease control. We provide an overview of recent computational approaches for studying virus-host PPI interactions.Methods:In this review, a variety of computational methods for virus-host PPIs prediction have been surveyed. These methods are categorized based on the features they utilize and different machine learning algorithms including classical and novel methods.Results:We describe the pivotal and representative features extracted from relevant sources of biological data, mainly include sequence signatures, known domain interactions, protein motifs and protein structure information. We focus on state-of-the-art machine learning algorithms that are used to build binary prediction models for the classification of virus-host protein pairs and discuss their abilities, weakness and future directions.Conclusion:The findings of this review confirm the importance of computational methods for finding the potential protein-protein interactions between virus and host. Although there has been significant progress in the prediction of virus-host PPIs in recent years, there is a lot of room for improvement in virus-host PPI prediction.


2018 ◽  
Vol 20 (6) ◽  
pp. 2185-2199 ◽  
Author(s):  
Yanju Zhang ◽  
Ruopeng Xie ◽  
Jiawei Wang ◽  
André Leier ◽  
Tatiana T Marquez-Lago ◽  
...  

AbstractAs a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.


Symmetry ◽  
2021 ◽  
Vol 13 (4) ◽  
pp. 652
Author(s):  
David Alaminos ◽  
José Ignacio Peláez ◽  
M. Belén Salas ◽  
Manuel A. Fernández-Gámez

Sovereign debt and currencies play an increasingly influential role in the development of any country, given the need to obtain financing and establish international relations. A recurring theme in the literature on financial crises has been the prediction of sovereign debt and currency crises due to their extreme importance in international economic activity. Nevertheless, the limitations of the existing models are related to accuracy and the literature calls for more investigation on the subject and lacks geographic diversity in the samples used. This article presents new models for the prediction of sovereign debt and currency crises, using various computational techniques, which increase their precision. Also, these models present experiences with a wide global sample of the main geographical world zones, such as Africa and the Middle East, Latin America, Asia, Europe, and globally. Our models demonstrate the superiority of computational techniques concerning statistics in terms of the level of precision, which are the best methods for the sovereign debt crisis: fuzzy decision trees, AdaBoost, extreme gradient boosting, and deep learning neural decision trees, and for forecasting the currency crisis: deep learning neural decision trees, extreme gradient boosting, random forests, and deep belief network. Our research has a large and potentially significant impact on the macroeconomic policy adequacy of the countries against the risks arising from financial crises and provides instruments that make it possible to improve the balance in the finance of the countries.


Electronics ◽  
2021 ◽  
Vol 10 (3) ◽  
pp. 314
Author(s):  
Adrián Alcolea ◽  
Javier Resano

A decision tree is a well-known machine learning technique. Recently their popularity has increased due to the powerful Gradient Boosting ensemble method that allows to gradually increasing accuracy at the cost of executing a large number of decision trees. In this paper we present an accelerator designed to optimize the execution of these trees while reducing the energy consumption. We have implemented it in an FPGA for embedded systems, and we have tested it with a relevant case-study: pixel classification of hyperspectral images. In our experiments with different images our accelerator can process the hyperspectral images at the same speed at which they are generated by the hyperspectral sensors. Compared to a high-performance processor running optimized software, on average our design is twice as fast and consumes 72 times less energy. Compared to an embedded processor, it is 30 times faster and consumes 23 times less energy.


2021 ◽  
Author(s):  
Yang Zhang ◽  
Yue Wu

Traditional response surface methodology (RSM) has utilized the ordinary least squared (OLS) technique to numerically estimate the coefficients for multiple influence factors to achieve the values of the responsive factor while considering the intersection and quadratic terms of the influencers if any. With the emergence and popularization of machine learning (ML), more competitive methods has been developed which can be adopted to complement or replace the tradition RSM method, i.e. the OLS with or without the polynomial terms. In this chapter, several commonly used regression models in the ML including the improved linear models (the least absolute shrinkage and selection operator model and the generalized linear model), the decision trees family (decision trees, random forests and gradient boosting trees), the model of the neural nets, (the multi-layer perceptrons) and the support vector machine will be introduced. Those ML models will provide a more flexible way to estimate the response surface function that is difficult to be represented by a polynomial as deployed in the traditional RSM. The advantage of the ML models in predicting precise response factor values is then demonstrated by implementation on an engineering case study. The case study has shown that the various choices of the ML models can reach a more satisfactory estimation for the responsive surface function in comparison to the RSM. The GDBT has exhibited to outperform the RSM with an accuracy improvement for 50% on unseen experimental data.


2021 ◽  
Vol 22 (S6) ◽  
Author(s):  
Weixia Xu ◽  
Yangyun Gao ◽  
Yang Wang ◽  
Jihong Guan

Abstract Background Protein protein interactions (PPIs) are essential to most of the biological processes. The prediction of PPIs is beneficial to the understanding of protein functions and thus is helpful to pathological analysis, disease diagnosis and drug design etc. As the amount of protein data is growing fast in the post genomic era, high-throughput experimental methods are expensive and time-consuming for the prediction of PPIs. Thus, computational methods have attracted researcher’s attention in recent years. A large number of computational methods have been proposed based on different protein sequence encoders. Results Notably, the confidence score of a protein sequence pair could be regarded as a kind of measurement to PPIs. The higher the confidence score for one protein pair is, the more likely the protein pair interacts. Thus in this paper, a deep learning framework, called ordinal regression and recurrent convolutional neural network (OR-RCNN) method, is introduced to predict PPIs from the perspective of confidence score. It mainly contains two parts: the encoder part of protein sequence pair and the prediction part of PPIs by confidence score. In the first part, two recurrent convolutional neural networks (RCNNs) with shared parameters are applied to construct two protein sequence embedding vectors, which can automatically extract robust local features and sequential information from the protein pairs. Based on it, the two embedding vectors are encoded into one novel embedding vector by element-wise multiplication. By taking the ordinal information behind confidence score into consideration, ordinal regression is used to construct multiple sub-classifiers in the second part. The results of multiple sub-classifiers are aggregated to obtain the final confidence score. Following that, the existence of PPIs is determined by the confidence score. We set a threshold $$\theta$$ θ , and say the interaction exists between the protein pair if its confidence score is bigger than $$\theta$$ θ . Conclusions We applied our method to predict PPIs on data sets S. cerevisiae and Homo sapiens. Through experimental verification, our method outperforms state-of-the-art PPI prediction models.


2011 ◽  
Vol 33 (1) ◽  
pp. 8-11
Author(s):  
Hung Xuan Ta ◽  
Liisa Holm

A great number of cellular behaviours are mediated by proteins which always carry out their functions by interacting with each other. Unravelling protein–protein interactions (PPIs) is one of the central goals in proteomics, which will decipher the molecular mechanisms underlying the biological functions and thereby help to understand human diseases on a system-wide level. A number of experimental techniques, especially high-throughput approaches, have resulted in a large amount of PPI data that still suffer from incompleteness and contradiction. Moreover, these experimental techniques are expensive, time-consuming and labour-intensive. Computational methods have emerged as complementary tools to experimental approaches to discover PPIs. Promisingly, computational methods can guide, assess and validate experimental data and finally predict novel PPIs.


Sign in / Sign up

Export Citation Format

Share Document