Develop machine learning-based regression predictive models for engineering protein solubility

2019 ◽  
Vol 35 (22) ◽  
pp. 4640-4646 ◽  
Author(s):  
Xi Han ◽  
Xiaonan Wang ◽  
Kang Zhou

Abstract Motivation Protein activity is a significant characteristic for recombinant proteins which can be used as biocatalysts. High activity of proteins reduces the cost of biocatalysts. A model that can predict protein activity from amino acid sequence is highly desired, as it aids experimental improvement of proteins. However, only limited data for protein activity are currently available, which prevents the development of such models. Since protein activity and solubility are correlated for some proteins, the publicly available solubility dataset may be adopted to develop models that can predict protein solubility from sequence. The models could serve as a tool to indirectly predict protein activity from sequence. In literature, predicting protein solubility from sequence has been intensively explored, but the predicted solubility represented in binary values from all the developed models was not suitable for guiding experimental designs to improve protein solubility. Here we propose new machine learning (ML) models for improving protein solubility in vivo. Results We first implemented a novel approach that predicted protein solubility in continuous numerical values instead of binary ones. After combining it with various ML algorithms, we achieved a R2 of 0.4115 when support vector machine algorithm was used. Continuous values of solubility are more meaningful in protein engineering, as they enable researchers to choose proteins with higher predicted solubility for experimental validation, while binary values fail to distinguish proteins with the same value—there are only two possible values so many proteins have the same one. Availability and implementation We present the ML workflow as a series of IPython notebooks hosted on GitHub (https://github.com/xiaomizhou616/protein_solubility). The workflow can be used as a template for analysis of other expression and solubility datasets. Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Author(s):  
Xi Han ◽  
Wenbo Ning ◽  
Xiaoqiang Ma ◽  
Xiaonan Wang ◽  
Kang Zhou

AbstractImproving catalytic ability of protein biocatalysts leads to reduction in the production cost of biocatalytic manufacturing process, but the search space of possible proteins/mutants is too large to explore exhaustively through experiments. To some extent, highly soluble recombinant proteins tend to exhibit high activity. Here, we demonstrate that an optimization methodology based on machine learning prediction model can effectively predict which peptide tags can improve protein solubility quantitatively. Based on the protein sequence information, a support vector machine model we recently developed was used to evaluate protein solubility after randomly mutated tags were added to a target protein. The optimization algorithm guided the tags to evolve towards variants that can result in higher solubility. Moreover, the optimization results were validated successfully by adding the tags designed by our optimization algorithm to a model protein, expressing it in vivo and experimentally quantifying its solubility and activity. For example, solubility of a tyrosine ammonium lyase was more than doubled by adding two tags to its N- and C-terminus. Its protein activity was also increased nearly 3.5 fold by adding the tags. Additional experiments also supported that the designed tags were effective for improving activity of multiple proteins and are better than previously reported tags. The presented optimization methodology thus provides a valuable tool for understanding the correlation between amino acid sequence and protein solubility and for engineering protein [email protected], [email protected]


Sensors ◽  
2021 ◽  
Vol 21 (11) ◽  
pp. 3827
Author(s):  
Gemma Urbanos ◽  
Alberto Martín ◽  
Guillermo Vázquez ◽  
Marta Villanueva ◽  
Manuel Villa ◽  
...  

Hyperspectral imaging techniques (HSI) do not require contact with patients and are non-ionizing as well as non-invasive. As a consequence, they have been extensively applied in the medical field. HSI is being combined with machine learning (ML) processes to obtain models to assist in diagnosis. In particular, the combination of these techniques has proven to be a reliable aid in the differentiation of healthy and tumor tissue during brain tumor surgery. ML algorithms such as support vector machine (SVM), random forest (RF) and convolutional neural networks (CNN) are used to make predictions and provide in-vivo visualizations that may assist neurosurgeons in being more precise, hence reducing damages to healthy tissue. In this work, thirteen in-vivo hyperspectral images from twelve different patients with high-grade gliomas (grade III and IV) have been selected to train SVM, RF and CNN classifiers. Five different classes have been defined during the experiments: healthy tissue, tumor, venous blood vessel, arterial blood vessel and dura mater. Overall accuracy (OACC) results vary from 60% to 95% depending on the training conditions. Finally, as far as the contribution of each band to the OACC is concerned, the results obtained in this work are 3.81 times greater than those reported in the literature.


2019 ◽  
Vol 36 (1) ◽  
pp. 272-279 ◽  
Author(s):  
Hannah F Löchel ◽  
Dominic Eger ◽  
Theodor Sperlea ◽  
Dominik Heider

AbstractMotivationClassification of protein sequences is one big task in bioinformatics and has many applications. Different machine learning methods exist and are applied on these problems, such as support vector machines (SVM), random forests (RF) and neural networks (NN). All of these methods have in common that protein sequences have to be made machine-readable and comparable in the first step, for which different encodings exist. These encodings are typically based on physical or chemical properties of the sequence. However, due to the outstanding performance of deep neural networks (DNN) on image recognition, we used frequency matrix chaos game representation (FCGR) for encoding of protein sequences into images. In this study, we compare the performance of SVMs, RFs and DNNs, trained on FCGR encoded protein sequences. While the original chaos game representation (CGR) has been used mainly for genome sequence encoding and classification, we modified it to work also for protein sequences, resulting in n-flakes representation, an image with several icosagons.ResultsWe could show that all applied machine learning techniques (RF, SVM and DNN) show promising results compared to the state-of-the-art methods on our benchmark datasets, with DNNs outperforming the other methods and that FCGR is a promising new encoding method for protein sequences.Availability and implementationhttps://cran.r-project.org/.Supplementary informationSupplementary data are available at Bioinformatics online.


Author(s):  
Sunday Olakunle Idowu ◽  
Amos Akintayo Fatokun

Oxidative stress induced by excessive levels of reactive oxygen species (ROS) underlies several diseases. Therapeutic strategies to combat oxidative damage are, therefore, a subject of intense scientific investigation to prevent and treat such diseases, with the use of phytochemical antioxidants, especially polyphenols, being a major part. Polyphenols, however, exhibit structural diversity that determines different mechanisms of antioxidant action, such as hydrogen atom transfer (HAT) and single-electron transfer (SET). They also suffer from inadequate in vivo bioavailability, with their antioxidant bioactivity governed by permeability, gut-wall and first-pass metabolism, and HAT-based ROS trapping. Unfortunately, no current antioxidant assay captures these multiple dimensions to be sufficiently “biorelevant,” because the assays tend to be unidimensional, whereas biorelevance requires integration of several inputs. Finding a method to reliably evaluate the antioxidant capacity of these phytochemicals, therefore, remains an unmet need. To address this deficiency, we propose using artificial intelligence (AI)-based machine learning (ML) to relate a polyphenol’s antioxidant action as the output variable to molecular descriptors (factors governing in vivo antioxidant activity) as input variables, in the context of a biomarker selectively produced by lipid peroxidation (a consequence of oxidative stress), for example F2-isoprostanes. Support vector machines, artificial neural networks, and Bayesian probabilistic learning are some key algorithms that could be deployed. Such a model will represent a robust predictive tool in assessing biorelevant antioxidant capacity of polyphenols, and thus facilitate the identification or design of antioxidant molecules. The approach will also help to fulfill the principles of the 3Rs (replacement, reduction, and refinement) in using animals in biomedical research.


Cancers ◽  
2020 ◽  
Vol 12 (11) ◽  
pp. 3406
Author(s):  
Elisabeth Bumes ◽  
Fro-Philip Wirtz ◽  
Claudia Fellner ◽  
Jirka Grosse ◽  
Dirk Hellwig ◽  
...  

Isocitrate dehydrogenase (IDH)-1 mutation is an important prognostic factor and a potential therapeutic target in glioma. Immunohistological and molecular diagnosis of IDH mutation status is invasive. To avoid tumor biopsy, dedicated spectroscopic techniques have been proposed to detect D-2-hydroxyglutarate (2-HG), the main metabolite of IDH, directly in vivo. However, these methods are technically challenging and not broadly available. Therefore, we explored the use of machine learning for the non-invasive, inexpensive and fast diagnosis of IDH status in standard 1H-magnetic resonance spectroscopy (1H-MRS). To this end, 30 of 34 consecutive patients with known or suspected glioma WHO grade II-IV were subjected to metabolic positron emission tomography (PET) imaging with O-(2-18F-fluoroethyl)-L-tyrosine (18F-FET) for optimized voxel placement in 1H-MRS. Routine 1H-magnetic resonance (1H-MR) spectra of tumor and contralateral healthy brain regions were acquired on a 3 Tesla magnetic resonance (3T-MR) scanner, prior to surgical tumor resection and molecular analysis of IDH status. Since 2-HG spectral signals were too overlapped for reliable discrimination of IDH mutated (IDHmut) and IDH wild-type (IDHwt) glioma, we used a nested cross-validation approach, whereby we trained a linear support vector machine (SVM) on the complete spectral information of the 1H-MRS data to predict IDH status. Using this approach, we predicted IDH status with an accuracy of 88.2%, a sensitivity of 95.5% (95% CI, 77.2–99.9%) and a specificity of 75.0% (95% CI, 42.9–94.5%), respectively. The area under the curve (AUC) amounted to 0.83. Subsequent ex vivo 1H-nuclear magnetic resonance (1H-NMR) measurements performed on metabolite extracts of resected tumor material (eight specimens) revealed myo-inositol (M-ins) and glycine (Gly) to be the major discriminators of IDH status. We conclude that our approach allows a reliable, non-invasive, fast and cost-effective prediction of IDH status in a standard clinical setting.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Muhammad Farrukh Khan ◽  
Taher M. Ghazal ◽  
Raed A. Said ◽  
Areej Fatima ◽  
Sagheer Abbas ◽  
...  

The Internet of Medical Things (IoMT) enables digital devices to gather, infer, and broadcast health data via the cloud platform. The phenomenal growth of the IoMT is fueled by many factors, including the widespread and growing availability of wearables and the ever-decreasing cost of sensor-based technology. The cost of related healthcare will rise as the global population of elderly people grows in parallel with an overall life expectancy that demands affordable healthcare services, solutions, and developments. IoMT may bring revolution in the medical sciences in terms of the quality of healthcare of elderly people while entangled with machine learning (ML) algorithms. The effectiveness of the smart healthcare (SHC) model to monitor elderly people was observed by performing tests on IoMT datasets. For evaluation, the precision, recall, fscore, accuracy, and ROC values are computed. The authors also compare the results of the SHC model with different conventional popular ML techniques, e.g., support vector machine (SVM), K-nearest neighbor (KNN), and decision tree (DT), to analyze the effectiveness of the result.


2019 ◽  
Vol 35 (20) ◽  
pp. 4072-4080 ◽  
Author(s):  
Timo M Deist ◽  
Andrew Patti ◽  
Zhaoqi Wang ◽  
David Krane ◽  
Taylor Sorenson ◽  
...  

Abstract Motivation In a predictive modeling setting, if sufficient details of the system behavior are known, one can build and use a simulation for making predictions. When sufficient system details are not known, one typically turns to machine learning, which builds a black-box model of the system using a large dataset of input sample features and outputs. We consider a setting which is between these two extremes: some details of the system mechanics are known but not enough for creating simulations that can be used to make high quality predictions. In this context we propose using approximate simulations to build a kernel for use in kernelized machine learning methods, such as support vector machines. The results of multiple simulations (under various uncertainty scenarios) are used to compute similarity measures between every pair of samples: sample pairs are given a high similarity score if they behave similarly under a wide range of simulation parameters. These similarity values, rather than the original high dimensional feature data, are used to build the kernel. Results We demonstrate and explore the simulation-based kernel (SimKern) concept using four synthetic complex systems—three biologically inspired models and one network flow optimization model. We show that, when the number of training samples is small compared to the number of features, the SimKern approach dominates over no-prior-knowledge methods. This approach should be applicable in all disciplines where predictive models are sought and informative yet approximate simulations are available. Availability and implementation The Python SimKern software, the demonstration models (in MATLAB, R), and the datasets are available at https://github.com/davidcraft/SimKern. Supplementary information Supplementary data are available at Bioinformatics online.


2015 ◽  
Vol 2015 ◽  
pp. 1-5 ◽  
Author(s):  
Masayuki Yarimizu ◽  
Cao Wei ◽  
Yusuke Komiyama ◽  
Kokoro Ueki ◽  
Shugo Nakamura ◽  
...  

Receptor tyrosine kinases are essential proteins involved in cellular differentiation and proliferation in vivo and are heavily involved in allergic diseases, diabetes, and onset/proliferation of cancerous cells. Identifying the interacting partner of this protein, a growth factor ligand, will provide a deeper understanding of cellular proliferation/differentiation and other cell processes. In this study, we developed a method for predicting tyrosine kinase ligand-receptor pairs from their amino acid sequences. We collected tyrosine kinase ligand-receptor pairs from the Database of Interacting Proteins (DIP) and UniProtKB, filtered them by removing sequence redundancy, and used them as a dataset for machine learning and assessment of predictive performance. Our prediction method is based on support vector machines (SVMs), and we evaluated several input features suitable for tyrosine kinase for machine learning and compared and analyzed the results. Using sequence pattern information and domain information extracted from sequences as input features, we obtained 0.996 of the area under the receiver operating characteristic curve. This accuracy is higher than that obtained from general protein-protein interaction pair predictions.


2020 ◽  
Author(s):  
Jianwen Chen ◽  
Shuangjia Zheng ◽  
Huiying Zhao ◽  
Yuedong Yang

AbstractMotivationProtein solubility is significant in producing new soluble proteins that can reduce the cost of biocatalysts or therapeutic agents. Therefore, a computational model is highly desired to accurately predict protein solubility from the amino acid sequence. Many methods have been developed, but they are mostly based on the one-dimensional embedding of amino acids that is limited to catch spatially structural information.ResultsIn this study, we have developed a new structure-aware method to predict protein solubility by attentive graph convolutional network (GCN), where the protein topology attribute graph was constructed through predicted contact maps from the sequence. GraphSol was shown to substantially out-perform other sequence-based methods. The model was proven to be stable by consistent R2 of 0.48 in both the cross-validation and independent test of the eSOL dataset. To our best knowledge, this is the first study to utilize the GCN for sequence-based predictions. More importantly, this architecture could be extended to other protein prediction tasks.AvailabilityThe package is available at http://[email protected] informationSupplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document