Solubility-Weighted Index: fast and accurate prediction of protein solubility

ABSTRACTMotivationRecombinant protein production is a widely used technique in the biotechnology and biomedical industries, yet only a quarter of target proteins are soluble and can therefore be purified.ResultsWe have discovered that global structural flexibility, which can be modeled by normalised B-factors, accurately predicts the solubility of 12,216 recombinant proteins expressed in Escherichia coli. We have optimised B-factors, and derived a new set of values for solubility scoring that further improves prediction accuracy. We call this new predictor the ‘Solubility-Weighted Index’ (SWI). Importantly, SWI outperforms many existing protein solubility prediction tools. Furthermore, we have developed ‘SoDoPE’ (Soluble Domain for Protein Expression), a web interface that allows users to choose a protein region of interest for predicting and maximising both protein expression and solubility.AvailabilityThe SoDoPE web server and source code are freely available at https://tisigner.com/sodope and https://github.com/Gardner-BinfLab/TISIGNER-ReactJS, respectively. The code and data for reproducing our analysis can be found at https://github.com/Gardner-BinfLab/SoDoPE_paper2020.

Download Full-text

SOLart: a structure-based method to predict protein solubility and aggregation

Bioinformatics ◽

10.1093/bioinformatics/btz773 ◽

2019 ◽

Cited By ~ 4

Author(s):

Qingzhen Hou ◽

Jean Marc Kwasigroch ◽

Marianne Rooman ◽

Fabrizio Pucci

Keyword(s):

Solvent Accessibility ◽

Pearson Correlation ◽

Protein Solubility ◽

Independent Set ◽

Supplementary Information ◽

Training Dataset ◽

Solubility Prediction ◽

High Concentration ◽

Structural Genomic ◽

Genomic Studies

Abstract Motivation The solubility of a protein is often decisive for its proper functioning. Lack of solubility is a major bottleneck in high-throughput structural genomic studies and in high-concentration protein production, and the formation of protein aggregates causes a wide variety of diseases. Since solubility measurements are time-consuming and expensive, there is a strong need for solubility prediction tools. Results We have recently introduced solubility-dependent distance potentials that are able to unravel the role of residue–residue interactions in promoting or decreasing protein solubility. Here, we extended their construction by defining solubility-dependent potentials based on backbone torsion angles and solvent accessibility, and integrated them, together with other structure- and sequence-based features, into a random forest model trained on a set of Escherichia coli proteins with experimental structures and solubility values. We thus obtained the SOLart protein solubility predictor, whose most informative features turned out to be folding free energy differences computed from our solubility-dependent statistical potentials. SOLart performances are very good, with a Pearson correlation coefficient between experimental and predicted solubility values of almost 0.7 both in cross-validation on the training dataset and in an independent set of Saccharomyces cerevisiae proteins. On test sets of modeled structures, only a limited drop in performance is observed. SOLart can thus be used with both high-resolution and low-resolution structures, and clearly outperforms state-of-art solubility predictors. It is available through a user-friendly webserver, which is easy to use by non-expert scientists. Availability and implementation The SOLart webserver is freely available at http://babylone.ulb.ac.be/SOLART/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Structure-aware Protein Solubility Prediction From Sequence Through Graph Convolutional Network And Predicted Contact Map

10.1101/2020.06.24.169011 ◽

2020 ◽

Author(s):

Jianwen Chen ◽

Shuangjia Zheng ◽

Huiying Zhao ◽

Yuedong Yang

Keyword(s):

Structural Information ◽

Protein Solubility ◽

Supplementary Information ◽

Solubility Prediction ◽

Contact Map ◽

Convolutional Network ◽

Contact Maps ◽

Protein Prediction ◽

The One ◽

The Cost

AbstractMotivationProtein solubility is significant in producing new soluble proteins that can reduce the cost of biocatalysts or therapeutic agents. Therefore, a computational model is highly desired to accurately predict protein solubility from the amino acid sequence. Many methods have been developed, but they are mostly based on the one-dimensional embedding of amino acids that is limited to catch spatially structural information.ResultsIn this study, we have developed a new structure-aware method to predict protein solubility by attentive graph convolutional network (GCN), where the protein topology attribute graph was constructed through predicted contact maps from the sequence. GraphSol was shown to substantially out-perform other sequence-based methods. The model was proven to be stable by consistent R2 of 0.48 in both the cross-validation and independent test of the eSOL dataset. To our best knowledge, this is the first study to utilize the GCN for sequence-based predictions. More importantly, this architecture could be extended to other protein prediction tasks.AvailabilityThe package is available at http://[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

SoluProt: Prediction of Soluble Protein Expression in Escherichia coli

10.26434/chemrxiv.13047818 ◽

2020 ◽

Author(s):

Jiri Hon ◽

Martin Marusiak ◽

Tomas Martinek ◽

Antonin Kunka ◽

Jaroslav Zendulka ◽

...

Keyword(s):

Escherichia Coli ◽

Protein Expression ◽

Soluble Protein ◽

Experimental Studies ◽

Computational Prediction ◽

Supplementary Information ◽

Gradient Boosting ◽

Sequence Information ◽

Success Rates ◽

Solubility Prediction

Motivation: Poor protein solubility hinders the production of many therapeutic and industrially useful proteins. Experimental efforts to increase solubility are plagued by low success rates and often reduce biological activity. Computational prediction of protein expressibility and solubility in Escherichia coli using only sequence information could reduce the cost of experimental studies by enabling prioritisation of highly soluble proteins. Results: A new tool for sequence-based prediction of soluble protein expression in Escherichia coli, SoluProt, was created using the gradient boosting machine technique with the TargetTrack database as a training set. When evaluated against a balanced independent test set derived from the NESG database, SoluProt’s accuracy of 58.4% and AUC of 0.60 exceeded those of a suite of alternative solubility prediction tools. There is also evidence that it could significantly increase the success rate of experimental protein studies. SoluProt is freely available as a standalone program and a user-friendly webserver at <a href="https://loschmidt.chemi.muni.cz/soluprot/">https://loschmidt.chemi.muni.cz/soluprot/</a>. Availability and Implementation: <a href="https://loschmidt.chemi.muni.cz/soluprot/">https://loschmidt.chemi.muni.cz/soluprot/</a> Contact: [email protected] Supplementary Information: Supplementary data are available at Bioinformatics online

Download Full-text

SoluProt: Prediction of Soluble Protein Expression in Escherichia coli

10.26434/chemrxiv.13047818.v1 ◽

2020 ◽

Author(s):

Jiri Hon ◽

Martin Marusiak ◽

Tomas Martinek ◽

Antonin Kunka ◽

Jaroslav Zendulka ◽

...

Keyword(s):

Escherichia Coli ◽

Protein Expression ◽

Soluble Protein ◽

Experimental Studies ◽

Computational Prediction ◽

Supplementary Information ◽

Gradient Boosting ◽

Sequence Information ◽

Success Rates ◽

Solubility Prediction

Motivation: Poor protein solubility hinders the production of many therapeutic and industrially useful proteins. Experimental efforts to increase solubility are plagued by low success rates and often reduce biological activity. Computational prediction of protein expressibility and solubility in Escherichia coli using only sequence information could reduce the cost of experimental studies by enabling prioritisation of highly soluble proteins. Results: A new tool for sequence-based prediction of soluble protein expression in Escherichia coli, SoluProt, was created using the gradient boosting machine technique with the TargetTrack database as a training set. When evaluated against a balanced independent test set derived from the NESG database, SoluProt’s accuracy of 58.4% and AUC of 0.60 exceeded those of a suite of alternative solubility prediction tools. There is also evidence that it could significantly increase the success rate of experimental protein studies. SoluProt is freely available as a standalone program and a user-friendly webserver at <a href="https://loschmidt.chemi.muni.cz/soluprot/">https://loschmidt.chemi.muni.cz/soluprot/</a>. Availability and Implementation: <a href="https://loschmidt.chemi.muni.cz/soluprot/">https://loschmidt.chemi.muni.cz/soluprot/</a> Contact: [email protected] Supplementary Information: Supplementary data are available at Bioinformatics online

Download Full-text

Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction

Briefings in Bioinformatics ◽

10.1093/bib/bbt057 ◽

2013 ◽

Vol 15 (6) ◽

pp. 953-962 ◽

Cited By ~ 27

Author(s):

C. C. H. Chang ◽

J. Song ◽

B. T. Tey ◽

R. N. Ramanan

Keyword(s):

Escherichia Coli ◽

Recombinant Protein ◽

Recombinant Protein Production ◽

Protein Production ◽

Protein Solubility ◽

Solubility Prediction

Download Full-text

Mitigation of deleterious phenotypes in chloroplast-engineered plants accumulating high levels of foreign proteins

Biotechnology for Biofuels ◽

10.1186/s13068-021-01893-2 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Jennifer A. Schmidt ◽

Lubna V. Richter ◽

Lisa A. Condoluci ◽

Beth A. Ahner

Keyword(s):

Protein Expression ◽

Protein Production ◽

Biomass Accumulation ◽

Cellulase Production ◽

Protein Yield ◽

Total Soluble Protein ◽

Foreign Protein ◽

Active Components ◽

Dwarf Phenotype ◽

Mutant Phenotypes

Abstract Background The global demand for functional proteins is extensive, diverse, and constantly increasing. Medicine, agriculture, and industrial manufacturing all rely on high-quality proteins as major active components or process additives. Historically, these demands have been met by microbial bioreactors that are expensive to operate and maintain, prone to contamination, and relatively inflexible to changing market demands. Well-established crop cultivation techniques coupled with new advancements in genetic engineering may offer a cheaper and more versatile protein production platform. Chloroplast-engineered plants, like tobacco, have the potential to produce large quantities of high-value proteins, but often result in engineered plants with mutant phenotypes. This technology needs to be fine-tuned for commercial applications to maximize target protein yield while maintaining robust plant growth. Results Here, we show that a previously developed Nicotiana tabacum line, TetC-cel6A, can produce an industrial cellulase at levels of up to 28% of total soluble protein (TSP) with a slight dwarf phenotype but no loss in biomass. In seedlings, the dwarf phenotype is recovered by exogenous application of gibberellic acid. We also demonstrate that accumulating foreign protein represents an added burden to the plants’ metabolism that can make them more sensitive to limiting growth conditions such as low nitrogen. The biomass of nitrogen-limited TetC-cel6A plants was found to be as much as 40% lower than wildtype (WT) tobacco, although heterologous cellulase production was not greatly reduced compared to well-fertilized TetC-cel6A plants. Furthermore, cultivation at elevated carbon dioxide (1600 ppm CO2) restored biomass accumulation in TetC-cel6A plants to that of WT, while also increasing total heterologous protein yield (mg Cel6A plant−1) by 50–70%. Conclusions The work reported here demonstrates that well-fertilized tobacco plants have a substantial degree of flexibility in protein metabolism and can accommodate considerable levels of some recombinant proteins without exhibiting deleterious mutant phenotypes. Furthermore, we show that the alterations to protein expression triggered by growth at elevated CO2 can help rebalance endogenous protein expression and/or increase foreign protein production in chloroplast-engineered tobacco.

Download Full-text

MetaADEDB 2.0: a comprehensive database on adverse drug events

Bioinformatics ◽

10.1093/bioinformatics/btaa973 ◽

2020 ◽

Author(s):

Zhuohang Yu ◽

Zengrui Wu ◽

Weihua Li ◽

Guixia Liu ◽

Yun Tang

Keyword(s):

Safety Assessment ◽

Adverse Drug Events ◽

Adverse Event Reporting System ◽

Adverse Event Reporting ◽

Supplementary Information ◽

Online Database ◽

Web Interface ◽

Drug Discovery And Development ◽

Comprehensive Information ◽

User Friendly

Abstract Summary MetaADEDB is an online database we developed to integrate comprehensive information on adverse drug events (ADEs). The first version of MetaADEDB was released in 2013 and has been widely used by researchers. However, it has not been updated for more than seven years. Here, we reported its second version by collecting more and newer data from the U.S. FDA Adverse Event Reporting System (FAERS) and Canada Vigilance Adverse Reaction Online Database, in addition to the original three sources. The new version consists of 744 709 drug–ADE associations between 8498 drugs and 13 193 ADEs, which has an over 40% increase in drug–ADE associations compared to the previous version. Meanwhile, we developed a new and user-friendly web interface for data search and analysis. We hope that MetaADEDB 2.0 could provide a useful tool for drug safety assessment and related studies in drug discovery and development. Availability and implementation The database is freely available at: http://lmmd.ecust.edu.cn/metaadedb/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Spliceogen: an integrative, scalable tool for the discovery of splice-altering variants

Bioinformatics ◽

10.1093/bioinformatics/btz263 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4405-4407 ◽

Cited By ~ 1

Author(s):

Steven Monger ◽

Michael Troup ◽

Eddie Ip ◽

Sally L Dunwoodie ◽

Eleni Giannoulatou

Keyword(s):

Supplementary Information ◽

Command Line ◽

Supplementary Data ◽

In Silico Prediction ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Prediction Tools ◽

Motif Prediction ◽

Command Line Tool ◽

Genome Scale

Abstract Motivation In silico prediction tools are essential for identifying variants which create or disrupt cis-splicing motifs. However, there are limited options for genome-scale discovery of splice-altering variants. Results We have developed Spliceogen, a highly scalable pipeline integrating predictions from some of the individually best performing models for splice motif prediction: MaxEntScan, GeneSplicer, ESRseq and Branchpointer. Availability and implementation Spliceogen is available as a command line tool which accepts VCF/BED inputs and handles both single nucleotide variants (SNVs) and indels (https://github.com/VCCRI/Spliceogen). SNV databases with prediction scores are also available, covering all possible SNVs at all genomic positions within all Gencode-annotated multi-exon transcripts. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

RobNorm: model-based robust normalization method for labeled quantitative mass spectrometry proteomics data

Bioinformatics ◽

10.1093/bioinformatics/btaa904 ◽

2020 ◽

Author(s):

Meng Wang ◽

Lihua Jiang ◽

Ruiqi Jian ◽

Joanne Y Chan ◽

Qing Liu ◽

...

Keyword(s):

Mass Spectrometry ◽

Protein Expression ◽

Real Data ◽

Tissue Expression ◽

Supplementary Information ◽

Systematic Bias ◽

Proteomics Data ◽

Robust Fitting ◽

Fitting Method ◽

The One

Abstract Motivation Data normalization is an important step in processing proteomics data generated in mass spectrometry experiments, which aims to reduce sample-level variation and facilitate comparisons of samples. Previously published methods for normalization primarily depend on the assumption that the distribution of protein expression is similar across all samples. However, this assumption fails when the protein expression data is generated from heterogenous samples, such as from various tissue types. This led us to develop a novel data-driven method for improved normalization to correct the systematic bias meanwhile maintaining underlying biological heterogeneity. Results To robustly correct the systematic bias, we used the density-power-weight method to down-weigh outliers and extended the one-dimensional robust fitting method described in the previous work to our structured data. We then constructed a robustness criterion and developed a new normalization algorithm, called RobNorm. In simulation studies and analysis of real data from the genotype-tissue expression project, we compared and evaluated the performance of RobNorm against other normalization methods. We found that the RobNorm approach exhibits the greatest reduction in systematic bias while maintaining across-tissue variation, especially for datasets from highly heterogeneous samples. Availabilityand implementation https://github.com/mwgrassgreen/RobNorm. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text