scholarly journals Structure-aware Protein Solubility Prediction From Sequence Through Graph Convolutional Network And Predicted Contact Map

2020 ◽  
Author(s):  
Jianwen Chen ◽  
Shuangjia Zheng ◽  
Huiying Zhao ◽  
Yuedong Yang

AbstractMotivationProtein solubility is significant in producing new soluble proteins that can reduce the cost of biocatalysts or therapeutic agents. Therefore, a computational model is highly desired to accurately predict protein solubility from the amino acid sequence. Many methods have been developed, but they are mostly based on the one-dimensional embedding of amino acids that is limited to catch spatially structural information.ResultsIn this study, we have developed a new structure-aware method to predict protein solubility by attentive graph convolutional network (GCN), where the protein topology attribute graph was constructed through predicted contact maps from the sequence. GraphSol was shown to substantially out-perform other sequence-based methods. The model was proven to be stable by consistent R2 of 0.48 in both the cross-validation and independent test of the eSOL dataset. To our best knowledge, this is the first study to utilize the GCN for sequence-based predictions. More importantly, this architecture could be extended to other protein prediction tasks.AvailabilityThe package is available at http://[email protected] informationSupplementary data are available at Bioinformatics online.

2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Jianwen Chen ◽  
Shuangjia Zheng ◽  
Huiying Zhao ◽  
Yuedong Yang

AbstractProtein solubility is significant in producing new soluble proteins that can reduce the cost of biocatalysts or therapeutic agents. Therefore, a computational model is highly desired to accurately predict protein solubility from the amino acid sequence. Many methods have been developed, but they are mostly based on the one-dimensional embedding of amino acids that is limited to catch spatially structural information. In this study, we have developed a new structure-aware method GraphSol to predict protein solubility by attentive graph convolutional network (GCN), where the protein topology attribute graph was constructed through predicted contact maps only from the sequence. GraphSol was shown to substantially outperform other sequence-based methods. The model was proven to be stable by consistent $${\text{R}}^{2}$$ R 2 of 0.48 in both the cross-validation and independent test of the eSOL dataset. To our best knowledge, this is the first study to utilize the GCN for sequence-based protein solubility predictions. More importantly, this architecture could be easily extended to other protein prediction tasks requiring a raw protein sequence.


Author(s):  
Qingzhen Hou ◽  
Jean Marc Kwasigroch ◽  
Marianne Rooman ◽  
Fabrizio Pucci

Abstract Motivation The solubility of a protein is often decisive for its proper functioning. Lack of solubility is a major bottleneck in high-throughput structural genomic studies and in high-concentration protein production, and the formation of protein aggregates causes a wide variety of diseases. Since solubility measurements are time-consuming and expensive, there is a strong need for solubility prediction tools. Results We have recently introduced solubility-dependent distance potentials that are able to unravel the role of residue–residue interactions in promoting or decreasing protein solubility. Here, we extended their construction by defining solubility-dependent potentials based on backbone torsion angles and solvent accessibility, and integrated them, together with other structure- and sequence-based features, into a random forest model trained on a set of Escherichia coli proteins with experimental structures and solubility values. We thus obtained the SOLart protein solubility predictor, whose most informative features turned out to be folding free energy differences computed from our solubility-dependent statistical potentials. SOLart performances are very good, with a Pearson correlation coefficient between experimental and predicted solubility values of almost 0.7 both in cross-validation on the training dataset and in an independent set of Saccharomyces cerevisiae proteins. On test sets of modeled structures, only a limited drop in performance is observed. SOLart can thus be used with both high-resolution and low-resolution structures, and clearly outperforms state-of-art solubility predictors. It is available through a user-friendly webserver, which is easy to use by non-expert scientists. Availability and implementation The SOLart webserver is freely available at http://babylone.ulb.ac.be/SOLART/. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (12) ◽  
pp. 3645-3651
Author(s):  
Lyam Baudry ◽  
Gaël A Millot ◽  
Agnes Thierry ◽  
Romain Koszul ◽  
Vittore F Scolari

Abstract Motivation Hi-C contact maps reflect the relative contact frequencies between pairs of genomic loci, quantified through deep sequencing. Differential analyses of these maps enable downstream biological interpretations. However, the multi-fractal nature of the chromatin polymer inside the cellular envelope results in contact frequency values spanning several orders of magnitude: contacts between loci pairs separated by large genomic distances are much sparser than closer pairs. The same is true for poorly covered regions, such as repeated sequences. Both distant and poorly covered regions translate into low signal-to-noise ratios. There is no clear consensus to address this limitation. Results We present Serpentine, a fast, flexible procedure operating on raw data, which considers the contacts in each region of a contact map. Binning is performed only when necessary on noisy regions, preserving informative ones. This results in high-quality, low-noise contact maps that can be conveniently visualized for rigorous comparative analyses. Availability and implementation Serpentine is available on the PyPI repository and https://github.com/koszullab/serpentine; documentation and tutorials are provided at https://serpentine.readthedocs.io/en/latest/. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Filomeno Sánchez Rodríguez ◽  
Shahram Mesdaghi ◽  
Adam J Simpkin ◽  
J Javier Burgos-Mármol ◽  
David L Murphy ◽  
...  

Abstract Summary Covariance-based predictions of residue contacts and inter-residue distances are an increasingly popular data type in protein bioinformatics. Here we present ConPlot, a web-based application for convenient display and analysis of contact maps and distograms. Integration of predicted contact data with other predictions is often required to facilitate inference of structural features. ConPlot can therefore use the empty space near the contact map diagonal to display multiple coloured tracks representing other sequence-based predictions. Popular file formats are natively read and bespoke data can also be flexibly displayed. This novel visualization will enable easier interpretation of predicted contact maps. Availability and implementation available online at www.conplot.org, along with documentation and examples. Alternatively, ConPlot can be installed and used locally using the docker image from the project’s Docker Hub repository. ConPlot is licensed under the BSD 3-Clause. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Oana Ursu ◽  
Nathan Boley ◽  
Maryna Taranova ◽  
Y.X. Rachel Wang ◽  
Galip Gurkan Yardimci ◽  
...  

AbstractMotivationThe three-dimensional organization of chromatin plays a critical role in gene regulation and disease. High-throughput chromosome conformation capture experiments such as Hi-C are used to obtain genome-wide maps of 3D chromatin contacts. However, robust estimation of data quality and systematic comparison of these contact maps is challenging due to the multi-scale, hierarchical structure of chromatin contacts and the resulting properties of experimental noise in the data. Measuring concordance of contact maps is important for assessing reproducibility of replicate experiments and for modeling variation between different cellular contexts.ResultsWe introduce a concordance measure called GenomeDISCO (DIfferences between Smoothed COntact maps) for assessing the similarity of a pair of contact maps obtained from chromosome conformation capture experiments. The key idea is to smooth contact maps using random walks on the contact map graph, before estimating concordance. We use simulated datasets to benchmark GenomeDISCO’s sensitivity to different types of noise that affect chromatin contact maps. When applied to a large collection of Hi-C datasets, GenomeDISCO accurately distinguishes biological replicates from samples obtained from different cell types. GenomeDISCO also generalizes to other chromosome conformation capture assays, such as HiChIP.AvailabilitySoftware implementing GenomeDISCO is available at https://github.com/kundajelab/[email protected] informationSupplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (18) ◽  
pp. 4691-4698 ◽  
Author(s):  
Bikash K Bhandari ◽  
Paul P Gardner ◽  
Chun Shen Lim

Abstract Motivation Recombinant protein production is a widely used technique in the biotechnology and biomedical industries, yet only a quarter of target proteins are soluble and can therefore be purified. Results We have discovered that global structural flexibility, which can be modeled by normalized B-factors, accurately predicts the solubility of 12 216 recombinant proteins expressed in Escherichia coli. We have optimized these B-factors, and derived a new set of values for solubility scoring that further improves prediction accuracy. We call this new predictor the ‘Solubility-Weighted Index’ (SWI). Importantly, SWI outperforms many existing protein solubility prediction tools. Furthermore, we have developed ‘SoDoPE’ (Soluble Domain for Protein Expression), a web interface that allows users to choose a protein region of interest for predicting and maximizing both protein expression and solubility. Availability and implementation The SoDoPE web server and source code are freely available at https://tisigner.com/sodope and https://github.com/Gardner-BinfLab/TISIGNER-ReactJS, respectively. The code and data for reproducing our analysis can be found at https://github.com/Gardner-BinfLab/SoDoPE_paper_2020. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (22) ◽  
pp. 4640-4646 ◽  
Author(s):  
Xi Han ◽  
Xiaonan Wang ◽  
Kang Zhou

Abstract Motivation Protein activity is a significant characteristic for recombinant proteins which can be used as biocatalysts. High activity of proteins reduces the cost of biocatalysts. A model that can predict protein activity from amino acid sequence is highly desired, as it aids experimental improvement of proteins. However, only limited data for protein activity are currently available, which prevents the development of such models. Since protein activity and solubility are correlated for some proteins, the publicly available solubility dataset may be adopted to develop models that can predict protein solubility from sequence. The models could serve as a tool to indirectly predict protein activity from sequence. In literature, predicting protein solubility from sequence has been intensively explored, but the predicted solubility represented in binary values from all the developed models was not suitable for guiding experimental designs to improve protein solubility. Here we propose new machine learning (ML) models for improving protein solubility in vivo. Results We first implemented a novel approach that predicted protein solubility in continuous numerical values instead of binary ones. After combining it with various ML algorithms, we achieved a R2 of 0.4115 when support vector machine algorithm was used. Continuous values of solubility are more meaningful in protein engineering, as they enable researchers to choose proteins with higher predicted solubility for experimental validation, while binary values fail to distinguish proteins with the same value—there are only two possible values so many proteins have the same one. Availability and implementation We present the ML workflow as a series of IPython notebooks hosted on GitHub (https://github.com/xiaomizhou616/protein_solubility). The workflow can be used as a template for analysis of other expression and solubility datasets. Supplementary information Supplementary data are available at Bioinformatics online.


Cancers ◽  
2021 ◽  
Vol 13 (9) ◽  
pp. 2111
Author(s):  
Bo-Wei Zhao ◽  
Zhu-Hong You ◽  
Lun Hu ◽  
Zhen-Hao Guo ◽  
Lei Wang ◽  
...  

Identification of drug-target interactions (DTIs) is a significant step in the drug discovery or repositioning process. Compared with the time-consuming and labor-intensive in vivo experimental methods, the computational models can provide high-quality DTI candidates in an instant. In this study, we propose a novel method called LGDTI to predict DTIs based on large-scale graph representation learning. LGDTI can capture the local and global structural information of the graph. Specifically, the first-order neighbor information of nodes can be aggregated by the graph convolutional network (GCN); on the other hand, the high-order neighbor information of nodes can be learned by the graph embedding method called DeepWalk. Finally, the two kinds of feature are fed into the random forest classifier to train and predict potential DTIs. The results show that our method obtained area under the receiver operating characteristic curve (AUROC) of 0.9455 and area under the precision-recall curve (AUPR) of 0.9491 under 5-fold cross-validation. Moreover, we compare the presented method with some existing state-of-the-art methods. These results imply that LGDTI can efficiently and robustly capture undiscovered DTIs. Moreover, the proposed model is expected to bring new inspiration and provide novel perspectives to relevant researchers.


Energies ◽  
2021 ◽  
Vol 14 (12) ◽  
pp. 3611
Author(s):  
Sandra Gonzalez-Piedra ◽  
Héctor Hernández-García ◽  
Juan M. Perez-Morales ◽  
Laura Acosta-Domínguez ◽  
Juan-Rodrigo Bastidas-Oyanedel ◽  
...  

In this paper, a study on the feasibility of the treatment of raw cheese whey by anaerobic co-digestion using coffee pulp residues as a co-substrate is presented. It considers raw whey generated in artisanal cheese markers, which is generally not treated, thus causing environmental pollution problems. An experimental design was carried out evaluating the effect of pH and the substrate ratio on methane production at 35 °C (i.e., mesophilic conditions). The interaction of the parameters on the co-substrate degradation and the methane production was analyzed using a response surface analysis. Furthermore, two kinetic models were proposed (first order and modified Gompertz models) to determine the dynamic profiles of methane yield. The results show that co-digestion of the raw whey is favored at pH = 6, reaching a maximum yield of 71.54 mLCH4 gVSrem−1 (31.5% VS removed) for raw cheese whey and coffee pulp ratio of 1 gVSwhey gVSCoffe−1. The proposed kinetic models successfully fit the experimental methane production data, the Gompertz model being the one that showed the best fit. Then, the results show that anaerobic co-digestion can be used to reduce the environmental impact of raw whey. Likewise, the methane obtained can be integrated into the cheese production process, which could contribute to reducing the cost per energy consumption.


Author(s):  
Frederico Finan ◽  
Maurizio Mazzocco

Abstract Politicians allocate public resources in ways that maximize political gains, and potentially at the cost of lower welfare. In this paper, we quantify these welfare costs in the context of Brazil’s federal legislature, which grants its members a budget to fund public projects within their states. Using data from the state of Roraima, we estimate a model of politicians’ allocation decisions and find that 26.8% of the public funds allocated by legislators are distorted relative to a social planner’s allocation. We then use the model to simulate three potential policy reforms to the electoral system: the adoption of approval voting, imposing a one-term limit, and redistricting. We find that a one-term limit and redistricting are both effective at reducing distortions. The one-term limit policy, however, increases corruption, which makes it a welfare-reducing policy.


Sign in / Sign up

Export Citation Format

Share Document