scholarly journals DSResSol: A sequence-based solubility predictor created with Dilated Squeeze Excitation Residual Networks

2021 ◽  
Author(s):  
Mohammad Madani ◽  
Kaixiang Lin ◽  
Anna Tarakanova

Protein solubility is an important thermodynamic parameter critical for the characterization of a protein's function, and a key determinant for the production yield of a protein in both the research setting and within industrial applications. Thus, a highly accurate in silico bioinformatics tool for predicting protein solubility from protein sequence is sought. In this study, we developed a deep learning sequence-based solubility predictor, DSResSol, that takes advantage of the integration of squeeze excitation residual networks with dilated convolutional neural networks. The model captures the frequently occurring amino acid k-mers and their local and global interactions, and highlights the importance of identifying long-range interaction information between amino acid k-mers to achieve higher performance in comparison to existing deep learning-based models. DSResSol uses protein sequence as input, outperforming all available sequence-based solubility predictors by at least 5 percent in accuracy when the performance is evaluated by two different independent test sets. Compared to existing predictors, DSResSol not only reduces prediction bias for insoluble proteins but also predicts soluble proteins within the test sets with an accuracy that is at least 13 percent higher. We derive the key amino acids, dipeptides, and tripeptides contributing to protein solubility, identifying glutamic acid and serine as critical amino acids for protein solubility prediction. Overall, DSResSol can be used for fast, reliable, and inexpensive prediction of a protein's solubility to guide experimental design.

2021 ◽  
Vol 22 (24) ◽  
pp. 13555
Author(s):  
Mohammad Madani ◽  
Kaixiang Lin ◽  
Anna Tarakanova

Protein solubility is an important thermodynamic parameter that is critical for the characterization of a protein’s function, and a key determinant for the production yield of a protein in both the research setting and within industrial (e.g., pharmaceutical) applications. Experimental approaches to predict protein solubility are costly, time-consuming, and frequently offer only low success rates. To reduce cost and expedite the development of therapeutic and industrially relevant proteins, a highly accurate computational tool for predicting protein solubility from protein sequence is sought. While a number of in silico prediction tools exist, they suffer from relatively low prediction accuracy, bias toward the soluble proteins, and limited applicability for various classes of proteins. In this study, we developed a novel deep learning sequence-based solubility predictor, DSResSol, that takes advantage of the integration of squeeze excitation residual networks with dilated convolutional neural networks and outperforms all existing protein solubility prediction models. This model captures the frequently occurring amino acid k-mers and their local and global interactions and highlights the importance of identifying long-range interaction information between amino acid k-mers to achieve improved accuracy, using only protein sequence as input. DSResSol outperforms all available sequence-based solubility predictors by at least 5% in terms of accuracy when evaluated by two different independent test sets. Compared to existing predictors, DSResSol not only reduces prediction bias for insoluble proteins but also predicts soluble proteins within the test sets with an accuracy that is at least 13% higher than existing models. We derive the key amino acids, dipeptides, and tripeptides contributing to protein solubility, identifying glutamic acid and serine as critical amino acids for protein solubility prediction. Overall, DSResSol can be used for the fast, reliable, and inexpensive prediction of a protein’s solubility to guide experimental design.


Author(s):  
D. Filimonov ◽  
A. Lagunin

It is advisable to use data peptide's chemical structures with amino acids (AMA) substitution and the corresponding sections of the protein sequence without mutation to construct classification models predicting the pathogenic effects AMA substitutions based on MNA descriptors.


2018 ◽  
Author(s):  
Jeffrey I. Boucher ◽  
Troy W. Whitfield ◽  
Ann Dauphin ◽  
Gily Nachum ◽  
Carl Hollins ◽  
...  

AbstractThe evolution of HIV-1 protein sequences should be governed by a combination of factors including nucleotide mutational probabilities, the genetic code, and fitness. The impact of these factors on protein sequence evolution are interdependent, making it challenging to infer the individual contribution of each factor from phylogenetic analyses alone. We investigated the protein sequence evolution of HIV-1 by determining an experimental fitness landscape of all individual amino acid changes in protease. We compared our experimental results to the frequency of protease variants in a publicly available dataset of 32,163 sequenced isolates from drug-naïve individuals. The most common amino acids in sequenced isolates supported robust experimental fitness, indicating that the experimental fitness landscape captured key features of selection acting on protease during viral infections of hosts. Amino acid changes requiring multiple mutations from the likely ancestor were slightly less likely to support robust experimental fitness than single mutations, consistent with the genetic code favoring chemically conservative amino acid changes. Amino acids that were common in sequenced isolates were predominantly accessible by single mutations from the likely protease ancestor. Multiple mutations commonly observed in isolates were accessible by mutational walks with highly fit single mutation intermediates. Our results indicate that the prevalence of multiple base mutations in HIV-1 protease is strongly influenced by mutational sampling.


2021 ◽  
Vol 8 (6) ◽  
pp. 201852
Author(s):  
Yi Qian ◽  
Rui Zhang ◽  
Xinglu Jiang ◽  
Guoqiu Wu

Four nucleotides (A, U, C and G) constitute 64 codons at free combination but 64 codons are unequally assigned to 21 items (20 amino acids plus one stop). About 500 amino acids are known but only 20 are selected to make up the proteins. However, the relationships between amino acid and codon and between 20 amino acids have been unclear. In this paper, we studied the relationships between 20 amino acids in 33 species and found there were three constraints between 20 amino acids, such as the relatively stable mean carbon and hydrogen (C : H) ratios (0.50), similarity interactions between the constituent ratios of amino acids, and the frequency of amino acids according with Poisson distribution under certain conditions. We demonstrated that the unequal distribution of 64 codons and the choice of amino acids in molecular evolution would be constrained to remain stable C : H ratios. The constituent ratios and frequency of 20 amino acids in a species or a protein are two determinants of protein sequence evolution, so this finding showed the constraints between 20 amino acids played an important role in protein sequence evolution.


2018 ◽  
Vol 34 (15) ◽  
pp. 2605-2613 ◽  
Author(s):  
Sameer Khurana ◽  
Reda Rawi ◽  
Khalid Kunji ◽  
Gwo-Yu Chuang ◽  
Halima Bensmail ◽  
...  

Author(s):  
Namrata Anand-Achim ◽  
Raphael R. Eguchi ◽  
Alexander Derry ◽  
Russ B. Altman ◽  
Po-Ssu Huang

AbstractThe primary challenge of fixed-backbone protein design is to find a distribution of sequences that fold to the backbone of interest. This task is central to nearly all protein engineering problems, as achieving a particular backbone conformation is often a prerequisite for hosting specific functions. In this study, we investigate the capability of a deep neural network to learn the requisite patterns needed to design sequences. The trained model serves as a potential function defined over the space of amino acid identities and rotamer states, conditioned on the local chemical environment at each residue. While most deep learning based methods for sequence design only produce amino acid sequences, our method generates full-atom structural models, which can be evaluated using established sequence quality metrics. Under these metrics we are able to produce realistic and variable designs with quality comparable to the state-of-the-art. Additionally, we experimentally test designs for a de novo TIM-barrel structure and find designs that fold, demonstrating the algorithm’s generalizability to novel structures. Overall, our results demonstrate that a deep learning model can match state-of-the-art energy functions for guiding protein design.SignificanceProtein design tasks typically depend on carefully modeled and parameterized heuristic energy functions. In this study, we propose a novel machine learning method for fixed-backbone protein sequence design, using a learned neural network potential to not only design the sequence of amino acids but also select their side-chain configurations, or rotamers. Factoring through a structural representation of the protein, the network generates designs on par with the state-of-the-art, despite having been entirely learned from data. These results indicate an exciting future for protein design driven by machine learning.


2019 ◽  
Vol 15 (4) ◽  
pp. 367-375
Author(s):  
Martin A. Mune Mune ◽  
Christian B. Bassogog ◽  
Pierre A. Bayiga ◽  
Carine E. Nyobe ◽  
Samuel R. Minka

Background: There is a constant search of new plant proteins, with adequate nutritional and functional properties, as well as bioactive properties and low-cost for utilization in various food formulations. Objective: The aim of this work was to access the nutritional and functional potential of protein from Irvingia gabonensis, for utilization as ingredient or supplement in food. Methods: Proximate composition and amino acid were analyzed. Nutritional parameters were calculated from amino acid composition. Physicochemical properties and secondary structure of protein were determined. Finally, effect of oil to water ratio (OWR), pH and concentration on emulsifying properties was analyzed. Results: The flour contained 22.26% protein, 5.30% ash and 60% carbohydrates. Proteins contained all essential amino acids, with high content of Leu, Ile, Val, Thr and sulfur-containing amino acids. Essential amino acid index (69%), protein efficiency ratio (2.39-2.63) and biological value (79.91%) were studied. The maximum protein solubility (61%) was noticed at pH 8, while high hydrophobicity was observed at pH 2. A transition from an irregular secondary structure to a more ordered structure was found from pH 2-4 to pH 6-10. pH, OWR and concentration significantly affected emulsifying properties of Irvingia gabonensis almonds. The maximum emulsifying capacity (EC) was observed under acidic pH and high flour concentration. EC increased with increasing OWR and concentration, while decreased with increasing pH. High ES (25-35%) was observed at pH 4-8 and OWR of 1/3 to 1/2 (v/v), at flour concentration of 3-4% (w/v). Conclusion: Irvingia gabonensis showed good potential as food ingredient or supplement.


2022 ◽  
Author(s):  
Yuling Zhu ◽  
Jifeng Yuan

Enantiopure amino acids are of particular interest in the agrochemical and pharmaceutical industries. Here, we reported a multi-enzyme cascade for efficient production of L-phenylglycine (L-Phg) from biobased L-phenylalanine (L-Phe). We first attempted to engineer Escherichia coli for expressing L-amino acid deaminase (LAAD) from Proteus mirabilis, hydroxymandelate synthase (HmaS) from Amycolatopsis orientalis, (S)-mandelate dehydrogenase (SMDH) from Pseudomonas putida, the endogenous aminotransferase (AT) encoded by ilvE and L-glutamate dehydrogenase (GluDH) from E. coli. However, 10 mM L-Phe only afforded the synthesis of 7.21 mM L-Phg. The accumulation of benzoylformic acid suggested that the transamination step might be rate-limiting. We next used leucine dehydrogenase (LeuDH) from Bacillus cereus to bypass the use of L-glutamate as amine donor, and 40 mM L-Phe gave 39.97 mM (6.04 g/L) L-Phg, reaching 99.9% conversion. In summary, this work demonstrated a concise four-step enzymatic cascade for the L-Phg synthesis from biobased L-Phe, with a potential for future industrial applications.


2018 ◽  
Author(s):  
Antara Sengupta ◽  
Pabitra Pal Choudhury

AbstractThe aim of this paper is to make quantitative analysis of the properties which is really being carried from DNA sequence and finally landing up to the properties of a protein structure through its primary protein sequence. Thus, the paper has a theory which is applicable for any arbitrary DNA sequence whether it is of various species or mutated data or a bunch of genes responsible for a function to be occurred. Irrespective to genes of any families, species, wild type or mutated, our paper here gives a standard model which defines a mapping between physicochemical properties of any arbitrary DNA sequence and physicochemical properties of its amino acid sequence. Experiments have been carried out with PPCA protein family and its four homologs PPC(B E) which establishes that DNA sequence keeps its signature even after its translation into the corresponding amino acid sequence.


2008 ◽  
Vol 2 (1) ◽  
pp. 37-49 ◽  
Author(s):  
Kevin Campbell ◽  
Lukasz Kurgan

Development of accurate β-turn (beta-turn) type prediction methods would contribute towards the prediction of the tertiary protein structure and would provide useful insights/inputs for the fold recognition and drug design. Only one existing sequence-only method is available for the prediction of beta-turn types (for type I and II) for the entire protein chains, while the proposed method allows for prediction of type I, II, IV, VII, and non-specific (NS) beta-turns, filling in the gap. The proposed predictor, which is based solely on protein sequence, is shown to provide similar performance to other sequence-only methods for prediction of beta-turns and beta-turn types. The main advantage of the proposed method is simplicity and interpretability of the underlying model. We developed novel sequence-based features that allow identifying beta-turns types and differentiating them from non-beta-turns. The features, which are based on tetrapeptides (entire beta-turns) rather than a window centered over the predicted residues as in the case of recent competing methods, provide a more biologically sound model. They include 12 features based on collocation of amino acid pairs, focusing on amino acids (Gly, Asp, and Asn) that are known to be predisposed to form beta-turns. At the same time, our model also includes features that are geared towards exclusion of non-beta-turns, which are based on amino acids known to be strongly detrimental to formation of beta-turns (Met, Ile, Leu, and Val).


Sign in / Sign up

Export Citation Format

Share Document