scholarly journals Where Natural Protein Sequences Stand out From Randomness

2019 ◽  
Author(s):  
Laura Weidmann ◽  
Tjeerd Dijkstra ◽  
Oliver Kohlbacher ◽  
Andrei Lupas

AbstractBiological sequences are the product of natural selection, raising the expectation that they differ substantially from random sequences. We test this expectation by analyzing all fragments of a given length derived from either a natural dataset or different random models. For this, we compile all distances in sequence space between fragments within each dataset and compare the resulting distance distributions between sets. Even for 100mers, 95.4% of all distances between natural fragments are in accordance with those of a random model incorporating the natural residue composition. Hence, natural sequences are distributed almost randomly in global sequence space. When further accounting for the specific residue composition of domain-sized fragments, 99.2% of all distances between natural fragments can be modeled. Local residue composition, which might reflect biophysical constraints on protein structure, is thus the predominant feature characterizing distances between natural sequences globally, whereas homologous effects are only barely detectable.

Author(s):  
Yanping Zhang ◽  
Pengcheng Chen ◽  
Ya Gao ◽  
Jianwei Ni ◽  
Xiaosheng Wang

Aim and Objective:: Given the rapidly increasing number of molecular biology data available, computational methods of low complexity are necessary to infer protein structure, function, and evolution. Method:: In the work, we proposed a novel mthod, FermatS, which based on the global position information and local position representation from the curve and normalized moments of inertia, respectively, to extract features information of protein sequences. Furthermore, we use the generated features by FermatS method to analyze the similarity/dissimilarity of nine ND5 proteins and establish the prediction model of DNA-binding proteins based on logistic regression with 5-fold crossvalidation. Results:: In the similarity/dissimilarity analysis of nine ND5 proteins, the results are consistent with evolutionary theory. Moreover, this method can effectively predict the DNA-binding proteins in realistic situations. Conclusion:: The findings demonstrate that the proposed method is effective for comparing, recognizing and predicting protein sequences. The main code and datasets can download from https://github.com/GaoYa1122/FermatS.


2021 ◽  
Vol 20 (1) ◽  
Author(s):  
Li-Yun Lin ◽  
Hui-Ying Huang ◽  
Xue-Yan Liang ◽  
Dong-De Xie ◽  
Jiang-Tao Chen ◽  
...  

Abstract Background Thrombospondin-related adhesive protein (TRAP) is a transmembrane protein that plays a crucial role during the invasion of Plasmodium falciparum into liver cells. As a potential malaria vaccine candidate, the genetic diversity and natural selection of PfTRAP was assessed and the global PfTRAP polymorphism pattern was described. Methods 153 blood spot samples from Bioko malaria patients were collected during 2016–2018 and the target TRAP gene was amplified. Together with the sequences from database, nucleotide diversity and natural selection analysis, and the structural prediction were preformed using bioinformatical tools. Results A total of 119 Bioko PfTRAP sequences were amplified successfully. On Bioko Island, PfTRAP shows its high degree of genetic diversity and heterogeneity, with π value for 0.01046 and Hd for 0.99. The value of dN–dS (6.2231, p < 0.05) hinted at natural selection of PfTRAP on Bioko Island. Globally, the African PfTRAPs showed more diverse than the Asian ones, and significant genetic differentiation was discovered by the fixation index between African and Asian countries (Fst > 0.15, p < 0.05). 667 Asian isolates clustered in 136 haplotypes and 739 African isolates clustered in 528 haplotypes by network analysis. The mutations I116T, L221I, Y128F, G228V and P299S were predicted as probably damaging by PolyPhen online service, while mutations L49V, R285G, R285S, P299S and K421N would lead to a significant increase of free energy difference (ΔΔG > 1) indicated a destabilization of protein structure. Conclusions Evidences in the present investigation supported that PfTRAP gene from Bioko Island and other malaria endemic countries is highly polymorphic (especially at T cell epitopes), which provided the genetic information background for developing an PfTRAP-based universal effective vaccine. Moreover, some mutations have been shown to be detrimental to the protein structure or function and deserve further study and continuous monitoring.


2016 ◽  
Vol 73 (15) ◽  
pp. 2949-2957 ◽  
Author(s):  
Jia-Feng Yu ◽  
Zanxia Cao ◽  
Yuedong Yang ◽  
Chun-Ling Wang ◽  
Zhen-Dong Su ◽  
...  

2021 ◽  
Vol 8 ◽  
Author(s):  
Charles Christoffer ◽  
Vijay Bharadwaj ◽  
Ryan Luu ◽  
Daisuke Kihara

Protein-protein docking is a useful tool for modeling the structures of protein complexes that have yet to be experimentally determined. Understanding the structures of protein complexes is a key component for formulating hypotheses in biophysics regarding the functional mechanisms of complexes. Protein-protein docking is an established technique for cases where the structures of the subunits have been determined. While the number of known structures deposited in the Protein Data Bank is increasing, there are still many cases where the structures of individual proteins that users want to dock are not determined yet. Here, we have integrated the AttentiveDist method for protein structure prediction into our LZerD webserver for protein-protein docking, which enables users to simply submit protein sequences and obtain full-complex atomic models, without having to supply any structure themselves. We have further extended the LZerD docking interface with a symmetrical homodimer mode. The LZerD server is available at https://lzerd.kiharalab.org/.


2016 ◽  
Author(s):  
Sergei Spirin

There are a lot of algorithms and programs for reconstruction of phylogeny of a set of proteins basing on multiple sequence alignment. Many programs allow users to choose a number of parameters, for example, a model for maximum likelihood programs. Different programs and different parameters often produce different results. However at the moment all published benchmarks for evaluation of relative accuracy of programs or different choices of parameters are based on simulated sequences. The aim of the present work is to create a benchmark that allows a comparison of phylogenetic programs on large sets of alignments of natural protein sequences.


Author(s):  
Edwin Rodriguez Horta ◽  
Martin Weigt

AbstractCoevolution-based contact prediction, either directly by coevolutionary couplings resulting from global statistical sequence models or using structural supervision and deep learning, has found widespread application in protein-structure prediction from sequence. However, one of the basic assumptions in global statistical modeling is that sequences form an at least approximately independent sample of an unknown probability distribution, which is to be learned from data. In the case of protein families, this assumption is obviously violated by phylogenetic relations between protein sequences. It has turned out to be notoriously difficult to take phylogenetic correlations into account in coevolutionary model learning. Here, we propose a complementary approach: we develop two strategies to randomize or resample sequence data, such that conservation patterns and phylogenetic relations are preserved, while intrinsic (i.e. structure- or function-based) coevolutionary couplings are removed. An analysis of these data shows that the strongest coevolutionary couplings, i.e. those used by Direct Coupling Analysis to predict contacts, are only weakly influenced by phylogeny. However, phylogeny-induced spurious couplings are of similar size to the bulk of coevolutionary couplings, and dissecting functional from phylogeny-induced couplings might lead to more accurate contact predictions in the range of intermediate-size couplings.The code is available at https://github.com/ed-rodh/Null_models_I_and_II.Author summaryMany homologous protein families contain thousands of highly diverged amino-acid sequences, which fold in close-to-identical three-dimensional structures and fulfill almost identical biological tasks. Global coevolutionary models, like those inferred by the Direct Coupling Analysis (DCA), assume that families can be considered as samples of some unknown statistical model, and that the parameters of these models represent evolutionary constraints acting on protein sequences. To learn these models from data, DCA and related approaches have to also assume that the distinct sequences in a protein family are close to independent, while in reality they are characterized by involved hierarchical phylogenetic relationships. Here we propose Null models for sequence alignments, which maintain patterns of amino-acid conservation and phylogeny contained in the data, but destroy any coevolutionary couplings, frequently used in protein structure prediction. We find that phylogeny actually induces spurious non-zero couplings. These are, however, significantly smaller that the largest couplings derived from natural sequences, and therefore have only little influence on the first predicted contacts. However, in the range of intermediate couplings, they may lead to statistically significant effects. Dissecting phylogenetic from functional couplings might therefore extend the range of accurately predicted structural contacts down to smaller coupling strengths than those currently used.


Author(s):  
Lewis Moffat ◽  
Joe G. Greener ◽  
David T. Jones

AbstractThe prediction of protein structure and the design of novel protein sequences and structures have long been intertwined. The recently released AlphaFold has heralded a new generation of accurate protein structure prediction, but the extent to which this affects protein design stands yet unexplored. Here we develop a rapid and effective approach for fixed backbone computational protein design, leveraging the predictive power of AlphaFold. For several designs we demonstrate that not only are the AlphaFold predicted structures in agreement with the desired backbones, but they are also supported by the structure predictions of other supervised methods as well as ab initio folding. These results suggest that AlphaFold, and methods like it, are able to facilitate the development of a new range of novel and accurate protein design methodologies.


2011 ◽  
pp. 85-105 ◽  
Author(s):  
Simona Este Rombo ◽  
Luigi Palopoli

In the last years, the information stored in biological data-sets grew up exponentially, and new methods and tools have been proposed to interpret and retrieve useful information from such data. Most biological data-sets contain biological sequences (e.g., DNA and protein sequences). Thus, it is much significant to have techniques available capable of mining patterns from such sequences to discover interesting information from them. For instance, singling out for common or similar sub-sequences in sets of bio-sequences is sensible as these are usually associated to similar biological functions expressed by the corresponding macromolecules. The aim of this chapter is to explain how pattern discovery can be applied to deal with such important biological problems, describing also a number of relevant techniques proposed in the literature. A simple formalization of the problem is given and specialized for each of the presented approaches. Such a formalization should ease reading and understanding the illustrated material by providing a simple-to-follow roadmap scheme through the diverse methods for pattern extraction we are going to illustrate.


Sign in / Sign up

Export Citation Format

Share Document