substitution matrices
Recently Published Documents


TOTAL DOCUMENTS

91
(FIVE YEARS 11)

H-INDEX

19
(FIVE YEARS 1)

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Nicole Balasco ◽  
Gianluca Damaggio ◽  
Luciana Esposito ◽  
Flavia Villani ◽  
Rita Berisio ◽  
...  

AbstractThe ability of SARS-CoV-2 to rapidly mutate represents a remarkable complicancy. Quantitative evaluations of the effects that these mutations have on the virus structure/function is of great relevance and the availability of a large number of SARS-CoV-2 sequences since the early phases of the pandemic represents a unique opportunity to follow the adaptation of the virus to humans. Here, we evaluated the SARS-CoV-2 amino acid mutations and their progression by analyzing publicly available viral genomes at three stages of the pandemic (2020 March 15th and October 7th, 2021 February 7th). Mutations were classified in conservative and non-conservative based on the probability to be accepted during the evolution according to the Point Accepted Mutation substitution matrices and on the similarity of the encoding codons. We found that the most frequent substitutions are T > I, L > F, and A > V and we observe accumulation of hydrophobic residues. These findings are consistent among the three stages analyzed. We also found that non-conservative mutations are less frequent than conservative ones. This finding may be ascribed to a progressive adaptation of the virus to the host. In conclusion, the present study provides indications of the early evolution of the virus and tools for the global and genome-specific evaluation of the possible impact of mutations on the structure/function of SARS-CoV-2 variants.


2021 ◽  
Vol 3 (3) ◽  
Author(s):  
Tair Shauli ◽  
Nadav Brandes ◽  
Michal Linial

Abstract Human genetic variation in coding regions is fundamental to the study of protein structure and function. Most methods for interpreting missense variants consider substitution measures derived from homologous proteins across different species. In this study, we introduce human-specific amino acid (AA) substitution matrices that are based on genetic variations in the modern human population. We analyzed the frequencies of >4.8M single nucleotide variants (SNVs) at codon and AA resolution and compiled human-centric substitution matrices that are fundamentally different from classic cross-species matrices (e.g. BLOSUM, PAM). Our matrices are asymmetric, with some AA replacements showing significant directional preference. Moreover, these AA matrices are only partly predicted by nucleotide substitution rates. We further test the utility of our matrices in exposing functional signals of experimentally-validated protein annotations. A significant reduction in AA transition frequencies was observed across nine post-translational modification (PTM) types and four ion-binding sites. Our results propose a purifying selection signal in the human proteome across a diverse set of functional protein annotations and provide an empirical baseline for interpreting human genetic variation in coding regions.


2021 ◽  
Vol 3 (1) ◽  
Author(s):  
Marco Pietrosanto ◽  
Marta Adinolfi ◽  
Andrea Guarracino ◽  
Fabrizio Ferrè ◽  
Gabriele Ausiello ◽  
...  

Abstract Structural characterization of RNAs is a dynamic field, offering many modelling possibilities. RNA secondary structure models are usually characterized by an encoding that depicts structural information of the molecule through string representations or graphs. In this work, we provide a generalization of the BEAR encoding (a context-aware structural encoding we previously developed) by expanding the set of alignments used for the construction of substitution matrices and then applying it to secondary structure encodings ranging from fine-grained to more coarse-grained representations. We also introduce a re-interpretation of the Shannon Information applied on RNA alignments, proposing a new scoring metric, the Relative Information Gain (RIG). The RIG score is available for any position in an alignment, showing how different levels of detail encoded in the RNA representation can contribute differently to convey structural information. The approaches presented in this study can be used alongside state-of-the-art tools to synergistically gain insights into the structural elements that RNAs and RNA families are composed of. This additional information could potentially contribute to their improvement or increase the degree of confidence in the secondary structure of families and any set of aligned RNAs.


2020 ◽  
Author(s):  
Christoffer Norn ◽  
Ingemar André ◽  
Douglas L. Theobald

AbstractProteins evolve under a myriad of biophysical selection pressures that collectively control the patterns of amino acid substitutions. Averaged over time and across proteins, these evolutionary pressures are sufficiently consistent to produce global substitution patterns that can be used to successfully find homologues, infer phylogenies, and reconstruct ancestral sequences. Although the factors which govern the variation of protein substitution rates has received much attention, the influence of thermodynamic stability constraints remains unresolved. Here we develop a simple model to calculate amino acid rate matrices from evolutionary dynamics controlled by a fitness function that reports on the thermodynamic effects of amino acid mutations in protein structures. This hybrid biophysical and evolutionary model accounts for nucleotide transition/transversion rate bias, multi-nucleotide codon changes, the number of codons per amino acid, and thermodynamic protein stability. We find that our theoretical model accurately recapitulates the complex pattern of empirical rates observed in common global amino acid substitution matrices used in phylogenetics. These results suggest that selection for thermodynamically stable proteins, coupled with nucleotide mutation bias filtered by the structure of the genetic code, is the primary global driver behind the amino acid substitution patterns observed in proteins throughout the tree of life.


2020 ◽  
Author(s):  
Tair Shauli ◽  
Nadav Brandes ◽  
Michal Linial

Abstract The characterization of human genetic variation in coding regions is fundamental to the understanding of protein function, structure and evolution. Amino-acid (AA) substitution matrices encapsulate the stochastic nature of such proteomic variation and are widely used in studying protein families and evolutionary processes. The conventional substitution matrices, namely BLOSUM and PAM, were constructed to reflect polymorphism across species. In this study, we analyzed the frequencies of >4.8M single nucleotide variants within the healthy human population to accurately represent proteomic variability within the human species, at codon and AA resolution. Our model exposes various AA substitutions which are observed more frequently in one specific direction than in the opposite direction. We further demonstrate that nucleotide substitution rates only partially determine AA substitution rates. Finally, we investigate AA substitutions in post-translational modification and ion-binding sites, exposing purifying selection over a range of residue-based functions. These novel matrices provide a robust baseline for the analysis of protein variation in health and disease.


2020 ◽  
Vol 21 (S11) ◽  
Author(s):  
Valery Polyanovsky ◽  
Alexander Lifanov ◽  
Natalia Esipova ◽  
Vladimir Tumanyan

Abstract Background The alignment of character sequences is important in bioinformatics. The quality of this procedure is determined by the substitution matrix and parameters of the insertion-deletion penalty function. These matrices are derived from sequence alignment and thus reflect the evolutionary process. Currently, in addition to evolutionary matrices, a large number of different background matrices have been obtained. To make an optimal choice of the substitution matrix and the penalty parameters, we conducted a numerical experiment using a representative sample of existing matrices of various types and origins. Results We tested both the classical evolutionary matrix series (PAM, Blosum, VTML, Pfasum); structural alignment based matrices, contact energy matrix, and matrix based on the properties of the genetic code. This study presents results for two test set types: first, we simulated sequences that reflect the divergent evolution; second, we performed tests on Balibase sequences. In both cases, we obtained the dependences of the alignment quality (Accuracy, Confidence) on the evolutionary distance between sequences and the evolutionary distance to which the substitution matrices correspond. Optimization of a combination of matrices and the penalty parameters was carried out for local and global alignment on the values of penalty function parameters. Consequently, we found that the best alignment quality is achieved with matrices corresponding to the largest evolutionary distance. These matrices prove to be universal, i.e. suitable for aligning sequences separated by both large and small evolutionary distances. We analysed the correspondence of the correlation coefficients of matrices to the alignment quality. It was found that matrices showing high quality alignment have an above average correlation value, but the converse is not true. Conclusions This study showed that the best alignment quality is achieved with evolutionary matrices designed for long distances: Gonnet, VTML250, PAM250, MIQS, and Pfasum050. The same property is inherent in matrices not only of evolutionary origin, but also of another background corresponding to a large evolutionary distance. Therefore, matrices based on structural data show alignment quality close enough to its value for evolutionary matrices. This agrees with the idea that the spatial structure is more conservative than the protein sequence.


2020 ◽  
Author(s):  
Igor Lima ◽  
Elio A. Cino

AbstractHomologous proteins are often compared by pairwise sequence alignment, and structure superposition if the atomic coordinates are available. Unification of sequence and structure data is an important task in structural biology. Here, we present Sequence Similarity 3D (SS3D), a new method for integrating sequence and structure information for comparison of homologous proteins. SS3D quantifies the spatial similarity of residues within a given radius of homologous through-space contacts. The spatial alignments are scored using native BLOSUM and PAM substitution matrices. This work details the SS3D approach and demonstrates its utility through case studies comparing members of several protein families: GPCR, p53, kelch, SUMO, and SARS coronavirus spike protein. We show that SS3D can more clearly highlight biologically important regions of similarity and dissimilarity compared to pairwise sequence alignments or structure superposition alone. SS3D is written in C++, and is available with a manual and tutorial at https://github.com/0x462e41/SS3D/.


2020 ◽  
Author(s):  
Tair Shauli ◽  
Nadav Brandes ◽  
Michal Linial

AbstractThe characterization of human genetic variation in coding regions is fundamental to our understanding of protein function, structure, and evolution. Amino-acid (AA) substitution matrices such as BLOSUM (BLOcks SUbstitution Matrix) and PAM (Point Accepted Mutations) encapsulate the stochastic nature of such proteomic variation and are used in studying protein families and evolutionary processes. However, these matrices were constructed from protein sequences spanning long evolutionary distances and are not designed to reflect polymorphism within species. To accurately represent proteomic variation within the human population, we constructed a set of human-centric substitution matrices derived from genetic variations by analyzing the frequencies of >4.8M single nucleotide variants (SNVs). These human-specific matrices expose short-term evolutionary trends at both codon and AA resolution and therefore present an evolutionary perspective that differs from that implicated in the traditional matrices. Specifically, our matrices consider the directionality of variants, and uncover a set of AA pairs that exhibit a strong tendency to substitute in a specific direction. We further demonstrate that the substitution rates of nucleotides only partially determine AA substitution rates. Finally, we investigate AA substitutions in post-translational modification (PTM) and ion-binding sites. We confirm a strong propensity towards conservation of the identity of the AA that participates in such functions. The empirically-derived human-specific substitution matrices expose purifying selection over a range of residue-based protein properties. The new substitution matrices provide a robust baseline for the analysis of protein variations in health and disease. The underlying methodology is available as an open-access to the biomedical community.


Author(s):  
David P. Cavanaugh ◽  
Krishnan K. Chittur

ABSTRACTMotivationSequence database search and matching algorithms are an important tool when trying to understand the structure (and so the function) of proteins. Proteins with similar structure and function often have very similar primary structure. There are however many cases where proteins with similar structure have very different primary structures. Substitution matrices (PAM, BLOSUM, Gonnett) can be used to identify proteins of similar structure, but they fail when the sequence similarity falls below about 25%.ResultsWe have described a new algorithm for examining the the primary structure of proteins against a database of known proteins with a new hydrophobicity index. In this paper, we examine the ability of TMATCH to identify proteins of similar structure using sequence matching with the hydrophobicity index. We compare results from TMATCH with those obtained using FASTA and PSI-BLAST. We show that by using similarity patterns spread across the entire length of two proteins we get a more robust indicator of remote relatedness than relying upon high similarity scoring pair regions.AvailabilityThe program TMATCH is available on [email protected]


Sign in / Sign up

Export Citation Format

Share Document