Grouping of amino acid types and extraction of amino acid properties from multiple sequence alignments using variance maximization

The Shannon entropy is a common way of measuring conservation of sites in multiple sequence alignments, and has also been extended with the relative Shannon entropy to account for background frequencies. The von Neumann entropy is another extension of the Shannon entropy, adapted from quantum mechanics in order to account for amino acid similarities. However, there is yet no relative von Neumann entropy defined for sequence analysis. We introduce a new definition of the von Neumann entropy for use in sequence analysis, which we found to perform better than the previous definition. We also introduce the relative von Neumann entropy and a way of parametrizing this in order to obtain the Shannon entropy, the relative Shannon entropy and the von Neumann entropy at special parameter values. We performed an exhaustive search of this parameter space and found better predictions of catalytic sites compared to any of the previously used entropies.

Download Full-text

Experimental Assessment of the Importance of Amino Acid Positions Identified by an Entropy-Based Correlation Analysis of Multiple-Sequence Alignments

Biochemistry ◽

10.1021/bi300747r ◽

2012 ◽

Vol 51 (28) ◽

pp. 5633-5641 ◽

Cited By ~ 11

Author(s):

Susanne Dietrich ◽

Nadine Borst ◽

Sandra Schlee ◽

Daniel Schneider ◽

Jan-Oliver Janda ◽

...

Keyword(s):

Amino Acid ◽

Correlation Analysis ◽

Sequence Alignments ◽

Multiple Sequence ◽

Experimental Assessment ◽

Multiple Sequence Alignments

Download Full-text

QMaker: Fast and accurate method to estimate empirical models of protein evolution

10.1101/2020.02.20.958819 ◽

2020 ◽

Cited By ~ 1

Author(s):

Bui Quang Minh ◽

Cuong Cao Dang ◽

Le Sy Vinh ◽

Robert Lanfear

Keyword(s):

Amino Acid ◽

Amino Acid Substitution ◽

Search Algorithm ◽

Phylogenetic Analyses ◽

Accurate Method ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Substitution Models ◽

Protein Alignments

AbstractAmino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models, however, they are typically complicated and slow. In this paper, we propose QMaker, a new ML method to estimate a general time-reversible Q matrix from a large protein dataset consisting of multiple sequence alignments. QMaker combines an efficient ML tree search algorithm, a model selection for handling the model heterogeneity among alignments, and the consideration of rate mixture models among sites. We provide QMaker as a user-friendly function in the IQ-TREE software package (http://www.iqtree.org) supporting the use of multiple CPU cores so that biologists can easily estimate amino acid substitution models from their own protein alignments. We used QMaker to estimate new empirical general amino acid substitution models from the current Pfam database as well as five clade-specific models for mammals, birds, insects, yeasts, and plants. Our results show that the new models considerably improve the fit between model and data and in some cases influence the inference of phylogenetic tree topologies.

Download Full-text

Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments

Molecules ◽

10.3390/molecules24010104 ◽

2018 ◽

Vol 24 (1) ◽

pp. 104

Author(s):

Patrice Koehl ◽

Henri Orland ◽

Marc Delarue

Keyword(s):

Amino Acids ◽

Amino Acid ◽

Principal Components ◽

Gaussian Model ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Substitution Matrices ◽

Multivariate Gaussian ◽

Multivariate Gaussian Model

Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein solely from the sequences that decorate it. To carry out such predictions, and following the work of others, we have implemented a multivariate Gaussian model to analyze correlation in multiple sequence alignments. We have explored and tested several numerical encodings of amino acids within this model. We have shown that 1D encodings based on amino acid biochemical and biophysical properties, as well as higher dimensional encodings computed from the principal components of experimentally derived mutation/substitution matrices, do not perform as well as a simple twenty dimensional encoding with each amino acid represented with a vector of one along its own dimension and zero elsewhere. The optimum obtained from representations based on substitution matrices is reached by using 10 to 12 principal components; the corresponding performance is less than the performance obtained with the 20-dimensional binary encoding. We highlight also the importance of the prior when constructing the multivariate Gaussian model of a multiple sequence alignment.

Download Full-text

Protein residues determining interaction specificity in paralogous families

Bioinformatics ◽

10.1093/bioinformatics/btaa934 ◽

2020 ◽

Author(s):

Borja Pitarch ◽

Juan A G Ranea ◽

Florencio Pazos

Keyword(s):

Amino Acid ◽

Fine Tuning ◽

Supplementary Information ◽

Sequence Information ◽

Large Set ◽

Supplementary Data ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Protein Residues

Abstract Motivation Predicting the residues controlling a protein’s interaction specificity is important not only to better understand its interactions but also to design mutations aimed at fine-tuning or swapping them as well. Results In this work, we present a methodology that combines sequence information (in the form of multiple sequence alignments) with interactome information to detect that kind of residues in paralogous families of proteins. The interactome is used to define pairwise similarities of interaction contexts for the proteins in the alignment. The method looks for alignment positions with patterns of amino-acid changes reflecting the similarities/differences in the interaction neighborhoods of the corresponding proteins. We tested this new methodology in a large set of human paralogous families with structurally characterized interactions, and discuss in detail the results for the RasH family. We show that this approach is a better predictor of interfacial residues than both, sequence conservation and an equivalent ‘unsupervised’ method that does not use interactome information. Availability and implementation http://csbg.cnb.csic.es/pazos/Xdet/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Faculty Opinions recommendation of Influence of conservation on calculations of amino acid covariance in multiple sequence alignments.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1020500.235440 ◽

2004 ◽

Author(s):

Andras Fiser

Keyword(s):

Amino Acid ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Download Full-text

Influence of conservation on calculations of amino acid covariance in multiple sequence alignments

Proteins Structure Function and Bioinformatics ◽

10.1002/prot.20098 ◽

2004 ◽

Vol 56 (2) ◽

pp. 211-221 ◽

Cited By ~ 167

Author(s):

Anthony A. Fodor ◽

Richard W. Aldrich

Keyword(s):

Amino Acid ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Download Full-text

Size and structure of the sequence space of repeat proteins

10.1101/635581 ◽

2019 ◽

Author(s):

Jacopo Marchi ◽

Ezequiel A. Galpern ◽

Rocio Espada ◽

Diego U. Ferreiro ◽

Aleksandra M. Walczak ◽

...

Keyword(s):

Amino Acid ◽

Protein Design ◽

Amino Acid Sequences ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Repeat Proteins ◽

The Impact ◽

New Strategies ◽

Amino Acid Conservation

AbstractThe coding space of protein sequences is shaped by evolutionary constraints set by requirements of function and stability. We show that the coding space of a given protein family —the total number of sequences in that family— can be estimated using models of maximum entropy trained on multiple sequence alignments of naturally occuring amino acid sequences. We analyzed and calculated the size of three abundant repeat proteins families, whose members are large proteins made of many repetitions of conserved portions of ∼30 amino acids. While amino acid conservation at each position of the alignment explains most of the reduction of diversity relative to completely random sequences, we found that correlations between amino acid usage at different positions significantly impact that diversity. We quantified the impact of different types of correlations, functional and evolutionary, on sequence diversity. Analysis of the detailed structure of the coding space of the families revealed a rugged landscape, with many local energy minima of varying sizes with a hierarchical structure, reminiscent of fustrated energy landscapes of spin glass in physics. This clustered structure indicates a multiplicity of subtypes within each family, and suggests new strategies for protein design.

Download Full-text

The evolution of contact prediction: Evidence that contact selection in statistical contact prediction is changing

10.1101/660191 ◽

2019 ◽

Author(s):

Mark Chonofsky ◽

Saulo H. P. de Oliveira ◽

Konrad Krawczyk ◽

Charlotte M. Deane

Keyword(s):

Amino Acids ◽

Protein Structure ◽

Amino Acid ◽

Structure Prediction ◽

Prediction Methods ◽

Sequence Alignments ◽

Multiple Sequence ◽

Contact Prediction ◽

Multiple Sequence Alignments ◽

Physico Chemical

AbstractOver the last few years, the field of protein structure prediction has been transformed by increasingly-accurate contact prediction software. These methods are based on the detection of coevolutionary relationships between residues from multiple sequence alignments. However, despite speculation, there is little evidence of a link between contact prediction and the physico-chemical interactions which drive amino-acid coevolution. Furthermore, existing protocols predict only a fraction of all protein contacts and it is not clear why some contacts are favoured over others.Using a dataset of 863 protein domains, we assessed the physico-chemical interactions of contacts predicted by CCMpred, MetaPSICOV, and DNCON2, as examples of direct coupling analysis, meta-prediction, and deep learning, respectively. To further investigate what sets these predicted contacts apart, we considered correctly-predicted contacts and compared their properties against the protein contacts that were not predicted.We found that predicted contacts tend to form more bonds than non-predicted contacts, which suggests these contacts may be more important. Comparing the contacts predicted by each method, we found that metaPSICOV and DNCON2 favour accuracy whereas CCMPred detects contacts with more bonds. This suggests that the push for higher accuracy may lead to a loss of physico-chemically important contacts.These results underscore the connection between protein physico-chemistry and the coevolutionary couplings that can be derived from multiple sequence alignments. This relationship is likely to be relevant to protein structure prediction and functional analysis of protein structure and may be key to understanding their utility for different problems in structural biology.Author summaryAccurate contact prediction has allowed scientists to predict protein structures with unprecedented levels of accuracy. The success of contact prediction methods, which are based on inferring correlations between amino acids in protein multiple sequence alignments, has prompted a great deal of work to improve the quality of contact prediction, leading to the development of several different methods for detecting amino acids in proximity.In this paper, we investigate the properties of these contact prediction methods. We find that contacts which are predicted differ from the other contacts in the protein, in particular they have more physico-chemical bonds, and the predicted contacts are more strongly conserved than other contacts across protein families. We also compared the properties of different contact prediction methods and found that the characteristics of the predicted sets depend on the prediction method used.Our results point to a link between physico-chemical bonding interactions and the evolutionary history of proteins, a connection which is reflected in their amino acid sequences.

Download Full-text