Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments

Patrice Koehl; Henri Orland; Marc Delarue

doi:10.3390/molecules24010104

Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments

Molecules ◽

10.3390/molecules24010104 ◽

2018 ◽

Vol 24 (1) ◽

pp. 104

Author(s):

Patrice Koehl ◽

Henri Orland ◽

Marc Delarue

Keyword(s):

Amino Acids ◽

Amino Acid ◽

Principal Components ◽

Gaussian Model ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Substitution Matrices ◽

Multivariate Gaussian ◽

Multivariate Gaussian Model

Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein solely from the sequences that decorate it. To carry out such predictions, and following the work of others, we have implemented a multivariate Gaussian model to analyze correlation in multiple sequence alignments. We have explored and tested several numerical encodings of amino acids within this model. We have shown that 1D encodings based on amino acid biochemical and biophysical properties, as well as higher dimensional encodings computed from the principal components of experimentally derived mutation/substitution matrices, do not perform as well as a simple twenty dimensional encoding with each amino acid represented with a vector of one along its own dimension and zero elsewhere. The optimum obtained from representations based on substitution matrices is reached by using 10 to 12 principal components; the corresponding performance is less than the performance obtained with the 20-dimensional binary encoding. We highlight also the importance of the prior when constructing the multivariate Gaussian model of a multiple sequence alignment.

Download Full-text

The evolution of contact prediction: Evidence that contact selection in statistical contact prediction is changing

10.1101/660191 ◽

2019 ◽

Author(s):

Mark Chonofsky ◽

Saulo H. P. de Oliveira ◽

Konrad Krawczyk ◽

Charlotte M. Deane

Keyword(s):

Amino Acids ◽

Protein Structure ◽

Amino Acid ◽

Structure Prediction ◽

Prediction Methods ◽

Sequence Alignments ◽

Multiple Sequence ◽

Contact Prediction ◽

Multiple Sequence Alignments ◽

Physico Chemical

AbstractOver the last few years, the field of protein structure prediction has been transformed by increasingly-accurate contact prediction software. These methods are based on the detection of coevolutionary relationships between residues from multiple sequence alignments. However, despite speculation, there is little evidence of a link between contact prediction and the physico-chemical interactions which drive amino-acid coevolution. Furthermore, existing protocols predict only a fraction of all protein contacts and it is not clear why some contacts are favoured over others.Using a dataset of 863 protein domains, we assessed the physico-chemical interactions of contacts predicted by CCMpred, MetaPSICOV, and DNCON2, as examples of direct coupling analysis, meta-prediction, and deep learning, respectively. To further investigate what sets these predicted contacts apart, we considered correctly-predicted contacts and compared their properties against the protein contacts that were not predicted.We found that predicted contacts tend to form more bonds than non-predicted contacts, which suggests these contacts may be more important. Comparing the contacts predicted by each method, we found that metaPSICOV and DNCON2 favour accuracy whereas CCMPred detects contacts with more bonds. This suggests that the push for higher accuracy may lead to a loss of physico-chemically important contacts.These results underscore the connection between protein physico-chemistry and the coevolutionary couplings that can be derived from multiple sequence alignments. This relationship is likely to be relevant to protein structure prediction and functional analysis of protein structure and may be key to understanding their utility for different problems in structural biology.Author summaryAccurate contact prediction has allowed scientists to predict protein structures with unprecedented levels of accuracy. The success of contact prediction methods, which are based on inferring correlations between amino acids in protein multiple sequence alignments, has prompted a great deal of work to improve the quality of contact prediction, leading to the development of several different methods for detecting amino acids in proximity.In this paper, we investigate the properties of these contact prediction methods. We find that contacts which are predicted differ from the other contacts in the protein, in particular they have more physico-chemical bonds, and the predicted contacts are more strongly conserved than other contacts across protein families. We also compared the properties of different contact prediction methods and found that the characteristics of the predicted sets depend on the prediction method used.Our results point to a link between physico-chemical bonding interactions and the evolutionary history of proteins, a connection which is reflected in their amino acid sequences.

Download Full-text

A minimum reporting standard for multiple sequence alignments

10.1101/2020.01.15.907733 ◽

2020 ◽

Author(s):

Thomas KF Wong ◽

Subha Kalyaanamoorthy ◽

Karen Meusemann ◽

David K Yeates ◽

Bernhard Misof ◽

...

Keyword(s):

Amino Acids ◽

Sequence Data ◽

Pivotal Role ◽

Sequence Alignments ◽

Reporting Standard ◽

Multiple Sequence ◽

Molecular Sequence Data ◽

Molecular Sequence ◽

Multiple Sequence Alignments

ABSTRACTMultiple sequence alignments (MSAs) play a pivotal role in studies of molecular sequence data, but nobody has developed a minimum reporting standard (MRS) to quantify the completeness of MSAs in terms of completely-specified nucleotides or amino acids. We present an MRS that relies on four simple completeness metrics. The metrics are implemented in AliStat, a program developed to support the MRS. A survey of published MSAs illustrates the benefits and unprecedented transparency offered by the MRS.

Download Full-text

A minimum reporting standard for multiple sequence alignments

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa024 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 8

Author(s):

Thomas K F Wong ◽

Subha Kalyaanamoorthy ◽

Karen Meusemann ◽

David K Yeates ◽

Bernhard Misof ◽

...

Keyword(s):

Amino Acids ◽

Sequence Data ◽

Pivotal Role ◽

Sequence Alignments ◽

Reporting Standard ◽

Multiple Sequence ◽

Molecular Sequence Data ◽

Molecular Sequence ◽

Multiple Sequence Alignments

Abstract Multiple sequence alignments (MSAs) play a pivotal role in studies of molecular sequence data, but nobody has developed a minimum reporting standard (MRS) to quantify the completeness of MSAs in terms of completely specified nucleotides or amino acids. We present an MRS that relies on four simple completeness metrics. The metrics are implemented in AliStat, a program developed to support the MRS. A survey of published MSAs illustrates the benefits and unprecedented transparency offered by the MRS.

Download Full-text

RELATIVE VON NEUMANN ENTROPY FOR EVALUATING AMINO ACID CONSERVATION

Journal of Bioinformatics and Computational Biology ◽

10.1142/s021972001000494x ◽

2010 ◽

Vol 08 (05) ◽

pp. 809-823 ◽

Cited By ~ 13

Author(s):

FREDRIK JOHANSSON ◽

HIROYUKI TOH

Keyword(s):

Amino Acid ◽

Sequence Analysis ◽

Shannon Entropy ◽

Von Neumann Entropy ◽

Sequence Alignments ◽

Multiple Sequence ◽

Von Neumann ◽

Multiple Sequence Alignments ◽

Previous Definition ◽

Definition Of

The Shannon entropy is a common way of measuring conservation of sites in multiple sequence alignments, and has also been extended with the relative Shannon entropy to account for background frequencies. The von Neumann entropy is another extension of the Shannon entropy, adapted from quantum mechanics in order to account for amino acid similarities. However, there is yet no relative von Neumann entropy defined for sequence analysis. We introduce a new definition of the von Neumann entropy for use in sequence analysis, which we found to perform better than the previous definition. We also introduce the relative von Neumann entropy and a way of parametrizing this in order to obtain the Shannon entropy, the relative Shannon entropy and the von Neumann entropy at special parameter values. We performed an exhaustive search of this parameter space and found better predictions of catalytic sites compared to any of the previously used entropies.

Download Full-text

Experimental Assessment of the Importance of Amino Acid Positions Identified by an Entropy-Based Correlation Analysis of Multiple-Sequence Alignments

Biochemistry ◽

10.1021/bi300747r ◽

2012 ◽

Vol 51 (28) ◽

pp. 5633-5641 ◽

Cited By ~ 11

Author(s):

Susanne Dietrich ◽

Nadine Borst ◽

Sandra Schlee ◽

Daniel Schneider ◽

Jan-Oliver Janda ◽

...

Keyword(s):

Amino Acid ◽

Correlation Analysis ◽

Sequence Alignments ◽

Multiple Sequence ◽

Experimental Assessment ◽

Multiple Sequence Alignments

Download Full-text

QMaker: Fast and accurate method to estimate empirical models of protein evolution

10.1101/2020.02.20.958819 ◽

2020 ◽

Cited By ~ 1

Author(s):

Bui Quang Minh ◽

Cuong Cao Dang ◽

Le Sy Vinh ◽

Robert Lanfear

Keyword(s):

Amino Acid ◽

Amino Acid Substitution ◽

Search Algorithm ◽

Phylogenetic Analyses ◽

Accurate Method ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Substitution Models ◽

Protein Alignments

AbstractAmino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models, however, they are typically complicated and slow. In this paper, we propose QMaker, a new ML method to estimate a general time-reversible Q matrix from a large protein dataset consisting of multiple sequence alignments. QMaker combines an efficient ML tree search algorithm, a model selection for handling the model heterogeneity among alignments, and the consideration of rate mixture models among sites. We provide QMaker as a user-friendly function in the IQ-TREE software package (http://www.iqtree.org) supporting the use of multiple CPU cores so that biologists can easily estimate amino acid substitution models from their own protein alignments. We used QMaker to estimate new empirical general amino acid substitution models from the current Pfam database as well as five clade-specific models for mammals, birds, insects, yeasts, and plants. Our results show that the new models considerably improve the fit between model and data and in some cases influence the inference of phylogenetic tree topologies.

Download Full-text

Molecular Characterization of a Novel Fusarivirus Infecting the Plant-pathogenic Fungus Alternaria Solani

10.21203/rs.3.rs-216353/v1 ◽

2021 ◽

Author(s):

Jie Wei Gong ◽

Hong Liu ◽

Fei Xiao Zhu ◽

Yun Shi Zhao ◽

Le Jia Cheng ◽

...

Keyword(s):

Amino Acids ◽

Pathogenic Fungus ◽

Alternaria Solani ◽

Open Reading Frames ◽

Phytopathogenic Fungus ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Putative Protein

Abstract A novel mycovirus belonging to the proposed family "Fusariviridae" was discovered in Alternaria Solani by sequencing a double-stranded RNA extracted from this phytopathogenic fungus. The virus was tentatively named “Alternaria solani fusarivirus 1” (AsFV1). AsFV1 has a single-stranded positive-sense (+ssRNA) genome of 6,845 nucleotides containing three open reading frames (ORFs) and a poly(A) tail. The largest ORF, ORF1 encodes a large polypeptide of 1,556 amino acids (aa) with conserved RNA-dependent RNA polymerase and helicase domains. The ORF2 and ORF3 have overlapping regions, encoding a putative protein of 522 amino acids (aa) and a putative protein of 105 amino acids (aa) respectively, for which function is unknown now. Multiple sequence alignments and phylogenetic analysis revealed AsFV1 belonging to Fusariviridae. This is the first report of the full-length nucleotide sequence of a fusarivirus infected with Alternaria solani.

Download Full-text

Viewing multiple sequence alignments with the JavaScript Sequence Alignment Viewer (JSAV)

F1000Research ◽

10.12688/f1000research.5486.1 ◽

2014 ◽

Vol 3 ◽

pp. 249 ◽

Cited By ~ 9

Author(s):

Andrew C. R. Martin

Keyword(s):

Amino Acids ◽

Sequence Alignment ◽

Web Site ◽

Web Pages ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

The JavaScript Sequence Alignment Viewer (JSAV) is designed as a simple-to-use JavaScript component for displaying sequence alignments on web pages. The display of sequences is highly configurable with options to allow alternative coloring schemes, sorting of sequences and ’dotifying’ repeated amino acids. An option is also available to submit selected sequences to another web site, or to other JavaScript code. JSAV is implemented purely in JavaScript making use of the JQuery and JQuery-UI libraries. It does not use any HTML5-specific options to help with browser compatibility. The code is documented using JSDOC and is available from http://www.bioinf.org.uk/software/jsav/.

Download Full-text

Grouping of amino acid types and extraction of amino acid properties from multiple sequence alignments using variance maximization

Proteins Structure Function and Bioinformatics ◽

10.1002/prot.20648 ◽

2005 ◽

Vol 61 (3) ◽

pp. 523-534 ◽

Cited By ~ 9

Author(s):

James O. Wrabl ◽

Nick V. Grishin

Keyword(s):

Amino Acid ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Acid Properties ◽

Amino Acid Properties

Download Full-text

Protein residues determining interaction specificity in paralogous families

Bioinformatics ◽

10.1093/bioinformatics/btaa934 ◽

2020 ◽

Author(s):

Borja Pitarch ◽

Juan A G Ranea ◽

Florencio Pazos

Keyword(s):

Amino Acid ◽

Fine Tuning ◽

Supplementary Information ◽

Sequence Information ◽

Large Set ◽

Supplementary Data ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Protein Residues

Abstract Motivation Predicting the residues controlling a protein’s interaction specificity is important not only to better understand its interactions but also to design mutations aimed at fine-tuning or swapping them as well. Results In this work, we present a methodology that combines sequence information (in the form of multiple sequence alignments) with interactome information to detect that kind of residues in paralogous families of proteins. The interactome is used to define pairwise similarities of interaction contexts for the proteins in the alignment. The method looks for alignment positions with patterns of amino-acid changes reflecting the similarities/differences in the interaction neighborhoods of the corresponding proteins. We tested this new methodology in a large set of human paralogous families with structurally characterized interactions, and discuss in detail the results for the RasH family. We show that this approach is a better predictor of interfacial residues than both, sequence conservation and an equivalent ‘unsupervised’ method that does not use interactome information. Availability and implementation http://csbg.cnb.csic.es/pazos/Xdet/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text