Protein residues determining interaction specificity in paralogous families

Author(s):  
Borja Pitarch ◽  
Juan A G Ranea ◽  
Florencio Pazos

Abstract Motivation Predicting the residues controlling a protein’s interaction specificity is important not only to better understand its interactions but also to design mutations aimed at fine-tuning or swapping them as well. Results In this work, we present a methodology that combines sequence information (in the form of multiple sequence alignments) with interactome information to detect that kind of residues in paralogous families of proteins. The interactome is used to define pairwise similarities of interaction contexts for the proteins in the alignment. The method looks for alignment positions with patterns of amino-acid changes reflecting the similarities/differences in the interaction neighborhoods of the corresponding proteins. We tested this new methodology in a large set of human paralogous families with structurally characterized interactions, and discuss in detail the results for the RasH family. We show that this approach is a better predictor of interfacial residues than both, sequence conservation and an equivalent ‘unsupervised’ method that does not use interactome information. Availability and implementation http://csbg.cnb.csic.es/pazos/Xdet/. Supplementary information Supplementary data are available at Bioinformatics online.

Author(s):  
Jun Wang ◽  
Pu-Feng Du ◽  
Xin-Yu Xue ◽  
Guang-Ping Li ◽  
Yuan-Ke Zhou ◽  
...  

Abstract Summary Many efforts have been made in developing bioinformatics algorithms to predict functional attributes of genes and proteins from their primary sequences. One challenge in this process is to intuitively analyze and to understand the statistical features that have been selected by heuristic or iterative methods. In this paper, we developed VisFeature, which aims to be a helpful software tool that allows the users to intuitively visualize and analyze statistical features of all types of biological sequence, including DNA, RNA and proteins. VisFeature also integrates sequence data retrieval, multiple sequence alignments and statistical feature generation functions. Availability and implementation VisFeature is a desktop application that is implemented using JavaScript/Electron and R. The source codes of VisFeature are freely accessible from the GitHub repository (https://github.com/wangjun1996/VisFeature). The binary release, which includes an example dataset, can be freely downloaded from the same GitHub repository (https://github.com/wangjun1996/VisFeature/releases). Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Fabian Sievers ◽  
Desmond G Higgins

Abstract Motivation Secondary structure prediction accuracy (SSPA) in the QuanTest benchmark can be used to measure accuracy of a multiple sequence alignment. SSPA correlates well with the sum-of-pairs score, if the results are averaged over many alignments but not on an alignment-by-alignment basis. This is due to a sub-optimal selection of reference and non-reference sequences in QuanTest. Results We develop an improved strategy for selecting reference and non-reference sequences for a new benchmark, QuanTest2. In QuanTest2, SSPA and SP correlate better on an alignment-by-alignment basis than in QuanTest. Guide-trees for QuanTest2 are more balanced with respect to reference sequences than in QuanTest. QuanTest2 scores correlate well with other well-established benchmarks. Availability and implementation QuanTest2 is available at http://bioinf.ucd.ie/quantest2.tar, comprises of reference and non-reference sequence sets and a scoring script. Supplementary information Supplementary data are available at Bioinformatics online


2021 ◽  
Author(s):  
Jaspreet Singh ◽  
Kuldip Paliwal ◽  
Jaswinder Singh ◽  
Yaoqi Zhou

Protein language models have emerged as an alternative to multiple sequence alignment for enriching sequence information and improving downstream prediction tasks such as biophysical, structural, and functional properties. Here we show that a combination of traditional one-hot encoding with the embeddings from two different language models (ProtTrans and ESM-1b) allows a leap in accuracy over single-sequence based techniques in predicting protein 1D secondary and tertiary structural properties, including backbone torsion angles, solvent accessibility and contact numbers. This large improvement leads to an accuracy comparable to or better than the current state-of-the-art techniques for predicting these 1D structural properties based on sequence profiles generated from multiple sequence alignments. The high-accuracy prediction in both secondary and tertiary structural properties indicates that it is possible to make highly accurate prediction of protein structures without homologous sequences, the remaining obstacle in the post AlphaFold2 era.


Author(s):  
Saisai Sun ◽  
Wenkai Wang ◽  
Zhenling Peng ◽  
Jianyi Yang

Abstract Motivation Recent years have witnessed that the inter-residue contact/distance in proteins could be accurately predicted by deep neural networks, which significantly improve the accuracy of predicted protein structure models. In contrast, fewer studies have been done for the prediction of RNA inter-nucleotide 3D closeness. Results We proposed a new algorithm named RNAcontact for the prediction of RNA inter-nucleotide 3D closeness. RNAcontact was built based on the deep residual neural networks. The covariance information from multiple sequence alignments and the predicted secondary structure were used as the input features of the networks. Experiments show that RNAcontact achieves the respective precisions of 0.8 and 0.6 for the top L/10 and L (where L is the length of an RNA) predictions on an independent test set, significantly higher than other evolutionary coupling methods. Analysis shows that about 1/3 of the correctly predicted 3D closenesses are not base pairings of secondary structure, which are critical to the determination of RNA structure. In addition, we demonstrated that the predicted 3D closeness could be used as distance restraints to guide RNA structure folding by the 3dRNA package. More accurate models could be built by using the predicted 3D closeness than the models without using 3D closeness. Availability and implementation The webserver and a standalone package are available at: http://yanglab.nankai.edu.cn/RNAcontact/. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2010 ◽  
Vol 08 (05) ◽  
pp. 809-823 ◽  
Author(s):  
FREDRIK JOHANSSON ◽  
HIROYUKI TOH

The Shannon entropy is a common way of measuring conservation of sites in multiple sequence alignments, and has also been extended with the relative Shannon entropy to account for background frequencies. The von Neumann entropy is another extension of the Shannon entropy, adapted from quantum mechanics in order to account for amino acid similarities. However, there is yet no relative von Neumann entropy defined for sequence analysis. We introduce a new definition of the von Neumann entropy for use in sequence analysis, which we found to perform better than the previous definition. We also introduce the relative von Neumann entropy and a way of parametrizing this in order to obtain the Shannon entropy, the relative Shannon entropy and the von Neumann entropy at special parameter values. We performed an exhaustive search of this parameter space and found better predictions of catalytic sites compared to any of the previously used entropies.


Biochemistry ◽  
2012 ◽  
Vol 51 (28) ◽  
pp. 5633-5641 ◽  
Author(s):  
Susanne Dietrich ◽  
Nadine Borst ◽  
Sandra Schlee ◽  
Daniel Schneider ◽  
Jan-Oliver Janda ◽  
...  

Author(s):  
Jacob L Steenwyk ◽  
Thomas J Buida ◽  
Abigail L Labella ◽  
Yuanning Li ◽  
Xing-Xing Shen ◽  
...  

Abstract Motivation Diverse disciplines in biology process and analyze multiple sequence alignments (MSAs) and phylogenetic trees to evaluate their information content, infer evolutionary events and processes, and predict gene function. However, automated processing of MSAs and trees remains a challenge due to the lack of a unified toolkit. To fill this gap, we introduce PhyKIT, a toolkit for the UNIX shell environment with 30 functions that process MSAs and trees, including but not limited to estimation of mutation rate, evaluation of sequence composition biases, calculation of the degree of violation of a molecular clock, and collapsing bipartitions (internal branches) with low support. Results To demonstrate the utility of PhyKIT, we detail three use cases: (1) summarizing information content in MSAs and phylogenetic trees for diagnosing potential biases in sequence or tree data; (2) evaluating gene-gene covariation of evolutionary rates to identify functional relationships, including novel ones, among genes; and (3) identify lack of resolution events or polytomies in phylogenetic trees, which are suggestive of rapid radiation events or lack of data. We anticipate PhyKIT will be useful for processing, examining, and deriving biological meaning from increasingly large phylogenomic datasets. Availability PhyKIT is freely available on GitHub (https://github.com/JLSteenwyk/PhyKIT), PyPi (https://pypi.org/project/phykit/), and the Anaconda Cloud (https://anaconda.org/JLSteenwyk/phykit) under the MIT license with extensive documentation and user tutorials (https://jlsteenwyk.com/PhyKIT). Supplementary information Supplementary data are available on figshare (doi: 10.6084/m9.figshare.13118600) and are available at Bioinformatics online.


Author(s):  
Bui Quang Minh ◽  
Cuong Cao Dang ◽  
Le Sy Vinh ◽  
Robert Lanfear

AbstractAmino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models, however, they are typically complicated and slow. In this paper, we propose QMaker, a new ML method to estimate a general time-reversible Q matrix from a large protein dataset consisting of multiple sequence alignments. QMaker combines an efficient ML tree search algorithm, a model selection for handling the model heterogeneity among alignments, and the consideration of rate mixture models among sites. We provide QMaker as a user-friendly function in the IQ-TREE software package (http://www.iqtree.org) supporting the use of multiple CPU cores so that biologists can easily estimate amino acid substitution models from their own protein alignments. We used QMaker to estimate new empirical general amino acid substitution models from the current Pfam database as well as five clade-specific models for mammals, birds, insects, yeasts, and plants. Our results show that the new models considerably improve the fit between model and data and in some cases influence the inference of phylogenetic tree topologies.


2015 ◽  
Vol 32 (6) ◽  
pp. 814-820 ◽  
Author(s):  
Gearóid Fox ◽  
Fabian Sievers ◽  
Desmond G. Higgins

Abstract Motivation: Multiple sequence alignments (MSAs) with large numbers of sequences are now commonplace. However, current multiple alignment benchmarks are ill-suited for testing these types of alignments, as test cases either contain a very small number of sequences or are based purely on simulation rather than empirical data. Results: We take advantage of recent developments in protein structure prediction methods to create a benchmark (ContTest) for protein MSAs containing many thousands of sequences in each test case and which is based on empirical biological data. We rank popular MSA methods using this benchmark and verify a recent result showing that chained guide trees increase the accuracy of progressive alignment packages on datasets with thousands of proteins. Availability and implementation: Benchmark data and scripts are available for download at http://www.bioinf.ucd.ie/download/ContTest.tar.gz. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.


Molecules ◽  
2018 ◽  
Vol 24 (1) ◽  
pp. 104
Author(s):  
Patrice Koehl ◽  
Henri Orland ◽  
Marc Delarue

Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein solely from the sequences that decorate it. To carry out such predictions, and following the work of others, we have implemented a multivariate Gaussian model to analyze correlation in multiple sequence alignments. We have explored and tested several numerical encodings of amino acids within this model. We have shown that 1D encodings based on amino acid biochemical and biophysical properties, as well as higher dimensional encodings computed from the principal components of experimentally derived mutation/substitution matrices, do not perform as well as a simple twenty dimensional encoding with each amino acid represented with a vector of one along its own dimension and zero elsewhere. The optimum obtained from representations based on substitution matrices is reached by using 10 to 12 principal components; the corresponding performance is less than the performance obtained with the 20-dimensional binary encoding. We highlight also the importance of the prior when constructing the multivariate Gaussian model of a multiple sequence alignment.


Sign in / Sign up

Export Citation Format

Share Document