scholarly journals Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments

2015 ◽  
Vol 32 (6) ◽  
pp. 814-820 ◽  
Author(s):  
Gearóid Fox ◽  
Fabian Sievers ◽  
Desmond G. Higgins

Abstract Motivation: Multiple sequence alignments (MSAs) with large numbers of sequences are now commonplace. However, current multiple alignment benchmarks are ill-suited for testing these types of alignments, as test cases either contain a very small number of sequences or are based purely on simulation rather than empirical data. Results: We take advantage of recent developments in protein structure prediction methods to create a benchmark (ContTest) for protein MSAs containing many thousands of sequences in each test case and which is based on empirical biological data. We rank popular MSA methods using this benchmark and verify a recent result showing that chained guide trees increase the accuracy of progressive alignment packages on datasets with thousands of proteins. Availability and implementation: Benchmark data and scripts are available for download at http://www.bioinf.ucd.ie/download/ContTest.tar.gz. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Author(s):  
Mark Chonofsky ◽  
Saulo H P de Oliveira ◽  
Konrad Krawczyk ◽  
Charlotte M Deane

Abstract Motivation Over the last few years, the field of protein structure prediction has been transformed by increasingly-accurate contact prediction software. These methods are based on the detection of coevolutionary relationships between residues from multiple sequence alignments. However, despite speculation, there is little evidence of a link between contact prediction and the physico-chemical interactions which drive amino-acid coevolution. Furthermore, existing protocols predict only a fraction of all protein contacts and it is not clear why some contacts are favoured over others. Using a dataset of 863 protein domains, we assessed the physico-chemical interactions of contacts predicted by CCMpred, MetaPSICOV, and DNCON2, as examples of direct coupling analysis, meta-prediction, and deep learning. Results We considered correctly-predicted contacts and compared their properties against the protein contacts that were not predicted. Predicted contacts tend to form more bonds than non-predicted contacts, which suggests these contacts may be more important than contacts that were not predicted. Comparing the contacts predicted by each method, we found that metaPSICOV and DNCON2 favour accuracy whereas CCMPred detects contacts with more bonds. This suggests that the push for higher accuracy may lead to a loss of physico-chemically important contacts. These results underscore the connection between protein physico-chemistry and the coevolutionary couplings that can be derived from multiple sequence alignments. This relationship is likely to be relevant to protein structure prediction and functional analysis of protein structure and may be key to understanding their utility for different problems in structural biology. Availability We use publicly-available databases. Our code is available for download at http://opig.stats.ox.ac.uk/. Supplementary information Supplementary information is available at Bioinformatics online.


Author(s):  
Fabian Sievers ◽  
Desmond G Higgins

Abstract Motivation Secondary structure prediction accuracy (SSPA) in the QuanTest benchmark can be used to measure accuracy of a multiple sequence alignment. SSPA correlates well with the sum-of-pairs score, if the results are averaged over many alignments but not on an alignment-by-alignment basis. This is due to a sub-optimal selection of reference and non-reference sequences in QuanTest. Results We develop an improved strategy for selecting reference and non-reference sequences for a new benchmark, QuanTest2. In QuanTest2, SSPA and SP correlate better on an alignment-by-alignment basis than in QuanTest. Guide-trees for QuanTest2 are more balanced with respect to reference sequences than in QuanTest. QuanTest2 scores correlate well with other well-established benchmarks. Availability and implementation QuanTest2 is available at http://bioinf.ucd.ie/quantest2.tar, comprises of reference and non-reference sequence sets and a scoring script. Supplementary information Supplementary data are available at Bioinformatics online


2021 ◽  
Author(s):  
Liang Hong ◽  
Siqi Sun ◽  
Liangzhen Zheng ◽  
Qingxiong Tan ◽  
Yu Li

Evolutionarily related sequences provide information for the protein structure and function. Multiple sequence alignment, which includes homolog searching from large databases and sequence alignment, is efficient to dig out the information and assist protein structure and function prediction, whose efficiency has been proved by AlphaFold. Despite the existing tools for multiple sequence alignment, searching homologs from the entire UniProt is still time-consuming. Considering the success of AlphaFold, foreseeably, large- scale multiple sequence alignments against massive databases will be a trend in the field. It is very desirable to accelerate this step. Here, we propose a novel method, fastMSA, to improve the speed significantly. Our idea is orthogonal to all the previous accelerating methods. Taking advantage of the protein language model based on BERT, we propose a novel dual encoder architecture that can embed the protein sequences into a low-dimension space and filter the unrelated sequences efficiently before running BLAST. Extensive experimental results suggest that we can recall most of the homologs with a 34-fold speed-up. Moreover, our method is compatible with the downstream tasks, such as structure prediction using AlphaFold. Using multiple sequence alignments generated from our method, we have little performance compromise on the protein structure prediction with much less running time. fastMSA will effectively assist protein sequence, structure, and function analysis based on homologs and multiple sequence alignment.


2019 ◽  
Author(s):  
Mark Chonofsky ◽  
Saulo H. P. de Oliveira ◽  
Konrad Krawczyk ◽  
Charlotte M. Deane

AbstractOver the last few years, the field of protein structure prediction has been transformed by increasingly-accurate contact prediction software. These methods are based on the detection of coevolutionary relationships between residues from multiple sequence alignments. However, despite speculation, there is little evidence of a link between contact prediction and the physico-chemical interactions which drive amino-acid coevolution. Furthermore, existing protocols predict only a fraction of all protein contacts and it is not clear why some contacts are favoured over others.Using a dataset of 863 protein domains, we assessed the physico-chemical interactions of contacts predicted by CCMpred, MetaPSICOV, and DNCON2, as examples of direct coupling analysis, meta-prediction, and deep learning, respectively. To further investigate what sets these predicted contacts apart, we considered correctly-predicted contacts and compared their properties against the protein contacts that were not predicted.We found that predicted contacts tend to form more bonds than non-predicted contacts, which suggests these contacts may be more important. Comparing the contacts predicted by each method, we found that metaPSICOV and DNCON2 favour accuracy whereas CCMPred detects contacts with more bonds. This suggests that the push for higher accuracy may lead to a loss of physico-chemically important contacts.These results underscore the connection between protein physico-chemistry and the coevolutionary couplings that can be derived from multiple sequence alignments. This relationship is likely to be relevant to protein structure prediction and functional analysis of protein structure and may be key to understanding their utility for different problems in structural biology.Author summaryAccurate contact prediction has allowed scientists to predict protein structures with unprecedented levels of accuracy. The success of contact prediction methods, which are based on inferring correlations between amino acids in protein multiple sequence alignments, has prompted a great deal of work to improve the quality of contact prediction, leading to the development of several different methods for detecting amino acids in proximity.In this paper, we investigate the properties of these contact prediction methods. We find that contacts which are predicted differ from the other contacts in the protein, in particular they have more physico-chemical bonds, and the predicted contacts are more strongly conserved than other contacts across protein families. We also compared the properties of different contact prediction methods and found that the characteristics of the predicted sets depend on the prediction method used.Our results point to a link between physico-chemical bonding interactions and the evolutionary history of proteins, a connection which is reflected in their amino acid sequences.


2021 ◽  
Author(s):  
Diego del Alamo ◽  
Davide Sala ◽  
Hassane Mchaourab ◽  
Jens Meiler

Equilibrium fluctuations and triggered conformational changes often underlie the functional cycles of membrane proteins. For example, transporters mediate the passage of molecules across cell membranes by alternating between inward-facing (IF) and outward-facing (OF) states, while receptors undergo intracellular structural rearrangements that initiate signaling cascades. Although the conformational plasticity of these proteins has historically posed a challenge for traditional de novo protein structure prediction pipelines, the recent success of AlphaFold2 (AF2) in CASP14 culminated in the modeling of a transporter in multiple conformations to high accuracy. Given that AF2 was designed to predict static structures of proteins, it remains unclear if this result represents an underexplored capability to accurately predict multiple conformations and/or structural heterogeneity. Here, we present an approach to drive AF2 to sample alternative conformations of topologically diverse transporters and G-protein coupled receptors (GPCRs) that are absent from the AF2 training set. Whereas models generated using the default AF2 pipeline are conformationally homogeneous and nearly identical to one another, reducing the depth of the input multiple sequence alignments (MSAs) led to the generation of accurate models in multiple conformations. In our benchmark, these conformations were observed to span the range between two experimental structures of interest, suggesting that our protocol allows sampling of the conformational landscape at the energy minimum. Nevertheless, our results also highlight the need for the next generation of deep learning algorithms to be designed to predict ensembles of biophysically relevant states.


2015 ◽  
Author(s):  
Robert Sheridan ◽  
Robert J. Fieldhouse ◽  
Sikander Hayat ◽  
Yichao Sun ◽  
Yevgeniy Antipin ◽  
...  

Recently developed maximum entropy methods infer evolutionary constraints on protein function and structure from the millions of protein sequences available in genomic databases. The EVfold web server (at EVfold.org) makes these methods available to predict functional and structural interactions in proteins. The key algorithmic development has been to disentangle direct and indirect residue-residue correlations in large multiple sequence alignments and derive direct residue-residue evolutionary couplings (EVcouplings or ECs). For proteins of unknown structure, distance constraints obtained from evolutionarily couplings between residue pairs are used to de novo predict all-atom 3D structures, often to good accuracy. Given sufficient sequence information in a protein family, this is a major advance toward solving the problem of computing the native 3D fold of proteins from sequence information alone. Availability: EVfold server at http://evfold.org/ Contact: [email protected]


Author(s):  
Jun Wang ◽  
Pu-Feng Du ◽  
Xin-Yu Xue ◽  
Guang-Ping Li ◽  
Yuan-Ke Zhou ◽  
...  

Abstract Summary Many efforts have been made in developing bioinformatics algorithms to predict functional attributes of genes and proteins from their primary sequences. One challenge in this process is to intuitively analyze and to understand the statistical features that have been selected by heuristic or iterative methods. In this paper, we developed VisFeature, which aims to be a helpful software tool that allows the users to intuitively visualize and analyze statistical features of all types of biological sequence, including DNA, RNA and proteins. VisFeature also integrates sequence data retrieval, multiple sequence alignments and statistical feature generation functions. Availability and implementation VisFeature is a desktop application that is implemented using JavaScript/Electron and R. The source codes of VisFeature are freely accessible from the GitHub repository (https://github.com/wangjun1996/VisFeature). The binary release, which includes an example dataset, can be freely downloaded from the same GitHub repository (https://github.com/wangjun1996/VisFeature/releases). Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Aashish Jain ◽  
Genki Terashi ◽  
Yuki Kagaya ◽  
Sai Raghavendra Maddhuri Venkata Subramaniya ◽  
Charles Christoffer ◽  
...  

ABSTRACTProtein 3D structure prediction has advanced significantly in recent years due to improving contact prediction accuracy. This improvement has been largely due to deep learning approaches that predict inter-residue contacts and, more recently, distances using multiple sequence alignments (MSAs). In this work we present AttentiveDist, a novel approach that uses different MSAs generated with different E-values in a single model to increase the co-evolutionary information provided to the model. To determine the importance of each MSA’s feature at the inter-residue level, we added an attention layer to the deep neural network. The model is trained in a multi-task fashion to also predict backbone and orientation angles further improving the inter-residue distance prediction. We show that AttentiveDist outperforms the top methods for contact prediction in the CASP13 structure prediction competition. To aid in structure modeling we also developed two new deep learning-based sidechain center distance and peptide-bond nitrogen-oxygen distance prediction models. Together these led to a 12% increase in TM-score from the best server method in CASP13 for structure prediction.


Author(s):  
Saisai Sun ◽  
Wenkai Wang ◽  
Zhenling Peng ◽  
Jianyi Yang

Abstract Motivation Recent years have witnessed that the inter-residue contact/distance in proteins could be accurately predicted by deep neural networks, which significantly improve the accuracy of predicted protein structure models. In contrast, fewer studies have been done for the prediction of RNA inter-nucleotide 3D closeness. Results We proposed a new algorithm named RNAcontact for the prediction of RNA inter-nucleotide 3D closeness. RNAcontact was built based on the deep residual neural networks. The covariance information from multiple sequence alignments and the predicted secondary structure were used as the input features of the networks. Experiments show that RNAcontact achieves the respective precisions of 0.8 and 0.6 for the top L/10 and L (where L is the length of an RNA) predictions on an independent test set, significantly higher than other evolutionary coupling methods. Analysis shows that about 1/3 of the correctly predicted 3D closenesses are not base pairings of secondary structure, which are critical to the determination of RNA structure. In addition, we demonstrated that the predicted 3D closeness could be used as distance restraints to guide RNA structure folding by the 3dRNA package. More accurate models could be built by using the predicted 3D closeness than the models without using 3D closeness. Availability and implementation The webserver and a standalone package are available at: http://yanglab.nankai.edu.cn/RNAcontact/. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document